BGF User's Guide

 

This is the help document for BGF(Beijing Gene Finder) Web Service. BGF is an ab initio gene finding program based on DP(Dynamic Programming) & HSMM(Hidden Semi-Markov Model).

 

 

Input

 

BGF takes sequence only in FASTA format.

The following is an example:

 

>MT000349
AGCTATCAGCTTATCACCACACACAGACACAGAAGGAAATGTCGGCTTCT
GCGGCGCCCACGCACATCCGCTTCTCCTCCGCCGCACCTCCATCGGCCGC
CGCTCTCCGGCGGCCTCGCCGCCGGTGCGCCACGCCCGTCCGATGCTCCC
TCGCCGCAGCGCCGGGTCTCCGGGCGCCGCCTGAGCTCATCGACTCCATC
CTCTCCAAGGTGATGCTGAAACATCATCTCTCTGCTCTTCTTGGGTGCAC
TCTTTCCGGATTCCTAGCTTCTGCTCAGTGCTCGGTTTATGCAATCTTGT
ACTCCACATTAGAACACATTTCTTGACTCAGAAATTAACGAGTTCTGGTG
TTCCATAGTACTAATTTTAACCTCAATTAGTTTTTTCTTATCTTACACCA
AAAAAATTCGGATTTTTGTATAAATCCTGGAGAATATTCCTTATTATTGG
TTTCCAATTACTTACCTTTTTCCTTCTGTCATTTTCAGTTTCTTGTATAT
..................................
GGATCCGTGTATTCCTGCAACACAGTGCCTTGCGCTATGGTCTTTTTTTC
TGAGAGCTTGTACTTGTACCTGTAGAATGTAGTGTATGCATCAAGCTGCT
GCTACTGAATAAAAGAAAAAAGAAAAATATATGTTGTGGGTTGGGCTGAA
TGCCTGTACCCCATGAGCACAGGATGCTCTCGATCATTGAGCGTGCTGTG
CACGTCGTGGGCCTCCAACTAAAACTGTAATCATCCTTGGGCAGAAGACG
GCAGAAATCTTGAACTTTTTGTTTTGTCTTGTTCTTCGGCTGATAATGCT
GCTTCTTCTGATAACAATTGCCCCTGGAAATGCTAATAATGTAAGAAGAG
CACTGCTATACGT
 
 

How to run it

 

First, you should paste your sequence into the Sequence box in FASTA format. SNP characters are allowed and will be translated to one of the letters they presented random. Specially, `N' will be always translated to `C' to avoid stop codons unexpected. Alternatively, you can upload a sequence file in your computer by click File upload button. Then, you can choose Species and press Submit button to run BGF. If your sequence contains bunches of `N's inside, which indicate an assemble gap is there, you should set the Gap(`N's) option to the indicator `N's length.

 

 

Output

 

BGF output
Gene# - predicted gene number, starting from start of sequence; 
S - DNA strand (+ for direct or - for complementary); 
Exon# - predicted exon number in current gene; 
Type - type of coding sequence or transcription site: 
        Init - First (starting with start codon) 
        Intr - internal (internal exon) 
        Term - last coding segment, ending with stop codon) 
        Sngl - single exon gene; 
        Prom - TSS (TATA-box or cap site); 
        PolA - PolyA signal site; 
Start/End - position of start or end of the Type; 
ORF_S/E - positions where the first complete codon starts and the last codon ends; 
Score - exon score for the Type; 
Len - length of current exon;
 
For example

Program    : BGF

Version    : 2.1.1

Time       : Sun Jan 15 11:48:07 2006

Parameter  : Rice

Sequence   : MT000349

Length     : 10813

GC%        : 43.29

Total Genes:    3 (  2 in + strand &   1 in - strand)

Total Exons:   18 ( 13 in + strand &   5 in - strand)

 

Gene# S Exon# Type   Start       End   ORF_S     ORF_E   Score    Len

===== = ===== ==== ======= = ======= ======= = ======= ======= ======

 

    1 +     1 Intr      27 -     124      29 -     124    1.04     98

    1 +     2 Intr     598 -     721     598 -     720    8.08    124

    1 +     3 Intr     907 -    1083     909 -    1082    4.89    177

    1 +     4 Intr    1198 -    1259    1200 -    1259    8.05     62

    1 +     5 Intr    1631 -    2030    1631 -    2029    0.05    400

    1 +     6 Intr    2264 -    2295    2266 -    2295    5.06     32

    1 +     7 Intr    2709 -    2851    2709 -    2849    8.26    143

    1 +     8 Intr    3084 -    3150    3085 -    3150   15.12     67

    1 +     9 Intr    3253 -    3330    3253 -    3330   12.25     78

    1 +    10 Intr    3448 -    3593    3448 -    3591    4.16    146

    1 +    11 Term    3839 -    3878    3840 -    3878    4.32     40

    1 +       PolA    4052 -                             -1.87

 

    2 -       PolA    4797 -                             -0.27

    2 -     1 Term    4915 -    5117    4915 -    5115    9.50    203

    2 -     2 Intr    5587 -    5729    5588 -    5728    6.02    143

    2 -     3 Intr    5958 -    6044    5960 -    6043    7.20     87

    2 -     4 Intr    6862 -    7037    6864 -    7037    5.55    176

    2 -     5 Init    7454 -    7552    7454 -    7552   14.01     99

    2 -       Prom    7872 -                             -4.24

 

    3 +       Prom    7922 -                             -5.79

    3 +     1 Init    8043 -    8487    8043 -    8486   12.66    445

    3 +     2 Term    9433 -    9497    9435 -    9497   -0.25     65

    3 +       PolA    9996 -                              0.48

 

Predicted protein(s):

>BGF:  Gene:1 Exon(s):11 AA:454 Chain+ H-T+

TEGNVGFCGAHAHPLLLRRTSIGRRSPAASPPVKGTDRGVLLPKDGHQEVADVALQLAKY

CIDDPVKSPLIFGEWEVVYCSVPTSPGGLYRTPLGRLIFKTDEMAQVVQAPDVVKNKVSF

SVFGFDGAVSLKGKLNVLDGKWIQVIFEPPEVKTNEHGYGFLVNPAMKLLLLVYTVFARR

FQHFCRQLLVTEHFWIYEHRQISIKRSRLFQTSKCISIADMPPPACSNVLYGDRTCTVEK

SPLEKENAFLEKPSCSSPHPRRGGVPSSSRVSRLLDGGVAVELPLWDKRSKYSAQSVRAM

PMRVLTVGKKRSRGAQLIVEEYKEKLGYYCDIEDTLIKSNPKLTSDVKVQVEAEDMAMML

QLKPEDFVVVLDENGKDVTSEQVADLVGDAGNTGSSRLTFCIGGPYGFGLQVRERADATI

RLSSMVLNHQVALIVLMEQLYRAWTIIKGQKYHH

>BGF:  Gene:2 Exon(s):5 AA:235 Chain- H+T+

MAEADAQTQSRAHSSTAAPVAGETAGEPVGFPQNGAINGAPLMFPVMYPMLMTGMHPQQS

LDDQAQGPGIYAIQQNQFMGSTLMPLTYRIPTESVGAVAGEEQAQDARQQHGPQRQVVVR

RYQTGAITPLLRWLQRAGGAAARPPQAPARPENRAPLAAQNDGNVQPPGGNLADPANNDQ

AAENQEPGAAAANENQQEVDGEGNRRNWLGGVFKEVQLIVVGFVASLLPGFQHND

>BGF:  Gene:3 Exon(s):2 AA:169 Chain+ H+T+

MARLLSRTLALARADSAAVPSYGRLHVRGVSSKVEFIEIDLSSEDAPSSSSSSGVEGGGF

GPREMGMRRLEDAIHGVLVRRAAPEWLPFVPGGSYWVPEMRRGVAADLVGTAVRSAIGAA

WNAEAMTEEEMMCLTTMRGWPSEAYFVEDCLEPAVVGWASCLGSFVYMG

 
 

Reference

 

[1] Bellman, R., Dynamic Programming, Princeton Univ. Press, 1957.
[2] Bellman, R., Dreyfus, S. E., Applied Dynamic Programming, Princeton Univ. Press, 1962.
[3] Burge, Ch., Identification of genes in human genomic DNA, Thesis, Stanford University, March 1997.
[4] Burge, Ch. and Karlin, S., Prediction of complete gene structures in human genomic DNA, J. Mol. Biol. 268 (1997) 78-94.
[5] Burset, M. and Guig'o, R., Evaluation of gene structure prediction programs, Genomics, 34 (1996) 353-367.
[6] Fickett, J. W., Finding genes by computer: the state of the art, Trends in Genet., 12 (1996) 316-320.
[7] Krogh, A. et al., A hidden Markov model that finds genes in E.coli DNA, Nucleic Acids Research, 22 (1994) 4768-4778.
[8] Krogh, A. et al., Hidden Markov Models in computational biology applications to protein modeling, J. Mol. Biol., 235 (1994) 1501-1531.
[9] Mood, A. M. and Graybill, F. A., Introduction to the Theory of Statistics, 2nd ed., McGraw-Hill, New York, 1963.
[10] Rabiner, L. R. and Juang, B. H., An introduction to Hidden Markov Models, IEEE ASSP Magazine, 3 (1986) 4-16.
[11] Rabiner, L. R., A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings on the IEEE, 77 (1989) 257-286.
[12] Waterman, M. S., Introduction to Computational Biology, Maps, sequences and genomes, Chapman & Hall, London, 1995.
[13]
Fickett JW., Tung CS., Assessment of protein coding measures, Nucleic Acids Res. 1992 Dec 25;20(24):6441-50. Review.
[14] Hui-min Xie, DP and HMM (Unpublished note).
[15] Hui-min Xie, A Note for Alpha, Beta & Gamma (Unpublished note).
[16] Hui-min Xie, A Experiment on HMM (Unpublished note).
[17] Wei-Mou Zheng, Genomic signal enhancement by clustering, Commun. Theor. Phys. 39 (2003) 631.
[18] Wei-Mou Zheng, Finding Signals for plant promoters, Geno., Prot. & Bioinfo. 1 (2003) 68.
[19] Wei-Mou Zheng, Genomic signal search by dynamic programming, Commun. Theor. Phys. 39 (2003) 761.
[20] Tao Jiang, Ying Xu, Michael Q. Zhang, Current Topics in Computational Molecular Biology, Tsing Hua press and MIT press
 

 

 

BGF Group

 

Authors : Jin-song Liu, Zhao Xu
Tutors  : Bai-lin Hao, Hui-min Xie, Wei-mou Zheng, Guo-ying Li, Jun Wang
Partners: Lin Fang, Jiao Jin, Lei Gao, Heng Li, Hai-hong Li
          Yan Li, Zi-xing Xing, Qi-zhai Li, Shao-gen Gao