Phylogenetic tree construction using trinucleotide usage profile (TUP). Issue 13 (October 2016)
- Record Type:
- Journal Article
- Title:
- Phylogenetic tree construction using trinucleotide usage profile (TUP). Issue 13 (October 2016)
- Main Title:
- Phylogenetic tree construction using trinucleotide usage profile (TUP)
- Authors:
- Chen, Si
Deng, Lih-Yuan
Bowman, Dale
Shiau, Jyh-Jen
Wong, Tit-Yee
Madahian, Behrouz
Lu, Henry - Abstract:
- Abstract Background It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k ), finds the frequency distribution for all words of certain lengthk over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k ) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 46 =4096 to 415 . Results We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution usingnon-overlapping windows of length 3. The total number of possible words needed for TUP is 43 =64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numericalAbstract Background It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k ), finds the frequency distribution for all words of certain lengthk over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k ) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 46 =4096 to 415 . Results We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution usingnon-overlapping windows of length 3. The total number of possible words needed for TUP is 43 =64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. Conclusions Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification. … (more)
- Is Part Of:
- BMC bioinformatics. Volume 17:Issue 13(2016)
- Journal:
- BMC bioinformatics
- Issue:
- Volume 17:Issue 13(2016)
- Issue Display:
- Volume 17, Issue 13 (2016)
- Year:
- 2016
- Volume:
- 17
- Issue:
- 13
- Issue Sort Value:
- 2016-0017-0013-0000
- Page Start:
- 117
- Page End:
- 130
- Publication Date:
- 2016-10
- Subjects:
- Feature frequency profile (FFP) -- Reading frame -- Summary statistics -- Phylogenetic tree construction -- Tree comparison
Bioinformatics -- Periodicals
Computational biology -- Periodicals
570.285 - Journal URLs:
- http://www.biomedcentral.com/bmcbioinformatics/ ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=13 ↗
http://link.springer.com/ ↗ - DOI:
- 10.1186/s12859-016-1222-3 ↗
- Languages:
- English
- ISSNs:
- 1471-2105
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - Digital store
British Library HMNTS - ELD Digital store - Ingest File:
- 10041.xml