PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics
Metagenomic datasets are composed of DNA fragments from large numbers of different and potentially novel organisms. These datasets can contain up to several million sequences taken from heterogeneous populations of extremely varied abundance. Unlike traditional genomic studies, metagenomic
analysis requires an additional binning step. This process groups DNA fragments from the same or similar species of origin. However, existing unsupervised metagenomic binning programs cannot accurately analyze datasets containing a large number of species or with significantly unbalanced abundance
ratios. To improve upon these current limitations, we present PuzzleCluster, a novel unsupervised binning algorithm. PuzzleCluster incorporates a unique cluster refinement step by automatically grouping reads which share a nucleotide word (i.e. reverse complement pairs) of a predetermined
length. Additionally, the clustering parameters are estimated by fitting the Jensen-Shannon distance among sequences using the expectation maximization algorithm. Since clustering parameters are computed based on each dataset, our approach can adapt to the peculiarities of each dataset and
is not confined by universal parameters. Furthermore, PuzzleCluster utilizes no prior assumptions about the genetic makeup or number of organisms present in the sample, making it well-suited for applications with a large amount of biodiversity and completely unknown organisms. As a comparison,
PuzzleCluster has an accuracy 9%, 19.8%, 15.7%, and 19.5% higher than MetaCluster 3.0 for taxonomic levels phylum, class, order, and family, respectively. PuzzleCluster source code is freely available at http://math.stanford.edu/~ksiegel/PuzzleCluster.html.
Keywords: Clustering; Jensen-Shannon distance; expectation maximization; metagenome; quality threshold algorithm; word agreement
Document Type: Research Article
Publication date: April 1, 2015
- Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth reviews written by leaders in the field, covering a wide range of the integration of biology with computer and information science.
The journal focuses on reviews on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.
Current Bioinformatics is an essential journal for all academic and industrial researchers who want expert knowledge on all major advances in bioinformatics. - Editorial Board
- Information for Authors
- Subscribe to this Title
- Call for Papers
- Ingenta Connect is not responsible for the content or availability of external websites
- Access Key
- Free content
- Partial Free content
- New content
- Open access content
- Partial Open access content
- Subscribed content
- Partial Subscribed content
- Free trial content