Free Content Efficient clustering of large EST data sets on parallel computers

Authors: Anantharaman Kalyanaraman; Srinivas Aluru1; Suresh Kothari1; Volker Brendel2

Source: Nucleic Acids Research, Volume 31, Number 11, 01 June 2003 , pp. 2963-2974(12)

Publisher: Oxford University Press

Buy & download fulltext article:

Free content The full text is free.

View now:
PDF

Abstract:

Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.

Document Type: Research article

DOI: http://dx.doi.org/10.1093/nar/gkg379

Affiliations: 1: Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA and 2: Department of Zoology and Genetics and Department of Statistics, Iowa State University, Ames, IA 50011, USA

Publication date: 2003-06-01

More about this publication?
  • Nucleic Acids Research (NAR) is a fully Open Access journal, providing rapid publication of leading edge research into the nucleic acids under the following categories: chemistry, computational biology, genomics, molecular biology, nucleic acid enzymes, RNA and structural biology. There is a Survey and Summary section, and methods papers are published
    in NAR Methods Online. Each year the first issue is devoted to biological databases, and a later issue to relevant web-based software resources.
Related content

Key

Free Content
Free content
New Content
New content
Open Access Content
Open access content
Subscribed Content
Subscribed content
Free Trial Content
Free trial content

Text size:

A | A | A | A
Share this item with others: These icons link to social bookmarking sites where readers can share and discover new web pages. print icon Print this page