Unsupervised Classification of Chemical Compounds
Authors: Guttiérrez Toscano, P.; Marriott, F. H. C.
Source: Journal of the Royal Statistical Society: Series C (Applied Statistics), Volume 48, Number 2, 1999 , pp. 153-163(11)
Abstract:Clustering chemical compounds of similar structure is important in the pharmaceutical industry. One way of describing the structure is the chemical `fingerprint'. The fingerprint is a string of binary digits, and typical data sets consist of very large numbers of fingerprints; a suitable clustering procedure must take account of the properties of this method of coding, and must be able to handle large data sets. This paper describes the analysis of a set of fingerprint data. The analysis was based on an appropriate distance measure derived from the fingerprints, followed by metric scaling into a low-dimensional space. An approximation to metric scaling, suitable for very large data sets, was investigated. Cluster analysis using two programs, mclust and AutoClass-C, was carried out on the scaled data.
Document Type: Original Article
Affiliations: University of Oxford, UK
Publication date: 1999-01-01