TY - JOUR
T1 - Misty Mountain clustering
T2 - Application to fast unsupervised flow cytometry gating
AU - Sugár, István P.
AU - Sealfon, Stuart C.
N1 - Funding Information:
We thank Profs. D. Stäuffer and B. Roysam for sending the source code of a Hoshen-Kopelman type cluster counting algorithm and spectral clustering, respectively. We also thank Prof. F. Hayot for the critical evaluation of the manuscript. We acknowledge Drs. B. Hartman and J. Seto for providing the barcoding FCM data, Dr. German Nudelman for making the program available on the web and Dr. Yongchao Ge for analyzing FCM data with flowClust and flowMerge. We are grateful for Prof. Ryan Brinkman for providing access to the GvHD flow cytometry data sets and to Prof. Hans Snoeck for providing the OP9 dataset. This work from the Program for Research in Immune Modeling and Experimentation (PRIME) was supported by contract NIH/NIAID HHSN266200500021C.
PY - 2010/10/9
Y1 - 2010/10/9
N2 - Background: There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.Results: To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.Conclusions: Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.
AB - Background: There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.Results: To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.Conclusions: Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.
UR - https://www.scopus.com/pages/publications/77957550165
U2 - 10.1186/1471-2105-11-502
DO - 10.1186/1471-2105-11-502
M3 - Article
C2 - 20932336
AN - SCOPUS:77957550165
SN - 1471-2105
VL - 11
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 502
ER -