Selecting Clustering Algorithms for Identity-By-Descent Mapping

Ruhollah Shemirani, Gillian M. Belbin, Keith Burghardt, Kristina Lerman, Christy L. Avery, Eimear E. Kenny, Christopher R. Gignoux, José Luis Ambite

Research output: Contribution to journalConference articlepeer-review

Abstract

Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios. Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering

Original languageEnglish
Pages (from-to)121-132
Number of pages12
JournalPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Issue number2023
DOIs
StatePublished - 2023
Event28th Pacific Symposium on Biocomputing, PSB 2023 - Kohala Coast, United States
Duration: 3 Jan 20237 Jan 2023

Keywords

  • Benchmark
  • Clustering
  • Clustering Metrics
  • Community Detection
  • Comparative Analysis
  • Genome-wide Association Studies
  • Identity-By-Descent

Fingerprint

Dive into the research topics of 'Selecting Clustering Algorithms for Identity-By-Descent Mapping'. Together they form a unique fingerprint.

Cite this