TY - GEN
T1 - Optimizing high-performance computing systems for biomedical workloads
AU - Kovatch, Patricia
AU - Gai, Lili
AU - Cho, Hyung Min
AU - Fluder, Eugene
AU - Jiang, Dansha
N1 - Funding Information:
2.7-GHz compute cores (48 cores/node with two sockets/node) with 192 GB of memory per node in 286 nodes, 80 TB of total memory, 350 TB of solid state storage, 48 V100 GPUs and 29 PB of raw storage. This includes a second BODE2 funded by NIH in 2019 to enable computationally and data intensive workflows for NIH-funded projects. Although we have identified compute partitions by specific names and queues with certain access policies, the overall machine is called Minerva. As of 2020, we have two partitions: Chimera and BODE2, with BODE2 only accessible to NIH-funded research in accordance with our NIH S10 award. We have three separate GPFS file systems mounted on compute nodes: Orga at GPFS 4.x, Hydra at GPFS 5.x and Arion at GPFS 5.x. Arion, which was funded by the S10 award, contains only NIH-funded research data.
Funding Information:
Research reported in this paper was supported by the Office of Research Infrastructure of the National Institutes of Health under award numbers S10OD018522 and S10OD026880. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - The productivity of computational biologists is limited by the speed of their workflows and subsequent overall job throughput. Because most biomedical researchers are focused on better understanding scientific phenomena rather than developing and optimizing code, a computing and data system implemented in an adventitious and/or non-optimized manner can impede the progress of scientific discovery. In our experience, most computational, life-science applications do not generally leverage the full capabilities of high-performance computing, so tuning a system for these applications is especially critical. To optimize a system effectively, systems staff must understand the effects of the applications on the system. Effective stewardship of the system includes an analysis of the impact of the applications on the compute cores, file system, resource manager and queuing policies. The resulting improved system design, and enactment of a sustainability plan, help to enable a long-term resource for productive computational and data science. We present a case study of a typical biomedical computational workload at a leading academic medical center supporting over $100 million per year in computational biology research. Over the past eight years, our high-performance computing system has enabled over 900 biomedical publications in four major areas: genetics and population analysis, gene expression, machine learning, and structural and chemical biology. We have upgraded the system several times in response to trends, actual usage, and user feedback. Major components crucial to this evolution include scheduling structure and policies, memory size, compute type and speed, parallel file system capabilities, and deployment of cloud technologies. We evolved a 70 teraflop machine to a 1.4 petaflop machine in seven years and grew our user base nearly 10-fold. For long-term stability and sustainability, we established a chargeback fee structure. Our overarching guiding principle for each progression has been to increase scientific throughput and enable enhanced scientific fidelity with minimal impact to existing user workflows or code. This highly-constrained system optimization has presented unique challenges, leading us to adopt new approaches to provide constructive pathways forward. We share our practical strategies resulting from our ongoing growth and assessments.
AB - The productivity of computational biologists is limited by the speed of their workflows and subsequent overall job throughput. Because most biomedical researchers are focused on better understanding scientific phenomena rather than developing and optimizing code, a computing and data system implemented in an adventitious and/or non-optimized manner can impede the progress of scientific discovery. In our experience, most computational, life-science applications do not generally leverage the full capabilities of high-performance computing, so tuning a system for these applications is especially critical. To optimize a system effectively, systems staff must understand the effects of the applications on the system. Effective stewardship of the system includes an analysis of the impact of the applications on the compute cores, file system, resource manager and queuing policies. The resulting improved system design, and enactment of a sustainability plan, help to enable a long-term resource for productive computational and data science. We present a case study of a typical biomedical computational workload at a leading academic medical center supporting over $100 million per year in computational biology research. Over the past eight years, our high-performance computing system has enabled over 900 biomedical publications in four major areas: genetics and population analysis, gene expression, machine learning, and structural and chemical biology. We have upgraded the system several times in response to trends, actual usage, and user feedback. Major components crucial to this evolution include scheduling structure and policies, memory size, compute type and speed, parallel file system capabilities, and deployment of cloud technologies. We evolved a 70 teraflop machine to a 1.4 petaflop machine in seven years and grew our user base nearly 10-fold. For long-term stability and sustainability, we established a chargeback fee structure. Our overarching guiding principle for each progression has been to increase scientific throughput and enable enhanced scientific fidelity with minimal impact to existing user workflows or code. This highly-constrained system optimization has presented unique challenges, leading us to adopt new approaches to provide constructive pathways forward. We share our practical strategies resulting from our ongoing growth and assessments.
KW - Cloud technologies
KW - Computational biology
KW - Genomics
KW - High performance computing
KW - Parallel file systems
KW - Scheduling
KW - Sustainability
KW - System optimization
UR - http://www.scopus.com/inward/record.url?scp=85091593634&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW50202.2020.00040
DO - 10.1109/IPDPSW50202.2020.00040
M3 - Conference contribution
AN - SCOPUS:85091593634
T3 - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
SP - 183
EP - 192
BT - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 34th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
Y2 - 18 May 2020 through 22 May 2020
ER -