TY - GEN
T1 - Rethinking data-intensive science using scalable analytics systems
AU - Nothaft, Frank Austin
AU - Massie, Matt
AU - Danford, Timothy
AU - Zhang, Zhao
AU - Laserson, Uri
AU - Yeksigian, Carl
AU - Kottalam, Jey
AU - Ahuja, Arun
AU - Hammerbacher, Jeff
AU - Linderman, Michael
AU - Franklin, Michael J.
AU - Joseph, Anthony D.
AU - Patterson, David A.
PY - 2015/5/27
Y1 - 2015/5/27
N2 - "Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28× speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8-8:9× improvement over the state-of-the-art MPI-based system.
AB - "Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28× speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8-8:9× improvement over the state-of-the-art MPI-based system.
KW - Analytics
KW - Genomics
KW - Mapreduce
KW - Scientific computing
UR - http://www.scopus.com/inward/record.url?scp=84957568769&partnerID=8YFLogxK
U2 - 10.1145/2723372.2742787
DO - 10.1145/2723372.2742787
M3 - Conference contribution
AN - SCOPUS:84957568769
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 631
EP - 646
BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
T2 - ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Y2 - 31 May 2015 through 4 June 2015
ER -