TY - JOUR
T1 - Multiple-instance learning of somatic mutations for the classification of tumour type and the prediction of microsatellite status
AU - Anaya, Jordan
AU - Sidhom, John William
AU - Mahmood, Faisal
AU - Baras, Alexander S.
N1 - Publisher Copyright:
© 2023, The Author(s).
PY - 2024/1
Y1 - 2024/1
N2 - Large-scale genomic data are well suited to analysis by deep learning algorithms. However, for many genomic datasets, labels are at the level of the sample rather than for individual genomic measures. Machine learning models leveraging these datasets generate predictions by using statically encoded measures that are then aggregated at the sample level. Here we show that a single weakly supervised end-to-end multiple-instance-learning model with multi-headed attention can be trained to encode and aggregate the local sequence context or genomic position of somatic mutations, hence allowing for the modelling of the importance of individual measures for sample-level classification and thus providing enhanced explainability. The model solves synthetic tasks that conventional models fail at, and achieves best-in-class performance for the classification of tumour type and for predicting microsatellite status. By improving the performance of tasks that require aggregate information from genomic datasets, multiple-instance deep learning may generate biological insight.
AB - Large-scale genomic data are well suited to analysis by deep learning algorithms. However, for many genomic datasets, labels are at the level of the sample rather than for individual genomic measures. Machine learning models leveraging these datasets generate predictions by using statically encoded measures that are then aggregated at the sample level. Here we show that a single weakly supervised end-to-end multiple-instance-learning model with multi-headed attention can be trained to encode and aggregate the local sequence context or genomic position of somatic mutations, hence allowing for the modelling of the importance of individual measures for sample-level classification and thus providing enhanced explainability. The model solves synthetic tasks that conventional models fail at, and achieves best-in-class performance for the classification of tumour type and for predicting microsatellite status. By improving the performance of tasks that require aggregate information from genomic datasets, multiple-instance deep learning may generate biological insight.
UR - http://www.scopus.com/inward/record.url?scp=85175575209&partnerID=8YFLogxK
U2 - 10.1038/s41551-023-01120-3
DO - 10.1038/s41551-023-01120-3
M3 - Article
AN - SCOPUS:85175575209
SN - 2157-846X
VL - 8
SP - 57
EP - 67
JO - Nature Biomedical Engineering
JF - Nature Biomedical Engineering
IS - 1
ER -