TY - JOUR
T1 - Autoencoders for sample size estimation for fully connected neural network classifiers
AU - Gulamali, Faris F.
AU - Sawant, Ashwin S.
AU - Kovatch, Patricia
AU - Glicksberg, Benjamin
AU - Charney, Alexander
AU - Nadkarni, Girish N.
AU - Oermann, Eric
N1 - Funding Information:
This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. Research reported in this paper was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Funding Information:
This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. Research reported in this paper was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.
AB - Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.
UR - http://www.scopus.com/inward/record.url?scp=85143818608&partnerID=8YFLogxK
U2 - 10.1038/s41746-022-00728-0
DO - 10.1038/s41746-022-00728-0
M3 - Article
AN - SCOPUS:85143818608
SN - 2398-6352
VL - 5
JO - npj Digital Medicine
JF - npj Digital Medicine
IS - 1
M1 - 180
ER -