Deep learning algorithms require large training datasets to achieve optimal performance. For many AI tasks, it is unclear whether algorithm performance would improve further if more training data was added. The aim of this study is to quantify the number of CT training samples required to achieve radiologist-level performance for a deep learning AI algorithm that estimates pulmonary nodule malignancy risk.
Methods and materials: For estimating pulmonary nodule malignancy risk, we used the NLST dataset (malignant nodules:1249, benign nodules:14828) to train a deep learning algorithm. The dataset was split: 80% training and 20% internal validation. The algorithm was trained on random subsets of the training set with subset sizes ranging from 10% to 100%, with a class distribution of malignant7.77% and benign92.23%. The trained AI algorithms were validated on a size-matched cancer-enriched cohort (malignant:59, benign:118) from DLCST. The performance was compared against a group of 11 clinicians that also scored the test set, which included 4 thoracic radiologists.
Results: Using training data subsets of 10%, 20%, and 30%, the AI achieved AUC values of 0.74 (95%CI:0.67-0.82), 0.79 (95%CI:0.72-0.85), and 0.81 (95%CI:0.74-0.87) respectively. When the training data set size reached 60% (malignant:602, benign:7112), the performance saturated, reaching an AUC of 0.82 (95%CI:0.75-0.88). This was comparable to the average AUC of all clinicians (0.82,95%CI:0.77-0.86,p>0.99) and of the 4 thoracic radiologists (0.82,95%CI:0.74-0.89,p>0.99).
Conclusion: The AI was able to reach the level of an experienced thoracic radiologist when it was trained on 7714 nodules (malignant:602) from the NLST dataset. These findings have potential implications for the allocation of resources in developing deep learning algorithms for lung cancer medical imaging diagnostics.
Limitations: The generalizability of these findings is constrained by heterogeneity and geographical limitations of the datasets used in this study.