Purpose: Artificial Intelligence (AI) algorithms often lack uncertainty estimation for classification tasks. Uncertainty estimation may however be an important requirement for clinical adoption of AI algorithms. In this study, we integrate a method for uncertainty estimation into a previously developed AI algorithm and investigate the performance when applying different uncertainty thresholds.
Methods and materials: We used a retrospective external validation dataset from the Danish Lung Cancer Screening Trial, containing 818 benign and 65 malignant nodules. Our previously developed AI algorithm for nodule malignancy risk estimation was extended with a method for measuring the prediction uncertainty. The uncertainty score (UnS) was calculated by measuring the standard deviation over 20 different predictions of an ensemble of AI models. Two UnS thresholds at the 90th and 95th percentile were applied to retain 90% and 95% of all cases as certain, respectively. For these scenarios, we calculated the area under the ROC curve (AUC) for certain and uncertain cases, and for the full set of nodules.
Results: On the full set of 883 nodules, the AUC of the AI risk score was 0.932. For the 90th and 95th percentile, the AUC of the AI risk score for certain cases was 0.934 and 0.935, respectively, and for the uncertain cases was 0.710 and 0.688, respectively.
Conclusion: In this retrospective data set, we demonstrate that integrating an uncertainty estimation method into a deep learning-based nodule malignancy risk estimation algorithm slightly increased the performance on certain cases. The AI performance is substantially worse on uncertain cases and therefore in need of human visual review.
Limitations: This study is a retrospective analysis on data from one single lung cancer screening trial. More external validation is needed.