Benchmarking of Artificial Intelligence and Radiologists for Lung Cancer Screening in CT: The LUNA25 Challenge

D. Peeters, B. Obreja, N. Antonissen, R. Dinnessen, Z. Saghir, E. Scholten, R. Vliegenthart, M. Prokop and C. Jacobs

European Congress of Radiology 2025.

Purpose:

The imminent implementation of lung cancer screening and growing workload for radiologists demonstrates the need for safe and validated AI algorithms. At present, it is challenging to adequately validate and benchmark the increasing amount of AI algorithms being developed. In this study, we present the LUNA25 challenge, a public competition aiming to evaluate the diagnostic performance of AI algorithms and radiologists in lung nodule malignancy risk estimation at screening CT.

Methods:

The LUNA25 dataset will include 5051 screening CT scans from the National Lung Cancer Screening Trial (NLST), with 624 malignant and 7414 benign nodules. Participating teams can access this dataset to develop AI algorithms. For algorithm validation, a separate set of 65 malignant and 818 benign nodules from the Danish Lung Cancer Screening Trial (DLCST) will serve as a hidden test set. Additionally, a subset from DLCST with indeterminate nodules measuring 5-15mm in diameter will be assessed by a panel of radiologists with varying experience levels to benchmark radiologists' performance against AI algorithms. Performance will be measured using area under the ROC curve (AUC) and at different operating points in terms of sensitivity and specificity.

Results:

With the NLST and DLCST cohorts collected, the challenge is ready to be introduced to the ECR audience. Preliminary results with an in-house developed AI algorithm demonstrated a mean AUC of 0.91 [0.87, 0.95] on DLCST.

Conclusion:

The LUNA25 challenge expects to establish a worldwide benchmark for AI algorithms in estimating lung nodule malignancy risk at screening CTs and offer insights into how AI compares to radiologists across different experience levels and operating points.

Limitations:

LUNA25 only benchmarks AI's stand-alone performance, and does not address workflow integration or radiologist-AI interaction, which are important for clinical adoption.