Assessing the agreement between privacy-preserving Llama model and human experts when labelling radiology reports for specific significant incidental findings in lung cancer screening

F. van der Graaf, N. Antonissen, E. Scholten, M. Prokop and C. Jacobs

Annual Meeting of the European Society of Thoracic Imaging 2024.

Purpose/Objectives:

The ERS/ESTS/ESTRO/ESR/ESTI/EFOMP statement on management of incidental findings from low dose CT screening for lung cancer recognizes nine specific significant incidental findings (SIFs). To detect these SIFs efficiently and accurately during lung cancer screening, artificial intelligence algorithms may aid human experts. Automatic identification of scans with SIFs would be required to effectively train AI systems to detect these SIFs. In this study, we investigate the agreement between out-of-box Llama2-7b-chat, privacy-preserving, large language model (LLM) and human experts in labeling SIFs in thorax-abdomen radiology reports.

Methods & Materials:

In this study, 100 CT radiology reports from Radboud University Medical Center were examined for nine specific SIFs. The LLM generated outputs for the presence (1) or absence (0) of the SIFs, using an engineered system prompt. Agreements between LLM outputs and two human experts, the agreement between five runs of the LLM, and the agreement between the two human experts was assessed using Fleiss k, with bootstrapped 95% confidence intervals.

Results:

The interobserver agreement for the nine specific SIFs for the two human experts was substantial, median k = 0.763 (0.479, 0.891). The agreement between the LLM and the first human expert was moderate, median k = 0.427 (0.181, 0.661) and between the LLM and the second human expert was fair, median k = 0.395 (0.101, 0.619), for the nine SIFs. The agreement between the five runs of the LLM was almost perfect k = 0.970 (0.912, 1.000). When analyzing agreement in specific SIFs, we found substantial agreement between the LLM and each observer for bronchiectasis (k = 0.667 (0.327, 0.884) and k = 0.634 (0.293, 0.878)), and coronary artery calcifications (k = 0.627 (0.433, 0.792) and k = 0.610 (0.381, 0.793), and poor agreement for thyroid abnormalities (k = -0.027 (-0.048, -0.005), k = -0.020 (-0.042, -0.005)). The agreement between the human experts was substantial for bronchiectasis k = 0.889 (0.693, 1.000), and for thyroid abnormalities k = 0.884 (0.479, 1.000), and poor for interstitial lung abnormalities k = 0.321 (-0.053, 0.628), and for mediastinal masses k = 0.313 (-0.036, 0.795).

Conclusion:

This study demonstrates that there is a large difference between the agreement of each human observer and the LLM, compared to the agreement between the two human observers. This study highlights the potential for LLMs to automatically label radiology reports for SIFs and indicates which SIFs can be more reliably labeled than others. Further research is needed, to fine tune the LLM with labelled radiology reports to improve agreement with human experts, when labelling for SIFs.