Hybrid framework for lesion-aware, clinically coherent chest X-ray report generation using contrastive learning and large language models

Sci Rep. 2026 Jan 5;16(1):4645. doi: 10.1038/s41598-025-34799-2.

Won-Jun Noh ^# ¹, Sun-Woo Pi ^# ¹, Byoung-Dai Lee ² ³

Affiliations

1 Department of Computer Science, Graduate School, Kyonggi University, 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon-si, 16227, Gyeonggi-do, Republic of Korea.
2 Department of Computer Science, Graduate School, Kyonggi University, 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon-si, 16227, Gyeonggi-do, Republic of Korea. [email protected].
3 Division of AI and Computer Engineering, Kyonggi University, 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon-si, 16227, Gyeonggi-do, Republic of Korea. [email protected].
# Contributed equally.

PMID: 41492086 DOI: 10.1038/s41598-025-34799-2

Abstract

Automated radiology report generation from chest X-rays (CXRs) has the potential to reduce the workload of radiologists and improve diagnostic consistency. However, conventional approaches have been constrained by trade-offs between understanding global images and characterizing fine-grained lesions, often leading to omissions or clinically inconsistent narratives. This study proposed a hybrid framework, CLALA-Net, to integrate global and regional representations through three key modules: Lesion Cross-Attention (LCA), Lesion-Level Contrastive Learning (LLCL), and Image-Text Contrastive Learning (ITCL). LCA injects lesion-level cues derived from full-image classification into each region of interest (ROI), LLCL enhances discriminability by aligning lesion representations across CXRs, and ITCL improves visual-textual semantic alignment. A large language model (LLM)-based aggregator was utilized to consolidate ROI-level descriptions into a clinically coherent report. An LLM-driven label extraction pipeline was introduced to generate fine-grained lesion annotations for training and evaluation. Extensive experiments on the Chest-Imagenome dataset demonstrated that CLALA-Net outperformed existing baselines in both lesion-level accuracy (mean F1-score: 0.40) and report-level consistency (total score: 14.32/20). Ablation studies confirmed the complementary roles of LCA and LLCL, whereas the sensitivity analysis indicated strong performance gains with improved label quality. By bridging full-image contextual reasoning with regional-level lesion analysis, CLALA-Net produced accurate, semantically consistent, and clinically reliable chest radiography reports. This framework provides a robust and interpretable foundation for the real-world deployment of automated radiological reporting.

Supplementary Information: The online version contains supplementary material available at 10.1038/s41598-025-34799-2.

Keywords

Chest x-ray; Contrastive learning; Large language model; Multimodal learning; Radiology report generation.

Products

Cat. No.

Product Name

Description

Target

Research Area
HY-50767

Palbociclib

99.94%, CDK4/6 Inhibitor

CDK

Cancer

Name
Email *

	Sorry, but the email address you supplied was invalid.