Research Papers

ARXIV Cancer: unknown Method: multimodal large language model

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
Published 2026-01-06 17:13

The paper presents AnatomiX, a multitask multimodal large language model designed for anatomically grounded interpretation of chest X-rays. It employs a two-stage approach that first identifies anatomical structures and extracts their features, followed by leveraging a large language model for various downstream tasks. The results indicate that AnatomiX significantly enhances anatomical reasoning, achieving over 25% improvement in performance on several tasks compared to existing methods.

Read abstract

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at github.com/aneesurhashmi/anatomix.

ARXIV Cancer: general cancer Method: reinforcement learning

Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, Haoteng Tang
Published 2026-01-06 14:17

This paper presents a novel framework for self-consistent radiology report generation using Multimodal Large Language Models (MLLMs). The proposed 'Reason-then-Summarize' architecture, optimized through Group Relative Policy Optimization (GRPO), aims to align linguistic outputs with visual evidence while minimizing factual hallucinations. Experimental results on the MIMIC-CXR benchmark indicate that the method achieves state-of-the-art performance in clinical efficacy metrics.

Read abstract

Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.

ARXIV Cancer: unknown Method: Swin Transformer U-Net 3D

Lesion Segmentation in FDG-PET/CT Using Swin Transformer U-Net 3D: A Robust Deep Learning Framework

Shovini Guha, Dwaipayan Nandi
Published 2026-01-06 09:52

This paper introduces the Swin Transformer U-Net 3D (SwinUNet3D) framework for automated lesion segmentation in FDG-PET/CT imaging, which is crucial for cancer diagnosis and therapy planning. The model combines shifted window self-attention with U-Net style skip connections to enhance both global context and fine anatomical detail. Evaluation on the AutoPET III FDG dataset shows that SwinUNet3D significantly outperforms the baseline 3D U-Net in terms of Dice score and IoU, while also providing faster inference times.

Read abstract

Accurate and automated lesion segmentation in Positron Emission Tomography / Computed Tomography (PET/CT) imaging is essential for cancer diagnosis and therapy planning. This paper presents a Swin Transformer UNet 3D (SwinUNet3D) framework for lesion segmentation in Fluorodeoxyglucose Positron Emission Tomography / Computed Tomography (FDG-PET/CT) scans. By combining shifted window self-attention with U-Net style skip connections, the model captures both global context and fine anatomical detail. We evaluate SwinUNet3D on the AutoPET III FDG dataset and compare it against a baseline 3D U-Net. Results show that SwinUNet3D achieves a Dice score of 0.88 and IoU of 0.78, surpassing 3D U-Net (Dice 0.48, IoU 0.32) while also delivering faster inference times. Qualitative analysis demonstrates improved detection of small and irregular lesions, reduced false positives, and more accurate PET/CT fusion. While the framework is currently limited to FDG scans and trained under modest GPU resources, it establishes a strong foundation for future multi-tracer, multi-center evaluations and benchmarking against other transformer-based architectures. Overall, SwinUNet3D represents an efficient and robust approach to PET/CT lesion segmentation, advancing the integration of transformer-based models into oncology imaging workflows.

ARXIV Cancer: general cancer Method: graph contrastive learning

Topology-aware Pathological Consistency Matching for Weakly-Paired IHC Virtual Staining

Mingzhou Jiang, Jiaying Zhou, Nan Zeng, Mickael Li, Qijie Tang, Chao He, Huazhu Fu, Honghui He
Published 2026-01-06 08:28

This paper presents a novel framework for virtual staining that converts Hematoxylin and Eosin (H&E) images to immunohistochemical (IHC) images, addressing the challenges of weakly-paired data due to spatial misalignment. The proposed Topology-aware Consistency Matching (TACM) mechanism utilizes graph contrastive learning to ensure structural consistency, while the Topology-constrained Pathological Matching (TCPM) mechanism enhances pathological consistency. Experimental results indicate that the method significantly outperforms existing approaches, achieving higher generation quality and clinical relevance.

Read abstract

Immunohistochemical (IHC) staining provides crucial molecular characterization of tissue samples and plays an indispensable role in the clinical examination and diagnosis of cancers. However, compared with the commonly used Hematoxylin and Eosin (H&E) staining, IHC staining involves complex procedures and is both time-consuming and expensive, which limits its widespread clinical use. Virtual staining converts H&E images to IHC images, offering a cost-effective alternative to clinical IHC staining. Nevertheless, using adjacent slides as ground truth often results in weakly-paired data with spatial misalignment and local deformations, hindering effective supervised learning. To address these challenges, we propose a novel topology-aware framework for H&E-to-IHC virtual staining. Specifically, we introduce a Topology-aware Consistency Matching (TACM) mechanism that employs graph contrastive learning and topological perturbations to learn robust matching patterns despite spatial misalignments, ensuring structural consistency. Furthermore, we propose a Topology-constrained Pathological Matching (TCPM) mechanism that aligns pathological positive regions based on node importance to enhance pathological consistency. Extensive experiments on two benchmarks across four staining tasks demonstrate that our method outperforms state-of-the-art approaches, achieving superior generation quality with higher clinical relevance.

ARXIV Cancer: unknown Method: deep learning

CutisAI: Deep Learning Framework for Automated Dermatology and Cancer Screening

Rohit Kaushik, Eva Kaushik
Published 2026-01-05 21:29

This paper presents the Conformal Bayesian Dermatological Classifier (CBDC), a deep learning framework designed for automated dermatology and cancer screening. The framework integrates Statistical Learning Theory, Topological Data Analysis, and Bayesian Conformal Inference to improve uncertainty quantification in predictions. Experimental results demonstrate that CBDC achieves high classification accuracy while providing interpretable and calibrated predictions suitable for clinical use.

Read abstract

The rapid growth of dermatological imaging and mobile diagnostic tools calls for systems that not only demonstrate empirical performance but also provide strong theoretical guarantees. Deep learning models have shown high predictive accuracy; however, they are often criticized for lacking well, calibrated uncertainty estimates without which these models are hardly deployable in a clinical setting. To this end, we present the Conformal Bayesian Dermatological Classifier (CBDC), a well, founded framework that combines Statistical Learning Theory, Topological Data Analysis (TDA), and Bayesian Conformal Inference. CBDC offers distribution, dependent generalization bounds that reflect dermatological variability, proves a topological stability theorem that guarantees the invariance of convolutional neural network embeddings under photometric and morphological perturbations and provides finite conformal coverage guarantees for trustworthy uncertainty quantification. Through exhaustive experiments on the HAM10000, PH2, and ISIC 2020 datasets, we show that CBDC not only attains classification accuracy but also generates calibrated predictions that are interpretable from a clinical perspective. This research constitutes a theoretical and practical leap for deep dermatological diagnostics, thereby opening the machine learning theory clinical applicability interface.

ARXIV Cancer: thyroid cancer Method: prior-guided DETR

Prior-Guided DETR for Ultrasound Nodule Detection

Jingjing Wang, Zhuo Xiao, Xinning Yao, Bo Liu, Lijuan Niu, Xiangzhi Bai, Fugen Zhou
Published 2026-01-05 15:32

This paper presents a prior-guided DETR framework aimed at improving the detection of ultrasound nodules associated with thyroid and breast cancers. The method incorporates prior knowledge at multiple stages of the network to enhance feature extraction and detection accuracy, particularly for irregular and blurred nodules. Experimental results indicate that the proposed approach outperforms 18 existing detection methods, especially in challenging cases involving complex nodule morphology.

Read abstract

Accurate detection of ultrasound nodules is essential for the early diagnosis and treatment of thyroid and breast cancers. However, this task remains challenging due to irregular nodule shapes, indistinct boundaries, substantial scale variations, and the presence of speckle noise that degrades structural visibility. To address these challenges, we propose a prior-guided DETR framework specifically designed for ultrasound nodule detection. Instead of relying on purely data-driven feature learning, the proposed framework progressively incorporates different prior knowledge at multiple stages of the network. First, a Spatially-adaptive Deformable FFN with Prior Regularization (SDFPR) is embedded into the CNN backbone to inject geometric priors into deformable sampling, stabilizing feature extraction for irregular and blurred nodules. Second, a Multi-scale Spatial-Frequency Feature Mixer (MSFFM) is designed to extract multi-scale structural priors, where spatial-domain processing emphasizes contour continuity and boundary cues, while frequency-domain modeling captures global morphology and suppresses speckle noise. Furthermore, a Dense Feature Interaction (DFI) mechanism propagates and exploits these prior-modulated features across all encoder layers, enabling the decoder to enhance query refinement under consistent geometric and structural guidance. Experiments conducted on two clinically collected thyroid ultrasound datasets (Thyroid I and Thyroid II) and two public benchmarks (TN3K and BUSI) for thyroid and breast nodules demonstrate that the proposed method achieves superior accuracy compared with 18 detection methods, particularly in detecting morphologically complex nodules.The source code is publicly available at https://github.com/wjj1wjj/Ultrasound-DETR.

ARXIV Cancer: unknown Method: multi-source domain adaptation

Mind the Gap: Continuous Magnification Sampling for Pathology Foundation Models

Alexander Möllers, Julius Hense, Florian Schulz, Timo Milbich, Maximilian Alber, Lukas Ruff
Published 2026-01-05 15:19

This study investigates the impact of magnification sampling on the performance of pathology foundation models in histopathology. The authors propose a continuous magnification sampling method to address the limitations of traditional discrete sampling strategies. Their experiments demonstrate that continuous sampling significantly enhances classification accuracy, particularly at intermediate magnifications, and optimized distributions can further improve model performance. The findings highlight the importance of magnification in the evaluation of pathology models.

Read abstract

In histopathology, pathologists examine both tissue architecture at low magnification and fine-grained morphology at high magnification. Yet, the performance of pathology foundation models across magnifications and the effect of magnification sampling during training remain poorly understood. We model magnification sampling as a multi-source domain adaptation problem and develop a simple theoretical framework that reveals systematic trade-offs between sampling strategies. We show that the widely used discrete uniform sampling of magnifications (0.25, 0.5, 1.0, 2.0 mpp) leads to degradation at intermediate magnifications. We introduce continuous magnification sampling, which removes gaps in magnification coverage while preserving performance at standard scales. Further, we derive sampling distributions that optimize representation quality across magnification scales. To evaluate these strategies, we introduce two new benchmarks (TCGA-MS, BRACS-MS) with appropriate metrics. Our experiments show that continuous sampling substantially improves over discrete sampling at intermediate magnifications, with gains of up to 4 percentage points in balanced classification accuracy, and that optimized distributions can further improve performance. Finally, we evaluate current histopathology foundation models, finding that magnification is a primary driver of performance variation across models. Our work paves the way towards future pathology foundation models that perform reliably across magnifications.

ARXIV Cancer: thyroid cancer Method: detection transformer

Nodule-DETR: A Novel DETR Architecture with Frequency-Channel Attention for Ultrasound Thyroid Nodule Detection

Jingjing Wang, Qianglin Liu, Zhuo Xiao, Xinning Yao, Bo Liu, Lu Li, Lijuan Niu, Fugen Zhou
Published 2026-01-05 08:53

This study presents Nodule-DETR, a novel detection transformer architecture aimed at improving the detection of thyroid nodules in ultrasound images. The method incorporates innovative modules such as Multi-Spectral Frequency-domain Channel Attention and Hierarchical Feature Fusion to enhance the detection of low-contrast nodules. Experimental results indicate that Nodule-DETR significantly outperforms existing models, demonstrating its potential for clinical application in thyroid cancer diagnostics.

Read abstract

Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule-DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule-DETR introduces three key innovations: a Multi-Spectral Frequency-domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low-contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi-scale integration; and Multi-Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real-world thyroid ultrasound images. The results demonstrate that Nodule-DETR achieves state-of-the-art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule-DETR highlights its significant potential for clinical application as an effective tool in computer-aided thyroid diagnosis. The code of work is available at https://github.com/wjj1wjj/Nodule-DETR.

ARXIV Cancer: pancreatic ductal adenocarcinoma and breast cancer Method: Retrieval-Augmented Generation

Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

Udiptaman Das, Krishnasai B. Atmakuri, Duy Ho, Chi Lee, Yugyung Lee
Published 2026-01-05 07:16

This paper presents an end-to-end framework for constructing and evaluating clinical knowledge graphs (KGs) from unstructured clinical narratives using multi-agent prompting and a Retrieval-Augmented Generation (KG-RAG) strategy. The method integrates various components including entity extraction, uncertainty scoring, schema generation, and validation to enhance the accuracy and semantic consistency of the KGs. The framework was applied to two oncology cohorts, demonstrating improvements in precision and relevance compared to baseline methods.

Read abstract

Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.

ARXIV Cancer: breast cancer Method: dual-stream architecture

CTIS-QA: Clinical Template-Informed Slide-level Question Answering for Pathology

Hao Lu, Ziniu Qian, Yifu Li, Yang Zhou, Bingzheng Wei, Yan Xu
Published 2026-01-05 03:54

This paper presents a clinical diagnosis template-based pipeline designed to extract and structure pathological information from reports. The authors developed a Clinical Pathology Report Template (CPRT) to ensure standardized extraction of diagnostic elements, validated on TCGA-BRCA. They introduced CTIS-QA, a Slide-level Question Answering model that utilizes a dual-stream architecture to enhance diagnostic accuracy. Experimental results demonstrate that CTIS-QA outperforms existing models across various metrics.

Read abstract

In this paper, we introduce a clinical diagnosis template-based pipeline to systematically collect and structure pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTIS-Bench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists' diagnostic approach. One stream captures global slide-level context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. Code and data are available at https://github.com/HLSvois/CTIS-QA.

Find the papers that actually matter