Research Papers

ARXIV Cancer: general cancer Method: self-supervised learning

TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

Kohei Yamamoto, Tomohiro Kikuchi
Published 2026-01-01 08:27

This study introduces TotalFM, a radiological foundation model designed to efficiently learn the relationship between 3D-CT images and linguistic expressions through organ separation. Utilizing a large-scale dataset and advanced techniques such as segmentation and Large Language Model processing, the model balances computational efficiency with representation capability. The results indicate that TotalFM outperforms existing models in zero-shot lesion classification tasks, demonstrating its potential for practical applications in radiology.

Read abstract

While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

PUBMED Cancer: laryngeal cancer Method: unknown

Functional Voice Restoration After Laryngeal Transplantation: A Multidisciplinary Protocol and Longitudinal Outcomes.

Bin Zeng, Hailing Gu, Zheng Jiang, Mailudan Ainiwaer, Yitao Zheng, Jimin Yang, Jia Ren, Fei Chen
Published 2026-01-01 00:00

This study presents a protocolized framework for voice rehabilitation following laryngeal transplantation, addressing the lack of standardized approaches in this area. It details the experiences of four male patients, three of whom had laryngeal cancer, undergoing structured assessments and personalized rehabilitation. The results indicate significant improvements in vocal function over time, particularly with the implementation of neuromuscular reinnervation strategies. The findings aim to guide evidence-based rehabilitation practices for laryngeal transplantation.

Read abstract

Laryngeal transplantation offers the potential for patients to regain vocal function, yet standardised voice rehabilitation protocols are lacking. We share the experience of our team in the regular follow-up of voice function evaluation and address this gap by establishing a multidisciplinary pathway for functional recovery. Four male transplant recipients (3 laryngeal cancers, 1 hypopharyngeal cancer) underwent protocolized assessments at 1/3/6/8 months post-op: subjective assessment (GRBAS scale) and objective evaluation (multiparametric acoustic analysis and electronic laryngoscopy). Personalized rehabilitation was delivered weekly by a licensed speech therapist. Protocol evolution occurred: Patients 1-2 received conventional training; Patients 3-4 received intensive neuromuscular reinnervation strategies. The voice of the four patients showed a gradual decrease in the degree of hoarseness, a gradual alleviation of breathiness, and a gradual decrease in asthenia score, with the overall condition improving. The MPT was about 1.8 s at 1 month after surgery which kept increasing in all patients. The 3rd patient, who performed the best among the 4 patients, had an MPT of more than 10 s at 8 months after surgery. Laryngeal mucosa sensory function was gradually established in patients starting 3 months after operation, and compensatory vibration of ventricular band appeared at 8 months after operation with the assistance of voice training. This study anchored to neuromuscular reinnervation milestones demonstrates that standardised evaluations coupled with individualized training progressively restore vocal function. Our protocolized framework guides evidence-based rehabilitation for institutions pursuing laryngeal transplantation WHAT THIS PAPER ADDS: What is already known on this subject Laryngeal transplantation surgically restores laryngeal anatomy but faces functional recovery challenges due to delayed neuromuscular reinnervation. Existing literature focuses predominantly on immunosuppression and graft viability, with sparse evidence guiding postoperative voice rehabilitation. Standardised protocols for phonatory recovery-routine in other neurogenic voice disorders (e.g., vocal fold paralysis)-are absent. Fewer than 20 human cases have been reported globally, and only two publications detail voice outcomes. Consequently, rehabilitation strategies remain ad hoc, lacking consensus on intervention timing, exercise biomarkers, or psychological support frameworks. What this study adds to existing knowledge This study establishes the first protocolized voice rehabilitation framework for laryngeal transplantation, anchored to neuromuscular milestones: Pharyngeal reflex recovery (3 months) signalling sensory reinnervation; Ventricular band compensation (8 months) indicating motor adaptation. We demonstrate that early, structured rehabilitation (initiated at 1 month) enables significant voice restoration (MPT: 1.8 s → >10 s). Critically, we identify modular design principles accommodating clinical interruptions (e.g., ICU admissions) without compromising core outcomes. We anticipate these findings will guide evidence-based rehabilitation for institutions pursuing laryngeal transplantation and inform standardised pathways for complex laryngologic rehabilitation. What are the potential or actual clinical implications of this work? Rehabilitation Standardization: Provides evidence-based timelines (1/3/6/8-month assessments) and neuromuscular biomarkers to guide intervention intensity. Broad Applicability: The protocol shows cross-utility for bilateral vocal fold paralysis and post-traumatic neurogenic dysphonia, leveraging shared reinnervation mechanisms. Contingency Management: Modular training design maintains efficacy despite clinical interruptions (e.g., 40% cohort ICU/oncology transfers). Technology Integration: Validates objective metrics (MPT, mucosal wave symmetry) as targets for future AI-assisted biofeedback tools. Clinicians should prioritise early sensorimotor retraining (<3 months) while monitoring compensatory strategies (ventricular vibration) as functional proxies.

ARXIV Cancer: head and neck cancer Method: rank-based method

friends.test: rank-based method for feature selection in interaction matrices

Alexandra Suvorikova, Alexey Kroshnin, Dmirijs Lvovs, Vera Mukhina, Andrey Mironov, Elana J. Fertig, Ludmila Danilova, Alexander Favorov
Published 2025-12-31 13:03

This paper presents friends.test, a rank-based method designed to enhance feature selection in interaction matrices, particularly in the context of identifying specific interactions amidst background noise. The method utilizes model fitting to detect structural breaks in entity interactions, allowing for the integration of heterogeneous data sources. The effectiveness of friends.test is demonstrated using transnational data from head and neck cancer.

Read abstract

The analysis of the interaction matrix between two distinct sets is essential across diverse fields, from pharmacovigilance to transcriptomics. Not all interactions are equally informative: a marker gene associated with a few specific biological processes is more informative than a highly expressed non-specific gene associated with most observed processes. Identifying these interactions is challenging due to background connections. Furthermore, data heterogeneity across sources precludes universal identification criteria. To address this challenge, we introduce \textsf{friends.test}, a method for identifying specificity by detecting structural breaks in entity interactions. Rank-based representation of the interaction matrix ensures invariance to heterogeneous data and allows for integrating data from diverse sources. To automatically locate the boundary between specific interactions and background activity, we employ model fitting. We demonstrate the applicability of \textsf{friends.test} on the GSE112026 -- transnational data from head and neck cancer. A computationally efficient \textsf{R} implementation is available at https://github.com/favorov/friends.test.

ARXIV Cancer: unknown Method: vision transformer

VL-OrdinalFormer: Vision Language Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading

Zahid Ullah, Jihie Kim
Published 2025-12-31 03:01

This study introduces VLOrdinalFormer, a vision language guided ordinal learning framework designed for the automated grading of knee osteoarthritis (KOA) using knee radiographs. The method integrates a ViT L16 backbone with CORAL based ordinal regression and a CLIP driven semantic alignment module, enhancing the model's ability to interpret subtle radiographic distinctions. Experimental results on the OAI kneeKL224 dataset demonstrate that VLOrdinalFormer outperforms existing CNN and ViT baselines, particularly in accurately classifying early disease stages.

Read abstract

Knee osteoarthritis (KOA) is a leading cause of disability worldwide, and accurate severity assessment using the Kellgren Lawrence (KL) grading system is critical for clinical decision making. However, radiographic distinctions between early disease stages, particularly KL1 and KL2, are subtle and frequently lead to inter-observer variability among radiologists. To address these challenges, we propose VLOrdinalFormer, a vision language guided ordinal learning framework for fully automated KOA grading from knee radiographs. The proposed method combines a ViT L16 backbone with CORAL based ordinal regression and a Contrastive Language Image Pretraining (CLIP) driven semantic alignment module, allowing the model to incorporate clinically meaningful textual concepts related to joint space narrowing, osteophyte formation, and subchondral sclerosis. To improve robustness and mitigate overfitting, we employ stratified five fold cross validation, class aware re weighting to emphasize challenging intermediate grades, and test time augmentation with global threshold optimization. Experiments conducted on the publicly available OAI kneeKL224 dataset demonstrate that VLOrdinalFormer achieves state of the art performance, outperforming CNN and ViT baselines in terms of macro F1 score and overall accuracy. Notably, the proposed framework yields substantial performance gains for KL1 and KL2 without compromising classification accuracy for mild or severe cases. In addition, interpretability analyses using Grad CAM and CLIP similarity maps confirm that the model consistently attends to clinically relevant anatomical regions. These results highlight the potential of vision language aligned ordinal transformers as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.

ARXIV Cancer: unknown Method: convolutional neural network

Deep Learning Approach for the Diagnosis of Pediatric Pneumonia Using Chest X-ray Imaging

Fatemeh Hosseinabadi, Mohammad Mojtaba Rohani
Published 2025-12-31 00:07

This study explores the use of convolutional neural networks (CNNs) for the automated classification of pediatric chest X-ray images to diagnose pneumonia. Three CNN architectures—ResNetRS, RegNet, and EfficientNetV2—were evaluated using transfer learning on a curated dataset of 1,000 images. RegNet demonstrated the highest classification performance with an accuracy of 92.4%.

Read abstract

Pediatric pneumonia remains a leading cause of morbidity and mortality in children worldwide. Timely and accurate diagnosis is critical but often challenged by limited radiological expertise and the physiological and procedural complexity of pediatric imaging. This study investigates the performance of state-of-the-art convolutional neural network (CNN) architectures ResNetRS, RegNet, and EfficientNetV2 using transfer learning for the automated classification of pediatric chest Xray images as either pneumonia or normal.A curated subset of 1,000 chest X-ray images was extracted from a publicly available dataset originally comprising 5,856 pediatric images. All images were preprocessed and labeled for binary classification. Each model was fine-tuned using pretrained ImageNet weights and evaluated based on accuracy and sensitivity. RegNet achieved the highest classification performance with an accuracy of 92.4 and a sensitivity of 90.1, followed by ResNetRS (accuracy: 91.9, sensitivity: 89.3) and EfficientNetV2 (accuracy: 88.5, sensitivity: 88.1).

ARXIV Cancer: unknown Method: YOLOv5 and YOLOv8

Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula, Jonathan Stubblefield
Published 2025-12-30 23:32

This study presents a novel pipeline that combines YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The integration aims to improve the translation of structured AI predictions into comprehensive diagnostic narratives. Results indicate that the AI-generated reports exhibit strong semantic similarity to human-authored reports, although there are stylistic differences.

Read abstract

Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

ARXIV Cancer: lung cancer Method: deep learning

Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq, Linda Larson-Prior, Fred Prior
Published 2025-12-30 15:34

This study presents Virtual-Eyes, a quality-control pipeline designed for low-dose CT lung cancer screening. The pipeline enhances the performance of generalist foundation models in cancer risk prediction by enforcing strict imaging standards and preprocessing techniques. Results indicate that Virtual-Eyes significantly improves the predictive accuracy of the RAD-DINO model while negatively impacting specialist models, highlighting the importance of tailored preprocessing in AI workflows.

Read abstract

Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

ARXIV Cancer: general cancer Method: generative framework

One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Zhihua Wang, Fei Wu, Quanlin Li, Pinghong Zhou, Shuo Wang, Xian Yang
Published 2025-12-30 15:07

The study introduces EndoRare, a generative framework designed to synthesize high-fidelity exemplars of rare gastrointestinal lesions from a single reference image. By employing language-guided concept disentanglement, the method enhances the training of AI classifiers and improves diagnostic accuracy for novice clinicians. Validation across four rare pathologies showed significant improvements in recall and precision when using synthetic images for data augmentation.

Read abstract

Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

ARXIV Cancer: brain tumor Method: meta-guided multi-modal learning

MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

Yulong Zou, Bo Liu, Cun-Jing Zheng, Yuan-ming Geng, Siyue Li, Qiankun Zuo, Shuihua Wang, Yudong Zhang, Jin Hong
Published 2025-12-30 01:37

This paper presents a novel meta-guided multi-modal learning (MGML) framework aimed at improving brain tumor segmentation using incomplete multimodal MRI data. The framework includes a meta-parameterized adaptive modality fusion component and a consistency regularization module to enhance segmentation performance. Experimental results on the BraTS2020 and BraTS2023 datasets demonstrate that the proposed method outperforms several state-of-the-art techniques, achieving high Dice scores for various tumor types.

Read abstract

Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

ARXIV Cancer: pancreatic neoplasm Method: Vision Transformer

Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi
Published 2025-12-29 16:51

This study presents a Scalable Residual Feature Aggregation (SRFA) framework aimed at improving the early detection of pancreatic neoplasms using multimodal CT imaging. The framework employs a combination of preprocessing, segmentation with MAGRes-UNet, and feature extraction using DenseNet-121, enhanced by a hybrid metaheuristic optimization strategy. Experimental results demonstrate significant performance improvements, achieving 96.23% accuracy and outperforming traditional CNNs and contemporary transformer-based models.

Read abstract

The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.

Find the papers that actually matter