Publications
2025
Pushing the boundaries of radiotherapy-immunotherapy combinations: highlights from the 7th immunorad conference.
Laurent, Pierre-Antoine, Fabrice André, Alexandre Bobard, Desiree Deandreis, Sandra Demaria, Stephane Depil, Stefan B. Eichmüller et al. (2025)Over the last decade, the annual Immunorad Conference, held under the joint auspicies of Gustave Roussy (Villejuif, France) and the Weill Cornell Medical College (New-York, USA) has aimed at exploring the latest advancements in the fields of tumor immunology and radiotherapy-immunotherapy combinations for the treatment of cancer. Gathering medical oncologists, radiation oncologists, physicians and researchers with esteemed expertise in these fields, the Immunorad Conference bridges the gap between preclinical outcomes and clinical opportunities. Thus, it paves a promising way toward optimizing radiotherapy-immunotherapy combinations and, from a broader perspective, improving therapeutic strategies for patients with cancer. Herein, we report on the topics developed by key-opinion leaders during the 7th Immunorad Conference held in Paris-Les Cordeliers (France) from September 27th to 29th 2023, and set the stage for the 8th edition of Immunorad which will be held at Weill Cornell Medical College (New-York, USA) in October 2024.
Laurent, Pierre-Antoine, Fabrice André, Alexandre Bobard, Desiree Deandreis, Sandra Demaria, Stephane Depil, Stefan B. Eichmüller et al. Pushing the boundaries of radiotherapy-immunotherapy combinations: highlights from the 7th immunorad conference. OncoImmunology 14, no. 1 (2025): 2432726. doi: https://doi.org/10.1080/2162402X.2024.2432726
FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare
Lekadir, Karim; Feragen, Aasa; Fofanah, Abdul Joseph; Frangi, Alejandro F; Zuluaga, Maria A.; et al.Despite major advances in artificial intelligence (AI) research for healthcare, the deployment and adoption of AI technologies remain limited in clinical practice. This paper describes the FUTURE-AI framework, which provides guidance for the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI Consortium was founded in 2021 and comprises 117 interdisciplinary experts from 50 countries representing all continents, including AI scientists, clinical researchers, biomedical ethicists, and social scientists. Over a two year period, the FUTURE-AI guideline was established through consensus based on six guiding principles—fairness, universality, traceability, usability, robustness, and explainability. To operationalise trustworthy AI in healthcare, a set of 30 best practices were defined, addressing technical, clinical, socioethical, and legal dimensions. The recommendations cover the entire lifecycle of healthcare AI, from design, development, and validation to regulation, deployment, and monitoring.
Summary points
- Despite major advances in medical artificial intelligence (AI) research, clinical adoption of emerging AI solutions remains challenging owing to limited trust and ethical concerns
- The FUTURE-AI Consortium unites 117 experts from 50 countries to define international guidelines for trustworthy healthcare AI
- The FUTURE-AI framework is structured around six guiding principles: fairness, universality, traceability, usability, robustness, and explainability
- The guideline addresses the entire AI lifecycle, from design and development to validation and deployment, ensuring alignment with real world needs and ethical requirements
- The framework includes 30 detailed recommendations for building trustworthy and deployable AI systems, emphasising multistakeholder collaboration
- Continuous risk assessment and mitigation are fundamental, addressing biases, data variations, and evolving challenges during the AI lifecycle
- FUTURE-AI is designed as a dynamic framework, which will evolve with technological advancements and stakeholder feedback
Lekadir, Karim; Feragen, Aasa; Fofanah, Abdul Joseph; Frangi, Alejandro F; Zuluaga, Maria A.; et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare BMJ 2025; 388 doi: https://doi.org/10.1136/bmj.r340 (Published 17 February 2025) Cite this as: BMJ 2025;388:r340
Generating Explanations in Medical Question-Answering by Expectation Maximization Inference over Evidence
Sun, Wei; Li, Mingxiao; Sileo, Damien; Davis, Jesse J.; Moens, Marie FrancineMedical Question Answering (medical QA) systems play an essential role in assisting healthcare workers in finding answers to their questions. However, it is not sufficient to merely provide answers by medical QA systems because users might want explanations, that is, more analytic statements in natural language that describe the elements and context that support the answer. To do so, we propose a novel approach for generating natural language explanations for answers predicted by medical QA systems. As high-quality medical explanations require additional medical knowledge, so that our system extracts knowledge from medical textbooks to enhance the quality of explanations during the explanation generation process. Concretely, we designed an Expectation-Maximization approach that makes inferences about the evidence found in these texts, offering an efficient way to focus attention on lengthy evidence passages. Experimental results, conducted on two datasets MQAE-diag and MQAE, demonstrate the effectiveness of our framework for reasoning with textual evidence. Our approach outperforms state-of-the-art models, achieving a significant improvement of 6.13 and 5.47 percentage points on the Rouge-L score; 6.49 and 5.28 percentage points on the Bleu-4 score on the MQAE-diag and MQAE datasets.
Wei Sun, Mingxiao Li, Damien Sileo, Jesse Davis, and Marie-Francine Moens. 2025. Generating Explanations in Medical Question-Answering by Expectation Maximization Inference over Evidence. ACM Trans. Comput. Healthcare 6, 2, Article 23 (April 2025), 23 pages. https://doi.org/10.1145/3712296
Radiomics in Dermatological Optical Coherence Tomography (OCT): Feature Repeatability, Reproducibility, and Integration into Diagnostic Models in a Prospective Study
Widaatalla, Yousif ; Wolswijk, Tom ; Khan, Muhammad Danial ; Halilaj, Iva ; Mosterd, Klara ; Woodruff, Henry C. ; Lambin, PhilippeObjectives: Radiomics has seen substantial growth in medical imaging; however, its potential in optical coherence tomography (OCT) has not been widely explored. We systematically evaluate the repeatability and reproducibility of handcrafted radiomics features (HRFs) from OCT scans of benign nevi and examine the impact of bin width (BW) selection on HRF stability. The effect of using stable features on a radiomics classification model was also assessed. Methods: In this prospective study, 20 volunteers underwent test–retest OCT imaging of 40 benign nevi, resulting in 80 scans. The repeatability and reproducibility of HRFs extracted from manually delineated regions of interest (ROIs) were assessed using concordance correlation coefficients (CCCs) across BWs ranging from 5 to 50. A unique set of stable HRFs was identified at each BW after removing highly correlated features to eliminate redundancy. These robust features were incorporated into a multiclass radiomics classifier trained to distinguish benign nevi, basal cell carcinoma (BCC), and Bowen’s disease. Results: Six stable HRFs were identified across all BWs, with a BW of 25 emerging as the optimal choice, balancing repeatability and the ability to capture meaningful textural details. Additionally, intermediate BWs (20–25) yielded 53 reproducible features. A classifier trained with six stable features achieved a 90% accuracy and AUCs of 0.96 and 0.94 for BCC and Bowen’s disease, respectively, compared to a 76% accuracy and AUCs of 0.86 and 0.80 for a conventional feature selection approach. Conclusions: This study highlights the critical role of BW selection in enhancing HRF stability and provides a methodological framework for optimizing preprocessing in OCT radiomics. By demonstrating the integration of stable HRFs into diagnostic models, we establish OCT radiomics as a promising tool to aid non-invasive diagnosis in dermatology.
Widaatalla, Y.; Wolswijk, T.; Khan, M.D.; Halilaj, I.; Mosterd, K.; Woodruff, H.C.; Lambin, P. Radiomics in Dermatological Optical Coherence Tomography (OCT): Feature Repeatability, Reproducibility, and Integration into Diagnostic Models in a Prospective Study. Cancers 2025, 17, 768. https://doi.org/10.3390/cancers17050768
Harmonizing CT scanner acquisition variability in an anthropomorphic phantom: A comparative study of image-level and feature-level harmonization using GAN, ComBat, and their combination
Mali, Shruti Atul, Nastaran Mohammadian Rad, Henry C. Woodruff, Adrien Depeursinge, Vincent Andrearczyk, and Philippe LambinPurpose
Radiomics allows for the quantification of medical images and facilitates precision medicine. Many radiomic features derived from computed tomography (CT) are sensitive to variations across scanners, reconstruction settings, and acquisition protocols. In this phantom study, eight different CT reconstruction parameters were varied to explore image- and feature-level harmonization approaches to improve tissue classification.
Methods
Varying reconstructions of an anthropomorphic radiopaque phantom containing three lesion categories (metastasis, hemangioma, and benign cyst) and normal liver tissue were used for evaluating two harmonization methods and their combination: (i) generative adversarial networks (GANs) at the image level; (ii) ComBat at the feature level, and (iii) a combination of (i) and (ii). A total of 93 texture and intensity features were extracted from each tissue class before and after image-level harmonization and were also harmonized at the feature level. Reproducibility and stability were assessed via the Concordance Correlation Coefficient (CCC) and pairwise comparisons using paired stability tests. The ability of features to discriminate between tissue classes was assessed by measuring the area under the receiver operating characteristic curve. The global reproducibility and discriminative power were assessed by averaging over the entire dataset and across all tissue types.
Results
ComBat improved reproducibility by 31.58% and stability by 5.24%, while GAN increased reproducibility by 8% it reduced stability by 4.33%. Classification analysis revealed that ComBat increased average AUC by 15.19%, whereas GAN decreased AUC by 2.56%.
Conclusion
While GAN qualitatively enhances image harmonization, ComBat provides superior statistical improvements in feature stability and classification performance, highlighting the importance of robust feature-level harmonization in radiomics.
Mali SA, Rad NM, Woodruff HC, Depeursinge A, Andrearczyk V, et al. (2025) Harmonizing CT scanner acquisition variability in an anthropomorphic phantom: A comparative study of image-level and feature-level harmonization using GAN, ComBat, and their combination. PLOS ONE 20(5): e0322365. https://doi.org/10.1371/journal.pone.0322365
Impact of synthetic data on training a deep learning model for lesion detection and classification in contrast-enhanced mammography
Van Camp, Astrid, Henry C. Woodruff, Lesley Cockmartin, Marc Lobbes, Michael Majer, Corinne Balleyguier, Nicholas W. Marshall, Hilde Bosmans, and Philippe LambinPurpose: Predictive models for contrast-enhanced mammography often perform better at detecting and classifying enhancing masses than (non-enhancing) microcalcification clusters. We aim to investigate whether incorporating synthetic data with simulated microcalcification clusters during training can enhance model performance.
Approach: Microcalcification clusters were simulated in low-energy images of lesion-free breasts from 782 patients, considering local texture features. Enhancement was simulated in the corresponding recombined images. A deep learning (DL) model for lesion detection and classification was trained with varying ratios of synthetic and real (850 patients) data. In addition, a handcrafted radiomics classifier was trained using delineations and class labels from real data, and predictions from both models were ensembled. Validation was performed on internal (212 patients) and external (279 patients) real datasets.
Results: The DL model trained exclusively with synthetic data detected over 60% of malignant lesions. Adding synthetic data to smaller real training sets improved detection sensitivity for malignant lesions but decreased precision. Performance plateaued at a detection sensitivity of 0.80. The ensembled DL and radiomics models performed worse than the standalone DL model, decreasing the area under this receiver operating characteristic curve from 0.75 to 0.60 on the external validation set, likely due to falsely detected suspicious regions of interest.
Conclusions: Synthetic data can enhance DL model performance, provided model setup and data distribution are optimized. The possibility to detect malignant lesions without real data present in the training set confirms the utility of synthetic data. It can serve as a helpful tool, especially when real data are scarce, and it is most effective when complementing real data.
Van Camp A, Woodruff HC, Cockmartin L, Lobbes M, Majer M, Balleyguier C, Marshall NW, Bosmans H, Lambin P. Impact of synthetic data on training a deep learning model for lesion detection and classification in contrast-enhanced mammography. J Med Imaging (Bellingham). 2025 Nov;12(Suppl 2):S22006. doi: 10.1117/1.JMI.12.S2.S22006. Epub 2025 Apr 28. PMID: 40302983; PMCID: PMC12036226.
Assessing Data Quality in Heterogeneous Healthcare Integration: The AIDAVA Framework
Jens Declerck; Ömer Durukan Kiliç; Ensar Emir Erol; Shervin Mehryar; Dipak Kalra; Isabelle de Zegher; Remzi CelebiBackground:
Integrated health data is foundational for secondary use, research, and policy making. However, data quality issues – such as missing values and inconsistencies – are common due to the heterogeneity of health data sources. Existing frameworks often apply static, one-time assessments, limiting their ability to address quality problems across evolving data pipelines.
Objective:
This study evaluates the AIDAVA data quality framework, which introduces dynamic, lifecycle-based validation of health data using knowledge graph technologies and SHACL-based rules. The framework is assessed for its ability to detect and manage data quality issues – specifically, completeness and consistency – during integration.
Methods:
Using the MIMIC-III dataset, we simulated real-world data quality challenges by introducing structured noise, including missing values and logical inconsistencies. The data was transformed into Source Knowledge Graphs (SKGs) and integrated into a unified Personal Health Knowledge Graph (PHKG). SHACL validation rules were applied iteratively during the integration process, and data quality was assessed under varying noise levels and integration orders.
Results:
The AIDAVA framework effectively detected completeness and consistency issues across all scenarios. Completeness was shown to influence the interpretability of consistency scores, and domain-specific attributes (e.g., diagnoses, procedures) were more sensitive to integration order and data gaps.
Conclusions:
AIDAVA supports dynamic, rule-based validation throughout the data lifecycle. By addressing both dimension-specific vulnerabilities and cross-dimensional effects, it lays the groundwork for scalable, high-quality health data integration. Future work should explore deployment in live clinical settings and expand to additional quality dimensions.
Declerck, Jens, Ömer Durukan Kiliç, Ensar Emir Erol, Shervin Mehryar, Dipak Kalra, Isabelle de Zegher, and Remzi Celebi. Assessing Data Quality in Heterogeneous Health Care Integration: Simulation Study of the AIDAVA Framework. JMIR Medical Informatics 13 (2025): e75275. doi: https://doi.org/10.2196/75275.
An AI-powered data curation and publishing virtual assistant: usability and explainability/causability of, and patient interest in the first-generation prototype
van Mierlo, Rutger; Liang, Wenjie; Norak, Kerli; Kargl, Michaela; Maasik, Mall; Bynens, Anne-Lore; Plass, Markus; Kreuzthaler, Markus; Benedikt, Martin; Hochstenbach, Laura; orcid van 't Hof, Arnoud; orcid Celebi, Remzi; Dekker, Andre; de Zegher, Isabelle; Kalendralis, Petros;Introduction: Ensuring high quality and reusability of personal health data is costly and time-consuming. An AI-powered virtual assistant for health data curation and publishing could support patients to ensure harmonization and data quality enhancement, which improves interoperability and reusability. This formative evaluation study aimed to assess the usability of the first-generation (G1) prototype developed during the AI-powered data curation and publishing virtual assistant (AIDAVA) Horizon Europe project.
Methods: In this formative evaluation study, we planned to recruit 45 patients with breast cancer and 45 patients with cardiovascular disease from three European countries. An intuitive front-end, supported by AI and non-AI data curation tools, is being developed across two generations. G1 was based on existing curation tools and early prototypes of tools being developed. Patients were tasked with ingesting and curating their personal health data, creating a personal health knowledge graph that represented their integrated, high-quality medical records. Usability of G1 was assessed using the system usability scale. The subjective importance of the explainability/causability of G1, the perceived fulfillment of these needs by G1, and interest in AIDAVA-like technology were explored using study-specific questionnaires.
Results: A total of 83 patients were recruited; 70 patients completed the study, of whom 19 were unable to successfully curate their health data due to configuration issues when deploying the curation tools. Patients rated G1 as marginally acceptable on the system usability scale (59.1 ± 19.7/100) and moderately positive for explainability/causability (3.3–3.8/5), and were moderately positive to positive regarding their interest in AIDAVA-like technology (3.4–4.4/5).
Discussion: Despite its marginal acceptability, G1 shows potential in automating data curation into a personal health knowledge graph, but it has not reached full maturity yet. G1 deployed very early prototypes of tools planned for the second-generation (G2) prototype, which may have contributed to the lower usability and explainability/causability scores. Conversely, patient interest in AIDAVA-like technology seems quite high at this stage of development, likely due to the promising potential of data curation and data publication technology. Improvements in the library of data curation and publishing tools are planned for G2 and are necessary to fully realize the value of the AIDAVA solution.
van Mierlo, Rutger, Wenjie Liang, Kerli Norak, Michaela Kargl, Mall Maasik, Anne-Lore Bynens, Markus Plass, Markus Kreuzthaler, Martin Benedikt, Laura Hochstenbach, Arnoud van 't Hof, Remzi Celebi, Andre Dekker, Isabelle de Zegher, Petros Kalendralis, and the AIDAVA consortium. An AI-powered data curation and publishing virtual assistant: usability and explainability/causability of, and patient interest in the first-generation prototype. Frontiers in Digital Health 7 (2025): 1629413. doi: https://doi.org/10.3389/fdgth.2025.1629413.
Journey of a Data Element to Data Interoperability and Reuse
Isabelle de Zegher, Remzi CelebiPoor health data interoperability has been an issue for more than 25 years. With the emergence of data-hungry AI solutions, the need for interoperable and reusable health data is growing. This paper presents a prototype virtual assistant, developed and tested as part of the AIDAVA Horizon Europe project, that creates an interoperable and reusable personal health knowledge graph, compliant with a semantic standard, by maximising automation in the curation of multimodal heterogeneous health data. The paper describes the different transformation steps of a data element from collection to its reusable form, with associated issues. We conclude that the lack of health data interoperability could be resolved by addressing these issues in a structural way through a validated roadmap.
de Zegher, Isabelle, and Remzi Celebi. Journey of a Data Element to Data Interoperability and Reuse. Studies in Health Technology and Informatics 327 (2025): 975–980. doi: https://doi.org/10.3233/SHTI250517.
Radiomics Quality Score 2.0: towards radiomics readiness levels and clinical translation for personalized medicine
Philippe Lambin, Henry C. Woodruff, Shruti Atul Mali, Xian Zhong, Sheng Kuang, Elizaveta Lavrova, Hamza Khan, Karim Lekadir, Alex Zwanenburg, Joseph Deasy, Maciej Bobowicz, Luis Marti-Bonmati, Andrew Maidment, Michel Dumontier, Paul E. Kinahan, J. Martijn Nobel, Sina Amirrajab & Zohaib SalahuddinRadiomics is a tool for medical imaging analysis that could have a relevant role in precision oncology by offering precise quantitative support for clinical decision-making. The Radiomics Quality Score (RQS) is a tool developed to assess the rigour of radiomics studies that has now been widely adopted by researchers. Although RQS version 1.0 established a benchmark, an updated framework is required to account for evolving knowledge and ensure optimal evaluation of the quality of radiomics studies through the inclusion of fairness, explainability, rigorous quality control and harmonization. In this Review, we introduce the updated RQS 2.0, which maintains the scientific rigour of its predecessor and addresses these contemporary needs, and therefore could potentially accelerate clinical translation. Moreover, we introduce the radiomics readiness levels, inspired by the technology readiness level framework, which are integrated in RQS 2.0 and reflect nine distinct levels of incremental improvement in radiomics research with the ultimate aim of clinical implementation. We also detail anticipated future directions in radiomics, outlining a strategic vision to advance precision oncology, which is the ultimate aim of RQS 2.0.
Lambin, Philippe, Henry C. Woodruff, Shruti Atul Mali, Xian Zhong, Sheng Kuang, Elizaveta Lavrova, Hamza Khan, Karim Lekadir, Alex Zwanenburg, Joseph Deasy, Maciej Bobowicz, Luis Marti-Bonmati, Andrew Maidment, Michel Dumontier, Paul E. Kinahan, J. Martijn Nobel, Sina Amirrajab, and Zohaib Salahuddin. Radiomics Quality Score 2.0: towards radiomics readiness levels and clinical translation for personalized medicine. Nature Reviews Clinical Oncology 22 (2025): 831–846. doi: https://doi.org/10.1038/s41571-025-01067-1.
Improving Patient Engagement in Phase 2 Clinical Trials With a Trial-Specific Patient Decision Aid: Development and Usability Study
Halilaj, Iva, Relinde Lieverse, Cary Oberije, Lizza Hendriks, Charlotte Billiet, Ines Joye, Brice Van Eeckhout, Anke Wind, Anshu Ankolekar, and Philippe LambinBackground: Making informed decisions about clinical trial participation can be overwhelming for patients due to the complexity of trial information, potential risks and benefits, and the emotional burden of a recent diagnosis. Patient decision aids (PDAs) simplify this process by providing clear information on treatment options, empowering patients to actively participate in shared decision-making with their doctors. While PDAs have shown promise in various health care contexts, their use in clinical trials, particularly in the form of trial-specific patient decision aids (tPDAs), remains underused.
Objective: This study aims to address the challenge of patient comprehension of traditional clinical trial materials. We developed a freely accessible, user-friendly tPDA within the context of the ImmunoSABR phase 2 trial. The tPDA aimed to enhance informed decision-making regarding trial participation. The primary endpoint was usability, quantitatively measured by the System Usability Scale (SUS). Secondary endpoints included time spent on the tPDA, patient satisfaction ratings, and participants’ self-reported level of understanding of the trial.
Methods: We developed the tPDA following the International Patient Decision Aid Standards and validated it through a structured, 3-phase iterative evaluation process. An initial evaluation was performed with 17 computer scientists who had expertise in biomedical applications, ensuring technical robustness. The content and usability were further refined through evaluations involving 10 clinicians and 8 medical students, focusing on clinical accuracy and user-friendliness. Finally, the tool was tested by 6 patients eligible for the ImmunoSABR trial to assess real-world applicability and patient-centered design.
Results: Evaluations demonstrated the tPDA’s effectiveness in enhancing informed decision-making, directly addressing our primary end point of usability with an overall mean SUS score of 79.4 (SD 15.9), indicative of good usability. Addressing our secondary endpoints, patients completed the tPDA efficiently, with the majority (4/6) finishing in under 30 minutes, and all but 1 within 60 minutes. Qualitative feedback highlighted significant improvements in patients’ understanding of the trial details, reinforcing the tPDA’s role in facilitating better patient engagement and comprehension.
Conclusions: Our study demonstrates the feasibility and potential of tPDAs to enhance patient comprehension and engagement in clinical trials. Integrating tPDAs offers a valuable addition to traditional paper-based and verbal communication methods, promoting informed decision-making and patient-centered care.
Halilaj, Iva, Relinde Lieverse, Cary Oberije, Lizza Hendriks, Charlotte Billiet, Ines Joye, Brice Van Eeckhout, Anke Wind, Anshu Ankolekar, and Philippe Lambin. Improving Patient Engagement in Phase 2 Clinical Trials With a Trial-Specific Patient Decision Aid: Development and Usability Study. Journal of Medical Internet Research 27 (2025): e71817. doi: https://doi.org/10.2196/71817.
2024
An automated toolbox for microcalcification cluster modeling for mammographic imaging
Astrid Van Camp, Eva Punter, Katrien Houbrechts, Lesley Cockmartin, Renate Prevos, Nicholas W. Marshall, Henry C. Woodruff, Philippe Lambin, Hilde Bosmans (2024)Background
Mammographic imaging is essential for breast cancer detection and diagnosis. In addition to masses, calcifications are of concern and the early detection of breast cancer also heavily relies on the correct interpretation of suspicious microcalcification clusters. Even with advances in imaging and the introduction of novel techniques such as digital breast tomosynthesis and contrast-enhanced mammography, a correct interpretation can still be challenging given the subtle nature and large variety of calcifications.Purpose
Computer simulated lesion models can serve to develop, optimize, or improve imaging techniques. In addition to their use in comparative (virtual clinical trial) detection experiments, these models have potential application in training deep learning models and in the understanding and interpretation of breast lesions. Existing simulation methods, however, often lack the capacity to model the diversity occurring in breast lesions or to generate models relevant for a specific case. This study focuses on clusters of microcalcifications and introduces an automated, flexible toolbox designed to generate microcalcification cluster models customized to specific tasks.Methods
The toolbox allows users to control a large number of simulation parameters related to model characteristics such as lesion size, calcification shape, or number of microcalcifications per cluster. This leads to the capability of creating models that range from regular to complex clusters. Based on the input parameters, which are either tuned manually or pre-set for a specific clinical type, different sets of models can be simulated depending on the use case. Two lesion generation methods are described. The first method generates three-dimensional microcalcification clusters models based on geometrical shapes and transformations. The second method creates two-dimensional (2D) microcalcification cluster models for a specific 2D mammographic image. This novel method employs radiomics analysis to account for local textures, ensuring the simulated microcalcification cluster is appropriately integrated within the existing breast tissue. The toolbox is implemented in the Python language and can be conveniently run through a Jupyter Notebook interface, openly accessible at https://gitlab.kuleuven.be/medphysqa/deploy/breast-calcifications. Validation studies performed by radiologists assessed the level of malignancy and realism of clusters tuned with specific parameters and inserted in mammographic images.Results
The flexibility of the toolbox with multiple simulation methods is illustrated, as well as the compatibility with different simulation frameworks and image types. The automation allows for the straightforward and fast generation of diverse microcalcification cluster models. The generated models are most likely applicable for various tasks as they can be configured in a variety of ways and inserted in different types of mammographic images of multiple acquisition systems. Validation studies confirmed the capacity to simulate realistic clusters and capture clinical properties when tuned with appropriate parameter settings.Conclusion
This simulation toolbox offers a flexible means of simulating microcalcification cluster models with potential use in both technical and clinical research in mammography imaging. The 3D generation methods allow for specifying many characteristics regarding the calcification shape and cluster architecture, and the 2D generation method presents a novel manner to create microcalcification clusters tailored to existing breast textures.Van Camp A, Punter E, Houbrechts K, et al. An automated toolbox for microcalcification cluster modeling for mammographic imaging. Med Phys. 2024;1-15. https://doi.org/10.1002/mp.17521
Artificial intelligence based data curation: enabling a patient-centric European health data space
de Zegher I, Norak K, Steiger D, Müller H, Kalra D, Scheenstra B, Cina I, Schulz S, Uma K, Kalendralis P, Lotman E-M, Benedikt M, Dumontier M and Celebi RThe emerging European Health Data Space (EHDS) Regulation opens new prospects for large-scale sharing and re-use of health data. Yet, the proposed regulation suffers from two important limitations: it is designed to benefit the whole population with limited consideration for individuals, and the generation of secondary datasets from heterogeneous, unlinked patient data will remain burdensome. AIDAVA, a Horizon Europe project that started in September 2022, proposes to address both shortcomings by providing patients with an AI-based virtual assistant that maximises automation in the integration and transformation of their health data into an interoperable, longitudinal health record. This personal record can then be used to inform patient-related decisions at the point of care, whether this is the usual point of care or a possible cross-border point of care. The personal record can also be used to generate population datasets for research and policymaking. The proposed solution will enable a much-needed paradigm shift in health data management, implementing a ‘curate once at patient level, use many times’ approach, primarily for the benefit of patients and their care providers, but also for more efficient generation of high-quality secondary datasets. After 15 months, the project shows promising preliminary results in achieving automation in the integration and transformation of heterogeneous data of each individual patient, once the content of the data sources managed by the data holders has been formally described. Additionally, the conceptualization phase of the project identified a set of recommendations for the development of a patient-centric EHDS, significantly facilitating the generation of data for secondary use.
Citation: de Zegher I, Norak K, Steiger D, Müller H, Kalra D, Scheenstra B, Cina I, Schulz S, Uma K, Kalendralis P, Lotman E-M, Benedikt M, Dumontier M and Celebi R (2024) Artificial intelligence based data curation: enabling a patient-centric European health data space. Front. Med. 11:1365501. doi: 10.3389/fmed.2024.1365501 https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024.1365501/full
Disambiguation of acronyms in clinical narratives with large language models.
Kugic, A., Schulz, S., & Kreuzthaler, M. (2024)Objective: To assess the performance of large language models (LLMs) for zero-shot disambiguation of acronyms in clinical narratives.
Materials and Methods: Clinical narratives in English, German, and Portuguese were applied for testing the performance of four LLMs: GPT-3.5, GPT-4, Llama-2-7b-chat, and Llama-2-70b-chat. For English, the anonymized Clinical Abbreviation Sense Inventory (CASI, University of Minnesota) was used. For German and Portuguese, at least 500 text spans were processed. The output of LLM models, prompted with contextual information, was analyzed to compare their acronym disambiguation capability, grouped by document-level metadata, the source language, and the LLM.
Results: On CASI, GPT-3.5 achieved 0.91 in accuracy. GPT-4 outperformed GPT-3.5 across all datasets, reaching 0.98 in accuracy for CASI, 0.86 and 0.65 for two German datasets, and 0.88 for Portuguese. Llama models only reached 0.73 for CASI and failed severely for German and Portuguese. Across LLMs, performance decreased from English to German and Portuguese processing languages. There was no evidence that additional document-level metadata had a significant effect.
Conclusion: For English clinical narratives, acronym resolution by GPT-4 can be recommended to improve readability of clinical text by patients and professionals. For German and Portuguese, better models are needed. Llama models, which are particularly interesting for processing sensitive content on premise, cannot yet be recommended for acronym resolution.
Kugic, A., Schulz, S., & Kreuzthaler, M. (2024). Disambiguation of acronyms in clinical narratives with large language models. Journal of the American Medical Informatics Association, ocae157. https://doi.org/10.1093/jamia/ocae157
Unraveling Clinical Insights: A Lightweight and Interpretable Approach for Multimodal and Multilingual Knowledge Integration.
Uma, K., & Moens, M. F. (2024).In recent years, the analysis of clinical texts has evolved significantly, driven by the emergence of language models like BERT such as PubMedBERT, and ClinicalBERT, which have been tailored for the (bio)medical domain that rely on extensive archives of medical documents. While they boast high accuracy, their lack of interpretability and language transfer limitations restrict their clinical utility. To address this, we propose a new, lightweight graph-based embedding method designed specifically for radiology reports. This approach considers the report’s structure and content, connecting medical terms through the multilingual SNOMED Clinical Terms knowledge base. The resulting graph embedding reveals intricate relationships among clinical terms, enhancing both clinician comprehension and clinical accuracy without the need for large pre-training datasets. Demonstrating the versatility of our method, we apply this embedding to two tasks: disease and image classification in X-ray reports. In disease classification, our model competes effectively with BERT-based approaches, yet it is significantly smaller and requires less training data. Additionally, in image classification, we illustrate the efficacy of the graph embedding by leveraging cross-modal knowledge transfer, highlighting its applicability across diverse languages.
Uma, K., & Moens, M. F. (2024). Unraveling Clinical Insights: A Lightweight and Interpretable Approach for Multimodal and Multilingual Knowledge Integration. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024 (pp. 197-203). https://aclanthology.org/2024.cl4health-1.24
Kommunikationsfähigkeit und Interoperabilität von Gesundheitsdaten in einem vernetzten Gesundheitssystem.
Daumke, P., Haverkamp, C., Heckmann, S., Kuper, M., Müller, A., Oemig, F., ... & Schulz, S. (2024).Interoperabilität ist für ein vernetztes Gesundheitssystems unabdingbar. Basierend auf Terminologiestandards wie ICD, LOINC und SNOMED CT erfordert sie eine korrekte Interpretation von Patientendaten in der jeweiligen Anwendungssituation. Dies wird unterstützt durch syntaktische Standards wie FHIR, welche Codes in den patientenspezifischen Kontext einbetten. Um Routinedaten interoperabel zu machen, ist die Kluft zwischen klinischer Sprache und normierter Dokumentation zu überbrücken. Natural Language Processing (NLP) ist hierbei eine Technologie, die sich derzeit im Zeichen der Künstlichen Intelligenz rapide weiterentwickelt. Die Kommunikation mit dem Computer in menschlicher Sprache wird erheblich an Bedeutung gewinnen. Das Kapitel gibt einen Einblick in aktuelle Techniken und Ressourcen zur Unterstützung von Interoperabilität. Dazu kommen Perspektiven der Gesundheitsversorgung, Gesundheitsverwaltung, Wissenschaft, Industrie und Selbstverwaltung zur Sprache.
Daumke, P., Haverkamp, C., Heckmann, S., Kuper, M., Müller, A., Oemig, F., ... & Schulz, S. (2024). Kommunikationsfähigkeit und Interoperabilität von Gesundheitsdaten in einem vernetzten Gesundheitssystem. In Health Data Management: Schlüsselfaktor für erfolgreiche Krankenhäuser (pp. 457-496). Wiesbaden: Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-43236-2_41
Towards Explainability in Automated Medical Code Prediction from Clinical Records.
Uma, K., Francis, S., Sun, W., Moens, MF. (2024).The International Statistical Classification of Diseases and Related Health Problems (ICD) is a global standard, a diagnostic tool that is frequently used for endemic research, health management, and clinical diagnosis, and it plays a crucial role in providing shrewd medical treatment. Comparable statistics on the causes of mortality and morbidity across locations and throughout time have been based on the ICD. The traditional procedure of assigning codes is expensive, error-prone and time-consuming, and automated mapping of ICD codes is now a significant area of scholarly research. With the help of statistical modeling, rule-engines, conventional machine learning, and deep learning techniques like graph embedding, attention mechanisms, adversarial learning, and pre-trained language models (PLMs), this paper aims to analyze and document inferences on the evolution of clinical coding automation. We try to summarize with comparative performance analysis various approaches addressed towards codification of free-text clinical narratives on the publicly available Medical Information Mart. This study investigates whether clinicians and researchers could benefit from an adequate interpretation of model predictions from an Explainable Artificial Intelligence (XAI) perspective. Finally, the survey illustrates ICD coding and disease classification applications and its challenges, evaluation metrics, datasets, and directions towards automating explanatory medical code predictions.
Uma, K., Francis, S., Sun, W., Moens, MF. (2024). Towards Explainability in Automated Medical Code Prediction from Clinical Records. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2023. Lecture Notes in Networks and Systems, vol 825. Springer, Cham. https://doi.org/10.1007/978-3-031-47718-8_40
Deep Learning of Multimodal Ultrasound: Stratifying the Response to Neoadjuvant Chemotherapy in Breast Cancer Before Treatment
Gu, Jionghui ; Zhong, Xian ; Fang, Chengyu ; Lou, Wenjing ; Fu, Peifen ; Woodruff, Henry C. ; Wang, Baohua ; Jiang, Tian’an ; Lambin, PhilippeBackground: Not only should resistance to neoadjuvant chemotherapy (NAC) be considered in patients with breast cancer but also the possibility of achieving a pathologic complete response (PCR) after NAC. Our study aims to develop 2 multimodal ultrasound deep learning (DL) models to noninvasively predict resistance and PCR to NAC before treatment.
Methods: From January 2017 to July 2022, a total of 170 patients with breast cancer were prospectively enrolled. All patients underwent multimodal ultrasound examination (grayscale 2D ultrasound and ultrasound elastography) before NAC. We combined clinicopathological information to develop 2 DL models, DL_Clinical_resistance and DL_Clinical_PCR, for predicting resistance and PCR to NAC, respectively. In addition, these 2 models were combined to stratify the prediction of response to NAC.
Results: In the test cohort, DL_Clinical_resistance had an AUC of 0.911 (95%CI, 0.814-0.979) with a sensitivity of 0.905 (95%CI, 0.765-1.000) and an NPV of 0.882 (95%CI, 0.708-1.000). Meanwhile, DL_Clinical_PCR achieved an AUC of 0.880 (95%CI, 0.751-0.973) and sensitivity and NPV of 0.875 (95%CI, 0.688-1.000) and 0.895 (95%CI, 0.739-1.000), respectively. By combining DL_Clinical_resistance and DL_Clinical_PCR, 37.1% of patients with resistance and 25.7% of patients with PCR were successfully identified by the combined model, suggesting that these patients could benefit by an early change of treatment strategy or by implementing an organ preservation strategy after NAC.
Conclusions: The proposed DL_Clinical_resistance and DL_Clinical_PCR models and combined strategy have the potential to predict resistance and PCR to NAC before treatment and allow stratified prediction of NAC response.
Gu J, Zhong X, Fang C, Lou W, Fu P, Woodruff HC, Wang B, Jiang T, Lambin P. Deep Learning of Multimodal Ultrasound: Stratifying the Response to Neoadjuvant Chemotherapy in Breast Cancer Before Treatment. Oncologist. 2024 Feb 2;29(2):e187-e197. doi: 10.1093/oncolo/oyad227. PMID: 37669223; PMCID: PMC10836325.
Simulated image-specific microcalcification clusters and associated mass enhancement to enhance training of a deep learning model for cancer detection in contrast-enhanced mammography
Van Camp, Astrid ; Woodruff, Henry C. ; Cockmartin, Lesley ; Marshall, Nicholas William ; Bosmans, Hilde T.C. ; Lambin, PhilippeWe present an automated method to generate synthetic contrast-enhanced mammography cases with simulated microcalcification clusters. This method accounts for existing textures in the breast, with the simulated clusters inserted in the low-energy image. In parallel, potential mass-like enhancement is modelled from real values in the recombined image. The same deep learning model was trained with different amounts and ratios of real and synthetic data. When trained with real data only, malignant masses are more often correctly detected and classified than malignant microcalcification clusters. The addition of synthetic data with simulated clusters during training could increase detection sensitivity for all types of malignant lesions and maintained similar levels of AUC for classification. This enhanced performance was consistent on both internal and external test sets. These findings demonstrate the potential applicability of synthetic data to enhance deep learning models, especially when real data are scarce or imbalanced.
Astrid Van Camp, Henry C. Woodruff, Lesley Cockmartin, Nicholas W. Marshall, Hilde Bosmans, and Philippe Lambin "Simulated image-specific microcalcification clusters and associated mass enhancement to enhance training of a deep learning model for cancer detection in contrast-enhanced mammography", Proc. SPIE 13174, 17th International Workshop on Breast Imaging (IWBI 2024), 1317404 (29 May 2024); https://doi.org/10.1117/12.3026879
2023
Towards principles of ontology-based annotation of clinical narratives.
Schulz, S., Del-Pinto, W., Han, L., Kreuzthaler, M., Aghaei, S., & Nenadic, G. (2023).Despite the increasing availability of ontology-based semantic resources for biomedical content representation, large amounts of clinical data are in narrative form only. Therefore, many clinical information management tasks require information extraction using natural language processing (NLP).
Clinical corpora annotated by humans are crucial resources for this purpose. On the one hand, they are needed to domain-fine-tune language models (LMs) with the purpose to formally represent clinical information extracted from unstructured free-text. On the other hand, annotated corpora are indispensable for assessing the results of information extracting using NLP.
The effectiveness of annotations crucially depends on annotation quality. Detailed annotation guidelines, which define the form that extracted information should take, prevent human annotators from taking erratic annotation decisions and guarantee a good inter-annotator agreement. Our hypothesis is that, to this end, annotations should (i) be based on ontological principles and (ii) be consistent with existing clinical documentation standards.
With the experience of several annotation projects we highlight the need for sophisticated guidelines. We formulate a set of abstract principles on which such guidelines should be based, followed by examples how to keep them, on the one hand, user-friendly and consistent, and on the other hand compatible with the international semantic standards SNOMED CT and FHIR, including their areas of overlap.
We sketch the representation of the resulting representations in a knowledge graph as a state-of-the-art semantic representation paradigm, which can be enriched by additional content on A-Box and T-Box level and on which symbolic and neural reasoning tasks can be applied.
Schulz, S., Del-Pinto, W., Han, L., Kreuzthaler, M., Aghaei, S., & Nenadic, G. (2023). Towards principles of ontology-based annotation of clinical narratives. In Proceedings of the International Conference on Biomedical Ontologies (Vol. 2023). https://ceur-ws.org/Vol-3603/Paper4.pdf
Semantic Annotation of Tabular Data for Machine-to-Machine Interoperability via Neuro-Symbolic Anchoring
Shervin Mehryar, Remzi CelebiIn this paper we investigate automated annotation of tabular data using semantic technologies in combination with neural network embedding. Specifically, we propose an anchoring model in which property and cell types from the data embedding space are aligned with ontology relation and entity types. We show that by combining the power of symbolic reasoning, neural embeddings, and loss function design, a significant performance improvement as high as 86% for column property, 82% for column type, and 87% for column qualifier annotations can be achieved based on DBpedia and Wikidata table extractions.
Shervin Mehryar, Remzi Celebi. Semantic Annotation of Tabular Data for Machine-to-Machine Interoperability via Neuro-Symbolic Anchoring. SemTab’23: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2023, co-located with the 22nd. International Semantic Web Conference (ISWC), November 6-10, 2023, Athens, Greece
AI for life: Trends in artificial intelligence for biotechnology
Andreas Holzinger, Katharina Keiblinger, Petr Holub, Kurt Zatloukal, Heimo MüllerDue to popular successes (e.g., ChatGPT) Artificial Intelligence (AI) is on everyone's lips today. When advances in biotechnology are combined with advances in AI unprecedented new potential solutions become available. This can help with many global problems and contribute to important Sustainability Development Goals. Current examples include Food Security, Health and Well-being, Clean Water, Clean Energy, Responsible Consumption and Production, Climate Action, Life below Water, or protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss. AI is ubiquitous in the life sciences today. Topics include a wide range from machine learning and Big Data analytics, knowledge discovery and data mining, biomedical ontologies, knowledge-based reasoning, natural language processing, decision support and reasoning under uncertainty, temporal and spatial representation and inference, and methodological aspects of explainable AI (XAI) with applications of biotechnology. In this pre-Editorial paper, we provide an overview of open research issues and challenges for each of the topics addressed in this special issue. Potential authors can directly use this as a guideline for developing their paper.
Holzinger A, Keiblinger K, Holub P, Zatloukal K, Müller H. AI for life: Trends in artificial intelligence for biotechnology. N Biotechnol. 2023;74:16-24. doi: 10.1016/j.nbt.2023.02.001. PMID: 36754147
Masking Language Model Mechanism with Event-Driven Knowledge Graphs for Temporal Relations Extraction from Clinical Narratives.
Uma, K., Francis, S., & Moens, M. F. (2023).For many natural language processing systems, the extraction of temporal links and associations from clinical narratives has been a critical challenge. To understand such processes, we must be aware of the occurrences of events and their time or temporal aspect by constructing a chronology for the sequence of events. The primary objective of temporal relation extraction is to identify relationships and correlations between entities, events, and expressions. We propose a novel architecture leveraging Transformer based graph neural network by combining textual data with event graph embeddings for predicting temporal links across events, entities, document creation time and expressions. We demonstrate our preliminary findings on i2b2 temporal relations corpus for predicting BEFORE, AFTER and OVERLAP links with event graph for correct set of relations. Comparison with various Biomedical-BERT embedding types were benchmarked yielding best performance on PubMed BERT with language model masking (LMM) mechanism on our methodology. This illustrates the effectiveness of our proposed strategy.
Uma, K., Francis, S., & Moens, M. F. (2023). Masking Language Model Mechanism with Event-Driven Knowledge Graphs for Temporal Relations Extraction from Clinical Narratives. In International Conference on Complex Networks and Their Applications (pp. 162-174). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-53468-3_14
Toward human-level concept learning: Pattern benchmarking for AI algorithms
Holzinger, A., Saranti, A., Angerschmid, A., Finzel, B., Schmid, U., & Mueller, H. (2023).Artificial intelligence (AI) today is very successful at standard pattern-recognition tasks due to the availability of large amounts of data and advances in statistical data-driven machine learning. However, there is still a large gap between AI pattern recognition and human-level concept learning. Humans can learn amazingly well even under uncertainty from just a few examples and are capable of generalizing these concepts to solve new conceptual problems. The growing interest in explainable machine intelligence requires experimental environments and diagnostic/benchmark datasets to analyze existing approaches and drive progress in pattern analysis and machine intelligence. In this paper, we provide an overview of current AI solutions for benchmarking concept learning, reasoning, and generalization; discuss the state-of-the-art of existing diagnostic/benchmark datasets (such as CLEVR, CLEVRER, CLOSURE, CURI, Bongard-LOGO, V-PROM, RAVEN, Kandinsky Patterns, CLEVR-Humans, CLEVRER-Humans, and their extension containing human language); and provide an outlook of some future research directions in this exciting research domain.
Holzinger A, Saranti A, Angerschmid A, Finzel B, Schmid U, Mueller H. Toward human-level concept learning: Pattern benchmarking for AI algorithms. Patterns (N Y). 2023 Jul 5;4(8):100788. doi: 10.1016/j.patter.2023.100788. PMC: 10435961
Combining Deep Learning and Handcrafted Radiomics for Classification of Suspicious Lesions on Contrast-enhanced Mammograms
Beuque, Manon P.L. ; Lobbes, Marc B.I. ; van Wijk, Yvonka ; Widaatalla, Yousif ; Primakov, Sergey P. ; Majer, Michael ; Balleyguier, Corinne S. ; Woodruff, Henry C. ; Lambin, PhilippeBackground
Handcrafted radiomics and deep learning (DL) models individually achieve good performance in lesion classification (benign vs malignant) on contrast-enhanced mammography (CEM) images.
Purpose
To develop a comprehensive machine learning tool able to fully automatically identify, segment, and classify breast lesions on the basis of CEM images in recall patients.
Materials and Methods
CEM images and clinical data were retrospectively collected between 2013 and 2018 for 1601 recall patients at Maastricht UMC+ and 283 patients at Gustave Roussy Institute for external validation. Lesions with a known status (malignant or benign) were delineated by a research assistant overseen by an expert breast radiologist. Preprocessed low-energy and recombined images were used to train a DL model for automatic lesion identification, segmentation, and classification. A handcrafted radiomics model was also trained to classify both human- and DL-segmented lesions. Sensitivity for identification and the area under the receiver operating characteristic curve (AUC) for classification were compared between individual and combined models at the image and patient levels.
Results
After the exclusion of patients without suspicious lesions, the total number of patients included in the training, test, and validation data sets were 850 (mean age, 63 years ± 8 [SD]), 212 (62 years ± 8), and 279 (55 years ± 12), respectively. In the external data set, lesion identification sensitivity was 90% and 99% at the image and patient level, respectively, and the mean Dice coefficient was 0.71 and 0.80 at the image and patient level, respectively. Using manual segmentations, the combined DL and handcrafted radiomics classification model achieved the highest AUC (0.88 [95% CI: 0.86, 0.91]) (P < .05 except compared with DL, handcrafted radiomics, and clinical features model, where P = .90). Using DL-generated segmentations, the combined DL and handcrafted radiomics model showed the highest AUC (0.95 [95% CI: 0.94, 0.96]) (P < .05).
Conclusion
The DL model accurately identified and delineated suspicious lesions on CEM images, and the combined output of the DL and handcrafted radiomics models achieved good diagnostic performance.
Beuque, Manon P.L. ; Lobbes, Marc B.I. ; van Wijk, Yvonka ; Widaatalla, Yousif ; Primakov, Sergey P. ; Majer, Michael ; Balleyguier, Corinne S. ; Woodruff, Henry C. ; Lambin, Philippe. Combining Deep Learning and Handcrafted Radiomics for Classification of Suspicious Lesions on Contrast-enhanced Mammograms (2023) Radiology, 307 (5), art. no. e221843. DOI: 10.1148/radiol.221843.