Genome-Centric Multimodal Data Integration in Personalised Cardiovascular Medicine.

Deliverables

The NextGen Data Management Plan v 1 0

NextGen addresses the complex problem of integrating multiple types of data (multi-modal data), including genomic data, into research pathways that might include AI tools. NextGen develops tools to overcome barriers in data integration and access, to lead to an improved clinical outcome from healthcare research.
NextGen exploits on federated technology and decentralized data management techniques centrally aggregating data is outside the declared project scope (which is particularly relevant in the context of data subject to the GDPR).
Starting with datasets accessible by NextGen participants locally or through biobanks, the integration tooling listed will first be deployed locally while a concept of decentralized platform for research portability across different sites is developed around specific pilots defined during the project.

Data Discovery Functionality D1 3 – Part 1

The NextGen Project aims to develop tooling for integrating genome-centric multimodal data for research analytics. These tools promote a federated approach to dealing with the sensitive data required for personalised cardiovascular medicine. These tools will be deployed in pilots as NextGen's pathfinder projects. The vision of the NextGen federated approach is to create a health data ecosystem where data remains under the control of each research organisation. As a result, these tools will expand the reach of participants to perform research activities beyond their own institutions but within an ecosystem governance.
Achieving this vision requires new mechanisms to lawfully discover data availability for research purposes while the datasets remain secured by each research organisation. The search has to discover data while this data is secured behind each organisation's wall and cannot be moved a priori. Providing such a tool is the goal of NextGen's Federated Catalogues.
Current discovery mechanisms are based on search solutions that require data and information about the data (the meta-data) to be moved to a hosted system (i.e. the catalogue). Existing federated catalogue solutions are not suitable for sensitive health data as a participating research institution cannot maintain sufficient control over the structure and usage of the exposed. This is particularly true for health data spaces spanning multiple jurisdictions.

Pathfinder Platform Design and Architecture Blueprint and Assessment and Demonstration of MVP Technology

The NextGen Pathfi nder represents a tangible implementation of core European Health Data Space (EHDS) specifi cations, serving as a federated, multi-site "mini-EHDS" network. It integrates advanced tools for genomic and multimodal data analysis while adhering to FAIR principles and cross-border governance frameworks. This document is a blueprint for the design of the Pathfi nder platform architecture and core services implementation as part of MVP technology.

D1.1: Iterative Data Management Plan & Data Integration Framework

D1.1: Iterative Data Management Plan & Data Integration Framework (Months 6, 24, 36) This deliverable establishes the data management framework, detailing how health data integration barriers will be lowered using lawful searchability and consent mechanisms. It will document data types (including imaging, WGS, WES, and synthetic data), FAIR principles, and authentication mechanisms for data portability

D2.1: Design and architecture of the NextGen platform and D2.2: Technology Assessment

D2.1: Design and architecture of the NextGen platform (blueprint, sandbox, deployment) (Months 18, 36, 48) This includes a blueprint of the Data Oriented Architecture (DOA) and dataspace services, designed to support distributed data storage, processing, and remote collaboration.

D2.2: Technology Assessment (Months 18, 36) Assessment of the developed data space services, such as the Data Space catalogue, Security tools, Governance repository, and Marketplace

D3.1: Accelerated secondary genomic analysis software (iterative)

D3.1: Accelerated secondary genomic analysis software (iterative) (Months 18, 36, 42) Development of open-source, hardware-accelerated pipelines (utilising SYCL) for secondary genomic analysis to reduce execution time significantly compared to proprietary CPU implementations

D3.2: Annotation methods

D3.2: Annotation methods (Months 12, 42) Implementation of AI-driven genomic data quality control pipelines and variant annotation/prioritisation tools to rank variants based on the likelihood of gene-disease relationships

D3.4: Privacy assessment of federated approaches

D3.4: Privacy assessment of federated approaches (Months 18, 36) Definition of metrics and strategies to assess and mitigate privacy risks (e.g., privacy leakage) in federated learning algorithms

D3.5: Synthetic data for testing and piloting

D3.5: Synthetic data for testing and piloting (Months 24, 36) Generation of synthetic datasets, ranging from low fidelity for platform testing to high fidelity for algorithm refinement, covering modalities such as tabular, imaging, and genomic data.

Deliverable 3.6 Synthetic datasets for testing and piloting – 2

Our definition of Synthetic Data. In NextGen, we use a broad definition and consider synthetic data as constructed to reflect some
characteristics of a real or imagined dataset. Specifically, we will define a synthetic dataset to be a
constructed entity that shares specific characteristics of actual (or imagined) reference data.
The characteristics referred to above include, but are not limited to:
- Technical characteristics of dataset representations (e.g. file size and format)
- Semantics - how information is represented
- Measurements and statistical properties related to the domain of the data
Defined in this manner, examples of synthetic data include both:
- “Technical synthetic data”: Data which is generated without reference to an existing dataset
(e.g. for developing or testing technical functionality),
- “Healthcare synthetic data”: Data that is mathematically modelled on an existing source dataset
to replicate some characteristics of its informational content (e.g. to remove personal
information while retaining specific medical characteristics or summary statistics).

Deliverable 3.8 Accelerated secondary genomic analysis software

D3.8 is a demo deliverable of the SYCL-Genomics Acceleration Library (SYCL-GAL) that is being developed by EURECOM. The goal of the demo is to build a preliminary version of the library to demonstrate feasibility, and enable integration with other components of NextGen in the context of pilots and pathfinder projects. As planned, a preliminary version of SYCL-GAL has been developed by M18, and demoed at the second NextGen General Assembly meeting to all consortium partners. This document provides a high-level outline of the work done in re-implementation of the GATK secondary analysis pipeline and its optimisation for enhanced performance through multi-threaded CPU-based processing.

D4.1: Model validation

D4.1: Model validation (Month 6) Definition of metrics, benchmarks, and risk management frameworks to assess the accuracy, reliability, and explainability of AI/ML models.

D5.1: Pathfinder Frameworks and Pathways

D5.1: Pathfinder Frameworks and Pathways (Month 6) Development of high-level functional designs and agile deployment plans for the Pathfinder pilots

D5.2: Pathfinder functionality and specification

D5.2: Pathfinder functionality and specification (Months 12, 24) Development of pilots focused on genomic curation, interpretation, and acceleration functionality

D6.1: Legal Framework and structured guidance (iterative)

D6.1: Legal Framework and structured guidance (iterative) (Months 12, 24, 36) Development of normative governance frameworks and risk management assessments for key project deliverables

D7.1: NextGen Stakeholder platform workshops

D7.1: NextGen Stakeholder platform workshops (Months 7, 13, 19, 25, 31, 37, 43) Engagement activities within the stakeholder platform, including actively moderated discussions and webinars

D7.2: Drivers, barriers and Cost-Benefit Analysis

D7.2: Drivers, barriers and Cost-Benefit Analysis (Months 12, 24, 36) Assessment and validation of costs, benefits, drivers, and barriers regarding the adoption of the NextGen platform

D8.1: Dissemination and Communication Plan (iterative)

D8.1: Dissemination and Communication Plan (iterative) (Months 4, 12, 24, 36) Establishment of key performance indicators and a strategy for disseminating project results to target audiences

D8.2: Dissemination and communication activity report

D8.2: Dissemination and communication activity report (Months 12, 24, 36, 48) Reporting on activities such as educational webinars, scientific publications, workshops, and hackathons

D8.3 – Updated Dissemination and Communication Plan

This document presents the updated Dissemination and Communication (D&C) Plan for the Horizon Europe project NextGen - Next Generation Tools or Genome-Centric Multimodal Data Integration in Personalised Cardiovascular Medicine. It is intended to ensure that project information is communicated accurately, consistently, and effectively to relevant stakeholders and target audiences.
The purpose of this periodic update is twofold:
1. To highlight dissemination and communication activities accomplished during the first 24 months of the project (M1-M24).
2. To outline activities planned for the remainder of the project (M24-M48).
The updated D&C Plan is designed to ensure that the project’s results are disseminated in a timely manner, using appropriate formats and networks to reach the identified stakeholders. During the period January 2024 - December 2025, the primary target audience has been the scientific community, with a particular focus on clinicians, researchers, and healthcare professionals. This approach aims to maximise impact and societal benefit, while remaining responsive to the project’s development pace and evolving external factors.