Genome-Centric Multimodal Data Integration in Personalised Cardiovascular Medicine.

Federated Learning Logic and Architecture

Federated learning in NextGen is adopted as a practical operating model for collaboration: multiple clinical and research sites can contribute to shared analyses on Genetics and Cardiovascular Disease while keeping data under their local governance. In the activities for Optimised multimodal and genomic processing, analytical and integration tools (WP3), this is anchored in an explicit comparison with adjacent approaches such as distributed computing, meta-analysis, and Trusted Research Environments (TREs). The functional claim is clear: federated learning is treated as a scalable alternative to TRE-style setups, because it lets a consortium expand access and privacy guarantees while data governance stays within each participating organisation.

That framing immediately drives architectural choices, focusing on a cross-silo, centralised topology with an aggregation server, consistently with how bioinformatics collaborations typically operate across institutions. This matters less as a technical preference than as an organisational one: it provides a predictable coordination point, clarifies responsibilities, and supports repeatable operational workflows for consortia that must run analyses across heterogeneous environments. In the same functional register, the work focuses the role of privacy-enhancing technologies – homomorphic encryption, differential privacy, and secure multiparty computation—as core enablers for the collaboration model.

The Work Package turns this concept into a set of concrete analysis capabilities, reviewing how federated learning maps onto bioinformatics domains that matter for NextGen, including proteomics, GWAS, single-cell RNA sequencing, multi-omics, and medical imaging, and it surveys ready-to-use tooling such as sfkit and FeatureCloud. The practical objective here is adoption: federated learning becomes relevant when it can be expressed as familiar analytical operations, rather than remaining an abstract infrastructure choice. The report’s implementation guide follows that logic by showing how to build federated analogues of existing computations, distinguishing simpler sum-based operations from iterative model training patterns, and using frameworks such as Flower to make these patterns actionable.

Work continues to progress from “how it should work” to “what was built and tested”. Two methods are singled out as operational proofs: a federated GWAS implementation using SF-GWAS based on a workflow comparable to established practice (similar to PLINK) and benchmarked against widely used centralised baselines (PLINK and REGENIE). SF-GWAS PCA appears viable yet demands substantial resources, while SF-GWAS LMM is characterised as immature and heavy in memory requirements. Alternatively, a federated implementation of PLIER (for transcriptomic dimensionality reduction) is reported using Flower, with results equivalent to centralised training and an “indistinguishable” latent space compared to centralised models across random seeds.

Privacy is treated as an operational constraint a realistic threat model and differentiates risk profiles across data types are adopted, highlighting re-identification risk for high-density DNA data via linkage attacks and a lower-risk profile for bulk transcriptomics. The mitigation package combines secure aggregation (SecAgg+) so the server observes only aggregates, and differential privacy via noise addition to protect against membership inference on the final model. These principles are translated into concrete rules for specific algorithms: for PLIER, local covariance matrices must never be shared because small sample sizes can enable reconstruction attacks, and for deep learning models such as scVI/scANVI the approach combines secure aggregation with central differential privacy to balance accuracy and privacy.

Finally, WP3 activities connect federated learning to the wider “make it usable” work needed for real deployments. A synthetic data strategy for safe testing and piloting is defined, including the use of HAPGEN2 and documented datasets (with Model Cards and Data Cards). A highlighted example is a HAPMAP-derived set generated from the CEU population panel (chromosome 18) with 1000 synthetic individuals and simulated binary phenotypes, enabling GWAS pipeline testing while avoiding privacy exposure. In parallel, D3.8 tackles the throughput side of the pipeline by demonstrating SYCL-GAL, a re-implementation of GATK pre-processing designed for speed, showing a 7-fold CPU speedup (64 threads) over standard Java GATK in the reported benchmark. Together, these elements support the same functional end-state: federated learning and federated analysis become dependable when they can be tested safely, deployed repeatedly, and executed within practical time budgets.