Next-Generation Approaches to Genomic Data Curation and Interpretation
Genome curation involves a series of steps that integrate genomic data with validated experimental evidence and literature from diverse databases. Effective data curation and quality control are essential to ensure the accuracy, consistency, and reliability of genomic analysis results.
In this context, NextGen genomic tools will focus on several aspects.
Accelerating genomic data analysis pipelines. Through our open-source software development, we are helping to ease the computational burden of genomic analyses, making alignment and variant calling more accessible and scalable.
Pathogenic variants annotation and interpretation. While current annotation tools are generally accurate in scoring missense variants, they often fail to prioritize the true causal variant in monogenic disorders in many clinical cases. To address this, we are benchmarking pathogenic variant annotation methods across diverse datasets and exploring machine learning models that integrate structural and functional genomic annotations with patient phenotypes. Our goal is to improve the prioritization of causal variants in monogenic diseases, enhancing diagnostic accuracy and clinical relevance.
Annotation of Variants in Complex Diseases. Most current missense annotation models are optimized for monogenic disorders and are not designed to assess the role of variants in complex, polygenic diseases. We have demonstrated that machine learning models can be effectively trained for this purpose as well. Our ongoing work in the NextGen Project focuses on leveraging these models to improve the annotation of variants associated with complex cardiovascular disorders and to support more accurate patient profiling in clinical practice.
Coordinated data processing across multiple devices. In the context of federated learning, we are developing tools that enable consistent quality control and data pre-processing across datasets stored on different partners’ servers, without the need to move the data. When necessary, these processes can also be coordinated globally to obtain joint information from the distributed data.Genomic data imputation and integration. We are developing deep learning methods to generate robust representations of genomic data that can handle missing or inconsistent information. These models enable both the imputation of incomplete data and the integration of genomic information with other data modalities. This approach opens new possibilities for multi-modal analyses, even in the presence of missing data types, and significantly enhances the overall quality of genomic analyses results.