The Acceleration of Genomic Sequencing
Whole-genome and whole-exome sequencing data are first class citizens in NextGen, as variants derived from this data fundamentally dictate research and clinical outcomes both in our consortium and beyond.
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) are two ways of reading our genetic code. The genome is the complete set of DNA in a person – every letter of the biological instructions that make us who we are. Sequencing the whole genome means decoding all of it, every base pair from start to finish. The exome, by contrast, is the small portion of the genome – about one percent – that contains genes directly responsible for making proteins. These are the sections most often linked to disease, so sequencing just the exome is faster and cheaper while still highly informative for many medical questions.
In practice, both techniques start with fragments of DNA that are read by sequencing machines and then reassembled computationally into a continuous genetic map. Whole-genome sequencing gives the broadest picture, including regulatory and non-coding regions that influence how genes work. Whole-exome sequencing focuses on the areas most likely to reveal disease-causing mutations.
However, before the raw sequencing data can be turned into actionable variants, it needs to undergo a series of data processing steps. The gold standard pipeline used for performing such processing is the Genome Analysis Toolkit (GATK).
Genomic acceleration refers to speeding up the vast amount of computation required to transform raw sequencing data into meaningful insights. Modern sequencing machines can produce terabytes of data for a single genome, and analysing this information – aligning sequences, identifying variants, and linking them to clinical outcomes – can be extremely resource-intensive.
Although it provides highly accurate, fully automated method for analyzing sequenced reads, the GATK pipeline has emerged to be a computational bottleneck as it is CPU-based, uses memory-intensive, managed programming languages, and performs a lot of disk I/O for staging intermediate data, thus, making large-scale genomic data analysis time-consuming and costly. As a result, a new market has grown for hardware accelerated pipelines that can use GPUs, FPGAs, or other accelerators, to scale genomic data analysis. Unfortunately, all popular solutions that provide scalable acceleration are closed source in nature and tightly integrated with high-end accelerators from a single manufacturer, resulting in hardware-accelerated genomics being a niche solution applicable to well-endowed clinics and laboratories.
Our goal in Nextgen to break this cost barrier by developing an open-source Genomics Acceleration Library that can provide the same functionality as GATK, but is built using open standards to provide portable, vendor-agnostic hardware acceleration of genomic data analysis. In order to do this, we are building on the SYCL programming standard and developing SYCL-GAL – a library that contains SYCL-based implementation of key algorithms required during post-alignment processing of sequencing data that can be executed in a performance-portable manner on multi-vendor CPUs and GPUs.
Raja Appuswami, Luca Alessandro Remotti