Data-Driven Genomic Computing: Making Sense of the Signals from the Genome
Politecnico di Milano
Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a current cost of about 1000 US $; many algorithms exist for the extraction of genome features, or “signals”, including peaks (enriched regions), variants (mutated DNA sequences), or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to the available information; such a system can be used, e.g., for precision medicine, which aims at assigning the best treatment to each patient.
In this talk, I will describe a new data-driven framework for extracting and integrating genomic features, which is made available to the scientific community; in this work, we use foundational data management abstractions, with the objective of simplifying and improving over many low-level bio-informatics tools currently in use. We developed a new query language and system for managing genomic datasets on the cloud, with programmatic interfaces for R and Python; we also developed a repository which integrates open data produced by large international consortia, after designing and extracting a common core of semantically aligned metadata. In my talk, I will also hint to some big data management problems that we face for providing optimized data access in the cloud, and to biological and clinical applications that have been developed by using our systems. The framework internally uses the Spark big data engine and can be accessed at Cineca in Italy and at the Broad Institute in Cambridge (US).
This work is funded by an Advanced ERC Grant, “data-driven Genomic Computing” (GeCo).
Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 76) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: “Search Computing (SeCo)” (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD “Edward T. Codd Innovation Award” (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.