Developing of new bioinformatic approaches – PanCanRisk

eDiVa is an application targeted towards genomic variant annotation, variant ranking using multiple features and family/trio analysis to uncover potential causal gene(s) from NGS experiments.

eDiVa deals with the whole process of NGS data processing, from alignment, to variant calling, to quality control and filtering to ensure high quality output data. The produced output is a multi-sample variant file, which will be processed by the eDiVA-Annotate module, which provides functional annotation for each variant. Resulting files can be further processed using Rare-variant Exome Wide Association Studies (REWAS) for large cohorts, or eDiVA-prioritize for small families, trios or single case diagnostics.

The PanCanRisk project developed an adaptation of the eDiVA pipeline for distributed processing of large-scale whole exome sequencing (WES) data. eDiVA code has now been adapted to be able to run on basically any computing engine with ensured reproducibility thanks to a combination of Docker containers and the Next-Flow managing system.

eDiVA software is available online.

ANNOTATED SET OF SOMATIC VARIANTS

Tumors are characterized by the accumulation of somatic mutations. Specific somatic mutations or somatic mutation load can be used as a phenotype and be associated to specific germline risk variants. Furthermore, germline variants in DNA damage repair genes can lead to an increased number of somatic mutations (higher mutation load). Hence, association of germline coding variants with mutation load can be used to identify cancer risk genes, similar to the use of expression QTL (eQTL) analysis for identification of regulatory variants. Here we report the creation of a high quality somatic variation dataset for TCGA and ICGC samples.

We have used the newly developed method cDriver to identify driver genes in 6,870 cases in 21 cancer types from the TCGA and ICGC cohorts. We identified 98 novel tumortype – driver gene connections and found that chromatin modifiers have often been overlooked as driver genes in previous studies.

ANNOTATED SET OF PREDICTED CNVS

Germline copy number variants (CNV) are a common source of genomic variation involved in many genomic disorders, such as cancer or autism. Genomic microarrays, FISH, MLPA, as well as many other technologies, are widely used for detection of CNVs. Whole-genome sequencing (WGS) is a well established, highly accurate tool for the detection of point mutations and small indels. CNV detection using WGS data has been emerging as a competitive alternative for interrogating CNVs, but remains challenging. We have developed a new method, termed ClinCNV, for read-depth and B-allele frequency based multi-sample germline CNV detection. Using ClinCNV we analyzed a cohort of 2,834 WGS germline (normal) samples from 28 cancer types collected by the PCAWG consortium (ICGC and TCGA samples), 2,590 of which passed ClinCNV’s quality control. We have detected 16,951, 6,512 and 1100 bi-allelic deletions, bi-allelic duplications and multi-allelic events (mCNVs), respectively, of size greater or equal than 3KB. False discovery rates (FDRs) for the three variant types were estimated using the IRS method [1] and available microarray intensity data. We found that FDR of bi-allelic deletions, bi-allelic duplications and multi-allelic was 0.031, 0.037 and 0.014, respectively.

Our high quality germline CNV callset predicted using a large number of normal tissue samples from cancer patients may be used to identify CNVs overlapping known cancer predisposition (risk) genes, e.g. deletions of BRCA1 in breast and ovarian cancer. Furthermore it can be applied as control data for analyzing CNVs in other (non-cancer) disease cohorts.

We conclude that the PCAWG WGS cohort is highly suitable for identification of common and rare CNVs, but is also problematic for identification CNVs associated to cancer risk due to population stratification issues.

OPEN-SOURCE SOFTWARE FOR INDEL AND CNV PREDICTIONS

We focused on the development of a new detection method for copy number variants (CNVs), which facilitates whole-genome sequencing (WGS), whole exome sequencing (WES) and targeted sequencing (‘gene panel) analysis of both germline and somatic CNV. As we identified suitable high quality methods for indel prediction early on in the project we decided to not develop an additional indel caller, but to include indel calling algorithms of GATK, freebayes, samtools and RTG in our pipeline (see deliverable D1.3). These indel callers were selected for the PCAWG cohort an extensively benchmarked in the PCAWG paper (https://www.biorxiv.org/content/early/2017/11/01/208330).

Here we describe the ClinCNV method for CNV calling, which is publically released as open-source software at https://github.com/imgag/ClinCNV. A paper describing the ClinCNV method is in preparation.