Analising genetic variation data within cancer types

TOOLS TO ACCOUNT FOR BATCH EFFECTS AND POPULATION STRUCTURE

We have established workflows for the study of regulatory variants, specifically for PanCancer cohorts, which consist are compiled of heterogeneous datasets, and where confounding factors need to be appropriately accounted for. This development took place in the context of data from PCAWG, a collection of cancer samples from approximately 1000 individuals from 25 cancer types (http://pancancer.info/). The heterogeneity of this datasets is both genetic, i.e. due to differences in population structure between cohorts, and additionally represents variability due to technical processing factors and other unobserved covariates, i.e. lab and batch effects and RNA extraction protocols. An additional challenge in the study are large differences in sample size, which require pan-cancer analysis strategies to identify regulatory associations between individual genetic variants and gene expression levels.

We have established an integrated pipeline, which allows to effectively compare and assess alternative strategies to adjust for both types of confounding, genetic confounders and expression confounding. This workflow addresses all major steps, including appropriate filters, the compilation of a final expression and genetic datasets, and the parallel execution of the genetic analysis workflow on parallel compute clusters.

We also evaluated alternative strategies to adjust for confounding factors and we decided to use the PEER pipeline for further analyses in the PCAWG dataset.

NOVEL GLMM-BASED RVAS METHOD FOR RARE AND COMMON VARIANTS ASSOCIATION

We developed a pipeline for analyzing association of rare variants with disease, i.e. for rare variant association studies (RVAS). Out approach was originally optimized for analyzing cohorts of a single disease/cancer and controls studied by whole exome sequencing (WES), hence the tool has been termed Rare-variant Exome Wide Association Study (REWAS). REWAS has also been adapted for whole-genome sequencing (WGS) as well as for multi-disease and pan-cancer studies.

REWAS integrates several analysis steps for data quality control, filtering, gene nomenclature standardization, selection of rare variants, multiple methods for the association analysis, permutation testing and multiple testing correction.

The current version of REWAS including a detailed manual and all adaptations for WGS and pan-cancer type rare variant association studies is deposited in GitHub.

Our REWAS tool now includes a novel association test based on Integrated Nested Laplace approximation (INLA; Rue et al., 2009) to implement Bayesian inference on the generalised linear mixed model framework.

One of the main goals of the algorithm development performed within this project has been to increase power to detect rare variant associations. In addition, the INLA implementation provides a very flexible approximation. Taking advantage of this flexibility, we have generalised the method to perform association tests at network level, instead of gene level, thereby allowing the use of prior knowledge of protein-protein interactions or pathways to maximize the detection power of rare variants acting through a combination of genes in the same pathway.

We have finished the implementation of the approach and have evaluated the performance in the PCAWG data that includes germline variants in 2,834 patients affected by 20 cancer types.

In summary, while pathway analysis may be helpful in the identification of complex gene networks involved in disease, it probably needs a stricter set of filters to reduce the amount of background noise by including only variants that will be functionally relevant to the disease of interest.