clinvaR vignette

James Diao

2017-08-03

This is a vignette intended to introduce new users to clinvaR (https://github.com/jamesdiao/clinvaR).

Purpose

For clinvaR, the first goal is to go from genes to a vcf of related variants, along with classifications from any version of ClinVar.
This involves 3 steps:

  1. Select a list of genes that you are interested in. clinvaR makes it easy to download and import relevant variants from 1KG.
  2. Select a version of ClinVar to assign pathogenicity labels with. All previous ClinVar VCFs are stored as binary files and can be retrieved with a “closest date” function.
  3. Select an analysis method. These include frequency, submission, and assertion counts over time, and aggregate frequency by ancestral group.

Dependencies

clinvaR has a number of file dependencies that are generally useful for these types of projects. I have attached them here. remotes and devtools are interchangeable.

## [1] remotes pander  ggplot2 tibble  tidyr   dplyr   stringr

Select a List of Genes

The first step is to have a list of genes that you are interested in.

Method 1: Direct assignment

The easiest way is to directly save your genes of interest in gene_panel as a character vector (all caps). We have taken the example of the 20 genes on the Laboratory of Molecular Medicine’s gene testing panel for hypertrophic cardiomyopathy.

hcm_panel <- c("ACTC1", "ACTN2", "CSRP3", "GLA", "LAMP2", "MYBPC3", "MYH7", 
"MYL2", "MYL3", "MYOZ2", "NEXN", "PLN", "PRKAG2", "PTPN11", "RAF1", 
"TNNC1", "TNNI3", "TNNT2", "TPM1", "TTR")

Method 2: Import from text file

From own file

Another way is to save your gene list in a text file (1 gene per line) and import it.

path <- system.file("extdata/MacArthur_Gene_Lists/lists/lmm_hcm.tsv", package = "clinvaR")
print(path)
## [1] "/Users/jamesdiao/Documents/Kohane_Lab/clinvaR/inst/extdata/MacArthur_Gene_Lists/lists/lmm_hcm.tsv"
hcm_panel <- get_genes(path)
print(hcm_panel)
##  [1] "ACTC1"  "ACTN2"  "CSRP3"  "GLA"    "LAMP2"  "MYBPC3" "MYH7"  
##  [8] "MYL2"   "MYL3"   "MYOZ2"  "NEXN"   "PLN"    "PRKAG2" "PTPN11"
## [15] "RAF1"   "TNNC1"  "TNNI3"  "TNNT2"  "TPM1"   "TTR"

From clinvaR stored lists

The MacArthur Lab has assembled a compilation of gene lists (https://github.com/macarthur-lab/gene_lists) that may be a useful starting point. We have included all sets (and a few of our own) with abbreviated descriptions. Please visit the GitHub repo for citation information.

List File Name Count Description
Universe universe.tsv 18,991 Approved symbols for 18,991 protein-coding genes according to HGNC. All other lists are subsets.
Hypertrophic cardiomyopathy lmm_hcm.tsv 20 Gene testing panel for hypertrophic cardiomyopathy from the Laboratory of Molecular Medicine
ACMG secondary findings recommendations v2.0 acmg_59.tsv 59 The American College of Medical Genetics and Genomics has recommended that sequencing laboratories seek and report secondary findings in a minimum list of 59 genes.
BROCA - Cancer Risk Panel BROCA_Cancer_Risk_Panel.tsv 66 Suspected hereditary cancer predisposition, with a focus on breast or ovarian cancers. May co-occur with other cancer types (such as colorectal, endometrial, pancreatic, endocrine, or melanoma).
DNA Repair Genes, KangJ DRG_KangJ.tsv 151 151 DNA repair genes from DNA repair pathways: ATM, BER, FA/HR, MMR, NHEJ, NER, TLS, XLR, RECQ, and other.
ClinGen haploinsufficient genes clingen_level3_genes_2015_02_27.tsv 221 Genes with sufficient evidence for dosage pathogenicity (level 3) as determined by the ClinGen Dosage Sensitivity Map
Genes with any disease association reported in ClinVar clinvar_path_likelypath.tsv 3078 All gene symbols for which there is at least one variant with an assertion of pathogenic or likely pathogenic in ClinVar.
FDA-approved drug targets fda_approved_drug_targets.tsv 286 Genes whose protein products are known to be the mechanistic targets of FDA-approved drugs.
Drug targets by Nelson et al 2012 drug_targets_nelson.tsv 201 Drug targets according to Nelson et al 2012.
X-linked ClinVar genes x-linked_clinvar.tsv 61 X chromosome genes in the August 6, 2015 ClinVar release that have at least 3 reportedly pathogenic, non-conflicted variants in ClinVar with at least one submitter other than OMIM or GeneReviews.
All dominant genes all_ad.tsv 709 Currently the union of the Berg and Blekhman autosomal dominant OMIM gene lists, may add more lists later.
All recessive genes all_ar.tsv 1183 Currently the union of the Berg and Blekhman autosomal recessive OMIM gene lists, may add more lists later.
Essential in culture core_essentials_hart.tsv 285 Genes deemed essential in multiple cultured cell lines based on shRNA screen data
Essential in mice mgi_essential.tsv 2,454 Genes where homozygous knockout in mice results in pre-, peri- or post-natal lethality.
Genes nearest to GWAS peaks gwascatalog.tsv 3,762 Closest gene 3’ and 5’ of GWAS hits in the NHGRI GWAS catalog as of Feb 9, 2015
Kinases kinases.tsv 351 From UniProt’s pkinfam list
G-protein-coupled receptors gpcr.tsv 1705 GPCR list from guidetopharmacology.org
Natural product targets natural_product_targets.tsv 37 List of hand-curated targets of natural products

These are easily imported using get_genes() by directly referring to the file name. Use the get_genes() command without arguments to list possibilities.

get_genes()
##  [1] "BROCA_Cancer_Risk_Panel.tsv"        
##  [2] "DRG_KangJ.tsv"                      
##  [3] "DRG_WoodRD.tsv"                     
##  [4] "acmg_59.tsv"                        
##  [5] "all_ad.tsv"                         
##  [6] "all_ar.tsv"                         
##  [7] "berg_ad.tsv"                        
##  [8] "berg_ar.tsv"                        
##  [9] "berg_xd.tsv"                        
## [10] "berg_xr.tsv"                        
## [11] "blekhman_ad.tsv"                    
## [12] "blekhman_ar.tsv"                    
## [13] "blekhman_x.tsv"                     
## [14] "clingen_level3_genes_2015_02_27.tsv"
## [15] "clinvar_path_likelypath.tsv"        
## [16] "core_essentials_hart.tsv"           
## [17] "drug_targets_nelson.tsv"            
## [18] "fda_approved_drug_targets.tsv"      
## [19] "fmrp_list_gencode.tsv"              
## [20] "gpcr.tsv"                           
## [21] "gwascatalog.tsv"                    
## [22] "kinases.tsv"                        
## [23] "lmm_hcm.tsv"                        
## [24] "mgi_essential.tsv"                  
## [25] "natural_product_targets.tsv"        
## [26] "olfactory_receptors.tsv"            
## [27] "universe.tsv"                       
## [28] "x-linked_clinvar.tsv"
hcm_panel <- get_genes('lmm_hcm.tsv')
print(hcm_panel)
##  [1] "ACTC1"  "ACTN2"  "CSRP3"  "GLA"    "LAMP2"  "MYBPC3" "MYH7"  
##  [8] "MYL2"   "MYL3"   "MYOZ2"  "NEXN"   "PLN"    "PRKAG2" "PTPN11"
## [15] "RAF1"   "TNNC1"  "TNNI3"  "TNNT2"  "TPM1"   "TTR"

The last way is to search OMIM for related genes. This has not been added as a full feature yet.

Download and Import 1000 Genomes VCF

#hcm_download_log <- download_1000g(genes = hcm_panel)
vcf <- import_file_1000g(genes = hcm_panel)
## [1] Importing [1/20] ACTC1
## [1] Importing [2/20] ACTN2
## [1] Importing [3/20] CSRP3
## [1] Importing [4/20] GLA
## [1] Importing [5/20] LAMP2
## [1] Importing [6/20] MYBPC3
## [1] Importing [7/20] MYH7
## [1] Importing [8/20] MYL2
## [1] Importing [9/20] MYL3
## [1] Importing [10/20] MYOZ2
## [1] Importing [11/20] NEXN
## [1] Importing [12/20] PLN
## [1] Importing [13/20] PRKAG2
## [1] Importing [14/20] PTPN11
## [1] Importing [15/20] RAF1
## [1] Importing [16/20] TNNC1
## [1] Importing [17/20] TNNI3
## [1] Importing [18/20] TNNT2
## [1] Importing [19/20] TPM1
## [1] Importing [20/20] TTR
vcf[1:3,1:8]
##    GENE    AF_1000G          VAR_ID CHROM      POS          ID REF ALT
## 1 ACTC1 0.000199681 15_35080338_G_A    15 35080338 rs140004011   G   A
## 2 ACTC1 0.000399361 15_35080409_A_G    15 35080409 rs184599823   A   G
## 3 ACTC1 0.000199681 15_35080521_T_C    15 35080521 rs142735669   T   C

Note that there are another 2504+ columns, including individual-level data (0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate).

Import ClinVar VCF

This must be annotated with some version of ClinVar. This can be done either using the latest saved version (July 5, 2017), or by specifying a date from the list.

get_date_list()
##  [1] "2012-06-16" "2012-10-26" "2012-11-05" "2012-12-31" "2013-01-14"
##  [6] "2013-01-18" "2013-02-26" "2013-05-06" "2013-08-08" "2013-09-05"
## [11] "2013-09-30" "2013-11-05" "2013-12-03" "2013-12-30" "2014-02-11"
## [16] "2014-03-03" "2014-04-01" "2014-04-30" "2014-06-04" "2014-07-02"
## [21] "2014-08-07" "2014-09-02" "2014-09-29" "2014-11-05" "2014-12-02"
## [26] "2015-01-06" "2015-02-03" "2015-03-05" "2015-03-30" "2015-05-04"
## [31] "2015-06-03" "2015-06-29" "2015-08-04" "2015-09-01" "2015-09-29"
## [36] "2015-11-02" "2015-12-01" "2016-01-04" "2016-02-03" "2016-03-02"
## [41] "2016-04-05" "2016-05-02" "2016-05-31" "2016-07-05" "2016-08-02"
## [46] "2016-08-31" "2016-10-03" "2016-11-01" "2016-11-28" "2017-01-04"
## [51] "2017-01-30" "2017-02-28" "2017-04-04" "2017-05-01" "2017-05-30"
## [56] "2017-07-05"

Use get_clinvar() to collect from a certain date (if date is not present, the closest date is used).

newest_clinvar <- get_clinvar(Sys.Date())
newest_clinvar[1:4,] #%>% select(-VAR_ID, -CLNDBN)
##         VAR_ID  GENE CHROM    POS REF ALT          ID CLNSIG
## 1 1_949523_C_T ISG15     1 949523   C   T rs786201005      5
## 2 1_949608_G_A ISG15     1 949608   G   A      rs1921      2
## 3 1_949739_G_T ISG15     1 949739   G   T rs672601312      5
## 4 1_955563_G_C  AGRN     1 955563   G   C rs539283387      3
##                                                 CLNDBN
## 1 Immunodeficiency_38_with_basal_ganglia_calcification
## 2                                        not_specified
## 3 Immunodeficiency_38_with_basal_ganglia_calcification
## 4                                        not_specified
##   pathogenic_incl_conflicts pathogenic_no_conflicts
## 1                      TRUE                    TRUE
## 2                     FALSE                   FALSE
## 3                      TRUE                    TRUE
## 4                     FALSE                   FALSE

Annotate Gene-Variant VCF

The ClinVar VCF can be used to annotate the Gene-Variant VCF. Note that this removes all variants without pathogenic assertions.

hcm_vcf <- annotate_1000g(vcf = vcf, clinvar = newest_clinvar, conflicts = TRUE)
hcm_vcf[1:3,1:12]
##    GENE    AF_1000G          VAR_ID CHROM       POS REF ALT          ID
## 1 TNNT2 0.000199681 1_201333455_G_A     1 201333455   G   A rs483352832
## 2 ACTN2 0.000599042 1_236849999_A_G     1 236849999   A   G rs121434525
## 3 CSRP3 0.000199681 11_19207878_C_T    11  19207878   C   T rs138218523
##             CLNSIG
## 1                5
## 2 5, 0, 3, 0, 3, 3
## 3     255, 4, 2, 2
##                                                                                                                                                                  CLNDBN
## 1                                                                                                                                            Dilated_cardiomyopathy_1DD
## 2 Dilated_cardiomyopathy_1AA, not_specified, Dilated_cardiomyopathy, Cardiovascular_phenotype, Dilated_cardiomyopathy_1AA, Primary_familial_hypertrophic_cardiomyopathy
## 3                                       not_specified, Primary_familial_hypertrophic_cardiomyopathy, Dilated_cardiomyopathy_1M, Familial_hypertrophic_cardiomyopathy_12
##   pathogenic_incl_conflicts pathogenic_no_conflicts
## 1                      TRUE                    TRUE
## 2                      TRUE                   FALSE
## 3                      TRUE                   FALSE

If we wanted to do this in fewer lines of code (with all the defaults used above), we could have done this more quickly by simply specifying the genes in annotate. We can also specify whether we want to exclude the (likely) pathogenic variants with any (likely) benign conflicts as well. These are included by default.

Simple Analyses

Plot variant frequency by ancestry

You can use the plot function (plot.annotated_vcf) to look at the frequency of these variants across ancestral groups.

  1. Fraction = TRUE returns the fraction of the population with at least 1 of the genes.
  2. Fraction = FALSE returns the average number of variants per person.
plot(hcm_vcf, fraction = TRUE)

plot(hcm_vcf, fraction = FALSE)

Similar plots can be generated for the un-annotated VCF using a slightly different plot function (plot.plain_vcf).

plot(vcf, fraction = FALSE)

Plot ClinVar Time-Series: over_time(vcf)

over_time() works for unannotated vcfs. It generates a plot of aggregated allele frequency over time.

over_time(vcf, verbose = FALSE) -> output
plot(output$Frequencies)

plot(output$Submissions)

plot(output$Significances)

#output$Frequencies %>% kable
#output$Submissions %>% kable

To add penetrance computations, specify the prevalence and case frequency.

over_time(vcf, prevalence = 0.002, case_freq = 0.4, verbose = FALSE) -> output
plot(output$Penetrance)