GenomeVIP: the Genome Institute at Washington University A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud R. Jay Mashl October 20, 2014
Turnkey Variant Analysis Project Provides a collection of analysis tools and computational frameworks for streamlined discovery and interpretation of genetic variants VarScan Pindel BreakDancer Genome Variant Investigation Portal Multi-tool Variant discovery Cloud computing Scalability Extensibility Poster #1678M (Monday) local Cloud (AWS) tvap.genome.wustl.edu
Genome Variant Investigation Portal Web server and interface for germline and somatic variant-discovery tools VarScan Pindel BreakDancer Heuristic/statistical calling of single nucleotide variants (SNVs) Indel detection for paired reads based on local realignment Structural variant (SV) detection for paired reads GenomeSTRiP (Harvard U.) Structural variant detection and genotyping Concurrent pipelines (SNV, indel, SV) with parallelization Launchable on local machines or on the cloud through Amazon Web Services (AWS) Download results from AWS via web browser
Biological Discoveries (selected) Comprehensive molecular portraits of human breast tumours Identified four main types by combining data from five platforms Nature 490, 61-70 (2012) Clonal evolution in relapsed acute myeloid leukaemia Cancer consists of multiple variants; founding clone may give rise to relapse clone; subclones may survive therapy and mutate further Nature 481, 506-510 (2012) Genomic Landscape of Non-Small Cell Lung Cancer in Smokers and Never-Smokers Of patients with lung cancer, smokers found to have10x more mutations than non-smokers Cell 150, 1121-34 (2012) Discovery & genotyping for structural variants in populations ~14,000 deletion polymorphisms with allelic states (1000G pilot) Nature Genetics 43, 269-276 (2011)
Application to APOL1: Demo Representative samples from PUR population from 1000 Genomes Analyze within the range chr22 : 36-37 Mbp for known variants: Sample Region Variant Isoforms HG01242 22:36,661,906 A / G G1 (non-silent) HG01101 22:36,662,041 AATAATT / A G2 (Δ6) HG01049 22:36,133,448 Δ 767bp
Login Select AWS Click Next
Sample & Reference Selection Specify path & retrieve Select samples Entering path: Copy the given URI. Click Retrieve. Click on all the PUR low_coverage items to transfer them to the Selected bams textbox. Select reference hs37d5.chr22.fa. Select reference (hs37d5.chr22.fa) Click Next.
SNV Detection: VarScan SNV All 22:36130000-36700000 CheckVarScan Select Germline Select SNVs only Select All (pooled) samples Select User-defined region and enter 22:36130000-36700000 Keep p-value: 0.99 Set Output vcf: True Click Next.
Indel Detection: Pindel Select All Check Run Pindel Select All (pooled) samples Select User-defined region and enter 22:36130000-36700000 Click Next. 22:36130000-36700000
SV Detection: BreakDancer 22:36130000-36700000 Check BreakDancer In Step 1, select All (pooled) samples In Step 3, select Intra (ITX) only, user-defined region and enter 22:36130000-36700000 Click Next.
SV Detection & Genotyping: GenomeSTRiP 1. Check Run GenomeSTRiP 2. Verify reference is hs37d5.chr22.fa 3. Select mask Hs37d5 human_g1k_v37.mask.36.fasta.chr22 4. GC normalization: True, with cn2_mask_g1k_v37.fasta 5. Chromosome: User-defined with 22:36130000-36700000 6. Variant size: 100bp 100 kbp. 100bp- 100kbp
Amazon AWS Submission Select machine type Jobs have been tested to finish within a few minutes Where to send results Validate & submit
Results 22 36133341 DEL_1 T <DEL> SVLEN=-762;SVTYPE=DEL 22 36662041. AATAATT A. PASS END=36662047;HOMLEN=4;HOMSEQ=ATAA;SVLEN=-6;SVTYPE=DEL; 22 36661906. A G. PASS ADP=7;WT=1;HET=0;HOM=0;NC=2 22 36662041. AATAATT A. PASS ADP=4;WT=0;HET=1;HOM=0;NC=2;
Jay Mashl (rmashl @ genome.wustl.edu) Kai Ye (kye @ genome.wustl.edu) Li Ding (lding @ genome.wustl.edu)...and with thanks to the Ding Lab members Poster #1678 / M (this afternoon) http://tvap.genome.wustl.edu/ National Human Genome Research Institute
Alternate slides
Amazon AWS S3 Data Retrieval Links to actual files to be generated, along with merged VCF Click links to download Participants will identify variants in the output (Left) Prepared results available, in case of technology problems