Analyzing Genomic Data with PyEnsembl and Varcode. Alex Rubinsteyn SciPy - July 9th, 2015

Similar documents
Machine Learning For Personalized Cancer Vaccines. Alex Rubinsteyn February 9th, 2018 Data Science Salon Miami

Session 4 Rebecca Poulos

Hands-On Ten The BRCA1 Gene and Protein

Session 4 Rebecca Poulos

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Bioinformatics Laboratory Exercise

Introduction. Introduction

Finding subtle mutations with the Shannon human mrna splicing pipeline

Reporting TP53 gene analysis results in CLL

The Cancer Genome Atlas & International Cancer Genome Consortium

Trinity: Transcriptome Assembly for Genetic and Functional Analysis of Cancer [U24]

Section Chapter 14. Go to Section:

MutationTaster & RegulationSpotter

Golden Helix s End-to-End Solution for Clinical Labs

Genome. Institute. GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud. R. Jay Mashl.

Functional analysis of DNA variants

Bio 111 Study Guide Chapter 17 From Gene to Protein

The feasibility of circulating tumour DNA as an alternative to biopsy for mutational characterization in Stage III melanoma patients

PRECISION INSIGHTS. GPS Cancer. Molecular Insights You Can Rely On. Tumor-normal sequencing of DNA + RNA expression.

Data mining with Ensembl Biomart. Stéphanie Le Gras

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

NGS in tissue and liquid biopsy

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

Studying Alternative Splicing

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

RNA and Protein Synthesis Guided Notes

Identifying Mutations Responsible for Rare Disorders Using New Technologies

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

Introduction to Systems Biology of Cancer Lecture 2

CANCER GENETICS PROVIDER SURVEY

CHR POS REF OBS ALLELE BUILD CLINICAL_SIGNIFICANCE

Molecular Characterization of Tumors Using Next-Generation Sequencing

6.3 DNA Mutations. SBI4U Ms. Ho-Lau

Chapter 18. Viral Genetics. AP Biology

For all of the following, you will have to use this website to determine the answers:

RNA-seq Introduction

Protein Synthesis

Gene Finding in Eukaryotes

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Central Dogma. Central Dogma. Translation (mrna -> protein)

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

Analysis with SureCall 2.1

AGENDA for 02/11/14 AGENDA: HOMEWORK: Due Thurs, OBJECTIVES: Quiz tomorrow, Thurs, : The Genetic Code

Enterprise Interest Thermo Fisher Scientific / Employee

PSSV User Manual (V1.0)

CITATION FILE CONTENT/FORMAT

DNA codes for RNA, which guides protein synthesis.

The neoepitope landscape of breast cancer: implications for immunotherapy

PROTEIN SYNTHESIS. It is known today that GENES direct the production of the proteins that determine the phonotypical characteristics of organisms.

University of Pittsburgh Cancer Institute UPMC CancerCenter. Uma Chandran, MSIS, PhD /21/13

Clinical Genetics. Functional validation in a diagnostic context. Robert Hofstra. Leading the way in genetic issues

Analysis of Region-Specific Transcriptomic Changes in the Autistic Brain

An Analysis of MDM4 Alternative Splicing and Effects Across Cancer Cell Lines

The Cancer Genome Atlas

6/12/2018. Disclosures. Clinical Genomics The CLIA Lab Perspective. Outline. COH HopeSeq Heme Panels

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2

SpliceDB: database of canonical and non-canonical mammalian splice sites

Cancer Informatics Lecture

Multiple sequence alignment

Proteogenomic analysis of alternative splicing: the search for novel biomarkers for colorectal cancer Gosia Komor

Ambient temperature regulated flowering time

George R. Honig Junius G. Adams III. Human Hemoglobin. Genetics. Springer-Verlag Wien New York

GENE EXPRESSION. Amoeba Sisters video 3pk9YVo. Individuality & Mutations

ABS04. ~ Inaugural Applied Bayesian Statistics School EXPRESSION

SUPPLEMENTARY INFORMATION. Intron retention is a widespread mechanism of tumor suppressor inactivation.

Alternative RNA processing: Two examples of complex eukaryotic transcription units and the effect of mutations on expression of the encoded proteins.

Supplemental Information For: The genetics of splicing in neuroblastoma

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Bacterial Gene Finding CMSC 423

Correspondence to Nature Genetics: Exploring pediatric cancer mutation information using ProteinPaint

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

Multi-omics data integration colon cancer using proteogenomics approach

No mutations were identified.

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Long non-coding RNAs

MUTATIONS, MUTAGENESIS, AND CARCINOGENESIS

MEDICAL GENOMICS LABORATORY. Non-NF1 RASopathy panel by Next-Gen Sequencing and Deletion/Duplication Analysis of SPRED1 (NNP-NG)

AACR GENIE Data Guide

SUPPLEMENTARY INFORMATION

Investigating rare diseases with Agilent NGS solutions

THE UMD TP53 MUTATION DATABASE UPDATES AND BENEFITS. Pr. Thierry Soussi

Gene Expression. From a gene to a protein

PERSONALIZED GENETIC REPORT CLIENT-REPORTED DATA PURPOSE OF THE X-SCREEN TEST

TOWARDS ACCURATE GERMLINE AND SOMATIC INDEL DISCOVERY WITH MICRO-ASSEMBLY. Giuseppe Narzisi, PhD Bioinformatics Scientist

Supplementary Figure 1. Estimation of tumour content

Summary... 2 TRANSLATIONAL RESEARCH Tumour gene expression used to direct clinical decision-making for patients with advanced cancers...

Deploying the full transcriptome using RNA sequencing. Jo Vandesompele, CSO and co-founder The Non-Coding Genome May 12, 2016, Leuven

The 100,000 Genomes Project Harnessing the power of genomics for NHS rare disease and cancer patients

Genetic Testing and Analysis. (858) MRN: Specimen: Saliva Received: 07/26/2016 GENETIC ANALYSIS REPORT

Concurrent Practical Session ACMG Classification

Supplementary Figure 1. Schematic diagram of o2n-seq. Double-stranded DNA was sheared, end-repaired, and underwent A-tailing by standard protocols.

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Accel-Amplicon Panels

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

Transcription:

Analyzing Genomic Data with PyEnsembl and Varcode Alex Rubinsteyn SciPy - July 9th, 2015

HammerLab @ Mount Sinai http://www.hammerlab.org/research/

Not biologists! Most lab members have a background in Computer Science and programming: Distributed data management Machine learning/statistics Programming languages Data visualization

Lab focus Software development: building high-quality open source software for biomedicine Research: use that software to change how a physician treats a patient

Lab research projects Can we predict response to immunotherapy from melanoma & lung cancer mutations? (collaboration with Alex Snyder & Matt Hellman) Does chemotherapy increase heterogeneity of mutations across cells in ovarian cancer? (collaborations with John Martignetti and Alex Snyder) Can we reprogram the immune system to attack cancer cells making mutated proteins? (collaboration with Nina Bhardwaj s lab) Personalized cancer vaccine pipeline written in Python Phase I clinical trial (~20 patients) starting soon (!!)

tl;dr of high throughput DNA sequencing for cancer Peripheral Blood Normal DNA Fragments Sequenced Reads Cancer ( somatic ) Mutations Tumor Tumor DNA Fragments Sequenced Reads

How do we find mutations? Normal DNA Reads Tumor DNA Reads Reference Genome Sequence

Variant file format: VCF Entry for variant chr20:1235678 GTC>G Source: wiki.nci.nih.gov

Cancer-specific variant file format: MAF Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbsnp_rs dbsnp_val_status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID AGL 178 genome.wustl.edu 37 1 100349684 100349684 + Missense_Mutation SNP G G A TCGA-13-1405-01A-01W-0494-09 TCGA-13-1405-10A-01W-0495-09 G G G A G G Unknown Valid Somatic 4 WXS 454_PCR_WGA 1 dbgap Illumina GAIIx c0d1de72-4cce-4d74-93f0-29c462dc1426 89f04056-0478-4305-b1ce-486ae469b4dd SASS6 163786 genome.wustl.edu 37 1 100573197 100573197 + Missense_Mutation SNP G G A TCGA-04-1542-01A-01W-0553-09 TCGA-04-1542-10A-01W-0553-09 G G G A G G Unknown Valid Somatic 4 WXS 454_PCR_WGA 1 dbgap Illumina GAIIx 317a63afe862-43df-8ef5-7c555b2cb678 b94052a8-c3d2-4e47-81e2-62242bc0841a

Working With Genomic Mutation Data in Python PyEnsembl Python interface to the Ensembl genome annotations Where is each gene/transcript/exon? What s the biotype of each transcript? (e.g. long noncoding RNA, protein coding) Varcode Read VCF (& MAF) files Filter variants (e.g. only protein coding genes) Compare collections of variants (e.g. do these two patients cancers share mutations?) Predict effect of mutations on protein sequence

Getting Started Download reference genome annotation data: $ pip install pyensembl varcode $ pyensembl install --release 75 and wait for ~10 minutes for download and indexing of several GB of reference sequence and annotation data. Ensembl releases specific to the version of the human genome variants are aligned against: GRCh37/NCBI37/hg19 = Ensembl release 75 GRCh38 = Ensembl release 80 (newer releases coming)

PyEnsembl Example: Looking up info about the TP53 gene

PyEnsembl Example: What s the sequence of TP53-001?

Varcode Example: Loading a MAF file Each mutation represented by a Variant object Collection of Variant objects is called a VariantCollection!

Varcode: Which gene is most mutated in ovarian cancer?

Varcode: What are the protein effects of mutations in TP53? Many effect classes corresponding to kinds of changes in protein sequence (or describing which non-coding region a mutation affects), examples: Substitution (change on amino acid into another) Frameshift (change in codon translation frame) Intronic (mutation gets spliced out of transcript)

Varcode: What s the sequence of a mutant protein? Don t just want to know what kind of effect a mutation has, but specifically what will the sequence of the new protein be.

Thanks! PyEnsembl www.github.com/hammerlab/pyensembl Varcode www.github.com/hammerlab/varcode