Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience 1

To Cancer Genetics Studies Introduction Next Generation Sequencing (NGS) on Illumina platform is suitable for clinical applications that require large amounts of information, accurate quantification and high-sensitivity detection Mutation detection in tumours (from biopsies / circulating tumour cells (CTC)). Pathogen detection e.g. organism identification for epidemiological investigations Gut microbial flora genomics Detection of the presence of antibiotic resistance genes Comparison of novel sequences / genes to those in public databases 2

Applications of NGS to Cancer Genetics Some Commonly Applied Techniques Sequencing The Genome Reference alignment, targeted resequencing for polymorphism and mutation discovery De novo assembly for characterisation of novel genes, genomes. Paired-end sequencing highlights larger structural variants (inherited/acquired) Sequencing The Transcriptome RNA-Seq allows absolute quantification of gene expression across transcriptome No prior knowledge of content needed quantify expression of unknown genes Profiling of mrna, ncrna, mirna Sequencing The Cistrome ChIP-Seq allows profiling of cis-acting targets (DNA binding sites) of a trans-acting factor (transcription factor, restriction enzyme, etc) on a genome scale. Determine how proteins interact with DNA to regulate gene expression Determine how TFs and other proteins influence phenotype-affecting mechanisms SImilar approach can be used to characterise genomic methylation patterns the methylome 3

Applications of NGS to Cancer Genetics Levels of information extraction, data integration Variant Detection RNA-Seq Quantification RNA-Seq Discovery ChIP-Seq Integrate Associate observed variants with regulation/transcriptional changes; link to external databases Analyse Identify Process Generate Overlapping Genes Differential Expression Novel Isoforms Associated Genes Variant Detection Expression levels Novel gene models Motif finding Targeted Resequencing Density on known exons Novel Transcripts Binding Sources Consensus Sequence Identify splice-crossing reads Enriched regions De novo assembly / reads mapped to (un) annotated reference sequence 10 8-10 9 short DNA fragments Level of Information Extraction 4

Human Resequencing and Variant Detection Reference Assembly, Targeted Resequencing And Variant Detection Search for alterations at nucleotide level to explain changes in regulation/transcription Single Ended (SE) sequencing ~85% of complex genome accessible suitable for SNPs, small indels (DIPs) Paired-Ended (PE) sequencing ~99% of complex genome accessible Find longer DIPs Find larger structural variations Span repeat regions 5

Human Resequencing and Variant Detection 2009 Nature Paper Cytogenetically normal AML genome sequenced (32x) Comparison with matched normal tissue (14x) 98 full runs on Illumina GA to achieve required depth Alignment, variant discovery performed by MAQ 97.7% of variants in AML genome also in normal Further restricted to annotated gene-coding regions Across all tumour cells: found 10 genes with acquired mutations (8 novel) present in all cells at presentation and relapse Our study establishes whole genome sequencing as an unbiased method for discovering initiating mutations in cancer genomes, and for identifying novel genes that may respond to targeted therapies 6

Polymorphism detections within P53 P53 Variant detection study Guardian of the genome (Lane, 1992) Protects fidelity of DNA replication Directs cell arrest/apoptosis when stressed 35000 30000 Coverage of p53 gene Mutated in more than half of human cancers Human TP53 gene located on 17p13.1 Region sometimes deleted in human cancer Study Search for variants on P53 gene in matched tumour samples. Use gene specific PCR to amplify exons only to maximise depth of coverage Coverage per base position 25000 20000 15000 10000 5000 Use MAQ for alignment, variant discovery against P53 reference gene 12000 13000 14000 15000 16000 17000 18000 19000 Gene position Comparison with results from 454, Sanger 10

Polymorphism detections within BRCA1 BRCA1 Variant detection study Human tumour suppressor gene Primarily expressed in breast tissue Helps repair damaged DNA (if possible) Mutations to BRCA1 allow uncontrolled replication of damaged cells. 11

Polymorphism detections within BRCA1 BRCA1 Variant detection study Human tumour suppressor gene Primarily expressed in breast tissue Helps repair damaged DNA (if possible) CASAVA Demultiplex (11 samples) Map reads to ref (BRCA1) Mutations to BRCA1 allow uncontrolled replication of damaged cells. Pilot Study Search for variants on BRCA1 gene Use gene specific PCR to amplify exons only to maximise depth of coverage Multiplexed 11 samples loaded into one lane Use CASAVA for de-multiplexing, alignment Use SAMtools for consensus/indel calling, filtering Validation of results against known variants. SAMTools Conversion to SAM format Conversion to Pileup format Consensus/Indel Calling Filter for variants Comparison with Known variants 12

RNA-Seq: Transcriptome Analysis RNA-Seq Sequence RNA (translated to cdna) Mapped to annotated reference genome (annotated genes, known variants) Expression levels deduced from total number of reads that map to exons of a gene. RNA-Seq versus Microarray More sensitive to low-abundance transcripts absolute gene expression levels detectable can detect single molecules no prior knowledge required of content Greater ability to distinguish isoforms Ability to determine allelic expression Less biased 13

RNA-Seq: Transcriptome Analysis RNA-Seq Study of ovarian cancer cell lines Identification of changes in gene expression in strains with acquired drug-resistance Special interest in ncrna expression data Use Bowtie and Tophat to map reads, identify splice sites Use Cufflinks to assemble transcripts, calculate abundances ~87% of reads mapped to genome Use DESeq to perform differential expression tests Use DAVID (Database for Annotation, Visualisation and Integrated Discovery (http://david.abcc.ncifcrf.gov/)) for pathway analysis Found significant representation of cancer pathways and focal adhesion genes BOWTIE Maps reads to reference genome (hg19) TOPHAT Identifies splice sites (known/novel) CUFFLINKS Transcript Assembly, Quantification DESeq Differential Gene Expression of RNA-Seq data DAVID Pathways Analysis of deregulated genes http://david.abcc.ncifcrf.gov/ 14

ChIP-Seq: Genome-wide protein-dna interactions ChIP-Seq Chromatin-immunoprecipitation (ChIP) isolates proteinbound DNA Follow by deep sequencing of DNA fragments (Seq) Facilitates genome wide mapping pf DNA-protein interactions How TFs, other chromatin associated factors can affect phenotype. Regulation/Structural Analysis ChIP-Seq vs. ChIP-chip no prior knowledge of content required Similar approach can be used to map genomic methylation 15

ChIP-Seq: Genome-wide protein-dna interactions ChIP-Seq Study ofhaematopoietic Stem Cells Interest in Haematopoiesis and genetic circuitry of blood cell development Tal1 T-cell acute lymphocytic leukaemia protein 1 TF that controls development and differentiation of Haematopoietic Stem Cells (HSCs) Very few target genes had been validated. ChIP-Seq approach taken to generate a genome-wide catalogue of Tal1 binding events in stem cell line Use Illumina BeadStudio ChIP-Seq module to identify peaks (potential chromatin binding sites) Followed by in vivo validation (foetal liver, transgenic mice) Allows construction of in vivo validated network of 17 factors and respective regulatory elements 16