Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Characterisation of structural variation in breast cancer genomes using paired-end sequencing on the Illumina Genome Analyser Phil Stephens Cancer Genome Project

Why is it important to study cancer?

Why is it important to study cancer? Causes of cancer How cancers develop Pathways involved

Why is it important to study cancer? Causes of cancer How cancers develop Pathways involved Development of tests for early detection

Why is it important to study cancer? Causes of cancer How cancers develop Pathways involved Development of tests for early detection New targets for anticancer drug development

Why is it important to study cancer? Causes of cancer How cancers develop Pathways involved Development of tests for early detection New targets for anticancer drug development Monitor whether treatment is working

Challenges of studying solid tumours

Breast cancer spectral karyotype C/O Paul Edwards, Departments of Pathology and Oncology, University of Cambridge 73 Chromosomes, 37 structural abnormalities

Amplification MYC in Breast Cancer 20 16 MYC 12 8 4 60 80 100 120 140 Genomic location (Mb)

Point mutations & Indels Pre-cancerous lesion in situ cancer Invasive cancer Metastatic cancer

Point mutations & Indels Pre-cancerous lesion in situ cancer Invasive cancer Metastatic cancer ~20 Cancer causing mutations

Point mutations & Indels Pre-cancerous lesion in situ cancer Invasive cancer Metastatic cancer ~20 Cancer causing mutations 100,000 Passenger mutations

Pessimism around studying solid tumours

BRAF is an oncogene 70% of Melanoma T1799A V600E

BRAF is an oncogene 70% of Melanoma T1799A V600E BRAF is not important Melanoma

BRAF is an oncogene 70% of Melanoma T1799A V600E BRAF is not important Melanoma BRAF is not a good drug target

BRAF is an oncogene 70% of Melanoma T1799A V600E BRAF is not important Melanoma BRAF is not a good drug target Melanoma is too complex to treat effectively

Patient with malignant melanoma Courtesy of Dr Grant McArthur

Selective inhibitor of BRAF V600E (Plexxicon 4032) Courtesy of Dr Grant McArthur

Selective inhibitor of BRAF V600E (Plexxicon 4032) Before 15 days after Courtesy of Dr Grant McArthur

New sequencing technologies

Rearrangement characterisation

Summary of Illumina GA protocol 500bps Randomly shear DNA Size select Sequence from & ligate Illumina adaptors & amplify both ends

Summary of Illumina GA protocol 400 bps 37 bps 37 bps

Summary of Illumina GA protocol 400 bps 37 bps 37 bps Align reads to ref sequence MAQ algorithm Li Heng, http://maq.sourceforge.net/. Genome Res., Nov 2008

Correctly mapping paired end reads 400 bps 37 bps 37 bps Chromosome 11

Incorrectly mapping paired-end reads 400 bps 37 bps 37 bps Chromosome 11 Chromosome 8

PCR amplify in tumour and matched normal Somatic Germline Artefact T N T N T N

PCR amplify in tumour and matched normal Somatic Germline Artefact T N T N T N 6 bps of microhomology at break

Structural variation screen 9 matched breast cancer cell lines 49.0 Gb total data 8.5 mean physical coverage

Illumina GA sequencing:- Broad objectives Structural variation Fusion genes Copy number changes Structure of amplicons

Breast cancer HCC38 spectral karyotype C/O Paul Edwards, Departments of Pathology and Oncology, University of Cambridge 73 Chromosomes, 37 structural abnormalities

Breast cancer HCC38 (Triple neg) 238 somatic structural variants

Breast cancer HCC38 (Triple neg) 150 100 50 238 somatic structural variants Patterns of variation 0

What are these structural variants doing?

Copy number ~160 kb copy number increase 6 5 4 3 2 1 0 48 48.2 48.4 48.6 48.8 49 49.2 49.4 Chromosome 3 position (Mb)

Copy number 164 kb tandem duplication 6 5 4 3 2 T N 1 0 48 48.2 48.4 48.6 48.8 49 49.2 49.4 Chromosome 3 position (Mb)

Copy number 164 kb tandem duplication 6 5 4 3 2 1 0 48 48.2 48.4 48.6 48.8 49 49.2 49.4 Chromosome 3 position (Mb) SLC26A6/PRKAR2A in frame fusion gene SLC26A6 PRKAR2A (Exons 1-17) (Exons 4-11) 5 UTR 3 UTR

Copy number 164 kb tandem duplication Genomic PCR bps 6 5 600 400 T N 4 3 200 T N 2 1 0 48 48.2 48.4 48.6 48.8 49 49.2 49.4 Chromosome 3 position (Mb) SLC26A6/PRKAR2A in frame fusion gene SLC26A6 PRKAR2A (Exons 1-17) (Exons 4-11) 5 UTR 3 UTR

Copy number 164 kb tandem duplication Genomic PCR bps 6 5 600 400 T N 4 3 200 T N 2 1 0 48 48.2 48.4 48.6 48.8 49 49.2 49.4 Chromosome 3 position (Mb) SLC26A6/PRKAR2A in frame fusion gene FISH confirmation of tandem duplication SLC26A6 PRKAR2A (Exons 1-17) (Exons 4-11) 5 UTR 3 UTR

bps 600 400 200 T N RT-PCR

bps 600 400 200 T N RT-PCR Predicted 914 amino acid fusion protein MGLADASGPRDTQALLSATQAMDLRRRDYHMERPLLNQEHLEELGRWGSAPRTHQWRTWLQCSRARAYALLLQHLPVLVWLPRYPVRDWLLGDLLSGL SVAIMQLPQGLAYALLAGLPPVFGLYSSFYPVFIYFLFGTSRHISVGTFAVMSVMVGSVTESLAPQALNDSMINETARDAARVQVASTLSVLVGLFQVGLGLIH FGFVVTYLSEPLVRGYTTAAAVQVFVSQLKYVFGLHLSSHSGPLSLIYTVLEVCWKLPQSKVGTVVTAAVAGVVLVVVKLLNDKLQQQLPMPIPGELLTLIGAT GISYGMGLKHRFEVDVVGNIPAGLVPPVAPNTQLFSKLVGSAFTIAVVGFAIAISLGKIFALRHGYRVDSNQELVALGLSNLIGGIFQCFPVSCSMSRSLVQEST GGNSQVAGAISSLFILLIIVKLGELFHDLPKAVLAAIIIVNLKGMLRQLSDMRSLWKANRADLLIWLVTFTATILLNLDLGLVVAVIFSLLLVVVRTQMPHYSVLGQ VPDTDIYRDVAEYSEAKEVRGVKVFRSSATVYFANAEFYSDALKQRCGVDVDFLISQKKKLLKKQEQLKLKQLQKEEKLRKQAASPKGASVSINVNTSLEDMR SNNVEDCKMVIHPKTDEQRCRLQEACKDILLFKNLDQEQLSQVLDAMFERIVKADEHVIDQGDDGDNFYVIERGTYDILVTKDNQTRSVGQYDNRGSFGEL ALMYNTPRAATIVATSEGSLWGLDRVTFRRIIVKNNAKKRKMFESFIESVPLLKSLEVSERMKIVDVIGEKIYKDGERIITQGEKADSFYIIESGEVSILIRSRTKSN KDGGNQEVEIARCHKGQYFGELALVTNKPRAASAYAVGDVKCLVMDVQAFERLLGPCMDIMKRNISHYEEQLVKMFGSSVDLGNLGQStop

Five expressed in frame fusion genes 2 generated by tandem duplications 3 generated by large inversions

Very small rearrangements:- below the resolution of copy number graphs

Very small rearrangements:- below the resolution of copy number graphs T N 5 kb tandem duplication, spanned by 4 reads 12 13 14 15 14 15 16 17 18 19 TNRC15:- In frame duplication exons 14 & 15

Different patterns of structural variation are emerging from other breast cancers

Patterns of somatic structural variation HCC38 HCC1937 HCC1954 Triple Neg BRCA1 null ERBB2 Positive

Copy number Complex patterns of structural variation 10 8 6 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation 10 8 6 Tandem Duplication 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation 10 8 6 Tandem Duplication Deletion & Tandem Duplication 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation t(6;7) translocation 10 8 Tandem Duplication t(6;7) translocation Deletion & Tandem Duplication 6 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation 10 8 t(6;7) translocation Tandem Duplication Inversion & Tandem Duplication & t(6;7) translocation t(6;7) translocation Deletion & Tandem Duplication 6 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation 10 8 6 t(6;7) translocation Tandem Duplication Inversion & Tandem Duplication & t(6;7) translocation Inversion & Tandem Duplication & t(6;7) translocation t(6;7) translocation Deletion & Tandem Duplication 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Copy number Complex patterns of structural variation 10 8 6 t(6;7) translocation Tandem Duplication Complex Complex Inversion & Tandem Duplication & t(6;7) translocation Inversion & Tandem Duplication & t(6;7) translocation t(6;7) translocation Deletion & Tandem Duplication 4 2 0 0 25 50 75 100 125 150 Solexa copy number:- chromosome 6

Summary of breast cancer cell line data 9 breast cancer cell lines 1152 somatic rearrangements 16 In-frame fusion genes (15/16 expressed)

Structural variation screen on 15 primary breast cancers 2 Luminal A 2 Luminal B 3 Basal like 2 BRCA1 neg 2 BRCA2 neg 3 ERBB2 + 1 Lobular 66.6 Gb total data 7.4 mean physical coverage

199 somatic structural variants Triple negative

Triple negative 150 100 50 0 199 somatic structural variants Patterns of variation

Triple negative 4 2 0 7 somatic structural variants Patterns of variation

ERRB2 positive 50 25 0 62 somatic structural variants Patterns of variation

ERRB2 positive 50 40 30 20 10 0 94 somatic structural variants Patterns of variation

Luminal B expression 200 150 100 50 0 231 somatic structural variants Patterns of variation

Invasive lobular carcinoma 2 1 0 1 somatic structural variant Patterns of variation

Novel expressed in frame fusion gene

Novel expressed in frame fusion gene 16.8 Mb inversion

Very small rearrangements T N 3 kb deletion RB1:- Deletion Exon 13

Summary of primary breast cancer data 15 primary breast cancers 1014 Somatic rearrangements 10 In-frame fusion genes (5 expressed)

How do we identify the driver mutations?

How do we identify the driver mutations? 139 genes rearranged in more than one sample

How do we identify the driver mutations? 139 genes rearranged in more than one sample 22 gene fusions ~300 primary breast cancers FISH cdna PCR (Jorge Reis-Filho) (John Foekins)

Whole exome sequencing (Agilent Sure Select)

Whole exome sequencing (Agilent Sure Select) Shear genomic DNA, ligate Illumina adaptors Hybridise to baits that span exons & wash Sequence form both ends

Breast cancer samples (28 total) 21 x ER + HER2 6 x ER + HER2 + 3 x triple neg

Substitutions in known cancer genes PD3995a AKT1 E17K PD3995a NF1 G-1T PD3994a PIK3CA N345K PD3989a PIK3CA E545K PD3856a PIK3CA H1047R PD3857a PIK3CA H1047R PD3888a PIK3CA H1047R PD3983a PIK3CA H1047R PD3985a PIK3CA H1047R PD3992a PIK3CA H1047R PD3996a PTEN Y27D PD3991a TP53 G245S PD4002a TP53 H179Y PD3987a TP53 Y220C PD3986a TP53 G+1A PD3985a TP53 R306X

High confidence Indels Sample Gene Mutation PD3849a CDH1 V193X PD3984a MAP2K4 V151X PD3992a MAP2K4 I81X PD3989a PTEN L370X

High confidence Indels Sample Gene Mutation PD3849a CDH1 V193X PD3984a MAP2K4 V151X PD3992a MAP2K4 I81X PD3989a PTEN L370X PD3995a GATA3 N352X PD3988a GATA3 N352X PD4004a GATA3 Read through

Novel cancer genes

Novel cancer genes Integrating rearrangement and exome data 2 novel pathways deregulated in a large proportion of ER pos breast cancer

Potential applications in healthcare

Personalised Haematology Risk stratification Therapy duration Prediction of relapse Tumour cells Time (months)

Tumour-specific rearrangements

Plasma DNA Normal tissues Cancer DNA with tumour-specific mutation

Work flow Identify tumour-specific rearrangements from cancer DNA 2 weeks ~ 2,500

Work flow Identify tumour-specific rearrangements from cancer DNA 2 weeks ~ 2,500 Design & test quantitative assay for rearrangements 1 week 250

Work flow Identify tumour-specific rearrangements from cancer DNA 2 weeks ~ 2,500 Design & test quantitative assay for rearrangements Extract DNA from plasma (10-20mL blood sample) 1 week 250 ½ day 50

Assay design

Detecting 1 copy of tumour genome

Serial measurements

Potential healthcare applications Monitoring tumour response to therapy in real-time Reduce toxicity, prevent drug wastage

Potential healthcare applications Monitoring tumour response to therapy in real-time Reduce toxicity, prevent drug wastage Identifying disease relapse before clinically evident Pre-emptive therapy

Potential healthcare applications Monitoring tumour response to therapy in real-time Reduce toxicity, prevent drug wastage Identifying disease relapse before clinically evident Pre-emptive therapy Choosing intensity of adjuvant therapy based on risk stratification

Potential healthcare applications 100 Breast cancers 100 Colorectal cancers 100 Osteosarcomas

Conclusions Paired-end sequencing is an effective tool for characterising structural variation in complex cancer genomes

Conclusions Paired-end sequencing is an effective tool for characterising structural variation in complex cancer genomes The average breast cancer has ~ 100 somatic structural variants

Conclusions Paired-end sequencing is an effective tool for characterising structural variation in complex cancer genomes The average breast cancer has ~ 100 somatic structural variants The average breast cancer has ~1.0 expressed in-frame fusion gene Exome sequencing will unveil a multitude of novel drug targets in the next few years