Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection Dr Elaine Kenny Neuropsychiatric Genetics Research Group Institute of Molecular Medicine Trinity College Dublin

Schizophrenia - background Common, complex brain disorder Presents with a significant heterogeneity of symptoms psychotic symptoms delusions and hallucinations negative symptoms affective flattening, alogia and avolition Cognitive deficits working memory, attention Age of onset: males between the age of 16 and 30 females largely presenting after the age of 30 Variable course of illness and complete remission is probably uncommon

Schizophrenia - background Chronic debilitating Devastating for individual, family and society Life expectancy reduced by 10 years ~20% of affected in full time employment Substantial societal burden from disease Biology poorly understood and treatments partially effective Higher incidence in men

Heritability of Schizophrenia

International Schizophrenia Consortium (ISC) GWAS study

Chromosomal Deletions Large chromosomal deletions (100Kb 3Mb) occur more frequently in schizophrenia cases compared to normal healthy controls Most of these deletions are spread across the genome and do not yet implicate specific genes in schizophrenia However, some schizophrenia deletions do cluster at certain regions, implicating specific genes in the illness

Where are the rest of the risk genes? Few replicated common SNPs for schizophrenia

Challenges for common disorders

How do we find the disease causing rare variants? Sequence whole genomes: Expensive! Sample number required for statistical power means huge costs A large portion of the genome will be non-functional and unlikely to harbour risk mutations From Mendelian disease, we know that (i) mutations causing amino acid changes account for ~60% of disease mutations (ii) small indels in genes account for ~25% of disease mutations (iii) <1% of disease mutations have been found in regulatory regions

Target Enrichment Best compromise Sequence portions of the genome Genomic regions of interest e.g. genes/promoters etc. Exome

The logical extension of sample pooling is to perform multiplexed target enrichments in which many samples are barcoded before capture

<3µg Library Preparation Ilumina SE/PE SOLiD Hybridization / Capture 0.5µg 24 hours Baits: - crna probes - Long (120bp) - Biotin labeled - User-defined (earray) - SurePrint synthesis Bead Separation Wash / Elution / Amp Page 15

Whole Genome Whole Exome N=200 genes 1 sample per flowcell 8 samples per flowcell 100+ samples per flowcell x3 Illumina flowcell 3Gb sequence post QC and alignment 25K per genome (direct consumables cost) 2.5K per exome (direct consumables cost) 250 per sample (direct consumables cost)

How to barcode/index samples?

Sequencing Library prep 5 3 Shear Sonicate 3 5 5 3 3 5 End Repair T4 and Klenow DNA polymerases, T4 PNK 5 3 3 5 Add A Base Klenow exo _ 5 A A A 3 3 A A A 5

Ligation of adapters 5 3 T 5 A A 5 T 3 5 5 3 T A A T 3 5

Ligation of indexed adapters 5 3 T A A T 3 5 5 3 T A A T 3 5

Indexing /Barcoding of DNA Samples + + + + AACCAT CAACCT GCATGT TCAGTT Target Enrichment (single Agilent SureSelect rxn) Sequencing (single lane of an Illumina GA II)

Sequencing Output: indexed 40bp reads AACCATTCCGTGTACTGACTGCTCGATATA CAACCTTCCGTGTACTGACTGCTCGATCTA AACCATTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTCGATATA CAACCTTCCGTGTACTGACTGCTCGATATA AACCATTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA CAACCTTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA AACCATTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTCGATATA CAACCTTCCGTGTACTGACTGCTCGATCTA GCATGTTCCGTGTACTGACTGCTCGATATA Separate reads and analyze on a per-individual basis AACCATTCCGTGTACTGACTGCTCGATATA AACCATTCCGTGTACTGACTGCTCGATATA AACCATTCCGTGTACTGACTGCTCGATATA AACCATTCCGTGTACTGACTGCTCGATATA CAACCTTCCGTGTACTGACTGCTCGATATA CAACCTTCCGTGTACTGACTGCTCGATCTA CAACCTTCCGTGTACTGACTGCTCGATATA CAACCTTCCGTGTACTGACTGCTCGATCTA SNP Detection GCATGTTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA GCATGTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA TCAGTTTCCGTGTACTGACTGCTCGATATA CNV Detection

SureSelect pilot study: combine indexed samples and target enrich using 1 capture library Targeted resequencing of 9 HapMap samples Single non-indexed sample library Single indexed sample library 3-index sample library 9-index sample library

Sequence coverage across PTBP2 for indexed vs non-indexed sample

Results: Number of sequence reads per indexed sample in sequenced libraries (pre-alignment to reference genome) 3 samples Median coverage of 41x 9 samples Median coverage of 11x Overall SNP concordance >99%

Target enrichment achieved non-index sample 1-index sample 3-index sample 1 9-index sample 1 Percentage reads in targeted 20% 22% 23% 19% regions +/- 50bp 2 Fold enrichment in targeted 1708 1885 1912 1608 regions 3 Percentage Target bases 98% 98% 98% 93% Covered 4 Median Coverage of Target 5 169x 6 93x 6 41x 11x 1 Average values given for multi-sample libraries. 2 Number of reads uniquely mapping to the target region (+/-50bp) as a % of the number of reads uniquely mapping to hg18. 3 (Sequence reads uniquely mapping to the target regions/sequence reads mapping to hg18) x Maximum Enrichment where Maximum Enrichment is a ratio of genome length (3,080,419,510bp) to target length (377388bp) 4 Percentage of target bases covered by at least one sequence read 5 (Number of 34bp reads matching target x 34)/target length 6 The difference in median read coverage between the non-indexed and indexed sample is reflective of the higher number of clusters on the flowcell and also the higher number of clusters passing QC filters in the non-indexed sample (83.48% vs 57.65%,

CNV detection Targeted 17 CNVs known to be polymorphic in Hapmap samples

CNV detection in raw data: 9-index sample Known CNV region targeted on chr 22 Strong correlation between sequence coverage and CNV genotype in the indexed samples (rho=1, p<0.0005)

Development of CNV detection algorithm In cases were raw read number not even across indexed samples- how to identify CNVs?

Development of CNV detection algorithm Step 1: Normalise read counts

Development of CNV detection algorithm Step 2: Identify regions where normalised read counts differ significantly between samples

Application to schizophrenia and autism Agilent earray Design Coding Exons (210 genes): 687,784bp Expanded Exons: 742,017bp exons <121bp in size were expanded to 121bp so that all exonic sequenced would be targeted by 2 crna baits Baited sequence: 1,033,568bp Coverage of exons by crna baits: 98%

How many DNA samples can be indexed in a single enrichment and sequencing reaction? Specificity and sensitivity of enrichment rxn (target = ~1Mb) 16-24 samples per lane Quantity of sequence reads post QC and alignment (80bp PE seq)

Acknowledgements Dr Derek Morris Dr Paul Cormican Dr Eleisa Heron William Gilks Sarah Furlong Dr Colm O Dushlaine Dr Carlos Pinto Dr Ric Anney Dr Aiden Corvin Dr Louise Gallagher Prof Michael Gill www.medicine.tcd.ie/sequencing