SUPPLEMENTARY INFORMATION

Size: px

Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Georgiana Simon
6 years ago
Views:

1 doi: /nature13394 Genome sequencing identifies major causes of severe intellectual disability Table of Contents Participants...4 Patient recruitment... 4 Patient selection... 4 Cohort representativeness... 7 Counseling and sample acquisition... 7 Whole genome sequencing and data analysis...8 Whole genome sequencing... 8 Validation of WGS variant calling... 8 Identification of de novo SNVs...8 Proof-of-principle experiments for de novo SNV detection Prioritization of candidate (de novo) mutations Prioritization of autosomal inherited SNVs for recessive inheritance Prioritization of maternally-inherited X-linked recessive SNVs in male patients Validation of candidate (de novo) SNVs Identification of (de novo) CNV and SV event Prioritization for clinically relevant dominant (de novo) and recessive CNV and SVs Validation of clinically relevant CNV and SV events

2 Cross variant type analysis for clinically relevant recessive mutations Generation of gene lists involved in ID Assessment of rare missense burden of known ID genes, candidate ID genes and genes with de novo mutations Statistical analysis for overrepresentation of deleterious mutations in patients with severe ID Overrepresentation of de novo mutations Statistical analysis of clinically relevant CNV and SV events RNA confirmation studies for IQSEC2-TENM3 gene fusion Amplicon-based deep sequencing using Ion Torrent Clinical interpretation of mutations using established diagnostic criteria Mutation impact of SNVs Functional relevance of SNVs Mutation impact and functional relevance of (de novo) CNV and SVs Classification of (de novo) variants as cause of disease Phenotypic re-evaluation Statistical evaluation of diagnostic criteria using de novo mutations identified in control individuals Calculation of de novo mutations as cause of severe ID Supplementary tables Supplementary Table 1: Detailed cohort description of 50 trios studied by WGS

3 Supplementary Table 2: Average genome sequencing statistics per individual Supplementary Table 3: Comparison of variants between WES and WGS identified in regions targeted by exome enrichment kit Supplementary Table 4: Overview of putative de novo SNVs detected through WGS Supplementary Table 5: Validation rates of autosomal candidate de novo SNVs grouped by genomic location and confidence level Supplementary Table 6: Distribution of de novo SNVs in coding sequence identified through WGS Supplementary Table 7: De novo substitution rates and type of mutations identified through WGS Supplementary Table 8: Coding de novo SNVs identified in 50 patients with severe ID Supplementary Table 9: Coding de novo mutation rates of published studies Supplementary Table 10: List of 528 known ID genes used for prioritization of variants Supplementary Table 11: List of 628 Candidate ID genes used for prioritization of variants Supplementary Table 12: Summary of de novo non-coding SNVs identified in 528 known ID genes Supplementary Table 13: De novo SNVs in non-coding sequence identified in patients with severe ID Supplementary Table 14: Phenotypic comparison of patients with mutations in known and candidate ID genes Supplementary Table 15: Molecular diagnosis per patient after WGS References

4 Participants Patient recruitment The Department of Human Genetics from the Radboud university medical center is a tertiary referral center for clinical genetics. Approximately 350 patients with unexplained intellectual disability (ID) are referred annually to our clinic for diagnostic evaluation. Prior to this referral, the majority of patients have received extensive clinical evaluations by other medical specialists such as pediatricians and (pediatric) neurologists. After referral, all patients are clinically evaluated by a clinical geneticist and, in most cases, receive a routine genetic diagnostic assessment, including a 250K Affymetrix SNP microarray analysis for the detection of copy number variations and dedicated DNA diagnostic single-gene tests based on clinical suspicion of specific syndromes. The majority of patients also receive a routine metabolic screen in serum and urine. Patient selection The clinical data of all patients tested with a SNP microarray are stored in an in-house database, which contains information of >5,600 patients with ID and/or multiple congenital anomalies. 31 The vast majority of this cohort (92%) consists of isolated patients from non-consanguineous families. In total, 62% of patients are reported to have an IQ<50. Of these, we previously recruited 100 patients with severe unexplained ID (IQ 50) for family-based exome sequencing. 32 In this series, we reported 16 patients to have a conclusive diagnostic report, and another 19 patients to have a de novo mutation in a candidate ID gene. Since the publication of this report, we performed additional experiments to increase the diagnostic yield in this series, including (i) familial segregation studies for X-linked variants identified in 3 patients; (ii) bioinformatic re-analysis of WES data in all 100 patients using new software (LifeScope v2.1) for mutations in a set of 495 known ID genes (as reported in Supplementary Table 12), (iii) CNV calling using CoNIFER 33 as described by de Ligt et al and (iv) systematic searches for additional patients carrying de novo mutations in candidate ID genes by e.g. literature reports and/or in-house diagnostic WES analyses. A summary of these analyses is presented in Table 1. Combined, this re-evaluated molecular screening increased the number of conclusive diagnoses for ID to 27/100 patients (27%). A possible cause for ID remains in 11% of patients. For a total of 62 patients WES did not reveal a genetic cause for ID. From this series of 62 WES-negative patients, 50 patients were recruited for the current study on family-based whole genome sequencing. 4

5 Table 1: Re-evaluated molecular diagnosis in 100 patients with severe ID after publication of diagnostic exome sequencing in patients with severe ID. Patient ID in de Ligt et al Molecular diagnosis 4 Highly likely Based on [gene] (inheritance) PDHA1 (X-linked) Re-evalutated molecular diagnosis Based on [gene] (inheritance) NO - Change due to: X-linked variant does not segregate 5 Possibly PHIP (de novo) Highly likely PHIP (de novo) In house 2nd case identified through diagnostic WES in patient with severe ID 6 Possibly PSMA7 (de novo) Highly likely 14 NO Highly likely EHMT1 deletion (de novo) DDX3X (de novo) CNV calling on WES data identified deletion of EHMT1 (de novo) Bioinformatic re-analysis of WES data identified de novo mutation in DDX3X; A second patient was subsequently identified by diagnostic WES in a patient with ID 18 Highly likely ARHGEF9 (X-linked) NO X-linked variant does not segregate 40 Possibly WAC/ MIB1 (de novo) Highly likely 60 NO Highly likely WAC (de novo) TCF4 (de novo) In house 2nd case identified through diagnostic WES in patient with severe ID Bioinformatic re-analysis identified a de novo mutation in TCF4 72 Possibly PPP2R5D (de novo) Highly likely PPP2R5D (de novo) 2nd case identified in house in patient with severe ID 74 Possibly KIF5C (de novo) Highly likely KIF5C (de novo) Literature reports additional patient with de novo mutation in KIF5C 77 NO Highly likely 85 NO Highly likely STXBP1 (de novo) SCN2A (de novo) Bioinformatic re-analysis identified a de novo mutation STXBP1 Bioinformatic re-analysis identified a de novo mutation in SCN2A 87 Possibly RAPGEF1 (de novo) Highly likely MECP2 (de novo) Bioinformatic re-analysis identified MECP2 de novo mutation 91 Possibly EEF1A2 (de novo) Highly likely EEF1A2 (de novo) 2nd patient in literature with exact same de novo mutation 92 Possibly MYT1L (de novo) Highly likely MYT1L (de novo) Overlapping small CNV containing MYT1L described in literature 96 NO Highly likely ADNP (de novo) Bioinformatic re-analysis of WES data identified a de novo 5

6 mutation in ADNP; A second patient was subsequently identified by diagnostic WES in a patient with ID 6

7 Cohort representativeness Details on the clinical characteristics of all patients are provided by de Ligt et al For the purpose of clarity, Table 2 provides the patient identifiers used in this study as well as the ones used in the previously published WES study. 32 Based on the extensive clinical and diagnostic work-up performed for this series, including genomic microarray analysis (250K Affymetrix SNP microarray) as well as diagnostic exome sequencing our series represents the end point of current diagnostic strategies for patients with unexplained severe ID. Table 2: Cross-references for patient numbers used in de Ligt et al. and the current study Trio ID current study Trio ID de Ligt et al Trio ID current study Trio ID de Ligt et al Trio ID current study Trio ID de Ligt et al Counseling and sample acquisition All 50 patients were part of our previous study on diagnostic exome sequencing in patients with severe ID and were seen and counseled by clinical specialist (MHW, BvB, TK, BBAdV or HB). This study was approved by the Medical Ethics Committee of the Radboud university medical center (NL ; NL ; NL ). All participants or their legal representatives signed informed consent. 7

8 Whole genome sequencing and data analysis Whole genome sequencing Whole Genome Sequencing (WGS) was performed by Complete Genomics (Mountain View, California, USA) as previously described by Drmanac et al. 35 Prior to sequencing, a minimum set of quality controls was performed including DNA measurement using picogreen and gel electrophoreses to exclude sample degradation. Sequence reads were mapped to the reference genome (GRCh37) and variants were called in the aligned genome according to the methods described by Carnevali et al. using version 2.4 of the Complete Genomics software. 36,37 All samples passed internal Complete Genomics quality control parameters and a gender check and 96-SNP genotyping assay were performed to exclude sample-swaps. For additional quality control metrics Complete Genomics calculated the ratio of variants reported in dbsnpv137 to all variants identified (on average 97.3% of all called variants and 98.5% of high confidence variants). Validation of WGS variant calling As proof-of-principle experiments, we performed WGS on samples from 8 patients with known pathogenic mutations. All eight known pathogenic mutations were accurately identified (Table 3a). In addition, for all 50 trios (totaling 150 samples), we compared the overlap between variants previously called by WES and WGS. To compare the performance of both platforms, we first filtered the WGS variants to only include variants from within the exome targets (Agilent SureSelect version 2). The differences in variants identified could be explained by annotation differences between WGS and WES variant calling pipelines. For example, WGS might have called a three base substitution (e.g. ACG > TCC) whereas WES called two separate variants (e.g. A > T and G > C). In addition, we annotated all variants with the average exome target coverage for the same sample. Variants not called by WES but only called by WGS could be found predominantly in regions of low WES coverage. Vice versa, variants only identified in WES could be found more frequently in regions of above average exome coverage, likely indicating alignment problems due to repeats (Supplementary Table 3). Additionally, we identified a bias in our initial variant calling algorithm that only calle variants if they were found in reads on both strands, thereby missing some of the de novo variants. Identification of de novo SNVs Variants were identified by Complete Genomics reported in var format. The initial set of candidate de novo mutations, including both SNVs and insertion/deletion events, was generated using the cgatools calldiff program (Complete Genomics). This program compares 8

9 Table 3: De novo pathogenic and non-pathogenic variants previously detected using WES used to optimize de novo variant detection Trio ID Gene Genomic position (GRCh37) Protein level A: Known pathogenic mutations in patients with severe ID identified using WES Variant in WGS data Identified as de novo mutation? Pos. control 1 TCF4 Chr18(GRCh37):g C>T p.(arg576gln) + n.a. Pos. control 2 SCN2A Chr2(GRCh37):g G>A p.(trp1398*) + n.a Pos. control 3 GRIN2B Chr12(GRCh37):g G>A p.(pro553leu) + n.a Pos. control 4 GRIN2A Chr16(GRCh37):g G>C p.(pro522arg) + n.a Pos. control 5 PDHA1 ChrX(GRCh37):g delinsAGA p.(pro110argfs*71) + n.a Pos. control 6 GATAD2B Chr1(GRCH37):g G>A p.(gln470*) + n.a Pos. control 7 CTNNB1 Chr3(GRCh37):g del p.(ser425thrfs*11) + n.a Pos. control 8 LRP2 Chr2(GRCh37):g del p.(gly4146glufs*2) + n.a B: De novo mutations identified using WES 2 that are not considered to be the cause of ID 2 TRIO Chr5(GRCh37):g A>T p.(aps1368val) C15orf40 Chr15(GRCh37):g C>T p.(ala89thr) RB1 Chr13(GRCh37):g T>C p.(tyr173his) CNGA3 Chr2(GRCh37):g G>T p.(trp335cys) CXXC11 Chr2(GRCh37):g G>A p.(gly412asp) CLRN3 Chr10(GRCh37):g A>T p.(ile22asn) OSBPL9 Chr1(GRCh37):g C>T p.(pro276leu) MFAP3 Chr5(GRCh37):g A>G p.(ser53gly) FHDC1 Chr4(GRCh37):g G>C p.(leu297phe) ALG13 ChrX(GRCh37):g A>G p.(asn107ser) KRT32 Chr17(GRCh37):g C>T p.(glu89lys) HADHA Chr2(GRCh37):g C>T p.(=) ZNF831 Chr20(GRCh37):g G>A p.(arg1615his) USP8 Chr15(GRCh37):g G>A p.(=) NEK1 Chr4(GRCh37):g T>G p.(lys901asn) + -* 36 ATP7B Chr13(GRCh37):g G>A p.(=) EFS Chr14(GRCh37):g G>A p.(=) ZFYVE16 Chr5(GRCh37):g C>T p.(ala380val) C20orf26 Chr20(GRCh37):g A>C p.(thr790pro) RBL2 Chr16(GRCh37):g G>T p.(val99phe) ABCC8 Chr11(GRCh37):g G>A p.(=) + + +: Identified; -: not identified; n.a.: not assessed. *Variant was detected as somatic variant in the patient after deep-sequencing (see also Figure 1). Trio ID refer to numbers used in this study. 9

10 two variant files to determine where and how two genomes differ and returns a confidence score for each comparison (e.g. for de novo mutations, two scores are assigned one for the comparison of the child to his/her father and one for the comparison to his/her mother). This program can be used under the assumption of a non-diploid genome (suited for the calling of somatic mutations) as well as under the assumption of a diploid genome (suited for the calling of germline mutations). In our analysis for candidate de novo mutations, we combined these approaches; first, we compared the set of all initial candidate autosomal variants in the child separately to those of each of the parents under the assumption of a non-diploid genome. To this end, calldiff thus determined whether a variant is truly found in only the child s genome by gathering variants into superloci and performing a local refinement of variant calls. Next, we performed the same analysis under the explicit assumption of a diploid genome. For both analyses, the individual comparisons to each parent were merged and filtered for variants with varquality = PASS. The above analyses resulted in two separate lists of candidate de novo variants, diploid and non-diploid, which were merged into a single list. Note that the diploid calls were for the most part a subset of the non-dipolid calls. Variants occurring in both lists were assigned scores from the non-diploid list. Based on the rank order of the two confidence scores for each candidate de novo mutation, we binned the variants in three groups: Low confidence: at least one score < 0 Medium confidence: both scores 0 but at least one score < 5 High confidence: both scores 5 On average, 82 high, 99 medium and 1,279 low confidence de novo SNVs were called on the autosomes of each patient (Supplementary Table 4). As calldiff requires the same ploidy state for comparison of variants, we applied a different method for calling of de novo variants on sex chromosomes, including a separate analysis for male and female offspring. Variants on the X chromosome in female patients were compared to the maternal genome (using the diploid assumption) and binning of candidate de novo variants was based on a single confidence score (i.e. comparison to the single parent rather than both as performed for the autosomes). For male patients, variants were prioritized based only on variant call quality (i.e. varfilter = PASS). Proof-of-principle experiments for de novo SNV detection In the cohort of 50 patients selected for this study, WES had not revealed a de novo mutation as the cause for ID. Nonetheless, 21 de novo mutations were detected in these 50 patients that most likely represent background mutations explained by the human per-generation de novo 10

11 mutation rate. The de novo substitution rate in this set of patients (0.42) is lower than the full series of 100 patients. Analysis of previous WES for quality metrics, including coverage, however, did not reveal a significant difference between this subset of 50 patients compared to the 50 patients that were not selected for our current study (median WES coverage for patients in WGS study: 62.9-fold, and for patients not part of this study 62.2-fold; Mann-Whitney U-test, P=0.85). To further verify de novo mutation detection, we assessed whether these de novo mutations were accurately detected using WGS. Using the method described above, 21/21 mutations were present in the patient s genome (present in var file). In the downstream analysis for de novo mutation detection, 19/21 mutations were routinely detected as a de novo (Table 3b). Further evaluation of why 2/21 mutations were not identified as a potential de novo mutation revealed that one mutation (in CXXC11) was absent from the de novo candidate list due to low QC values after comparison to parental samples ( PASS;SQLOW, i.e. a somatic score < -10). Interestingly, the other mutation (in NEK1) was not correctly annotated as candidate de novo since the variant was suggested to be present in mosaic state in the patient (confidence scores at for comparison to parental genomes were -4 and -4). Follow-up experiments with deep-sequencing up to >25,000 sequence depth using Ion Torrent (Life Technologies) indeed confirmed this variant to be present at 32%, thereby confirming its somatic origin and explaining why it was not correctly identified in our pipeline. Prioritization of candidate (de novo) mutations All variants reported in the candidate de novo lists were annotated using our custom in-house bioinformatic pipeline, including information on their genomic position (exon, splice site, intron, UTR, promoter region), their predicted effect on protein level, and their frequency in variant databases (in-house; dbsnpv37; Exome variant server (ESP), 38 the genome of the Netherlands (GoNL), 39 and the Wellderly dataset 40 ). We then proceeded to systematically validate all variants that according to RefSeq were: within the coding sequence and canonical splice sites (+2bp/-2bp) and that o had a high or medium confidence of being de novo o disrupted the gene by causing a frameshift or stop mutation o were within a known ID gene (Supplementary Table 10) within a acceptor (+20bp) or donor (8bp) splice site and that had a high or medium confidence of being de novo outside of the coding regions, had a high confidence of being de novo and were within: o the introns of a known ID gene o the UTR of a known ID gene

12 o the promoter region of a known ID gene according to ENCODE information (Chromatin state segments of nine human cell types (Broad ChromHMM) and transcription factor binding sites (Txn Factor ChIP)). 41 This prioritization scheme resulted in 382 potential de novo autosomal mutations for further validation using Sanger sequencing. For the X-chromosome, candidate de novo mutations were selected in a similar way as autosomal variants, albeit that only a single confidence score was used (see Identification of de novo SNV mutations ). For male patients we selected variants with VarFilter = PASS, followed by the same filter steps as used in female patients. The systematic filtering process identified 36 candidate de novo mutations on the X-chromosome for further validation using Sanger sequencing. Prioritization of autosomal inherited SNVs for recessive inheritance Using the Complete Genomics software cgatools listvariants and cgatools testvariants, variants identified in the patient were annotated with the parent-of-origin. Using these annotated files, we selected coding variants (or affecting canonical splice sites) in known ID genes (Supplementary Table 10) that segregated according to a recessive inheritance pattern, including both homozygous variants (genotype calls for patient, father, mother respectively) as well as compound heterozygous mutations (two variants in same gene, one showing genotype call and one for patient, father, mother respectively). To further prioritize for clinically relevant mutations, variants were filtered for a minor allele frequency <=2% using our in-house variant database, GoNL 39, dbsnpv137, and the Wellderly data set 40. Additionally, synonymous variants as well as missense mutations with a PhyloP score <3.5 were excluded for analysis. 42 This analysis did not identify recessive SNV mutations, either compound heterozygous or homozygous. Prioritization of maternally-inherited X-linked recessive SNVs in male patients X-linked variants that segregated from mother to male offspring were identified using the Complete Genomics cgatools listvariants and testvariants programs by genotype calls (genotype child, father, mother). We selected according to an X-linked inheritance assuming a pattern of a heterozygous variant identified in the mother and the same hemizygous variant in the male offspring. Variants were subsequently annotated using a custom annotation pipeline and prioritized for coding and canonical splice-site variants located in known ID genes (Supplementary Table 10). To further prioritize for clinically relevant mutations, variants were filtered for a minor allele frequency <=2% using our in-house variant server, GoNL 39,

13 dbsnpv137, and the Wellderly data set. 40 Additionally, synonymous variants as well as missense mutations with a PhyloP score 5.0 were excluded for analysis. 42 This filter process revealed one X-linked variant in patient 24, for which segregation analysis has so far been inconclusive in determining its relevance to the patient s phenotype. Validation of candidate (de novo) SNVs Validation and de novo testing for all candidate de novo mutations was performed using standard Sanger sequencing approaches. Primers were designed to surround the candidate mutation, and PCR reactions were performed using RedTaq Readymix PCR reaction mix (Sigma-Aldrich, St. Louis, MO, USA), AmpliTaqGold 360 Master mix (Life Technologies, Bleiswijk, NL) and/or Phire Hot Start Master mix (Finnzymes, Bioke, NL). Primer sequences and PCR conditions are available upon request. Of 382 autosomal candidate de novo mutations, 136 variants (35.6%) were confirmed by Sanger sequencing of DNA of the patient. Analyses of parental DNA samples subsequently established de novo occurrence for 131 of these (34.3%). Of note, validation rates correlated well with the confidence levels, being 3.0%, 20.0% and 78.9% for low, medium and high confidence autosomal candidate de novo variants respectively (Supplementary Table 5). For candidate de novo SNVs on the X-chromosome, 8/36 (22%) were validated and shown to be of de novo origin using Sanger sequencing. Identification of (de novo) CNV and SV event Two different approaches were used to detect copy number variation, i.e. one based on read depth information (referred to as CNV calling), and one based on discordant read pairs (referred to as SV calling), both provided by Complete Genomics. The latter method (SV calling) was used both to provide additional evidence to support for CNV calls, as well as to detect copy neutral events such as genomic inversions and/or balanced translocations. That is, a CNV identified using the read depth approach (with a resolution determined by the window bin-size) could be fine-mapped using the SV algorithm to single nucleotide level (using the discordant read pairs). In addition, the SV detection method was able to detect copy number variants below the bin-size used by the read depth method. In total, 13,799 CNVs were detected by Complete Genomics based on deviations in the sequencing read depth in comparison to a baseline using 2kb, GC-corrected windows and a Hidden Markov Model. 37 The called CNV segments were filtered on the ploidity score > 10 and subsequently filtered based on frequency in the population and likely functional impact. Annotations were based on gene components (as per RefSeq), overlap with CNVs from GoNL release 1 ref 39, overlap with non-related parents from the dataset, as well as Wellderly data set 40 using both a 80%, 50% reciprocal and 1bp overlap. Similarly regions were annotated for overlap

14 with the ID gene list (Supplementary Table 10) using a 1bp overlap. Subsequently CNVs were filtered based on frequency in the population and likely functional impact. CNVs were considered rare when i) they were found in less than 1% of the Wellderly population, ii) less than or equal to two non-related parents out of a subset of 48 parents, iii) less than or equal to three out of 250 GoNL trios. A copy number aware reciprocal overlap of 80% was used that considered two segments to overlap when the control CNV had the same copy number or higher than the patient s CNV gains or the same copy number or lower than the patient s CNV losses. In addition, only male individuals were included as control samples for chromosome X. 551 rare CNV segments were identified outside hypervariable regions, of which 130 overlaped with a coding region of the genome, these segements were then considered for prioritization based on inheritance. CNVs were visualized using Biodiscovery Nexus software. Secondly, SV calling was performed using uniquely mapped discordant read pairs. Discordant read pairs were clustered into junctions based on genomic location and mapping orientation and finally merged into 194,913 events: tandem duplications (n=11,422), distal duplications (n=6,532), deletions (n=99,471), inversions (n=68,461), translocations and complex events (n=8,980). The junction breakpoints were refined where possible by local reassembly of neighboring reads. 37 After SV calling, filtering was performed on all duplication types, deletions, inversions and translocations to identify rare de novo SVs occurring, i) in 1% of Wellderly samples 43, ii) occurring in neither parent, iii) the least frequent junction involved in the event should have a Complete Genomics baseline frequency of 0.1 and iv) mean read depth of at least 10 reads across the event. SV events were considered overlapping when all genomic regions contributing to the event combined overlapped reciprocally by more than 80% or had matching breakpoints within a 50 bp window. In total 1,350 rare events were identified of which 67 had at least one high confidence junction and were considered for prioritization based on inheritance. Prioritization for clinically relevant dominant (de novo) and recessive CNV and SVs A candidate list of CNV events was created to include regions overlapping with RefSeq genes and segments with a higher copy number gain or a lower copy number loss than the patient s own parents. After filtering, the CNVs were analyzed for de novo occurrence, autosomal recessive inheritance, as well as X-linked inheritance and compound heterozygosity. Losses or gains in the patient were considered de novo when overlapping parental segments (80% reciprocal) were copy number neutral. Potential autosomal recessive inheritance CNVs were extracted by identifying CNVs in which the parents had overlapping segments of the same type (i.e. gain or loss) thereby extracting homozygous deletions and triplications. X-linked inheritance was considered for male offspring when the mother had an overlapping segment of

15 the same CNV type. Finally, compound events were investigated by comparing events that overlap the same RefSeq gene when one CNV was maternally inherited and the other was paternally inherited. SVs were annotated for overlap with coding sequence or to occur in non-coding sequencing of a known ID gene (Supplementary Table 10). To identify genic SVs with autosomal recessive or X- linked inheritance the filtering was adjusted to include events overlapping SVs from the same SV class segregating from parent to patient. To detect SVs occurring via X-linked inheritance SVs were identified occurring in male offspring and their mothers. Finally compound SV-SV events were investigated by comparing events that overlap the same RefSeq gene. Variants were only paired when one variant was maternally inherited and the other paternally. The resulting compound events were filtered to identify high confidence rare events (occurring <= 1% in the Wellderly population and having at least 10 supporting read pairs). Validation of clinically relevant CNV and SV events All candidate pathogenic CNVs and SVs were validated using the Affymetrix CytoscanHD microarray platform and/or breakpoint spanning PCRs. Affymetrix arrays were used according the manufacturer s instructions. For breakpoint spanning PCRs, primer sequences were chosen to be located ~ base pairs from the predicted junction. Primer sequences are available upon request. PCR reactions were performed using RedTaq Readymix PCR reaction mix (Sigma-Aldrich, St. Louis, MO, USA), AmpliTaqGold 360 Master mix (Life Technologies, Bleiswijk, NL) and/or Phire Hot Start Master mix (Finnzymes, Bioke, NL). Afterwards, Sanger sequencing was used to verify the junction fragments. Cross variant type analysis for clinically relevant recessive mutations Cross variant compound events were considered when two variants of different mutation types affected the same gene but were inherited from different parents (e.g. an inherited SNV from father and an inherited CNV from mother, both affecting the same gene). Cross variant compound analysis was performed between all variant types called by Complete Genomics. SNVs that were heterozygous in the child and found in heterozygous state in only one of the parents were considered for the compound analysis. Such SNVs were extracted by filtering the output of cgatools listvariants and cgatools testvariants by genotype calls or for patient, father, mother, respectively. CNVs that were considered for compound analysis must have only one parent with an overlapping CNV of the same type with more than 80% reciprocal overlap. Likewise, SVs were considered for compound analysis if overlap was identified in only one of the parents using a reciprocal overlap of 80% or more. Of note, the

16 analysis between SVs and CNVs excluded regions that had an reciprocal overlap of 50% or matching breakpoints within a 200bp window, to avoid predicting events that were in fact the same variant. All resulting cross compound variants were filtered on frequency in the control population (see above) and ranked on likely functional importance, e.g. coding exons having a higher priority than introns. This analysis did not reveal any candidates of potential clinical relevance. Generation of gene lists involved in ID To systematically search for genes involved in ID phenotypes and to classify them as known, or candidate ID gene the following actions were performed: 1. All protein-coding variants reported in HGMD (Professional th December 2013) that were reported to occur in patients with a phenotype of intellectual disability or mental retardation (n=368 genes). HGMD attempts to catalog known (published) gene lesions responsible for human genetic disease. Recent publications however have shown that some mutations collated do not in fact represent disease-causing mutations, hence, we subsequently annotated and prioritized all mutations based on mutation type (synonymous, missense, nonsense, splice site, insertion/deletion, gross CNV) and genomic conservation (phylop 3.5). We considered mutations to be pathogenic when they were either (i) deleterious (nonsense, splice, insertion/deletion and gross CNV) or (ii) represented missense mutation with a disease mutation (DM) annotation. Next, we counted the number of patients known to HGMD with pathogenic mutations in each of the genes. 2. We systematically collected all de novo mutations reported by large-scale exome sequencing studies that focused on the detection of de novo mutations in patients with intellectual disability. 32,44,45 These data were supplemented with de novo mutations identified in patients with epilepsy 46, autism and schizophrenia because of phenotypic overlap with ID. To prioritize likely pathogenic mutations, variants were prioritized according to the following: The variant was identified in a proband The variant was validated by an alternative technique and proven of de novo origin De novo variants that were previously identified by de Ligt et al Ref 32 and occurring in the current series of 50 patients with severe ID were excluded (to prevent over-interpretation of genes Table 3b in supplement) - Only non-synonymous mutations were taken into account The PhyloP conservation of missense mutations was

17 For all genes (n=462) that remained after this selection, we subsequently interrogated HGMD to see how many additional patients were reported to have mutations in these genes. 3. We performed a systematic literature search in PubMed using the search terms Whole exome sequencing AND Intellectual Disability (23 rd January 2014). This search returned 110 papers describing an individual gene and its relation to ID as well as 24 studies describing multiple genes. All genes (n=154) were manually extracted from these publications and subsequently annotated with the number of patients reported. Of note, for publications describing families with autosomal recessive ID, each family was counted as one patient. 4. Within the diagnostic section of our Department of Human Genetics, we routinely screen patient-parents trios with severe intellectual disability using exome sequencing. During data analysis, prioritization of (de novo) variants is based on a list of ID genes previously reported in patients with ID (n=494 genes). Due to the phenotypic overlap of ID with epilepsy and/or several metabolic disorders, this gene lists represents a wide spectrum of genes involved in ID. We prioritized this gene list in based on the number of patients with pathogenic mutations using the variants from action 1 as well as from our in-house diagnostic and research databases collecting patients with de novo mutations. Based on published studies screening large patient cohorts we calculated that 1,476 patients were systematically screened. Including the single gene studies, we therefore estimated the total number of patients to be roughly 2,500. We then calculated the required number of patients with a de novo mutation in a single gene to achieve a 0.1/18,424 (the number of genes being tested) probability of a false positive. More than 90% of all 18,424 genes have a coding size of 3,400 bases (or less). We subsequently used this as a conservative gene size estimate. Applying a Poisson distribution showed that mutations in a single gene in 5 patients or more are required for it to be a ID gene, whereas genes with mutations in one or more but less than 5 patients were used as candidate ID gene. Table 4: Classification of genes according to the frequency of reported pathogenic mutations Total genes Known ID gene Candidate ID gene HGMD with phenotype ID or MR PubMed search WES+ID Diagnostic gene list Large-scale sequencing studies with overlapping phenotypes Total unique genes

18 Based on these calculations, the four lists of known ID genes and four lists of candidate ID genes were combined (Table 4), checked for overlap and adjusted for gene category when necessary (when after combining multiple searches 5 patients were reported). Both lists were then filtered for unique genes. These efforts collectively resulted in a known ID gene list containing 528 genes (Supplementary Table 10), and a candidate ID gene list containing 628 genes (Supplementary Table 11). Assessment of rare missense burden of known ID genes, candidate ID genes and genes with de novo mutations Similar to Petrovski et al. 54 and Kurana et al. 55 we evaluated the tolerance of genes for variation by assessing variation burden in the normal population. We downloaded all variants called in 6,503 exomes from the Washington variation server (ESP6500SI) 38 from which we selected all missense and synonymous variants not associated to disease, hence removing 1) insertions and deletions 2) nonsense variants 3) variants that incorporated multiple bases or alleles 4) variants of which the effect on protein could not be inferred using refseq annotation and 5) variants that were reported in the HGMD database. 56 This resulted in a total of 1,118,178 SNVs (696,013 missense, 422,165 synonymous) across 18,424 refseq genes. Based on the hypothesis that functionally more important genes harbor less variation we calculated for each gene two separate features: the number of positions where a rare variant in the population (<1%) results in a missense change, per 1000 base pairs of coding sequence (Ns/Kb) and the ratio of the number of positions where a variant in the population results in a missense change to the number of positions resulting in a rare synonymous change (Ns/Ss). For a given set of genes (e.g. all RefSeq genes) we calculated the overall measures by taking the median of scores of the individual genes in the set. Gene lengths were calculated by summing the lengths of the individual coding exons of a gene. We tested for significant differences in tolerance to varation between gene sets by permutation testing against all RefSeq genes; the median values for each gene sets was compared to that from generating 1,000,000 gene sets of the same size with randomly chosen genes from the complete list of refseq genes. As a proof-of-principle, we validated the method by comparing rare missense burden of the known ID genes (Supplementary Table 10) to genes that are known to be loss-of-function tolerant. 57 From the list by MacArthur et al. 57 we excluded genes in which variants were identified that (1) do not lead to nonsense-mediated decay (i.e. truncating variants located within the last exon), (2) that truncated only specific transcripts of a gene, (3) were not validated by an alternative technique, or are otherwise indicated as potential false positives. This resulted in a list of 242 variants compromising 225 unique genes of which 169 had a refseq annotation. We found a significant lower tolerance for rare missense variants in known ID genes than expected (Median Ns/Kb =

19 17.68, P < 1.0E-06; Median Ns/Ss = 1.5, P<1.0E-06) as well as in candidate ID genes (Median Ns/Kb = 17.61, P<1.0E-06; Median Ns/Ss = 1.545, P<1.0E-06), whereas we found a significantly higher tolerance in loss-of-function tolerant genes (Median Ns/Kb = 26.49, P<1.0E-06; Median Ns/Ss = 2.214, P<1.0E-06) (Extended Data Figure 1). We repeated this analysis for the 9 known and 10 candidate ID genes in which we found de novo mutations in our patients. These analyses showed a significant lower tolerance for rare missense variants for the 9 known ID genes (Median Ns/Kb = 9.45, P = ; Median Ns/Ss = 0.71, P=2.03E-04) as well as for the 10 candidate ID genes (Median Ns/Kb = 12.8, P = 0.013; Median Ns/Ss = 1.27, P=0.037). P-values from both scores were combined using Fisher s combined probability test. Statistical analysis for overrepresentation of deleterious mutations in patients with severe ID We compared the proportion of de novo loss-of-function (LoF) mutations (n=18, including nonsense, frameshift and canonical splice site) identified in our cohort to frequencies reported in healthy controls from multiple published studies (Table 5). 45,47,49-51,53 To test significance, we used a two-sided Fisher s exact test as implemented in R. 29 Overrepresentation of de novo mutations Overrepresentation of de novo mutations in known and candidate ID gene lists was calculated by first calculating the total coding size in the genome based on refseq annotation (32,342,354) and comparing the coding de novo rate to the coding de novo rate within the coding sequence Table 5: Overview of de novo mutations in healthy individuals Number of controls (c) or siblings (s) Total LoF Missense Synonymous Rauch et al. 20 (c) O'Roak et al. 50 (s) Sanders et al.* 200 (s) Iossifov et al.** 343 (s) Gulsuner et al. 84 (c) Xu et al.*** 34 (c) Total Putative LoF includes nonsense, frameshift and canonical splice site mutations based on gene annotation; * study did not consider indels; ** not all de novo mutations were validated by Sanger sequencing; ***excluded splice site variants if not in canonical di-nucleotide of splice site. For some studies the numbers deviate from the numbers given in the main publication, numbers considered here were retrieved from de novo mutation overviews from supplementary tables

20 of the gene list using a two-sided Fisher s Exact test. Overrepresentation for loss-of-function mutations was subsequently calculated using a two-sided Fisher s exact test and control data described in Table 4. Statistical analysis of clinically relevant CNV and SV events Enrichment analysis for known ID genes was performed in comparison to de novo CNVs identified in controls 58,59 using Fisher s Exact test. In addition, CNV and SV events identified as clinically relevant were tested for the occurrence of exonic CNVs within the same gene in a larger cohort of both inhouse individuals affected with ID (n=7,743) and public as well as inhouse control individuals (n=4,056 of which 2,108 were male). 60,61 Microarray experiments were performed on these cohorts to identify CNVs using the Affymetrix CytoScanHD (n=10,544) and Affymetrix 6.0 (n=1,255) microarray platforms. All experiments passed quality control (MAPD 0.25 and SNP QC 15), after which CNVs were identified using Affymetrix Power Tools v The frequency of CNVs identified in these two groups was then compared and odds ratios calculated. Multiple testing correction was performed using Benjamini Hochberg with a FDR of 0.1. In the case of events occurring on the X-Chromosome only male controls were used in the comparison. RNA confirmation studies for IQSEC2-TENM3 gene fusion To assess the presence of a gene fusion of IQSEC2-TENM3 in patient 31, Epstein Barr Virus transformed lymphoblastoid cells (EBV-LCLs) were cultured with and without the addition of cyclohexamide. As negative control, EBV-LCLs from a non-related control sample was used. Total RNA was isolated using a RNeasy mini columns (QIAGEN) according the instructions of the manufacturer. A total of 2 µg was subsequently processed for cdna generation using the M MLV Reverse Transcriptase mix (Invitrogen). As a result of the inverted insertion of TENM3- ex22-27 duplication in intron 2 IQSEC2, it was predicted that an inframe cdna were to be generated. As such, forward primers were chosen to be located in exon 2 of IQSEC2, whereas the reverse primers were chosen to be located in exon 23 of TENM3. PCRs were then performed using AmpliTaqGold 360 Master mix (Life Technologies, Bleiswijk, NL). Generated PCR products were Sanger sequenced and mapped against a the predicted fusion gene sequence using Vector NTI (Life Technologies)

21 Amplicon-based deep sequencing using Ion Torrent To confirm the somatic presence of mutations, PCR amplicons were generated using standard PCR protocols and used for deep-sequencing on an Ion PGM sequencer (Life Technologies). PCR products were sheared to 200-bp fragments by use of Covaris E210 (Covaris Inc.). Library preparation was performed using the Ion Xpress Plus Fragment Library Kit in combination with the Ion Xpress Barcode adapters 1-96 kit. Emulsion PCR and enrichment were carried out on the Ion OneTouch System using the Ion PGM TemplateOT2 200 Kit as described by the manufacturer. Sequencing was performed using the 200 kit v2, with Ion Sphere particles loaded to an Ion314 or Ion316 chip (Life Technologies). Raw sequencing reads were mapped to the reference genome (hg19) using the Torrent Server (part of the PGM software) with use of the Torrent suite software. FASTQ files were subsequently analyzed and visualized in SeqNext (JSI Medical systems). Clinical interpretation of mutations using established diagnostic criteria For each mutation a standardized pipeline was applied to interpret gene function and mutation impact (Supplementary Table 8). This pipeline was previously described by de Ligt et al and is in concordance with protocols for variant interpretation by Bell et al and Berg et al Essential steps in classification and interpretation are described in more detail below. As a consequence of this systematic analysis, de novo mutations in known/candidate ID genes may be classified as not relevant due to lack of mutation impact. Also, other seemingly severely disruptive mutations may be classified as not relevant for the ID phenotype as they either (i) are involved in phenotypes not related to ID (e.g. cardiomyopathies) or (ii) evidence for biological functioning related to ID is (currently) absent. Despite these potential pitfalls, the classification and interpretation as described below has been shown to be very powerful and is used on daily basis in our diagnostic laboratory of the department (accredited to the CCKL Code of practice, which is based on EN/ISO (2003) (Registration numbers R114/R115, Accreditation numbers 095/103)). Furthermore, the use of this classification system in defining new candidate ID genes (e.g. for genes in which, to our knowledge, this report describes the first de novo mutation in a patient with ID), has proven successful as demonstrated in Table 1 of this supplement (page 5); For approximately 66% of all candidate genes we previously reported 32,44, additional de novo mutations in patients with overlapping clinical phenotypes have since been identified

22 Mutation impact of SNVs Mutation impact for (de novo) SNVs on protein level was assessed in two different ways. Firstly, mutations were classified as deleterious (reported as D in Supplementary Table 8) if the mutation resulted in (i) a nonsense mutation, (ii) frameshift, (iii) inframe insertion/deletion event, or (iv) splice defect (assessed using Interactive Biosoftware Alamut version 2.3-2). For missense mutations prediction of pathogenicity was evaluated using Condel 64 (Referred to as P in Supplementary Table 8). As an additional indicator for mutation severity, the genomic conservation was used, assessed as a PhyloP 3.5 (referred to as C, Supplementary Table 8). 42,44 This specific PhyloP score is based on comparison of the PhyloP score for all missense mutations listed in HGMD 56 (a catalog of disease causing mutation) and dbsnp (as proxy for variants in healthy individuals). Whereas no PhyloP score is mutually exclusive to either type of database, the relative frequency of missense variants with a PhyloP score 3.5 is higher in HGMD than in dbsnp. 44 Functional relevance of SNVs A systematic literature study using various resources such as PubMed was conducted for all mutated genes to identify their potential link to ID (referred to as F, Supplementary Table 8). Additional functional support as provided by enrichment for Gene Ontology (GO) Biological Processes and/or mouse phenotypes (MP) known to be involved in known ID genes 32. To determine which GO and MP terms were enriched, the list of 528 known ID genes (Supplementary Table 10) was used. Next, all GO and MP terms attributed to genes with de novo mutations were systematically compared to those ID enrichmed terms; Mutations scored positive for these analysis if there was overlap for the enriched terms (referred to as G and M in Supplementary Table 8, respectively). In addition, the brain expression profile for each gene was assessed using data derived from the Human Brain Transcriptome database (version 2-dec-2013), which provides spatio-temporal information on genes in human brain development. 65 A gene was considered to be brain expressed if the Log2 intensity was 6 for a minimum of three periods as described by Neale et al. 48 (referred to as B in Supplementary Table 8). When no brain expression data was available, the expression was considered to be unknown (referred to as u ). Of note, variants that were previously identified using WES were re-evaluated in this process. Mutation impact and functional relevance of (de novo) CNV and SVs Mutation impact of (de novo) CNVs and SVs was evaluated using established clinical diagnostic methods. 31,66,67 In brief, genomic regions affected by copy number alterations are evaluated for the frequency in the normal population and/or occurrence in other patients with clinical

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc. Variant Classification Author: Mike Thiesen, Golden Helix, Inc. Overview Sequencing pipelines are able to identify rare variants not found in catalogs such as dbsnp. As a result, variants in these datasets