for the article titled An atlas of human long non- coding RNAs with accurate 5 ends TABLE OF CONTENTS

Size: px

Start display at page:

Download "for the article titled An atlas of human long non- coding RNAs with accurate 5 ends TABLE OF CONTENTS"

Doris Nelson
5 years ago
Views:

1 SUPPLEMENTARY Supplementary INFORMATION Information for article titled doi:.38/nature237 An atlas of human long non- coding RNAs with accurate 5 ends by Division of Genome Technologies, RIKEN, Yokohama, Japan (27) TABLE OF CONTENTS A Supplementary Notes... 2 Rationale and Implementation of TIEScore Benchmarking Performance of TIEScore... 3 TIEScore Cutoffs and TSS Validation Rates... 6 Definition of Genes, Coding Potential and Comparison with GENCODEv Directionality, Exosome Sensitivity and Transcript Properties Sequence Features at LncRNA TSS Supported by Different Types... B Online Resources... 2 Assembly, Expression Atlas and or Resources Overview of FANTOM CAT Browser Use Case : Download a List of Novel e- lncrnas... Use Case 2: Explore LncRNAs with Conserved Exons and Implicated in Use Case 3: Explore LncRNAs Enriched in Classical Monocytes Use Case : Explore Cell Types Associated with Crohn s Diseases... 9 C Supplementary Tables... 2 CAGE Library Information RNA- seq Library Information FANTOM CAT Gene Information... 2 Directionality, Exosome Sensitivity and Transcript Properties FANTOM CAT Genes in LncRNAdb Conservation of TIR and Exon Transposons at TIR Orthologous Transcription Expression Levels and Specificity in Primary Cell Facets... 2 Sample Ontology Information... 2 Gene Association with Cell Types Trait Information Gene Association with Traits... 2 Curation of Cell type and Trait Pairs Genes Involved in Cell Type and Trait pairs linked lncrna and mrna Pairs Gene- based Functional Evidence Grouping of Samples for Differential Expression Differential Expression Results... 2 D References

2 RESEARCH SUPPLEMENTARY INFORMATION A Supplementary Notes Rationale and Implementation of TIEScore a C = CAGE cluster expression level (cpm) CAGE criterion E = exon number total exon length (kb) Distance criterion exon Exon criterion exon D = distance from CAGE cluster to transcript (nt) Transformation based on linear regression with TSS validation rate by Roadmap as in b Weighting based on benchmark performance (ROC-AUC) against gold standards as in c TIEScore as an empirical estimation of TSS validation rate by TIEScore = w e f e (E) + w d f d (D) + w c f c (C) w e =. f e (E) = 5.5 log 2 E w d =.55 f d (D) =.9 D w c =.35 f c (C) = 6.63 log 2 C b Transformation c Weighting linear scale square root scale log 2 scale increasing performance CAGE cluster TSS validation rate by (%) r 2 =.3 r 2 =.69 f e (E) = 5.5 log 2 E r 2 =.96 Exon criterion (E) TIEScore performance (ROC-AUC) Performance CAGE cluster TSS validation rate by (%) transcript TSS validation rate by (%) Exon criterion (E) in various scales f d (D) =.9 D r 2 =.89 r 2 =.98 r 2 = Distance criterion (D) in various scales f c (C) = 6.63 log 2 C r 2 =.7 r 2 =.89 r 2 = CAGE criterion (C) in various scales actual TSS permutated TSS Distance criterion (D) CAGE criterion (C) weight of f c (C) weight of f d (D) weight of f e (E) w e =. w d =.55 w c = rank of combinations (n=86) of relative weights, sorted by TIEScore performance (ROC-AUC) Exon criterion (E) Distance criterion (D) CAGE criterion (C) Supplementary Fig. Rationale and Implementation of TIEScore. a, For each pair of transcript and CAGE cluster, TIEScore is calculated as sum of values from exon (E), distance (D) and CAGE (C) criteria, which were transformed by fe (E), fd (D) and fc (C) as in b and weighted by we, wd and wc as in c, respectively. The resulting TIEScore can be interpreted as an empirical estimation of TSS validation rate by. b, Transformation of TIEScore criteria values. We investigated correlation between TIEScore criteria values and TSS validation rate by DNaseI hypersensitivity sites (). The criteria values were transformed into various scales (columns) and plotted against TSS validation rate by. A TSS is validated if it overlaps a. Within each bin of values at X- axis, percentage of validated TSS was calculated (i.e. TSS validation rate). Same number of positions was randomly sampled from unannotated genomic regions (permutated TSS). Linear regression (red line, 99.99% confidence intervals) was performed on actual TSS and r 2 was indicated. The plot with highest r 2 within each TIEScore criteria was highlighted in pink and corresponding linear regression functions (i.e. fe (E), fd (D) and fc (C)) were used to transform TIEScore criteria values. c, Weighting of TIEScore criteria values. TIEScore is defined as weighted sum of transformed TIEScore criteria values. 2

3 SUPPLEMENTARY INFORMATION RESEARCH. TIEScore Rationale: Transformation and Weighting of Criteria Values Transcription Initiation Evidence Score (TIEScore) is a custom metric that evaluates properties of a pair of CAGE cluster and transcript model to quantify likelihood that corresponding CAGE transcription start site (TSS) being genuine (Supplementary Fig. a). We reasoned lowly abundant transcripts are likely to be insufficiently covered by RNA-seq reads and thus ir transcript models are more likely to be incomplete and truncated into shorter and less spliced models. This is supported by observation that product of exon number and total exon length (kb) of transcript models (i.e. Exon criterion, E) are highly correlated with percentage of TSS validated by (i.e. validation rate, Supplementary Fig. b). Also, lowly abundant CAGE clusters that are distant from transcript 5 ends may represent degradation products of longer transcripts, previously referred to as exon painting 2. This is supported by observation that expression levels of CAGE clusters and ir distances to closest transcript model (i.e. CAGE criterion, C and Distance criterion, D) are correlated with validation rate (Supplementary Fig. b-c). We refore devised TIEScore (Supplementary Fig. a), a metric integrating se criteria to address inaccurately identified 5 ends of transcripts and build transcript models with well-supported transcription initiation evidence. First, we explored linearity of validation rate across se criteria values using various scales based on linear regression (Supplementary Fig. b). For each criterion, scale that yields highest r 2 was chosen (Supplementary Fig. b) and its value was n transformed into validation rate based on corresponding linear regression functions (i.e. f e (E), f d (D) and f c (C)) as in Supplementary Fig. b. TIEScore is defined as weighted sum of se transformed values. To optimize weights of se criteria, we evaluated performance of TIEScore for discrete combinations of weights (n=86 combinations, i.e. three weights at grid of. with sums equal to, Supplementary Fig. c). The performance of TIEScore was measured in terms of area under receiver operating characteristic (ROC) curve (ROC-AUC) 3 in 7 matched CAGE and RNA-seq libraries (see details in Supplementary Note 2) and combination of weights which yielded highest AUC-ROC was chosen (i.e. w e, w d and w c ). TIEScore is thus calculated as: w e f e (E) + w d f d (D) + w c f c (C), which can be interpreted as an empirical estimation of TSS validation rate by..2 Implementation of TIEScore in Meta-assembly of FANTOM CAT TIEScore was evaluated for each pair of CAGE clusters and transcript models within kb. Each transcript model was assigned to CAGE cluster yielding highest TIEScore and its 5 end was adjusted to most prominent TSS of CAGE cluster. Transcripts with no CAGE peaks within kb, and CAGE peaks without transcripts within kb, were refore discarded. This associated each retained transcript with a CAGE cluster, and each retained CAGE cluster with one or more transcripts. A CAGE cluster is n assigned highest TIEScore yielded from its associated transcripts. TIEScore was first applied to each of five transcript model collections separately (with,897 CAGE libraries, Supplementary Table ) and n merged into a non-redundant transcript set (referred to as raw FANTOM CAGE associated transcriptome (CAT)). Specifically, transcript models from GENCODEv9 were used as initial reference to sequentially overlay onto m transcripts from or four collections, in sequence of Human BodyMap 2., mitranscriptome 5, ENCODE 6 and FANTOM5 RNA-seq assembly (this study). In each overlay, query transcript models that share ) 5 ends within ±5bp, 2) exact same splicing junctions, 3) 3 ends within ±2,bp, with reference transcript models were discarded as redundant. The non-redundant query transcript models were n added as new reference set for next round of overlay. For each CAGE cluster, maximum TIEScore after all four rounds of overlays was taken as its TIEScore. The TIEScore of raw FANTOM CAT was benchmarked against gold standards (with N=5, see details in Supplementary Note 2). The TIEScore cutoffs were chosen as in Supplementary Note

RESEARCH SUPPLEMENTARY INFORMATION 2 Benchmarking Performance of TIEScore gold standard TSS regions with various degrees of chromhmm state support gold standard non-tss regions relaxed strict > >2 >3

$.75 all CAGE clusters 5, fraction of regions covered by DNase-seq a relative position from peak of CAGE clusters (bp)..5.75 recall.75.5.5.. RNA-seq read counts cutoff = 27.$ 5 8 2 6 8 2 log2 CAGE read count cutoff 3 6 9 2 5 8 log2 RNA-seq read count cutoff. cutoff = 52.9.75.5. FDR =.5 5 6 7 TIEscore cutoff F-measure mean =.55 8 35,83 6 5 6 7 TIEscore cutoff.5.75 recall FDR =.

5 8 2 6 8 2 log2 CAGE read count cutoff 3 6 9 2 5 8 log2 RNA-seq read count cutoff. cutoff = 52.9.75.5. FDR =.5 5 6 7 TIEscore cutoff F-measure mean =.55 8 35,83 6 5 6 7 TIEscore cutoff.5.75 recall FDR =.

5 by three classifiers.75.75.5. precision PR-AUC mean =.98 e CAGE read count RNA-seq read count,333 TIEScore.75 b 2 6 8 2 log2cage read count cutoff CAGE read counts cutoff =.86 Supplementary Fig.

All CAGE clusters (st column) refer to all CAGE clusters in raw FANTOM CAT.

$Y-axis: fraction of corresponding regions covered by DNaseseq peaks.$ Line and shaded area: signal summarized per window of 2bp, solid line and shaded area represent median and quartiles of 27 epigenome datasets at corresponding window.

Line and shaded area: signal summarized per window of 2bp, solid line and shaded area represent median and quartiles of 27 epigenome datasets at corresponding window.

4 RESEARCH SUPPLEMENTARY INFORMATION 2 Benchmarking Performance of TIEScore gold standard TSS regions with various degrees of chromhmm state support gold standard non-tss regions relaxed strict > >2 >3 > >5 >6 >7 >8 >9 > 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,..75 all CAGE clusters 5, fraction of regions covered by DNase-seq a relative position from peak of CAGE clusters (bp) recall RNA-seq read counts cutoff = 27. CAGE clusters with less than.86 read counts, rescued by TIEScore 89 37, log2 RNA-seq read count cutoff,66 cutoff = cutoff = FDR = log2 CAGE read count cutoff log2 RNA-seq read count cutoff. cutoff = FDR = TIEscore cutoff F-measure mean = , TIEscore cutoff.5.75 recall FDR = F-measure mean =.2. 2, TIEScore cutoff = 52.9 PR-AUC mean =.92 F-measure mean =.9 F-measure PR-AUC mean = recall. d false discovery rate c TSS identified at false discovery rate of.5 by three classifiers precision PR-AUC mean =.98 e CAGE read count RNA-seq read count,333 TIEScore.75 b log2cage read count cutoff CAGE read counts cutoff =.86 Supplementary Fig. 2 Benchmarking Performance of TIEScore. Seventy samples with matched CAGE and RNA-seq libraries were used for TIEScore benchmarking (Supplementary table ). a, coverage of gold standard TSS regions. All CAGE clusters (st column) refer to all CAGE clusters in raw FANTOM CAT. Gold standard non-tss regions (2nd column) and TSS regions (3rd column) were defined using chromatin states among 27 epigenome datasets from Roadmap Epigenomics Consortium7 and CAGE clusters identified in FANTOM58. Gold standard TSS regions were defined at various degrees of chromatin state support (from relaxed to strict, 3rd to last column). Y-axis: fraction of corresponding regions covered by DNaseseq peaks. Line and shaded area: signal summarized per window of 2bp, solid line and shaded area represent median and quartiles of 27 epigenome datasets at corresponding window. In b, c and d, benchmarking of performance of TIEScore, CAGE read count and RNA-seq read count was repeated with sets of gold standard TSS regions defined at various levels of stringency as in a with same color scale. b, Precision and recall curve (PR curve). The area under PR curve (PR-AUC)3 is used as a measure of classifier performance in identifying genuine TSS. PR-AUC3 of TIEScore is significantly higher than that of or 2 classifiers across various stringencies of gold standard TSS definition (P<.5, paired Student s t-test). In c and d, vertical dashed lines: classifier cutoffs with false discovery rate (FDR) of.5 at gold standard TSS>5 chromatin state support. c, F-measure versus classifier cutoffs. F-measure3 was calculated as harmonic mean of precision and recall and is used as a measure of classifier accuracy in identifying genuine TSS. At FDR of.5 (vertical dashed lines), TIEScore outperforms (higher F-measure values) or two classifiers. d, FDR versus classifier cutoffs. Intersections of curve and horizontal dashed lines refer to classifier cutoffs for achieving FDR of.5 (vertical dashed lines). e, Overlap of TSS identified by three classifiers at cutoffs for achieving FDR of.5. W W W. N A T U R E. C O M / N A T U R E

5 SUPPLEMENTARY INFORMATION RESEARCH 2. Definition of gold standard TSS and non-tss regions To assess how well TIEScore identifies genuine TSS compared to using CAGE or RNA-seq derived information alone, we first defined sets of gold standard TSS and non-tss regions using epigenome datasets from Roadmap Epigenomics Consortium 7 ( Firstly, chromatin states TssA (Active TSS), Enh (Enhancers), TssBiv (Bivalent/Poised TSS) and EnhBiv (Bivalent Enhancer) were referred to as pro-tss-states ; and chromatin states Tx (Strong transcription), Het (Heterochromatin) and Quies (Quiescent/Low) were referred to as non-tss-states. A CAGE cluster was defined as a gold standard TSS region when its prominent TSS overlaps a pro-tss-state in N of 27 epigenomic datasets 7 and overlaps no non- TSS-states, where N reflects stringency of gold standard definition. Conversely, a CAGE cluster was defined as a gold standard non-tss region when its prominent TSS overlaps a non-tss state in at least 2 of 27 epigenome datasets 7 and overlaps no pro-tss-states. These gold standard regions were n used to assess TIEScore performance in distinguishing TSS from non-tss regions (Supplementary Fig. 2a). 2.2 Performance of TIEScore versus CAGE or RNA-seq data alone Using 7 samples with matched CAGE and RNA-seq libraries (Supplementary Table 2), performance of TIEScore in identifying genuine TSS was compared against CAGE or RNA-seq data alone, in identification of genuine TSS (Supplementary Fig. 2). First, TIEScore was applied to a subset of CAGE clusters with at least 3 reads (sum among 7 samples) and FANTOM5 RNA-seq assembly. A set of universal TSS regions (n=,,) was n generated across three sets of TSS (i.e. combined, and CAGE or RNA-seq alone) by merging transcripts 5 ends in FANTOM5 RNA-seq assembly within ±nt into RNA-seq TSS regions and n by merging se RNA-seq TSS regions with CAGE clusters within ±nt. Gold standard TSS and non-tss regions for se universal TSS regions were defined using midpoint of se universal TSS regions instead of prominent TSS in CAGE clusters. Ten sets of gold standard TSS regions were defined at various stringencies as described above, with N= to, at step of. For each of universal TSS regions, its support (i.e. score) from each of three datasets was n calculated as, ) combined: maximum TIEScore of all associated CAGE clusters; 2) CAGE only: maximum number of CAGE reads of all associated CAGE clusters; 3) RNA-seq only: maximum number of RNA-seq reads of all associated RNA-seq TSS regions. (Note: each RNA-seq TSS region is represented by sum of RNA-seq reads of its associated transcripts as estimated in Sailfish 2 ). The score performance for each of three TSS datasets was assessed using R packages ROCR 3 and PRROC 3. Specifically, sensitivity, specificity, precision, recall, FDR and F-measure at various score cutoffs were calculated using ROCR 3, and ROC-AUC and PR-AUC were calculated using PRROC 3. In terms of precision, recall and F-measure (Supplementary Fig. 2b-c), TIEScore substantially outperformed both CAGE only and RNA-seq only based approaches. In Supplementary Fig. 2d, cutoffs values of various classifiers at FDR of.5 were indicated. In Supplementary Fig. 2e, RNA-seq read count identified appreciably fewer TSS than or two classifiers, due to its high cutoffs (i.e reads) to achieve FDR of.5. Number of TSS identified based on TIEScore and CAGE read count is comparable. TIEScore rescued 35,83 TSS discarded by CAGE read count classifier cutoff at reads, implying its ability to identify lowly expressed TSS with acceptable FDR by synergizing information from CAGE and RNA-seq data. 5

6 RESEARCH SUPPLEMENTARY INFORMATION 3 TIEScore Cutoffs and TSS Validation Rates a c sensitivity cumulative percentage of TSS (%) % specificity=.83 robust TIEScore cutoff 53.% 8.6% sensitivity=.82 ROC-AUC= specificity b d accuracy or false discovery rate TSS validation rate by (%) permissive (35.3) FDR=.23 robust (5.) FDR=.77 stringent (6.) FDR=.26 accuracy FDR TIEScore actual TSS permutated TSS Pearson s r.9 e TSS validation rate by RAMPAGE (%) validated by RAMPAGE within X bp X = Pearson s r mean = TIEScore TIEScore TIEScore Supplementary Fig. 3 TIEScore Cutoffs and TSS Validation Rates. a. Robust cutoff of TIEScore. The robust TIEScore cutoff (TIEScore=5.) was based on optimal cutoff determined from a ROC curve, with FDR=.77. b, Permissive and stringent cuttofs of TIEScore. c, Cumulative percentage of TSS. % refers all TSS in raw FANTOM CAT. d, FANTOM CAT TSS validation rate by. Percentages of TSS at various TIEScore (bin=.) overlaps with were calculated. Permutated TSS refers to randomly sampled genomic positions as control. e, TSS validation by RAMPAGE. The percentages of TSS at various TIEScore (bin=.) that could be validated by RAMPAGE at various ranges were calculated and loess-smood curves were plotted. In c-e, vertical dashed lines represent three TIEScore cutoffs as in b. 3. Definition of TIEScore cutoffs Based on a ROC curve (Supplementary Fig. 3a), we derived an optimal TIEScore cutoff (TIEScore=5., referred to as robust cutoff), with ROC-AUC=.89 and FDR=.77. We also defined two additional values (Supplementary Fig. 3b), permissive (TIEScore=35.3) and stringent (TIEScore=6.) cutoffs, based on FDR chosen as three times more relaxed (FDR=.23) and strict (FDR=.26) than robust FDR. 3.2 Properties at various TIEScore cutoffs At robust cutoff, 6.9% of all TSS were retained (Supplementary Fig. 3c). We observed high correlations between TIEScore and TSS validation rate by (Pearson s r=.9, Methods, Supplementary Fig. 3d) and by RAMPAGE (mean Pearson s r=.95, Methods, Supplementary Fig. 3e), suggesting that TIEScore quantitatively identifies genuine TSS. 6

7 SUPPLEMENTARY INFORMATION RESEARCH Definition of Genes, Coding Potential and Comparison with GENCODEv a support in classes promoter enhancer dyadic no Gene classes uncertain coding Number of genes in classes lncrnas mrnas ors Levels of support in FANTOM CAT CAT genes, above stringent cutoff CAT genes, between robust and stringent cutoffs CAT genes, between permissive and robust cutoffs b c Stringent cutoff 3,52 genes Robust cutoff 59, genes Permissive cutoff 2,25 genes 5 75 percentage of genes (%) 5 75 percentage of genes (%) structural RNAs small RNAs short ncrnas sense overlap RNAs pseudogenes lncrna, sense intronic lncrna, intergenic lncrna, divergent lncrna, antisense protein coding mrnas uncertain coding structural RNAs small RNAs short ncrnas sense overlap RNAs pseudogenes lncrna, sense intronic lncrna, intergenic lncrna, divergent lncrna, antisense protein coding mrnas uncertain coding structural RNAs small RNAs short ncrnas sense overlap RNAs pseudogenes lncrna, sense intronic lncrna, intergenic lncrna, divergent lncrna, antisense 5,, 5, 2,, 3, 35, number of genes (count) 5,, 5, 2,, 3, 35, number of genes (count) d f number of GENCODEv genes (count) 5,, 5, 2, total bp bp 5 bp bp bp density.5.5 below permissive cutoff lncrnas mrnas pseudogenes GENCODEv gene classes e number of novel FANTOM CAT genes (count), 2, 3,, 5, lncrnas mrnas ors FANTOM CAT gene classes GENCODEv transcripts with 5 end revised by FANTOM CAT lncrna protein coding pseudogene 26,578 % 5,926 % 7,66 % 6,9 6% 28,27 88% 8, 6% 2,82 8% 93,788 6% 7,376 2% 7,6 27% 5,33 3%,867 28% 5,9 9% 29,73 2% 3,56 2% 3,33 %,963 %,97 % mean 35.9 mean 95. mean 8.7 protein coding mrnas 5 75 percentage of genes (%) 5,, 5, 2,, 3, 35, number of genes (count) relative modified 5 end position (bp) Supplementary Fig. Definition of FANTOM CAT Genes and Comparison with GENCODEv. In a-c, Left column: Percentage of genes within each gene class supported by different types of Roadmap regulatory regions 7, based on overlapping of ir strongest TSS with Roadmap 7. At all cutoffs, majority of protein coding mrnas are supported by promoter. Right column: Number of genes within each gene class. a, b and c, refer to FANTOM CAT genes at stringent, robust and permissive TIEScore cutoff respectively. d, GENCODEv genes in FANTOM CAT. The number of GENCODEv lncrna, mrna and pseudogene genes supported at various levels of FANTOM CAT was plotted. e, FANTOM CAT genes novel to GENCODEv. A FANTOM CAT gene is novel to GENCODEv if none of its transcripts is compatible with a GENCODEv transcript. f, Revision of 5 ends of GENCODEv transcripts. Upper: Number of GENCODEv transcripts revised by various extents. Total refers to total number of GENCODEv transcripts. X bp refers to number of GENCODEv transcripts revised by at least X bp. Lower: Distributions of 5 end of GENCODEv lncrna, mrna and pseudogene transcripts relative to ir compatible FANTOM CAT transcripts (i.e. relative modified 5 end position) were plotted. The absolute mean within each class was indicated.. Definition FANTOM CAT genes FANTOM CAT genes were defined based on clustering of transcript models in raw FANTOM CAT (after reducing complexity) using a custom perl script. Specifically, non-gencodev9 transcripts with exon boundaries within ±5bp to that of a GENCODEv9 transcript were first assigned to corresponding GENECODEv9 genes. Single exon non-gencodev9 transcripts within ±5bp of a GENCODEv9 exon were also assigned to corresponding GENECODEv9 gene. The non-gencodev9 transcripts spanning multiple GENECODEv9 genes were defined as chimeric transcripts and removed (as likely assembly artifacts). Remaining unassigned non-gencodev9 transcripts with exon boundaries within ±5bp were n recursively clustered, and all se resulting transcript clusters were referred to as novel genes outside GENCODEv9. Applying various TIEScore cutoffs to raw FANTOM CAT, we defined 2,25 permissive, 59, robust and 3,52 stringent genes (Supplementary Fig. a-c). Gene classes of 7

8 RESEARCH SUPPLEMENTARY INFORMATION FANTOM CAT genes assigned to GENCODEv9 protein coding genes, pseudogenes and small RNA genes were directly inherited from ir biotypes. Or gene classes in FANTOM CAT were defined as follows..2 Definition of gene classes Coding potential of all transcripts in raw FANTOM CAT was evaluated using Coding Potential Assessment Tool (CPAT, version ) with default parameters on hg9. Transcripts with CPAT score <.36 and no open reading frames (ORF) 3nt (based on getorf 6 ) are defined as non-coding. A FANTOM CAT gene is defined as non-coding if all its transcripts are non-coding, or its GENCODEv9 biotype is annotated as non-coding 7. A gene is defined as coding when at least 5% of its transcripts are coding and re is at least one transcript with an ORF 3nt. Orwise gene is classified as coding uncertain. Non-coding genes generating at least one transcript with total exonic length 2nt are defined as lncrna genes, and those <2nt are defined as short ncrna. LncRNAs are n classified in ascending order as follows: divergent lncrnas, sense intronic lncrnas, antisense lncrnas and intergenic lncrnas. Divergent lncrnas: genes with its strongest CAGE cluster within ±2kb on opposite strand of any CAGE clusters of GENCODEv9 protein coding genes or pseudogenes. Sense intronic lncrna: lncrna genes ) initiating within intron of anor FANTOM CAT gene, 2) with at least 5% of ir genic region overlapping with genic region of any anor genes, 3) with its strongest CAGE cluster not overlapping exons of or genes, and ) containing CAGE reads, or orwise defined as or sense overlap RNA. Antisense lncrnas: genes with 5% of ir genic region overlapping with genic region of GENCODEv9 protein coding genes or pseudogenes on opposite strand. Intergenic lncrnas: remaining lncrna genes that could not be assigned to any of above categories..3 Comparison between FANTOM CAT and GENCODEv A GENCODEv gene is supported by FANTOM CAT if at least for one of its transcripts is compatible with a FANTOM CAT transcript. A GENCODEv transcript is defined as compatible with FANTOM CAT transcripts if all of its exon boundaries are within ±5bp to that of a FANTOM CAT transcript, and vice versa. A FANTOM CAT gene is defined as novel if none of its transcripts is compatible to GENCODEv transcripts. About 33.86% (5,3 of,8) of GENCODEv lncrna genes (grey stack, Supplementary Fig. d) are below permissive TIEScore cutoff. At robust cutoff, FANTOM CAT covered 8,269 of 2,32 protein-coding genes in GENCODEv (Supplementary Fig. d). The robust FANTOM CAT covered 6,99 of,8 lncrna genes in GENCODEv (Supplementary Fig. d) and added 9,723 novel lncrna genes outside of GENCODEv (Supplementary Fig. e). In addition, 5 ends of 6,9 lncrna transcripts in GENCODEv were revised using FANTOM CAT models, which resulted in ir improved accuracy by a mean of 35.9bp (Supplementary Fig. f).. Annotation of open reading frames in FANTOM CAT. As some previously annotated putative lncrnas transcripts have been shown to associate with ribosomes and thus may encode for short peptides 8, we furr annotated coding potential of open reading frames, but were not used for defining coding status of a FANTOM CAT gene. Coordinates of ORFs on all FANTOM CAT transcripts (n=86,58) were extracted using getorf 6, with minimum 3nt requiring both start and stop codons. Only top 5 longest ORFs per transcript were retained, which yielded 3,6,858 non-redundant ORFs. Coordinates of ORFs were converted from transcript level to genomic level. A pre-computed 6-way whole genome alignment 9 of hg9 was downloaded from UCSC Genome Browser 2. Only 27 species were retained in alignment based on its intersection with 58-mammal phylogeny used in PhyloCSF 2, including Alpaca, Armadillo, Baboon, Bushbaby, Cat, Chimp, Cow, Dog, Dolphin, Elephant, Gorilla, Guinea pig, Hedgehog, Horse, Human, Marmoset, Megabat, Microbat, Mouse, Orangutan, Pika, Rabbit, Rat, Rhesus, Shrew, Squirrel and Tenrec. Interspecies alignments of ORF were extracted using mafsinregion from UCSC Genome Browser 2. Coding potential of se ORFs were assessed using PhyloCSF 2 and RNAcode 22 using default settings, based on phylogenetic information from interspecies alignments. We furr identified potentially translated small ORFs (sorf) based on ribosome profiling data in sorfs.org 23. An ORF is defined as coding if its score in PhyloCSF 2, P<. in RNAcode 22 or it matches an sorf with good floss-score as defined in sorfs.org 23. The annotations of all ORFs in FANTOM CAT are available at Note: Of 27,99 lncrnas genes in robust FANTOM CAT, 7,722 of m were annotated as lncrnas in GENCODEv9 2 and remaining 2,97 non-gencodev9 lncrna genes where all ir transcripts lacked coding potential (CPAT score <.36 and no ORF 3nt). Of se 2,97 non-gencodev9 lncrna genes, only,87 had putative ORFs with additional evidence of coding potential (in eir PhyloCSF 2, RNAcode 22 or sorfs.org 23 ). We provide se annotations on web resource ( yet we still consider se as genuine lncrnas for furr analysis. 8

9 5 Directionality, Exosome Sensitivity and Transcript Properties SUPPLEMENTARY INFORMATION RESEARCH 5 Directionality, Exosome Sensitivity and Transcript Properties a a directionality = (read ss read as ) / (read ss + read as ) directionality = (read ss read as ) / (read ss + read as ) sense CAGE sense TSS CAGE (read TSS antisense ss ) (read antisense ss ) CAGE TSS CAGE (read as TSS ) (read as ) CAGE cluster miscellaneous CAGE cluster miscellaneous in question in question region for read counting region for read counting distance from sense TSS (bp) distance from sense TSS (bp) b b c c d d e percentage of genes of genes exosome sensitivity genomic span span (log2 (log2 kb) kb) splicing index mrna divergent p-lncrna intergenic p-lncrna e-lncrna mrna divergent p-lncrna intergenic p-lncrna e-lncrna directionality (strand-bias (strand-bias of CAGE of CAGE signal) signal) Supplementary Fig. 55 Directionality, Exosome Sensitivity and Transcript Properties. a, a, Definition of of directionality. Miscellaneous refers to to any antisense TSS (if any). Directionality of a CAGE cluster is is defined as: as: (readss readas)/(readss+readas), where readss and readas are number of CAGE read counts (across,897 FANTOM5 samples) within 8 8 to +2nt to +2nt of of its its prominent TSS on on sense and antisense strand, respectively. b percentages of of genes within each each category for for each each bin bin of of directionality. c, c, Exosome sensitivity is measured as relative fraction of of CAGE signal observed after after exosome knockdown in in HeLa-S3 cells cells. d,. d, genomic span, defined as length between 5 5 most to to 3 3 most base of of a a gene. gene. e, e, splicing index, index, calculated as as average of of sum of of ratio of of junction spanning reads to to spliced reads of of each of of its its introns across across 7 7 RNA-seq RNA-seq libraries. libraries. The The box box plots plots show show median, quartiles and Tukey whiskers of measurements for for genes genes at at each each bin bin of of directionality. 5. Rationale Transcription initiation is intrinsically bidirectional 26 (Supplementary Fig. 5a). Functionally distinct RNA species were previously categorized by ir transcriptional directionality (strand-bias of transcription initiation, referred to as directionality) and by exosome-sensitivity (sensitivity to ribonucleolytic RNA exosome complex). For each lncrna category we examined relationship between se features and length and splicing of observed transcripts (Supplementary Fig. 5b-e). 5.2 Calculation of directionality, splicing index, genomic span and exosome sensitivity Directionality of a given CAGE cluster is measured as bias of CAGE signal on sense versus antisense strand (relative to TSS), as described with modifications as follows. A zero value implies perfectly balanced CAGE signal on both strands, and values of + and implies strongly biased CAGE signal towards sense or antisense strand, respectively. Directionality of a CAGE cluster is defined as follows: (read SS read AS )/(read SS +read AS ), where read SS and read AS are number of read counts from all FANTOM5 CAGE datasets (n=,897, Supplementary Table ) within 8 to +2nt of its prominent TSS on sense and antisense strand, respectively (Supplementary Fig. 5a). Directionality of a gene is defined by directionality of its strongest CAGE cluster. Splicing index of a gene is defined as average of sum of ratio of junction spanning reads to spliced reads of each of its introns across 7 RNA-seq libraries (libraries mentioned in Methods). Genomic span of a gene is defined as length between its 5 most to 3 most genomic location. Exosome sensitivity of a CAGE cluster is measured as relative fraction of CAGE signal observed after exosome knockdown in HeLa-S3 cells as previously described. 5.2 Directionality p-lncrnas Consistent with previous findings, transcription initiation regions (TIRs) that are predominantly transcribed from sense strand (directionality +, in all four categories, Supplementary Fig. 5b) produce less exosome-sensitive (Supplementary Fig. 5c), generally longer (Supplementary Fig. 5d) and more spliced and RNAs (Supplementary Fig. 5e). Therefore, directionality of TIR somewhat reflects properties of its produced RNAs. As expected, TIRs of mrnas predominantly generate sense transcripts (directionality +, red) and are only mildly exosome sensitive, 9 9

10 RESEARCH SUPPLEMENTARY INFORMATION whereas p-lncrnas generated from divergent antisense transcription from se mrna promoters are PROMPT 27 like (directionality, purple), largely exosome sensitive, short and rarely spliced. In contrast intergenic p-lncrna TIRs are mostly biased towards generating sense transcripts (directionality >, blue). In addition, ~% of intergenic p- lncrnas predominantly generating sense transcripts (directionality +, blue), such as MALAT 28, are relatively resistant to exosome degradation. 5.3 Directionality e-lncrnas About 76% of e-lncrnas arose from balanced bidirectional TIRs (directionality.8 and +.8, green), while about 22% were generated from unidirectional TIRs (directionality >+.8, green). These unidirectional TIR derived e- lncrnas are generally less exosome sensitive, longer, more spliced (directionality >+.8, green). This study refore expand our previous catalog of bidirectionally transcribed enhancer regions 8 by including unidirectional e-lncrnas. These contain previously identified functional e-lncrna such as CCAT, an unidirectionally transcribed from a superenhancer, which promotes long-range chromatin looping and regulates MYC transcription 29.

SUPPLEMENTARY INFORMATION RESEARCH 6 Sequence Features at LncRNA TSS Supported by Different Types a lncrna, sense

n=,33 n=522 n=936 n=5,827 n=92 n=26 n=,376 n=22 n=,8 n=, n=, dyadic enhancer promoter no dyadic enhancer promoter no

regions whole genome 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5.2..2. 75 75 75 75 75 75 75 75..3.5.2.6.

$2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 TATA box (fraction) INR motif (fraction) 75 75 75 75 75 75 75 75 2,5 2,5 2,5 2,5 75$ $.75..75 H3K27ac (fraction).$ (bp) Supplementary Fig. 6 Sequence Features at LncRNA TSS Supported by Different Types.

(bp) Supplementary Fig. 6 Sequence Features at LncRNA TSS Supported by Different Types.

Solid lines: TSS of lncrna genes and control regions randomly sampled from whole (whole genome) or unannotated

a, Epigenomic features surrounding TSS. Distributions of Roadmap DNase-seq and ChIP-seq (Methods) were plotted.

Distributions of rejected substitution (RS) score 3, CpG island, polyadenylation site signal (PAS) and 5 splicing

Distributions of TATA box and initiator (INR) motif, were plotted.. 6.

of e-lncrnas and intergenic p-lncrnas TIR as measured using RS scores 3, we observed an enrichment of sequence

11 SUPPLEMENTARY INFORMATION RESEARCH 6 Sequence Features at LncRNA TSS Supported by Different Types a lncrna, sense intronc lncrna, intergenic lncrna, divergent lncrna, antisense random controls n=6 n=,26 n=26 n=2, n=,8 n=6, n=,7 n=,33 n=522 n=936 n=5,827 n=92 n=26 n=,376 n=22 n=,8 n=, n=, dyadic enhancer promoter no dyadic enhancer promoter no dyadic enhancer promoter no dyadic enhancer promoter no unannot. regions whole genome 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2, ,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 2, ,5 2, ,5 2,5 2,5 2,5 2,5 2,5 2,5 2,5 TATA box (fraction) INR motif (fraction) ,5 2,5 2,5 2, SS (fraction) PAS (fraction) CpG island (fraction) RS score (mean) H3Kme3 (fraction)..75 H3Kme (fraction) H3K27ac (fraction)..75 DNase (fraction) b relative position from TSS (bp) c relative position from TSS (bp) relative position from TSS (bp) Supplementary Fig. 6 Sequence Features at LncRNA TSS Supported by Different Types. Black dashed lines: whole genome background. Solid lines: TSS of lncrna genes and control regions randomly sampled from whole (whole genome) or unannotated portion of genome (unannot. regions). Colors of lines correspond to type of support as indicated on top. a, Epigenomic features surrounding TSS. Distributions of Roadmap DNase-seq and ChIP-seq (Methods) were plotted. b, Genomic features surrounding TSS. Distributions of rejected substitution (RS) score 3, CpG island, polyadenylation site signal (PAS) and 5 splicing site (5 SS), were plotted. c, Core promoter motifs. Distributions of TATA box and initiator (INR) motif, were plotted.. 6. Genuineness TSS of lncrna gene classes supported by different types Despite relatively weaker selective constraints of e-lncrnas and intergenic p-lncrnas TIR as measured using RS scores 3, we observed an enrichment of sequence features associated with regulated transcription initiation, including TATA-box, INR and CpG island (absent for e-lncrnas) as well as signals conducive to generating long transcripts including 5 SS enrichment and PAS depletion. These sequence features suggest that at least a subset of se TIR have undergone selection for both transcription initiation and elongation. For remaining lncrnas lacking support (grey), we still observed noticeable enrichment of 5 SS, INR and TATA-box motifs and depletion of PAS, suggesting that a subset of se TSS is likely to be genuine despite lack of support.

12 RESEARCH SUPPLEMENTARY INFORMATION B Online Resources All resources mentioned below are available at Assembly, Expression Atlas and or Resources. Assembly: Cutoffs and identifiers We provide The FANTOM CAT assembly at cutoffs (as *.gtf), Raw: without TIEScore cutoff, FDR=.285, contains % of all TSS Permissive: TIEScore 35.3, FDR=.23, contains 88.6% of all TSS Robust: [optimal, recommend to use] TIEScore 5., FDR=.77, contains 6.9% of all TSS Stringent: TIEScore 6., FDR=.26, contains 9.% of all TSS At each of TIEScore cutoffs, assembly consists of 3 types of basic items (as *.bed), ) genes, 2) transcripts and 3) CAGE clusters. Each of se items has a unique identifier and associates to items of or types as specified in ID mapping tables (as *.txt): ONE transcript associates with ONLY ONE CAGE cluster; ONE transcript associates with ONLY ONE gene; ONE CAGE cluster can associate with multiple transcripts; ONE CAGE cluster can refore associate with multiple genes; Gene identifiers (mainly inherited from GENCODEv9): ENSG*.*: inherited from GENCODEv9, e.g. ENSG577.9; CATG*.*: novel genes from FANTOM CAT, e.g. CATG557.; Transcript identifiers (mainly inherited from GENCODEv9): ENST*.*: inherited from GENCODEv9 7, e.g. ENST57292.; ENCT*.*: novel transcripts from ENCODE 6 transcript models, e.g. ENCT868.; HBMT*.*: novel transcripts from Human BodyMap v2. transcript models, e.g. HBMT386.; MICT *.*: novel transcripts from mitranscriptome 5 transcript models, e.g. MICT969.; FTMT*.*: novel transcripts from FANTOM5 RNA-seq transcript models, e.g. FTMT96.; CAGE Cluster identifiers (ALL inherited from FANTOM5 DPI CAGE Clusters ): chr*:*..*,*: as chromosome:start..end,strand, inherited from FANTOM5, e.g. chr: ,+;.2 Expression Atlas At each of TIEScore cutoffs, we provide following expression tables (,897 FANTOM5 CAGE libraries): tag count per CAGE cluster: CAGE tags count within flanking region (±5nt) of CAGE cluster; tag count per gene: sum of CAGE tags of CAGE clusters associated with gene; rle cpm per CAGE cluster: rle normalized (by edger 3 ) cpm (count per millions) CAGE cluster; rle cpm per gene: sum of CAGE tags of CAGE clusters associated with gene;.3 Or Resources We also provide following resources for FANTOM CAT as robust TIEScore cutoff: Differential expression: differential expression analysis (edger 3 ) of manually curated series; Sample ontology association: gene-based association with FANTOM5 sample ontology terms; Trait association: Gene-based association with traits in GWASdb 32 and PICS 33 ; Expression specificity: Gene-based expression levels and specificity in primary cell facets; Co-expression: Co-expression of lncrna mrna pairs linked by GTEx 3 -associated SNPs; Selective constraints: RS score and GERP elements 3 at TIR and exon; Orthologous TSS activities: Orthologous TSS activities in mouse, rat, dog and chicken; ORF Annotations: coding potential of all ORF based on phylocsf 2, RNACode 22 and sorf.org 23 ; Transposable elements: TIR association with transposable elements; Directionality: Transcriptional directionality of TIR based on,897 FANTOM5 CAGE libraries; Exosome Sensitivity: Exosome sensitivity based on exosome knockdown in HeLa cells ; Splicing Index: Gene-based splicing index in 7 RNA-seq libraries; 2

SUPPLEMENTARY INFORMATION RESEARCH 2 Overview of FANTOM CAT Browser Supplementary Fig. 7 Landing Page of FANTOM CAT Browser. The FANTOM CAT Browser is hosted at http://fantom.gsc.riken.jp/cat/v/#/.

relevant data. Alternatively, users can also browse by trait or sample ontologies, and intersect with ir associated genes. Click on circles below and start browsing Genes, Traits and Ontologies.

13 SUPPLEMENTARY INFORMATION RESEARCH 2 Overview of FANTOM CAT Browser Supplementary Fig. 7 Landing Page of FANTOM CAT Browser. The FANTOM CAT Browser is hosted at This web application allows users to browse FANTOM CAT genes, view ir genomic loci through ZENBU 35, filter by ir annotations, intersect m with ir associated sample ontologies or traits, and download relevant data. Alternatively, users can also browse by trait or sample ontologies, and intersect with ir associated genes. Click on circles below and start browsing Genes, Traits and Ontologies. Gene List Page Gene Individual Page Supplementary Fig. 8 Browsing Genes from Gene List and Individual Page. On Gene List Page ( users can select for genes based on filters on left if list. Click on GeneID on list to browse an individual gene. The Gene Individual Page displays basic information of gene and its associated sample ontologies and traits. User can navigate between genes, sample ontologies and traits through links. Alternatively, users can start by browsing list of sample ontologies and traits and reach ir associated genes. 3

RESEARCH SUPPLEMENTARY INFORMATION 3 Use Case : Download a List

9 Download a List of Novel e-lncrnas. Steps:.

Check novel in filter Annotations to filter for genes that are

Check e-lncrna in filter Category, and this should return 7,232

14 RESEARCH SUPPLEMENTARY INFORMATION 3 Use Case : Download a List of Novel e- lncrnas 2 3 Gene List Page Supplementary Fig. 9 Download a List of Novel e-lncrnas. Steps:. Go to Gene List Page by clicking Genes in Landing Page; 2. Check novel in filter Annotations to filter for genes that are not annotated in GENCODEv9; 3. Check e-lncrna in filter Category, and this should return 7,232 genes;. Choose data columns to be displayed (and thus downloaded) and click Export visible data as csv ;

SUPPLEMENTARY INFORMATION RESEARCH Use Case 2: Explore LncRNAs with Conserved Exons and Implicated in 8 9 7 Gene Individual Page 6 5

Go to Gene List Page by clicking Genes in Landing Page; 2.

Check yes in filter mrna Coexpression, and this should return,673 genes;.

Click on header of Exon conservation column to sort genes by level of conservation at ir exons; 6. Click on one of genes on top (i.e. highly conserved exon), e.

15 SUPPLEMENTARY INFORMATION RESEARCH Use Case 2: Explore LncRNAs with Conserved Exons and Implicated in Gene Individual Page Gene List Page Supplementary Fig. Explore LncRNAs with Conserved Exons and Implicated in. Steps:. Go to Gene List Page by clicking Genes in Landing Page; 2. Check yes in filter Exon Conservation to filter for genes with conserved exons; 3. Check yes in filter mrna Coexpression, and this should return,673 genes;. Choose data columns to Exon conservation and mrna Coexpression be displayed; 5. Click on header of Exon conservation column to sort genes by level of conservation at ir exons; 6. Click on one of genes on top (i.e. highly conserved exon), e.g. ENSG96.; 7. It should reach individual page for gene ENSG96.; 8. Examining summary for gene, it notes ENSG96. is coexpressed with -linked ENSG26822.; 9. Examining conservation summary of gene, it notes exon of ENSG96. is conserved;. Click on extended to view ENSG96. loci;. Continue next page: in extended ZENBU view, it shows information at locus; 5

RESEARCH SUPPLEMENTARY INFORMATION 3 33 22 55 2 5 ZENBU extended xten view ZENBU extended xten view view ZENBU extended xten 7 77 66 6 SupplementaryFig. Fig.

Continuing from end of Supplementary Fig., you should see ZENBU extended view for ENSG96.;.. Continuing from of of Supplementary Fig.,, you should seesee ZENBU extended view for ENSG96.

(gene name of ENSG96.) to highlight Click settings and feature highlight search input RP-973N3. (gene name of ENSG96.) to highlight locus of RP-973N3.; locus ofofrp-973n3.; locus RP-973N3.; 3.

. To To cancel highlight, repeat stepstep 2 and delete search Examine redhighlighted highlighted gene and transcript of RP-973N3.

Click on track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene, which shows connections.

16 RESEARCH SUPPLEMENTARY INFORMATION ZENBU extended xten view ZENBU extended xten view view ZENBU extended xten SupplementaryFig. Fig. Visualizean anlncrna LncRNAImplicated Implicatedinin ininzenbu. ZENBU. Supplementary Supplementary Fig. Visualize Visualize an LncRNA Implicated in in ZENBU. Steps: Steps: Steps:. Continuing from end of Supplementary Fig., you should see ZENBU extended view for ENSG96.;.. Continuing from of of Supplementary Fig.,, you should seesee ZENBU extended view for ENSG96.; Continuing from end Supplementary Fig. should ZENBU extended view for ENSG96.; 2. Click settings and inend feature highlight search inputyou RP-973N3. (gene name of ENSG96.) to highlight 2.2. Click settings and in in feature highlight search input RP-973N3. (gene name of ENSG96.) to highlight Click settings and feature highlight search input RP-973N3. (gene name of ENSG96.) to highlight locus of RP-973N3.; locus ofofrp-973n3.; locus RP-973N3.; 3. Examine red highlighted gene and transcript of RP-973N3.. To cancel highlight, repeat step 2 and delete search 3.3. Examine red gene and transcript of RP-973N3.. To To cancel highlight, repeat stepstep 2 and delete search Examine redhighlighted highlighted gene and transcript of RP-973N3.. cancel highlight, repeat 2 and delete search string in setting dialogue box; string in setting dialogue box; string in setting dialogue box;. Click on track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene, which shows connections.. Click [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene, which shows connections Clickonontrack track [SNP] linked pairs, SIGNIFICANT ONLY, -to-gene, between SNPs to targetlncrna-mrna coexpressed mrna, highlight regions flanking which SNPs shows (i.e. leftconnections ends of between SNPs to target coexpressed mrna, highlight regions flanking SNPs (i.e. left ends of between and to targetglass coexpressed mrna, green lines) clicksnps magnifying icon to zoom in; highlight regions flanking SNPs (i.e. left ends of green lines) click magnifying glass icon to zoom in; in; green lines)and click magnifying to SNP, zoom 5. In zoom inand view, click on track [SNP]glass GTExicon v6p, Pe-5, coding genes, and highlight SNPs, and 5.5. InIn zoom in inview, click onon track [SNP] GTEx SNP, v6p, Pe-5, coding genes, and highlight SNPs, and and zoom view, click track [SNP] GTEx SNP, v6p, Pe-5, coding genes, highlight SNPs, bring up display panel (at bottom) showing active tissues of highlighted and SNPs; bring up display panel (at bottom) showing active tissues of highlighted SNPs; bring up display panel (at (i.e. bottom) showing and active tissuesmrna of (i.e. highlighted SNPs; 6. To zoom out to view lncrna RP-973N3.) linked PLEKHG3), click on one of connections in 6.6. ToTozoom outouttotoview lncrna (i.e. RP-973N3.) andand linked mrna (i.e.(i.e. PLEKHG3), click on one of in in view lncrna (i.e. RP-973N3.) linked mrna PLEKHG3), of connections connections zoom track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene, n click click on one range to zoom out; track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene, n click range to zoom out; track [SNP] linked lncrna-mrna SIGNIFICANT -to-gene, click to zoom out; 7. Aview contains coexpressed lncrna mrnapairs, pair linked by ONLY, SNP, you can rearrangen tracks by range dragging title 7.7. Abars contains pairpair linked by SNP, youyou can rearrange ONLY, tracks by dragging title Aview view contains coexpressed coexpressed lncrna mrna linked by SNP, can rearrange tracks by dragging title as desired, e.g. move up lncrna mrna track [SNP] linked lncrna-mrna pairs, SIGNIFICANT -to-gene ; bars upup track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene ; barsasasdesired, desired,e.g. e.g.move move track [SNP] linked lncrna-mrna pairs, SIGNIFICANT ONLY, -to-gene ; 6 W W W. N A T U R E. C O M / N A T U R E 6 6

SUPPLEMENTARY INFORMATION RESEARCH 5 Use Case 3: Explore LncRNAs Enriched in Classical Monocytes 7 6 77 6 6 5 U in Classical MM onocytes Use se CCase ase 33: : EExplore xplore LLncRNAs ncrnas

Supplementary Fig. 2 Explore LncRNAs Enriched in Classical Monocytes. Supplementary Fig. 2 Explore LncRNAs Enriched in Classical Monocytes. Steps: Steps:.

Type classical monocytes and press filter ; 3. Go Click on CL:86 to enter ontology individual page for classical monocytes;. Ontology List Page by clicking Ontologies Landing Page;monocytes; 3.

Check lncrna classes in filter Gene Class to filter for lncrna genes; 5. Click on header of Fold column to sort genes by level of enrichment in classical monocytes; 3.

e. CATG78.;. Check lncrna classes in filter Gene Class to filter for lncrna genes; 6. Click on genes on top (i.e. most enriched in classical monocytes), i.e. CATG78.; 7.

Examining Click on header of Fold for column to sort genes by level of enrichment classicalwith monocytes; 7. it itnotes CATG78. is associated classical monocytes; 8.

17 SUPPLEMENTARY INFORMATION RESEARCH 5 Use Case 3: Explore LncRNAs Enriched in Classical Monocytes U in Classical MM onocytes Use se CCase ase 33: : EExplore xplore LLncRNAs ncrnas EEnriched nriched in Classical onocytes Ontology List Page Ontology List Page Ontology List Page Supplementary Fig. 2 Explore LncRNAs Enriched in Classical Monocytes. Supplementary Fig. 2 Explore LncRNAs Enriched in Classical Monocytes. Supplementary Fig. 2 Explore LncRNAs Enriched in Classical Monocytes. Steps: Steps:. Go to Ontology List Page by clicking Ontologies in Landing Page;. Go to Ontology List Page by clicking Ontologies in Landing Page; 2. Type classical monocytes and press filter ; Steps: 2. Type classical monocytes and press filter ; 3. Go Click on CL:86 to enter ontology individual page for classical monocytes;. Ontology List Page by clicking Ontologies Landing Page;monocytes; 3. Clicktoon CL:86 to enter ontology individualin page for classical. Check lncrna classes in filter Gene Class to filter for lncrna genes; 2. Type classical monocytes and press filter ;. Check lncrna classes in filter Gene Class to filter for lncrna genes; 5. Click on header of Fold column to sort genes by level of enrichment in classical monocytes; 3. Click on header CL:86 enter ontology for classical monocytes; 5. of Fold tocolumn to sort genesindividual by level ofpage enrichment in classical monocytes; 6. Click on genes on top (i.e. most enriched in classical monocytes), i.e. CATG78.;. Check lncrna classes in filter Gene Class to filter for lncrna genes; 6. Click on genes on top (i.e. most enriched in classical monocytes), i.e. CATG78.; 7. Examining summary summary for gene, gene, notes CATG78. isinassociated with classical monocytes; 5. Examining Click on header of Fold for column to sort genes by level of enrichment classicalwith monocytes; 7. it itnotes CATG78. is associated classical monocytes; 8. Hover over enrichment plot to examine expression of CATG78. in individual samples of classical monocytes; 6. Click on genes on top (i.e. most enriched in classical monocytes), i.e. CATG78.; 8. Hover over enrichment plot to examine expression of CATG78. in individual samples of classical monocytes; 9. Hover over box plot to examine dynamic expression of CATG78. in classical monocytes stimulation; 7. Examining summary for gene, it notes CATG78. is associated with classical monocytes; 9. Hover over box plot to examine dynamic expression of CATG78. in classical monocytes stimulation;. Click Click on on expression view CATG Hover over enrichment to examine expression loci; ofloci; CATG78. in individual samples of classical monocytes;. expression totoplot view CATG78. Continue next page: expression ZENBU view,it expression itshows shows information at locus;monocytes stimulation;. 9. Continue Hover over page: box plot to examine dynamic ofexpression CATG78. in classical. next ininexpression ZENBU view, expression information at locus;. Click on expression to view CATG78. loci; W W W. N A T U R E. C O M / N A T U R E. Continue next page: in expression ZENBU view, it shows expression information at locus; 7 7 7

RESEARCH SUPPLEMENTARY INFORMATION 3 2 ZENBU expression view Supplementary Fig.

Continuing from end of Supplementary Fig. 2, you should see ZENBU expression view for CATG78.; 2.

in track [Sample Ontology Enrichment] Robust, gene (fold difference) and bring up experiment

18 RESEARCH SUPPLEMENTARY INFORMATION 3 2 ZENBU expression view Supplementary Fig. 3 ZENBU Expression View of an LncRNA Enriched in Classical Monocytes. Steps:. Continuing from end of Supplementary Fig. 2, you should see ZENBU expression view for CATG78.; 2. Click on CATG78. in track [Sample Ontology Enrichment] Robust, gene (fold difference) and bring up experiment panel (at bottom) showing fold enrichment of CATG78. in classical monocytes; 3. Click on CATG78. in track [Expression Profile] Robust, gene-based expression (n=829 samples) and bring up experiment panel (at bottom) to show expression of CATG78. in,829 FANTOM5 samples;. Click on track [Epigenome] Roadmap, (n= samples) to shows tissues of roadmap are active within displayed loci; 8

SUPPLEMENTARY INFORMATION RESEARCH 66 Use Use Case Case : : Explore Explore Cell Cell Types Types Associated Associated with with Crohn s Crohn s Diseases Diseases 6 5 3 2 Trait List Page

19 SUPPLEMENTARY INFORMATION RESEARCH 66 Use Use Case Case : : Explore Explore Cell Cell Types Types Associated Associated with with Crohn s Crohn s Diseases Diseases Trait List Page Supplementary Fig. Explore Cell Types Associated with Crohn s Diseases. Steps:. Go to Trait List Page by clicking Ontologies in Landing Page; 2. Type crohn and click filter ; 3. Click on DOID:8778 to enter trait individual page for Crohn s Disease GWAS;. Reached trait individual page for Crohn s Disease GWAS; 5. Browse cell-types that are associated Crohn s Disease in Ontology Enrichment table, and click on filter to filter for genes that are associated with Crohn s Disease and enriched in Leukocytes, this should return 23 genes; 6. Examine composition of filtered genes, and go to individual gene page for gene based information; 9

Supplementary Figures

Supplementary Figures Supplementary Figure 1. Heatmap of GO terms for differentially expressed genes. The terms were hierarchically clustered using the GO term enrichment beta. Darker red, higher positive