Obstacles and challenges in the analysis of microrna sequencing data

Obstacles and challenges in the analysis of microrna sequencing data (mirna-seq) David Humphreys Genomics core Dr Victor Chang AC 1936-1991, Pioneering Cardiothoracic Surgeon and Humanitarian

The ABCs about mirnas (Annotation, Biogenesis, Curation) www.mirbase.org Mature fasta file Stem loop fasta file Gff (genome coordinate file)

mirna-seq applications Discovery - Novel mirnas - Isoforms - Biogenesis iii ) non canonical processing iv) Strand selection v) length/ non-template additions Quantification - Differentially expressed mirnas - Differential processing Read length covers entire mature transcript

Experimental design Sample selection Species, replicates RNA extraction Library preparation

RNA extraction Liquid Column Bead Prep time ++ ++++ +++ mirna purification +++ ++++ ++++ Recovery ++++ +++ +++ Limitations/pitfalls Low input mirna bias Early protocols no mirna??? Most susceptible: - Low GC content, - 2ndary structure Kim et al., (2011) Molecular Cell 43, 1005-1014 Kim et al., (2012) Molecular Cell 46, 893-895 Small RNA ppt with longer RNA Ratio 141/200c Down regulated mirnas: 141, 29b, 21, 106b, 15a, 34a NO change!! Cell number (L) = 200,000 (H) = 800,000 Low confluence = 500,000 cells High confluence = 800,000 cells

RNA quantification and integrity seqanswers.com/forums/showthread.php?t=21280 Nano drop Qubit Agilent Absorbance 230 260 280 Can detect salt & other contaminants! WARNING - Accuracy poor below 50ng/ul - Careful of concentrations > 1ug/ul Assays specific for DNA/RNA! WARNING - Known biases in quantifying ssrna < 50ng/ul! Quantitate size WARNING - Quantification only accurate in the defined range (read manual)

Library prep kit comparison Sample prep P- mirna -OH # Input amount # PH, buffers/salts/atp Adaptor ligation mirna Sequential Ligation # Sequence # Temperature # Incubation times mirna i) Hybridisation ii) Ligation iii) Denaturation RT (Reverse Transcription) mirna PCR # PCR cycles OK # Hafner et al., (2011) RNA 17(9), 1-16

Summary Sample selection Species, replicates RNA extraction Use same method for all preps Quantify (2 methods) Assess integrity Library preparation Consistent input Consistent ligation conditions (time/temperature) Use same kits

mirna-seq Bioinformatics (Trim - ALIGN Report)

Anscombe s Quartet Maths is a tool for analysis. You can blindly ignore biases and errors in data sets. - mean, stdev, variance, correlation are the same! Image from wikipedia https://en.wikipedia.org/wiki/anscombe%27s_quartet

Challenges Length of a sequence read covers entire microrna transcript Upstream bias will have impacts on analysis Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics Multimappers Mismatches Aligners Feature counting Normalisation Visualisation Differential expression Sharing data

Choice of reference? Genome mirbase stem-loop Better discovery Limited discovery Possible incorrect/loss of mappings Forced (biased) mapping Slower, computationally restrictive? Faster, less complicated.

Multi-mappers (1) mirbase does NOT ACCURATELY report number of times a read aligns to genome Multi-loci mirbase entries provide some information Number mirs 200 160 120 80 40 Human multi-mappers # mir-486 Example 0 0 20 40 60 80 > 100 Number of mapped locations mir-486 # Human mirbase entries mapped using bowtie aligner allowing all multi-mappers

Multi-mappers (2) Multi-mapping rate increases as read length decreases. What should the minimum length mirna read? Shortest length in mirbase is 17nt! mir-133 family mir-133a-1-3p uuugguccccuucaaccagcug mir-133a-1-3p uuugguccccuucaaccagcug mir-133b-1-3p uuugguccccuucaaccagcua Where do you assign multi-loci counts? - Assign to each position? - Assign fraction to each position? - Intelligently assign to a position? - Ignore? mir-133b mir-133a

Mismatches Sequencing Variants i) Error in library prep ii) Variants in reference genome iii) Sequencer RNA editing Ohanian et al. (2013) BMC Genetics, 14:18 Type Enzyme Comment A to I (G) ADAR Predominantly on pre-mirs C to T Apobec Not identified yet? Chawla et al., (2014) Nucleic Acids Research, 42 (8): 5245 5255 Tomaselli et al., (2013) Int. J. Mol. Sci. 14, 22796-22816

Aligners (Too) Many choices Each aligner has a wide array of options with DIFFERENT default settings. Bowtie aligner provides error rate and multi-mapping control : bowtie -p 4 -n 1 -l 21 --nomaqround -k 10 --best --strata --chunkmbs 256 Allow 1 mismatch in a length of 21nt Report up to 10 multi-mappers Fastq calibration dataset: Available for ALL species present in mirbase, features include: i) Each header defines mirbase mapping location ii) Contains all mirbase entries with all single nucleotide mismatches mirna ID Mapping location #1 Mapping location #2 hsa-let-7f-5p_m_chr9_94176353_94176374_+#chrx_53557246_53557267_- 0 chr9 94176353 255 22M * 0 0 TGAGGTAGTAGATTGTATAGTT

Non template additions (NTA) i) Adenylation <mirna seq> + (A) n ii) Uridylation <mirna seq> + (T) n DETECTION METHODS: Koppers-Lalic et al., (2014), Cell Reports 8, 1649 1658 Aligners tend to softclip 3 mismatches!! Remove adaptor - Hard trim (18nt) - Extend alignment. - Look for mismatch clusters at end of read.

Assigning mirna counts Mature mirna analysis i) 5 isomirs ii) 3 isomirs iii) Non canonical iv) Arm switching v) Length vi) Editing Cistronic Analysis (i) (ii) Humphreys et al., 2013, NAR

mirspring http://mirspring.victorchang.edu.au Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013. Small (<2MB) HTML document that replicates the mirna aligned sequencing data. Needs NO internet connectivity. Provides visualization of sequence data + research tools == complete transparency.

Cummulative distribution of mirna reads OK AGO IP TissueENCODE Atlas THP-1 Heart Kidney Liver Lung Ovary Spleen Testes Thymus Brain Placenta HeLa S3 A549 Ag04450 Bj Gm1287 H1hesc HepG2 Huvec K562 MCF7 73 mirspring documents NheK 895 million sequence Sknshra tags < 55 megabytes of disk space In most cell lines and tissues the most abundant mirna should comprise < 35% of all aligned mirna sequences Sampling bias!

Top 100 mirnas typically: - 22nt long - Good correlation with mirbase

Conclusions Many challenges in mirna-seq analysis Multi-mappers Mismatches Best practises. be methodical Know the question you wish to address Know your species (reference/mirbase) Know your aligner Test your pipeline! Know what you are missing Quality control metrics/ visualisation

If you would like a mirbase test data set for any species/reference combination please don t hesistate to contact me. d.humphreys@victorchang.edu.au mirspring.victorchang.edu.au - Fastq synthetic data sets - Intelligently assign multi-mappers - R objects Joshua Ho Peter Szot Catherine Suter Diane Fatkin St Vincent s Hospital Chris Hayward Kavitha Andrew Jabbour Thomas Priess