Chris Bielow Algorithmic Bioinformatics, Institute for Computer Science MSSimulator Chris Bielow, Stephan Aiche, Sandro Andreotti, Knut Reinert FU Berlin, Germany Simulation of Mass Spectrometry Data
Motivation Digestion (trypsin) Cañas et. Al 2006 2
Motivation Gradient length Matrix type, ESI voltage, X Ion Trap Orbitrap, FT-ICR, TOF X SILAC MS E ICAT MeCAT itraq = a Create myriada of diverse different database MS (orwith MS/MS) manual setups annotation Simulation BUT insufficient data for algorithm development 3
Outline Capabilities of MSSimulator Realism of generated data Algorithm Benchmarking 4
The Big Picture FASTA >sp P02586 TNNC2... [ # intensity=120 # ] MTDQQAEARSYLSEEMAAFDMFDADGGGDISVKELGTVMRM Models & Parameters (e.g., SVM, ) MSSimulator Contaminants "Methanol,CH3OH,1622.6,.. Digestion Separation Ionization MS MS/MS RAW Data mzml Feature Data position, charge featurexml Relation Data labeling pairs, charge pairs consensusxml 5
Digestion Naïve digestion enzyme Digestion Separation Ionization MS MS/MS # missed cleavages (not site specific) 6
Digestion Naïve digestion Trained model Siepen et al, 2007 Digestion Separation Ionization MS MS/MS 7
Separation Capillary Electrophoresis HPLC via SVR Pfeiffer et al, 2007 Digestion Separation Ionization MS MS/MS 30 MT [min] 15 30 RT [min] 15 500 m/z [Th] 1000 500 m/z [Th] 1000 8
Separation RT dimension: Exponential gaussian hybrid (EGH) ì ï f = í ï î ï æ -( t - t H exp R ) 2 ç è 2s 2 g +t t - t R ( ) ö, 2s 2 g +t t - t R ø Lan et al, 2001 ( ) > 0 0, 2s 2 g +t ( t - t R ) 0 Digestion Separation Ionization MS MS/MS 9
Detectability and Ionization Detectability thresholded SVC (Schulz-Trieglaff et al, 2008) Digestion Separation Ionization MS MS/MS Ionization MALDI simulated real ESI P(q = 1) = p q1 P(q = 2) = p q2 B(n; p) 10
intensity intensity intensity MS Signal Components Isotope distribution 1n 2n MW 500 Digestion Separation Ionization MS MS/MS m/z MW 1000 m/z MW 2500 m/z 11
MS Signal Components Isotope distribution convolved with Lorentzian Gaussian Digestion Separation Ionization MS MS/MS Lorentzian Gaussian m/z 12
MS Signal Convolved Raw signal is a convolution of RT and m/z signal m/z 13
Capabilities MS/MS Signal Digestion Separation Ionization MS MS/MS http://www.astbury.leeds.ac.uk/facil/mstut/mstutorial_files/qtof.jpg 14
Capabilities MS/MS Signal Digestion Separation Ionization MS MS/MS 15
Capabilities MS/MS Signal Zhou et al 2008 Peptide sequence 35 features encoding for every b- and y-ion Digestion Separation Ionization MS MS/MS SVM classification: prediction of fragment presence / absence SVM regression: prediction of fragment peak intensity neutral losses + charge variants with predefined intensity MS/MS spectrum neutral losses + charge variants with probabilistic intensity model 16
Realism of Data Assume you create chocolate real chocolate imitate the real chocolate 17
Realism of Data Two different ways to convince yourself about the quality of your new chocolate bar: Bottom-up: IMITATE BUILDING BLOCKS Each basic simulation step is ideally based on an accepted physical model and evaluated with an accepted measure against real data. Top-down: EAT THE CHOCOLATE The result of the simulation is used in subsequent analysis steps (e.g. MS/MS Identification) and deemed good if similar results are obtained. 18
Realism of Data MSSimulator hence combines both approaches: Bottom-up: MSSimulator uses published models Digestions (missed cleavages) RT/MT prediction Detectability MS/MS ion ladder prediction Top-down: Lets have a look 19
Top-down Q-TOF Simulated Real 20
Top-down Simulated FTMS Real 21
Algorithm Benchmarking SILAC labeled data simulated with a Q-TOF preset 22
Use Case SILAC Data Imagine you want to compare two tools that can perform an expression analysis, ASAPRatio and XPRESS (both TPP). ASAPRatio is newer and hence should perform better. You want to convince yourself. 23
Use Case SILAC Data Lab Simulation Measure data 2h 10min 1:1 1:2 1:4 1:10 Simulate data Identification 1h Conversion to pepxml Run XPRESS/ASAPRatio 1h 1h Run XPRESS/ASAPRatio Performance comparison using manual annotation 2d 2min Performance comparison using ground truth 24
Use Case SILAC Data These results indicate that the newer ASAPRatio performs better then XPRESS 25
More Use Cases Experimental setup optimization Increasing the HPLC gradient time will increase the number of identified peptides! Increasing resolution will improve feature finding, even though aquisition time will go up! 26
More Use Cases Experimental setup optimization Protein Count Dyn Range Resolution Gradient Length Noise 27
More Use Cases feature finding performance by in-silico spike in 28
More Use Cases feature finding performance by in-silico spike in Real signal: TP FN (sensitivity) Bogus signal: FP TN (specificity) 29
More Use Cases feature finding performance by in-silico spike in Real signal: TP FN (sensitivity) Bogus signal: FP TN (specificity) 30
Wrapping up MSSimulator is part of (www.openms.de) Input: FASTA and additional configuration files Output (ground truth) Raw data (mzml) peptide positions (RT/mz), charge state, labeling status, contaminants positions (featurexml) labeling pairs/groups, charge groups (consensusxml) 31
Acknowledgements Alexandra Zerck Christian Huber Silke Ruzek OpenMS team 32
Literature 1. Siepen JA, Keevil E-J, Knight D, Hubbard SJ. Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. Journal of proteome research. 2007;6(1):399-408. 2. Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K. LC-MSsim--a simulation software for liquid chromatography mass spectrometry data. BMC bioinformatics. 2008;9:423. 3. Pfeifer N, Leinenbach A, Huber CG, Kohlbacher O. Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics. BMC bioinformatics. 2007;8:468. 4. Lan K, Jorgenson JW. A hybrid of exponential and gaussian functions as a simple model of asymmetric chromatographic peaks. Journal of Chromatography A. 2001;915(1-2):1-13. 5. Zhou C, Bowler LD, Feng J. A machine learning approach to explore the spectra intensity pattern of peptides using tandem mass spectrometry data. BMC bioinformatics. 2008;9(Cid):325 33
Thank you for your attention! 34