Comparison of ESI-MS Spectra in MassBank Database

BMEI 2008 Comparison of ESI-MS Spectra in MassBank Database Hisayuki Horai 1,2, Masanori Arita 1,2,3,4, Takaaki Nishioka 1,2 1 IAB, Keio Univ., 2 JST-BIRD, 3 Univ. of Tokyo, 4 RIKEN PSC 1

Table of Contents Metabolomics, Mass Spectrometry & Spectral Database Spectral Search by Similarity Vector Space Model Evaluation of Relevance MS/MS Spectra Search of Metabolites 2

Metabolomics & Mass Spectrometry Measurement of Metabolites Identification & Quantification + Intensity + Fragment Ion + + Precursor Ion 0 m/z Identification by Similarity of Peak Pattern mass-to-charge(number)-ratio 3

MassBank http://www.massbank.jp/ Mass Spectral Database for Identification of Metabolites Comprehensive Collection Metabolites, Drugs, Agrichemicals,... EI-MS, ESI-MS, MS/MS, XC/MS,... Various Experimental Conditions Distributed Database on Internet Cloud Computing Environment for Users Quality Control of Data at Contributor's Site Open to Public Variation of Resolution Sensitivity Fragmentation... Distributed Search Public Free Access via Internet Provide Software as Freeware (Server, DB System, Search Engine,...) 4

Spectral Search Most Important Function of Spectral Database 6

Spectral Search by Similalrity Based on Vector Space Model Search Based on Vector Space Model: Already Established Search Method Information Retrieval (e.g. Google, PubMed,...) Low Resolution (Integer) m/z Spectral Search for EI-MS Translate Spectrum to Vector Axis for m/z of Peak Element of Vector: Intensity of Peak Similarity of 2 Spectra = Cosine of Vectors Query q Target d 0 θ (1) - (6): m/z q1 - q5, d1 - d6: Intensity d1 q1 d2 Spectrum s2 (1) (2) (3) (4) (5) (6) q3 Spectrum s1 d4 q4 q = q + Score( s1, s2) d q = q1 d1 + q4 d4 2 2 2 1 + q3 + q4 q 5 = cosθ = 2 2 2 d = d1 + d2 + d4 + d5 2 2 q5 Score( q, d) = cosθ = s1 s2 s1 d6 s2 (q1, 0, q3, q4, q5, 0) Inner Product Length q d q d (d1, d2, 0, d4, 0, d5) Dimension = 6 7

Weighting & Normalization Spectral Vector: (..., V i,... ) Relative Intensity "Intensity must be normalized by Largest Peak." Vi = Intensity / max(intensity) Improvement for Better Search Importance of Large Ion "Large m/z may be specific for a compound." Vi = Relative Intensity (m/z) n [ n > 1 ] Importance of Intensity "Small peaks should not be ignored." Vi = (Relative Intensity) m (m/z) n [ n > 1, 0 < m < 1 ] 8

Spectral Search of Real Number m/z Introduce Tolerance of m/z to Match Peaks in Different Spectra Peaks within Tolerance are Compiled into an Axis "Different Peak Hit Problem" and "Same Peak Hit Problem" q1 q3 q4 q5 (q1, 0, q3, q4, q5, 0) Query q d1 d2 d4 d6 (d1, d2, 0, d4, 0, d5) Target d 0 (1) - (6): m/z q1 - q5, d1 - d6: Intensity (1) (2) (3) (4) (5) (6) Dimension = 6 9

Different Peak Hit Problem q1 (..., q1,...) Query q d1 d2 What is Element d of Vector d? (...,???,...) Target d 0 q1, d1, d2: Intensity Choice of Solution: Largest, Smallest Average, Total Nearest m/z... Select Largest Peak in MassBank 10

Same Peak Hit Problem q1 q2 Query q d1 q1 and q2: 1 Axis or 2 Axes? If 1Axis, What is Element of Vector q? If 2 Axes, What are 2 Elements of Vector d? Target d 0 q1, q2, d1:intensity 2 Axes & Duplicated Use of Hit Peak in Target in MassBank: q = (..., q1, q2,... ) d = (..., d1, d1,... ) 11

Spectral Search in MassBank - Default Setting - Weighting & Normalization: sqrt(relative Intensity) m/z / 10 where Relative Intensity = 1000 * Intensity / max(intensity) Tolerance:0.3 Choose Effective Peaks (Ignore Noise Peaks): Upper Bound of m/z: Ignore Peak when m/z 1000 Lower Bound of Intensity: Ignore Peak when Relative Intensity < 5 Lower Bound of Number of Hit Peaks: # Effective Peaks of Query 3: Ignore Target when # of Hit Peaks 3 # Effective Peaks of Query < 3: Ignore Target unless all Effective Peaks are Hit 13

Spectral Search in MassBank Red: Exact Hit Pink: Hit within Tolerance 3D View Select & Search Queries Selected Results Selected Query Ranked List of Search Results for Selected Query 14

Variety of Spectral Search Variety of Weighting & Normalization Variety of Real Number m/z Search Tolerance Solution for Different Peak Hit Problem Solution for Same Peak Hit Problem Variety of Practical Parameters Optimization of Search Method Depends on Target Set Based on Systematic Evaluation for Real Data 15

Evaluation of Relevance Identification of Metabolites 16

MS/MS Spectra MS/MS: Fragmentation Depend on Machine & Collision Energy Difficulty of MS/MS Spectral Search Needs for Comprehensive Collection Different Machine Different Collision Energy 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 17

MS/MS Spectra in MassBank Contributor: Keio Univ. QqQ MS/MS Spectra Low Resolution 861 Metabolites 4,205 Spectra (1-to-5 Spectra for 1 Metabolites) QqTOF MS/MS Spectra High Resolution 898 Metabolites 4,431 Spectra (2-to-5 Spectra for 1 Metabolites) 18

Evaluation Method For each Machine, Leave-one-out Test Test-1: QqQ Spectra Test-2: QqTOF Spectra For both Machines, Test-3: Query = QqQ, Target = QqTOF Test-4: Query = QqTOF, Target = QqQ Evaluation Index Precision & Recall Best Ranking of True Positive "Cross" Search 19

Precision & Recall Correct Answer = Spectrum of Same Metabolite as Query Results Targets Correct Answers P (Positive): Results N (Negative):not Results T (True): Success F (False): Mistake FP TP FN Targets are Divided into 4 Groups (FN, FP, TN, TP) for a Query. TN Precision = TP / (TP + FP) Recall = TP / (TP + FN) 20

Evaluation of Score Based Search Method Introduce Threshold of Score : Th Select Results where score Th Set of Results Depends on Th Tradeoff between Recall and Precision: Th Positive Recall & Precision Th Positive Recall & Precision precision Plot Precision-Recall Curve when Th is Shifted between 0 and 1. recall 21

Precision-Recall Graph 22

Best Rank of True Positive 23

Relevance of MS/MS Spectral Search Top 1 has High Relevance Top 1 is True Positive for 30% of all Queries Average of Best Rank of True Positive [BRTP] is Less than 2! Top 2 has High Relevance but BRTP 2 is Rare Case If Top 2 is True Positive, then Top1 might be True Positive! If Top 1 & Top2 are same Metabolites, then Relevance is Very High! Relativity between Rank & Score Top 1 is True / False Score is High / Low Ignoring Precision, QqTOF hits More than QqQ Tolerance of m/z Affects Relevance Tolerance Recal Tolerance Precision (Especially, for QqTOF) 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. Importance of Rank 24

Relevane of "Cross" Search Q QTof:Query = QqQ, Target = QqTOF (test-3) QTof Q: Query = QqTof, Target = QqQ (test-4) Q Tof is Better than Tof Q High Resolution Spectral Database is Useful for Low Resolution MS Users, too! Better than test-1 and test-2 Difference of Machine is Less Important than Difference of Collision Energy! Effectiveness of MS/MS Database of Various Machines 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 25

Conclusions Search Method based on Vector Space Model High Resolution (Real Number) m/z Spectra Weighting & Normalizing using m/z & Intensity Evaluation of MS Database of Metabolites Importance of High Resolution Spectra Effectiveness for Collecting Various Spectra by Different Machine under Different Experimental Condition 26

Future Plan Metabolome Integrated Database MassBank is an Important Part of Metabolome Integrated Database FlavonoidViewer Metabolome Passway DB MassBank LipidBank KNApSAcK Species-Metabolite Relationship DB 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. Comprehensive Lipid DB 27

Acknowledgements Special Thanks to Following Collaborators... IAB, Keio Univ. Y.Nihei, T.Ikeda, Y.Ojima, R.Matsuzawa, T.Soga, Y.Kakazu Grad.Sch.Frontier Sc., Univ. of Tokyo K.Suwa, M.Yoshimoto Bioinfo.&Genomics, NAIST S.Kanaya, Y.Shimbo RIKEN PSC K.Saito, F.Matsuda, A.Oikawa, M.Kusano, A.Fukushima, T.Sakurai, K.Akiyama Grad.Sch.Med., Univ. of Tokyo R.Taguchi Dept.Sci., Nara Women T.Takeuchi Nishiwaki Lab., JCL Bioassay Inc. Z.Tozuka Kazusa DNA Lab. T.Ara Leibniz Inst. Plant BIochem. S.Neumann This work is supported by BIRD-JST and Grant-in-Aid for Scientific Research on Priority Areas "Systems Genomics" from MEXT of Japan. 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 28

We would appreciate it very much if you could contribute spectra of metabolites and natural products to MassBank. URL: E-mail: http://www.massbank.jp/ massbank @ iab.keio.ac.jp

(Supplemental) 30

Evaluation for Single Correct Set Divide Correct Set into Target and Query k-fold Cross Validation Divide Correct set into k Subsets(S 1,... S k ) For all i, Query = S i, Target = Union of Other Subsets Calculate Average & Variance of Evaluation Index e.g. 2CV(k = 2),10CV(k = 10) Leave-one-out For all Element x of Correct Set, Query = { x }, Target = Rest of Correct Set Equivalent to k-fold Cross Validation where k is Number of Elements Useful when Correct Set is Small Random Sampling Query = Select k Elements Randomly, Target = Rest of Correct Set Repeat Enough Times of Evaluation 31

Index of Relevance Recall (Sensitivility) = TP / (TP + FN) Precision = TP / (TP + NP) Accuracy = (TP + FN) / (TP + TN + FP + FN) Fallout (Specificity) = TN / (TN + FP) Generality = TP / (TP + TN + FP + FN) Targets Results FP TP Correct Answers FN TN 32

Free from Tradeoff between Recall & Precision Area Measurement Break Even Point Value where Precision = Recall Average Precision of Eleven Point Average of Precisions where Recall are 0.0, 0.1, 0.2,..., 0.9, 1.0 Maximum F-measure = max(2 R P / (R + P)) precision recall F-measure = Harmonic Average of Recall and Precision 33

Evaluation Index by Ranking Average Precision Average of Precision for each True Positive Ex.:(1/1 + 2/4) / 2 = 0.75 Mean Reciprocal Rank (MRR) Average of Inverse of Rank for each True Positive 例 :(1/1 + 1/4) / 2 = 0.625 Discounted Cumulative Gain (DCG) Cumulate 1 / log 2 (r + 1) for each True Positive (r:rank) 例 :1/log 2 2 + 1/log 2 5 = 1.431 Ex. 1:T 2:F 3:F 4:T 5:F 6:F 34

Evaluation Index for Best True Positive Precision of Top Ranker Analogy of Average Precision Inverse of Best Ranking of True Positive Analogy of Mean Reciprocal Rank e.g.:top1~4:false, Top5:True 1/5 = 0.200 35

Why Cross Search is Better in High Threshold Area? Nearest Spectrum of Different Energy in Same Machine is Far from Nearest Spectrum in Different Machine QqTOF QqQ Limitation of Leave-one-out Test 36