Comparison of ESI-MS Spectra in MassBank Database

Similar documents
MassBank User s Manual

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

LC/MS/MS SOLUTIONS FOR LIPIDOMICS. Biomarker and Omics Solutions FOR DISCOVERY AND TARGETED LIPIDOMICS

Metabolite identification in metabolomics: Database and interpretation of MSMS spectra

METABOSCAPE A METABOLITE PROFILING PIPELINE DRIVEN BY AUTOMATIC COMPOUND IDENTIFICATION

MASS SPECTROMETRY BASED METABOLOMICS. Pavel Aronov. ABRF2010 Metabolomics Research Group March 21, 2010

Welcome! Mass Spectrometry meets Cheminformatics WCMC Metabolomics Course 2014 Tobias Kind. Course: Search of MS/MS files with the NIST MS Search GUI

Library identifications for Lemnaceae

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Metabolite identification in metabolomics: Metlin Database and interpretation of MSMS spectra

Introducing.. PIKA (Ochotona princeps)

Quadrupole and Ion Trap Mass Analysers and an introduction to Resolution

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Top 10 Tips for Successful Searching ASMS 2003

SWATH Acquisition Enables the Ultra-Fast and Accurate Determination of Novel Synthetic Opioids

Metabolomic and Proteomics Solutions for Integrated Biology. Christine Miller Omics Market Manager ASMS 2015

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

AB Sciex QStar XL. AIMS Instrumentation & Sample Report Documentation. chemistry

Improved Intelligent Classification Technique Based On Support Vector Machines

MS/MS Library Creation of Q-TOF LC/MS Data for MassHunter PCDL Manager

MASS SPECTROMETRY IN METABOLOMICS

Efficient AUC Optimization for Information Ranking Applications

SYNAPT G2-S High Definition MS (HDMS) System

Simple Cancer Screening Based on Urinary Metabolite Analysis

Characterization of an Unknown Compound Using the LTQ Orbitrap

METHODS FOR DETECTING CERVICAL CANCER

Mass Spectrometry Infrastructure

COMP90049 Knowledge Technologies

Proteomics of body liquids as a source for potential methods for medical diagnostics Prof. Dr. Evgeny Nikolaev

Quantitation by High Resolution Mass Spectrometry: Case Study of TOF MS for the Quantitation of Allopurinol from Human Plasma

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

LC/QTOF Discovery of Previously Unreported Microcystins in Alberta Lake Waters

New Developments in LC-IMS-MS Proteomic Measurements and Informatic Analyses

MALDI Activity 4 MALDI-TOF Mass Spectrometry: Data Analysis

Terminology. Metabo*omics

Metabolomics: quantifying the phenotype

Data Mining in Bioinformatics Day 4: Text Mining

Choosing the metabolomics platform

Sensitivity, Specificity, and Relatives

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Discovery Metabolomics - Quantitative Profiling of the Metabolome using TripleTOF Technology

OMCL Network of the Council of Europe QUALITY MANAGEMENT DOCUMENT

MSSimulator. Simulation of Mass Spectrometry Data. Chris Bielow, Stephan Aiche, Sandro Andreotti, Knut Reinert FU Berlin, Germany

Prediction of micrornas and their targets

Benefits and Characteristic Applications of High Resolution GC/MS and LC/MS. Frank David RIC and Ghent University

How to Use TOF and Q-TOF Mass Spectrometers

Copyright 2008 Society of Photo Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE, vol. 6915, Medical Imaging 2008:

Chemical Analysis Business Operations Waters Corporation Milford MA

Ultra High Definition Optimizing all Analytical Dimensions

ION MOBILITY COUPLED TO HIGH RESOLUTION MASS SPECTROMETRY: THE POSSIBILITIES, THE LIMITATIONS

Various performance measures in Binary classification An Overview of ROC study

3. Model evaluation & selection

[application note] DIRECT TISSUE IMAGING AND CHARACTERIZATION OF PHOSPHOLIPIDS USING A MALDI SYNAPT HDMS SYSTEM

4-Fluoroethamphetamine

chapter 1 - fig. 2 Mechanism of transcriptional control by ppar agonists.

Latest Innovations in LC/MS/MS from Waters for Metabolism and Bioanalytical Applications

Databehandling. 3. Mark e.g. the first fraction (1: 0-45 min, 2: min, 3; min, 4: min, 5: min, 6: min).

CEU MASS MEDIATOR USER'S MANUAL Version 2.0, 31 st July 2017

Week 2 Video 3. Diagnostic Metrics

Designer Cannabinoids

New Mass Spectrometry Tools to Transform Metabolomics and Lipidomics

Impurity Identification using a Quadrupole - Time of Flight Mass Spectrometer QTOF

MS1 and MS2 crosstalk in label free quantitation of mass spectrometry data independent acquisitions

Sue D Antonio Application Chemist Cedar Creek, TX

Challenges in Separation Technologies

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

High-sensitivity Orbitrap mass analysis of intact macromolecular assemblies. R. J. Rose, E. Damoc, E. Denisov, A. Makarov, A. J. R.

Increased Identification Coverage and Throughput for Complex Lipidomes

Behavioral Data Mining. Lecture 4 Measurement

[ APPLICATION NOTE ] High Sensitivity Intact Monoclonal Antibody (mab) HRMS Quantification APPLICATION BENEFITS INTRODUCTION WATERS SOLUTIONS KEYWORDS

Introduction to Proteomics 1.0

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Application Note # LCMS-89 High quantification efficiency in plasma targeted proteomics with a full-capability discovery Q-TOF platform

For personal use only. Please do not reuse or reproduce

Amadeo R. Fernández-Alba

Moving from targeted towards non-targeted approaches

Statistics, Probability and Diagnostic Medicine

MALDI Imaging Drug Imaging Detlev Suckau Head of R&D MALDI Bruker Daltonik GmbH. December 19,

Data Independent MALDI Imaging HDMS E for Visualization and Identification of Lipids Directly from a Single Tissue Section

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Don t miss a thing on your peptide mapping journey How to get full coverage peptide maps using high resolution accurate mass spectrometry

Sunil Kulkarni Product Specialist Agilent Technologies

NON TARGETED SEARCHING FOR FOOD

Inhibitory Effect of Methotrexate on Rheumatoid Arthritis Inflammation and Comprehensive Metabolomics Analysis Using UPLC-Q/TOF-MS

MS/MS to Targeted Proteomics (MRM)

The Agilent MassHunter Software. One Software for all Agilent Mass Spec Systems

Automating Mass Spectrometry-Based Quantitative Glycomics using Tandem Mass Tag (TMT) Reagents with SimGlycan

Supplementary materials

2. Ionization Sources 3. Mass Analyzers 4. Tandem Mass Spectrometry

PosterREPRINT AN AUTOMATED METHOD TO SELF-CALIBRATE AND REJECT NOISE FROM MALDI PEPTIDE MASS FINGERPRINT SPECTRA

Improved method for the quantification of lysophospholipids including enol ether

A NOVEL METHOD OF M/Z DRIFT CORRECTION FOR OA-TOF MASS SPECTROMETERS BASED ON CONSTRUCTION OF LIBRARIES OF MATRIX COMPONENTS.

Predictive Models for Healthcare Analytics

QSTAR Operation. May 3, Bob Seward

Advantages of Ion Mobility Q-TOF for Characterization of Diverse Biological Molecules

The Comparison of High Resolution MS with Triple Quadrupole MS for the Analysis of Oligonucleotides

HW 1 - Bus Stat. Student:

Comparison of Full Scan MS2 and MS3 Linear Ion Trap Approaches for Quantitation of Vitamin D

Lecture 3. Tandem MS & Protein Sequencing

Transcription:

BMEI 2008 Comparison of ESI-MS Spectra in MassBank Database Hisayuki Horai 1,2, Masanori Arita 1,2,3,4, Takaaki Nishioka 1,2 1 IAB, Keio Univ., 2 JST-BIRD, 3 Univ. of Tokyo, 4 RIKEN PSC 1

Table of Contents Metabolomics, Mass Spectrometry & Spectral Database Spectral Search by Similarity Vector Space Model Evaluation of Relevance MS/MS Spectra Search of Metabolites 2

Metabolomics & Mass Spectrometry Measurement of Metabolites Identification & Quantification + Intensity + Fragment Ion + + Precursor Ion 0 m/z Identification by Similarity of Peak Pattern mass-to-charge(number)-ratio 3

MassBank http://www.massbank.jp/ Mass Spectral Database for Identification of Metabolites Comprehensive Collection Metabolites, Drugs, Agrichemicals,... EI-MS, ESI-MS, MS/MS, XC/MS,... Various Experimental Conditions Distributed Database on Internet Cloud Computing Environment for Users Quality Control of Data at Contributor's Site Open to Public Variation of Resolution Sensitivity Fragmentation... Distributed Search Public Free Access via Internet Provide Software as Freeware (Server, DB System, Search Engine,...) 4

Collaboration in MassBank 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 5

Spectral Search Most Important Function of Spectral Database 6

Spectral Search by Similalrity Based on Vector Space Model Search Based on Vector Space Model: Already Established Search Method Information Retrieval (e.g. Google, PubMed,...) Low Resolution (Integer) m/z Spectral Search for EI-MS Translate Spectrum to Vector Axis for m/z of Peak Element of Vector: Intensity of Peak Similarity of 2 Spectra = Cosine of Vectors Query q Target d 0 θ (1) - (6): m/z q1 - q5, d1 - d6: Intensity d1 q1 d2 Spectrum s2 (1) (2) (3) (4) (5) (6) q3 Spectrum s1 d4 q4 q = q + Score( s1, s2) d q = q1 d1 + q4 d4 2 2 2 1 + q3 + q4 q 5 = cosθ = 2 2 2 d = d1 + d2 + d4 + d5 2 2 q5 Score( q, d) = cosθ = s1 s2 s1 d6 s2 (q1, 0, q3, q4, q5, 0) Inner Product Length q d q d (d1, d2, 0, d4, 0, d5) Dimension = 6 7

Weighting & Normalization Spectral Vector: (..., V i,... ) Relative Intensity "Intensity must be normalized by Largest Peak." Vi = Intensity / max(intensity) Improvement for Better Search Importance of Large Ion "Large m/z may be specific for a compound." Vi = Relative Intensity (m/z) n [ n > 1 ] Importance of Intensity "Small peaks should not be ignored." Vi = (Relative Intensity) m (m/z) n [ n > 1, 0 < m < 1 ] 8

Spectral Search of Real Number m/z Introduce Tolerance of m/z to Match Peaks in Different Spectra Peaks within Tolerance are Compiled into an Axis "Different Peak Hit Problem" and "Same Peak Hit Problem" q1 q3 q4 q5 (q1, 0, q3, q4, q5, 0) Query q d1 d2 d4 d6 (d1, d2, 0, d4, 0, d5) Target d 0 (1) - (6): m/z q1 - q5, d1 - d6: Intensity (1) (2) (3) (4) (5) (6) Dimension = 6 9

Different Peak Hit Problem q1 (..., q1,...) Query q d1 d2 What is Element d of Vector d? (...,???,...) Target d 0 q1, d1, d2: Intensity Choice of Solution: Largest, Smallest Average, Total Nearest m/z... Select Largest Peak in MassBank 10

Same Peak Hit Problem q1 q2 Query q d1 q1 and q2: 1 Axis or 2 Axes? If 1Axis, What is Element of Vector q? If 2 Axes, What are 2 Elements of Vector d? Target d 0 q1, q2, d1:intensity 2 Axes & Duplicated Use of Hit Peak in Target in MassBank: q = (..., q1, q2,... ) d = (..., d1, d1,... ) 11

Variety of Spectral Search Variety of Weighting & Normalization Variety of Real Number m/z Search Tolerance Solution for Different Peak Hit Problem Solution for Same Peak Hit Problem Variety of Practical Parameters 12

Spectral Search in MassBank - Default Setting - Weighting & Normalization: sqrt(relative Intensity) m/z / 10 where Relative Intensity = 1000 * Intensity / max(intensity) Tolerance:0.3 Choose Effective Peaks (Ignore Noise Peaks): Upper Bound of m/z: Ignore Peak when m/z 1000 Lower Bound of Intensity: Ignore Peak when Relative Intensity < 5 Lower Bound of Number of Hit Peaks: # Effective Peaks of Query 3: Ignore Target when # of Hit Peaks 3 # Effective Peaks of Query < 3: Ignore Target unless all Effective Peaks are Hit 13

Spectral Search in MassBank Red: Exact Hit Pink: Hit within Tolerance 3D View Select & Search Queries Selected Results Selected Query Ranked List of Search Results for Selected Query 14

Variety of Spectral Search Variety of Weighting & Normalization Variety of Real Number m/z Search Tolerance Solution for Different Peak Hit Problem Solution for Same Peak Hit Problem Variety of Practical Parameters Optimization of Search Method Depends on Target Set Based on Systematic Evaluation for Real Data 15

Evaluation of Relevance Identification of Metabolites 16

MS/MS Spectra MS/MS: Fragmentation Depend on Machine & Collision Energy Difficulty of MS/MS Spectral Search Needs for Comprehensive Collection Different Machine Different Collision Energy 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 17

MS/MS Spectra in MassBank Contributor: Keio Univ. QqQ MS/MS Spectra Low Resolution 861 Metabolites 4,205 Spectra (1-to-5 Spectra for 1 Metabolites) QqTOF MS/MS Spectra High Resolution 898 Metabolites 4,431 Spectra (2-to-5 Spectra for 1 Metabolites) 18

Evaluation Method For each Machine, Leave-one-out Test Test-1: QqQ Spectra Test-2: QqTOF Spectra For both Machines, Test-3: Query = QqQ, Target = QqTOF Test-4: Query = QqTOF, Target = QqQ Evaluation Index Precision & Recall Best Ranking of True Positive "Cross" Search 19

Precision & Recall Correct Answer = Spectrum of Same Metabolite as Query Results Targets Correct Answers P (Positive): Results N (Negative):not Results T (True): Success F (False): Mistake FP TP FN Targets are Divided into 4 Groups (FN, FP, TN, TP) for a Query. TN Precision = TP / (TP + FP) Recall = TP / (TP + FN) 20

Evaluation of Score Based Search Method Introduce Threshold of Score : Th Select Results where score Th Set of Results Depends on Th Tradeoff between Recall and Precision: Th Positive Recall & Precision Th Positive Recall & Precision precision Plot Precision-Recall Curve when Th is Shifted between 0 and 1. recall 21

Precision-Recall Graph 22

Best Rank of True Positive 23

Relevance of MS/MS Spectral Search Top 1 has High Relevance Top 1 is True Positive for 30% of all Queries Average of Best Rank of True Positive [BRTP] is Less than 2! Top 2 has High Relevance but BRTP 2 is Rare Case If Top 2 is True Positive, then Top1 might be True Positive! If Top 1 & Top2 are same Metabolites, then Relevance is Very High! Relativity between Rank & Score Top 1 is True / False Score is High / Low Ignoring Precision, QqTOF hits More than QqQ Tolerance of m/z Affects Relevance Tolerance Recal Tolerance Precision (Especially, for QqTOF) 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. Importance of Rank 24

Relevane of "Cross" Search Q QTof:Query = QqQ, Target = QqTOF (test-3) QTof Q: Query = QqTof, Target = QqQ (test-4) Q Tof is Better than Tof Q High Resolution Spectral Database is Useful for Low Resolution MS Users, too! Better than test-1 and test-2 Difference of Machine is Less Important than Difference of Collision Energy! Effectiveness of MS/MS Database of Various Machines 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 25

Conclusions Search Method based on Vector Space Model High Resolution (Real Number) m/z Spectra Weighting & Normalizing using m/z & Intensity Evaluation of MS Database of Metabolites Importance of High Resolution Spectra Effectiveness for Collecting Various Spectra by Different Machine under Different Experimental Condition 26

Future Plan Metabolome Integrated Database MassBank is an Important Part of Metabolome Integrated Database FlavonoidViewer Metabolome Passway DB MassBank LipidBank KNApSAcK Species-Metabolite Relationship DB 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. Comprehensive Lipid DB 27

Acknowledgements Special Thanks to Following Collaborators... IAB, Keio Univ. Y.Nihei, T.Ikeda, Y.Ojima, R.Matsuzawa, T.Soga, Y.Kakazu Grad.Sch.Frontier Sc., Univ. of Tokyo K.Suwa, M.Yoshimoto Bioinfo.&Genomics, NAIST S.Kanaya, Y.Shimbo RIKEN PSC K.Saito, F.Matsuda, A.Oikawa, M.Kusano, A.Fukushima, T.Sakurai, K.Akiyama Grad.Sch.Med., Univ. of Tokyo R.Taguchi Dept.Sci., Nara Women T.Takeuchi Nishiwaki Lab., JCL Bioassay Inc. Z.Tozuka Kazusa DNA Lab. T.Ara Leibniz Inst. Plant BIochem. S.Neumann This work is supported by BIRD-JST and Grant-in-Aid for Scientific Research on Priority Areas "Systems Genomics" from MEXT of Japan. 2008/05/09 Copyright 2008, Hisayuki Horai, All Rights Reserved. 28

We would appreciate it very much if you could contribute spectra of metabolites and natural products to MassBank. URL: E-mail: http://www.massbank.jp/ massbank @ iab.keio.ac.jp

(Supplemental) 30

Evaluation for Single Correct Set Divide Correct Set into Target and Query k-fold Cross Validation Divide Correct set into k Subsets(S 1,... S k ) For all i, Query = S i, Target = Union of Other Subsets Calculate Average & Variance of Evaluation Index e.g. 2CV(k = 2),10CV(k = 10) Leave-one-out For all Element x of Correct Set, Query = { x }, Target = Rest of Correct Set Equivalent to k-fold Cross Validation where k is Number of Elements Useful when Correct Set is Small Random Sampling Query = Select k Elements Randomly, Target = Rest of Correct Set Repeat Enough Times of Evaluation 31

Index of Relevance Recall (Sensitivility) = TP / (TP + FN) Precision = TP / (TP + NP) Accuracy = (TP + FN) / (TP + TN + FP + FN) Fallout (Specificity) = TN / (TN + FP) Generality = TP / (TP + TN + FP + FN) Targets Results FP TP Correct Answers FN TN 32

Free from Tradeoff between Recall & Precision Area Measurement Break Even Point Value where Precision = Recall Average Precision of Eleven Point Average of Precisions where Recall are 0.0, 0.1, 0.2,..., 0.9, 1.0 Maximum F-measure = max(2 R P / (R + P)) precision recall F-measure = Harmonic Average of Recall and Precision 33

Evaluation Index by Ranking Average Precision Average of Precision for each True Positive Ex.:(1/1 + 2/4) / 2 = 0.75 Mean Reciprocal Rank (MRR) Average of Inverse of Rank for each True Positive 例 :(1/1 + 1/4) / 2 = 0.625 Discounted Cumulative Gain (DCG) Cumulate 1 / log 2 (r + 1) for each True Positive (r:rank) 例 :1/log 2 2 + 1/log 2 5 = 1.431 Ex. 1:T 2:F 3:F 4:T 5:F 6:F 34

Evaluation Index for Best True Positive Precision of Top Ranker Analogy of Average Precision Inverse of Best Ranking of True Positive Analogy of Mean Reciprocal Rank e.g.:top1~4:false, Top5:True 1/5 = 0.200 35

Why Cross Search is Better in High Threshold Area? Nearest Spectrum of Different Energy in Same Machine is Far from Nearest Spectrum in Different Machine QqTOF QqQ Limitation of Leave-one-out Test 36

37