A computational framework for discovery of glycoproteomic biomarkers

Similar documents
Nature Biotechnology: doi: /nbt Supplementary Figure 1

A systematic investigation of CID Q-TOF-MS/MS collision energies to allow N- and O-glycopeptide identification by LC-MS/MS

Protein Reports CPTAC Common Data Analysis Pipeline (CDAP)

Glycosylation analysis of blood plasma proteins

A comprehensive, open-source platform for mass spectrometry based glycoproteomics. data analysis. Sciences, Buffalo, NY, USA

Automating Mass Spectrometry-Based Quantitative Glycomics using Tandem Mass Tag (TMT) Reagents with SimGlycan

Quantification with Proteome Discoverer. Bernard Delanghe

Isomeric Separation of Permethylated Glycans by Porous Graphitic Carbon (PGC)-LC-MS/MS at High- Temperatures

Profiling the Distribution of N-Glycosylation in Therapeutic Antibodies using the QTRAP 6500 System

Moving from targeted towards non-targeted approaches

Linear Growth of Glycomics

Isomer Separation of Positively Labeled N-glycans by CE-ESI-MS

New Developments in LC-IMS-MS Proteomic Measurements and Informatic Analyses

Figure S6. A-J) Annotated UVPD mass spectra for top ten peptides found among the peptides identified by Byonic but not SEQUEST + Percolator.

Proteomics of body liquids as a source for potential methods for medical diagnostics Prof. Dr. Evgeny Nikolaev

Digitizing the Proteomes From Big Tissue Biobanks

Analysis of Glycopeptides Using Porous Graphite Chromatography and LTQ Orbitrap XL ETD Hybrid MS

Proteome Informatics. Brian C. Searle Creative Commons Attribution

Flow-Through Electron Capture Dissociation in a novel Branched RF Ion Trap

SimGlycan. A high-throughput glycan and glycopeptide data analysis tool for LC-, MALDI-, ESI- Mass Spectrometry workflows.

The Automation of Glycopeptide Discovery in High Throughput MS/MS Data

PTM Discovery Method for Automated Identification and Sequencing of Phosphopeptides Using the Q TRAP LC/MS/MS System

Increasing Molecular Coverage in Complex Biological and Environmental Samples by Using IMS-MS

LC/MS/MS SOLUTIONS FOR LIPIDOMICS. Biomarker and Omics Solutions FOR DISCOVERY AND TARGETED LIPIDOMICS

2. Ionization Sources 3. Mass Analyzers 4. Tandem Mass Spectrometry

MSnID Package for Handling MS/MS Identifications

New Instruments and Services

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Algorithms for Glycan Structure Identification with Tandem Mass Spectrometry

Introduction to Proteomics 1.0

Thank you for joining us! Our session will begin shortly Waters Corporation 1

Applying a Novel Glycan Tagging Reagent, RapiFluor-MS, and an Integrated UPLC-FLR/QTof MS System for Low Abundant N-Glycan Analysis

Targeted and untargeted metabolic profiling by incorporating scanning FAIMS into LC-MS. Kayleigh Arthur

SUPPLEMENTARY INFORMATION

Characterization of Disulfide Linkages in Proteins by 193 nm Ultraviolet Photodissociation (UVPD) Mass Spectrometry. Supporting Information

Extended Mass Range Triple Quadrupole for Routine Analysis of High Mass-to-charge Peptide Ions

Supporting Information for MassyTools-assisted data analysis of total serum N-glycome changes associated with pregnancy

SIEVE 2.1 Proteomics Example

[ APPLICATION NOTE ] High Sensitivity Intact Monoclonal Antibody (mab) HRMS Quantification APPLICATION BENEFITS INTRODUCTION WATERS SOLUTIONS KEYWORDS

Don t miss a thing on your peptide mapping journey How to get full coverage peptide maps using high resolution accurate mass spectrometry

LC MS/MS Quantitation of Esophagus Disease Blood Serum Glycoproteins by Enrichment with Hydrazide Chemistry and Lectin Affinity Chromatography

NON TARGETED SEARCHING FOR FOOD

Supporting Information Parsimonious Charge Deconvolution for Native Mass Spectrometry

Enhancing Sequence Coverage in Proteomics Studies by Using a Combination of Proteolytic Enzymes

Glycosylation analyses of recombinant proteins by LC-ESI mass spectrometry

Tool for Rapid Analysis of glycopeptide by Permethylation (TRAP) via one-pot site mapping and glycan analysis.

Glycoproteomics: Identifying the Glycosylation of Prostate Specific Antigen at Normal and High Isoelectric Points by LC MS/MS

on Non-Consensus Protein Motifs Analytical & Formulation Sciences, Amgen. Seattle, WA

Supplementary Materials for

Structural Elucidation of N-glycans Originating From Ovarian Cancer Cells Using High-Vacuum MALDI Mass Spectrometry

LC/QTOF Discovery of Previously Unreported Microcystins in Alberta Lake Waters

FPO. Label-Free Molecular Imaging. Innovation with Integrity. Discover, localize and quantify biochemical changes and molecular markers

Detecting Glycosylations in Complex Samples

Top 10 Tips for Successful Searching ASMS 2003

Supplementary Figure 1. PD-L1 is glycosylated in cancer cells. (a) Western blot analysis of PD-L1 in breast cancer cells. (b) Western blot analysis

New Mass Spectrometry Tools to Transform Metabolomics and Lipidomics

Latest Innovations in LC/MS/MS from Waters for Metabolism and Bioanalytical Applications

Application Note # ET-17 / MT-99 Characterization of the N-glycosylation Pattern of Antibodies by ESI - and MALDI mass spectrometry

Improve Protein Analysis with the New, Mass Spectrometry- Compatible ProteasMAX Surfactant

Application Note # LCMS-89 High quantification efficiency in plasma targeted proteomics with a full-capability discovery Q-TOF platform

NIH Public Access Author Manuscript J Proteome Res. Author manuscript; available in PMC 2014 July 05.

Barry Boyes 1,2, Shujuan Tao 2, and Ron Orlando 2

Characterization of an Unknown Compound Using the LTQ Orbitrap

Detailed Analysis of Polyclonal IgG with Emphasis on Sialylation and Sub-Classes After Fractionation on a Sialic-Acid Binding Lectin

Simple modification of a protein database for mass spectral identification of N-linked glycopeptides

Can we learn something new about peptide separations after 40 years of RP and HILIC chromatography? Martin Gilar April 12, MASSEP 2016

Current Glycoprotein Analysis. Glycan Characterization: Oligosaccharides. Glycan Analysis: Sample Preparation. Glycan Analysis: Chromatography

The distribution of log 2 ratio (H/L) for quantified peptides. cleavage sites in each bin of log 2 ratio of quantified. peptides

Advantages of Ion Mobility Q-TOF for Characterization of Diverse Biological Molecules

Chemical Analysis Business Operations Waters Corporation Milford MA

Advances in Hybrid Mass Spectrometry

Mass spectrometry-based N-glycoproteomics for cancer biomarker discovery

APPLICATION NOTE. Highly reproducible and Comprehensive Proteome Profiling of Formalin-Fixed Paraffin-Embedded (FFPE) Tissues Slices

Supplementary Figures and Notes for Digestion and depletion of abundant

Characterization of glycopeptides by combining collision-induced dissociation and electron-transfer dissociation mass spectrometry data

Analysis of N-Linked Glycans from Coagulation Factor IX, Recombinant and Plasma Derived, Using HILIC UPLC/FLR/QTof MS

Supporting Information: Protein Corona Analysis of Silver Nanoparticles Exposed to Fish Plasma

TECHNICAL BULLETIN. R 2 GlcNAcβ1 4GlcNAcβ1 Asn

SYNAPT G2-S High Definition MS (HDMS) System

Nature Methods: doi: /nmeth.3177

for the Identification of Phosphorylated Peptides

6726, 62 Temesvari krt, Szeged, Hungary. Genentech Hall, N472A, MC 2240, th Street, San Francisco, CA

REDOX PROTEOMICS. Roman Zubarev.

Learning Objectives. Overview of topics to be discussed 10/25/2013 HIGH RESOLUTION MASS SPECTROMETRY (HRMS) IN DISCOVERY PROTEOMICS

Bringing Glycan Analysis to a New Age of Enlightenment

INTRODUCTION CH 3 CH CH 3 3. C 37 H 48 N 6 O 5 S 2, molecular weight Figure 1. The Xevo QTof MS System.

FOURIER TRANSFORM MASS SPECTROMETRY

Supporting Information. Lysine Propionylation to Boost Proteome Sequence. Coverage and Enable a Silent SILAC Strategy for

MSSimulator. Simulation of Mass Spectrometry Data. Chris Bielow, Stephan Aiche, Sandro Andreotti, Knut Reinert FU Berlin, Germany

MS1 and MS2 crosstalk in label free quantitation of mass spectrometry data independent acquisitions

Ion Source. Mass Analyzer. Detector. intensity. mass/charge

A Fully Integrated Workflow for LC-MS/MS Analysis of Labeled and Native N-Linked Glycans Released From Proteins

1. Sample Introduction to MS Systems:

Metabolomics A New Tool in Molecular Toxicology

Introduction. Methods RESEARCH FUND FOR THE CONTROL OF INFECTIOUS DISEASES. TCW Poon *, HLY Chan, HWC Leung, A Lo, RHY Lau, AY Hui, JJY Sung

Chapter 1. Introduction

High Resolution Glycopeptide Mapping of EPO Using an Agilent AdvanceBio Peptide Mapping Column

GlycanPac AXR-1 Columns

TECHNICAL NOTE. Accurate and fast proteomics analysis of human plasma with PlasmaDive and SpectroDive

SUPPORTING INFORMATION. Lysine Carbonylation is a Previously Unrecognized Contributor. to Peroxidase Activation of Cytochrome c by Chloramine-T

Transcription:

A computational framework for discovery of glycoproteomic biomarkers Haixu Tang, Anoop Mayampurath, Chuan-Yih Yu Indiana University, Bloomington Yehia Mechref, Erwang Song Texas Tech University 1

Goal: to build computational tools that can identify (and quantify) intact glycopeptides in complex samples by using routine proteomic instruments and protocols Mining glycoforms in massive proteomic datasets available for different biological samples (cell lines, tissues, bloods, animal models, etc) Accessible by well established proteomics facilities for some applications (e.g., biomarker discovery) Expectation: instead of comprehensively characterizing all/most glycoforms, we target at the abundant glycopeptides (major glycoforms) 2

From proteomics to glycoproteomics High throughput data acquisition from complex samples (e.g., human blood samples) using conventional proteomics protocols Complex glycoprotein sample CID ETD HCD Proteolytic (tryptic) (glyco)peptides Seperation by HPLC Mass analyzer Peptide search engines (>20): Assessment of peptide-spectrum matchings (PSMs) Sequest, X!tandem cross-correlation Mascot, OMSSA probabilistic scoring InSpect spectra alignment scoring Each experimental spectrum is compared against all putative peptides in a database with the matched precursor mass; only the 1 st ranked PSM is considered to be correct. 3

PSMs Challenge: Assessment of PSMs 8.73 7.34 To determine the correct PSMs among all PSMs Each experimental spectrum is compared against all peptides in the database with the matched precursor mass; only the 1 st ranked PSM is considered to be correct. there is 1 correct PSM for each spectrum. Industrial standard: to report identified peptides by controlled false discovery rate (FDR) the target-decoy search strategy Search both the target and a decoy database (e.g. the reverse protein database) Use the peptide-spectrum matches (PSMs) in decoy database to estimate the false PSMs in the target database FDR = (# decoy PSMs) / (# target PSMs) Can be used for any search engine or scoring model 4

Toward the glycoproteomics for the identification of intact glycopeptides Main challenges Data acquired from complex samples using routine proteomics protocols A majority number of un-modified peptide ions and a considerable number of glycopeptide ions Glycopeptide ions often show lower abundances than ions of nonmodified peptide from the same protein due to microheterogenenity target database is big Various glycans may attach to the same glycosylation site: microheterogeneity N-glycans: >150 # putative glycopeptides = # peptides # glycans More search time & false negatives (missing identifications)! 5

Higher false discovery rate with larger target database # identified true peptides Type I error, Type II error Type I error, Type II error vary score cutoff Type II error # identified true peptides = # identified peptides (1 FDR) Type I error FDR 6

Higher false discovery rate with larger target database # identified true peptides Larger target database A Bayesian interpretation Score distribution of true PSMs Prior prob. for any PSM to be true P( t S) = P( S t)p( t) P( S t)p( t) + P( S not t)[1- P( t)] P-value (a null model) t: a PSM is true; S: the score of PSM. FDR 7

Higher false discovery rate with larger target database P( t S) = P( S t)p( t) P( S t)p( t) + P( S not t)[1- P( t)] P(t) ~ # true PSMs / # potential PSMs = (# peptides in the sample # spectra per peptide) / (# peptides in the database # experimental spectra) Therefore, the larger the database, and the more spectra, the higher FDR. Larger effective search space (i.e., larger target database and more MS/MS spectra, e.g. in complex samples) introduces higher FDR. 8

Goal: to reduce effective search space (fewer PSMs)! HCD CID ETD Glycoprotein database Ä Putative glycan mass HCD CID ETD glycopeptide-spectrum matching (GPMs) Strategies: 1) focus on only glycopeptides that are expected to be observed from the samples, e.g., from glycoproteins identified using un-modified peptide ions; using motifs of glycosylation sites; 2) Filter Ions that are likely derived from glycopeptides; 3) Orthogonal scoring of different fragmentation spectra, e.g. ETD for peptide fragmentation and CID for glycan fragmentation. 9

Overview of the computational framework HCD LC- MS/ MS CID ETD 2 1 HCD/CID scoring Peptide identification using ETD Filter putative glycopeptide ions 3 Glycan sequencing Putative Y1 ions Putative peptide sequences decoy search Glycan scores Peptide scores 4 Peptide/glycan scoring & FDR estimates Database of glycoproteins 10

Testing datasets & target glycoprotein databases Sample Description # datasets Fetuin Bovine Fetuin; 1 hr LC-MS/MS analysis 5 proteins mixture Mixture of 5 model glycoproteins (bovine fetuin, human AGP, bovine pancreatic RNase B, porcine thyroglobulin (PTG), and human fibronectin ; 1 hr LC-MS/MS analysis Serum from cancer patients Serum from control individuals 3 2 4 5 5 hrs LC-MS/MS analysis 6 105 5 hrs LC-MS/MS analysis 6 105 # glycoproteins in database For serum samples, a total (union) of 105 unique glycoproteins can be identified using Mascot against human IPI protein database in 12 cancer and control samples. 11

1 HCD/CID scoring for filtering putative glycopeptide spectra - count longest path from one peak to next whose inter-peak spacing corresponds to mono- or di-saccharide loss - scores given by the length of longest pat Mayampurath, et. al., 2011, RCM, 25 (14), 2007-2019 Ions m/z Description 1 138.05 Oxonium Ion (HexNAc2H 2 O) 2 163.06 Hex 3 204.09 HexNAc 4 274.09 NeuAc-H 2 O 5 292.08 NeuAc 6 366.14 Hex+HexNAc 7 657.23 NeuAc+Hex+HexNAc - Presence/absence of 7 characteristic peaks - P-value computed from binomial distribution 12

2 Peptide identification by matching ETD spectra with peptides in database Glycoprotein database Putative Ä glycan mass Matched precursor mass c4 c5 c3 b9++ c7 b20++ c6 z13++ c8 c9 c10 c11 z20++ c12 Theoretical fragments: c/z, b/y, neutral losses Scoring: # matched peaks This has been commonly used in peptide search engines. 13

How accurate is the ETD scoring? A target-decoy search for estimating FDR Glycoprotein database Reverse glycoprotein database (preserving the motifs) Target search Putative Ä glycan mass Peptide decoy Putative Ä glycan mass Glycoprotein database Glycan decoy Ä False glycan mass (e.g mass-40) Combined target-decoy search, FDR=(# top-decoy-hits) / (# toptarget-hits); Note: in both cases, the number of glycopeptides in the decoy database is equal to the number of glycopeptides in the target database. 14

ETD scoring is not sufficient to identifying intact N- linked glycopeptides in complex samples Glycan decoy search (FDR<0.05) Peptide decoy search (FDR<0.05) Spectra Glycopeptides Sites Proteins Spectra Glycopeptides Sites Proteins Fetuin 25 15 5 2 16 7 3 2 Cancer serum Healthy serum 28 20 13 12 23 16 10 9 25 16 10 9 9 4 2 2 Note: the quality of ETD spectra is not sufficient to distinguish the true from false glycopeptide-spectrum matching (GPMs); therefore, we consider a unified scoring for both peptide identification (using ETD) and glycan sequencing (using CID) for the characterization of glycopeptides. 15

3 Glycan sequencing by using CID spectra of glycopeptides CID spectra of glycopeptide contain intensive peaks resulted from glycan fragmentation; Glycan sequencing from a CID spectrum of a glycopeptide is equivalent to that from a glycan spectrum, if Y1 ion in the glycopeptide spectrum is given; Y1 ion can be predicted from the peptide identified by using ETD scoring. 16

3 Glycan sequencing by using CID spectra of glycopeptides Add pseudo Y-ions by subtracting corresponding b-ion mass from precursor mass. Score: # matched peaks in the CID spectrum; additive to ETD score of the same ion Purpose: 1) to assess how likely the predicted Y1 ion derive a N-glycan (and thus correct); 2) derive the most likely N-glycan structure. Tang, et. al., 2005, Yu, et. al., 2014. 17

4 FDR estimation based on a unified scoring in glycan/peptide sequencing Glycoprotein database Target search Ä Putative glycan mass Reverse glycoprotein database Peptide decoy Putative Ä glycan mass A glycopeptide-spectrum matching (GPMs) is defined as the triplet (E, C, P), where E and C represent the ETD and CID spectrum of a precursor ion, respectively, and P is a peptide. A GPM is scored as the total score: S(E,C,P)=S(E,P)+S(C,P). Because here we used the peak counts in both ETD/CID scoring, they are directly additive. In general, one can train a linear or non-linear function of these two scores. In a combined target-decoy search, the GPM score can be computed for sorting each target or decoy glycopeptide, and FDR can be estimated by FDR=(# top-decoy-hits) / (# top-target-hits). Note: in most cases, the top-hit of an ETD spectrum to decoy peptide will NOT be the reverse peptide of the true glycopeptide, indicating a false Y1-ion will be assigned in this case, which will often lead to a low S(C,P) in glycan sequencing. Therefore, the total score S(E,C,P) is lower, providing higher distinguishing power between true and false GPMs. 18

Re-assessing GSMs in LC-MS/MS data from human serum samples T: GSMs from target database; F: GSMs from decoy database; Mayampurath, et. al., 2013,,Anal. Chem., 25 (14), 2007-2019 19

Identification of glycopeptides using CID/ETD combined scoring GlycoMap Analysis # protein IDs # sites detected # intact glycopeptides # glycopeptides with glycan sequences completely sequenced Glycan class distribution for completely sequenced glycans # complex # high mannose # hybrid Fetuin 2 5 22 8 8 0 0 5 protein mixture 4 5 11 6 6 0 0 Cancer 32 50 101 93 83 4 6 serum Control serum 29 44 92 89 82 1 5 Serum (total) 33 53 103 94 84 4 6 *FDR < 0.05 Mayampurath, et. al., 2013,,Anal. Chem., 25 (14), 2007-2019 20

ceruloplasmin precursor 21

Haptoglobin 22

Implementation GlycoFragwork implemented in C# http://darwin.informatics.indiana.edu/col/glycofr agwork/ Source code: http://sourceforge.net/projects/glycofragwork/ Input: mzxml or.raw file; output: identified glycopeptides, cartoon of glycans, CID & ETD scores, FDR Glycan sequencing algorithm is implemented as an independent.dll 23

Glycopeptide quantification for biomarker discovery Asialylated glycopeptides Sialylated glycopeptides Integrated peak area of identified intact glycopeptides are indicative for their abundances 24

Site-specific protein glycosylations (in cancer) Haptoglobin : N-184, N-207, N-211, N-241 Lin et al. J. Proteome Res. 2011 ; 10:2602-2611 25

A linear ANOVA model for discovery of disease associated site-specific glycosylations y p r r f g b e i, j ( i ), k ( j ( i )), c i i, c c j ( i ) k ( j ( i )) q i, j ( i ), k ( j ( i )), c p i - Abundance of ith protein g k ( j(i)) th ab u n. o f k g lycan, c= 1 th ab u n o f j p ep tid e, c= 1 r - Class effect c r i,c P i, c {1, 2 } P i, c 1 b q f j(i) th ab u n. o f j p ep tid e, c= 1 th ab u n o f i p ro tein, c= 1 - Experimental effect e - Error i, j(i),k ( j(i)),c,q r 0 i, c f 0 g j( i) k ( j ( i )) c j( i) k ( j ( i )) 0 Mayampurath, et. al., J. Proteome Res, 2014. 26

Log-Likelihood ratio test H : r 0, H : r 0 0 i, c 1 a i, c 1 Note: None of these four glycoproteins shown as significant (p-value<0.01) when t-test was applied to the quantities of individual glycopeptides from these proteins. Mayampurath, et. al., J. Proteome Res, 2014. 27

Haptoglobin Complement 3 protein Hemopexin Vitronectin 28

Conclusions We developed a computational framework for discovery of glycoproteomic biomarkers using LC- MS/MS data The framework can be applied to clinical proteomics without specific sample preparation and analytical protocols Routine proteomic approaches can be used for data collection We expect glycopeptides can be identified and quantified from existing proteomic data by using the framework 29

Acknowledgements IU Anoop Mayampurath Chuan-Yih Yu Abhinav Mathur Jagadeshwar Balan Yin Wu Funding: NSF: DBI-0642897 Persistent Fellowship (AM) Texas Tech Prof. Yehia Mechref Ehwang Song Yunli Hu PNNL Brian LaMarche 30