LC-MS. Pre-processing (xcms) W4M Core Team. 29/05/2017 v PDF Free Download

LC-MS Pre-processing (xcms) W4M Core Team 29/05/2017 v 1.0.0

Acquisition files upload and pre-processing with xcms: extraction, alignment and retention time drift correction. SECTION 1 2

LC-MS Data What is provided by the mass spectrometers What we want for data analysis

Extraction with XCMS R based software, Free A lot of parameters to tune, No graphical interface Need to write a R script Web for documentation: https://bioconductor.org/packages/release/bioc/html/xcms.html Forums : https://groups.google.com/forum/#!forum/xcms http://metabolomics-forum.com

Extraction with XCMS Extraction Extraction of ions in each sample independantly. Grouping alignment Each ion is aligned across all samples Retention time correction (optional) Fill peaks Replace missing data with baseline value CAMERA Annotation Statistics and visualisation (optionals) CAMERA For annotation of adducts, neutral loss and isotopes

Extraction with XCMS CAMERA Annotation of Adduct Fragments and isotopes

DATA 7

Data Raw data: mzxml, mzml, mzdata and netcdf samplemetadata HU_neg_011_b2 HU_neg_014_b2 Blank04 Blank05 bio bio blank blank Add some informations for further steps 8

Data Raw data: mzxml, mzml, mzdata and netcdf samplemetadata samples class sampletype subset full injectionorder batch age HU_neg_011_b2 bio sample 0 1 44 ne1 19 HU_neg_014_b2 bio sample 0 1 57 ne2 22 Blank04 blank blank 0 1 16 ne2 NA Blank05 blank blank 0 1 29 ne1 NA Add some informations for further steps 9

Two strategies 10

Two strategies 11 The "old" system -> files are nested in folders for their groups within a zip file + The folders set the group of the files for xcms.group + Only one import and one step - xcmsset is limited to 6 CPUs - The files aren't integrated into the history and can't be visualized (one day)

Two strategies The "brand new" system -> files are uploaded individually and processed in parallel - The xcmsset outputs have to be merged before using group - A samplemetadata file must be used to set the group (but you need one for some further steps anyway) + One xcmsset job is launch for each input file. It is highly parallelizable + The files are completely integrated in Galaxy and can be one day vizualized + A better transparency 12

Dataset Collection Dataset collection allow to group N datasets in 1 wrap / collection A Dataset collection depending of the tool will process nested datasets In one step tool In parallel xcmsset xcmsset xcmsset 13

Dataset Collection 14

Dataset Collection 15

Dataset Collection 1 16

Dataset collection XCMSSET 17

Dataset Collection 19

Dataset Collection 20

Dataset Collection 1 21

RUN! XCMSSET 22

mzxml raw file mzxml in a text editor 1 scan 23

mzxml raw file informations Real life example : m/z 187 fichier HU_neg_091.mzXML scan # RetTime (sec) basepeakmz int TIC %TIC delta ppm delta dalton 724 374.013 187.006423950195 1.11E+07 2.66E+07 42% 725 374.511 187.006896972656 3.26E+07 5.25E+07 62% 2.5 0.000473 726 374.996 187.007186889648 5.14E+07 7.89E+07 65% 1.6 0.000290 727 375.478 187.007324218750 6.19E+07 9.28E+07 67% 0.7 0.000137 728 375.955 187.007125854492 7.13E+07 1.05E+08 68% 1.1 0.000198 729 376.432 187.006988525391 7.34E+07 1.08E+08 68% 0.7 0.000137 730 376.906 187.006942749023 7.62E+07 1.10E+08 69% 0.2 0.000046 731 377.380 187.006942749023 6.98E+07 1.05E+08 67% 0.0 0.000000 732 377.861 187.006942749023 5.94E+07 9.00E+07 66% 0.0 0.000000 733 378.330 187.006713867188 5.79E+07 8.89E+07 65% 1.2 0.000229 734 378.805 187.006942749023 5.06E+07 7.77E+07 65% 1.2 0.000229 735 379.283 187.006484985352 4.33E+07 6.89E+07 63% 2.4 0.000458 736 379.762 187.006622314453 3.87E+07 6.19E+07 62% 0.7 0.000137 737 380.241 187.006576538086 3.14E+07 5.36E+07 58% 0.2 0.000046 738 380.720 187.006347656250 2.49E+07 4.96E+07 50% 1.2 0.000229 739 381.204 187.006439208984 1.98E+07 5.02E+07 39% 0.5 0.000092 740 381.684 187.006515502930 1.25E+07 3.56E+07 35% 0.4 0.000076 741 382.179 187.006393432617 1.11E+07 3.86E+07 29% 0.7 0.000122 187.0080 m/z deviation RT range m/z median 8.166 187.006805419922 187.0075 187.0070 187.0065 2.5 1.6 0.7 1.1 0.7 0.2 0.0 0.0 1.2 1.2 2.4 0.7 0.2 1.2 0.5 0.4 0.7 Scan to scan m/z deviation 187.0060 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 24

mzxml raw file informations Real life example : m/z 187 fichier HU_neg_091.mzXML scan # RetTime (sec) basepeakmz int TIC %TIC delta ppm delta dalton 724 374.013 187.006423950195 1.11E+07 2.66E+07 42% 725 374.511 187.006896972656 3.26E+07 5.25E+07 62% 2.5 0.000473 726 374.996 187.007186889648 5.14E+07 7.89E+07 65% 1.6 0.000290 727 375.478 187.007324218750 6.19E+07 9.28E+07 67% 0.7 0.000137 728 375.955 187.007125854492 7.13E+07 1.05E+08 68% 1.1 0.000198 729 376.432 187.006988525391 7.34E+07 1.08E+08 68% 0.7 0.000137 730 376.906 187.006942749023 7.62E+07 1.10E+08 69% 0.2 0.000046 731 377.380 187.006942749023 6.98E+07 1.05E+08 67% 0.0 0.000000 732 377.861 187.006942749023 5.94E+07 9.00E+07 66% 0.0 0.000000 733 378.330 187.006713867188 5.79E+07 8.89E+07 65% 1.2 0.000229 734 378.805 187.006942749023 5.06E+07 7.77E+07 65% 1.2 0.000229 735 379.283 187.006484985352 4.33E+07 6.89E+07 63% 2.4 0.000458 736 379.762 187.006622314453 3.87E+07 6.19E+07 62% 0.7 0.000137 737 380.241 187.006576538086 3.14E+07 5.36E+07 58% 0.2 0.000046 738 380.720 187.006347656250 2.49E+07 4.96E+07 50% 1.2 0.000229 739 381.204 187.006439208984 1.98E+07 5.02E+07 39% 0.5 0.000092 740 381.684 187.006515502930 1.25E+07 3.56E+07 35% 0.4 0.000076 9.00E+07 741 382.179 187.006393432617 1.11E+07 3.86E+07 29% 0.7 0.000122 8.00E+07 RT range m/z median 8.166 187.006805419922 7.00E+07 6.00E+07 5.00E+07 4.00E+07 3.00E+07 2.00E+07 1.00E+07 peak width = 8.2 0.00E+00 373 374 375 376 377 378 379 380 381 382 383 25

xcms Extraction centwave algorithm The algorithm aims detecting «Mass traces» or «region of interest» (ROI) which are defined as regions with less than a defined deviation of m/z in consecutive scans. This deviation must be lower than the value of the parameter «ppm» The value (unit ppm) has to be set according to mass spectrometer accuracy. ROI mass intensities are then used to define the chomatographic peak with continuous wavelet transform algorithm. peakwidth (min, max) has to be set for this step. 20,50 for HPLC 5,12 for UPLC Tautenhahn R. BMC Bioinformtics 2008

xcms Extraction algorithms MatchedFilter is dedicated to centroid or profile low resolution MS data Centwave is dedicated to centroid high resolution MS data 27

xcms Extraction centwave algorithm Centwave chromatographic peaks detection. Tautenhahn R., BioInformatics, 2008

centwave basic parameters CAMERA Annotation of Adduct Fragments and isotopes

centwave noise and coeluted peaks CAMERA Annotation of Adduct Fragments and isotopes

xcms centwave parameters xcms forum : How to choose peakwidth? "The main purpose of the peakwidth parameter is to roughly estimate the peak width range, this parameter is not a threshold. The wavelets used for peak detection are calculated from this parameter. If you use HPLC and your peaks are normally 20-60 s wide (base peak with), just go with that, i.e. peakwidth=c(20,60) centwave will still detect peaks that are 15s or 80 s wide! Important: Do not choose the minimum peak width too small, it will not increase sensitivity, but cause peaks to be split." Example: peak width ~ 45 s Using peakwidth = c(20,60) the peak will be split in three peaks, each detected as a ~10s wide separate peak (since they are separated by a local minimum) : using peakwidth = c(20,120) will keep the peak intact :

xcms extraction output CAMERA Annotation of Adduct Fragments and isotopes

xcms extraction output CAMERA Annotation of Adduct Fragments and isotopes When a zip file of is used a samplemetadata.tsv is created at this step. It must contains all informations needed for further analyses: batch correction and statistical analyses. This file must be downloaded in order to add all these informations and then uploaded.

xcms extraction output CAMERA Annotation of Adduct Fragments and isotopes

extraction parameters summary xcms steps xcms parameters related to description examples Extraction ppm m/z fluctuation of m/z value (ppm) from scan to scan. 5 (xcmsset) Depends on the mass spectrometer resolution peakwidth retention time range of chromatographic peak width (second) UPLC 5,20 HPLC 10,40 mzdiff m/z and retention time Minimum difference of mz for peaks with overlapping retention time (coeluting peak). Must be negative to allow overlap. -0.001 or 0.05 Prefilter Intensity A peak must be present in n scans with an intensity n=3,k=1000 greater than k. snthresh Intensity Ratio signal/noise threshold 3 noise Intensity Each centroid must be greater than "noise" value.

xcmsset 37

«Grouping» step Independant peaklists pool1b1 pool1b2 pool1b3 mz rt int mz rt int mz rt int 196.0905 66.6 7810936 196.0910 66.7 11733921 196.0902 66.6 7933325 158.1180 67.4 71736 342.0310 69.0 74594 158.1173 67.4 82969 342.0308 67.6 202268 267.0581 65.5 260877 342.0308 21.3 2581 267.0581 65.5 282039 283.0318 65.2 424631 283.0320 65.3 357448 Group ions by m/z Group by retention time mz rt int mz rt int mz rt int 196.0905 66.6 7810936 196.0910 66.7 11733921 196.0902 66.6 7933325 158.1180 67.4 71736 342.0310 69.0 74594 158.1173 67.4 82969 342.0308 67.6 202268 267.0581 65.5 260877 342.0308 21.3 2581 267.0581 65.5 282039 283.0318 65.2 424631 283.0320 65.3 357448 mz rt int mz rt int mz rt int 196.0905 66.6 7810936 196.0910 66.7 11733921 196.0902 66.6 7933325 158.1180 67.4 71736 158.1173 67.4 82969 342.0308 21.3 2581 342.0308 67.6 202268 342.0310 69.0 74594 267.0581 65.5 282039 267.0581 65.5 260877 283.0318 65.2 424631 283.0320 65.3 357448 Resulting matrix mz rt pool1b1 pool1b2 pool1b3 196.0905 66.6 7810936 11733921 7933325 158.1176 67.4 71736 82969 342.0308 21.3 2581 342.0309 68.3 202268 74594 267.0581 65.5 282039 260877 283.0319 65.2 424631 357448

xcms alignment group First step, a binning of mass domain is performed. The size of the bin is defined by mzwid parameter. Then for each mz bin, all ions of all samples are taken into account for all retention times. Kernel density estimator method is used to detect region of retention time with high density of ions. mzwid

xcms alignment group A gaussian model group together peaks with simillar retention time. The inclusivness of ions in a group is defined by the the standard deviation of the gaussian model (bandwith) corresponding to of the bw parameter xcms. This parameter can be interpreted as a retention time window. Vertical dash lines indicates that the feature is valid and will be retain in the data Matrix To be valid, the number of peaks in a group must be greater than the a percentage of the total number of samples. This threshold is defined by the minfrac parameter. mzwid Problem bw = 30 sec

xcms alignment group mzwid bw = 30 sec Problem Solved bw = 10 sec Decreasing bw allows to separate these 2 groups. The resulting m/z and retention time of the feature correspond to the median of m/z and RT of all ions grouped together as a single feature.

Minfrac parameter for group Minfrac = 0.5 4 samples in each group m/z Minfrac = minimum sample detected in at least one class to be considered as a group RT m/z RT 42

Minfrac parameter for group Minfrac = 0.5 4 samples in each group m/z RT

group interface CAMERA Annotation of Adduct Fragments and isotopes

xcms group output mzwid define the intervals of m/z CAMERA Annotation of Adduct Fragments and isotopes bw define the width of the gaussian curve

xcms group output Two distinct m/z merge as one group. Mzwid and bw too large

xcms group output Two distinct m/z are separated by decreasing bandwith value.

grouping parameters summary xcms steps Alignment (group) xcms parameters related to definition examples mzwid m/z Size of mz slices (bins). Range of m/z to be included in a group. Depends on mass spectrometer accuracy. bw retention standart deviation of the gaussian metapeak that group time together peaks minfrac samples A group to be valid must be found in minfrac*total n=10, number of samples in each subfolder of datafiles. minfrac=0.5 minfrac=0.5 correspond to 50%. found in at least 5 max number of ions Maximum number of groups detected in a single mz slices. 10 or 50

xcms workflow retcor CAMERA Annotation of Adduct Fragments and isotopes

xcms retcor output CAMERA Annotation of Adduct Fragments and isotopes retcor improving retention time must be followed by a second group step.

xcms retcor output Modification of the degre of smoothing Span = 0.8 CAMERA Annotation of Adduct Fragments and isotopes

Parameters for retcor Missing = 1 4 samples in each group m/z RT Extra = 1 m/z RT 52

xcms retcor obiwarp 53

retcor parameters summary xcms steps Retention time correction (retcor) xcms parameters smooth method related to description examples retention time Regression model to model time deviation among samples (linear or loess) linear or loess span degree of smoothing of the loess model. 0.2 to 1 extra samples number of "extra" peaks use to define reference peaks default=1 (or well behaved peaks) for modeling time deviation. Number of Peaks > number of samples. missing samples number of samples without reference peaks. If blank samples are used, missing = number of blanks. ploytype retention time Define the graphical visualistion of the effect of the model on retention time correction. number of blank samples deviation

Second grouping As retcor improved retention drift among samples a new grouping is mandatory to take advantage of this correction. bw parameter can thus be set to a smaller value than in the first group step. CAMERA Annotation of Adduct Fragments and isotopes

xcms fillpeaks Filling method: «chrom» for LCMS «MSW» for direct injection. CAMERA Annotation of Adduct Fragments and isotopes

MS data processing Report creation and Annotations Yann GUITTON 29/05/2017 v 1.0.0

LC-MS. Pre-processing (xcms) W4M Core Team. 29/05/2017 v 1.0.0