Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Similar documents
SDS-Assisted Protein Transport Through Solid-State Nanopores

SUPPLEMENTARY INFORMATION. Computational Assay of H7N9 Influenza Neuraminidase Reveals R292K Mutation Reduces Drug Binding Affinity

Transient β-hairpin Formation in α-synuclein Monomer Revealed by Coarse-grained Molecular Dynamics Simulation

Supplementary Figure 1 (previous page). EM analysis of full-length GCGR. (a) Exemplary tilt pair images of the GCGR mab23 complex acquired for Random

1.4 - Linear Regression and MS Excel

Interactions of Polyethylenimines with Zwitterionic and. Anionic Lipid Membranes

Detergent solubilised 5 TMD binds pregnanolone at the Q245 neurosteroid potentiation site.

Study on Different types of Structure based Properties in Human Membrane Proteins

Supplementary Materials for

Table S1: Kinetic parameters of drug and substrate binding to wild type and HIV-1 protease variants. Data adapted from Ref. 6 in main text.

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Supplementary Figures

Supplementary Figures

NeuroMem. RBF Decision Space Mapping

List of Figures. List of Tables

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Supplementary Information A Hydrophobic Barrier Deep Within the Inner Pore of the TWIK-1 K2P Potassium Channel Aryal et al.

Arginine side chain interactions and the role of arginine as a mobile charge carrier in voltage sensitive ion channels. Supplementary Information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Biology 2E- Zimmer Protein structure- amino acid kit

Supplementary Material

Supplementary Materials for

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

Supplementary Information. Conformational states of Lck regulate clustering in early T cell signaling

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

ATP-independent reversal of a membrane protein aggregate by a chloroplast SRP

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Identification of Tissue Independent Cancer Driver Genes

Hemoglobin & Sickle Cell Anemia Exercise

Practice Problems 3. a. What is the name of the bond formed between two amino acids? Are these bonds free to rotate?

Supplementary Figure-1. SDS PAGE analysis of purified designed carbonic anhydrase enzymes. M1-M4 shown in lanes 1-4, respectively, with molecular

Understandable Statistics

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Supplementary Table 1. Data collection and refinement statistics (molecular replacement).

Extraction and Identification of Tumor Regions from MRI using Zernike Moments and SVM

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Lecture 10 More about proteins

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Supplementary Figure 1. Recording sites.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Project Manual Bio3055. Cholesterol Homeostasis: HMG-CoA Reductase

Coarse grained simulations of Lipid Bilayer Membranes

(B D) Three views of the final refined 2Fo-Fc electron density map of the Vpr (red)-ung2 (green) interacting region, contoured at 1.4σ.

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Sum of Neurally Distinct Stimulus- and Task-Related Components.

A: All atom molecular simulation systems

Supplementary Figure 1 Preparation, crystallization and structure determination of EpEX. (a), Purified EpEX and EpEX analyzed on homogenous 12.

THE FUTURE OF OR. Dimitris Bertsimas MIT

Bilayer Deformation, Pores & Micellation Induced by Oxidized Lipids

Interaction Between Amyloid-b (1 42) Peptide and Phospholipid Bilayers: A Molecular Dynamics Study

Supplementary Material

Lane: 1. Spectra BR protein ladder 2. PFD 3. TERM 4. 3-way connector 5. 2-way connector

Predicting Sleep Using Consumer Wearable Sensing Devices

Structural analysis of fungus-derived FAD glucose dehydrogenase

Mammogram Analysis: Tumor Classification

Hemoglobin & Sickle Cell Anemia Exercise

Engineering of Ministry of Education, Institute of Molecular Science, Shanxi University, Taiyuan , China.

Introduction to Protein Structure Collection

Guided Inquiry Skills Lab. Additional Lab 1 Making Models of Macromolecules. Problem. Introduction. Skills Focus. Materials.

Evaluating Classifiers for Disease Gene Discovery

The MOLECULES of LIFE

Simulation of Self-Assembly of Ampiphiles Using Molecular Dynamics

Detection of Glaucoma and Diabetic Retinopathy from Fundus Images by Bloodvessel Segmentation

File name: Supplementary Information Description: Supplementary Figures, Supplementary Table and Supplementary References

1 Pattern Recognition 2 1

Protein Investigator. Protein Investigator - 3

The role of Ca² + ions in the complex assembling of protein Z and Z-dependent protease inhibitor: A structure and dynamics investigation

Gold nanocrystals at DPPC bilayer. Bo Song, Huajun Yuan, Cynthia J. Jameson, Sohail Murad

Trilateral Project WM4

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Chemical Nature of the Amino Acids. Table of a-amino Acids Found in Proteins

Classification of ECG Data for Predictive Analysis to Assist in Medical Decisions.

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

EECS 433 Statistical Pattern Recognition

Improved Intelligent Classification Technique Based On Support Vector Machines

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Supplementary Information: A Critical. Comparison of Biomembrane Force Fields: Structure and Dynamics of Model DMPC, POPC, and POPE Bilayers

Nature Structural & Molecular Biology: doi: /nsmb.2419

Amino acids. Dr. Mamoun Ahram Summer semester,

Augmented Medical Decisions

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Classification of EEG signals in an Object Recognition task

CS612 - Algorithms in Bioinformatics

Automated Medical Diagnosis using K-Nearest Neighbor Classification

7.1 Grading Diabetic Retinopathy

Supporting Information. Electrophoretic Deformation of Individual Transfer. RNA Molecules Reveals Their Identity

Model-free machine learning methods for personalized breast cancer risk prediction -SWISS PROMPT

FIRST MIDTERM EXAMINATION

Nature Immunology: doi: /ni Supplementary Figure 1

Measuring Focused Attention Using Fixation Inner-Density

Fitting discrete-data regression models in social science

Mutations and Disease Mutations in the Myosin Gene

Transcription:

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Amir Barati Farimani, Mohammad Heiranian, Narayana R. Aluru 1 Department of Mechanical Science and Engineering Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 Department of Chemistry, Stanford University, Stanford, California 94305 Table of contents 1. Translocation, forces and simulation sets 2. Conformations and blockade in the pore 3. Anion binding to thiol groups 4. Effect of mass of amino acids 5. Machine learning models 1 Corresponding Author, e-mail: aluru@illinois.edu, web: https://www.illinois.edu/~aluru/ 1

1. Translocation, forces and simulation sets The translocation of Alanine, Aspartic acid and Tryptophan are plotted as a function of time for different pulling forces in Figure S.1. As shown in the figure, for the small amino acid case of Alanine, the residues begin to pass through the pore under a force of 0.7643 pn which is also strong enough to push the larger amino acids of Tryptophan and Aspartic acid. a b Alanine translocation (# of residues) 10 8 6 4 2 0 0.9032 0.8337 0.7643 0.4169 pn Aspartic Acid translocation (# of residues) 8 6 4 2 0 0.7643 0.8337 0.9727 1.0422 pn 0 1 2 3 4 Translocation time (ns) 0 1 2 3 4 Translocation time (ns) c 10 Tryptophan translocation (# residues) 8 6 4 2 0 0.7643 1.1811 1.1117 0.8337 0.6948 pn 0 1 2 3 4 5 6 7 Translocation time (ns) Figure S.1. Translocation of a Alanine, b Aspartic acid and c Tryptophan chains as a function of time for different values of applied forces. 2

4,293 simulations have been performed to obtain the residence times, ionic currents, I-V curve and study the pore sensitivity of the MoS 2 nanopore. The detailed information about all the simulations is tabulated below. Table 1: Details of all the simulations performed. Simulation # of Purpose Comment set simulations 1 2000 Residence times and ionic Force per residue of 0.7643 pn currents 2 2000 Ionic currents Different forces per residue in the range of 0.4169 pn- 1.1811 pn 3 80 I-V curve No amino acids 4 100 Pore sensitivity Force per residue of 0.7643 pn 5 103 None Discarded (no translocation events) 6 10 Studying real-life amino acids Force per residue of 0.4516 pn 1-6 (total) 4293 3

2. Conformations and blockade in the pore Tryptophan amino acid has a large volume that occupies most of the pore (d=1.85 nm). The most dominant conformations of the amino acid in the pore are shown in Figure S.2.1. The amino acid blocks the pore and consequently the ionic current during its translocation is small. This leads to the lowest current and the longest residence time among all the amino acids. In Figure S.2.2, the dominant conformations of Alanine are shown. This small molecule sticks to the edge of the pore during its translation leaving most of the pore volume accessible to ions to pass through. Figure S.2.1. Dominant conformations and occupancy of Tryptophan in the pore with d=1.85 nm and an applied force of 0.6948 pn. The yellow and blue colors represent S and Mo atoms in MoS 2 structure and blue, red and cyan represent N, O, and C atoms in amino acids. The three last snapshots of amino acids are shown in vdw radii to demonstrate the volume of pore blockade. Hydrogens are not shown. 4

Figure S.2.2. Dominant conformations and occupancy of Alanine in the pore with d=1.85 nm and an applied force of 0.6948 pn. The yellow and blue colors represent S and Mo atoms in the MoS 2 structure and blue, red and cyan represent N, O, and C atoms in amino acids. The three last snapshots of amino acids are shown in vdw radii to demonstrate the volume of pore blockade. 5

3. Anion binding to thiol groups As mentioned in the manuscript, Methionine translocation results in a negative ionic current under a positive electric field. This is because anions are bound to the thiol groups of the amino acids (Figure S.3) and are dragged down with the chain resulting in a net negative ionic current. a b 20 18 Cl-Thiol H distance (Å) 16 14 12 10 8 6 binding time Ion translocation through the pore 4 2 0 100 200 300 400 500 time (ps) Figure S.3. a Different snapshots of Methionine translocation through the pore while anions (orange spheres), which are bound to the amino acids, are dragged down across the membrane. b Distance between a tagged Cl - and the hydrogen of the thiol group as a function of time. 6

4. Effect of mass of amino acids A 3D scattered plot of the mass of each amino acid as a function of its residence time and ionic current are shown in Figure S.4.1. As the mass increases, the current decreases since the amino acids with higher masses have larger volumes blocking the nanopore. Also, the residence time significantly increases with increasing mass and follows a power-law relation as shown in Figure S.4.2. Figure S.4.1. 3D plot of mass (Da) of each amino acid versus its average residence time (ps) and ionic current (pa). Mass-current data is shown in red color, mass-residence time data is shown in green color and current-residence time is shown in blue color. 7

Residence time (ps) 3000 2500 2000 1500 1000 Model Allometric1 Equation y = a*x^b Reduced Chi-Sqr 215.006 Adj. R-Square 0.99976 Value Standard Error C a 4.66081E-9 1.69891E-8 b 5.33911 0.75035 500 0 T R 4.66(10 ) M 9 5.339 40 60 80 100 120 140 160 180 200 Mass (daltons) Figure S.4.2. The relationship between the residence time (T R ) and mass (M ) of amino acids. The circles represent the data from the simulations. The curve is the fitted power-law relation between T R and M. 5. Machine learning models There are different methods to identify the class of an unseen data point based on known data. Here, we use k-nearest neighbor (KNN), logistic regression and random forest algorithms 2. In KNN, to classify a given point, a, the k nearest data points (neighbors) of a given point are first identified. In the nanopore data classification problem, each point of our data is a pair of ionic current and residence time (a (ionic current, residence time)). Then the class of a is predicted by the majority of voting among the nearest neighbors. There are different metrics to calculate the distance between two points, a and b. Here, we use the Euclidian distance d which is given by d( a, b) a b n 2 i i S.5.1 i 1 where i denotes the type of the feature (here ionic current and residence time) and n is the total number of features (here, 2). In KNN, k is identified as the number of closest (neighboring) data points to the data point of interest. Appropriate choice of k is needed for high accuracy prediction. Small values of k (smallest limit, k=1) are sensitive to local structure and noise in the data. Ideally, larger values of k lead to better classification and finer boundaries between class regions. However, because of the finite number of data points, the accuracy recedes as the value of k approaches the number of sample data points. To find the 8

optimal value of k, the accuracy of prediction is calculated for different values of k in Figure S.5.1. k=3, 7 and 8 result in the highest accuracy. k=3 is chosen since it is computationally less expensive. Figure S.5.1. The accuracy of the prediction based on KNN as a function of k. In logistic regression classification, any point x in the multi-feature vector space is mapped into a scalar value (p) between 0 and 1 using the logistic function as follows p 1 1 ( a ) e bx S.5.2 where a and b are the optimized parameters based on the traning data points. The training data for the nanopore sequencing in this paper is (ionic current, residence time, amino acid label). The training data is available as the excel supplementary file (Aminoacid_IR.xlsx). The class of any point in space is defined by the value of p and the threshold in p indicate boundaries between class regions. In the random forest method, a collection of randomized trees (estimators) are constructed based on the input data. Each tree can be thought of as a regression fit to the given data. Finally, the forest selects the classification based on the majority of voting from the decisions of all the trees. To find the optimal number of estimators (N), the accuracy of prediction is investigated for different values of N in Figure S.5.2. N=9, which is used in our calculations, leads to the most accurate prediction. 9

Figure S.5.2. The accuracy of the prediction based on Random Forest as a function of the number of estimators. We tried to classify the data without any prior training to see how good the data can be classified into clusters and what the accuracy of classification is compared to the labeled data. In the classification of the raw data, all the data points are unlabeled (only ionic current and residence time are known, and we do not know their respective amino acid labels) and the only given information is the number of clusters. All the clusters are predicted using k-mean clustering method with an accuracy of 72.6% when the Tryptophan cluster is excluded (1900 data points clustered into 19 clusters) as shown in Figure S.5.3. With the Tryptophan data included (2000 data points and with 20 clusters), the accuracy is 54%. This accuracy is defined by the number of correctly clustered data points compared to the total number of data points. A correctly identified cluster means that the cluster membership of each data point is the same as the labeled residence time and ionic current. 10

Figure S.5.3. The unlabeled data points (blue) are clustered using k-mean clustering. The mean values of the clusters based on the learning and the actual data are presented in black stars and red spheres, respectively. 11