Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Amir Barati Farimani, Mohammad Heiranian, Narayana R. Aluru 1 Department of Mechanical Science and Engineering Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 Department of Chemistry, Stanford University, Stanford, California 94305 Table of contents 1. Translocation, forces and simulation sets 2. Conformations and blockade in the pore 3. Anion binding to thiol groups 4. Effect of mass of amino acids 5. Machine learning models 1 Corresponding Author, e-mail: aluru@illinois.edu, web: https://www.illinois.edu/~aluru/ 1

1. Translocation, forces and simulation sets The translocation of Alanine, Aspartic acid and Tryptophan are plotted as a function of time for different pulling forces in Figure S.1. As shown in the figure, for the small amino acid case of Alanine, the residues begin to pass through the pore under a force of 0.7643 pn which is also strong enough to push the larger amino acids of Tryptophan and Aspartic acid. a b Alanine translocation (# of residues) 10 8 6 4 2 0 0.9032 0.8337 0.7643 0.4169 pn Aspartic Acid translocation (# of residues) 8 6 4 2 0 0.7643 0.8337 0.9727 1.0422 pn 0 1 2 3 4 Translocation time (ns) 0 1 2 3 4 Translocation time (ns) c 10 Tryptophan translocation (# residues) 8 6 4 2 0 0.7643 1.1811 1.1117 0.8337 0.6948 pn 0 1 2 3 4 5 6 7 Translocation time (ns) Figure S.1. Translocation of a Alanine, b Aspartic acid and c Tryptophan chains as a function of time for different values of applied forces. 2

4,293 simulations have been performed to obtain the residence times, ionic currents, I-V curve and study the pore sensitivity of the MoS 2 nanopore. The detailed information about all the simulations is tabulated below. Table 1: Details of all the simulations performed. Simulation # of Purpose Comment set simulations 1 2000 Residence times and ionic Force per residue of 0.7643 pn currents 2 2000 Ionic currents Different forces per residue in the range of 0.4169 pn- 1.1811 pn 3 80 I-V curve No amino acids 4 100 Pore sensitivity Force per residue of 0.7643 pn 5 103 None Discarded (no translocation events) 6 10 Studying real-life amino acids Force per residue of 0.4516 pn 1-6 (total) 4293 3

2. Conformations and blockade in the pore Tryptophan amino acid has a large volume that occupies most of the pore (d=1.85 nm). The most dominant conformations of the amino acid in the pore are shown in Figure S.2.1. The amino acid blocks the pore and consequently the ionic current during its translocation is small. This leads to the lowest current and the longest residence time among all the amino acids. In Figure S.2.2, the dominant conformations of Alanine are shown. This small molecule sticks to the edge of the pore during its translation leaving most of the pore volume accessible to ions to pass through. Figure S.2.1. Dominant conformations and occupancy of Tryptophan in the pore with d=1.85 nm and an applied force of 0.6948 pn. The yellow and blue colors represent S and Mo atoms in MoS 2 structure and blue, red and cyan represent N, O, and C atoms in amino acids. The three last snapshots of amino acids are shown in vdw radii to demonstrate the volume of pore blockade. Hydrogens are not shown. 4

Figure S.2.2. Dominant conformations and occupancy of Alanine in the pore with d=1.85 nm and an applied force of 0.6948 pn. The yellow and blue colors represent S and Mo atoms in the MoS 2 structure and blue, red and cyan represent N, O, and C atoms in amino acids. The three last snapshots of amino acids are shown in vdw radii to demonstrate the volume of pore blockade. 5

3. Anion binding to thiol groups As mentioned in the manuscript, Methionine translocation results in a negative ionic current under a positive electric field. This is because anions are bound to the thiol groups of the amino acids (Figure S.3) and are dragged down with the chain resulting in a net negative ionic current. a b 20 18 Cl-Thiol H distance (Å) 16 14 12 10 8 6 binding time Ion translocation through the pore 4 2 0 100 200 300 400 500 time (ps) Figure S.3. a Different snapshots of Methionine translocation through the pore while anions (orange spheres), which are bound to the amino acids, are dragged down across the membrane. b Distance between a tagged Cl - and the hydrogen of the thiol group as a function of time. 6

4. Effect of mass of amino acids A 3D scattered plot of the mass of each amino acid as a function of its residence time and ionic current are shown in Figure S.4.1. As the mass increases, the current decreases since the amino acids with higher masses have larger volumes blocking the nanopore. Also, the residence time significantly increases with increasing mass and follows a power-law relation as shown in Figure S.4.2. Figure S.4.1. 3D plot of mass (Da) of each amino acid versus its average residence time (ps) and ionic current (pa). Mass-current data is shown in red color, mass-residence time data is shown in green color and current-residence time is shown in blue color. 7

Residence time (ps) 3000 2500 2000 1500 1000 Model Allometric1 Equation y = a*x^b Reduced Chi-Sqr 215.006 Adj. R-Square 0.99976 Value Standard Error C a 4.66081E-9 1.69891E-8 b 5.33911 0.75035 500 0 T R 4.66(10 ) M 9 5.339 40 60 80 100 120 140 160 180 200 Mass (daltons) Figure S.4.2. The relationship between the residence time (T R ) and mass (M ) of amino acids. The circles represent the data from the simulations. The curve is the fitted power-law relation between T R and M. 5. Machine learning models There are different methods to identify the class of an unseen data point based on known data. Here, we use k-nearest neighbor (KNN), logistic regression and random forest algorithms 2. In KNN, to classify a given point, a, the k nearest data points (neighbors) of a given point are first identified. In the nanopore data classification problem, each point of our data is a pair of ionic current and residence time (a (ionic current, residence time)). Then the class of a is predicted by the majority of voting among the nearest neighbors. There are different metrics to calculate the distance between two points, a and b. Here, we use the Euclidian distance d which is given by d( a, b) a b n 2 i i S.5.1 i 1 where i denotes the type of the feature (here ionic current and residence time) and n is the total number of features (here, 2). In KNN, k is identified as the number of closest (neighboring) data points to the data point of interest. Appropriate choice of k is needed for high accuracy prediction. Small values of k (smallest limit, k=1) are sensitive to local structure and noise in the data. Ideally, larger values of k lead to better classification and finer boundaries between class regions. However, because of the finite number of data points, the accuracy recedes as the value of k approaches the number of sample data points. To find the 8

optimal value of k, the accuracy of prediction is calculated for different values of k in Figure S.5.1. k=3, 7 and 8 result in the highest accuracy. k=3 is chosen since it is computationally less expensive. Figure S.5.1. The accuracy of the prediction based on KNN as a function of k. In logistic regression classification, any point x in the multi-feature vector space is mapped into a scalar value (p) between 0 and 1 using the logistic function as follows p 1 1 ( a ) e bx S.5.2 where a and b are the optimized parameters based on the traning data points. The training data for the nanopore sequencing in this paper is (ionic current, residence time, amino acid label). The training data is available as the excel supplementary file (Aminoacid_IR.xlsx). The class of any point in space is defined by the value of p and the threshold in p indicate boundaries between class regions. In the random forest method, a collection of randomized trees (estimators) are constructed based on the input data. Each tree can be thought of as a regression fit to the given data. Finally, the forest selects the classification based on the majority of voting from the decisions of all the trees. To find the optimal number of estimators (N), the accuracy of prediction is investigated for different values of N in Figure S.5.2. N=9, which is used in our calculations, leads to the most accurate prediction. 9

Figure S.5.2. The accuracy of the prediction based on Random Forest as a function of the number of estimators. We tried to classify the data without any prior training to see how good the data can be classified into clusters and what the accuracy of classification is compared to the labeled data. In the classification of the raw data, all the data points are unlabeled (only ionic current and residence time are known, and we do not know their respective amino acid labels) and the only given information is the number of clusters. All the clusters are predicted using k-mean clustering method with an accuracy of 72.6% when the Tryptophan cluster is excluded (1900 data points clustered into 19 clusters) as shown in Figure S.5.3. With the Tryptophan data included (2000 data points and with 20 clusters), the accuracy is 54%. This accuracy is defined by the number of correctly clustered data points compared to the total number of data points. A correctly identified cluster means that the cluster membership of each data point is the same as the labeled residence time and ionic current. 10

Figure S.5.3. The unlabeled data points (blue) are clustered using k-mean clustering. The mean values of the clusters based on the learning and the actual data are presented in black stars and red spheres, respectively. 11