1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999)

Size: px
Start display at page:

Download "1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999)"

Transcription

1 1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999) Some Theoretical and Practical Considerations for Multivariate Statistical Cell Classification Useful in Autologous Stem Cell Transplantation and Tumor Cell Purging James A. Hokanson, 1 Judah I. Rosenblatt, 1 and James F. Leary 2 * 1 Department of Preventive Medicine and Community Health, Division of Infectious Diseases, University of Texas Medical Branch, Galveston, Texas 2 Department of Internal Medicine, Division of Infectious Diseases, University of Texas Medical Branch, Galveston, Texas Received 2 January 1998; Revision Received 26 November 1998; Accepted 19 January 1999 Background: As flow cytometric data becomes more complex, it becomes increasingly difficult to classify cells using conventional flow cytometry data techniques based on visual classification of the data by user-drawn regions. This paper shows some simple applications of multivariate statistical classification to classify flow cytometric data. Methods: Discriminant Function Analysis (DFA) and Logistic Regression (LR) analysis techniques were evaluated with respect to their potential utility in the problem of detecting human breast cancer cells within normal bone marrow cells. Data sets having defined properties were employed to evaluate the potential utility of these statistical classification techniques whose performance was measured by ROC analysis. Results: Two extreme but reasonable situations are presented: (1) data where the separation of cells was obvious by visual inspection and (2) data where major overlaps in the values of the individual FCM parameters made intuitive classification improbable. Both DFA and LR analysis were able to classify the cells of each type with acceptable accuracy and yield. Conclusions: The excellent empirical performance of both DFA and LR techniques, suggests that they offer promising approaches for classifying multiparameter FCM data using objective rules that may represent an improvement over commonly employed ad hoc approaches. Cytometry 36:60 70, Wiley-Liss, Inc. Key terms: flow cytometry; cell sorting; multivariate statistics; discriminant functions; logistic regression; ROC analysis; misclassification penalty Flow cytometry (FCM) may be thought of as a way to analyze and classify individual cells into one of a number of discrete cell types. Applying various criteria, FCM uses multiparameter signal characteristics detected from each cell to arrive at a classification decision for that cell. As of yet, objective analysis techniques for using the signal characteristics to classify cells have not been routinely employed to develop decision rules. In today s multiparameter FCM environment, investigators typically resort to intuition-based or even arbitrarily drawn boundaries to classify cells. This can cause fundamental problems where the multi-parameter data space cannot be easily visualized and conventional flow cytometry data analysis techniques offer few alternative approaches. This paper proposes the use of several standard multivariate statistical techniques, Discriminant Function Analysis (DFA) and Logistic Regression (LR) augmented by Receiver Operating Characteristic (ROC) analysis, to determine classification boundaries having uncertainties that can be statistically described. The ultimate goal in this particular application, cited as an example, is to be able to apply multivariate statistical classification procedures in real-time to the problem of detecting and eliminating ( purging ) contaminating human cancer cells in samples of normal bone marrow. This is a major problem in the autologous transplantation of mobilized peripheral stem cells in cancer patients undergoing high-dose chemotherapy. Gene marking methods (4, 5) have recently shown that contaminating tumor cells re-infused into a patient during autologous transplantation can indeed lead to tumor recurrence (7). Because the primary purpose of this paper is tutorial, the analysis methods were applied to simple data sets Grant sponsor: NIH, U.S. Public Health Service; Grant numbers: GM38645 and CA *Correspondence to: James F. Leary, Ph.D., Molecular Cytometry Unit, Route 0835, University of Texas Medical Branch, Galveston, TX james.leary@utmb.edu

2 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 61 FIG. 1. Flow cytometric data used in this paper. Left group: data from normal bone marrow cells. Right group: data from MCF-7 human breast cancer cells. representative of both obvious and non-obvious situations. Similarly, for clarity of presentation only two-dimensional figures are displayed. However, both the proposed methods and the data are multidimensional. Given the sample sizes associated with the processing of a large number of cells typically encountered in FCM experiments, the number of FCM parameters that could be included in the analysis could theoretically extend to multiples of today s most complex multiparameter data. As demonstrated by the excellent empirical results obtained in our analysis, these statistically robust procedures could provide investigators with a more objective basis for selecting multidimensional data analysis boundaries. A future paper will demonstrate that these techniques can also be used for real-time sort decisions if they are implemented through high-speed look-up tables that permit computations at memory speeds. MATERIALS AND METHODS Data File Creation and Statistical Software The data used in this report were obtained from the separate multiparameter FCM scanning of data from two distinct cell types. The data files were constructed from samples of 6,689 normal human bone marrow cells (NBM) and 7,796 MCF-7 human breast cancer cells (MCF-7). Four flow cytometric parameters used were measured in each sample (Fig. 1). These were Forward Scatter, Side Scatter, Green (FITC) fluorescence, and Orange (PE) fluorescence. The green (FITC) fluorescence measures the relative number of cell surface antigens labeled by a FITCconjugated monoclonal antibody (monoclonal antibody clone 9187, Baxter Laboratories, McGaw Park, IL) against these antigens that are found predominately on breast cancer cells. The red fluorescence was a measure of Red670 (a PE-Cy5 tandem conjugate) Streptavidin (Gibco/ BRL, Grand Island, NY) labeling of biotinylated anti-cd45 antibody (Caltag, Inc., Burlingame, CA) a marker of mature mononuclear cells. Figure 1 illustrates the raw multiparameter FCM data used in this paper. Random samples of cells from each raw data file were selected using computer algorithms to produce a composite data file. This composite file contained FCM signal parameter data from an experimenter-selected number of cells of each type plus an added variable that denoted the actual cell type of the selected cell. Data in channel 1,023, the highest channel number available on the FCM equipment used, were excluded. Any signal in this channel number could actually represent a signal from any higher channel number. A personal computer implementation of the statistical software package SAS (SAS Institute Inc., Cary, NC) (12) was used for all data manipulations and analyses described in this paper (in other related work we have used the statistical software package S-Plus, MathSoft, Inc., Cambridge, MA). For some calculations, the seed value used in the random number generator was fixed so that exactly the same cells could be repeatedly selected from each data file. Using this procedure, which provided repeated selections of the same cells from the total data available, the minimum ratio, or actual number, of rare cells required to obtain valid and stable discriminant function coefficients could be investigated, as could changes in the relative penalties of misclassification. The stability of the DFA coefficients and cut-point scores were determined using a general Monte Carlo procedure. Repeated subsets with replacement, containing an experi-

3 62 HOKANSON ET AL. menter-specified number of cells, were drawn at random and the corresponding discriminant functions were calculated. The means, standard deviations, and ranges of coefficients were calculated to judge the stability of the discriminant functions under different classification rules. Discriminant Function Analysis Discriminant Function Analysis (DFA) describes one standard statistical approach for classifying individual observations into membership in one of a number of groups (1, 2, 6, 8, 9). The basic assumptions of DFA relevant to FCM can be summarized as follows: (1) the number of groups must be known (or at least experimenter-specified), (2) there must be at least 2 observations per group, (3) the number of discriminating variables cannot be greater than N-2, where N is the number of observations, (4) discriminating variables should not be collinear, i.e., no discriminating variable should be a linear combination of any of the other discriminating variables, (5) the covariance matrices for each of the groups should be equal, (6) each group should be from a population of cells having a multivariate Gaussian distribution of the discriminating variables, (7) in each group, the observations should be a random sample from the population of interest, and (8) the number of discriminant functions is the smaller of the number of groups or the number of discriminating variables. The DFA described in this study is linear DFA where the discriminant function is a linear combination of the discriminating variables (1, 2, 8, 9). Discriminant functions can be adapted to avoid these assumptions but standard statistical packages generally do not do this. As illustrated in Figure 1, the FCM data used in this analysis did not appear to conform to the multivariate Gaussian assumption of DFA. However, as demonstrated further, the empirical results suggest that even in spite of this violation, DFA appears to be a useful tool in FCM analysis. The basic strategy in DFA is to calculate a function, based on a linear combination of discriminating values obtained from observations whose membership status is known, into a scoring system such that the score obtained from a new observation can be used to estimate the probability of membership in a particular group for that new observation. Where the goal is to classify objects into only two groups (each group can contain more than one cell subpopulation), the discriminant function can be written in matrix notation (2) as L X T C 1 (m 1 m 2 ) 1/2 (m 1 m 2 ) T C 1 (m 1 m 2 ). In this notation, X is the vector of observed discriminating variables for a particular cell, T denotes matrix transposition, 1 denotes matrix inversion, c is the common covariance matrix of all vector observations, and m 1,m 2 are the mean vectors of the discriminating variables for cell types 1 and 2 respectively. The goal is to develop a function that has an acceptable probability of classifying a new observation into correct group membership. When there are more than two cell types, DFA can still be used, although the calculations are more complex (2, 9, 13). FIG. 2. An idealized description of how Discriminant Function Analysis (DFA) can be used to classify cells into groups. DFA maps values in multivariate Discriminant Variable Space into the values of a Discriminant Function where a user-specified value can be calculated and used to classify outcomes. For each group (cell type), a set of expressions, L i a 1 x i1 a 2 x i2..., can be calculated (9). The x i (the means of the discriminating variables) are obtained from the multiparameter FCM signal characteristics. The a i are coefficients calculated from the data in a manner that optimally distinguishes the groups (9). The property that DFA uses to optimally distinguish the groups for FCM based cell type data analysis and sorting is particularly useful. The discriminant function, L, using linear combinations of the L i s obtained for each group can be used to divide the observations into groups (9). For a given observation, L is calculated from the discriminant coefficients (a i ) and the values of the discriminating variables for that particular observation. If the L score exceeds some experimenter-specified level, c, the observation is assigned membership in one group; if below, then membership in the other group is assigned (9). The approach is similar to the logic used in the traditional one-way analysis of variance situation where ideally the variation in the L i s within a cell type should be much less than the variation in the L i s between the cell types. The concept of discriminant function classifiers is shown in Figure 2. For ease of visualization, only the two parameters from a bivariate data space are projected into the single discriminant function. The results of different penalities of misclassification are shown by the lines c1, c2, and c3. When the prior probabilities of membership are equal or unknown and there are only two groups, the traditional default for the threshold discriminant score (9) has been c (L 1 L 2 )/2. If the prior probabilities of membership are equal or known, the posterior probability of member-

4 ship of a selected cell in a group, Pr(L), can then be calculated using Pr(L) 1/(1 exp ( L c) ) where 1/(1 exp ( L C) ) where 1/(1 exp ( L C) ) is the multivariate logistic function (1, 9). Many DFA situations assume that there is an equal probability of being in each group and a DFA score of zero is used as the cutoff point for discrimination. For FCM experiments when information about the relative cell type frequency is known, or can be estimated, assignment of a cell to a group based on its expected frequency of occurrence can modify the misclassification rate by changing the cutoff score from that obtained when equal relative cell type frequencies are assumed. For the FCM experiments described in this paper, the ultimate goal is to acceptably classify cells existing within samples of mixed groups of normal bone marrow and cancer cells such that the cancer cells could be excluded (purged) from the bone marrow mixture in a manner suitable for subsequent autologous bone marrow transplantation. In this situation, it is most likely that many more normal cells than cancer cells will be encountered. Were we to use the relative frequency as the only criterion, the prior probability (before any processing of FCM data) of being classified as a normal cell would greatly exceed the prior probability of being classified as a cancer cell, since the number of normal cells greatly exceeds the number of cancer cells in virtually all cases. In the absence of any additional information, a reasonable rule for classifying a particular cell would be based on the relative frequency of each cell type in the mixture. Hence, in this paper results are presented with the cutoff scores modified to reflect the prior probabilities of classification based on relative cell type frequency. Also, in the context of cancer cell sorting using FCM, the penalty for incorrect classification may not be equal between cell types. The classification of a cancer cell as a normal cell may have far more serious consequences than classification of a normal cell as cancerous. The penalty of misclassifying a normal cell as a tumor cell results only in a reduced yield of normal cells, still of acceptable purity. If the normal cell lost is not a stem cell of interest then we don t even care about that since only the yield of stem cells is important. The former may result in a false sense of security until a treatment-resistant recurrence or metastasis occurs. It has been demonstrated in at least two studies that cells transplanted back into the patient gave rise to subsequent relapse of breast cancer (4, 7). One goal of this study is to establish procedures for choosing an acceptable cutoff score in the face of unequal probabilities of occurrence of cell types and where the acceptable penalties for probabilities of misclassification differ across the cell types. If we are trying to reduce the number of malignant cells in a sample, one reasonable approach is to specify the proportion of cancer cells that will be permitted to remain in the sample (required purity) and then calculate whether an unacceptable proportion of normal cells have also been eliminated (required yield). If prior knowledge about the probability of membership in a group is known, or the cost of misclassification is MULTIVARIATE STATISTICAL CELL CLASSIFICATION unequal, the discriminant function score can be modified in SAS to where c (L 1 L 2 ) K 2 Kisln3 p 2c(12) p 1 c(21)4 and p 2 is the prior probability of membership in class 2, p 1 is the prior probability of membership in class 1, c(12) is the penalty for being classified as being a member of group 1 given that the true membership is group 2 and c(21) is the penalty for being classified as being a member of group 2 given that the true membership is group 1 (9). Note that when the probability of membership is equal (p 1 p 2 ) and the cost of misclassification is the same across groups (c(12) c(21)), then K 0. Letting D 2 (m 1 m 2 ) T C 1 (m 1 m 2 ), where D 2 is called the Malahanobis D 2, it can be shown that the probability of misclassifying a vector of observations from a cell which actually is in group 1 as being from group 2 is ([c 1 2D 2 ]/( D 2 )), where is the standard Gaussian cumulative distribution (i. e., (z 0 ) pr(z z 0 )) where z is from a standard Gaussian distribution with a mean of 0 and standard deviation of 1 (1, 2, 9). Correspondingly, the probability of misclassifying a vector of observations from a cell in group 2 as being from group 1 is 1 ([c 1 2D 2 ]/( D 2 )). As illustrated in Figure 1, the FCM data used in this study clearly do not appear to conform to multivariate Gaussian distributions. Also, the assumption of equal covariance matrices for the data from each cell type may not generally be the case for FCM data. These factors may cause a departure of the true probabilities of correct classification from the theoretically calculated ones. DFA is a robust technique whose use remains valid despite violations of many of these theoretical assumptions (1, 8, 9, 10, 11). Correct classification is the ultimate test and determines how violations of the assumptions influence the accuracy of the analysis. Also, Monte Carlo techniques were used to evaluate the accuracy of classification. Logistic Regression and ROC Analysis Logistic Regression is an alternative method of classification when the multivariate Gaussian distribution model is not justified (1). It is based on the concept of rewriting probabilities in terms of odds, where odds P z /(1 P z ). If P 1/(1 exp ( L C) ), then log(p z /(1 P z )) C L C a 1 x 1 a 2 x 2 a 3 x

5 64 HOKANSON ET AL. Logistic regression offers a somewhat different paradigm than DFA in solving the FCM cell classification problem. While DFA uses conditional probabilities of misclassification based on classical hypothesis testing theory and multivariate normality assumptions, logistic regression presumes a model relating Pr i, the probability that a cell with values x 1 x 1,...,x n of measurements x 1,...,x n is in a population i, i 1,...,koftheform Pr i K l 1 n e(ci j 1 a ijx j) e (Ci K j 1 aijxj) where (a i1,...,a in ) are unknown (logistic regression) coefficients characterizing population i. The (a i1,...,a in ) are often estimated using training sets of cells from each of these known populations. Such models will only be useful if the variables X i,...,x n are suitably chosen. This change in notation reflects the applicability of logistic regression referred to earlier. From estimates (â i1,...,â in )i 1,..., k obtained from training sets, we can then make measurements x 1,...,x n in any given cell to obtain estimates Pˆr 1,...,Pˆr n of the true values of P 1,...,P n for this cell. Taking account of the estimates of the relative frequency of the various populations, one can then reasonably use the values Pˆr 1,...,Pˆr n to classify the given cell in line with the goals of purging undesirable cells (e.g., tumor cells) while keeping as many desired cells (e.g., stem cells) as possible. Because Logistic Regression analysis uses the method of maximum likelihood in its calculations, solutions to most practical problems require substantial computational resources and use of statistical software packages such as SAS (1, 12). ROC analysis is a graphic representation of the effects of applying different cutoff criteria in a decision situation where there are two classifications, and ROC curves can be obtained as part of LR analysis in most statistical packages. It uses a graph of sensitivity ( true positive rate ) vs. (1-specificity) ( false positive rate ) (1, 12). A common utilization of ROC curves in biomedical research has been in the evaluation of diagnostic tests where an individual is classified as either positive or negative for some disease condition) (3). The curve is constructed by graphing the true positive rate (sensitivity) on the y-axis as a function of the false positive rate (1-specificity) on the x-axis. These sensitivities and false positive rate are calculated using all possible values of a test as the cutoff point between the two outcomes. A perfect classification test is one that would have a sensitivity of 1 for all possible false-positive values. This would be a line that rises from coordinates (0,0) to (0,1) and then goes to (1,1). The dashed diagonal line is a representation of a useless test (random classification) where the true positive and false positive rates rise at identical rates (see Fig. 7). The area between the ROC curve and the diagonal line is typically used as an indicator of the value of a test statistic. A perfect Table 1 Descriptive Statistics From Monte Carlo Simulation of DFA Misclassification Rates* Outcome Mean Std Dev Minimum Maximum Mahalanobis D Equal probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Prior probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Unequal Cost Probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Logistic regression misclassification rate *Descriptive statistics obtained from a Monte Carlo simulation of 50 replicate samples of the Discriminant Function Analysis misclassification rates using Equal Probabilities of Normal and Malignant cell frequency, Relative Frequency of Each Cell Type Occurrence (Prior Probabilities), and 10-1 Penalty for Misclassification of a Malignant Cell as Normal Probabilities. While the Prior Probabilities Model produced the lowest overall misclassification rate, this model produced an extremely high misclassification rate of Malignant Cells classified as Normal. This situation was remedied by including an arbitrary cost of misclassification rate, which produced an over misclassification rate nearly equivalent to the Prior Probability Model but that greatly reduced the misclassification of Malignant Cells as Normal rate. In addition, the descriptive statistics for Mahalanobis D2 score and the Logistic Regression misclassification rates for these Monte Carlo simulations are shown. classification test has an area under the ROC curve of 1.0. The utility of the ROC approach is that it evaluates a test statistic s ability to discriminate between two populations and help select the cutoff level that maximizes both the true positive and false positive rates. RESULTS The primary purpose of this paper was to demonstrate that statistical classification techniques may offer an objective approach to developing classification decision rules for FCM experiments. To do this it was necessary to apply these techniques to FCM data that represented the values typically encountered but having defined statistical properties. The first condition was necessary to determine whether these statistical techniques were applicable to actual FCM experiments; the second was necessary in order to evaluate the techniques under circumstances where the actual outcomes were known. Figure 1 illustrates the multiparameter histograms from two distinct FCM experiments using different cell types (bone marrow and MCF-7) that formed the basis of the analysis presented

6 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 65 Table 2 Acceptable Probabilities of Misclassification of Each Cell Type Can Be User-Specified* C (An experimenter supplied number) Probability of misclassifying x from population 1 ([c 1 2D ]/( 2 D 2 )) Probability of misclassifying x from population 2 1 ([c 1 2D ]/( 2 D 2 )) *If the desired relative probabilities of misclassification of each cell type can be specified by the user, the value of the parameter C can be modified to adjust the classification boundaries used in the DFA. For example, if a user specifies that the desired probability of classifying a malignant cell as normal is and the desired probability of classifying a normal cell as malignant is , then the value of C 1 inserted into the DFA algorithm will produce the desired classification boundary. FIG. 3. Discriminant Function Analysis applied to separating an equal number of bone marrow and breast cancer cells. The effects of using three different cutoff scores (that alter the probabilities of misclassification are shown; see Table 2. in this paper. Depending on the specific aspect of DFA or LR that was being tested, varying numbers of entries in each of these separate files were selected at random and used to construct a data set that was analyzed using these statistical techniques. Table 1 illustrates the stability of the DFA and LR classification schemes with regard to successfully classifying cells. Using the two fluorescent parameters and Monte Carlo methods, 50 replicate data sets were created and analyzed from a random selection of 5,000 cells from the normal bone marrow data file and 1,000 cells from the MCF-7 data file. Table 1 shows the results under three different scenarios: (1) rates of misclassification under the assumption that likelihood of membership in each of the two cell type categories is equally likely; (2) rates of misclassification when it is assumed that the probability of membership in each category is proportional to the known relative frequency of each cell type in the data set, and (3) rates of misclassification when it is assumed that the probability of membership in each category is based on a perceived, but arbitrary, differential value for misclassification. As an example of scenario 3, in a clinical bone marrow transplant setting, an investigator may be willing to discard 10, 100, or 1,000 normal cells to prevent one malignant cell from being reintroduced into the patient. Table 1 presents results for FCM experiments wherein mixtures involving the analysis of an infrequently encountered cell type (rare malignant cells in a mixture of many normal cells) may have major consequences where failure to correctly identify that infrequent cell occurs. Under the assumption of equal probabilities of occurrence, the overall error rate is highest, but the rate of misclassification of each cell type (normal as malignant and malignant as normal) is approximately equal. When a classification rule based on relative frequency in the data is used, the overall error rate is lowest but, because it was assumed that the vast majority of cells were normal, the misclassification rate of malignant cells as normal was quite high (mean 91%). When some form of differential value for misclassification was introduced, in this case a value of 10 to 1, the overall error rate was similar to that encountered when the relative frequency rule was employed. However, the rate of misclassification of malignant cells was markedly reduced (mean 3%). Other values for the relative value of misclassification obviously could have been used, but the ones employed clearly demonstrate the concept. An alternative to using the intuitively appealing, but arbitrary and hard-to-evaluate, costs of misclassification approach is to specify acceptable probabilities of misclassification for each cell type and then modify the classification boundaries based on these values. Table 2 indicates how the theoretical probabilities of misclassification change as a function of c, an experimenter-chosen level, in the expression ([c 1 2D 2 ]/( D 2 2).

7 66 HOKANSON ET AL. FIG. 4. Evaluation of the minimal number of cells required to obtain a stable discriminant function analysis. As indicated, DFA estimates are stable even when there is a large variation in the relative proportion of each cell type. For example, if c 1 is used, then there is a probability of 0.03 of misclassifying a cell from population 1 and a probability of of misclassifying a cell from population 2. For the data sets utilized, it appeared that neither the multivariate Gaussian assumption nor the assumption of equal covariance matrices appear to be valid. The violations of these assumptions, however, do not appear to reduce the utility of DFA to empirically classify cells correctly. However, these violations may cause a departure of the actual probabilities from the theoretically

8 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 67 FIG. 5. The stability of the Discriminant Function Analysis over four samples of cells randomly selected from the total cells available in this analysis. This indicates the repeatability of DFA when different random samplings of each cell type are used. calculated ones. This was shown by use of the Monte Carlo methods used to generate the entries in Table 1. Because of the instructional value of showing twodimensional plots and the lack of additional discriminating power when data from the non-fluorescent parameters were added to the data from the two fluorescent parameters, many results are presented (see Figs. 3 5) using data from just the two fluorescent parameters (FITC and PE). The goal here is to illustrate the results of DFA and LR when the results are intuitively obvious. Figure 2 shows the concept of discriminant function analysis. Different cutoff scores can be user-specified in tradeoffs between yield and purity of a given cell subpopulation. Figure 3 illustrates the separations obtained with all 6,689 human bone marrow cells (NBM) and 7,796 MCF-7

9 68 HOKANSON ET AL. Variable Table 3 Logistic Regression Classification of the Data* Parameter estimate Standard error Wald confidence limits Lower Upper P P Intercept Intercept only Intercept and covariates Chi-square 2 Log likelihood (2 df) (P 0.001) Association of predicted probabilities and observed responses Concordant 78.8% Discordant 20.8% Tied 0.3% Area under ROC curve *Results of Logistic Regression Analysis performed on a file of a randomly selected sample of 500 Normal Bone Marrow and 100 MCF-7 used to produce the ROC curve shown in Figure 7. As indicated in the text, only a relatively small number of cells were used to illustrate the concept. FIG. 6. The DFA obtained from a sample of 100 bone marrow cells (round points) and 10 MCF-7 cells (square points) using the nonfluorescent parameters available in the data files used in this paper. human breast cancer cells and different probabilities of misclassification. To evaluate the minimum number of MCF-7 cells required to achieve acceptable discrimination levels, analyses were performed using 5,000 normal bone marrow and 10, 50, 200, and 1,000 MCF-7 cells (Fig. 4). In this case, the seed value of the random number generator was fixed so that the same cells were selected for analysis and the probabilities of misclassification were varied. To evaluate the stability discriminant of function over random selection of cells, repeated random samples were drawn. Typical results are shown in Figure 5. For the results in each panel, the same cells were selected so that different probabilities of misclassification could be used. Using data from either the two fluorescent or all four (two fluorescent and two scatter) parameters, the classification results of a Logistic Regression often produced complete separation (100% correct classification). Hence, evaluation of statistical estimates are not meaningful under these circumstances. To illustrate interpretation of the results of a LR Analysis where complete separation did not occur, the results of DFA and its corresponding LR when just the nonfluorescent parameters were used are shown. The goal here was to evaluate DFA and LR when the classification was no longer intuitively obvious. Figure 6 illustrates the DFA results, based on the light scattering parameters only, when 100 MCF-7 and 500 NBM cells were randomly selected. The corresponding Logistic Regression and ROC analysis produced Table 3 and Figure 7. Figure 7 shows the results of a ROC analysis of the data based on the two non-fluorescent parameters that should have some, but not very good, classification power since the forward and side light scattering of the tumor cells partially overlap as shown in Figure 1. While the DFA and Logistic Regression did not produce as useful a separation tool as when the fluorescent parameters were included, the corresponding ROC analysis indicated that these techniques still produced credible classification capabilities. DISCUSSION Development of scientifically sound FCM sort boundary classification rules based on multivariate statistical techniques that are easy to implement could greatly enhance the value of many FCM applications. This paper proposes the use of mathematically rigorous statistical classification techniques that take advantage of the multiparameter signals obtained from current FCM devices. One technique, DFA, also provides an ability to adjust sort decisions based on: (1) the prior probability of encountering only a small number of a particular cell type, (2) the potential cost penalty associated with the misclassification of a cell, and (3) the need to specify particular probabilities of misclassification. Of some concern is that the FCM data did not conform to the DFA assumption of a multivariate Gaussian distribution. However, based on our results that used correct classification as the objective outcome, the excellent empirical performance of DFA suggests that it should be considered as a technique to assist in the development of rules for determining FCM sort boundaries. The Logistic Regression (LR) techniques we evaluated also produced excellent empirical results. In addition, ROC analysis assisted in deciding cutoff points for LRbased data analysis boundaries. While LR does not require the same Gaussian assumptions about the underlying data structure, the inability of existing statistical packages to

10 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 69 FIG. 7. The use of ROC analysis to evaluate the effectiveness of parameters used in the classification of cell subpopulations. The ROC analysis obtained from this data indicates that while the discriminating power is greatly reduced when using non-fluorescent parameters as compared to using fluorescent parameters, the LR analysis still greatly improves the selection process when compared to not using objective criteria. adjust data analysis decisions based on the need to meet specific criteria such as the relative cost of misclassification infers that LR may not currently be as flexible as DFA at developing data analysis decisions. This paper demonstrates that under very specific conditions, statistical classification systems can assist in determining data analysis boundaries in FCM experiments. The real challenge is in how to advance these findings into the world of real-time cell sorting, where the true identity of any given cell is unknown and the sort decision must be made within microseconds of encountering the FCM sensors. This will be discussed in a future paper. Beyond the quantitative issues addressed in this paper, there are issues of biologic, instrumental, systemic, and as yet unknown variability that determine the utility of FCM in resolving cell classification problems. For example, within each major cell type, subpopulations may exist that can confound the relatively straightforward classifications proposed in this paper. DFA, as implemented in most current statistical packages, can only classify into two groups even though each group can contain more than one cell subpopulation. The ultimate goal of our efforts is directed at sorting a population of cells that are involved in the autologous bone marrow procedure. The mixture of cells to be processed contains tumor cells, normal but not stem-cell bone marrow cells, and stem-cell bone marrow. The purpose is to collect the maximum fraction of stem cells while minimizing the re-infusion of the contaminating tumor cells or the nuisance non-stem cell bone marrow cells. Without the satisfactory purging of tumor cells from the re-infused bone marrow cells, the possibility of a reoccurrence of the primary tumor is greatly increased. While numerous technical challenges remain, this paper assists in achieving this goal by demonstrating that there are statistically rigorous techniques available to assist in making sort decisions. As the instrumentation improves, further enhancements in the use of these statistical technical techniques will be employed to refine the development of sort boundary classification techniques. In addition to the methods described here, new developments should be explored. Cluster analysis and regression tree analysis techniques obviously merit further exploration. On the horizon are techniques designed to extract the maximum amounts of information from the data contained in FCM data sets. These so called Knowledge Discovery in Data or Data Mining techniques may extend our efforts beyond statistical classification into

11 70 HOKANSON ET AL. pattern recognition, artificial intelligence, and neural networks. The bottom-line message is that FCM investigators should explore the spectrum of quantitative techniques that are becoming available to them for developing more objective rules for the selection of sort boundaries. LITERATURE CITED 1. Afifi AA, Clark V. Computer aided multivariate analysis. New York: Van Nostrand Reinhold; p Anderson TW. An introduction to multivariate statistical analysis. New York: John Wiley and Sons, Inc.; p Beck JR, Shultz EK. The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 1986;110: Brenner MK, Rill DR, Moen RC, Krance RA, Mirro J Jr, Anderson WF, Ihle JN. Gene-marking to trace origin of relapse after autologous bone-marrow transplantation. Lancet 1993;341: Brenner MK. The contribution of marker gene studies to hematopoietic stem cell therapies Stem Cells 1995;13: Fisher RA. The use of multiple measurements in taxonomic problems. Annu Eugenics 1936;7: Heslop HE, Rooney CM, Rill DR, Krance RA, Brenner MK. Use of gene marking in bone marrow transplantation. Cancer Detect Prev 1996;20: Klecka WR. Discriminant analysis. Sage University Paper series. Quant Appl Social Sci 1980; Kleinbaum DG, Kupper L L, Miller KE. Applied regression and other multivariate methods. Boston: PWS-Kent; p Lachenbruch PA. Discriminant analysis. New York: Hafner Press; p McLachlan GJ. Discriminant analysis and statistical pattern recognition. New York:. John Wiley and Sons, Inc.; p SAS/STAT user s guide, version 6, 4th ed, volume 1. Cary, NC: SAS Institute Inc.; p Tatsuoka MM. Multivariate analysis: techniques for education and psychological research, 2nd ed. New York: John Wiley & Sons; p18 62.

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Detection Theory: Sensitivity and Response Bias

Detection Theory: Sensitivity and Response Bias Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System

More information

Score Tests of Normality in Bivariate Probit Models

Score Tests of Normality in Bivariate Probit Models Score Tests of Normality in Bivariate Probit Models Anthony Murphy Nuffield College, Oxford OX1 1NF, UK Abstract: A relatively simple and convenient score test of normality in the bivariate probit model

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Discriminant Analysis with Categorical Data

Discriminant Analysis with Categorical Data - AW)a Discriminant Analysis with Categorical Data John E. Overall and J. Arthur Woodward The University of Texas Medical Branch, Galveston A method for studying relationships among groups in terms of

More information

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION

More information

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing

More information

Classical Psychophysical Methods (cont.)

Classical Psychophysical Methods (cont.) Classical Psychophysical Methods (cont.) 1 Outline Method of Adjustment Method of Limits Method of Constant Stimuli Probit Analysis 2 Method of Constant Stimuli A set of equally spaced levels of the stimulus

More information

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science Abstract One method for analyzing pediatric B cell leukemia is to categorize

More information

PubH 7405: REGRESSION ANALYSIS. Propensity Score

PubH 7405: REGRESSION ANALYSIS. Propensity Score PubH 7405: REGRESSION ANALYSIS Propensity Score INTRODUCTION: There is a growing interest in using observational (or nonrandomized) studies to estimate the effects of treatments on outcomes. In observational

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Pitfalls in Linear Regression Analysis

Pitfalls in Linear Regression Analysis Pitfalls in Linear Regression Analysis Due to the widespread availability of spreadsheet and statistical software for disposal, many of us do not really have a good understanding of how to use regression

More information

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh Comparing the evidence for categorical versus dimensional representations of psychiatric disorders in the presence of noisy observations: a Monte Carlo study of the Bayesian Information Criterion and Akaike

More information

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA PART 1: Introduction to Factorial ANOVA ingle factor or One - Way Analysis of Variance can be used to test the null hypothesis that k or more treatment or group

More information

Small Group Presentations

Small Group Presentations Admin Assignment 1 due next Tuesday at 3pm in the Psychology course centre. Matrix Quiz during the first hour of next lecture. Assignment 2 due 13 May at 10am. I will upload and distribute these at the

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Regression Discontinuity Analysis

Regression Discontinuity Analysis Regression Discontinuity Analysis A researcher wants to determine whether tutoring underachieving middle school students improves their math grades. Another wonders whether providing financial aid to low-income

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1 From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Contents Dedication... iii Acknowledgments... xi About This Book... xiii About the Author... xvii Chapter 1: Introduction...

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training. Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Section on Survey Research Methods JSM 2009

Section on Survey Research Methods JSM 2009 Missing Data and Complex Samples: The Impact of Listwise Deletion vs. Subpopulation Analysis on Statistical Bias and Hypothesis Test Results when Data are MCAR and MAR Bethany A. Bell, Jeffrey D. Kromrey

More information

Detection Theory: Sensory and Decision Processes

Detection Theory: Sensory and Decision Processes Detection Theory: Sensory and Decision Processes Lewis O. Harvey, Jr. Department of Psychology and Neuroscience University of Colorado Boulder The Brain (observed) Stimuli (observed) Responses (observed)

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

RAG Rating Indicator Values

RAG Rating Indicator Values Technical Guide RAG Rating Indicator Values Introduction This document sets out Public Health England s standard approach to the use of RAG ratings for indicator values in relation to comparator or benchmark

More information

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY Lingqi Tang 1, Thomas R. Belin 2, and Juwon Song 2 1 Center for Health Services Research,

More information

For general queries, contact

For general queries, contact Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,

More information

Multiple Bivariate Gaussian Plotting and Checking

Multiple Bivariate Gaussian Plotting and Checking Multiple Bivariate Gaussian Plotting and Checking Jared L. Deutsch and Clayton V. Deutsch The geostatistical modeling of continuous variables relies heavily on the multivariate Gaussian distribution. It

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

CHAPTER ONE CORRELATION

CHAPTER ONE CORRELATION CHAPTER ONE CORRELATION 1.0 Introduction The first chapter focuses on the nature of statistical data of correlation. The aim of the series of exercises is to ensure the students are able to use SPSS to

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Six Sigma Glossary Lean 6 Society

Six Sigma Glossary Lean 6 Society Six Sigma Glossary Lean 6 Society ABSCISSA ACCEPTANCE REGION ALPHA RISK ALTERNATIVE HYPOTHESIS ASSIGNABLE CAUSE ASSIGNABLE VARIATIONS The horizontal axis of a graph The region of values for which the null

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way: ROC Curves We developed a method to make diagnoses of anxiety using criteria provided by Phillip. Would it also be possible to make such diagnoses based on a much more simple scheme, a simple cutoff point

More information

MODEL SELECTION STRATEGIES. Tony Panzarella

MODEL SELECTION STRATEGIES. Tony Panzarella MODEL SELECTION STRATEGIES Tony Panzarella Lab Course March 20, 2014 2 Preamble Although focus will be on time-to-event data the same principles apply to other outcome data Lab Course March 20, 2014 3

More information

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug? MMI 409 Spring 2009 Final Examination Gordon Bleil Table of Contents Research Scenario and General Assumptions Questions for Dataset (Questions are hyperlinked to detailed answers) 1. Is there a difference

More information

BIOSTATISTICAL METHODS

BIOSTATISTICAL METHODS BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH PROPENSITY SCORE Confounding Definition: A situation in which the effect or association between an exposure (a predictor or risk factor) and

More information

Biostatistics II

Biostatistics II Biostatistics II 514-5509 Course Description: Modern multivariable statistical analysis based on the concept of generalized linear models. Includes linear, logistic, and Poisson regression, survival analysis,

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08 Auditory Perception - Detection versus Discrimination - Localization versus Discrimination - - Electrophysiological Measurements Psychophysical Measurements Three Approaches to Researching Audition physiology

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

Data Analysis Using Regression and Multilevel/Hierarchical Models

Data Analysis Using Regression and Multilevel/Hierarchical Models Data Analysis Using Regression and Multilevel/Hierarchical Models ANDREW GELMAN Columbia University JENNIFER HILL Columbia University CAMBRIDGE UNIVERSITY PRESS Contents List of examples V a 9 e xv " Preface

More information

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront

More information

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08 Auditory Perception - Detection versus Discrimination - Localization versus Discrimination - Electrophysiological Measurements - Psychophysical Measurements 1 Three Approaches to Researching Audition physiology

More information

WELCOME! Lecture 11 Thommy Perlinger

WELCOME! Lecture 11 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 11 Thommy Perlinger Regression based on violated assumptions If any of the assumptions are violated, potential inaccuracies may be present in the estimated regression

More information

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016 Tutorial 3: Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016 Step 1: Research design Adequacy of sample size Choice of dependent variables Choice of independent variables (treatment effects)

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

Prediction of Malignant and Benign Tumor using Machine Learning

Prediction of Malignant and Benign Tumor using Machine Learning Prediction of Malignant and Benign Tumor using Machine Learning Ashish Shah Department of Computer Science and Engineering Manipal Institute of Technology, Manipal University, Manipal, Karnataka, India

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

Quasicomplete Separation in Logistic Regression: A Medical Example

Quasicomplete Separation in Logistic Regression: A Medical Example Quasicomplete Separation in Logistic Regression: A Medical Example Madeline J Boyle, Carolinas Medical Center, Charlotte, NC ABSTRACT Logistic regression can be used to model the relationship between a

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

METHODS FOR DETECTING CERVICAL CANCER

METHODS FOR DETECTING CERVICAL CANCER Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the

More information

Introduction to ROC analysis

Introduction to ROC analysis Introduction to ROC analysis Andriy I. Bandos Department of Biostatistics University of Pittsburgh Acknowledgements Many thanks to Sam Wieand, Nancy Obuchowski, Brenda Kurland, and Todd Alonzo for previous

More information

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Number XX An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 54 Gaither

More information

Fixed Effect Combining

Fixed Effect Combining Meta-Analysis Workshop (part 2) Michael LaValley December 12 th 2014 Villanova University Fixed Effect Combining Each study i provides an effect size estimate d i of the population value For the inverse

More information

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance The SAGE Encyclopedia of Educational Research, Measurement, Multivariate Analysis of Variance Contributors: David W. Stockburger Edited by: Bruce B. Frey Book Title: Chapter Title: "Multivariate Analysis

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

Statistics and Probability

Statistics and Probability Statistics and a single count or measurement variable. S.ID.1: Represent data with plots on the real number line (dot plots, histograms, and box plots). S.ID.2: Use statistics appropriate to the shape

More information

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper? Outline A Critique of Software Defect Prediction Models Norman E. Fenton Dongfeng Zhu What s inside this paper? What kind of new technique was developed in this paper? Research area of this technique?

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

Supplementary materials for: Executive control processes underlying multi- item working memory

Supplementary materials for: Executive control processes underlying multi- item working memory Supplementary materials for: Executive control processes underlying multi- item working memory Antonio H. Lara & Jonathan D. Wallis Supplementary Figure 1 Supplementary Figure 1. Behavioral measures of

More information

10. LINEAR REGRESSION AND CORRELATION

10. LINEAR REGRESSION AND CORRELATION 1 10. LINEAR REGRESSION AND CORRELATION The contingency table describes an association between two nominal (categorical) variables (e.g., use of supplemental oxygen and mountaineer survival ). We have

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys Multiple Regression Analysis 1 CRITERIA FOR USE Multiple regression analysis is used to test the effects of n independent (predictor) variables on a single dependent (criterion) variable. Regression tests

More information

Using Statistical Intervals to Assess System Performance Best Practice

Using Statistical Intervals to Assess System Performance Best Practice Using Statistical Intervals to Assess System Performance Best Practice Authored by: Francisco Ortiz, PhD STAT COE Lenny Truett, PhD STAT COE 17 April 2015 The goal of the STAT T&E COE is to assist in developing

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Propensity Score Methods for Causal Inference with the PSMATCH Procedure Paper SAS332-2017 Propensity Score Methods for Causal Inference with the PSMATCH Procedure Yang Yuan, Yiu-Fai Yung, and Maura Stokes, SAS Institute Inc. Abstract In a randomized study, subjects are randomly

More information

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE Peng Qiu1,4, Erin F. Simonds2, Sean C. Bendall2, Kenneth D. Gibbs Jr.2, Robert V. Bruggner2, Michael

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Various performance measures in Binary classification An Overview of ROC study

Various performance measures in Binary classification An Overview of ROC study Various performance measures in Binary classification An Overview of ROC study Suresh Babu. Nellore Department of Statistics, S.V. University, Tirupati, India E-mail: sureshbabu.nellore@gmail.com Abstract

More information

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Combining Risks from Several Tumors Using Markov Chain Monte Carlo University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln U.S. Environmental Protection Agency Papers U.S. Environmental Protection Agency 2009 Combining Risks from Several Tumors

More information

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2011 ISSN 1349-4198 Volume 7, Number 12, December 2011 pp. 6935 6948 A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING

More information

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) ROC Curve Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 ROC Curve The ROC Curve procedure provides a useful way to evaluate the performance of classification

More information

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects 22nd International Congress on Modelling and Simulation, Hobart, Tasmania, Australia, 3 to 8 December 2017 mssanz.org.au/modsim2017 Method Comparison for Interrater Reliability of an Image Processing Technique

More information

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY 2. Evaluation Model 2 Evaluation Models To understand the strengths and weaknesses of evaluation, one must keep in mind its fundamental purpose: to inform those who make decisions. The inferences drawn

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

Method Comparison Report Semi-Annual 1/5/2018

Method Comparison Report Semi-Annual 1/5/2018 Method Comparison Report Semi-Annual 1/5/2018 Prepared for Carl Commissioner Regularatory Commission 123 Commission Drive Anytown, XX, 12345 Prepared by Dr. Mark Mainstay Clinical Laboratory Kennett Community

More information