1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999)

Size: px

Start display at page:

Download "1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999)"

Mervyn Ellis
5 years ago
Views:

1 1999 Wiley-Liss, Inc. Cytometry 36:60 70 (1999) Some Theoretical and Practical Considerations for Multivariate Statistical Cell Classification Useful in Autologous Stem Cell Transplantation and Tumor Cell Purging James A. Hokanson, 1 Judah I. Rosenblatt, 1 and James F. Leary 2 * 1 Department of Preventive Medicine and Community Health, Division of Infectious Diseases, University of Texas Medical Branch, Galveston, Texas 2 Department of Internal Medicine, Division of Infectious Diseases, University of Texas Medical Branch, Galveston, Texas Received 2 January 1998; Revision Received 26 November 1998; Accepted 19 January 1999 Background: As flow cytometric data becomes more complex, it becomes increasingly difficult to classify cells using conventional flow cytometry data techniques based on visual classification of the data by user-drawn regions. This paper shows some simple applications of multivariate statistical classification to classify flow cytometric data. Methods: Discriminant Function Analysis (DFA) and Logistic Regression (LR) analysis techniques were evaluated with respect to their potential utility in the problem of detecting human breast cancer cells within normal bone marrow cells. Data sets having defined properties were employed to evaluate the potential utility of these statistical classification techniques whose performance was measured by ROC analysis. Results: Two extreme but reasonable situations are presented: (1) data where the separation of cells was obvious by visual inspection and (2) data where major overlaps in the values of the individual FCM parameters made intuitive classification improbable. Both DFA and LR analysis were able to classify the cells of each type with acceptable accuracy and yield. Conclusions: The excellent empirical performance of both DFA and LR techniques, suggests that they offer promising approaches for classifying multiparameter FCM data using objective rules that may represent an improvement over commonly employed ad hoc approaches. Cytometry 36:60 70, Wiley-Liss, Inc. Key terms: flow cytometry; cell sorting; multivariate statistics; discriminant functions; logistic regression; ROC analysis; misclassification penalty Flow cytometry (FCM) may be thought of as a way to analyze and classify individual cells into one of a number of discrete cell types. Applying various criteria, FCM uses multiparameter signal characteristics detected from each cell to arrive at a classification decision for that cell. As of yet, objective analysis techniques for using the signal characteristics to classify cells have not been routinely employed to develop decision rules. In today s multiparameter FCM environment, investigators typically resort to intuition-based or even arbitrarily drawn boundaries to classify cells. This can cause fundamental problems where the multi-parameter data space cannot be easily visualized and conventional flow cytometry data analysis techniques offer few alternative approaches. This paper proposes the use of several standard multivariate statistical techniques, Discriminant Function Analysis (DFA) and Logistic Regression (LR) augmented by Receiver Operating Characteristic (ROC) analysis, to determine classification boundaries having uncertainties that can be statistically described. The ultimate goal in this particular application, cited as an example, is to be able to apply multivariate statistical classification procedures in real-time to the problem of detecting and eliminating ( purging ) contaminating human cancer cells in samples of normal bone marrow. This is a major problem in the autologous transplantation of mobilized peripheral stem cells in cancer patients undergoing high-dose chemotherapy. Gene marking methods (4, 5) have recently shown that contaminating tumor cells re-infused into a patient during autologous transplantation can indeed lead to tumor recurrence (7). Because the primary purpose of this paper is tutorial, the analysis methods were applied to simple data sets Grant sponsor: NIH, U.S. Public Health Service; Grant numbers: GM38645 and CA *Correspondence to: James F. Leary, Ph.D., Molecular Cytometry Unit, Route 0835, University of Texas Medical Branch, Galveston, TX james.leary@utmb.edu

MULTIVARIATE STATISTICAL CELL CLASSIFICATION 61 FIG. 1. Flow cytometric data used in this paper. Left group: data from normal bone marrow cells. Right group: data from MCF-7 human breast cancer cells.

2 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 61 FIG. 1. Flow cytometric data used in this paper. Left group: data from normal bone marrow cells. Right group: data from MCF-7 human breast cancer cells. representative of both obvious and non-obvious situations. Similarly, for clarity of presentation only two-dimensional figures are displayed. However, both the proposed methods and the data are multidimensional. Given the sample sizes associated with the processing of a large number of cells typically encountered in FCM experiments, the number of FCM parameters that could be included in the analysis could theoretically extend to multiples of today s most complex multiparameter data. As demonstrated by the excellent empirical results obtained in our analysis, these statistically robust procedures could provide investigators with a more objective basis for selecting multidimensional data analysis boundaries. A future paper will demonstrate that these techniques can also be used for real-time sort decisions if they are implemented through high-speed look-up tables that permit computations at memory speeds. MATERIALS AND METHODS Data File Creation and Statistical Software The data used in this report were obtained from the separate multiparameter FCM scanning of data from two distinct cell types. The data files were constructed from samples of 6,689 normal human bone marrow cells (NBM) and 7,796 MCF-7 human breast cancer cells (MCF-7). Four flow cytometric parameters used were measured in each sample (Fig. 1). These were Forward Scatter, Side Scatter, Green (FITC) fluorescence, and Orange (PE) fluorescence. The green (FITC) fluorescence measures the relative number of cell surface antigens labeled by a FITCconjugated monoclonal antibody (monoclonal antibody clone 9187, Baxter Laboratories, McGaw Park, IL) against these antigens that are found predominately on breast cancer cells. The red fluorescence was a measure of Red670 (a PE-Cy5 tandem conjugate) Streptavidin (Gibco/ BRL, Grand Island, NY) labeling of biotinylated anti-cd45 antibody (Caltag, Inc., Burlingame, CA) a marker of mature mononuclear cells. Figure 1 illustrates the raw multiparameter FCM data used in this paper. Random samples of cells from each raw data file were selected using computer algorithms to produce a composite data file. This composite file contained FCM signal parameter data from an experimenter-selected number of cells of each type plus an added variable that denoted the actual cell type of the selected cell. Data in channel 1,023, the highest channel number available on the FCM equipment used, were excluded. Any signal in this channel number could actually represent a signal from any higher channel number. A personal computer implementation of the statistical software package SAS (SAS Institute Inc., Cary, NC) (12) was used for all data manipulations and analyses described in this paper (in other related work we have used the statistical software package S-Plus, MathSoft, Inc., Cambridge, MA). For some calculations, the seed value used in the random number generator was fixed so that exactly the same cells could be repeatedly selected from each data file. Using this procedure, which provided repeated selections of the same cells from the total data available, the minimum ratio, or actual number, of rare cells required to obtain valid and stable discriminant function coefficients could be investigated, as could changes in the relative penalties of misclassification. The stability of the DFA coefficients and cut-point scores were determined using a general Monte Carlo procedure. Repeated subsets with replacement, containing an experi-

3 62 HOKANSON ET AL. menter-specified number of cells, were drawn at random and the corresponding discriminant functions were calculated. The means, standard deviations, and ranges of coefficients were calculated to judge the stability of the discriminant functions under different classification rules. Discriminant Function Analysis Discriminant Function Analysis (DFA) describes one standard statistical approach for classifying individual observations into membership in one of a number of groups (1, 2, 6, 8, 9). The basic assumptions of DFA relevant to FCM can be summarized as follows: (1) the number of groups must be known (or at least experimenter-specified), (2) there must be at least 2 observations per group, (3) the number of discriminating variables cannot be greater than N-2, where N is the number of observations, (4) discriminating variables should not be collinear, i.e., no discriminating variable should be a linear combination of any of the other discriminating variables, (5) the covariance matrices for each of the groups should be equal, (6) each group should be from a population of cells having a multivariate Gaussian distribution of the discriminating variables, (7) in each group, the observations should be a random sample from the population of interest, and (8) the number of discriminant functions is the smaller of the number of groups or the number of discriminating variables. The DFA described in this study is linear DFA where the discriminant function is a linear combination of the discriminating variables (1, 2, 8, 9). Discriminant functions can be adapted to avoid these assumptions but standard statistical packages generally do not do this. As illustrated in Figure 1, the FCM data used in this analysis did not appear to conform to the multivariate Gaussian assumption of DFA. However, as demonstrated further, the empirical results suggest that even in spite of this violation, DFA appears to be a useful tool in FCM analysis. The basic strategy in DFA is to calculate a function, based on a linear combination of discriminating values obtained from observations whose membership status is known, into a scoring system such that the score obtained from a new observation can be used to estimate the probability of membership in a particular group for that new observation. Where the goal is to classify objects into only two groups (each group can contain more than one cell subpopulation), the discriminant function can be written in matrix notation (2) as L X T C 1 (m 1 m 2 ) 1/2 (m 1 m 2 ) T C 1 (m 1 m 2 ). In this notation, X is the vector of observed discriminating variables for a particular cell, T denotes matrix transposition, 1 denotes matrix inversion, c is the common covariance matrix of all vector observations, and m 1,m 2 are the mean vectors of the discriminating variables for cell types 1 and 2 respectively. The goal is to develop a function that has an acceptable probability of classifying a new observation into correct group membership. When there are more than two cell types, DFA can still be used, although the calculations are more complex (2, 9, 13). FIG. 2. An idealized description of how Discriminant Function Analysis (DFA) can be used to classify cells into groups. DFA maps values in multivariate Discriminant Variable Space into the values of a Discriminant Function where a user-specified value can be calculated and used to classify outcomes. For each group (cell type), a set of expressions, L i a 1 x i1 a 2 x i2..., can be calculated (9). The x i (the means of the discriminating variables) are obtained from the multiparameter FCM signal characteristics. The a i are coefficients calculated from the data in a manner that optimally distinguishes the groups (9). The property that DFA uses to optimally distinguish the groups for FCM based cell type data analysis and sorting is particularly useful. The discriminant function, L, using linear combinations of the L i s obtained for each group can be used to divide the observations into groups (9). For a given observation, L is calculated from the discriminant coefficients (a i ) and the values of the discriminating variables for that particular observation. If the L score exceeds some experimenter-specified level, c, the observation is assigned membership in one group; if below, then membership in the other group is assigned (9). The approach is similar to the logic used in the traditional one-way analysis of variance situation where ideally the variation in the L i s within a cell type should be much less than the variation in the L i s between the cell types. The concept of discriminant function classifiers is shown in Figure 2. For ease of visualization, only the two parameters from a bivariate data space are projected into the single discriminant function. The results of different penalities of misclassification are shown by the lines c1, c2, and c3. When the prior probabilities of membership are equal or unknown and there are only two groups, the traditional default for the threshold discriminant score (9) has been c (L 1 L 2 )/2. If the prior probabilities of membership are equal or known, the posterior probability of member-

4 ship of a selected cell in a group, Pr(L), can then be calculated using Pr(L) 1/(1 exp ( L c) ) where 1/(1 exp ( L C) ) where 1/(1 exp ( L C) ) is the multivariate logistic function (1, 9). Many DFA situations assume that there is an equal probability of being in each group and a DFA score of zero is used as the cutoff point for discrimination. For FCM experiments when information about the relative cell type frequency is known, or can be estimated, assignment of a cell to a group based on its expected frequency of occurrence can modify the misclassification rate by changing the cutoff score from that obtained when equal relative cell type frequencies are assumed. For the FCM experiments described in this paper, the ultimate goal is to acceptably classify cells existing within samples of mixed groups of normal bone marrow and cancer cells such that the cancer cells could be excluded (purged) from the bone marrow mixture in a manner suitable for subsequent autologous bone marrow transplantation. In this situation, it is most likely that many more normal cells than cancer cells will be encountered. Were we to use the relative frequency as the only criterion, the prior probability (before any processing of FCM data) of being classified as a normal cell would greatly exceed the prior probability of being classified as a cancer cell, since the number of normal cells greatly exceeds the number of cancer cells in virtually all cases. In the absence of any additional information, a reasonable rule for classifying a particular cell would be based on the relative frequency of each cell type in the mixture. Hence, in this paper results are presented with the cutoff scores modified to reflect the prior probabilities of classification based on relative cell type frequency. Also, in the context of cancer cell sorting using FCM, the penalty for incorrect classification may not be equal between cell types. The classification of a cancer cell as a normal cell may have far more serious consequences than classification of a normal cell as cancerous. The penalty of misclassifying a normal cell as a tumor cell results only in a reduced yield of normal cells, still of acceptable purity. If the normal cell lost is not a stem cell of interest then we don t even care about that since only the yield of stem cells is important. The former may result in a false sense of security until a treatment-resistant recurrence or metastasis occurs. It has been demonstrated in at least two studies that cells transplanted back into the patient gave rise to subsequent relapse of breast cancer (4, 7). One goal of this study is to establish procedures for choosing an acceptable cutoff score in the face of unequal probabilities of occurrence of cell types and where the acceptable penalties for probabilities of misclassification differ across the cell types. If we are trying to reduce the number of malignant cells in a sample, one reasonable approach is to specify the proportion of cancer cells that will be permitted to remain in the sample (required purity) and then calculate whether an unacceptable proportion of normal cells have also been eliminated (required yield). If prior knowledge about the probability of membership in a group is known, or the cost of misclassification is MULTIVARIATE STATISTICAL CELL CLASSIFICATION unequal, the discriminant function score can be modified in SAS to where c (L 1 L 2 ) K 2 Kisln3 p 2c(12) p 1 c(21)4 and p 2 is the prior probability of membership in class 2, p 1 is the prior probability of membership in class 1, c(12) is the penalty for being classified as being a member of group 1 given that the true membership is group 2 and c(21) is the penalty for being classified as being a member of group 2 given that the true membership is group 1 (9). Note that when the probability of membership is equal (p 1 p 2 ) and the cost of misclassification is the same across groups (c(12) c(21)), then K 0. Letting D 2 (m 1 m 2 ) T C 1 (m 1 m 2 ), where D 2 is called the Malahanobis D 2, it can be shown that the probability of misclassifying a vector of observations from a cell which actually is in group 1 as being from group 2 is ([c 1 2D 2 ]/( D 2 )), where is the standard Gaussian cumulative distribution (i. e., (z 0 ) pr(z z 0 )) where z is from a standard Gaussian distribution with a mean of 0 and standard deviation of 1 (1, 2, 9). Correspondingly, the probability of misclassifying a vector of observations from a cell in group 2 as being from group 1 is 1 ([c 1 2D 2 ]/( D 2 )). As illustrated in Figure 1, the FCM data used in this study clearly do not appear to conform to multivariate Gaussian distributions. Also, the assumption of equal covariance matrices for the data from each cell type may not generally be the case for FCM data. These factors may cause a departure of the true probabilities of correct classification from the theoretically calculated ones. DFA is a robust technique whose use remains valid despite violations of many of these theoretical assumptions (1, 8, 9, 10, 11). Correct classification is the ultimate test and determines how violations of the assumptions influence the accuracy of the analysis. Also, Monte Carlo techniques were used to evaluate the accuracy of classification. Logistic Regression and ROC Analysis Logistic Regression is an alternative method of classification when the multivariate Gaussian distribution model is not justified (1). It is based on the concept of rewriting probabilities in terms of odds, where odds P z /(1 P z ). If P 1/(1 exp ( L C) ), then log(p z /(1 P z )) C L C a 1 x 1 a 2 x 2 a 3 x

5 64 HOKANSON ET AL. Logistic regression offers a somewhat different paradigm than DFA in solving the FCM cell classification problem. While DFA uses conditional probabilities of misclassification based on classical hypothesis testing theory and multivariate normality assumptions, logistic regression presumes a model relating Pr i, the probability that a cell with values x 1 x 1,...,x n of measurements x 1,...,x n is in a population i, i 1,...,koftheform Pr i K l 1 n e(ci j 1 a ijx j) e (Ci K j 1 aijxj) where (a i1,...,a in ) are unknown (logistic regression) coefficients characterizing population i. The (a i1,...,a in ) are often estimated using training sets of cells from each of these known populations. Such models will only be useful if the variables X i,...,x n are suitably chosen. This change in notation reflects the applicability of logistic regression referred to earlier. From estimates (â i1,...,â in )i 1,..., k obtained from training sets, we can then make measurements x 1,...,x n in any given cell to obtain estimates Pˆr 1,...,Pˆr n of the true values of P 1,...,P n for this cell. Taking account of the estimates of the relative frequency of the various populations, one can then reasonably use the values Pˆr 1,...,Pˆr n to classify the given cell in line with the goals of purging undesirable cells (e.g., tumor cells) while keeping as many desired cells (e.g., stem cells) as possible. Because Logistic Regression analysis uses the method of maximum likelihood in its calculations, solutions to most practical problems require substantial computational resources and use of statistical software packages such as SAS (1, 12). ROC analysis is a graphic representation of the effects of applying different cutoff criteria in a decision situation where there are two classifications, and ROC curves can be obtained as part of LR analysis in most statistical packages. It uses a graph of sensitivity ( true positive rate ) vs. (1-specificity) ( false positive rate ) (1, 12). A common utilization of ROC curves in biomedical research has been in the evaluation of diagnostic tests where an individual is classified as either positive or negative for some disease condition) (3). The curve is constructed by graphing the true positive rate (sensitivity) on the y-axis as a function of the false positive rate (1-specificity) on the x-axis. These sensitivities and false positive rate are calculated using all possible values of a test as the cutoff point between the two outcomes. A perfect classification test is one that would have a sensitivity of 1 for all possible false-positive values. This would be a line that rises from coordinates (0,0) to (0,1) and then goes to (1,1). The dashed diagonal line is a representation of a useless test (random classification) where the true positive and false positive rates rise at identical rates (see Fig. 7). The area between the ROC curve and the diagonal line is typically used as an indicator of the value of a test statistic. A perfect Table 1 Descriptive Statistics From Monte Carlo Simulation of DFA Misclassification Rates* Outcome Mean Std Dev Minimum Maximum Mahalanobis D Equal probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Prior probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Unequal Cost Probabilities Overall misclassification rate Malignant as normal misclassification rate Normal as malignant misclassification rate Logistic regression misclassification rate *Descriptive statistics obtained from a Monte Carlo simulation of 50 replicate samples of the Discriminant Function Analysis misclassification rates using Equal Probabilities of Normal and Malignant cell frequency, Relative Frequency of Each Cell Type Occurrence (Prior Probabilities), and 10-1 Penalty for Misclassification of a Malignant Cell as Normal Probabilities. While the Prior Probabilities Model produced the lowest overall misclassification rate, this model produced an extremely high misclassification rate of Malignant Cells classified as Normal. This situation was remedied by including an arbitrary cost of misclassification rate, which produced an over misclassification rate nearly equivalent to the Prior Probability Model but that greatly reduced the misclassification of Malignant Cells as Normal rate. In addition, the descriptive statistics for Mahalanobis D2 score and the Logistic Regression misclassification rates for these Monte Carlo simulations are shown. classification test has an area under the ROC curve of 1.0. The utility of the ROC approach is that it evaluates a test statistic s ability to discriminate between two populations and help select the cutoff level that maximizes both the true positive and false positive rates. RESULTS The primary purpose of this paper was to demonstrate that statistical classification techniques may offer an objective approach to developing classification decision rules for FCM experiments. To do this it was necessary to apply these techniques to FCM data that represented the values typically encountered but having defined statistical properties. The first condition was necessary to determine whether these statistical techniques were applicable to actual FCM experiments; the second was necessary in order to evaluate the techniques under circumstances where the actual outcomes were known. Figure 1 illustrates the multiparameter histograms from two distinct FCM experiments using different cell types (bone marrow and MCF-7) that formed the basis of the analysis presented

6 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 65 Table 2 Acceptable Probabilities of Misclassification of Each Cell Type Can Be User-Specified* C (An experimenter supplied number) Probability of misclassifying x from population 1 ([c 1 2D ]/( 2 D 2 )) Probability of misclassifying x from population 2 1 ([c 1 2D ]/( 2 D 2 )) *If the desired relative probabilities of misclassification of each cell type can be specified by the user, the value of the parameter C can be modified to adjust the classification boundaries used in the DFA. For example, if a user specifies that the desired probability of classifying a malignant cell as normal is and the desired probability of classifying a normal cell as malignant is , then the value of C 1 inserted into the DFA algorithm will produce the desired classification boundary. FIG. 3. Discriminant Function Analysis applied to separating an equal number of bone marrow and breast cancer cells. The effects of using three different cutoff scores (that alter the probabilities of misclassification are shown; see Table 2. in this paper. Depending on the specific aspect of DFA or LR that was being tested, varying numbers of entries in each of these separate files were selected at random and used to construct a data set that was analyzed using these statistical techniques. Table 1 illustrates the stability of the DFA and LR classification schemes with regard to successfully classifying cells. Using the two fluorescent parameters and Monte Carlo methods, 50 replicate data sets were created and analyzed from a random selection of 5,000 cells from the normal bone marrow data file and 1,000 cells from the MCF-7 data file. Table 1 shows the results under three different scenarios: (1) rates of misclassification under the assumption that likelihood of membership in each of the two cell type categories is equally likely; (2) rates of misclassification when it is assumed that the probability of membership in each category is proportional to the known relative frequency of each cell type in the data set, and (3) rates of misclassification when it is assumed that the probability of membership in each category is based on a perceived, but arbitrary, differential value for misclassification. As an example of scenario 3, in a clinical bone marrow transplant setting, an investigator may be willing to discard 10, 100, or 1,000 normal cells to prevent one malignant cell from being reintroduced into the patient. Table 1 presents results for FCM experiments wherein mixtures involving the analysis of an infrequently encountered cell type (rare malignant cells in a mixture of many normal cells) may have major consequences where failure to correctly identify that infrequent cell occurs. Under the assumption of equal probabilities of occurrence, the overall error rate is highest, but the rate of misclassification of each cell type (normal as malignant and malignant as normal) is approximately equal. When a classification rule based on relative frequency in the data is used, the overall error rate is lowest but, because it was assumed that the vast majority of cells were normal, the misclassification rate of malignant cells as normal was quite high (mean 91%). When some form of differential value for misclassification was introduced, in this case a value of 10 to 1, the overall error rate was similar to that encountered when the relative frequency rule was employed. However, the rate of misclassification of malignant cells was markedly reduced (mean 3%). Other values for the relative value of misclassification obviously could have been used, but the ones employed clearly demonstrate the concept. An alternative to using the intuitively appealing, but arbitrary and hard-to-evaluate, costs of misclassification approach is to specify acceptable probabilities of misclassification for each cell type and then modify the classification boundaries based on these values. Table 2 indicates how the theoretical probabilities of misclassification change as a function of c, an experimenter-chosen level, in the expression ([c 1 2D 2 ]/( D 2 2).

7 66 HOKANSON ET AL. FIG. 4. Evaluation of the minimal number of cells required to obtain a stable discriminant function analysis. As indicated, DFA estimates are stable even when there is a large variation in the relative proportion of each cell type. For example, if c 1 is used, then there is a probability of 0.03 of misclassifying a cell from population 1 and a probability of of misclassifying a cell from population 2. For the data sets utilized, it appeared that neither the multivariate Gaussian assumption nor the assumption of equal covariance matrices appear to be valid. The violations of these assumptions, however, do not appear to reduce the utility of DFA to empirically classify cells correctly. However, these violations may cause a departure of the actual probabilities from the theoretically

8 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 67 FIG. 5. The stability of the Discriminant Function Analysis over four samples of cells randomly selected from the total cells available in this analysis. This indicates the repeatability of DFA when different random samplings of each cell type are used. calculated ones. This was shown by use of the Monte Carlo methods used to generate the entries in Table 1. Because of the instructional value of showing twodimensional plots and the lack of additional discriminating power when data from the non-fluorescent parameters were added to the data from the two fluorescent parameters, many results are presented (see Figs. 3 5) using data from just the two fluorescent parameters (FITC and PE). The goal here is to illustrate the results of DFA and LR when the results are intuitively obvious. Figure 2 shows the concept of discriminant function analysis. Different cutoff scores can be user-specified in tradeoffs between yield and purity of a given cell subpopulation. Figure 3 illustrates the separations obtained with all 6,689 human bone marrow cells (NBM) and 7,796 MCF-7

9 68 HOKANSON ET AL. Variable Table 3 Logistic Regression Classification of the Data* Parameter estimate Standard error Wald confidence limits Lower Upper P P Intercept Intercept only Intercept and covariates Chi-square 2 Log likelihood (2 df) (P 0.001) Association of predicted probabilities and observed responses Concordant 78.8% Discordant 20.8% Tied 0.3% Area under ROC curve *Results of Logistic Regression Analysis performed on a file of a randomly selected sample of 500 Normal Bone Marrow and 100 MCF-7 used to produce the ROC curve shown in Figure 7. As indicated in the text, only a relatively small number of cells were used to illustrate the concept. FIG. 6. The DFA obtained from a sample of 100 bone marrow cells (round points) and 10 MCF-7 cells (square points) using the nonfluorescent parameters available in the data files used in this paper. human breast cancer cells and different probabilities of misclassification. To evaluate the minimum number of MCF-7 cells required to achieve acceptable discrimination levels, analyses were performed using 5,000 normal bone marrow and 10, 50, 200, and 1,000 MCF-7 cells (Fig. 4). In this case, the seed value of the random number generator was fixed so that the same cells were selected for analysis and the probabilities of misclassification were varied. To evaluate the stability discriminant of function over random selection of cells, repeated random samples were drawn. Typical results are shown in Figure 5. For the results in each panel, the same cells were selected so that different probabilities of misclassification could be used. Using data from either the two fluorescent or all four (two fluorescent and two scatter) parameters, the classification results of a Logistic Regression often produced complete separation (100% correct classification). Hence, evaluation of statistical estimates are not meaningful under these circumstances. To illustrate interpretation of the results of a LR Analysis where complete separation did not occur, the results of DFA and its corresponding LR when just the nonfluorescent parameters were used are shown. The goal here was to evaluate DFA and LR when the classification was no longer intuitively obvious. Figure 6 illustrates the DFA results, based on the light scattering parameters only, when 100 MCF-7 and 500 NBM cells were randomly selected. The corresponding Logistic Regression and ROC analysis produced Table 3 and Figure 7. Figure 7 shows the results of a ROC analysis of the data based on the two non-fluorescent parameters that should have some, but not very good, classification power since the forward and side light scattering of the tumor cells partially overlap as shown in Figure 1. While the DFA and Logistic Regression did not produce as useful a separation tool as when the fluorescent parameters were included, the corresponding ROC analysis indicated that these techniques still produced credible classification capabilities. DISCUSSION Development of scientifically sound FCM sort boundary classification rules based on multivariate statistical techniques that are easy to implement could greatly enhance the value of many FCM applications. This paper proposes the use of mathematically rigorous statistical classification techniques that take advantage of the multiparameter signals obtained from current FCM devices. One technique, DFA, also provides an ability to adjust sort decisions based on: (1) the prior probability of encountering only a small number of a particular cell type, (2) the potential cost penalty associated with the misclassification of a cell, and (3) the need to specify particular probabilities of misclassification. Of some concern is that the FCM data did not conform to the DFA assumption of a multivariate Gaussian distribution. However, based on our results that used correct classification as the objective outcome, the excellent empirical performance of DFA suggests that it should be considered as a technique to assist in the development of rules for determining FCM sort boundaries. The Logistic Regression (LR) techniques we evaluated also produced excellent empirical results. In addition, ROC analysis assisted in deciding cutoff points for LRbased data analysis boundaries. While LR does not require the same Gaussian assumptions about the underlying data structure, the inability of existing statistical packages to

10 MULTIVARIATE STATISTICAL CELL CLASSIFICATION 69 FIG. 7. The use of ROC analysis to evaluate the effectiveness of parameters used in the classification of cell subpopulations. The ROC analysis obtained from this data indicates that while the discriminating power is greatly reduced when using non-fluorescent parameters as compared to using fluorescent parameters, the LR analysis still greatly improves the selection process when compared to not using objective criteria. adjust data analysis decisions based on the need to meet specific criteria such as the relative cost of misclassification infers that LR may not currently be as flexible as DFA at developing data analysis decisions. This paper demonstrates that under very specific conditions, statistical classification systems can assist in determining data analysis boundaries in FCM experiments. The real challenge is in how to advance these findings into the world of real-time cell sorting, where the true identity of any given cell is unknown and the sort decision must be made within microseconds of encountering the FCM sensors. This will be discussed in a future paper. Beyond the quantitative issues addressed in this paper, there are issues of biologic, instrumental, systemic, and as yet unknown variability that determine the utility of FCM in resolving cell classification problems. For example, within each major cell type, subpopulations may exist that can confound the relatively straightforward classifications proposed in this paper. DFA, as implemented in most current statistical packages, can only classify into two groups even though each group can contain more than one cell subpopulation. The ultimate goal of our efforts is directed at sorting a population of cells that are involved in the autologous bone marrow procedure. The mixture of cells to be processed contains tumor cells, normal but not stem-cell bone marrow cells, and stem-cell bone marrow. The purpose is to collect the maximum fraction of stem cells while minimizing the re-infusion of the contaminating tumor cells or the nuisance non-stem cell bone marrow cells. Without the satisfactory purging of tumor cells from the re-infused bone marrow cells, the possibility of a reoccurrence of the primary tumor is greatly increased. While numerous technical challenges remain, this paper assists in achieving this goal by demonstrating that there are statistically rigorous techniques available to assist in making sort decisions. As the instrumentation improves, further enhancements in the use of these statistical technical techniques will be employed to refine the development of sort boundary classification techniques. In addition to the methods described here, new developments should be explored. Cluster analysis and regression tree analysis techniques obviously merit further exploration. On the horizon are techniques designed to extract the maximum amounts of information from the data contained in FCM data sets. These so called Knowledge Discovery in Data or Data Mining techniques may extend our efforts beyond statistical classification into

11 70 HOKANSON ET AL. pattern recognition, artificial intelligence, and neural networks. The bottom-line message is that FCM investigators should explore the spectrum of quantitative techniques that are becoming available to them for developing more objective rules for the selection of sort boundaries. LITERATURE CITED 1. Afifi AA, Clark V. Computer aided multivariate analysis. New York: Van Nostrand Reinhold; p Anderson TW. An introduction to multivariate statistical analysis. New York: John Wiley and Sons, Inc.; p Beck JR, Shultz EK. The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 1986;110: Brenner MK, Rill DR, Moen RC, Krance RA, Mirro J Jr, Anderson WF, Ihle JN. Gene-marking to trace origin of relapse after autologous bone-marrow transplantation. Lancet 1993;341: Brenner MK. The contribution of marker gene studies to hematopoietic stem cell therapies Stem Cells 1995;13: Fisher RA. The use of multiple measurements in taxonomic problems. Annu Eugenics 1936;7: Heslop HE, Rooney CM, Rill DR, Krance RA, Brenner MK. Use of gene marking in bone marrow transplantation. Cancer Detect Prev 1996;20: Klecka WR. Discriminant analysis. Sage University Paper series. Quant Appl Social Sci 1980; Kleinbaum DG, Kupper L L, Miller KE. Applied regression and other multivariate methods. Boston: PWS-Kent; p Lachenbruch PA. Discriminant analysis. New York: Hafner Press; p McLachlan GJ. Discriminant analysis and statistical pattern recognition. New York:. John Wiley and Sons, Inc.; p SAS/STAT user s guide, version 6, 4th ed, volume 1. Cary, NC: SAS Institute Inc.; p Tatsuoka MM. Multivariate analysis: techniques for education and psychological research, 2nd ed. New York: John Wiley & Sons; p18 62.

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs: