The Analysis of 2 K Contingency Tables with Different Statistical Approaches

The Analysis of 2 K Contingency Tables with Different tatistical Approaches Hassan alah M. Thebes Higher Institute for Management and Information Technology drhassn_242@yahoo.com Abstract The main objective of this paper is to analyze the 2 K contingency tables with three statistical approaches (regression analysis, multinomial logistic regression analysis and linguistic fuzzy model). We compare these methods for evaluating the association between a risk factor and a disease. These statistical methods measure the association between the numeric levels of a risk factor and a disease in different ways. They have been applied to a set of data of childhood cancer risk from prenatal x-ray exposure. Regression and multinomial logistic regression analyses show similar results for a data set of 16226 children whereas the fuzzy analysis yields a different result. Keywords Contingency table, Multinomial logistic regression, Linguistic fuzzy model, Data of childhood cancer, X-ray exposure. 1. Introduction The 2 K contingency table is an important extension of 2 2 table which is a basic tool for epidemiology investigation. In 2 K contingency table, the presence or absence of a disease is recorded at K levels of a risk factor. The 2 K contingency table can be viewed from the perspective of a K - level variable (risk factor) or from the perspective of a binary variable (disease) [4]. In this paper, we use three different statistical approaches for analyzing the 2 K contingency table; regression analysis, multinomial logistic regression analysis and linguistic fuzzy model. Data on malignancies in children under 10 years of age and information on the mother's exposure to x-ray provide an example for the discussion and analysis of a 2 K table [2] and [3]. Table 1 shows the numbers of prenatal x-rays received by mothers of children with a malignant disease, and a series of controls (healthy children of the same age, sex, and similar areas of residence)

Table 1 Observed numbers of cases and controls by recorded number of maternal x-ray films during pregnancy * for simplicity, the values greater than five were coded as 5. 2. Regression Analysis A 2 K contingency table can be viewed as a set of K pairs of values. An ) estimated probability is generated for each value of X producing K pairs x j, p ) ( j where p j is the estimated probability that Y = 0 associated with each level represented by x j. In order to analyze the K pairs of values, a straight line which summarizes the relationship between X and Y is estimated and the slope of the estimated line is used as a summary of the relationship between X and Y. For a simple linear regression, three quantities are necessary to derive the basic statistical measures: the sum of squares for X ), the sum of squares for Y ), and the sum of cross-products for X and Y ( Films 0 1 2 3 4 5 Total Cases Y = 0 7332 287 199 96 59 65 8038 Controls Y = 1 7673 239 154 65 28 29 8188 Total 15005 526 353 161 87 94 16226 Proportion.489.546.564.596.678.691 ( yy ( xy ). These expressions calculated from a 2 K contingency table are [7]: = k j = 1 n. j ( x j v x) 2, where v x = n x n. j j / yy = n n / n (2) 1. 2. k v v v xy = ( x1 x2 ) yy where xi = n Now, the regression coefficient can be estimated as b y / x = xy / (4) and the variance of the estimated regression coefficient can be estimated as ) var( b y / x ) = yy / ( n 1) (5) On the other hand, a correlation coefficient measuring the degree of linear association between X and Y calculated in the usual way is xy r xy = (6) yy j= 1 ij x j / n i. * (1) (3) 2

For the data in Table 1, these quantities for the malignant disease are: =6733.581, yy = 4056. 155, xy =328. 548. Using (4) and (5), the estimated coefficient of regression and its variance are 0.049 and 0.000037 respectively. The correlation between the case/control status and the x-ray exposure is 0.063. A 95 % confidence interval of the association coefficient is (0.0577, 0.0683). Moreover, the expected numbers of cases and controls by recorded number of maternal x-ray films during pregnancy are estimated using an estimated linear response p =0.489 + 0. 049 x as shown in Table 2 below. i Table 2 Expected numbers of cases and controls by recorded number of maternal x-ray films during pregnancy Films 0 1 2 3 4 5 Total Cases Y = 0 7433.14 260.57 174.87 79.76 43.10 46.57 8038 Controls Y = 1 7571.86 265.43 178.13 81.24 43.90 47.43 8188 Total 15005 526 353 161 87 94 16226 Proportion.495.531.573.615.657.699 * for simplicity, the values greater than five were coded as 5. The observed and expected proportions of cases shown in Table 1 and Table 2 are plotted in Figure 1 below. 0.8 P 0.6 0.4 0.2 Obs. P Exp. P 0 1 2 3 4 5 6 X-ray Figure 1: Proportion of cases childhood cancer for exposure to maternal x-ray during pregnancy 3

Figure 1 indicates that the distribution of number of the cases is better fitted and the estimated line is good. An additional assessment of the dose-response relationship is accomplished by partitioning the total chi-square value. The chi-square statistic that measures homogeneity (H 0 : the proportion of cases is the same regardless of the degree of maternal x-ray exposure) is χ 2 = 47. 286. A chi-square value of this magnitude indicates the presence of some sort of nonhomogeneous pattern of response ( ρ value =0. 001) [7]. 3. Multinomial Logistic Regression Analysis Multinomial logistic regression analysis is useful for situations in which we want to be able to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general. In regression analysis, we use the numeric levels of a risk factor (the number of x-ray exposures) as an independent variable and the corresponding proportion of cases as dependent variable, but in multinomial logistic regression there is need to consider a large number of records (frequency) to establish an association between risk factor and a disease [5]. In order to analyze a 2 K contingency table using multinomial logistic regression analysis, the data in Table 1 were processed using PWIN and the numeric results were similar as those obtained by regression analysis [1]. That is the association coefficient between risk factor and disease is 0.053 with standard error of 0.008. A 95 % confidence interval of the association coefficient is (0.0481, 0.0579). 4. Fuzzy analysis In bioscience there are several levels of uncertainty, vagueness and imprecision, particularly in the medical and epidemiological areas, where the best and most useful description of disease entities often comprise linguistic terms that are inevitably vague. The theory of fuzzy logic has been developed to deal with the concept of partial truth values, ranging from completely true to completely false, and has become a powerful tool for dealing with imprecision and uncertainty aiming at tractability, robustness and low-cost solutions for real-world problems. These features and the ability to deal with linguistic terms could explain the increasing number of works applying fuzzy logic in biomedicine problems. In fact, the theory of fuzzy sets has become an important mathematical approach in diagnosis system, treatment of medical images and, more recently in epidemiology and public health [5] and [6]. For more knowledge about fuzzy logic theory the book by Yen and Langari [8] is recommended. A linguistic fuzzy model consists of a set of fuzzy rules and an inference method. The most common inference method is the Minimum of Mamdani, whose output is a fuzzy set. The fuzzy linguistic model to evaluate a childhood cancer risk 4

from prenatal x-ray exposure has two antecedents: malignancies in children under 10 years of age and information on the mother's exposure to x-ray. The model elaborated five fuzzy sets to the variable number of x-ray films that exposure to the mothers (very low, low, medium, high and very high) and two fuzzy sets for the variable number of children with a malignant disease and a series of controls ( healthy children of the same age) (cases and controls). The consequence of the model is the association between x-ray films and the malignancies in children under 10 years of age. We considered three fuzzy sets for this linguistic variable; weak, medium and strong. The base rules consist of the following ones: 1. If x-ray is very low and case then association is weak. 2. If x-ray is low and case then association is weak. 3. If x-ray is medium and case then association is weak. 4. If x-ray is high and case then association is medium. 5. If x-ray is very high and case then association is strong The association between the childrens' malignancies and x-ray films is determined by inference of the fuzzy rule set, and defuzzifiction of the fuzzy output. The system was run in a C++ language. Fuzzy sets to input variable number of x-ray and to output variable of association between malignancies children and x-ray are displayed in Figure 2 and Figure 3 below. Membership function 1 VLOW LOW MEDIUM HIGH VHIGH 1 2 4 5 X Ray Figure 2: Fuzzy sets to input variable number of X-ray 5

Membership function WEAK MEDIUM TRONG 10 20 Figure 3: Fuzzy sets to output variable of Association between malignancies children and X-Ray We notice that by combining all possible inputs it is possible to build 10 rules but, it only 5 rules were considered because some situations that can not occur. For example, it is impossible, for the mothers who were not exposed to x-ray, the children have a disease (if they have; this occurs for another reason). Although this is mathematically possible, it was subtracted from the rule bases, reducing the number of rules. The fuzzy set related to linguistic variables is presented in Figure 2. The membership fucntion represents the degree of compatibility of some input to all categpries. In fact, the membership degree represents the possibility that the input belongs to the set. Figure 3 shows the memebership function of the output. It is clear that the association increases monotonically when the number of x-ray films increases. It was 16 % for weak, 17 % for medium and 18 % for strong associations respectively. Also the weighted mean of the association between X-ray and the disease was 0.125 and the standard error was 0.0026. A 95 % confidence interval of the association coefficient is ( 0.1178, 0.1322). 6

Discussion In regression analysis, we use the numeric levels of a risk factor (the number of x-ray exposures) as an independent variable and the corresponding proportion of cases as a dependent variable. Furthermore, in multinomial logistic regression there is need for a considerable number of records (frequency) to establish an association between risk factor and a disease. In a fuzzy linguistic model, there is not such need. ( b y / x The point biserial correlation coefficient ( r xy ), the regression coefficient ) are interrelated when calculated from a 2 K table. For example, each has an expected value of zero when the variables X and Y are unrelated. The two statistics measure the association between the numeric levels of a risk factor and a disease in different ways but, in terms of probability, lead to the same inference. A measure of association assesses the strength of a relationship, while a statistical test gives an idea of the likelihood that such an association occurs by chance where both regression and multinomial logistic regression give similar results, the fuzzy model gives rather different results for evaluating the association between the risk factor and the disease (ee: Table 3). Table 3 Comparison between the results of the three methods Regression Multinomial logistic regression Fuzzy model Association coefficient 0.063 0.053 0.125 tandard error 0.0061 0.0082 0.0026 95 % CI (.0577,.0683) (.0481,.0579) (.1178, 1322) ρ value 0.001 0.000 We notice from Table 3 that the three statistical methods (regression, multinomial logistic regression and fuzzy model) for evaluating the association between risk factor and a disease show similar results for a data set of 16226 children, but the results from fuzzy model are rather different. References [1] Ashour,. K. and alem,. A. (2005). tatistical Presentation and Analysis using PWIN, Part two: Advanced Applied tatistics. Cairo University: IR. 7

[2] Bithell, J. F., and teward, M. A. (1975). Prenatal Irradiation and childhood Malignancy: A Review of British Data from the Oxford tudy. Brit. J. of Cancer (31):271-87. [3] Breslow, N. E., and Day, N. E. (1987). tatistical Methods in Cancer Research, Volume II. Oxford University Press. Oxford, UK. [4] Hardeo ahai and Anwer Khurshid (1996). tatistics in Epidemiology, Methods, Techniques and Applications. CRC Press, New York. [5] Luiz Fernando C. Nascimento and Neli Regina Ortega (2002). Fuzzy Linguistic Model for Evaluating the Risk of Neonatal Death. Rev aude Publica, 36 (6): 686-92. [6] chwarzer G., Nagata T., Mattern D., chmelzeisen R. and chumacher (2003). Comparison of Fuzzy Inference, Logistic Regression, and Classification Trees (CART). Methods Inf Med; 42: 572-7. [7] teve,. (1996). tatistical Analysis of Epidemiologic Data, 2 nd ed. Oxford University Press, Oxford. [8] Yen J. and Langari R. (1999). Fuzzy Logic: Intelligence, Control an information. Upper addle River (NJ), Prentic-hall. 8