The Analysis of 2 K Contingency Tables with Different Statistical Approaches

Similar documents
11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Artificial Intelligence For Homeopathic Remedy Selection

Regression Including the Interaction Between Quantitative Variables

11/24/2017. Do not imply a cause-and-effect relationship

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

Artificially Intelligent Primary Medical Aid for Patients Residing in Remote areas using Fuzzy Logic

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

IAPT: Regression. Regression analyses

Daniel Boduszek University of Huddersfield

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

bivariate analysis: The statistical analysis of the relationship between two variables.

Understandable Statistics

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Statistical questions for statistical methods

Fuzzy Expert System Design for Medical Diagnosis

Unit 1 Exploring and Understanding Data

Business Statistics Probability

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

isc ove ring i Statistics sing SPSS

Overview of Non-Parametric Statistics

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome

Daniel Boduszek University of Huddersfield

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including

Chapter 3: Examining Relationships

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Human Immunodeficiency Virus (HIV) Diagnosis Using Neuro-Fuzzy Expert System

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

HOW STATISTICS IMPACT PHARMACY PRACTICE?

Study Guide for the Final Exam

Modeling Health Related Quality of Life among Cancer Patients Using an Integrated Inference System and Linear Regression

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Centering Predictors

BMI 541/699 Lecture 16

STA 3024 Spring 2013 EXAM 3 Test Form Code A UF ID #

From Bivariate Through Multivariate Techniques

Correlation and regression

On the purpose of testing:

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Evidence-Based Medicine Journal Club. A Primer in Statistics, Study Design, and Epidemiology. August, 2013

Developing a fuzzy Likert scale for measuring xenophobia in Greece

Non Linear Control of Glycaemia in Type 1 Diabetic Patients

Confounding, Effect modification, and Stratification

Diagnostic screening. Department of Statistics, University of South Carolina. Stat 506: Introduction to Experimental Design

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Data Analysis Using Regression and Multilevel/Hierarchical Models

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Choosing a Significance Test. Student Resource Sheet

Diagnosis Of the Diabetes Mellitus disease with Fuzzy Inference System Mamdani

CHAPTER ONE CORRELATION

Prediction of Malignant and Benign Tumor using Machine Learning

Controlling Bias & Confounding

MEASURES OF ASSOCIATION AND REGRESSION

Uncertain Rule-Based Fuzzy Logic Systems:

Basic Statistics and Data Analysis in Work psychology: Statistical Examples

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

ExperimentalPhysiology

Six Sigma Glossary Lean 6 Society

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

Statistical Methods and Reasoning for the Clinical Sciences

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

POL 242Y Final Test (Take Home) Name

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

appstats26.notebook April 17, 2015

Lecture 21. RNA-seq: Advanced analysis

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Comparison of Mamdani and Sugeno Fuzzy Interference Systems for the Breast Cancer Risk

CHAPTER 4 ANFIS BASED TOTAL DEMAND DISTORTION FACTOR

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Simple Linear Regression

An Introduction to Bayesian Statistics

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Adaptive Type-2 Fuzzy Logic Control of Non-Linear Processes

Fuzzy Logic Based Expert System for Detecting Colorectal Cancer

Fever Diagnosis Rule-Based Expert Systems

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

Study Guide #2: MULTIPLE REGRESSION in education

Analysis of Variance (ANOVA)

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Results. NeuRA Family relationships May 2017

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Introduzione. Ruggero Donida Labati

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Multiple Bivariate Gaussian Plotting and Checking

SPECIAL ISSUE FOR INTERNATIONAL CONFERENCE ON INNOVATIONS IN SCIENCE & TECHNOLOGY: OPPORTUNITIES & CHALLENGES"

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

CHAPTER 3 RESEARCH METHODOLOGY

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

SW 9300 Applied Regression Analysis and Generalized Linear Models 3 Credits. Master Syllabus

Transcription:

The Analysis of 2 K Contingency Tables with Different tatistical Approaches Hassan alah M. Thebes Higher Institute for Management and Information Technology drhassn_242@yahoo.com Abstract The main objective of this paper is to analyze the 2 K contingency tables with three statistical approaches (regression analysis, multinomial logistic regression analysis and linguistic fuzzy model). We compare these methods for evaluating the association between a risk factor and a disease. These statistical methods measure the association between the numeric levels of a risk factor and a disease in different ways. They have been applied to a set of data of childhood cancer risk from prenatal x-ray exposure. Regression and multinomial logistic regression analyses show similar results for a data set of 16226 children whereas the fuzzy analysis yields a different result. Keywords Contingency table, Multinomial logistic regression, Linguistic fuzzy model, Data of childhood cancer, X-ray exposure. 1. Introduction The 2 K contingency table is an important extension of 2 2 table which is a basic tool for epidemiology investigation. In 2 K contingency table, the presence or absence of a disease is recorded at K levels of a risk factor. The 2 K contingency table can be viewed from the perspective of a K - level variable (risk factor) or from the perspective of a binary variable (disease) [4]. In this paper, we use three different statistical approaches for analyzing the 2 K contingency table; regression analysis, multinomial logistic regression analysis and linguistic fuzzy model. Data on malignancies in children under 10 years of age and information on the mother's exposure to x-ray provide an example for the discussion and analysis of a 2 K table [2] and [3]. Table 1 shows the numbers of prenatal x-rays received by mothers of children with a malignant disease, and a series of controls (healthy children of the same age, sex, and similar areas of residence)

Table 1 Observed numbers of cases and controls by recorded number of maternal x-ray films during pregnancy * for simplicity, the values greater than five were coded as 5. 2. Regression Analysis A 2 K contingency table can be viewed as a set of K pairs of values. An ) estimated probability is generated for each value of X producing K pairs x j, p ) ( j where p j is the estimated probability that Y = 0 associated with each level represented by x j. In order to analyze the K pairs of values, a straight line which summarizes the relationship between X and Y is estimated and the slope of the estimated line is used as a summary of the relationship between X and Y. For a simple linear regression, three quantities are necessary to derive the basic statistical measures: the sum of squares for X ), the sum of squares for Y ), and the sum of cross-products for X and Y ( Films 0 1 2 3 4 5 Total Cases Y = 0 7332 287 199 96 59 65 8038 Controls Y = 1 7673 239 154 65 28 29 8188 Total 15005 526 353 161 87 94 16226 Proportion.489.546.564.596.678.691 ( yy ( xy ). These expressions calculated from a 2 K contingency table are [7]: = k j = 1 n. j ( x j v x) 2, where v x = n x n. j j / yy = n n / n (2) 1. 2. k v v v xy = ( x1 x2 ) yy where xi = n Now, the regression coefficient can be estimated as b y / x = xy / (4) and the variance of the estimated regression coefficient can be estimated as ) var( b y / x ) = yy / ( n 1) (5) On the other hand, a correlation coefficient measuring the degree of linear association between X and Y calculated in the usual way is xy r xy = (6) yy j= 1 ij x j / n i. * (1) (3) 2

For the data in Table 1, these quantities for the malignant disease are: =6733.581, yy = 4056. 155, xy =328. 548. Using (4) and (5), the estimated coefficient of regression and its variance are 0.049 and 0.000037 respectively. The correlation between the case/control status and the x-ray exposure is 0.063. A 95 % confidence interval of the association coefficient is (0.0577, 0.0683). Moreover, the expected numbers of cases and controls by recorded number of maternal x-ray films during pregnancy are estimated using an estimated linear response p =0.489 + 0. 049 x as shown in Table 2 below. i Table 2 Expected numbers of cases and controls by recorded number of maternal x-ray films during pregnancy Films 0 1 2 3 4 5 Total Cases Y = 0 7433.14 260.57 174.87 79.76 43.10 46.57 8038 Controls Y = 1 7571.86 265.43 178.13 81.24 43.90 47.43 8188 Total 15005 526 353 161 87 94 16226 Proportion.495.531.573.615.657.699 * for simplicity, the values greater than five were coded as 5. The observed and expected proportions of cases shown in Table 1 and Table 2 are plotted in Figure 1 below. 0.8 P 0.6 0.4 0.2 Obs. P Exp. P 0 1 2 3 4 5 6 X-ray Figure 1: Proportion of cases childhood cancer for exposure to maternal x-ray during pregnancy 3

Figure 1 indicates that the distribution of number of the cases is better fitted and the estimated line is good. An additional assessment of the dose-response relationship is accomplished by partitioning the total chi-square value. The chi-square statistic that measures homogeneity (H 0 : the proportion of cases is the same regardless of the degree of maternal x-ray exposure) is χ 2 = 47. 286. A chi-square value of this magnitude indicates the presence of some sort of nonhomogeneous pattern of response ( ρ value =0. 001) [7]. 3. Multinomial Logistic Regression Analysis Multinomial logistic regression analysis is useful for situations in which we want to be able to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general. In regression analysis, we use the numeric levels of a risk factor (the number of x-ray exposures) as an independent variable and the corresponding proportion of cases as dependent variable, but in multinomial logistic regression there is need to consider a large number of records (frequency) to establish an association between risk factor and a disease [5]. In order to analyze a 2 K contingency table using multinomial logistic regression analysis, the data in Table 1 were processed using PWIN and the numeric results were similar as those obtained by regression analysis [1]. That is the association coefficient between risk factor and disease is 0.053 with standard error of 0.008. A 95 % confidence interval of the association coefficient is (0.0481, 0.0579). 4. Fuzzy analysis In bioscience there are several levels of uncertainty, vagueness and imprecision, particularly in the medical and epidemiological areas, where the best and most useful description of disease entities often comprise linguistic terms that are inevitably vague. The theory of fuzzy logic has been developed to deal with the concept of partial truth values, ranging from completely true to completely false, and has become a powerful tool for dealing with imprecision and uncertainty aiming at tractability, robustness and low-cost solutions for real-world problems. These features and the ability to deal with linguistic terms could explain the increasing number of works applying fuzzy logic in biomedicine problems. In fact, the theory of fuzzy sets has become an important mathematical approach in diagnosis system, treatment of medical images and, more recently in epidemiology and public health [5] and [6]. For more knowledge about fuzzy logic theory the book by Yen and Langari [8] is recommended. A linguistic fuzzy model consists of a set of fuzzy rules and an inference method. The most common inference method is the Minimum of Mamdani, whose output is a fuzzy set. The fuzzy linguistic model to evaluate a childhood cancer risk 4

from prenatal x-ray exposure has two antecedents: malignancies in children under 10 years of age and information on the mother's exposure to x-ray. The model elaborated five fuzzy sets to the variable number of x-ray films that exposure to the mothers (very low, low, medium, high and very high) and two fuzzy sets for the variable number of children with a malignant disease and a series of controls ( healthy children of the same age) (cases and controls). The consequence of the model is the association between x-ray films and the malignancies in children under 10 years of age. We considered three fuzzy sets for this linguistic variable; weak, medium and strong. The base rules consist of the following ones: 1. If x-ray is very low and case then association is weak. 2. If x-ray is low and case then association is weak. 3. If x-ray is medium and case then association is weak. 4. If x-ray is high and case then association is medium. 5. If x-ray is very high and case then association is strong The association between the childrens' malignancies and x-ray films is determined by inference of the fuzzy rule set, and defuzzifiction of the fuzzy output. The system was run in a C++ language. Fuzzy sets to input variable number of x-ray and to output variable of association between malignancies children and x-ray are displayed in Figure 2 and Figure 3 below. Membership function 1 VLOW LOW MEDIUM HIGH VHIGH 1 2 4 5 X Ray Figure 2: Fuzzy sets to input variable number of X-ray 5

Membership function WEAK MEDIUM TRONG 10 20 Figure 3: Fuzzy sets to output variable of Association between malignancies children and X-Ray We notice that by combining all possible inputs it is possible to build 10 rules but, it only 5 rules were considered because some situations that can not occur. For example, it is impossible, for the mothers who were not exposed to x-ray, the children have a disease (if they have; this occurs for another reason). Although this is mathematically possible, it was subtracted from the rule bases, reducing the number of rules. The fuzzy set related to linguistic variables is presented in Figure 2. The membership fucntion represents the degree of compatibility of some input to all categpries. In fact, the membership degree represents the possibility that the input belongs to the set. Figure 3 shows the memebership function of the output. It is clear that the association increases monotonically when the number of x-ray films increases. It was 16 % for weak, 17 % for medium and 18 % for strong associations respectively. Also the weighted mean of the association between X-ray and the disease was 0.125 and the standard error was 0.0026. A 95 % confidence interval of the association coefficient is ( 0.1178, 0.1322). 6

Discussion In regression analysis, we use the numeric levels of a risk factor (the number of x-ray exposures) as an independent variable and the corresponding proportion of cases as a dependent variable. Furthermore, in multinomial logistic regression there is need for a considerable number of records (frequency) to establish an association between risk factor and a disease. In a fuzzy linguistic model, there is not such need. ( b y / x The point biserial correlation coefficient ( r xy ), the regression coefficient ) are interrelated when calculated from a 2 K table. For example, each has an expected value of zero when the variables X and Y are unrelated. The two statistics measure the association between the numeric levels of a risk factor and a disease in different ways but, in terms of probability, lead to the same inference. A measure of association assesses the strength of a relationship, while a statistical test gives an idea of the likelihood that such an association occurs by chance where both regression and multinomial logistic regression give similar results, the fuzzy model gives rather different results for evaluating the association between the risk factor and the disease (ee: Table 3). Table 3 Comparison between the results of the three methods Regression Multinomial logistic regression Fuzzy model Association coefficient 0.063 0.053 0.125 tandard error 0.0061 0.0082 0.0026 95 % CI (.0577,.0683) (.0481,.0579) (.1178, 1322) ρ value 0.001 0.000 We notice from Table 3 that the three statistical methods (regression, multinomial logistic regression and fuzzy model) for evaluating the association between risk factor and a disease show similar results for a data set of 16226 children, but the results from fuzzy model are rather different. References [1] Ashour,. K. and alem,. A. (2005). tatistical Presentation and Analysis using PWIN, Part two: Advanced Applied tatistics. Cairo University: IR. 7

[2] Bithell, J. F., and teward, M. A. (1975). Prenatal Irradiation and childhood Malignancy: A Review of British Data from the Oxford tudy. Brit. J. of Cancer (31):271-87. [3] Breslow, N. E., and Day, N. E. (1987). tatistical Methods in Cancer Research, Volume II. Oxford University Press. Oxford, UK. [4] Hardeo ahai and Anwer Khurshid (1996). tatistics in Epidemiology, Methods, Techniques and Applications. CRC Press, New York. [5] Luiz Fernando C. Nascimento and Neli Regina Ortega (2002). Fuzzy Linguistic Model for Evaluating the Risk of Neonatal Death. Rev aude Publica, 36 (6): 686-92. [6] chwarzer G., Nagata T., Mattern D., chmelzeisen R. and chumacher (2003). Comparison of Fuzzy Inference, Logistic Regression, and Classification Trees (CART). Methods Inf Med; 42: 572-7. [7] teve,. (1996). tatistical Analysis of Epidemiologic Data, 2 nd ed. Oxford University Press, Oxford. [8] Yen J. and Langari R. (1999). Fuzzy Logic: Intelligence, Control an information. Upper addle River (NJ), Prentic-hall. 8