Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Similar documents
NORTH SOUTH UNIVERSITY TUTORIAL 2

Multiple Regression Analysis

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Overview of Non-Parametric Statistics

Class 7 Everything is Related

Simple Linear Regression the model, estimation and testing

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

IAPT: Regression. Regression analyses

Statistical reports Regression, 2010

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Chapter 3 CORRELATION AND REGRESSION

SCHOOL OF MATHEMATICS AND STATISTICS

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

Simple Linear Regression

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Data Analysis in the Health Sciences. Final Exam 2010 EPIB 621

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

EXECUTIVE SUMMARY DATA AND PROBLEM

Correlation and regression

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Daniel Boduszek University of Huddersfield

STATISTICS INFORMED DECISIONS USING DATA

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

5 To Invest or not to Invest? That is the Question.

STATISTICS & PROBABILITY

Business Statistics Probability

An Introduction to Bayesian Statistics

Day 11: Measures of Association and ANOVA

Understandable Statistics

Still important ideas

Self-assessment test of prerequisite knowledge for Biostatistics III in R

Notes for laboratory session 2

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Chapter 14: More Powerful Statistical Methods

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Logistic regression. Department of Statistics, University of South Carolina. Stat 205: Elementary Statistics for the Biological and Life Sciences

Performance of Median and Least Squares Regression for Slightly Skewed Data

Chapter 1: Exploring Data

6. Unusual and Influential Data

Unit 1 Exploring and Understanding Data

Chapter 3: Examining Relationships

NEUROBLASTOMA DATA -- TWO GROUPS -- QUANTITATIVE MEASURES 38 15:37 Saturday, January 25, 2003

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Observational studies; descriptive statistics

CHAPTER TWO REGRESSION

MEA DISCUSSION PAPERS

Clincial Biostatistics. Regression

Section 3.2 Least-Squares Regression

Advanced IPD meta-analysis methods for observational studies

Math 215, Lab 7: 5/23/2007

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

GPA vs. Hours of Sleep: A Simple Linear Regression Jacob Ushkurnis 12/16/2016

Psychology Research Process

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

11/24/2017. Do not imply a cause-and-effect relationship

12.1 Inference for Linear Regression. Introduction

An informal analysis of multilevel variance

Meta-analysis: Basic concepts and analysis

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

10. LINEAR REGRESSION AND CORRELATION

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model

Psychology Research Process

CHAPTER ONE CORRELATION

Linear Regression in SAS

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Immunological Data Processing & Analysis

1.4 - Linear Regression and MS Excel

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

bivariate analysis: The statistical analysis of the relationship between two variables.

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Research Analysis MICHAEL BERNSTEIN CS 376

The Pretest! Pretest! Pretest! Assignment (Example 2)

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

STAT 201 Chapter 3. Association and Regression

1 Simple and Multiple Linear Regression Assumptions

Still important ideas

Multiple Linear Regression Analysis

CHILD HEALTH AND DEVELOPMENT STUDY

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Question 1(25= )

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Bivariate Correlations

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

STATISTICS AND RESEARCH DESIGN

Biostatistics II

Popper If data follows a trend that is not linear, we cannot make a prediction about it. a. True b. False

Biostatistics for Med Students. Lecture 1

Chapter 3: Describing Relationships

MTH 225: Introductory Statistics

Transcription:

Content Quantifying association between continuous variables. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen In particular: Correlation (Simple) regression Dias 1 Dias 2 Example Newly diagnosed Type 2 Diabetes Research question pt glucose bmi sex age 1 1 15.3 25.16070 0 53.02669 2 2 12.1 22.96838 0 50.86653 3 4 13.4 34.37500 0 87.73990 4 5 14.0 26.16190 1 64.59411 5 6 13.8 35.07805 0 62.10815 6 7 13.8 26.71779 1 58.97604 7 8 16.2 27.18233 1 82.46133 8 9 8.5 33.70120 0 76.36687 9 10 17.3 28.67547 1 72.63792 10 11 8.6 26.21882 1 48.91170 11 12 17.0 27.43951 0 53.40999 12 13 15.4 32.67832 0 64.07392 13 14 7.8 24.05693 1 63.86858 14 15 16.4 25.12406 1 52.35318 15 16 7.4 33.13134 0 42.77618 16 17 11.6 30.12729 1 46.76797 17 19 14.2 33.07857 0 63.45517 18 20 14.4 29.24211 0 78.74333 19 21 11.6 21.24225 1 66.66940 A data set with 729 newly diagnosed Type 2 diabetes patients. pt: Patient ID glucose: Diagnostic plasma glucose (mmol/l) bmi: sex: age: Body Mass Index (kg/m2) sex (1=male, 0=female) age (years) Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? Dias 3 Dias 4

Scatter-plot Scatter-plot When investigating a potential association between only two variables (like diagnostic plasma glucose and BMI) a scatterplot is an important part of the analysis. It gives insight in the nature of the association. It shows problems in the data, e.g. outliers, strange or impossible values. Dias 5 Dias 6 Scatter-plot There is no apparent tendency, specifically not one that would support our research question and if we have to point out a tendency, it would be that high BMI associates with lower diagnostic glucose (why is this not so strange if we think about the diagnosis of diabetes?). Scatter-plot R code plot(diabetes$bmi,diabetes$glucose, frame=true, main=null, xlab= BMI (kg/m2), ylab= Glucose (mmol/l), col= green, pch=19) There seem to be some very large values, especially for diagnostic plasma glucose. These are valid measurements. Maybe a log transformation of glucose would make associations more apparent? Dias 7 Dias 8

Scatter-plot log transformation Measures of association We want to capture the association between two variables in a single number: a correlation coefficient, a measure of association. Suppose that Y i is the diagnostic plasma glucose of patient i and X i the BMI for the same person. Then we want our measure of association to have the following characteristics: A positive association indicates that if X i is large (relative to the rest of the sample) then Y i is likely to be large as well. A negative association indicates that if X i is large then Y i is likely to be small. Dias 9 Dias 10 Measures of association between -1 and 1 Measures of association for the diabetes data 0 : No association r = -0.059 ρ = -0.050 τ = -0.034 1 : perfect positive association -1 : Perfect negative association Dias 11 Dias 12

Measures of association for the diabetes data Pearson s correlation coefficient and log transformed r = -0.053 ρ = -0.050 τ = -0.034 Only the first one changes! Pearson s correlation coefficient is computed from the data set (X i, Y i ), i = 1,,N as: N r = i= 1 ( X X )( Y Y ) where X and Y are the respective means and SD x and SD y the respective standard deviations. i ( N 1) SD SD x i y Dias 13 Dias 14 Characteristics of Pearson s correlation coefficient Pearson s correlation coefficient has the following properties: It measures the degree of linear association. Pearson s correlation coefficient R code > cor(diabetes$bmi,diabetes$glucose,use= complete.obs ) [1] -0.05938123 Gives only the correlation coefficient. It is invariant to linear change of scale for the variables. It is not robust to outliers. Coefficient values that are comparable between different data sets, and moreover a valid confidence interval and p-value, require that both X i and Y i are normally distributed. > cor.test(diabetes$bmi,diabetes$glucose) Pearson's product-moment correlation data: diabetes$bmi and diabetes$glucose t = -1.5995, df = 723, p-value = 0.1101 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.13162533 0.01349032 sample estimates: cor -0.05938123 Also performs a statistical test to see whether the coefficient is different from zero. Dias 15 Dias 16

Normally distributed? Normally distributed? BMI Glucose BMI Log(Glucose) A Normal distribution for comparison. Dias 17 Dias 18 Normally distributed? Normally distributed? Dias 19 Dias 20

R code Rank correlation Spearman s ρ A histogram of BMI: hist(diabetes$bmi,main= BMI,xlab= BMI (kg/m2),col= green ) A Normal Q-Q plot of BMI: qqnorm(diabetes$bmi,main= BMI,col= green ) qqline(diabetes$bmi,col= red ) And how do we get all these works of art in some decent format? If data does not appear to be Normally distributed, or when there are outliers, one may instead compute the correlation between the ranks of the X i values and the ranks of the Y i values. This gives a nonparametric correlation coefficient called Spearman s ρ. It measures monotone association. It is invariant to monotone transformations (like a log transformation). jpeg(file= D:\mydirectory\mypicture.jpg,width=500,height=500) # # put here the code that generates the picture # dev.off() It is robust to outliers. It has an odd interpretation. Dias 21 Dias 22 Spearman s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= spearman ) Spearman's rank correlation rho data: diabetes$bmi and diabetes$glucose S = 66678220, p-value = 0.1801 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.04983743 Warning message: In cor.test.default(diabetes$bmi, diabetes$glucose, method = "spearman") : Cannot compute exact p-values with ties Rank correlation Kendall s τ A measure of monotone association with a more intuitive interpretation than Spearman s ρ is Kendall s τ. The observations from a pair of subjects i, j are and concordant if X i < X j and Y i < Y j or X i > X j and Y i > Y j discordant if X i < X j and Y i > Y j or X i > X j and Y i < Y j Kendall s τ is the difference between the probability for a concordant pair and the probability for a discordant pair. There are various versions of Kendall s τ depending on how ties are treated. Dias 23 Dias 24

Characteristics of Kendall s tau Kendall s rank correlation coefficient R code It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has a more straightforward interpretation than Spearman s rho. > cor.test(diabetes$bmi,diabetes$glucose,method= kendall ) Kendall's rank correlation tau data: diabetes$bmi and diabetes$glucose z = -1.3755, p-value = 0.169 alternative hypothesis: true tau is not equal to 0 sample estimates: tau -0.03427314 Dias 25 Dias 26 Correlation in the diabetes data Correlation in the diabetes data and log transformed r = -0.059 (p = 0.110) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169) r = -0.053 (p = 0.154) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169) Dias 27 Dias 28

Limitations of correlation coefficients While it is (relatively) clear what a correlation coefficient of 0 means, and also 1 or -1, it is often unclear what a highly significant correlation of, say, 0.5 means Correlation rarely answers the research question to a sufficient extend; because it is not easily interpretable. Coefficients of correlation depend on the sample selection and therefore we cannot compare values of the coefficients found in different data. Dias 29 Dias 30 Regression analysis Regression model formulation An (intuitively interpretable) way to describe a (linear) association between two continuous type variables. We say: To regress Y on X or: To regress glucose on BMI It models a response Y (the dependent variable, the exogenous variable, the output) as a function of a predictor X (the independent variable, the exogenous variable, the explanatory variable, the covariate) and a term representing random other influences (error, noise). Mathematically: Y i = α + βx i + ε i Where ε i are independently Normal distributed noise terms with mean 0 and standard deviation σ. Dias 31 Dias 32

Regression model Scatter-plot with regression line The mean of Y is modelled with a linear function of X; a line in the X-Y plane. For each X, Y is a random variable Normally distributed around the modelled mean of Y, with standard deviation σ Dias 33 Dias 34 Interpretation of the parameters Research question We have variation due to a systematic part, the explanatory variable, and a random part, the noise. The systematic part of the model is defined by the regression line. α = the intercept: β = the slope: mean level for Y i when X i = 0 mean increase for Y i when X i is increased 1 unit. Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? In a (simple) linear regression analysis, is the slope β different from 0 (or more pertinently, larger than 0)? Dias 35 Dias 36

How does the model answer the research question? Linear regression R code Interest may focus on making a simple hypothesis about the two parameters: Null hypothesis : β = 0 Null hypothesis : α = 0 The second hypothesis often has no (clinical) meaning. > mymodel <- lm(diabetes$glucose~diabetes$bmi) > summary(mymodel) Call: lm(formula = diabetes$glucose ~ diabetes$bmi) Residuals: Min 1Q Median 3Q Max -6.6974-3.5771-0.8535 2.3008 49.1636 Estimate of the slope P-value of the test for the null hypothesis β = 0. Coefficients: Estimate Std. Error t value Pr(> t ) Table with (Intercept) 14.96096 1.08396 13.80 <2e-16 *** parameter diabetes$bmi -0.05739 0.03588-1.60 0.110 estimates --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.976 on 723 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.003526, Adjusted R-squared: 0.002148 F-statistic: 2.558 on 1 and 723 DF, p-value: 0.1101 Dias 37 Dias 38 Plot of regression line R code Scatter-plot with regression line The lm() function can be used to plot the regression line in the scatter-plot: > plot(diabetes$bmi,diabetes$glucose) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > abline(mymodel) log transformed glucose Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.627159 0.069354 37.881 <2e-16 *** diabetes$bmi -0.003277 0.002296-1.428 0.154 Dias 39 Dias 40

How are the parameters estimated? Least squares fit The estimated parameters of the linear model define the line (found among all possible lines) which minimizes the squared distance between the data-points and the line in the scatter-plot. The estimation method is called ordinary least-squares (maximum likelihood gives the same answer). Dias 41 Dias 42 Does the model fit the data? Diagnostic plots Dias 43 Dias 44

Diagnostic plots R produces some diagnostic plots (of varying usefulness). The residuals (the error or noise) was supposed to be Normal distributed, this can be studied in the Q-Q plot (top right) More importantly, the residuals should have a single standard deviation, i.e. the variance should not increase with, for example, BMI. This can be studied in the residuals vs. fitted plot (top left) Data transformations If the residuals are not Normal, or (and this is more serious because the central limit theorem deals with much of the non- Normality issue) if variance seems to increase with level, it may be a good idea to transform one or both variables. This is the real reason to investigate log(glucose) instead of glucose. > mymodel <- lm(diabetes$glucose~diabetes$bmi) > opar <- par(mfrow = c(2,2), oma = c(0,0,1.1,0)) > plot(mymodel) > par(opar) Dias 45 Dias 46 Data transformations log transform The influence of one outlier Dias 47 Dias 48

Simpson s paradox Simpson s paradox Florida death penalty verdicts for homicide 1976-1987 relative to defendant s race White Black Blacks tend to murder blacks and whites tend to murder whites White Black Victim white 11% 23% (53/414) (11/37) and the murder of a white person has a higher probability of death penalty. 11% (53/430) 8% (15/176) Victim black 0% (0/16) 3% (4/139) For any victim the probability for a black person to get death penalty is about 2 times higher. Dias 49 Dias 50 Confounding Confounding Victim s race We are interested in the green highlighted association, but there is a correlation with the victim s race both with the defendant s race and the outcome of the trial. Confounder A confounder influences both exposure and outcome When confounding is present we cannot interpret the green highlighted association as causal Defendant s race Death penalty Exposure Outcome Dias 51 Dias 52

Randomization Two regressions Confounder Often there are many factors that may influence both exposure and outcome, some of them may not be observed The blue points denote patients with SBP>140 mmhg; the blue line the corresponding regression line. Exposure randomised Outcome or are unknown. If exposure is randomised, then there is no confounding. The green highlighted association can be interpreted causal. The red points denote patients with SBP < 140 mmhg; the red line the corresponding regression line. The black line is the general regression line. The slopes from the stratified analyses are less steep than the slope of the general line. Dias 53 Dias 54 Multiple regression > mymodel <- lm(log(diabetes$glucose)~diabetes$bmi+diabetes$sbp) > summary(mymodel) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.639870 0.069389 38.045 <2e-16 *** diabetes$bmi -0.002625 0.002308-1.137 0.2558 diabetes$sbp -0.054447 0.024168-2.253 0.0246 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Multiple regression Adjusting a statistical analysis means to include other predictor variables into the model formula. Intuitively, a slope for BMI is determined for each level of the SBP variable separately and these are then averaged. including SBP in the analysis removes the confounding effect of SBP from the relationship between log(glucose) and BMI. The adjusted slope (association) of bmi is less pronounced than before. SBP is related to both glucose and bmi and is a confounder. Dias 55 Dias 56

Take home message Association between two continuous variables may be measured by correlation coefficients or in (simple) linear regression analysis. The latter provides arguably the best interpretable results. Moreover, it is straightforwardly extended to be able to deal with confounding, and more Dias 57