NORTH SOUTH UNIVERSITY TUTORIAL 2 AHMED HOSSAIN,PhD Data Management and Analysis AHMED HOSSAIN,PhD - Data Management and Analysis 1
Correlation Analysis INTRODUCTION In correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r, ranges between -1 and +1. r quantifies the direction and strength of the linear relationship between the two variables. The sign of the r indicates the direction of the association. The magnitude of the r indicates the strength of the association. For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = -0.2 suggest a weak, negative association. r close to zero suggests no linear association between two continuous variables. Limitations: There may be a non-linear association between two continuous variables, but computation of a r does not detect this. AHMED HOSSAIN,PhD - Data Management and Analysis 2
Correlation Analysis SCATTER DIAGRAM We wish to estimate the association between gestational age and infant birth weight. In this example, birth weight is the dependent variable and gestational age is the independent variable. Thus Y =birth weight and X=gestational age. Note that the independent variable is on the horizontal axis (or X-axis), and the dependent variable is on the vertical axis (or Y-axis). AHMED HOSSAIN,PhD - Data Management and Analysis 3
Correlation Analysis SCATTER DIAGRAM AHMED HOSSAIN,PhD - Data Management and Analysis 4
Simple Linear Regression INTRODUCTION In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship 1 The strength of the relationship. 2 The direction of the relationship. We shall also be interested in making inferences about the relationship. We will be assuming here that the relationship between X and Y is linear (or has been linearized through transformation). AHMED HOSSAIN,PhD - Data Management and Analysis 5
Regression INTRODUCTION Technique used for the modeling and analysis of numerical data. Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships. AHMED HOSSAIN,PhD - Data Management and Analysis 6
Simple Linear Regression ASSUMPTIONS Suppose that we have a dataset (y 1, x 1 ), (y 2, x 2 ),, (y n, x n). Our interest is in using our model to predict values of Y for any given value of X = x. If we know the values of β 0 and β 1 then the fitted value for the observation y i would be β 0 + β 1 x i. The error in the fitted value can be measured by the vertical distance ɛ i = y i β 0 β 1 x i We would like to make these errors as small as possible. AHMED HOSSAIN,PhD - Data Management and Analysis 7
Simple Linear Regression EXAMPLE AHMED HOSSAIN,PhD - Data Management and Analysis 8
Simple Linear Regression EXAMPLE AHMED HOSSAIN,PhD - Data Management and Analysis 9
INTRODUCTION Extension of the simple linear regression model to two or more independent variables y = β 0 + β 1 x 1 + β 2 x 2 + + β nx n + ɛ For example, Expression = Baseline + Age + Tissue + Sex + Error. Partial Regression Coefficients: β i effect on the dependent variable when increasing the ith independent variable by 1 unit, holding all other predictors constant. AHMED HOSSAIN,PhD - Data Management and Analysis 10
CATEGORICAL INDEPENDENT VARIABLES AHMED HOSSAIN,PhD - Data Management and Analysis 11
CATEGORICAL INDEPENDENT VARIABLES AHMED HOSSAIN,PhD - Data Management and Analysis 12
RESULTS FROM R Call: lm(formula = y X1 + X2) Residuals: Coefficients: Min 1Q Median 3Q Max -4.5021-0.8847-0.2502 0.5476 4.3438 Estimate Std. Error t value Pr(> t ) (Intercept) 4.694357 1.365469 3.438 0.00146 ** X1-0.023186 0.023210-0.999 0.32432 X2-0.005716 0.007608-0.751 0.45721 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.688 on 37 degrees of freedom Multiple R-squared: 0.03497, Adjusted R-squared: -0.0172 F-statistic: 0.6703 on 2 and 37 DF, p-value: 0.5176 AHMED HOSSAIN,PhD - Data Management and Analysis 13
HYPOTHESIS TESTS: INDIVIDUAL REGRESSION COEFFICIENTS AHMED HOSSAIN,PhD - Data Management and Analysis 14
HYPOTHESIS TESTING: MODEL UTILITY TEST AHMED HOSSAIN,PhD - Data Management and Analysis 15
THE COEFFICIENT OF DETERMINATION The total sum of squares is a measure of the variability in y 1,, y n without taking the covariate into account. The error sum of squares is the amount of variability left after fitting a linear regression for the covariate. The model sum of squares is the amount of variability explained by the model. The proportion of the variability explained by the model is R 2 = SSR SST = 1 SSE SST In simple regression R 2 is the square of the sample correlation between x 1,, x n and y 1,, y n. AHMED HOSSAIN,PhD - Data Management and Analysis 16
BIRTHWEIGHT IS CONTINIOUS AND CATEGORICAL INDEPENDENT VARIABLES AHMED HOSSAIN,PhD - Data Management and Analysis 17
RESULTS AHMED HOSSAIN,PhD - Data Management and Analysis 18
INTERACTION INTERACTION Interaction effects represent the combined effects of variables on the criterion or dependent measure. When an interaction effect is present, the impact of one variable depends on the level of the other variable. EXAMPLE 1 Interaction between adding sugar to coffee and stirring the coffee. Neither of the two individual variables has much effect on sweetness but a combination of the two does. EXAMPLE 2 Interaction between smoking and inhaling asbestos fibres: Both raise lung carcinoma risk, but exposure to asbestos multiplies the cancer risk in smokers and non-smokers. Here, the joint effect of inhaling asbestos and smoking is higher than the sum of both effects. AHMED HOSSAIN,PhD - Data Management and Analysis 19
IDENTIFYING INTERACTION CATEGORICAL PREDICTORS If the researcher is interested in whether the treatment is equally effective for females and males. That is, is there a difference in treatment depending on gender group? This is a question of interaction. Interaction results whose lines do not cross. CONTINUOUS PREDICTORS : Single slope test. AHMED HOSSAIN,PhD - Data Management and Analysis 20