Examining Relationships Least-squares regression Sections 2.3
The regression line A regression line describes a one-way linear relationship between variables. An explanatory variable, x, explains variability in a response variable, y. Often one wants to predict y from a given x.
The least-squares regression line The least-squares regression line is: with slope and intercept standard deviation in y standard deviation in x correlation of x and y A prediction, ŷ, is made by plugging in a value of x
Interpretations The least-squares line minimizes the sum of squared -prediction errors. Note: vertical prediction errors: interchanging x and y would modify the formulation.
Interpretations Slope, b 1, is amount of change in ŷ when x increases by one unit. Intercept, b 0, is the prediction, ŷ, at x = 0. Example: BAC data. Each beer increases predicted BAC by 0.0180. Predicted BAC after no beers is -0.0127 0. random variability
Coefficient of determination, r 2 The coefficient of determination is r 2, measures the proportion of variability explained by the regression line. r 2 = 0.76 var. in y-hat var. in y
Residuals Analysis of residuals, y ŷ, helps to assess the suitability of a linear relationship.
Residual plots The ideal plot of residuals (y ŷ against x) would exhibit no systematic pattern.
Problem indicators Systematic patterns suggest complications and possible invalidity in the use of linear regression. Curved pattern: deviations from the linear form. Trends in spread: less prediction accuracy in some regions of x.
Influential observations An influential observation is an observation whose deletion would drastically change the regression line. 0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 complete data point #3 point #3 deleted 0.00 0 1 2 3 4 5 6 7 8 9 10 Often an outlier in x, but may not be an outlier in y
Examining Relationships Cautions about correlation and regression Section 2.4
Basic cautions Correlation is for two-way relationships; regression is for one-way relationships. Both are only relevant for linear relationships. Neither is resistant.
Extrapolation Extrapolation is when predictions are made outside the range of data. The linear relationship may be untrustworthy outside the range of data. Example: BAC data. unconsciousness or death Predicted BAC after x = 24 beers: ŷ = 0.4184. Predicted BAC after x = 36 beers: ŷ = 0.6340.
Lurking variables A lurking variable may influence the relationship between variables. An unobserved lurking variable may explain puzzling associations. Example: Higher rates of red-wine drinking Better levels of overall health Possible lurking variables: income, other lifestyle tendencies, etc.
Association is not causation An observed association may reflect the influence of a causal lurking variable. An experiment that controls lurking variables is best for establishing causation. Example: BAC: control weight, gender, etc. Causation may be established in other ways, but with weaker evidence.
Examining Relationships Relationships in categorical data Section 2.5
Two-way tables Relationships in categorical data may be explored by compiling variables in two-way tables. Row variable Column variable (cnt./1000) Age group Educa-on 25 34 35 54 55+ < High school 4459 9174 14226 High school 11562 26455 20060 Collge 1 3 yrs. 10693 22647 11125 College 4+ yrs. 11071 23160 10597
Marginal distributions The marginal distributions are the individual distributions of the row and column variables. (They appear in the margins of the two-way table.) (cnt./1000) Age group Row Educa-on 25 34 35 54 55+ totals < High school 4459 9174 14226 27859 High school 11562 26455 20060 58077 Collge 1 3 yrs. 10693 22647 11125 44465 College 4+ yrs. 11071 23160 10597 44828 Column totals 37785 81436 56008 175229
Conditional distributions A conditional distribution is calculated from the counts of one variable limited to a given category of the other variable. (cnt./1000) Age group Row Educa-on 25 34 35 54 55+ totals < High school 4459 9174 14226 27859 High school 11562 26455 20060 58077 Collge 1 3 yrs. 10693 22647 11125 44465 College 4+ yrs. 11071 23160 10597 44828 Column totals 37785 81436 56008 175229
Visualizing relationships Describe relationships with conditional distributions. (cnt./1000) Age group Row Educa-on 25 34 35 54 55+ totals < High school 16% 33% 51% 100% High school 20% 45% 35% 100% Collge 1 3 yrs. 24% 51% 25% 100% College 4+ yrs. 25% 52% 24% 100% < High school High school College, 1-3 yrs. College, 4+ yrs. 60% 50% 60% 60% 50% 40% 30% 20% 10% 40% 30% 20% 10% 50% 40% 30% 20% 10% 50% 40% 30% 20% 10% 0% 25-34 35-54 55+ 0% 25-34 35-54 55+ 0% 25-34 35-54 55+ 0% 25-34 35-54 55+
Producing data Introduction Chapter 3
Observational studies and experiments Central issue: the (undesirable) possibility of confounding between an explanatory variable and a lurking variable. In an observational study, individuals are observed, but no attempt is made to control the conditions of data-production. Often plagued by confounding with lurking variables In an experiment, the conditions of data -production are controlled by applying treatments to individuals. Avoids all types of confounding
Producing data Designing samples Section 3.1
Key elements of a sampling study Population: a collection of individuals about which the conclusions of statistical inference are to be relevant. Sample: the subset of a population on which data are measured and put to analysis. Sampling design: the method used to select the sample from the population.
Biased sampling designs Biased sampling: favors some portions of the population over others. Examples: Voluntary sampling: individuals are self-selected by responding to an incentive. Convenience sampling: selection is determined by the convenience of the selection-maker.
Probability sampling Probability sampling uses chance to select a sample, based on known selection probabilities. Draw labels from a hat, computer simulation, table of random numbers. Any bias is accommodated using knowledge of the selection probabilities. Examples: Simple random sampling: each fixed-sized subset has the same probability of selection. Unbiased Stratified sampling: simple random samples are drawn in distinct strata and aggregated.
Other sources of bias Under-coverage in the population list. Non-response of sampled individuals. Inaccurate responses of the respondent (response bias). May be unintentionally encouraged by the interviewer. Poor questionnaire design and wording.
Producing data Designing experiments Section 3.2
Terminology General Individuals Explanatory variables Experiments Subjects Factors Value of explanatory variable Level of a factor The level of a factor reflects the application of a treatment used to modify the experimental conditions in a specific way.
Principles of experimental design Use comparisons to cancel the effects of lurking variables. A control group (i.e., sham treatment, or placebo) may serve as a baseline comparison. Use randomization to allocate subjects among treatments. Use replication to reduce random variability. Patterns in the response are statistically significant if they are of such magnitude that they would rarely be observed by chance.
Problem issues Several issues may potentially undermine the principles of experimental design: Unconscious bias of the experimenter or subject. Avoid by double-blind application of treatments. Lack of realism in the subjects, treatments, or the experimental setting.