Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Size: px

Start display at page:

Download "Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at"

Penelope Cobb
5 years ago
Views:

The Choice of Variables in Observational Studies Author(s): D. R. Cox and E. J. Snell Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 23, No. 1 (1974), pp.

org/stable/2347053 Accessed: 06-02-2018 10:15 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted

1 The Choice of Variables in Observational Studies Author(s): D. R. Cox and E. J. Snell Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 23, No. 1 (1974), pp Published by: Wiley for the Royal Statistical Society Stable URL: Accessed: :15 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics)

2 Appl. Statist., (1974), 23, No. 1, p. 51 The Choice of Variables in Observational Studiest By D. R. Cox and E. J. SNELL Imperial College, London SUMMARY A review is given of considerations affecting the choice of explanatory variables in observational studies. Aspects of both design and analysis are considered. In particular the choice of explanatory variables in multiple regression is discussed and some recommendations made. Keywords: MULTIPLE REGRESSION; ANALYTICAL SURVEYS; MEDICAL APPLICATION; SELECTION OF VARIABLES; DESIGN OF INVESTIGATIONS; OBSERVATIONAL STUDIES 1. INTRODUCTION THIS paper reviews some general aspects of the choice of variables in observational studies. To keep the paper concise only outline examples have been included and to be specific these are medical, although the ideas apply widely. Observational studies, where they are not purely descriptive, have as their objective the explanation or prediction of some response in terms of explanatory or predictor variables. It is useful to have two examples in mind. Example 1. Consider an investigation into the incidence of a respiratory disease among a certain group of workers. The response variable may be severity of the disease, with possible explanatory variables being the worker's age, physical status, working conditions, previous employment, etc. Some variables may be more important than others in explaining the severity of the disease. Example 2. A different situation is one of trying to predict the time to death among patients known to be suffering from a progressive and fatal disease. Possible predictive variables are type of treatment, treatment variables such as dose, clinical and biochemical measurements made on diagnosis, etc. Although careful discussion of the most appropriate way to measure response is always important, and often several different measures will be called for, nevertheless what response variables to consider is frequently fairly clearcut. Thus in Example 1, severity may be assessed radiologically and graded according to standard levels. In Example 2, time to death is likely to be measured from time of diagnosis. In this paper we concentrate on the explanatory variables; how many such variables should be measured and, if many are observed, how should the analysis be handled to find the most relevant ones? These are difficult issues. Many of the following points are rather trite when put in general terms and do not lend themselves very well to quantitative discussion. On the other hand, the decision as to what to do in any particular investigation can be hard. t This paper is based on one prepared for the Division of Research in Epidemiology and Communications Science, World Health Organization. 51

3 52 APPLIED STATISTICS 2. SELECTION OF EXPLANATORY VARIABLES FOR MEASUREMENT The following general aspects of a study will influence the nature and number of explanatory variables measured: (a) Whether the study is intended to investigate some rather specific hypothesis about the phenomenon or whether it is designed to screen out the most important variables from rather a large number of possibly relevant variables, the important variables to be examined in detail subsequently, possibly by experiments rather than by observational studies. In the former case it is important to try to anticipate the main explanations competing with the hypothesis under test and to measure relevant variables. (b) Whether the response variables are observed quite quickly, so that the later parts of the study can be modified, if necessary, in the light of the earlier results. If this is not possible, it is more likely to be necessary to measure many variables on each individual. (c) Questions of economy of time, ease of setting up instruments, difficulty of contacting individuals, loss of accuracy arising from increased work-load, availability of "good" official statistics, etc. will often be crucial in deciding how many variables can be measured. (d) Variables may be included primarily to establish comparisons with previous related studies. In many studies binary explanatory variables will be adequate in analysis, except for the most important variables, provided that the split between the two categories is appropriately made. On the other hand, poorly defined binary variables may be virtually useless and for this reason it will often be essential to record on a more than two-point scale. Usually it will be sensible to arrange that binary explanatory variables are constructed to have roughly a split, in order for the effect on response to be as clear-cut as possible but if the response also is binary and its effect appreciable or if very non-linear effects are involved the position is more complicated (Cox, 1969). Multiplicity in explanatory variables can arise in two rather different ways. We may measure a number of quite different properties, or we may have a number of ways of measuring what is essentially one property. For example, occurrence of particular symptoms may be elicited by a single question or probed by a battery of related questions. Many possibilities exist for more sophisticated design, especially where the total number of explanatory variables is large and accuracy of measurement is likely to drop if all variables are measured on all subjects. One possibility is to measure only a subset of variables on each subject, the variables being chosen in a suitably balanced way. Another possibility, where some of the variables are arranged in batteries as indicated above, is not to measure each full battery on each subject, but to measure detailed variables only on subsets of individuals. Used with care these ideas should be fruitful, especially in very large-scale investigations. Of course simplicity in design remains a vitally important requirement. 3. BROAD PROBLEMS OF ANALYSIS Two main kinds of response variable commonly encountered are binary (e.g. occurrence, non-occurrence) and measurements that, possibly after transformation, lead to approximately normally distributed data. An important point is that while the

THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 53 precise techniques of analysis will be different for different the broad strategy to be adopted and the difficulties of interpretation likely to be

4 THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 53 precise techniques of analysis will be different for different the broad strategy to be adopted and the difficulties of interpretation likely to be encountered are the same. We shall concentrate in Section 4 largely on the techniques for normal theory multiple regression, simply because this is the most thoroughly investigated case. The following general points have to be considered: (a) There is a working distinction between producing (i) a fit to the data useful for future prediction in the absence of major changes in the system and (ii) an "explanation" which will link with other studies, e.g. fundamental laboratory work, and will predict under quite different circumstances. For (i), two quite different models, involving different explanatory variables, are equally acceptable if they fit the data equally well. If a choice has to be made between them, it may be done on the basis of simplicity, e.g. in terms of the number of explanatory variables necessary or the ease with which the relevant variables can be measured; for a quantitative decision theoretic analysis, see Lindley (1968). In (ii), however, it is usually of central importance to find which explanatory variables have important effects. (b) Even in the first case of prediction in the narrow sense it will not normally be wise to include all predictor variables. This is both for reasons of simplicity and because typically the mean square error of prediction will be raised by including too many variables. (c) The main difficulties in dealing with observational studies stem from two rather different sources, the omission of relevant variables from those measured and the presence of fairly high dependencies among the explanatory variables. As a simple illustration of the first situation consider the relation between time to death y and the level of some prescribed "dose" x. If the dose level is determined by the severity of disease, the dependence of y upon x as given by a simple linear regression cannot be interpreted as predicting the change in y for a particular individual given a change in x. Only if the omitted variable, severity of disease, is included as a further explanatory variable is a "causal" interpretation at all feasible. The second situation would, for example, arise in measuring the percentage of substances A, B,... in a compound, where there will be an exact linear relationship between the percentages in a compound. Although this is an extreme example, close dependencies are often unavoidable if a large number of explanatory variables are measured. Interpretation is difficult because many apparently different models may fit the data almost equally well. A principal component analysis of the explanatory variables may be tried in these circumstances (Jeffers, 1967). In designed experiments, balance and randomization largely overcome these difficulties; the omission and hence randomization of an important variable leads to an increased error variance and to a seriously incomplete understanding, but not to a "biased" conclusion. It may sometimes be possible to take additional observations at values of the explanatory variables chosen so as to reduce non-orthogonalities in the data (Dykstra, 1966; Gaylor and Merrill, 1968; Silvey, 1969). In observational studies in which the objective is the comparison of, say, a treatment with a control, matching individuals to remove bias is likely to be useful (Cochran, 1965, 1972). Of course in a very large-scale study the amount of data collected may be so great that quite apart from the difficulties of principle alluded to above there may be limitations imposed on computational grounds, or because of human limitations in what can be absorbed.

54 APPLIED STATISTICS In deciding what variables to include it is important to take account of additional information, for example as to which variables are likely to be alternatives to one another

5 54 APPLIED STATISTICS In deciding what variables to include it is important to take account of additional information, for example as to which variables are likely to be alternatives to one another and which it is almost certain to be necessary to include. It is equally important that any such "prior" knowledge inserted into the analysis should be tested for consistency with the data. There are two approaches to the inclusion of general classification variables like age, sex, etc. One is to make separate analyses for men and women, combining the conclusions at a later stage if they seem compatible. The other is to fit a composite model in which say the sex difference is represented by a single parameter; this assumes that in some sense there is no interaction between the main explanatory variables and sex, an assumption that can be tested at least informally. With large sets of data it will, however, frequently be sensible to analyse in a series of sections merging the analyses in a second stage. In any case the examination of the consistency of conclusions from independent sets of data is an important and simple technique for assessing precision. Note that there is a distinction implied here between genuine explanatory variables and classification variables that serve merely to define major subclasses of individuals. This is relevant when we are looking for proper "causal" relations. That is, it is not an "explanation" to say that a death rate for men is greater than that for women. 4. MORE DETAILED PROBLEMS OF ANALYSIS We now consider in more detail situations where for each individual there is a continuous response variable y and a number of explanatory variables xl,..., x. Suppose that we work provisionally with the assumption that the expected value of y is a linear function of the explanatory variables. It is assumed that any preliminary transformation of the response variable, e.g. from response time to its logarithm has been made, also that the data have been edited to remove gross errors and to isolate suspect values. We shall not describe the large body of statistical methods and theory associated with the linear model; for an introductory account, see Draper and Smith (1966) and for general comments Cox (1968). Formal significance tests are a useful guide to the importance of different explanatory variables, but have not to be followed too rigidly. One reason for caution is that the tabulated significance levels of the F distribution refer to a single test carried out in isolation; in practice we are nearly always concerned with a chain of related tests and this makes the interpretation of the ordinary significance levels indirect (Draper et al., 1971; Pope and Webster, 1972; Spj0tvoll, 1972). The difficulties of interpretation caused by non-orthogonality are less important if interest is purely in prediction over the range of explanatory variables covered by the data. Although several rather different looking equations will often have similar residual mean squares, it may be unimportant which equation is used. An equation with few explanatory variables will, however, give biased estimates if the omitted variables are at all relevant. The extent of bias, averaged over the observed distribution of the explanatory variables, in an equation with k variables is indicated by the statistic suggested by C. L. Mallows (Gorman and Toman, 1966) Ck = (residual sum of squares)/12 - (n - 2k), where n denotes sample size and C2 is a separate estimate Of a2; in the absence of bias, E(Ck) k. Given several equations with similar residual mean squares, one

6 THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 55 with small bias is likely to be preferred. This does not necessarily mean one with many explanatory variables; increasing the number of variables may reduce the bias but at the expense of increasing the total error of prediction. If predicting outside the observed region of the explanatory variables, different equations will give vastly different predictions. Methods for selecting single well-fitting equations from a large set will be reviewed at the end of this section. In most applications, however, the particular variables affecting response and the directions of their effects are of intrinsic interest and then the selection of just one wellfitting equation from among many is unsatisfactory and possibly very misleading. In principle, the following procedure seems a sensibly cautious approach in such situations. All possible 2P equations are fitted (Garside, 1965; Schatzoff et al., 1968; Morgan and Tatar, 1972) and those clearly inconsistent with the data rejected; that is equations with a residual mean square significantly greater than the mean square residual from the full model are rejected. Typically if an equation involving a subset Y of explanatory variables is consistent with the data, so is that based on a larger subset ey', Y'= Y 9. (Any exceptions to this will be minor ones depending on the particular levels of significance used.) Such a subset we call primitive. A program to find the primitive models and associated information has been written at Imperial College by Mrs M. Ansell and is available for use on the CDC 6400 computer. If there is only one primitive model, the situation is fairly clear-cut; where there is more than one, a choice between them can be made only on the basis of additional information. Unfortunately this procedure, even with sophisticated numerical analytic and programming techniques (Wampler, 1970; Mullet and Murray, 1971), does not seem feasible for more than explanatory variables. If more explanatory variables are available, as will often be the case for example in large epidemiological studies, it follows that some reduction will be essential before the above method can be used. The use of several alternative reductions will usually be desirable. The main methods for such reduction are as follows. (a) We may examine sets of explanatory variables. If the data contain batteries of questions, some form of total score may be adequate. This can be tested for consistency with the data. (b) Some variables may be specified for definite inclusion, for example if interest lies primarily in the supplementary effect of other variables. (c) Classification variables (such as sex, age, etc.) may be used to split the data into sections for separate analysis. (d) It may be thought on general grounds that a regression coefficient or regression coefficients, associated with a particular variable, even one not of primary importance, should be of a certain sign, e.g. should not be negative. Occasionally this may be helpful in clarifying the relationship. In any fitted model for which the regression coefficient is of the wrong sign, but not significantly different from zero, the coefficient is replaced by zero, i.e. the variable is in effect omitted. An estimate significantly different from zero and of the wrong sign implies that the prior assumption is wrong, or that the wrong form of relation is being fitted, or that an important variable has been omitted, or that by chance an extreme fluctuation has occurred. (e) When special relationships can be postulated among the explanatory variables the methods of path analysis can be used; see, for example, Turner and Stevens (1959). These methods, originally developed in connection with genetics, have

7 56 APPLIED STATISTICS more recently been examined by sociologists, for example, by Blalock and Blalock (1968). The general idea is partly that the postulation or discovery of a series of special relationships between the variables will clarify the whole problem and partly that such relationships will increase the precision of estimates and hence help to resolve ambiguities. The above approaches involve injecting some further external information. The remaining devices are essentially general computational devices; see Draper and Smith (1969) for a review up to that date. (f) The most commonly used procedures for progressive selection of variables are forward selection, backward elimination (Hamaker, 1962; Oosterhoff, 1963; Abt, 1967; Mantel, 1970), stepwise regression (Efroymson, 1960; Breaux, 1968; Goodman, 1971) and "optimum" regression. These will not be described in detail. "Optimum" regression finds that equation which for a specified number of explanatory variables has the minimum residual sum of squares. An algorithm by Beale et al. (1967) makes it unnecessary to evaluate all regressions; the procedure is claimed to be manageable provided the number of variables is not much in excess of 20 (Beale, 1970). (g) Newton and Spurrell (1967a, b) have proposed a method called element analysis for assessing the information provided by all 2P fits; 2P - 1 elements, to be used in conjunction with certain rules, are calculated from the sums of squares attributable to regression. (h) A suggestion of Gorman and Toman (1966) is to calculate a fractional factorial of the 2P possible regressions and to select variables by a subjective inspection of the values of the residual mean square or the statistic Ck for the computed regressions. Further evidence is needed on the efficiency of this procedure; an example is given in Daniel and Wood (1971). Hocking and Leslie (1967) and La Motte and Hocking (1970) consider a technique to minimize Ck for given k, calculating a subset of the regressions; see also Rothman (1968). (i) A procedure based on estimates which differ from the usual least squares estimates is that of ridge regression (Hoerl and Kennard, 1970a, b; Marquardt, 1970; Lindley and Smith, 1972). The least squares equations are modified to give estimates which are stable and which, although biased, give smaller mean square error of prediction. This is helpful when the main emphasis is on prediction and especially appropriate when the regression coefficients (or a subset of them) are generated by a random mechanism. It is not clear how useful the method is in the isolation of important variables. We consider that the procedures (f) and (h) should be used, if at all, only where the particular variables selected are not of intrinsic interest or as a preliminary device in the reduction of variables, so that the recommended techniques for up to 10 variables can be followed; several different reductions should then normally be examined. The possibility of interactions between the effects of different explanatory variables has usually to be borne in mind. They can be detected in essentially three ways, by graphical analysis of residuals, by fitting an extended model usually with fairly simple forms of interaction represented by cross products of primary explanatory variables or by analysing the data in sections. Of course in a problem with many explanatory variables the number of possible interaction terms, even of the simplest kind, is large. Then attention will often have to be restricted to those interactions thought

THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 57 particularly likely on general grounds and to interactions among variables with large "ordinary" effects.

8 THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 57 particularly likely on general grounds and to interactions among variables with large "ordinary" effects. With binary response variables essentially the same problems arise. 5. SOME MORE COMPLEX PROBLEMS The difficulties discussed in Section 4 arise in the context even of the simplest multiple regression model. There are, of course, many other sources of difficulty of analysis. In addition to those arising from different kinds of response, e.g. binary, some further problems associated with normal theory regression that need caution are as follows: (a) Missing values among the explanatory variables are a common source of difficulty (Buck, 1960; Afifi and Elashoff, 1966, 1967, 1969; Dagenais, 1971; Hartley and Hocking, 1971; Orchard and Woodbury, 1972). Current unpublished work by E. M. L. Beale and R. J. A. Little at Imperial College supports the method of Orchard and Woodbury. (b) There may be a need for models non-linear in the parameters or variables. (c) Major problems can arise when the individuals are arranged in groups. For example, the regressions between and within groups are likely to be different, and the errors of different individuals are unlikely to be mutually independent. The groups may be characterized by random variables and involve models with components of variance. (d) There may be appreciable rounding or measurement errors in the explanatory variables (Swindel and Bower, 1972). 6. SOME RECOMMENDATIONS It is difficult to give specific recommendations because of the widely differing situations that can arise in application. Some of the main points can be summarized as follows: In design (a) The nature of the study, and considerations of accuracy and economy determine how many variables are sensible. (b) Divide the variables into batteries, where relevant and consider the possibility of a special design to omit some measurements. In analysis, given that multiple regression techniques are applied, (c) The distinction between predicting future observations and interpreting the data can influence the choice of variables. (d) If interpretation is the objective, and p is not greater than 10-15, compute all 2P regressions and examine those consistent with the data. Larger values of p should in some way be reduced to make the computations feasible. (e) Automatic selection procedures, such as are commonly used in many generally available computer programs, should be used only as a preliminary device or if the particular variables selected are not of intrinsic interest. (f) Use of supplementary information and assumptions may be crucial in clarifying relationships. Any such assumptions should, however, be tested for consistency with the data and the conclusions with and without the supplementary information should normally be compared. (g) The possibility of interactions between the effects of different explanatory variables should be considered.

58 APPLIED STATISTICS REFERENCES ABT, K. (1967). On the identification of the significant independent variables in linear models. I, II. Metrika, 12, 1-15, 81-96. AFIFI, A. A. and ELASHOFF, R. M. (1966, 1967, 1969).

9 58 APPLIED STATISTICS REFERENCES ABT, K. (1967). On the identification of the significant independent variables in linear models. I, II. Metrika, 12, 1-15, AFIFI, A. A. and ELASHOFF, R. M. (1966, 1967, 1969). Missing observations in multivariate statistics. I-IV. J. Am. Statist. Assoc., 61, ; 62, 10-29; 64, , BEALE, E. M. L. (1970). Note on procedures for variable selection in multiple regression. Technometrics, 12, BEALE, E. M. L., KENDALL, M. G. and MANN, D. W. (1967). The discarding of variables in multivariate analysis. Biometrika, 54, BLALOCK, H. M. and BLALOCK, A. (editors) (1968). Methodology in Social Research. New York: McGraw Hill. BREAUX, H. J. (1968). A modification of Efroymson's technique for stepwise regression analysis. Comm. ACM, 11, BUCK, S. F. (1960). A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J. R. Statist. Soc. B, 22, COCHRAN, W. G. (1965). The planning of observational studies of human populations. J. R. Statist. Soc. A, 128, (1972). Observational studies. In Statistical Papers in Honor of George W. Snedecor (T. A. Bancroft, ed.), pp Iowa: Iowa State Press. Cox, D. R. (1968). Notes on some aspects of regression analysis. J. R. Statist. Soc. A, 131, (1969). Analysis of Binary Data. London: Methuen. DAGENAIS, M. G. (1971). Further suggestions concerning the utilization of incomplete observations in regression analysis. J. Am. Statist. Assoc., 66, DANIEL, C. and WOOD, F. S. (1971). Fitting Equations to Data. New York: Wiley-Interscience. DRAPER, N. R., GUTTMAN, I. and KANEMASU, H. (1971). The distribution of certain regression statistics. Biometrika, 58, DRAPER, N. and SMITH, H. (1966). Applied Regression Analysis. New York: Wiley. (1969). Methods for selecting variables from a given set of variables for regression analysis. Bull. Inst. Int. Statist., 43, DYKSTRA, 0. (1966). The orthogonalization of undesigned experiments. Technometrics, 6, EFROYMSON, M. A. (1960). Multiple regression analysis. In Mathematical Methods for Digital Computers (A. Ralston and H. S. Wilf, eds), Chapter 17. New York: Wiley. GARSIDE, M. J. (1965). The best subset in multiple regression analysis. Appl. Statist., 14, GAYLOR, D. W. and MERRILL, J. A. (1968). Augmenting existing data in multiple regression. Technometrics, 10, GOODMAN, L. A. (1971). The analysis of multidimensional contingency tables: stepwise procedures and direct estimation methods for building models for multiple classification. Technometrics, 13, GORMAN, J. W. and TOMAN, R. J. (1966). Selection of variables for fitting equations to data. Technometrics, 8, HAMAKER, H. C. (1962). On multiple regression analysis. Statist. Neerlandica, 16, HARTLEY, H. 0. and HOCKING, R. R. (1971). The analysis of incomplete data. Biometrics, 27, HOCKING, R. R. and LESLIE, R. W. (1967). Selection of the best subset in regression analysis. Technometrics, 9, HOERL, A. E. and KENNARD, R. W. (1970a). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, (1970b). Ridge regression: applications to nonorthogonal problems. Technometrics, 12, JEFFERS, J. N. R. (1967). Two case studies in the application of principal component analysis. Appl. Statist., 16, LA MOTTE, L. R. and HOCKING, R. R. (1970). Computational efficiency in the selection of regression variables. Technometrics, 12, LINDLEY, D. V. (1968). The choice of variables in multiple regression. J. R. Statist. Soc. B, 30, LINDLEY, D. V. and SMITH, A. F. M. (1972). Bayes estimates for the linear model (with Discussion). J. R. Statist. Soc. B, 34, 1-41.

10 THE CHOICE OF VARIABLES IN OBSERVATIONAL STUDIES 59 MANTEL, N. (1970). Why stepdown procedures in variable selection. Technometrics, 12, MARQUARDT, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation and non-linear estimation. Technometrics, 12, MORGAN, J. A. and TATAR, J. F. (1972). Calculation of the residual sum of squares for all possible regressions. Technometrics, 14, MULLET, G. M. and MURRAY, T. W. (1971). A new method for examining rounding error in least-squares regression computer programs. J. Am. Statist. Assoc., 66, NEWTON, R. G. and SPURRELL, D. J. (1967a). A development of multiple regression for the analysis of routine data. Appl. Statist., 16, (1967b). Examples of the use of elements for clarifying regression analysis. Appl. Statist., 16, OOSTERHOFF, J. (1963). On the selection of independent variables in a regression equation. Report 319. Math. Centre, Amsterdam. ORCHARD, T. and WOODBURY, M. A. (1972). A missing information principle, theory and applications. Proc. 6th Berkeley Symp., 1, POPE, P. T. and WEBSTER, J. T. (1972). The use of an F-statistic in stepwise regression procedures. Technometrics, 14, ROTHMAN, D. (1968). Comment on Hocking and Leslie's paper. Technometrics, 10, 432. SCHATZOFF, M., TSAO, R. and FIENBERG, S. (1968). Efficient calculation of all possible regressions. Technometrics, 10, SILVEY, S. D. (1969). On choosing additional values of explanatory variables to counter multicollinearity. Bull. Inst. Int. Statist., 43, SPJ0TVOLL, E. (1972). Multiple comparison of regression functions. Ann. Math. Statist., 43, SWINDEL, B. F. and BOWER, D. R. (1972). Rounding errors in the independent variables in a general linear model. Technometrics, 14, TURNER, M. E. and STEVENS, C. D. (1959). The regression analysis of causal paths. Biometrics, 15, WAMPLER, R. H. (1970). A report on the accuracy of some widely used least squares computer programs. J. Am. Statist. Assoc., 65,

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Notes on Some Aspects of Regression Analysis Author(s): D. R. Cox Source: Journal of the Royal Statistical Society. Series A (General), Vol. 131, No. 3 (1968), pp. 265-279 Published by: Wiley for the Royal