Statistics for Biology and Health

Statistics for Biology and Health Series Editors Mitchell Gail Jonathan M. Samet Anastasios Tsiatis Wing Wong More information about this series at http://www.springer.com/series/2848

Daniel Zelterman Applied Multivariate Statistics with R 123

Daniel Zelterman School of Public Health Yale University New Haven, CT, USA ISSN 1431-8776 ISSN 2197-5671 (electronic) Statistics for Biology and Health ISBN 978-3-319-14092-6 ISBN 978-3-319-14093-3 (ebook) DOI 10.1007/978-3-319-14093-3 Library of Congress Control Number: 2015942244 Springer Cham Heidelberg New York Dordrecht London Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www. springer.com)

Barry H Margolin 1943 2009 Teacher, mentor, and friend

Permissions R is Copyright c 2014 The R Foundation for Statistical Computing. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna, Austria. URL: http://www. R-project.org/. We are grateful for the use of data sets, used with permission as follows: Public domain data in Table 1.4 was obtained from the U.S. Cancer Statistics Working Group. United States Cancer Statistics: 1999 2007 Incidence and Mortality Web-based Report. Atlanta: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2010. Available online at www.cdc.gov/uscs. The data of Table 1.5 is used with permission of S&P Dow Jones. The data of Table 1.6 was collected from www.fatsecret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/06/17/ 40-unhealthiest-burgers.html. The data in Table 7.1 is used with permission of S&P Dow Jones Indices. The data in Table 7.4 was gathered from www.fatsectret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/10/18/ halloween-candy.html. The data in Table 7.5 was obtained from https://health.data.ny.gov/. The data in Table 8.2 is used with permission of Dr. Deepak Nayaran, Yale University School of Medicine. The data in Table 8.3 was reported by the National Vital Statistics System (NVSS): http://www.cdc.gov/nchs/deaths.htm. The data in Table 9.3 was generated as part of the Medicare Current Beneficiary Survey: http://www.cms.gov/mcbs/ andavailableonthecdc.gov website.

viii The data in Table 9.4 was collected by the United States Department of Labor, Mine Safety and Health Administration. This data is available online at http://www.msha.gov/stats/centurystats/coalstats.asp. The data of Table 9.6 is used with permission of Street Authority and David Sterman. The data in Table 11.1 is adapted from Statistics Canada (2009) Health Care Professionals and Official-Language Minorities in Canada 2001 and 2006, Catalog no. 91-550-X, Text Table 1.1, appearing on page 13. The data source in Table 11.5 is copyright 2010, Morningstar, Inc., Morningstar Bond Market Commentary September, 2010. All Rights Reserved. Used with permission. The data in Table 11.6 is cited from US Department of Health and Human Services, Administration on Aging. The data is available online at http://www.aoa.gov/aoaroot/aging\_statistics/index.aspx. The website http://www.gastonsanchez.com provided some of the code appearing in Output 11.1. Global climate data appearing in Table 12.2 is used with permission: Climate Charts & Graphs, courtesy of Kelly O Day. Tables 13.1 and 13.6 are reprinted from Stigler (1994) and used with permission of the Institute of Mathematical Statistics. In Addition Several data sets referenced and used as examples were obtained from the UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (Bache and Lichman 2013). Grant Support The author acknowledges support from grants from the National Institute of Mental Health, National Cancer Institute, and the National Institute of Environmental Health Sciences. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

Preface M ULTIVARIATE STATISTICS is a mature field with many different methods. Many of these are mathematical. Fortunately, these methods have been programmed so you should be able to run these on your computer without much difficulty. This book is targeted to a graduate-level practitioner who may need to use these methods but does not necessarily know about the mathematical derivations. For example, we use the sample average of the multivariate distribution to estimate the population mean but do not need to prove optimal properties of such an estimator when sampled from a normal parent population. Readers may want to analyze their data, motivated by disciplinespecific questions. They will discover ways to get at some important results without a degree in statistics. Similarly, those well trained in statistics will likely be familiar with many of the univariate topics covered here, but now can learn about new methods. The reader should have taken at least one course in statistics previously and have some familiarity with such topics at t-test, degrees of freedom (df), p-values, statistical significance, and the chi-squared test of independence in a2 2 table. He or she should also know the basic rules of probability such as independence and conditional probability. The reader should have some basic computing skills including data editing. It is not necessary to have experience with R or with programming languages although these are good skills to develop. We will assume that the reader has a rudimentary acquaintance of the univariate normal distribution. We begin a discussion of multivariate models with an introduction of the bivariate normal distribution in Chap. 6. These are used to make the leap from the scalar notation to the use of vectors and matrices used in Chap. 7 on the multivariate normal distribution. A brief review of linear algebra appears in Chap. 4, including the corresponding computations in R. Other multivariate distributions include models for extremes, described in Sect. 13.3. We frequently include the necessary software to run the programs in R because we need to be able to perform these methods with real data. In some cases we need to manipulate the data in order to get it to fit into the proper format. Readers may want to produce some of the graphical displays ix

Preface x given in Chap. 3 for their own data. For these readers, the full programs to produce the figures are listed in that chapter. The field of statistics has developed many useful methods for analyzing data, and many of these methods are already programmed for you and readily available in R. What smore,r is free, widely available, open source, flexible, and the current fashion in statistical computing. Authors of new statistical methods are regularly contributing to the many libraries in R so many new results are included as well. As befitting the Springer series in Life Sciences, Medicine & Health, a large portion of the examples given here are health related or biologically oriented. There are also a large number of examples from other disciplines. There are several reasons for this, including the abundance of good examples that are available. Examples from other disciplines do a good job of illustrating the method without a great deal of background knowledge of the data. For example, Chap. 9 on multivariable linear regression methods begins with an example of data for different car models. Because the measurements on the cars are readily understood by the reader with little or no additional explanation, we can concentrate on the statistical methods rather than spending time on the example details. In contrast, the second example, presented in Sect. 9.3, is about a large health survey and requires a longer introduction for the reader to appreciate the data. But at that point in Chap. 9, we are already familiar with the statistical tools and can address issues raised by the survey data. New Haven, CT, USA Daniel Zelterman Acknowledgments Special thanks are due to Michael Kane and Forrest Crawford, who together taught me enough R to fill a book, and Rob Muirhead, who taught multivariate statistics out of TW Anderson s text. Long talks with Alan Izenman provided large doses of encouragement. Steve Schwager first suggested writing this book and gave me the initial table of contents. Ben Kedem read and provided useful comments on Chap. 12. Chang Yu provided many comments on the technical material. Thanks to Beth Nichols whose careful reading and red pencil provided many editorial improvements on the manuscript. Thanks also to many teachers, students, and colleagues who taught me much. Many thanks to caring and supportive friends and family who encouraged and put up with me during the whole process.

Preface xi The APL computer language and TW Anderson s book (Anderson 2003) 1 on multivariate statistics provided me with the foundation for writing this book as a graduate student in the 1970s. I watched Frank Anscombe 2 working on his book (Anscombe 1981) and was both inspired and awed by the amount of effort involved. 1 Theodore Wilbur Anderson (1918 ). American mathematician and statistician. 2 Francis John Anscombe (1918 2001). British statistician. Founded the Statistics Department at Yale.

Contents Preface ix 1 Introduction 1 1.1 Goals of Multivariate Statistical Techniques.......... 1 1.2 Data Reduction or Structural Simplification.......... 3 1.3 Grouping and Classifying Observations............. 5 1.4 Examination of Dependence Among Variables......... 9 1.5 Describing Relationships Between Groups of Variables.... 10 1.6 Hypothesis Formulation and Testing.............. 10 1.7 Multivariate Graphics and Distributions............ 12 1.8 Why R?.............................. 13 1.9 Additional Readings....................... 14 2 Elements of R 17 2.1 Getting Started in R....................... 18 2.1.1 R as a Calculator..................... 18 2.1.2 Vectors in R........................ 19 2.1.3 Printing in R....................... 23 2.2 Simulation and Simple Statistics................ 24 2.3 Handling Data Sets........................ 27 2.4 Basic Data Manipulation and Statistics............ 32 2.5 Programming and Writing Functions in R........... 37 2.6 A Larger Simulation....................... 40 2.7 Advanced Numerical Operations................ 46 2.8 Housekeeping........................... 47 2.9 Exercises............................. 49 3 Graphical Displays 55 3.1 Graphics in R........................... 55 3.2 Displays for Univariate Data.................. 58 3.3 Displays for Bivariate Data................... 63 3.3.1 Plot Options, Colors, and Characters.......... 66 3.3.2 More Graphics for Bivariate Data........... 67 xiii

CONTENTS xiv 3.4 Displays for Three-Dimensional Data.............. 71 3.5 Displays for Higher Dimensional Data............. 75 3.5.1 Pairs, Bagplot, and Coplot............... 75 3.5.2 Glyphs: Stars and Faces................. 78 3.5.3 Parallel Coordinates................... 82 3.6 Additional Reading........................ 84 3.7 Exercises............................. 85 4 Basic Linear Algebra 89 4.1 Apples and Oranges....................... 89 4.2 Vectors.............................. 91 4.3 Basic Matrix Arithmetic..................... 94 4.4 Matrix Operations in R...................... 96 4.5 Advanced Matrix Operations.................. 102 4.5.1 Determinants....................... 102 4.5.2 Matrix Inversion..................... 104 4.5.3 Eigenvalues and Eigenvectors.............. 106 4.5.4 Diagonalizable Matrices................. 108 4.5.5 Generalized Inverses................... 109 4.5.6 Matrix Square Root................... 111 4.6 Exercises............................. 113 5 The Univariate Normal Distribution 117 5.1 The Normal Density and Distribution Functions....... 117 5.2 Relationship to Other Distributions............... 122 5.3 Transformations to Normality.................. 122 5.4 Tests for Normality........................ 126 5.5 Inference on Univariate Normal Means............. 131 5.6 Inference on Variances...................... 137 5.7 Maximum Likelihood Estimation, Part I............ 139 5.8 Exercises............................. 147 6 Bivariate Normal Distribution 151 6.1 The Bivariate Normal Density Function............ 152 6.2 Properties of the Bivariate Normal Distribution........ 156 6.3 Inference on Bivariate Normal Parameters........... 158 6.4 Tests for Bivariate Normality.................. 163 6.5 Maximum Likelihood Estimation, Part II........... 163 6.6 Exercises............................. 170 7 Multivariate Normal Distribution 173 7.1 Multivariate Normal Density and Its Properties........ 174 7.2 Inference on Multivariate Normal Means............ 176 7.3 Example: Home Price Index................... 178 7.4 Maximum Likelihood, Part III: Models for Means....... 182

CONTENTS xv 7.5 Inference on Multivariate Normal Variances.......... 187 7.6 Fitting Patterned Covariance Matrices............. 189 7.7 Tests for Multivariate Normality................ 194 7.8 Exercises............................. 202 8 Factor Methods 207 8.1 Principal Components Analysis................. 208 8.2 Example 1: Investment Allocations............... 210 8.3 Example 2: Kuiper Belt Objects................ 214 8.4 Example 3: Health Outcomes in US Hospitals......... 217 8.5 Factor Analysis.......................... 218 8.6 Exercises............................. 223 9 Multivariable Linear Regression 231 9.1 Univariate Regression...................... 232 9.2 Multivariable Regression in R.................. 238 9.3 A Large Health Survey...................... 243 9.4 Exercises............................. 250 10 Discrimination and Classification 257 10.1 An Introductory Example.................... 257 10.2 Multinomial Logistic Regression................. 261 10.3 Linear Discriminant Analysis.................. 265 10.4 Support Vector Machine..................... 273 10.5 Regression Trees......................... 278 10.6 Exercises............................. 283 11 Clustering 287 11.1 Hierarchical Clustering...................... 287 11.2 K-Means Clustering....................... 295 11.3 Diagnostics, Validation, and Other Methods.......... 301 11.4 Exercises............................. 308 12 Time Series Models 315 12.1 Introductory Examples and Simple Analyses.......... 315 12.2 Autoregressive Models...................... 322 12.3 Spectral Decomposition..................... 333 12.4 Exercises............................. 336 13 Other Useful Methods 339 13.1 Ranking from Paired Comparisons............... 339 13.2 Canonical Correlations...................... 342 13.3 Methods for Extreme Order Statistics............. 348 13.4 Big Data and Wide Data.................... 354 13.5 Exercises............................. 356

CONTENTS xvi Appendix: Libraries Used 361 Selected Solutions and Hints 363 References 375 About the author 381 Index 383