Statistics for Biology and Health

Similar documents
SPRINGER BRIEFS IN BIOCHEMISTRY AND MOLECULAR BIOLOGY. Gerhard Bauer Joseph S. Anderson. Gene Therapy for HIV From Inception to a Possible Cure

SpringerBriefs in Applied Sciences and Technology

Practical Case Studies in Hypertension Management. Series editor: Giuliano Tocci Rome Italy

Congenital Hip Disease in Adults

SpringerBriefs in Psychology. Series Editors Daniel David Raymond A. DiGiuseppe Kristene A. Doyle

In Clinical Practice

Therapeutic rtms in Neurology

Esthetic and Functional Management of Diastema

Iatrogenic Effects of Orthodontic Treatment

Differential Diagnosis of Movement Disorders in Clinical Practice

Introductory Statistical Inference with the Likelihood Function

Low and Lower Fertility

Evidence-Based Forensic Dentistry

Pediatric Endodontics

Imaging of Urinary Tract Diverticula

Progress in Social Psychiatry in Japan

Radiology Illustrated

Louise Grech Alan Lau Editors. Pharmaceutical Care Issues of Patients with Rheumatoid Arthritis. From Hospital to Community

Desmond P. Kidd. Neuro-Ophthalmology. Illustrated Case Studies

The Cleveland Clinic Manual of Dynamic Endocrine Testing

Indirect Questioning in Sample Surveys

Understanding. Regression Analysis

Measures of Positive Psychology

Progressive Multiple Sclerosis

White Coat Hypertension

Urinary Tract Infection

Musculoskeletal Health in Women

The Many Faces of Social Attention

Local Flaps in Facial Reconstruction

Practical Case Studies in Hypertension Management. Series editor Giuliano Tocci Rome, Italy

Radiation Therapy for Skin Cancer

Erin Lawson Mark S. Wallace Editors. Fibromyalgia. Clinical Guidelines and Treatments

Respiratory Medicine Series Editor: Sharon I.S. Rounds. Marc A. Judson Editor. Pulmonary Sarcoidosis A Guide for the Practicing Clinician

Radiation Therapy for Gastrointestinal Cancers

Atlas of Lymph Node Anatomy

Genetic Influences on Response to Drug Treatment for Major Psychiatric Disorders

Clinician s Manual on Restless Legs Syndrome

ECG INTERPRETATION: FROM PATHOPHYSIOLOGY TO CLINICAL APPLICATION

Neurobiological Bases of Abnormal Aggression and Violent Behaviour

SpringerBriefs in Cancer Research

Alexander N. Sencha Elena V. Evseeva Mikhail S. Mogutov Yury N. Patrunov. Breast Ultrasound

Wound Management in Urgent Care

Social Control of Sex Offenders

Reconstructive Oral and Maxillofacial Surgery

Management of Post-Stroke Complications

Diseases of the Spinal Cord

Radiation Therapy for Head and Neck Cancers

Human Motivation and Interpersonal Relationships

Pharmaceutical Care Issues of Patients with Rheumatoid Arthritis

John Papadopoulos David R. Schwartz Consulting Editor. Pocket Guide to Critical Care Pharmacotherapy Second Edition

Handbook for Venous Thromboembolism

Medical and Surgical Complications of Sickle Cell Anemia

SpringerBriefs in Criminology

Basics of Oncology. Second Edition Frederick O. Stephens Karl Reinhard Aigner

Morbid Obesity in Adolescents

Cognitive, Conative and Behavioral Neurology

Essentials in Ophthalmology

Osteoarthritic Knee Joint Painted by Artist: Mika Katsuta (Japan)

Sami Shousha Editor. Breast Pathology. Problematic Issues

Mark W.J. Strachan Brian M. Frier. Insulin Therapy. A Pocket Guide

Handbook of Insulin Therapies

Surgical Techniques for Kidney Cancer

CLINICAL GASTROENTEROLOGY

Statistical Analysis in Forensic Science

Abnormal Female Puberty

Cerebral Blood Flow, Metabolism, and Head Trauma

Sinus Headache, Migraine, and the Otolaryngologist

Chromosomal Translocations and Genome Rearrangements in Cancer

Deep Brain Stimulation for Neurological Disorders

Linear Regression Analysis

Emerging Concepts of Tumor Exosome Mediated Cell Cell Communication

Practical Issues in Geriatrics. Series Editor Stefania Maggi Aging Branch CNR-Neuroscience Institute Padova, Italy

Headache. Series editors Paolo Martelletti Roma, Italy Rigmor Jensen Glostrup, Denmark

The Olfactory System

Inflammation and Lung Cancer

Bayes Linear Statistics. Theory and Methods

Frozen Section Library Series Editor Philip T. Cagle, MD Houston, Texas, USA

Recurrent Erosion Syndrome and Epithelial Edema

The Role of Bacteria in Urology

Gynecologic Oncology

Thomas Reinhard Frank Larkin Editors. Corneal Disease. Recent Developments in Diagnosis and Therapy

Applied Linear Regression

Tadaaki Kirita Ken Omura Editors. Oral Cancer. Diagnosis and Therapy

Minimally Invasive Gynecological Surgery

The Polyol Paradigm and Complications of Diabetes

SpringerBriefs in Child Development

Next Generation Sequencing Based Clinical Molecular Diagnosis of Human Genetic Disorders

Planning and Care for Children and Adolescents with Dental Enamel Defects

Multicultural Health

Health Issues in Women with Multiple Sclerosis

Morphological Aspects of Inner Ear Disease

Palgrave Advances in Behavioral Economics. Series Editor John F. Tomer Co-Editor, Jl of Socio-Economics Manhattan College Riverdale, USA

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Management of Headache and Headache Medications

Donald E. Wesson. Editor. Metabolic Acidosis. A Guide to Clinical Assessment and Management

isc ove ring i Statistics sing SPSS

The Pharmacology of Alcohol and Drugs of Abuse and Addiction

SPRINGER HANDBOOK OF AUDITORY RESEARCH

Statistical Analysis with Missing Data. Second Edition

Transcription:

Statistics for Biology and Health Series Editors Mitchell Gail Jonathan M. Samet Anastasios Tsiatis Wing Wong More information about this series at http://www.springer.com/series/2848

Daniel Zelterman Applied Multivariate Statistics with R 123

Daniel Zelterman School of Public Health Yale University New Haven, CT, USA ISSN 1431-8776 ISSN 2197-5671 (electronic) Statistics for Biology and Health ISBN 978-3-319-14092-6 ISBN 978-3-319-14093-3 (ebook) DOI 10.1007/978-3-319-14093-3 Library of Congress Control Number: 2015942244 Springer Cham Heidelberg New York Dordrecht London Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www. springer.com)

Barry H Margolin 1943 2009 Teacher, mentor, and friend

Permissions R is Copyright c 2014 The R Foundation for Statistical Computing. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna, Austria. URL: http://www. R-project.org/. We are grateful for the use of data sets, used with permission as follows: Public domain data in Table 1.4 was obtained from the U.S. Cancer Statistics Working Group. United States Cancer Statistics: 1999 2007 Incidence and Mortality Web-based Report. Atlanta: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2010. Available online at www.cdc.gov/uscs. The data of Table 1.5 is used with permission of S&P Dow Jones. The data of Table 1.6 was collected from www.fatsecret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/06/17/ 40-unhealthiest-burgers.html. The data in Table 7.1 is used with permission of S&P Dow Jones Indices. The data in Table 7.4 was gathered from www.fatsectret.com and used with permission. The idea for this table was suggested by the article originally appearing in The Daily Beast, available online at http://www.thedailybeast.com/galleries/2010/10/18/ halloween-candy.html. The data in Table 7.5 was obtained from https://health.data.ny.gov/. The data in Table 8.2 is used with permission of Dr. Deepak Nayaran, Yale University School of Medicine. The data in Table 8.3 was reported by the National Vital Statistics System (NVSS): http://www.cdc.gov/nchs/deaths.htm. The data in Table 9.3 was generated as part of the Medicare Current Beneficiary Survey: http://www.cms.gov/mcbs/ andavailableonthecdc.gov website.

viii The data in Table 9.4 was collected by the United States Department of Labor, Mine Safety and Health Administration. This data is available online at http://www.msha.gov/stats/centurystats/coalstats.asp. The data of Table 9.6 is used with permission of Street Authority and David Sterman. The data in Table 11.1 is adapted from Statistics Canada (2009) Health Care Professionals and Official-Language Minorities in Canada 2001 and 2006, Catalog no. 91-550-X, Text Table 1.1, appearing on page 13. The data source in Table 11.5 is copyright 2010, Morningstar, Inc., Morningstar Bond Market Commentary September, 2010. All Rights Reserved. Used with permission. The data in Table 11.6 is cited from US Department of Health and Human Services, Administration on Aging. The data is available online at http://www.aoa.gov/aoaroot/aging\_statistics/index.aspx. The website http://www.gastonsanchez.com provided some of the code appearing in Output 11.1. Global climate data appearing in Table 12.2 is used with permission: Climate Charts & Graphs, courtesy of Kelly O Day. Tables 13.1 and 13.6 are reprinted from Stigler (1994) and used with permission of the Institute of Mathematical Statistics. In Addition Several data sets referenced and used as examples were obtained from the UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (Bache and Lichman 2013). Grant Support The author acknowledges support from grants from the National Institute of Mental Health, National Cancer Institute, and the National Institute of Environmental Health Sciences. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

Preface M ULTIVARIATE STATISTICS is a mature field with many different methods. Many of these are mathematical. Fortunately, these methods have been programmed so you should be able to run these on your computer without much difficulty. This book is targeted to a graduate-level practitioner who may need to use these methods but does not necessarily know about the mathematical derivations. For example, we use the sample average of the multivariate distribution to estimate the population mean but do not need to prove optimal properties of such an estimator when sampled from a normal parent population. Readers may want to analyze their data, motivated by disciplinespecific questions. They will discover ways to get at some important results without a degree in statistics. Similarly, those well trained in statistics will likely be familiar with many of the univariate topics covered here, but now can learn about new methods. The reader should have taken at least one course in statistics previously and have some familiarity with such topics at t-test, degrees of freedom (df), p-values, statistical significance, and the chi-squared test of independence in a2 2 table. He or she should also know the basic rules of probability such as independence and conditional probability. The reader should have some basic computing skills including data editing. It is not necessary to have experience with R or with programming languages although these are good skills to develop. We will assume that the reader has a rudimentary acquaintance of the univariate normal distribution. We begin a discussion of multivariate models with an introduction of the bivariate normal distribution in Chap. 6. These are used to make the leap from the scalar notation to the use of vectors and matrices used in Chap. 7 on the multivariate normal distribution. A brief review of linear algebra appears in Chap. 4, including the corresponding computations in R. Other multivariate distributions include models for extremes, described in Sect. 13.3. We frequently include the necessary software to run the programs in R because we need to be able to perform these methods with real data. In some cases we need to manipulate the data in order to get it to fit into the proper format. Readers may want to produce some of the graphical displays ix

Preface x given in Chap. 3 for their own data. For these readers, the full programs to produce the figures are listed in that chapter. The field of statistics has developed many useful methods for analyzing data, and many of these methods are already programmed for you and readily available in R. What smore,r is free, widely available, open source, flexible, and the current fashion in statistical computing. Authors of new statistical methods are regularly contributing to the many libraries in R so many new results are included as well. As befitting the Springer series in Life Sciences, Medicine & Health, a large portion of the examples given here are health related or biologically oriented. There are also a large number of examples from other disciplines. There are several reasons for this, including the abundance of good examples that are available. Examples from other disciplines do a good job of illustrating the method without a great deal of background knowledge of the data. For example, Chap. 9 on multivariable linear regression methods begins with an example of data for different car models. Because the measurements on the cars are readily understood by the reader with little or no additional explanation, we can concentrate on the statistical methods rather than spending time on the example details. In contrast, the second example, presented in Sect. 9.3, is about a large health survey and requires a longer introduction for the reader to appreciate the data. But at that point in Chap. 9, we are already familiar with the statistical tools and can address issues raised by the survey data. New Haven, CT, USA Daniel Zelterman Acknowledgments Special thanks are due to Michael Kane and Forrest Crawford, who together taught me enough R to fill a book, and Rob Muirhead, who taught multivariate statistics out of TW Anderson s text. Long talks with Alan Izenman provided large doses of encouragement. Steve Schwager first suggested writing this book and gave me the initial table of contents. Ben Kedem read and provided useful comments on Chap. 12. Chang Yu provided many comments on the technical material. Thanks to Beth Nichols whose careful reading and red pencil provided many editorial improvements on the manuscript. Thanks also to many teachers, students, and colleagues who taught me much. Many thanks to caring and supportive friends and family who encouraged and put up with me during the whole process.

Preface xi The APL computer language and TW Anderson s book (Anderson 2003) 1 on multivariate statistics provided me with the foundation for writing this book as a graduate student in the 1970s. I watched Frank Anscombe 2 working on his book (Anscombe 1981) and was both inspired and awed by the amount of effort involved. 1 Theodore Wilbur Anderson (1918 ). American mathematician and statistician. 2 Francis John Anscombe (1918 2001). British statistician. Founded the Statistics Department at Yale.

Contents Preface ix 1 Introduction 1 1.1 Goals of Multivariate Statistical Techniques.......... 1 1.2 Data Reduction or Structural Simplification.......... 3 1.3 Grouping and Classifying Observations............. 5 1.4 Examination of Dependence Among Variables......... 9 1.5 Describing Relationships Between Groups of Variables.... 10 1.6 Hypothesis Formulation and Testing.............. 10 1.7 Multivariate Graphics and Distributions............ 12 1.8 Why R?.............................. 13 1.9 Additional Readings....................... 14 2 Elements of R 17 2.1 Getting Started in R....................... 18 2.1.1 R as a Calculator..................... 18 2.1.2 Vectors in R........................ 19 2.1.3 Printing in R....................... 23 2.2 Simulation and Simple Statistics................ 24 2.3 Handling Data Sets........................ 27 2.4 Basic Data Manipulation and Statistics............ 32 2.5 Programming and Writing Functions in R........... 37 2.6 A Larger Simulation....................... 40 2.7 Advanced Numerical Operations................ 46 2.8 Housekeeping........................... 47 2.9 Exercises............................. 49 3 Graphical Displays 55 3.1 Graphics in R........................... 55 3.2 Displays for Univariate Data.................. 58 3.3 Displays for Bivariate Data................... 63 3.3.1 Plot Options, Colors, and Characters.......... 66 3.3.2 More Graphics for Bivariate Data........... 67 xiii

CONTENTS xiv 3.4 Displays for Three-Dimensional Data.............. 71 3.5 Displays for Higher Dimensional Data............. 75 3.5.1 Pairs, Bagplot, and Coplot............... 75 3.5.2 Glyphs: Stars and Faces................. 78 3.5.3 Parallel Coordinates................... 82 3.6 Additional Reading........................ 84 3.7 Exercises............................. 85 4 Basic Linear Algebra 89 4.1 Apples and Oranges....................... 89 4.2 Vectors.............................. 91 4.3 Basic Matrix Arithmetic..................... 94 4.4 Matrix Operations in R...................... 96 4.5 Advanced Matrix Operations.................. 102 4.5.1 Determinants....................... 102 4.5.2 Matrix Inversion..................... 104 4.5.3 Eigenvalues and Eigenvectors.............. 106 4.5.4 Diagonalizable Matrices................. 108 4.5.5 Generalized Inverses................... 109 4.5.6 Matrix Square Root................... 111 4.6 Exercises............................. 113 5 The Univariate Normal Distribution 117 5.1 The Normal Density and Distribution Functions....... 117 5.2 Relationship to Other Distributions............... 122 5.3 Transformations to Normality.................. 122 5.4 Tests for Normality........................ 126 5.5 Inference on Univariate Normal Means............. 131 5.6 Inference on Variances...................... 137 5.7 Maximum Likelihood Estimation, Part I............ 139 5.8 Exercises............................. 147 6 Bivariate Normal Distribution 151 6.1 The Bivariate Normal Density Function............ 152 6.2 Properties of the Bivariate Normal Distribution........ 156 6.3 Inference on Bivariate Normal Parameters........... 158 6.4 Tests for Bivariate Normality.................. 163 6.5 Maximum Likelihood Estimation, Part II........... 163 6.6 Exercises............................. 170 7 Multivariate Normal Distribution 173 7.1 Multivariate Normal Density and Its Properties........ 174 7.2 Inference on Multivariate Normal Means............ 176 7.3 Example: Home Price Index................... 178 7.4 Maximum Likelihood, Part III: Models for Means....... 182

CONTENTS xv 7.5 Inference on Multivariate Normal Variances.......... 187 7.6 Fitting Patterned Covariance Matrices............. 189 7.7 Tests for Multivariate Normality................ 194 7.8 Exercises............................. 202 8 Factor Methods 207 8.1 Principal Components Analysis................. 208 8.2 Example 1: Investment Allocations............... 210 8.3 Example 2: Kuiper Belt Objects................ 214 8.4 Example 3: Health Outcomes in US Hospitals......... 217 8.5 Factor Analysis.......................... 218 8.6 Exercises............................. 223 9 Multivariable Linear Regression 231 9.1 Univariate Regression...................... 232 9.2 Multivariable Regression in R.................. 238 9.3 A Large Health Survey...................... 243 9.4 Exercises............................. 250 10 Discrimination and Classification 257 10.1 An Introductory Example.................... 257 10.2 Multinomial Logistic Regression................. 261 10.3 Linear Discriminant Analysis.................. 265 10.4 Support Vector Machine..................... 273 10.5 Regression Trees......................... 278 10.6 Exercises............................. 283 11 Clustering 287 11.1 Hierarchical Clustering...................... 287 11.2 K-Means Clustering....................... 295 11.3 Diagnostics, Validation, and Other Methods.......... 301 11.4 Exercises............................. 308 12 Time Series Models 315 12.1 Introductory Examples and Simple Analyses.......... 315 12.2 Autoregressive Models...................... 322 12.3 Spectral Decomposition..................... 333 12.4 Exercises............................. 336 13 Other Useful Methods 339 13.1 Ranking from Paired Comparisons............... 339 13.2 Canonical Correlations...................... 342 13.3 Methods for Extreme Order Statistics............. 348 13.4 Big Data and Wide Data.................... 354 13.5 Exercises............................. 356

CONTENTS xvi Appendix: Libraries Used 361 Selected Solutions and Hints 363 References 375 About the author 381 Index 383