BIOSTATISTICS. Lecture 1 Data Presentation and Descriptive Statistics. dr. Petr Nazarov

Similar documents
BIOSTATISTICS. Lecture 1 Data Presentation and Descriptive Statistics. dr. Petr Nazarov

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Copy Number Variation Methods and Data

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

THE NORMAL DISTRIBUTION AND Z-SCORES COMMON CORE ALGEBRA II

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

The Effect of Fish Farmers Association on Technical Efficiency: An Application of Propensity Score Matching Analysis

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Arithmetic Average: Sum of all precipitation values divided by the number of stations 1 n

Price linkages in value chains: methodology

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

Insights in Genetics and Genomics

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

Optimal Planning of Charging Station for Phased Electric Vehicle *

Statistical Analysis on Infectious Diseases in Dubai, UAE

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

Association between cholesterol and cardiac parameters.

An Introduction to Modern Measurement Theory

THIS IS AN OFFICIAL NH DHHS HEALTH ALERT

I I I I I I I I I I I I 60

Physical Model for the Evolution of the Genetic Code

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

Project title: Mathematical Models of Fish Populations in Marine Reserves

Normal variation in the length of the luteal phase of the menstrual cycle: identification of the short luteal phase

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

INTEGRATIVE NETWORK ANALYSIS TO IDENTIFY ABERRANT PATHWAY NETWORKS IN OVARIAN CANCER

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Biased Perceptions of Income Distribution and Preferences for Redistribution: Evidence from a Survey Experiment

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

A Meta-Analysis of the Effect of Education on Social Capital

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

Are National School Lunch Program Participants More Likely to be Obese? Dealing with Identification

Study and Comparison of Various Techniques of Image Edge Detection

What Determines Attitude Improvements? Does Religiosity Help?

Integration of sensory information within touch and across modalities

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Natural Image Denoising: Optimality and Inherent Bounds

ADDITIVE MAIN EFFECTS AND MULTIPLICATIVE INTERACTION (AMMI) ANALYSIS OF GRAIN YIELD STABILITY IN EARLY DURATION RICE ABSTRACT

NHS Outcomes Framework

Bimodal Bidding in Experimental All-Pay Auctions

Heart Rate Variability Analysis Diagnosing Atrial Fibrillation

Non-parametric Survival Analysis for Breast Cancer Using nonmedical

EXAMINATION OF THE DENSITY OF SEMEN AND ANALYSIS OF SPERM CELL MOVEMENT. 1. INTRODUCTION

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

National Polyp Study data: evidence for regression of adenomas

Encoding processes, in memory scanning tasks

Does reporting heterogeneity bias the measurement of health disparities?

A Novel artifact for evaluating accuracies of gear profile and pitch measurements of gear measuring instruments

Validation of the Gravity Model in Predicting the Global Spread of Influenza

Working Paper Asymmetric Price Responses of Gasoline Stations: Evidence for Heterogeneity of Retailers

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

PSI Tuberculosis Health Impact Estimation Model. Warren Stevens and David Jeffries Research & Metrics, Population Services International

Rainbow trout survival and capture probabilities in the upper Rangitikei River, New Zealand

TTCA: an R package for the identification of differentially expressed genes in time course microarray data

IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

4.2 Scheduling to Minimize Maximum Lateness

Analysis of Correlated Recurrent and Terminal Events Data in SAS Li Lu 1, Chenwei Liu 2

Gurprit Grover and Dulumoni Das* Department of Statistics, Faculty of Mathematical Sciences, University of Delhi, Delhi, India.

x in place of µ in formulas.

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

Introduction ORIGINAL RESEARCH

Cancer morbidity in ulcerative colitis

Evaluation of Literature-based Discovery Systems

econstor Make Your Publications Visible.

Estimation of Relative Survival Based on Cancer Registry Data

Optimal probability weights for estimating causal effects of time-varying treatments with marginal structural Cox models

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

HERMAN AGUINIS University of Colorado at Denver. SCOTT A. PETERSEN U.S. Military Academy at West Point. CHARLES A. PIERCE Montana State University

UNIVERISTY OF KWAZULU-NATAL, PIETERMARITZBURG SCHOOL OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE

Assessment of Response Pattern Aberrancy in Eysenck Personality Inventory

Research Article Statistical Segmentation of Regions of Interest on a Mammographic Image

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

Sparse Representation of HCP Grayordinate Data Reveals. Novel Functional Architecture of Cerebral Cortex

The High way code. the guide to safer, more enjoyable drug use. (alcohol)

(4) n + 1. n+1. (1) 2n 1 (2) 2 (3) n 1 2 (1) 1 (2) 3 (1) 23 (2) 25 (3) 27 (4) 30

Using Past Queries for Resource Selection in Distributed Information Retrieval

[ ] + [3] i 1 1. is the density of the vegetable oil, R is the universal gas constant, T r. is the reduced temperature, and F c

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Are Drinkers Prone to Engage in Risky Sexual Behaviors?

Human development is deeply embedded in social

Experimental Study of Dielectric Properties of Human Lung Tissue in Vitro

Research Article Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

Ghebreegziabiher Debrezion Eric Pels Piet Rietveld

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

A Support Vector Machine Classifier based on Recursive Feature Elimination for Microarray Data in Breast Cancer Characterization. Abstract.

AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION

Performance Evaluation of Public Non-Profit Hospitals Using a BP Artificial Neural Network: The Case of Hubei Province in China

Addressing empirical challenges related to the incentive compatibility of stated preference methods

Transcription:

Mcroarray Center BIOSTATISTICS Lecture 1 Data Presentaton and Descrptve Statstcs dr. Petr Nazarov 22-02-2012 petr.nazarov@crp-sante.lu

COURSE OVERVIEW Organzaton Theoretcal course (30h) Theory Explanatons to all Common work? Practcal course (30h) Indvdual work Indvdual explanatons Fnal examnaton (questons) 3 ntermedate tests scored 0-3 (9 ponts n total) Fnal examnaton (tasks n Excel) Goal: your FINAL knowledge and sklls n bologcal data analyss! not markng of your work Data Mcrosoft Excel Software wth Data Analyss Add-In nstalled http://edu.sablab.net/data/xls Materals: @ moodle, & http://edu.sablab.net/bostat/2013 2

COURSE OVERVIEW Recommended Lterature presentaton methodology 3

COURSE OVERVIEW Introducton Drug dscovery Any bologcal study where numbers are measured or reported BIOSTATISTICS: why and where? Genomcs and systems bology Publc health 4

OUTLINE Lecture 1 Data and statstc elements, varables and observaton types of data (qualtatve and quanttatve) and scales (nomnal, ordnal, nterval, rato) Descrptve statstcs: tabular and graphcal presentaton frequency dstrbuton pe, bar chart and hstogram representaton cumulatve dstrbutons crosstabulaton and scatter dagram Descrptve statstcs: numercal measures measures of locaton: mean, mode, medan, quantles/quartles/percentles measure of varablty: varance, standard devaton, MAD, coeffcent of varaton other measures: skewness of dstrbuton z-score. Chebyshev's theorem. Detecton of outlers. Exploratory analyss. 5 number summary box plot Measure of assocaton between two varables covarance and correlaton coeffcent nterpretaton of correlaton coeffcent 5

DATA AND STATISTICS Elements, varables, and observatons, data scales and types 6

observaton DATA AND STATISTICS Data: Elements, Varables, and Observatons Data The facts and fgures collected, analyzed, and summarzed for presentaton and nterpretaton. elements varables Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 Can we consder the Place as element? IFS 3 log10 N 4.5 7

Quanttatve Qualtatve DATA AND STATISTICS Data Scales and Types Data scales: Nomnal scale data use labels or names to dentfy an attrbute of an element. Ex.1: Male, Female Ex.2: Rooms #: 101, 102, 103, Ordnal scale data exhbt the propertes of nomnal data and the order or rank of the data s meanngful. Interval scale data demonstrate the propertes of ordnal data and the nterval between values s expressed n terms of a fxed unt of measure Ex.1: Wnners: The 1 st, 2 nd, 3 rd places Ex.2: Marks: A, B, C, Ex.1: Examnaton score 0-100 Ex.2: Internet fame score Rato scale data demonstrate all the propertes of nterval data and the rato of two values s meanngful. Ex.1: Weght Ex.2: Prce 8

DATA AND STATISTICS Task: Defne the Scales Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 IFS 3 log10 N 4.5? 9

TABULAR AND GRAPHICAL PRESENTATION Frequency dstrbuton, bar and pe charts, hstogram, cumulatve frequency dstrbuton, scatter plot 10

TABULAR AND GRAPHICAL PRESENTATION Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Marks A B C B A B B A B C Frequency dstrbuton: Mark Frequency A 3 B 5 C 2 Total 10 Relatve frequency dstrbuton: Mark Frequency A 0.3 B 0.5 C 0.2 Total 1 In MS Excel use the followng functons: Percent frequency dstrbuton: Mark Frequency A 30% B 50% C 20% Total 100% =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 11

TABULAR AND GRAPHICAL PRESENTATION pancreatts.xls Example: Pancreatts Study The role of smokng n the etology of pancreatts has been recognzed for many years. To provde estmates of the quanttatve sgnfcance of these factors, a hosptal-based study was carred out n eastern Massachusetts and Rhode Island between 1975 and 1979. 53 patents who had a hosptal dscharge dagnoss of pancreatts were ncluded n ths unmatched case-control study. The control group conssted of 217 patents admtted for dseases other than those of the pancreas and blary tract. Rsk factor nformaton was obtaned from a standardzed ntervew wth each subject, conducted by a traned ntervewer. adapted from Chap T. Le, Introductory Bostatstcs Pancreatts patents: Smokers Ex-smokers Ex-smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Never Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Never Smokers Smokers Smokers 12

FREQUENCY DISTRIBUTION Relatve Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Relatve frequency dstrbuton A tabular summary of data showng the fracton or proporton of data tems n each of several nonoverlappng classes. Sum of all values should gve 1 Estmaton of probablty dstrbuton When number of experments n, R.F.D. P.D. pancreatts.xls Frequency dstrbuton: Smokng Cases Controls Never 2 56 Ex-smokers 13 80 Smokers 38 81 Total 53 217 Relatve frequency dstrbuton: Smokng Cases Controls Never 0.038 0.258 Ex-smokers 0.245 0.369 Smokers 0.717 0.373 Total 1 1 In Excel use the followng functons: =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 13

TABULAR AND GRAPHICAL PRESENTATION Crosstabulaton pancreatts.xls Dsease Smokng other pancreatts Total Ex-smokers 80 13 93 Never 56 2 58 Smokers 81 38 119 Total 217 53 270 Dsease Smokng other pancreatts Total Ex-smoker 80 13 93 Never 56 2 58 Smoker 81 38 119 Total 217 53 270 In Excel use the followng steps: Insert Pvot Table Set the range, ncludng the headers of the data Select output and set layout by drag-and-droppng the names nto the table 14

Percentage TABULAR AND GRAPHICAL PRESENTATION Bar and Pe Charts pancreatts.xls other Smokng Influence on Pancreatts 80 70 60 50 40 30 20 10 0 other pancreatts Never Ex-smoker Smoker pancreatts Never Ex-smoker Smoker Never Ex-smoker Smoker Smokng In MS Excel use the followng steps: Try to avod usng n scentfc reports. For publc/busness presentatons only! Insert Column Set data range (both columns of Percent freq. dstrbuton) Insert Pe Set data range (one columns of Percent freq. dstrbuton) 15

TABULAR AND GRAPHICAL PRESENTATION Example: Mce Data Seres Tordoff MG, Bachmanov AA Survey of calcum & sodum ntake and metabolsm wth bone and body composton data Project symbol: Tordoff3 Accesson number: MPD:103 mce.xls 790 mce from dfferent strans http://phenome.jax.org parameter Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 16

TABULAR AND GRAPHICAL PRESENTATION The followng are weghts n grams for 970 mce: Hstogram mce.xls 20.5 23.2 24.6 23.5 26 25.9 23.9 22.8 19.9 20.8 22.4 26 23.8 26.5 26 22.8 22.9 20.9 19.8 22.7 31 22.7 26.3 27.1 18.4 21 18.8 21 21.4 25.7 19.7 27 26.2 21.8 22.2 19.2 21.9 22.6 23.7 26.2 26 27.5 25 20.9 20.6 22.1 20 21.1 24.1 28.8 30.2 20.1 24.2 25.8 21.3 21.8 23.7 23.5 28 27.6 21.6 21 21.3 20.1 20.8 24.5 23.8 29.5 21.4 21.5 24 21.1 18.9 19.5 32.3 28 27.1 28.2 22.9 19.9 20.4 21.3 20.6 22.8 25.8 24.1 23.5 24.2 22 20.3 Sorted weghts show that the values are n the 10 49.6 grams. Let us dvde the weght nto the bns bns Weght,g Frequency >=10 1 10-20 237 20-30 417 30-40 124 40-50 11 More 0 17

Frequency TABULAR AND GRAPHICAL PRESENTATION Now, let us use bn-sze = 1 gram Hstogram Bn Frequency 8 0 9 1 10 10 11 11...... 39 2 40 2 More 0 70 60 50 40 30 20 Hstogram 10 In Excel use the followng steps: Specfy the column of bns (nterval) upper-lmts Data Data Analyss Hstrogram select the nput data, bns, and output (Analyss ToolPak should be nstalled) use Chart Wzard Columns to vsualze the results 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Weght, g 18

Endng weght TABULAR AND GRAPHICAL PRESENTATION Scatter Plot mce.xls Let us look on mutual dependency of the Startng and Endng weghts. 60 Scatter plot 50 40 30 20 In Excel use the followng steps: Select the data regon Use Insert XY (Scatter) 10 0 0 10 20 30 40 50 Startng weght 19

NUMERICAL MEASURES Populaton and sample, measures of locaton, quantles, quartles and percentles, measures of varablty, z-score, detecton of outlers, exploraton data analyss, box plot, covaraton, correlaton 20

NUMERICAL MEASURES Populaton and Sample Populaton parameter A numercal value used as a summary measure for a populaton (e.g., the populaton mean, varance 2, standard devaton ) POPULATION µ mean 2 varance N number of elements (usually N= ) SAMPLE x m, mean s 2 varance n number of elements Sample statstc A numercal value used as a summary measure for a sample (e.g., the sample mean m, the sample varance s 2, and the sample standard devaton s) mce.xls 790 mce from dfferent strans http://phenome.jax.org All exstng laboratory Mus musculus ID Stran Sex Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 1 129S1/SvImJ f 66 116 19.3 20.5 1.062 64 1.2 7.24 0.0605 14.5 4.4 2 129S1/SvImJ f 66 116 19.1 20.8 1.089 78 1.15 7.27 0.0553 13.9 4.4 3 129S1/SvImJ f 66 108 17.9 19.8 1.106 90 1.16 7.26 0.0546 13.8 2.9 368 129S1/SvImJ f 72 114 18.3 21 1.148 65 1.26 7.22 0.0599 15.4 4.2 369 129S1/SvImJ f 72 115 20.2 21.9 1.084 55 1.23 7.3 0.0623 15.6 4.3 370 129S1/SvImJ f 72 116 18.8 22.1 1.176 1.21 7.28 0.0626 16.4 4.3 371 129S1/SvImJ f 72 119 19.4 21.3 1.098 49 1.24 7.24 0.0632 16.6 5.4 372 129S1/SvImJ f 72 122 18.3 20.1 1.098 73 1.17 7.19 0.0592 16 4.1 4 129S1/SvImJ f 66 109 17.2 18.9 1.099 41 1.25 7.29 0.0513 14 3.2 5 129S1/SvImJ f 66 112 19.7 21.3 1.081 129 1.14 7.22 0.0501 16.3 5.2 10 129S1/SvImJ m 66 112 24.3 24.7 1.016 119 1.13 7.24 0.0533 17.6 6.8 364 129S1/SvImJ m 72 114 25.3 27.2 1.075 64 1.25 7.27 0.0596 19.3 5.8 365 129S1/SvImJ m 72 115 21.4 23.9 1.117 48 1.25 7.28 0.0563 17.4 5.7 366 129S1/SvImJ m 72 118 24.5 26.3 1.073 59 1.25 7.26 0.0609 17.8 7.1 367 129S1/SvImJ m 72 122 24 26 1.083 69 1.29 7.26 0.0584 19.2 4.6 6 129S1/SvImJ m 66 116 21.6 23.3 1.079 78 1.15 7.27 0.0497 17.2 5.7 7 129S1/SvImJ m 66 107 22.7 26.5 1.167 90 1.18 7.28 0.0493 18.7 7 8 129S1/SvImJ m 66 108 25.4 27.4 1.079 35 1.24 7.26 0.0538 18.9 7.1 9 129S1/SvImJ m 66 109 24.4 27.5 1.127 43 1.29 7.29 0.0539 19.5 7.1 21

NUMERICAL MEASURES Measures of Locaton Mean A measure of central locaton computed by summng the data values and dvdng by the number of observatons. Medan A measure of central locaton provded by the value n the mddle when the data are arranged n ascendng order. Mode A measure of locaton, defned as the value that occurs wth greatest frequency. x p m x N x n n x true Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mode = 23 Medan = 23.5 Mean = 31.7 22

Densty 0.000 0.010 0.020 Densty 0.00 0.02 0.04 0.06 NUMERICAL MEASURES Measures of Locaton mce.xls Hstogram and p.d.f. approxmaton medan mean mode Female proporton p f = 0.501 10 15 20 25 30 35 40 weght, g In Excel use the followng functons: = AVERAGE(data) = MEDIAN(data) = MODE(data) Bleedng tme medan = 55 mean = 61 mode = 48 0 50 100 150 200 N = 760 Bandwdth = 5.347 23

NUMERICAL MEASURES Quantles, Quartles and Percentles Percentle A value such that at least p% of the observatons are less than or equal to ths value, and at least (100-p)% of the observatons are greater than or equal to ths value. The 50- th percentle s the medan. Quartles The 25th, 50th, and 75th percentles, referred to as the frst quartle, the second quartle (medan), and thrd quartle, respectvely. In Excel use the followng functons: =PERCENTILE(data,p) Weght 12 16 19 22 23 23 24 32 36 42 63 68 Q 1 = 21 Q 2 = 23.5 Q 3 = 39 24

NUMERICAL MEASURES Measures of Varablty Interquartle range (IQR) A measure of varablty, defned to be the dfference between the thrd and frst quartles. Varance A measure of varablty based on the squared devatons of the data values about the mean. Standard devaton A measure of varablty computed by takng the postve square root of the varance. IQR Q 3 Q 1 populaton sample s 2 N x 2 2 x m n 1 2 Sample standard devaton s Populaton standard devaton 2 s 2 Weght 12 16 19 22 23 23 24 32 36 42 63 68 IQR = 18 Varance = 320.2 St. dev. = 17.9 In Excel use the followng functons: =VAR(data), =STDEV(data) 25

NUMERICAL MEASURES Measures of Varablty Coeffcent of varaton A measure of relatve varablty computed by dvdng the standard Standard devaton devaton by the mean. 100% Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mean CV = 57% Medan absolute devaton (MAD) MAD s a robust measure of the varablty of a unvarate sample of quanttatve data. MAD medan x medan x Set 1 Set 2 23 23 12 12 22 22 12 12 21 21 18 81 22 22 20 20 12 12 19 19 14 14 13 13 17 17 Set 1 Set 2 Mean 17.3 22.2 Medan 18 19 St.dev. 4.23 18.18 MAD 5.93 5.93 26

NUMERICAL MEASURES Measures of Varablty Skewness A measure of the shape of a data dstrbuton. Data skewed to the left result n negatve skewness; a symmetrc data dstrbuton results n zero skewness; and data skewed to the rght result n postve skewness. Skewness n n 1 n 2 s x m 3 adapted from Anderson et al Statstcs for Busness and Economcs 27

Endng weght NUMERICAL MEASURES Measure of Assocaton between 2 Varables Covarance A measure of lnear assocaton between two varables. Postve values ndcate a postve relatonshp; negatve values ndcate a negatve relatonshp. xy populaton x x y y N s xy sample x xy y n 1 mce.xls 60 50 40 In Excel use functon: =COVAR(data) 30 Endng weght vs. Startng weght 20 10 0 0 10 20 30 40 50 Startng weght s xy = 39.8 hard to nterpret 28

Endng weght NUMERICAL MEASURES Measure of Assocaton between 2 Varables Correlaton (Pearson product moment correlaton coeffcent) A measure of lnear assocaton between two varables that takes on values between -1 and +1. Values near +1 ndcate a strong postve lnear relatonshp, values near -1 ndcate a strong negatve lnear relatonshp; and values near zero ndcate the lack of a lnear relatonshp. populaton x x y y xy xy N x y x y r xy s s x xy s y sample x x y y s s n 1 x y 60 50 40 30 In Excel use functon: =CORREL(data) 20 10 r xy = 0.94 0 0 10 20 30 40 50 Startng weght mce.xls 29

NUMERICAL MEASURES Correlaton Coeffcent If we have only 2 data ponts n x and y datasets, what values would you expect for correlaton b/w x and y? Wkpeda 30

NUMERICAL MEASURES z-score and Detecton of Outlers z-score A value computed by dvdng the devaton about the mean (x x) by the standard devaton s. A z-score s referred to as a standardzed value and denotes the number of standard devatons x s from the mean. Chebyshev s theorem For any data set, at least (1 1/z 2 ) of the data values must be wthn z standard devatons from the mean, where z any value > 1. z x m s Weght z-score 12-1.10 16-0.88 19-0.71 22-0.54 23-0.48 23-0.48 24-0.43 32 0.02 36 0.24 42 0.58 63 1.75 68 2.03 For ANY dstrbuton: At least 75 % of the values are wthn z = 2 standard devatons from the mean At least 89 % of the values are wthn z = 3 standard devatons from the mean At least 94 % of the values are wthn z = 4 standard devatons from the mean At least 96% of the values are wthn z = 5 standard devatons from the mean 31

NUMERICAL MEASURES Detecton of Outlers For bell-shaped dstrbutons: Approxmately 68 % of the values are wthn 1 st.dev. from mean Approxmately 95 % of the values are wthn 2 st.dev. from mean Almost all data ponts are nsde 3 st.dev. from mean Outler An unusually small or unusually large data value. For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. Example: Gaussan dstrbuton Weght z-score 23 0.04 12-0.53 22-0.01 12-0.53 21-0.06 81 3.10 22-0.01 20-0.11 12-0.53 19-0.17 14-0.43 13-0.48 17-0.27 32

NUMERICAL MEASURES Task: Detecton of Outlers mce.xls Usng Excel, try to dentfy outler mce on the bass of Weght change varable z x m s For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. In Excel use the followng functons: = AVERAGE(data) - mean, m = STDEV(data) - standard devaton, s = abs(data) - absolute value sort by z-scale to dentfy outlers 33

DETECTION OF OUTLIERS Iglewcz-Hoagln Method Iglewcz-Hoagln method: modfed Z-score These authors recommend that modfed Z-scores wth an absolute value of greater than 3.5 be labeled as potental outlers. z x 0.6745 MAD medan z >3.5 outler medan( x) MAD( x) x medan x Bors Iglewcz and Davd Hoagln (1993), "Volume 16: How to Detect and Handle Outlers", The ASQC Basc References n Qualty Control: Statstcal Technques, Edward F. Mykytka, Ph.D., Edtor More methods are at: http://www.tl.nst.gov/dv898/handbook/eda/secton3/eda35h.htm 34

NUMERICAL MEASURES Exploraton Data Analyss Fve-number summary An exploratory data analyss technque that uses fve numbers to summarze the data: smallest value, frst quartle, medan, thrd quartle, and largest value chldren.xls Mn. : 12 Q 1 : 25 Medan: 32 Q 3 : 46 Max. : 79 In Excel use: Tool Data Analyss Descrptve Statstcs Box plot A graphcal summary of data based on a fve-number summary Mn Q 2 Q 1 Box Qplot 3 Max In Excel use (ndrect): Insert Other charts Openhgh-low-close open Q3 hgh Q3+1.5*IQR low Q1-1.5*IQR close Q1 1.5 IQR 35

Weght, g NUMERICAL MEASURES Example: Mce Weght Example Buld a box plot for weghts of male and female mce mce.xls 1. Buld 5 number summares for males and females Female Male Mn 10.0 12.0 Q1 17.2 23.8 Q2 20.7 27.1 Q3 23.3 31.2 Max 41.5 49.6 2. Combne the numbers nto the followng order open Q3 hgh Q3+mn(1.5*(Q3-Q1),Max) low Q1-max(1.5*(Q3-Q1),Mn) close Q1 Mouse weght In Excel use: Insert Other charts Open-hgh-low-close Put seres-n-rows Adjust colors, etc 45 40 35 30 25 20 15 10 5 0 Female Male 36

NUMERICAL MEASURES Weghted mean The mean obtaned by assgnng each observaton a weght that reflects ts mportance Weghted Mean m w x w As an example of the need of weghted mean, consder the followng sample of fve purchases of a raw materal over several months Note that the cost per pound vares from $2.80 to $3.40, and quantty purchased has vared from 500 to 2750. Suppose that manager asked for nformaton about the mean cost per pound of the raw materal. If we would use a smple mean of the cost p.p.: we overestmate the average cost! Anderson et al Statstcs for Busness and Economcs 37

NUMERICAL MEASURES Grouped Mean Grouped data Data avalable n class ntervals as summarzed by a frequency dstrbuton. Indvdual values of the orgnal data are not avalable. chldren.xls Bn Frequency 20 5 30 21 40 8 50 14 60 3 70 4 80 2 More 0 Mean for grouped data m k f n M Varance for grouped data s 2 k f M n 1 m 2 38

QUESTIONS? Thank you for your attenton to be contnued 39

DETECTION OF OUTLIERS Grubbs' Test Grubbs' test s an teratve method to detect outlers n a data set assumed to come from a normally dstrbuted populaton. Grubbs' statstcs at step k+1: G max x m ( k ) ( k1) s( k ) max z (k) teraton k m mean of the rest data s st.dev. of the rest data The hypothess of no outlers s rejected at sgnfcance level α f G N 1 N N t 2 2 t 2 where t 2 2 t a /(2N ), N 2 More methods are at: http://www.tl.nst.gov/dv898/handbook/eda/secton3/eda35h.htm 40

DETECTION OF OUTLIERS Grubbs' Test Let's perform Grubb's test for "Weght change" of mce.xls Step 1. Generate crtcal value N: =COUNTIF(A:A,">=0") t 2 : G Crt =TINV(0.05/(2*E1),E1-2)^2 = (E1-1)/SQRT(E1)* SQRT(E2/(E1-2+E2)) Weght change abs(x-m)/s N 790 0 9.847692462 t^2 17.51895 2.109 8.91981 G.Crt. 4.139802 0.565 4.819888341 0.578 4.704204352 0.642 4.134683177 0.658 3.992302884 G Crt N 1 N t 2 N 2 t 2 Step 2. Buld z and sort n descendng order where t 2 2 t a /(2N ), N 2 Step 3. If the frst z value s > G Crt remove t and go to step 2, else fnsh. 41