BIOSTATISTICS. Lecture 1 Data Presentation and Descriptive Statistics. dr. Petr Nazarov

Similar documents
BIOSTATISTICS. Lecture 1 Data Presentation and Descriptive Statistics. dr. Petr Nazarov

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Copy Number Variation Methods and Data

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

THE NORMAL DISTRIBUTION AND Z-SCORES COMMON CORE ALGEBRA II

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

The Effect of Fish Farmers Association on Technical Efficiency: An Application of Propensity Score Matching Analysis

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Price linkages in value chains: methodology

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

Arithmetic Average: Sum of all precipitation values divided by the number of stations 1 n

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Insights in Genetics and Genomics

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

Normal variation in the length of the luteal phase of the menstrual cycle: identification of the short luteal phase

I I I I I I I I I I I I 60

Project title: Mathematical Models of Fish Populations in Marine Reserves

THIS IS AN OFFICIAL NH DHHS HEALTH ALERT

Physical Model for the Evolution of the Genetic Code

An Introduction to Modern Measurement Theory

Statistical Analysis on Infectious Diseases in Dubai, UAE

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

Association between cholesterol and cardiac parameters.

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Study and Comparison of Various Techniques of Image Edge Detection

What Determines Attitude Improvements? Does Religiosity Help?

Optimal Planning of Charging Station for Phased Electric Vehicle *

Biased Perceptions of Income Distribution and Preferences for Redistribution: Evidence from a Survey Experiment

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Are National School Lunch Program Participants More Likely to be Obese? Dealing with Identification

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

Heart Rate Variability Analysis Diagnosing Atrial Fibrillation

INTEGRATIVE NETWORK ANALYSIS TO IDENTIFY ABERRANT PATHWAY NETWORKS IN OVARIAN CANCER

A Meta-Analysis of the Effect of Education on Social Capital

EXAMINATION OF THE DENSITY OF SEMEN AND ANALYSIS OF SPERM CELL MOVEMENT. 1. INTRODUCTION

Natural Image Denoising: Optimality and Inherent Bounds

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Bimodal Bidding in Experimental All-Pay Auctions

Analysis of Correlated Recurrent and Terminal Events Data in SAS Li Lu 1, Chenwei Liu 2

Validation of the Gravity Model in Predicting the Global Spread of Influenza

Non-parametric Survival Analysis for Breast Cancer Using nonmedical

PSI Tuberculosis Health Impact Estimation Model. Warren Stevens and David Jeffries Research & Metrics, Population Services International

NHS Outcomes Framework

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

ADDITIVE MAIN EFFECTS AND MULTIPLICATIVE INTERACTION (AMMI) ANALYSIS OF GRAIN YIELD STABILITY IN EARLY DURATION RICE ABSTRACT

Does reporting heterogeneity bias the measurement of health disparities?

4.2 Scheduling to Minimize Maximum Lateness

Working Paper Asymmetric Price Responses of Gasoline Stations: Evidence for Heterogeneity of Retailers

Integration of sensory information within touch and across modalities

Rainbow trout survival and capture probabilities in the upper Rangitikei River, New Zealand

x in place of µ in formulas.

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

econstor Make Your Publications Visible.

Gurprit Grover and Dulumoni Das* Department of Statistics, Faculty of Mathematical Sciences, University of Delhi, Delhi, India.

A Novel artifact for evaluating accuracies of gear profile and pitch measurements of gear measuring instruments

Estimation of Relative Survival Based on Cancer Registry Data

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

Study on Psychological Crisis Evaluation Combining Factor Analysis and Neural Networks *

Encoding processes, in memory scanning tasks

Evaluation of Literature-based Discovery Systems

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

Cancer morbidity in ulcerative colitis

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

Optimal probability weights for estimating causal effects of time-varying treatments with marginal structural Cox models

(4) n + 1. n+1. (1) 2n 1 (2) 2 (3) n 1 2 (1) 1 (2) 3 (1) 23 (2) 25 (3) 27 (4) 30

National Polyp Study data: evidence for regression of adenomas

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

Are Drinkers Prone to Engage in Risky Sexual Behaviors?

Human development is deeply embedded in social

Assessment of Response Pattern Aberrancy in Eysenck Personality Inventory

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

UNIVERISTY OF KWAZULU-NATAL, PIETERMARITZBURG SCHOOL OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE

Introduction ORIGINAL RESEARCH

Addressing empirical challenges related to the incentive compatibility of stated preference methods

A-UNIFAC Modeling of Binary and Multicomponent Phase Equilibria of Fatty Esters+Water+Methanol+Glycerol

TTCA: an R package for the identification of differentially expressed genes in time course microarray data

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

[ ] + [3] i 1 1. is the density of the vegetable oil, R is the universal gas constant, T r. is the reduced temperature, and F c

Performance Evaluation of Public Non-Profit Hospitals Using a BP Artificial Neural Network: The Case of Hubei Province in China

STAGE-STRUCTURED POPULATION DYNAMICS OF AEDES AEGYPTI

HIV/AIDS AND POVERTY IN SOUTH AFRICA: A BAYESIAN ESTIMATION OF SELECTION MODELS WITH CORRELATED FIXED-EFFECTS

A New Machine Learning Algorithm for Breast and Pectoral Muscle Segmentation

HERMAN AGUINIS University of Colorado at Denver. SCOTT A. PETERSEN U.S. Military Academy at West Point. CHARLES A. PIERCE Montana State University

Key words: carcass, fertility, genotype-by-environment, liver fluke, milk, reaction norm

Ghebreegziabiher Debrezion Eric Pels Piet Rietveld

Research Article Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

Research Article Statistical Segmentation of Regions of Interest on a Mammographic Image

Multidimensional Reliability of Instrument for Measuring Students Attitudes Toward Statistics by Using Semantic Differential Scale

Transcription:

Genomcs Research Unt BIOSTATISTICS Lecture 1 Data Presentaton and Descrptve Statstcs dr. Petr Nazarov 3-03-2017 petr.nazarov@lh.lu

COURSE OVERVIEW Organzaton: 60h = 12 days Theoretcal course (30h) Theory Explanatons to all Common work Practcal course (30h) Indvdual work Indvdual explanatons Fnal examnaton (questons) 3 ntermedate tests scored 0-3 (9 ponts n total) Fnal examnaton (tasks n Excel) Goal: your FINAL knowledge and sklls n bologcal data analyss! not markng of your work Data Mcrosoft Excel Software wth Data Analyss Add-In nstalled http://edu.sablab.net/data/xls Materals: @ moodle, & http://edu.sablab.net/bostat/2017 2

COURSE OVERVIEW Recommended Lterature presentaton methodology 3

COURSE OVERVIEW Introducton Drug dscovery Any bologcal study where numbers are measured or reported BIOSTATISTICS: why and where? Genomcs and systems bology Publc health 4

OUTLINE Lecture 1 Data and statstc elements, varables and observaton types of data (qualtatve and quanttatve) and scales (nomnal, ordnal, nterval, rato) Descrptve statstcs: tabular and graphcal presentaton frequency dstrbuton pe, bar chart and hstogram representaton cumulatve dstrbutons crosstabulaton and scatter dagram Descrptve statstcs: numercal measures measures of locaton: mean, mode, medan, quantles/quartles/percentles measure of varablty: varance, standard devaton, MAD, coeffcent of varaton other measures: skewness of dstrbuton z-score. Chebyshev's theorem. Detecton of outlers. Exploratory analyss. 5 number summary box plot Measure of assocaton between two varables covarance and correlaton coeffcent nterpretaton of correlaton coeffcent 5

DATA AND STATISTICS Elements, varables, and observatons, data scales and types 6

observaton DATA AND STATISTICS Data: Elements, Varables, and Observatons Data The facts and fgures collected, analyzed, and summarzed for presentaton and nterpretaton. elements varables Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 Can we consder the Place as element? IFS 3 log10 N 4.5 7

Quanttatve Qualtatve DATA AND STATISTICS Data Scales and Types Data scales: Nomnal scale data use labels or names to dentfy an attrbute of an element. Ex.1: Ex.2: Male, Female Rooms #: 101, 102, 103, Ordnal scale data exhbt the propertes of nomnal data and the order or rank of the data s meanngful. Ex.1: Ex.2: Wnners: The 1 st, 2 nd, 3 rd places Marks: A, B, C, Interval scale data demonstrate the propertes of ordnal data and the nterval between values s expressed n terms of a fxed unt of measure Ex.1: Examnaton score 0-100 Ex.2: Internet fame score Rato scale data demonstrate all the propertes of nterval data and the rato of two values s meanngful. Ex.1: Ex.2: Weght Prce 8

DATA AND STATISTICS Task: Defne the Scales Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 IFS 3 log10 N 4.5? 9

TABULAR AND GRAPHICAL PRESENTATION Frequency dstrbuton, bar and pe charts, hstogram, cumulatve frequency dstrbuton, scatter plot 10

TABULAR AND GRAPHICAL PRESENTATION Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Marks A B C B A B B A B C Frequency dstrbuton: Mark Frequency A 3 B 5 C 2 Total 10 Relatve frequency dstrbuton: Mark Frequency A 0.3 B 0.5 C 0.2 Total 1 In MS Excel use the followng functons: Percent frequency dstrbuton: Mark Frequency A 30% B 50% C 20% Total 100% =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 11

TABULAR AND GRAPHICAL PRESENTATION pancreatts.xls Example: Pancreatts Study The role of smokng n the etology of pancreatts has been recognzed for many years. To provde estmates of the quanttatve sgnfcance of these factors, a hosptal-based study was carred out n eastern Massachusetts and Rhode Island between 1975 and 1979. 53 patents who had a hosptal dscharge dagnoss of pancreatts were ncluded n ths unmatched case-control study. The control group conssted of 217 patents admtted for dseases other than those of the pancreas and blary tract. Rsk factor nformaton was obtaned from a standardzed ntervew wth each subject, conducted by a traned ntervewer. adapted from Chap T. Le, Introductory Bostatstcs Pancreatts patents: Smokers Ex-smokers Ex-smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Never Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Never Smokers Smokers Smokers 12

FREQUENCY DISTRIBUTION Relatve Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Relatve frequency dstrbuton A tabular summary of data showng the fracton or proporton of data tems n each of several nonoverlappng classes. Sum of all values should gve 1 Estmaton of probablty dstrbuton When number of experments n, R.F.D. P.D. pancreatts.xls Frequency dstrbuton: Smokng Cases Controls Never 2 56 Ex-smokers 13 80 Smokers 38 81 Total 53 217 Relatve frequency dstrbuton: Smokng Cases Controls Never 0.038 0.258 Ex-smokers 0.245 0.369 Smokers 0.717 0.373 Total 1 1 In Excel use the followng functons: =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 13

TABULAR AND GRAPHICAL PRESENTATION Crosstabulaton pancreatts.xls Dsease Smokng other pancreatts Total Ex-smokers 80 13 93 Never 56 2 58 Smokers 81 38 119 Total 217 53 270 Dsease Smokng other pancreatts Total Ex-smoker 80 13 93 Never 56 2 58 Smoker 81 38 119 Total 217 53 270 In Excel use the followng steps: Insert Pvot Table Set the range, ncludng the headers of the data Select output and set layout by drag-and-droppng the names nto the table 14

Percentage TABULAR AND GRAPHICAL PRESENTATION Bar and Pe Charts pancreatts.xls other Smokng Influence on Pancreatts 80 70 60 50 40 30 20 10 0 other pancreatts Never Ex-smoker Smoker pancreatts Never Ex-smoker Smoker Never Ex-smoker Smoker Smokng In MS Excel use the followng steps: Try to avod usng n scentfc reports. For publc/busness presentatons only! Insert Column Set data range (both columns of Percent freq. dstrbuton) Insert Pe Set data range (one columns of Percent freq. dstrbuton) 15

TABULAR AND GRAPHICAL PRESENTATION Example: Mce Data Seres Tordoff MG, Bachmanov AA Survey of calcum & sodum ntake and metabolsm wth bone and body composton data Project symbol: Tordoff3 Accesson number: MPD:103 mce.xls 790 mce from dfferent strans http://phenome.jax.org parameter Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 16

TABULAR AND GRAPHICAL PRESENTATION The followng are weghts n grams for 970 mce: Hstogram mce.xls 20.5 23.2 24.6 23.5 26 25.9 23.9 22.8 19.9 20.8 22.4 26 23.8 26.5 26 22.8 22.9 20.9 19.8 22.7 31 22.7 26.3 27.1 18.4 21 18.8 21 21.4 25.7 19.7 27 26.2 21.8 22.2 19.2 21.9 22.6 23.7 26.2 26 27.5 25 20.9 20.6 22.1 20 21.1 24.1 28.8 30.2 20.1 24.2 25.8 21.3 21.8 23.7 23.5 28 27.6 21.6 21 21.3 20.1 20.8 24.5 23.8 29.5 21.4 21.5 24 21.1 18.9 19.5 32.3 28 27.1 28.2 22.9 19.9 20.4 21.3 20.6 22.8 25.8 24.1 23.5 24.2 22 20.3 Sorted weghts show that the values are n the 10 49.6 grams. Let us dvde the weght nto the bns bns Weght,g Frequency >=10 1 10-20 237 20-30 417 30-40 124 40-50 11 More 0 17

TABULAR AND GRAPHICAL PRESENTATION Now, let us use bn-sze = 1 gram Hstogram Bn Frequency 8 0 9 1 10 10 11 11...... 39 2 40 2 More 0 Frequency 70 60 50 40 30 20 Hstogram 10 In Excel use the followng steps: Specfy the column of bns (nterval) upper-lmts Data Data Analyss Hstrogram select the nput data, bns, and output (Analyss ToolPak should be nstalled) use Chart Wzard Columns to vsualze the results 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Weght, g 18

Endng weght TABULAR AND GRAPHICAL PRESENTATION Scatter Plot mce.xls Let us look on mutual dependency of the Startng and Endng weghts. 60 Scatter plot 50 40 30 20 In Excel use the followng steps: Select the data regon Use Insert XY (Scatter) 10 0 0 10 20 30 40 50 Startng weght 19

NUMERICAL MEASURES Populaton and sample, measures of locaton, quantles, quartles and percentles, measures of varablty, z-score, detecton of outlers, exploraton data analyss, box plot, covaraton, correlaton 20

NUMERICAL MEASURES Populaton and Sample Populaton parameter A numercal value used as a summary measure for a populaton (e.g., the populaton mean, varance 2, standard devaton ) POPULATION µ mean 2 varance N number of elements (usually N= ) SAMPLE x m, mean s 2 varance n number of elements Sample statstc A numercal value used as a summary measure for a sample (e.g., the sample mean m, the sample varance s 2, and the sample standard devaton s) mce.xls 790 mce from dfferent strans http://phenome.jax.org All exstng laboratory Mus musculus ID Stran Sex Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 1 129S1/SvImJ f 66 116 19.3 20.5 1.062 64 1.2 7.24 0.0605 14.5 4.4 2 129S1/SvImJ f 66 116 19.1 20.8 1.089 78 1.15 7.27 0.0553 13.9 4.4 3 129S1/SvImJ f 66 108 17.9 19.8 1.106 90 1.16 7.26 0.0546 13.8 2.9 368 129S1/SvImJ f 72 114 18.3 21 1.148 65 1.26 7.22 0.0599 15.4 4.2 369 129S1/SvImJ f 72 115 20.2 21.9 1.084 55 1.23 7.3 0.0623 15.6 4.3 370 129S1/SvImJ f 72 116 18.8 22.1 1.176 1.21 7.28 0.0626 16.4 4.3 371 129S1/SvImJ f 72 119 19.4 21.3 1.098 49 1.24 7.24 0.0632 16.6 5.4 372 129S1/SvImJ f 72 122 18.3 20.1 1.098 73 1.17 7.19 0.0592 16 4.1 4 129S1/SvImJ f 66 109 17.2 18.9 1.099 41 1.25 7.29 0.0513 14 3.2 5 129S1/SvImJ f 66 112 19.7 21.3 1.081 129 1.14 7.22 0.0501 16.3 5.2 10 129S1/SvImJ m 66 112 24.3 24.7 1.016 119 1.13 7.24 0.0533 17.6 6.8 364 129S1/SvImJ m 72 114 25.3 27.2 1.075 64 1.25 7.27 0.0596 19.3 5.8 365 129S1/SvImJ m 72 115 21.4 23.9 1.117 48 1.25 7.28 0.0563 17.4 5.7 366 129S1/SvImJ m 72 118 24.5 26.3 1.073 59 1.25 7.26 0.0609 17.8 7.1 367 129S1/SvImJ m 72 122 24 26 1.083 69 1.29 7.26 0.0584 19.2 4.6 6 129S1/SvImJ m 66 116 21.6 23.3 1.079 78 1.15 7.27 0.0497 17.2 5.7 7 129S1/SvImJ m 66 107 22.7 26.5 1.167 90 1.18 7.28 0.0493 18.7 7 8 129S1/SvImJ m 66 108 25.4 27.4 1.079 35 1.24 7.26 0.0538 18.9 7.1 9 129S1/SvImJ m 66 109 24.4 27.5 1.127 43 1.29 7.29 0.0539 19.5 7.1 21

NUMERICAL MEASURES Measures of Locaton Mean A measure of central locaton computed by summng the data values and dvdng by the number of observatons. Medan A measure of central locaton provded by the value n the mddle when the data are arranged n ascendng order. Mode A measure of locaton, defned as the value that occurs wth greatest frequency. x p m x N x n n x true Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mode = 23 Medan = 23.5 Mean = 31.7 22

NUMERICAL MEASURES Measures of Locaton mce.xls Hstogram and p.d.f. approxmaton medanmean mode Female proporton p f = 0.501 Densty 0.00 0.02 0.04 0.06 10 15 20 25 30 35 40 weght, g Bleedng tme In Excel use the followng functons: = AVERAGE(data) = MEDIAN(data) = MODE(data) Densty 0.000 0.010 0.020 medan = 55 mean = 61 mode = 48 0 50 100 150 200 N = 760 Bandwdth = 5.347 23

NUMERICAL MEASURES Quantles, Quartles and Percentles Percentle A value such that at least p% of the observatons are less than or equal to ths value, and at least (100-p)% of the observatons are greater than or equal to ths value. The 50- th percentle s the medan. Quartles The 25th, 50th, and 75th percentles, referred to as the frst quartle, the second quartle (medan), and thrd quartle, respectvely. In Excel use the followng functons: =PERCENTILE(data,p) Weght 12 16 19 22 23 23 24 32 36 42 63 68 Q 1 = 21 Q 2 = 23.5 Q 3 = 39 24

NUMERICAL MEASURES Measures of Varablty Interquartle range (IQR) A measure of varablty, defned to be the dfference between the thrd and frst quartles. Varance A measure of varablty based on the squared devatons of the data values about the mean. Standard devaton A measure of varablty computed by takng the postve square root of the varance. IQR Q 3 Q 1 populaton sample s 2 N x 2 2 x m n 1 2 Sample standard devaton s Populaton standard devaton 2 s 2 Weght 12 16 19 22 23 23 24 32 36 42 63 68 IQR = 18 Varance = 320.2 St. dev. = 17.9 In Excel use the followng functons: =VAR(data) =STDEV(data) =STDEV.S(data) 25

NUMERICAL MEASURES Measures of Varablty Coeffcent of varaton A measure of relatve varablty computed by dvdng the standard Standard devaton devaton by the mean. 100% Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mean CV = 57% Medan absolute devaton (MAD) MAD s a robust measure of the varablty of a unvarate sample of quanttatve data. MAD medan x medan x Set 1 Set 2 23 23 12 12 22 22 12 12 21 21 18 81 22 22 20 20 12 12 19 19 14 14 13 13 17 17 Set 1 Set 2 Mean 17.3 22.2 Medan 18 19 St.dev. 4.23 18.18 MAD 5.93 5.93 26

NUMERICAL MEASURES Measures of Varablty Skewness A measure of the shape of a data dstrbuton. Data skewed to the left result n negatve skewness; a symmetrc data dstrbuton results n zero skewness; and data skewed to the rght result n postve skewness. Skewness n n 1 n 2 s x m 3 adapted from Anderson et al Statstcs for Busness and Economcs 27

NUMERICAL MEASURES Measure of Assocaton between 2 Varables Covarance A measure of lnear assocaton between two varables. Postve values ndcate a postve relatonshp; negatve values ndcate a negatve relatonshp. xy populaton x x y y N s xy sample x xy y n 1 mce.xls Endng weght vs. Startng weght Endng weght 60 50 40 30 20 10 0 0 10 20 30 40 50 Startng weght In Excel use functon: =COVAR(data) s xy = 39.8 hard to nterpret 28

NUMERICAL MEASURES Measure of Assocaton between 2 Varables Correlaton (Pearson product moment correlaton coeffcent) A measure of lnear assocaton between two varables that takes on values between -1 and +1. Values near +1 ndcate a strong postve lnear relatonshp, values near -1 ndcate a strong negatve lnear relatonshp; and values near zero ndcate the lack of a lnear relatonshp. populaton x x y y xy xy N x y x y r xy s s x xy s y sample x x y y s s n 1 x y 60 50 Endng weght 40 30 20 10 In Excel use functon: =CORREL(data) r xy = 0.94 0 0 10 20 30 40 50 Startng weght mce.xls 29

NUMERICAL MEASURES Correlaton Coeffcent If we have only 2 data ponts n x and y datasets, what values would you expect for correlaton b/w x and y? Wkpeda 30

NUMERICAL MEASURES z-score and Detecton of Outlers z-score A value computed by dvdng the devaton about the mean (x x) by the standard devaton s. A z-score s referred to as a standardzed value and denotes the number of standard devatons x s from the mean. Chebyshev s theorem For any data set, at least (1 1/z 2 ) of the data values must be wthn z standard devatons from the mean, where z any value > 1. z x m s Weght z-score 12-1.10 16-0.88 19-0.71 22-0.54 23-0.48 23-0.48 24-0.43 32 0.02 36 0.24 42 0.58 63 1.75 68 2.03 For ANY dstrbuton: At least 75 % of the values are wthn z = 2 standard devatons from the mean At least 89 % of the values are wthn z = 3 standard devatons from the mean At least 94 % of the values are wthn z = 4 standard devatons from the mean At least 96% of the values are wthn z = 5 standard devatons from the mean 31

NUMERICAL MEASURES Detecton of Outlers For bell-shaped dstrbutons: Approxmately 68 % of the values are wthn 1 st.dev. from mean Approxmately 95 % of the values are wthn 2 st.dev. from mean Almost all data ponts are nsde 3 st.dev. from mean Outler An unusually small or unusually large data value. For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. Example: Gaussan dstrbuton Weght z-score 23 0.04 12-0.53 22-0.01 12-0.53 21-0.06 81 3.10 22-0.01 20-0.11 12-0.53 19-0.17 14-0.43 13-0.48 17-0.27 32

NUMERICAL MEASURES Task: Detecton of Outlers mce.xls Usng Excel, try to dentfy outler mce on the bass of Weght change varable z x m s For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. In Excel use the followng functons: = AVERAGE(data) - mean, m = STDEV(data) - standard devaton, s = abs(data) - absolute value sort by z-scale to dentfy outlers 33

DETECTION OF OUTLIERS Iglewcz-Hoagln Method Iglewcz-Hoagln method: modfed Z-score These authors recommend that modfed Z-scores wth an absolute value of greater than 3.5 be labeled as potental outlers. z x 0.6745 MAD medan z >3.5 outler medan( x) MAD( x) x medan x Bors Iglewcz and Davd Hoagln (1993), "Volume 16: How to Detect and Handle Outlers", The ASQC Basc References n Qualty Control: Statstcal Technques, Edward F. Mykytka, Ph.D., Edtor More methods are at: http://www.tl.nst.gov/dv898/handbook/eda/secton3/eda35h.htm 34

NUMERICAL MEASURES Exploraton Data Analyss Fve-number summary An exploratory data analyss technque that uses fve numbers to summarze the data: smallest value, frst quartle, medan, thrd quartle, and largest value chldren.xls Mn. : 12 Q 1 : 25 Medan: 32 Q 3 : 46 Max. : 79 In Excel use: Data Data Analyss Descrptve Statstcs Box plot A graphcal summary of data based on a fve-number summary Mn Q 2 Q 1 Box Qplot 3 Max 1.5 IQR 35

NUMERICAL MEASURES Box-Plot n Excel Example Buld a box plot for endng weghts of male and female mce mce.xls 5-num. sum. FEMALE MALE Mn 10.0 12.0 Q1 17.2 23.8 Q2 (medan) 20.7 27.1 Q3 23.3 31.2 Max 41.5 49.6 box szes FEMALE MALE box 1 hdden 17.2 23.8 = Q1 box 2 lower 3.6 3.3 = Q2 - Q1 box 3 uper 2.6 4.1 = Q3 - Q2 1. Buld 5 number summares for males and females and calculate boxes sze + whskers 2. Calculate the szes of the boxes and whskers (here smplfed whskers are used) 3. Show Stacked Columns, swtch rows/columns 4. Make hdden box transparent 5. Add custom error bars whskers wsker top 18.2 18.4 = MAX - Q3 whsker bot 7.2 11.8 = Q1 - MIN See Contextures Inc. tutoral https://www.youtube.com/watch?v=ucwmfmxb1kk 36

NUMERICAL MEASURES Weghted mean The mean obtaned by assgnng each observaton a weght that reflects ts mportance Weghted Mean m w x w As an example of the need of weghted mean, consder the followng sample of fve purchases of a raw materal over several months Note that the cost per pound vares from $2.80 to $3.40, and quantty purchased has vared from 500 to 2750. Suppose that manager asked for nformaton about the mean cost per pound of the raw materal. If we would use a smple mean of the cost p.p.: we overestmate the average cost! Anderson et al Statstcs for Busness and Economcs 37

NUMERICAL MEASURES Grouped Mean Grouped data Data avalable n class ntervals as summarzed by a frequency dstrbuton. Indvdual values of the orgnal data are not avalable. chldren.xls Bn Frequency 20 5 30 21 40 8 50 14 60 3 70 4 80 2 More 0 Mean for grouped data m k f n M Varance for grouped data s 2 k f M n 1 m 2 38

QUESTIONS? Thank you for your attenton to be contnued 39