Genomcs Research Unt BIOSTATISTICS Lecture 1 Data Presentaton and Descrptve Statstcs dr. Petr Nazarov 3-03-2017 petr.nazarov@lh.lu
COURSE OVERVIEW Organzaton: 60h = 12 days Theoretcal course (30h) Theory Explanatons to all Common work Practcal course (30h) Indvdual work Indvdual explanatons Fnal examnaton (questons) 3 ntermedate tests scored 0-3 (9 ponts n total) Fnal examnaton (tasks n Excel) Goal: your FINAL knowledge and sklls n bologcal data analyss! not markng of your work Data Mcrosoft Excel Software wth Data Analyss Add-In nstalled http://edu.sablab.net/data/xls Materals: @ moodle, & http://edu.sablab.net/bostat/2017 2
COURSE OVERVIEW Recommended Lterature presentaton methodology 3
COURSE OVERVIEW Introducton Drug dscovery Any bologcal study where numbers are measured or reported BIOSTATISTICS: why and where? Genomcs and systems bology Publc health 4
OUTLINE Lecture 1 Data and statstc elements, varables and observaton types of data (qualtatve and quanttatve) and scales (nomnal, ordnal, nterval, rato) Descrptve statstcs: tabular and graphcal presentaton frequency dstrbuton pe, bar chart and hstogram representaton cumulatve dstrbutons crosstabulaton and scatter dagram Descrptve statstcs: numercal measures measures of locaton: mean, mode, medan, quantles/quartles/percentles measure of varablty: varance, standard devaton, MAD, coeffcent of varaton other measures: skewness of dstrbuton z-score. Chebyshev's theorem. Detecton of outlers. Exploratory analyss. 5 number summary box plot Measure of assocaton between two varables covarance and correlaton coeffcent nterpretaton of correlaton coeffcent 5
DATA AND STATISTICS Elements, varables, and observatons, data scales and types 6
observaton DATA AND STATISTICS Data: Elements, Varables, and Observatons Data The facts and fgures collected, analyzed, and summarzed for presentaton and nterpretaton. elements varables Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 Can we consder the Place as element? IFS 3 log10 N 4.5 7
Quanttatve Qualtatve DATA AND STATISTICS Data Scales and Types Data scales: Nomnal scale data use labels or names to dentfy an attrbute of an element. Ex.1: Ex.2: Male, Female Rooms #: 101, 102, 103, Ordnal scale data exhbt the propertes of nomnal data and the order or rank of the data s meanngful. Ex.1: Ex.2: Wnners: The 1 st, 2 nd, 3 rd places Marks: A, B, C, Interval scale data demonstrate the propertes of ordnal data and the nterval between values s expressed n terms of a fxed unt of measure Ex.1: Examnaton score 0-100 Ex.2: Internet fame score Rato scale data demonstrate all the propertes of nterval data and the rato of two values s meanngful. Ex.1: Ex.2: Weght Prce 8
DATA AND STATISTICS Task: Defne the Scales Person Place Gender Net Worth ($BIL) Age Source Internet Fame Score Wllam Gates III 1 M 40 53 Mcrosoft 9.5 Warren Buffett 2 M 37 79 Berkshre Hathaway 6.6 Carlos Slm Helu 3 M 35 69 telecom 2.1 Lawrence Ellson 4 M 22.5 64 Oracle 2.8 Ingvar Kamprad 5 M 22 83 IKEA 2.4 Karl Albrecht 6 M 21.5 89 Ald 3.6 Mukesh Amban 7 M 19.5 51 petrochemcals 4.4 Lakshm Mttal 8 M 19.3 58 steel 5.4 Theo Albrecht 9 M 18.8 87 Ald 1.5 Amanco Ortega 10 M 18.3 73 Zara 1.9 Jm Walton 11 M 17.8 61 Wal-Mart 3.9 Alce Walton 12 F 17.6 59 Wal-Mart 2.9 IFS 3 log10 N 4.5? 9
TABULAR AND GRAPHICAL PRESENTATION Frequency dstrbuton, bar and pe charts, hstogram, cumulatve frequency dstrbuton, scatter plot 10
TABULAR AND GRAPHICAL PRESENTATION Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Marks A B C B A B B A B C Frequency dstrbuton: Mark Frequency A 3 B 5 C 2 Total 10 Relatve frequency dstrbuton: Mark Frequency A 0.3 B 0.5 C 0.2 Total 1 In MS Excel use the followng functons: Percent frequency dstrbuton: Mark Frequency A 30% B 50% C 20% Total 100% =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 11
TABULAR AND GRAPHICAL PRESENTATION pancreatts.xls Example: Pancreatts Study The role of smokng n the etology of pancreatts has been recognzed for many years. To provde estmates of the quanttatve sgnfcance of these factors, a hosptal-based study was carred out n eastern Massachusetts and Rhode Island between 1975 and 1979. 53 patents who had a hosptal dscharge dagnoss of pancreatts were ncluded n ths unmatched case-control study. The control group conssted of 217 patents admtted for dseases other than those of the pancreas and blary tract. Rsk factor nformaton was obtaned from a standardzed ntervew wth each subject, conducted by a traned ntervewer. adapted from Chap T. Le, Introductory Bostatstcs Pancreatts patents: Smokers Ex-smokers Ex-smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Never Smokers Ex-smokers Ex-smokers Smokers Ex-smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Ex-smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Smokers Never Smokers Smokers Smokers 12
FREQUENCY DISTRIBUTION Relatve Frequency Dstrbuton Frequency dstrbuton A tabular summary of data showng the number (frequency) of tems n each of several nonoverlappng classes. Relatve frequency dstrbuton A tabular summary of data showng the fracton or proporton of data tems n each of several nonoverlappng classes. Sum of all values should gve 1 Estmaton of probablty dstrbuton When number of experments n, R.F.D. P.D. pancreatts.xls Frequency dstrbuton: Smokng Cases Controls Never 2 56 Ex-smokers 13 80 Smokers 38 81 Total 53 217 Relatve frequency dstrbuton: Smokng Cases Controls Never 0.038 0.258 Ex-smokers 0.245 0.369 Smokers 0.717 0.373 Total 1 1 In Excel use the followng functons: =COUNTIF(data,element) to get number of elements found n the data area =SUM(data) to get the sum of the values n the data area 13
TABULAR AND GRAPHICAL PRESENTATION Crosstabulaton pancreatts.xls Dsease Smokng other pancreatts Total Ex-smokers 80 13 93 Never 56 2 58 Smokers 81 38 119 Total 217 53 270 Dsease Smokng other pancreatts Total Ex-smoker 80 13 93 Never 56 2 58 Smoker 81 38 119 Total 217 53 270 In Excel use the followng steps: Insert Pvot Table Set the range, ncludng the headers of the data Select output and set layout by drag-and-droppng the names nto the table 14
Percentage TABULAR AND GRAPHICAL PRESENTATION Bar and Pe Charts pancreatts.xls other Smokng Influence on Pancreatts 80 70 60 50 40 30 20 10 0 other pancreatts Never Ex-smoker Smoker pancreatts Never Ex-smoker Smoker Never Ex-smoker Smoker Smokng In MS Excel use the followng steps: Try to avod usng n scentfc reports. For publc/busness presentatons only! Insert Column Set data range (both columns of Percent freq. dstrbuton) Insert Pe Set data range (one columns of Percent freq. dstrbuton) 15
TABULAR AND GRAPHICAL PRESENTATION Example: Mce Data Seres Tordoff MG, Bachmanov AA Survey of calcum & sodum ntake and metabolsm wth bone and body composton data Project symbol: Tordoff3 Accesson number: MPD:103 mce.xls 790 mce from dfferent strans http://phenome.jax.org parameter Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 16
TABULAR AND GRAPHICAL PRESENTATION The followng are weghts n grams for 970 mce: Hstogram mce.xls 20.5 23.2 24.6 23.5 26 25.9 23.9 22.8 19.9 20.8 22.4 26 23.8 26.5 26 22.8 22.9 20.9 19.8 22.7 31 22.7 26.3 27.1 18.4 21 18.8 21 21.4 25.7 19.7 27 26.2 21.8 22.2 19.2 21.9 22.6 23.7 26.2 26 27.5 25 20.9 20.6 22.1 20 21.1 24.1 28.8 30.2 20.1 24.2 25.8 21.3 21.8 23.7 23.5 28 27.6 21.6 21 21.3 20.1 20.8 24.5 23.8 29.5 21.4 21.5 24 21.1 18.9 19.5 32.3 28 27.1 28.2 22.9 19.9 20.4 21.3 20.6 22.8 25.8 24.1 23.5 24.2 22 20.3 Sorted weghts show that the values are n the 10 49.6 grams. Let us dvde the weght nto the bns bns Weght,g Frequency >=10 1 10-20 237 20-30 417 30-40 124 40-50 11 More 0 17
TABULAR AND GRAPHICAL PRESENTATION Now, let us use bn-sze = 1 gram Hstogram Bn Frequency 8 0 9 1 10 10 11 11...... 39 2 40 2 More 0 Frequency 70 60 50 40 30 20 Hstogram 10 In Excel use the followng steps: Specfy the column of bns (nterval) upper-lmts Data Data Analyss Hstrogram select the nput data, bns, and output (Analyss ToolPak should be nstalled) use Chart Wzard Columns to vsualze the results 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Weght, g 18
Endng weght TABULAR AND GRAPHICAL PRESENTATION Scatter Plot mce.xls Let us look on mutual dependency of the Startng and Endng weghts. 60 Scatter plot 50 40 30 20 In Excel use the followng steps: Select the data regon Use Insert XY (Scatter) 10 0 0 10 20 30 40 50 Startng weght 19
NUMERICAL MEASURES Populaton and sample, measures of locaton, quantles, quartles and percentles, measures of varablty, z-score, detecton of outlers, exploraton data analyss, box plot, covaraton, correlaton 20
NUMERICAL MEASURES Populaton and Sample Populaton parameter A numercal value used as a summary measure for a populaton (e.g., the populaton mean, varance 2, standard devaton ) POPULATION µ mean 2 varance N number of elements (usually N= ) SAMPLE x m, mean s 2 varance n number of elements Sample statstc A numercal value used as a summary measure for a sample (e.g., the sample mean m, the sample varance s 2, and the sample standard devaton s) mce.xls 790 mce from dfferent strans http://phenome.jax.org All exstng laboratory Mus musculus ID Stran Sex Startng age Endng age Startng weght Endng weght Weght change Bleedng tme Ionzed Ca n blood Blood ph Bone mneral densty Lean tssues weght Fat weght 1 129S1/SvImJ f 66 116 19.3 20.5 1.062 64 1.2 7.24 0.0605 14.5 4.4 2 129S1/SvImJ f 66 116 19.1 20.8 1.089 78 1.15 7.27 0.0553 13.9 4.4 3 129S1/SvImJ f 66 108 17.9 19.8 1.106 90 1.16 7.26 0.0546 13.8 2.9 368 129S1/SvImJ f 72 114 18.3 21 1.148 65 1.26 7.22 0.0599 15.4 4.2 369 129S1/SvImJ f 72 115 20.2 21.9 1.084 55 1.23 7.3 0.0623 15.6 4.3 370 129S1/SvImJ f 72 116 18.8 22.1 1.176 1.21 7.28 0.0626 16.4 4.3 371 129S1/SvImJ f 72 119 19.4 21.3 1.098 49 1.24 7.24 0.0632 16.6 5.4 372 129S1/SvImJ f 72 122 18.3 20.1 1.098 73 1.17 7.19 0.0592 16 4.1 4 129S1/SvImJ f 66 109 17.2 18.9 1.099 41 1.25 7.29 0.0513 14 3.2 5 129S1/SvImJ f 66 112 19.7 21.3 1.081 129 1.14 7.22 0.0501 16.3 5.2 10 129S1/SvImJ m 66 112 24.3 24.7 1.016 119 1.13 7.24 0.0533 17.6 6.8 364 129S1/SvImJ m 72 114 25.3 27.2 1.075 64 1.25 7.27 0.0596 19.3 5.8 365 129S1/SvImJ m 72 115 21.4 23.9 1.117 48 1.25 7.28 0.0563 17.4 5.7 366 129S1/SvImJ m 72 118 24.5 26.3 1.073 59 1.25 7.26 0.0609 17.8 7.1 367 129S1/SvImJ m 72 122 24 26 1.083 69 1.29 7.26 0.0584 19.2 4.6 6 129S1/SvImJ m 66 116 21.6 23.3 1.079 78 1.15 7.27 0.0497 17.2 5.7 7 129S1/SvImJ m 66 107 22.7 26.5 1.167 90 1.18 7.28 0.0493 18.7 7 8 129S1/SvImJ m 66 108 25.4 27.4 1.079 35 1.24 7.26 0.0538 18.9 7.1 9 129S1/SvImJ m 66 109 24.4 27.5 1.127 43 1.29 7.29 0.0539 19.5 7.1 21
NUMERICAL MEASURES Measures of Locaton Mean A measure of central locaton computed by summng the data values and dvdng by the number of observatons. Medan A measure of central locaton provded by the value n the mddle when the data are arranged n ascendng order. Mode A measure of locaton, defned as the value that occurs wth greatest frequency. x p m x N x n n x true Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mode = 23 Medan = 23.5 Mean = 31.7 22
NUMERICAL MEASURES Measures of Locaton mce.xls Hstogram and p.d.f. approxmaton medanmean mode Female proporton p f = 0.501 Densty 0.00 0.02 0.04 0.06 10 15 20 25 30 35 40 weght, g Bleedng tme In Excel use the followng functons: = AVERAGE(data) = MEDIAN(data) = MODE(data) Densty 0.000 0.010 0.020 medan = 55 mean = 61 mode = 48 0 50 100 150 200 N = 760 Bandwdth = 5.347 23
NUMERICAL MEASURES Quantles, Quartles and Percentles Percentle A value such that at least p% of the observatons are less than or equal to ths value, and at least (100-p)% of the observatons are greater than or equal to ths value. The 50- th percentle s the medan. Quartles The 25th, 50th, and 75th percentles, referred to as the frst quartle, the second quartle (medan), and thrd quartle, respectvely. In Excel use the followng functons: =PERCENTILE(data,p) Weght 12 16 19 22 23 23 24 32 36 42 63 68 Q 1 = 21 Q 2 = 23.5 Q 3 = 39 24
NUMERICAL MEASURES Measures of Varablty Interquartle range (IQR) A measure of varablty, defned to be the dfference between the thrd and frst quartles. Varance A measure of varablty based on the squared devatons of the data values about the mean. Standard devaton A measure of varablty computed by takng the postve square root of the varance. IQR Q 3 Q 1 populaton sample s 2 N x 2 2 x m n 1 2 Sample standard devaton s Populaton standard devaton 2 s 2 Weght 12 16 19 22 23 23 24 32 36 42 63 68 IQR = 18 Varance = 320.2 St. dev. = 17.9 In Excel use the followng functons: =VAR(data) =STDEV(data) =STDEV.S(data) 25
NUMERICAL MEASURES Measures of Varablty Coeffcent of varaton A measure of relatve varablty computed by dvdng the standard Standard devaton devaton by the mean. 100% Weght 12 16 19 22 23 23 24 32 36 42 63 68 Mean CV = 57% Medan absolute devaton (MAD) MAD s a robust measure of the varablty of a unvarate sample of quanttatve data. MAD medan x medan x Set 1 Set 2 23 23 12 12 22 22 12 12 21 21 18 81 22 22 20 20 12 12 19 19 14 14 13 13 17 17 Set 1 Set 2 Mean 17.3 22.2 Medan 18 19 St.dev. 4.23 18.18 MAD 5.93 5.93 26
NUMERICAL MEASURES Measures of Varablty Skewness A measure of the shape of a data dstrbuton. Data skewed to the left result n negatve skewness; a symmetrc data dstrbuton results n zero skewness; and data skewed to the rght result n postve skewness. Skewness n n 1 n 2 s x m 3 adapted from Anderson et al Statstcs for Busness and Economcs 27
NUMERICAL MEASURES Measure of Assocaton between 2 Varables Covarance A measure of lnear assocaton between two varables. Postve values ndcate a postve relatonshp; negatve values ndcate a negatve relatonshp. xy populaton x x y y N s xy sample x xy y n 1 mce.xls Endng weght vs. Startng weght Endng weght 60 50 40 30 20 10 0 0 10 20 30 40 50 Startng weght In Excel use functon: =COVAR(data) s xy = 39.8 hard to nterpret 28
NUMERICAL MEASURES Measure of Assocaton between 2 Varables Correlaton (Pearson product moment correlaton coeffcent) A measure of lnear assocaton between two varables that takes on values between -1 and +1. Values near +1 ndcate a strong postve lnear relatonshp, values near -1 ndcate a strong negatve lnear relatonshp; and values near zero ndcate the lack of a lnear relatonshp. populaton x x y y xy xy N x y x y r xy s s x xy s y sample x x y y s s n 1 x y 60 50 Endng weght 40 30 20 10 In Excel use functon: =CORREL(data) r xy = 0.94 0 0 10 20 30 40 50 Startng weght mce.xls 29
NUMERICAL MEASURES Correlaton Coeffcent If we have only 2 data ponts n x and y datasets, what values would you expect for correlaton b/w x and y? Wkpeda 30
NUMERICAL MEASURES z-score and Detecton of Outlers z-score A value computed by dvdng the devaton about the mean (x x) by the standard devaton s. A z-score s referred to as a standardzed value and denotes the number of standard devatons x s from the mean. Chebyshev s theorem For any data set, at least (1 1/z 2 ) of the data values must be wthn z standard devatons from the mean, where z any value > 1. z x m s Weght z-score 12-1.10 16-0.88 19-0.71 22-0.54 23-0.48 23-0.48 24-0.43 32 0.02 36 0.24 42 0.58 63 1.75 68 2.03 For ANY dstrbuton: At least 75 % of the values are wthn z = 2 standard devatons from the mean At least 89 % of the values are wthn z = 3 standard devatons from the mean At least 94 % of the values are wthn z = 4 standard devatons from the mean At least 96% of the values are wthn z = 5 standard devatons from the mean 31
NUMERICAL MEASURES Detecton of Outlers For bell-shaped dstrbutons: Approxmately 68 % of the values are wthn 1 st.dev. from mean Approxmately 95 % of the values are wthn 2 st.dev. from mean Almost all data ponts are nsde 3 st.dev. from mean Outler An unusually small or unusually large data value. For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. Example: Gaussan dstrbuton Weght z-score 23 0.04 12-0.53 22-0.01 12-0.53 21-0.06 81 3.10 22-0.01 20-0.11 12-0.53 19-0.17 14-0.43 13-0.48 17-0.27 32
NUMERICAL MEASURES Task: Detecton of Outlers mce.xls Usng Excel, try to dentfy outler mce on the bass of Weght change varable z x m s For bell-shaped dstrbutons data ponts wth z >3 can be consdered as outlers. In Excel use the followng functons: = AVERAGE(data) - mean, m = STDEV(data) - standard devaton, s = abs(data) - absolute value sort by z-scale to dentfy outlers 33
DETECTION OF OUTLIERS Iglewcz-Hoagln Method Iglewcz-Hoagln method: modfed Z-score These authors recommend that modfed Z-scores wth an absolute value of greater than 3.5 be labeled as potental outlers. z x 0.6745 MAD medan z >3.5 outler medan( x) MAD( x) x medan x Bors Iglewcz and Davd Hoagln (1993), "Volume 16: How to Detect and Handle Outlers", The ASQC Basc References n Qualty Control: Statstcal Technques, Edward F. Mykytka, Ph.D., Edtor More methods are at: http://www.tl.nst.gov/dv898/handbook/eda/secton3/eda35h.htm 34
NUMERICAL MEASURES Exploraton Data Analyss Fve-number summary An exploratory data analyss technque that uses fve numbers to summarze the data: smallest value, frst quartle, medan, thrd quartle, and largest value chldren.xls Mn. : 12 Q 1 : 25 Medan: 32 Q 3 : 46 Max. : 79 In Excel use: Data Data Analyss Descrptve Statstcs Box plot A graphcal summary of data based on a fve-number summary Mn Q 2 Q 1 Box Qplot 3 Max 1.5 IQR 35
NUMERICAL MEASURES Box-Plot n Excel Example Buld a box plot for endng weghts of male and female mce mce.xls 5-num. sum. FEMALE MALE Mn 10.0 12.0 Q1 17.2 23.8 Q2 (medan) 20.7 27.1 Q3 23.3 31.2 Max 41.5 49.6 box szes FEMALE MALE box 1 hdden 17.2 23.8 = Q1 box 2 lower 3.6 3.3 = Q2 - Q1 box 3 uper 2.6 4.1 = Q3 - Q2 1. Buld 5 number summares for males and females and calculate boxes sze + whskers 2. Calculate the szes of the boxes and whskers (here smplfed whskers are used) 3. Show Stacked Columns, swtch rows/columns 4. Make hdden box transparent 5. Add custom error bars whskers wsker top 18.2 18.4 = MAX - Q3 whsker bot 7.2 11.8 = Q1 - MIN See Contextures Inc. tutoral https://www.youtube.com/watch?v=ucwmfmxb1kk 36
NUMERICAL MEASURES Weghted mean The mean obtaned by assgnng each observaton a weght that reflects ts mportance Weghted Mean m w x w As an example of the need of weghted mean, consder the followng sample of fve purchases of a raw materal over several months Note that the cost per pound vares from $2.80 to $3.40, and quantty purchased has vared from 500 to 2750. Suppose that manager asked for nformaton about the mean cost per pound of the raw materal. If we would use a smple mean of the cost p.p.: we overestmate the average cost! Anderson et al Statstcs for Busness and Economcs 37
NUMERICAL MEASURES Grouped Mean Grouped data Data avalable n class ntervals as summarzed by a frequency dstrbuton. Indvdual values of the orgnal data are not avalable. chldren.xls Bn Frequency 20 5 30 21 40 8 50 14 60 3 70 4 80 2 More 0 Mean for grouped data m k f n M Varance for grouped data s 2 k f M n 1 m 2 38
QUESTIONS? Thank you for your attenton to be contnued 39