Chapter 5. Describing numerical data

Chapter 5 Describing numerical data 1

Topics Numerical descriptive measures Location Variability Other measurements Graphical methods Histogram Boxplot, Stem and leaf plot Scatter plot for bivariate data 2

Exploring numerical data The number of observations in the data set The center of the data Mean Median The variability of the data Variance or standard deviation Range Interquartile range Coefficient of Variation 3

Other measures Minimum and Maximum Exploring numerical data First quartile (25th percentile) and third quartile (75th percentile) Percentiles Skewness Kurtosis 4

Example 5.1 The percentage returns achieved by 76 average risk funds 21.88 59.82 33.28 20.23 15.78 13.59 32.98 11.31 21.46 14.45 25.43 10.67 12.11 17.69 56.87 4.43 24.95 20.13 8.75 12.45 19.89 37.71 49.97 19.67 55.4 27.16 39.44 19.45 15.42 55.9 28.25 31.03 31.15 41.86 49.76 24.89 11.37 18.19 32.5 7.57 29.18 31.7 67.69 10.09 61.88 24.08 7.97 24.22 18.18 24.21 31.96 45.39 56.63 14.36 10 45.37 27.16-2.7 11.38 12.67 12.41 91.15 8.4 30.52 20.26 4.08 26.09 12.21 67.45 43.31 19.21 18.49 21.31 25.18 26.85 18.89 5

Some preliminary observations What can we say about the returns of these funds? Almost all the funds have a positive return. The spread of the returns ranges from -2.7% to 91.15%. There appears to be one very high return, 91.15%. 6

Descriptive statistics: SAS SAS:Proc Univariate Program data ex5 1ar; infile F:\ST2137\lecdata\ex5 1ar.txt firstobs=2; input return; proc univariate data=ex5 1ar; title Simple Descriptive Statistics ; var return; run; 7

Output from Proc Univariate 8

Output from Proc Univariate 9

Output from Proc Univariate 10

Output from Proc Univariate 11

Output from Proc Univariate 12

SAS: Proc Mean proc means data=ex5 1ar n mean std min max stderr maxdec=4; title Procedure Means ; var return; run; 13

Output from Proc Mean 14

Descriptive statistics: R > ex5.1=read.table ( F:/ST2137/lecdata/ex5 1ar.txt,header=T) > ex5.1ar=ex5.1[,1] > summary(ex5.1ar) Min 1st Qu Median Mean 3rd Qu. Max -2.70 14.17 22.98 27.00 32.62 91.15 > mean(ex5.1ar) [1] 27.00079 > median(ex5.1ar) [1]22.98 > min(ex5.1ar) [1] -2.7 > max(ex5.1ar) [1] 91.15 15

Descriptive statistics: R > range(ex5.1ar) [1] -2.70 91.15 > quantile(ex5.1ar) 0% 25% 50% 75% 100% -2.7000 14.1675 22.9800 32.6200 91.1500 > var(ex5.1ar) [1] 312.0435 > sd(ex5.1ar) [1] 17.66475 > IQR(ex5.1ar) #interquartile range [1] 18.4525 16

Descriptive statistics: R > ex5.1ar[order(ex5.1ar)[1:5]] #the smallest 5 observations [1] -2.70 4.08 4.43 7.57 7.97 > nar=length(ex5.1ar) > ex5.1ar[order(ex5.1ar)[(nar-5):nar]] #the biggest 5 observations [1] 56.87 59.82 61.88 67.45 67.69 91.15 > cv=function(x)sd(x)/mean(x)*100 # Compute coefficient of variation (ex5.1ar) > cv(ex5.1ar) return 65.42 17

Calculating sample skewness: R Unbiased estimator of skewness is given by n(n 1) n 2 > skew=function(x){ n=length(x) m3=sum((x-mean(x)) 3)/n m2=sum((x-mean(x)) 2)/n m3/m2 (3/2) sqrt(n (n-1))/(n-2)} > skew(ex5.1ar) [1] 1.216081 ( m 3 ) m 3/2 2 18

Calculating sample kurtosis: R Unbiased estimator of kurtosis is given by (n 1) (n 2)(n 3) ((n + 1)m 4 m 2 2 3(n 1)) > kurt=function(x){ n=length(x) m4=sum((x-mean(x)) 4)/n m2=sum((x-mean(x)) 2)/n (n-1)/((n-2)*(n-3))*((n+1)*m4/m2 2-3*(n-1))} > kurt(ex5.1ar) [1] 1.545188 19

Descriptive Statistics: SPSS Import ex5.1ar.txt into the SPSS and call it ex5.1ar.sav Analyze Descriptive Statistics Descriptive... Move the variable return from the left panel to the right panel Click Option Choose the statistics that you want to compute Click Continue OK 20

Descriptive Statistics: SPSS 21

Output of descriptive statistics: SPSS Descriptive Statistics N Range Minimum Maximum Return Statistic Statistic Statistic Statistic 76 93.85-2.70 91.15 Mean Std Deviation Variance Statistic Std.Error Statistic Statistic 27.0008 2.02629 17.666475 312.043 Skewness Kurtosis Statistic Std.Error Statistic Std.Error 1.216.276 1.545.545 22

Graphical Methods: SAS Box Plot Stem and leaf plot Histogram QQ Plot 23

Plot in Proc Univariate: SAS Program /*Boxplot, stem and leaf plot and qq plot */ proc univariate data=ex5 1ar plot; var return; run; 24

Plot in Proc Univariate: SAS output 25

Plot in Proc Univariate: SAS output 26

Program /*Histogram and qqplot */ options reset=all ftext= Arial/bo cback=white colors=(black) gunit=pct htext=2 hpos=15; Histogram: SAS 27

Histogram: SAS Program proc univariate data=ex5 1ar; var return; histogram return/normal; /*Option normal creates a normal curve */ inset mean= Mean std= Standard Deviation /font= Arial pos=ne height=4; qqplot return; run; quit; 28

Output: Histogram in SAS 29

Output: Histogram in SAS 30

Histogram: R > hist(return,include.lowest=true,freq=true, main=paste( Histogram of return ), xlab= return, ylab= frequency, axes=true) > # Normal curve imposed on the histogram >xpt=seq(-10,100,0.1) >ypt=dnorm(seq(-10,100,0.1),mean(return),sd(return)) >ypt=ypt*length(return)*10 >lines(xpt,ypt) 31

QQplot: R > # qqplot of normal distribution > qqnorm(return) >qqline(return) 33

Stem and leaf plot: R > stem(return) 34

Boxplot: R >boxplot(return) 35

Plot: SPSS Open the SPSS data file ex5.1ar.sav Analyze Descriptive Statistics Explore.. Click plots Choose the plots you want Click Continue OK 36

Plot: SPSS 37

Output: Histogram in SPSS 38

Output: QQ plot in SPSS 39

Output: Boxplot in SPSS 40

Alternative way to get histogram in SPSS Open the SPSS data file ex5.1ar.sav Graphs Histogram... Move the variable return from the left panel to the right panel under Variable Choose Display normal curve, then click OK 41

Alternative way to get QQplot in SPSS Open the SPSS data file ex5.1ar.sav Analyze Descriptive Statistics QQ plot. Move the variable return from the left panel to the right panel under Variable Choose Normal in the Test distribution, then click OK 42

Alternative way to get Boxplot in SPSS Open the SPSS data file ex5.1ar.sav Graphs Legacy Dialogs Box plot... Choose Simple in the Summaries in separate variables, then click Define Move the variable return from the left panel to the right panel under Variable Click OK 43

Example 5.2 In addition to the percentage returns of the 76 average-risk funds, we have the following percentage returns achieved by 47 high-risk funds: 24.47 8.76 58.71 35.07-22.82 15.67 37.47 13.4 49.02-3.16 43.97 13.91 23.84-2.89 39.64 29.72 17.91-10.55 47.38 68.58 86.13-12.57-0.33-5.32 4 0.14 45.97 23.87 38.23 63.79 26.6 6.62 6.72 36.02 29.51 22.56 1.62 29.33 28.91 29.32 42.91 8.87 54.43 28.31 30.76 49.76 1.98 44

Preparing the dataset For a given fund, there are two variables: Return: Percentage return of the fund and Risk: 1 for average-risk fund and 2 for high-risk fund. Notice that the variable Risk is a classification (categorical) variable. 45

How to enter the data Two columns with the first column representing the return percentage while the second column identifies the type of funds There are 123 data in both columns. 46

Descriptive Statistics by groups: SAS proc format; value $risk 1 = Average Risk 2 = High Risk ; data ex5 2; infile F:\ST2137\lecdata\ex5 2.txt ; input return risk$; label return= Return Percentage ; format risk $risk.; 47

Descriptive Statistics by groups: SAS proc univariate data=ex5 2 plot; histogram return/midpoints=-30 to 90 by 15 normal; class risk; inset mean= Mean std= Standard Deviation /font= Arial pos=nw height=3; qqplot return; run; quit; 48

Output by groups: SAS 49

Output by groups: SAS 50

proc boxplot data=ex5 2; plot return*risk; run; Boxplot by groups: SAS m 51

Descriptive statistics by groups: R >ex5.2=read.table( G:/ST2137/lecdata/ex5 2.txt,header=T) >attach(ex5.2) >ex5.2ar=ex5.2[risk==1, Return ] >ex5.2hr=ex5.2[risk==2, Return ] >summary(ex5.2hr) Min 1st Qu Median Mean 3rd Qu. Max -22.82 6.67 26.60 24.81 38.94 86.13 52

>stem(ex5.2hr) Stem and leaf plot by groups: R 53

>boxplot(return risk) Boxplot by groups: R 20 0 20 40 60 80 1 2 54

Descriptive statistics by groups: SPSS Import ex5.2.txt into the SPSS and call it ex5.2.sav Analyze Descriptive Statistics Explore... Move Return from the left panel to the Dependent List panel Move Risk from left panel to the Factor List panel 55

Descriptive statistics by groups: SPSS Click Plots Choose the plots that you want Click Continue OK 56

Boxplot by groups:spss 57

Plot of bivariate data: SAS Refer to the data set htwt1 in Tutorial 1 data htwt1; infile F:\ST2137\tutorialdata\htwt1.txt ; input id gender$ height weight siblings; proc gplot data=htwt1; title Scatter Plot of WEIGHT by HEIGHT ; plot weight* height; symbol value=dot color=black; run; 58

Scatter plot of bivariate data: SAS 59

Plot of bivariate data: SAS proc gplot data=htwt1; title Using gender to generate the plotting symbols ; plot weight height=gender; symbol1 value=circle color=red; symbol2 value=square color=black; run; 60

Scatter plot of bivariate data for groups: SAS 61

Plot of bivariate data: SAS proc sort data=htwt1; by gender; proc gplot data=htwt1; title Separate Plots by GENDER ; by gender; plot weight*height; run; 62

Scatter plot of bivariate data for groups: SAS 63

Scatter plot of bivariate data for groups: SAS 64

Plot of bivariate data: R > htwt1=read.table( F:/ST2137/tutorialdata/htwt1.txt,header=TRUE) > names(htwt1)=c( id, gender, height, weight, siblings ) > attach(htwt1) > plot(height,weight,main= Plot of Weight by Height ) 65

Scatter plot of bivariate data for groups: R Plot of Weight by Height weight 40 50 60 70 80 150 160 170 180 height 66

Plot of bivariate data: R >plot(height[gender== M ],weight[gender== M ], +main= Use Gender to geneerate the plotting symbol, +ylab= Weight,xlab= Height, +xlim=c(150,190), ylim=c(40,80)) >par(new=t) >plot(height[gender== F ],weight[gender== F ], +main=,ylab=,xlab=,xlim=c(150,190), ylim=c(40,80), +axes=f,pch=0,col=2) 67

Scatter plot of bivariate data for groups: R Use Gender to geneerate the plotting symbol Weight 40 50 60 70 80 150 160 170 180 190 Height 68

Plot of bivariate data: R >par(mfrow=c(2,1)) >plot(height[gender== M ],weight[gender== M ], +main= Plot of Weight by Height for Male, +ylab= Weight,xlab= Height, +xlim=c(160,190), ylim=c(40,80)) >plot(height[gender== F ],weight[gender== F ], +main= Plot of Weight by Height for Female, +ylab= Weight,xlab= Height, +xlim=c(140,180), ylim=c(40,80)) 69

Output of separate plots by gender : R Plot of Weight by Height for Male Weight 40 50 60 70 80 160 165 170 175 180 185 190 Height Plot of Weight by Height for Female Weight 40 50 60 70 80 140 150 160 170 180 Height 70

Import the data set htwt1 Plot of bivariate data: SPSS Graph Legacy Dialogs Scatter/Dot... Choose Simple Scatter Define Move the variable height to the horizontal axis in the 2-D coordinate and weight to the vertical axis in the 2-D coordinate. 71

Scatter plot of bivariate data: SPSS 72

Plot of bivariate data for groups: SPSS Graph Legacy Dialogs Scatter/Dot... Move the variable height to the horizontal axis in the 2-D coordinate and weight to the vertical axis in the 2-D coordinate. Move the variable gender to the label cases by window. 73

Plot of bivariate data for groups: SPSS 74

Plot of bivariate data for groups: SPSS Click Options, choose Display chart with case labels 75

Output of plot by groups: SPSS 76

Separate plots by gender : SPSS Graph Legacy Dialogs Scatter/Dot... Move the variable height to the horizontal axis in the 2-D coordinate and weight to the vertical axis in the 2-D coordinate. Move the variable gender to the Panel Columns window. 77

Output of plot by gender :SPSS 78