Data Science and Statistics in Research: unlocking the power of your data

Similar documents
Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

2.4.1 STA-O Assessment 2

Observational studies; descriptive statistics

Chapter 1: Exploring Data

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Introduction to Statistical Data Analysis I

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Instructions and Checklist

Averages and Variation

Section 6: Analysing Relationships Between Variables

NORTH SOUTH UNIVERSITY TUTORIAL 1

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

General Example: Gas Mileage (Stat 5044 Schabenberger & J.P.Morgen)

Probability and Statistics. Chapter 1

VU Biostatistics and Experimental Design PLA.216

Section I: Multiple Choice Select the best answer for each question.

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

(a) 50% of the shows have a rating greater than: impossible to tell

Undertaking statistical analysis of

Chapter 1. Picturing Distributions with Graphs

Statistics is a broad mathematical discipline dealing with

Data, frequencies, and distributions. Martin Bland. Types of data. Types of data. Clinical Biostatistics

Homework Exercises for PSYC 3330: Statistics for the Behavioral Sciences

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

Outline. Practice. Confounding Variables. Discuss. Observational Studies vs Experiments. Observational Studies vs Experiments

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Business Statistics Probability

Unit 1 Exploring and Understanding Data

STOR 155 Section 2 Midterm Exam 1 (9/29/09)

Methodological skills

bivariate analysis: The statistical analysis of the relationship between two variables.

SAMPLE ASSESSMENT TASKS MATHEMATICS ESSENTIAL GENERAL YEAR 11

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

V. Gathering and Exploring Data

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

4Stat Wk 10: Regression

Test 1C AP Statistics Name:

Statistical Methods Exam I Review

Unit 7 Comparisons and Relationships

STT 200 Test 1 Green Give your answer in the scantron provided. Each question is worth 2 points.

Biostatistics for Med Students. Lecture 1

1SCIENTIFIC METHOD PART A. THE SCIENTIFIC METHOD

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

Frequency distributions

Lesson 9 Presentation and Display of Quantitative Data

How to interpret scientific & statistical graphs

(a) 50% of the shows have a rating greater than: impossible to tell

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Student name: SOCI 420 Advanced Methods of Social Research Fall 2017

Measuring the User Experience

Still important ideas

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Name AP Statistics UNIT 1 Summer Work Section II: Notes Analyzing Categorical Data

STP226 Brief Class Notes Instructor: Ela Jackiewicz

Chapter 1: Introduction to Statistics

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

An Introduction to Research Statistics

Math 2200 First Mid-Term Exam September 22, 2010

Still important ideas

Chapter 7: Descriptive Statistics

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Lesson 1: Distributions and Their Shapes

Unit 1 Outline Science Practices. Part 1 - The Scientific Method. Screencasts found at: sciencepeek.com. 1. List the steps of the scientific method.

CHAPTER 3 Describing Relationships

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Before we get started:

Statistics 371, lecture 3

q2_2 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Lesson 1: Distributions and Their Shapes

Basic Biostatistics. Chapter 1. Content

Understandable Statistics

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

QPM Lab 9: Contingency Tables and Bivariate Displays in R

LAB 1 The Scientific Method

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

Aim #1a: How do we analyze and interpret different types of data?

Chapter 1: Explaining Behavior

Distributions and Samples. Clicker Question. Review

Slide 1 - Introduction to Statistics Tutorial: An Overview Slide notes

ANOVA in SPSS (Practical)

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Section 1: Exploring Data

AP Statistics. Semester One Review Part 1 Chapters 1-5

Example The median earnings of the 28 male students is the average of the 14th and 15th, or 3+3

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistics and Epidemiology Practice Questions

DOWNLOAD PDF SUMMARIZING AND INTERPRETING DATA : USING STATISTICS

Chapter 2: The Organization and Graphic Presentation of Data Test Bank

Key: 18 5 = 1.85 cm. 5 a Stem Leaf. Key: 2 0 = 20 points. b Stem Leaf Key: 2 0 = 20 cm. 6 a Stem Leaf. c Stem Leaf

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

9 research designs likely for PSYC 2100

CCM6+7+ Unit 12 Data Collection and Analysis

BIOSTATS 540 Fall 2017 Exam 1 Page 1 of 12

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Transcription:

Data Science and Statistics in Research: unlocking the power of your data Session 1.4: Data and variables 1/ 33

OUTLINE Types of data Types of variables Presentation of data Tables Summarising Data 2/ 33

Types of data 3/ 33

WHAT ARE DATA AND VARIABLES? Data are the results of sampling values of some variables associated with a population or a process. A variable takes on one of a set of allowed values each time it is observed. Variables can be either qualitative or quantitative. Multiple measurements of a variable form a sample. 4/ 33

EXAMPLES OF DATA Heart rates of a patients heart rates taken at various times of day heart rate is a variable each measurement is an observation of that variable. Car attributes collecting fuel consumption and 10 other aspects of car design and performance for 32 automobiles. fuel consumption and the 10 other aspects are variables each car tested is an observation of these variables. Charateristics of iris flowers measurements of petals for 50 iris flowers petal length and width are variables each iris measured is an observation of these variables. 5/ 33

Types of variables 6/ 33

QUALITATIVE VARIABLES These take on distinct values or classes. Categorical: for example, whether someone travels to work by car/bus/train/foot/bicycle/motor cycle. These are called nominal data. Ordered categorical: for example whether someone is a non-smoker/light/moderate/heavy smoker or has low/medium/high blood pressure. These are called ordinal data as there is an order to the classes. Binary: a special case of variables which take on one of two possible values, for example true/false, male/female, survived/died. 7/ 33

QUANTITATIVE VARIABLES These take on numeric values and can be of two classes. Discrete: for example, number of patients in a study, number of cases of disease. Continuous: for example, temperature, blood pressure. Continuous quantitative variables can have their values grouped into classes and presented as discrete or ordered categorical variables. 8/ 33

EXAMPLE: MOTOR TREND CAR ROAD TESTS In R, there is a dataset from the 1974 Motor Trend US magazine comprises fuel consumption and 10 aspects of design and performance for 32 cars (1973 74 models). Dataset has 32 rows (observations) and 11 columns (variables). Let s look at a few variables to understand the types mpg - Miles per Gallon cyl Number of cylinders hp Horsepower wt Weight (1000 lbs) am Automatic or manual gear Number of gears. 9/ 33

EXAMPLE: MOTOR TREND CAR ROAD TESTS head(mtcars) mpg cyl hp wt am gear Mazda RX4 21.0 6 110 2.620 1 4 Mazda RX4 Wag 21.0 6 110 2.875 1 4 Datsun 710 22.8 4 93 2.320 1 4 Hornet 4 Drive 21.4 6 110 3.215 0 3 Hornet Sportabout 18.7 8 175 3.440 0 3 Valiant 18.1 6 105 3.460 0 3 10/ 33

EXAMPLE: MOTOR TREND CAR ROAD TESTS mpg - Miles per Gallon. cyl Number of cylinders. hp Horsepower. wt Weight (1000 lbs). am Automatic or manual. gear Number of gears. 11/ 33

EXAMPLE: MOTOR TREND CAR ROAD TESTS mpg - Miles per Gallon (Continuous). cyl Number of cylinders (Categorical). hp Horsepower (Discrete). wt weight (1000 lbs) (Continuous). am Automatic or manual (Binary). gear Number of gears (Discrete). 12/ 33

EXAMPLE: MOTOR TREND CAR ROAD TESTS We can also group continuous quantitative variables into classes and present as discrete/ordered categorical variables. Lets categorise the horsepower of the cars into three groups: (i) 1-100, (ii) 101-200 and (iii) 201+. Horsepower Frequency Percentage 1-100 9 28.1% 101-200 16 50.0% 201+ 7 21.9% Total 32 100% 13/ 33

Presentation of data 14/ 33

PRESENTING DATA We may want to describe data using tabulation visualisation. The most appropriate type of presentation will depend on the variable type (qualitative/quantitative) number of variables being presented. 15/ 33

TABULATION Tables can be used to describe almost any type of data (provided there is not too much!). Table: Years of smoking and lung capacity (on a scale 0-100) for emphysema patients. Patient Years Smoked Lung Capacity 1 25 55 2 36 60 3 22 50 4 15 30 5 48 75 6 39 70 16/ 33

VISUALISATION You can visualise your data using pie charts, bar charts, histograms, scatter plots, box plots, line plots and maps. 0/150 2.5 2.0 1 10 factor(1) Species setosa versicolor virginica count 5 Petal Width 1.5 1.0 Species setosa versicolor virginica 100 50 0.5 count 0 4 6 8 Number of Cylinders 0.0 2 4 6 Petal Length 35 30 mpg 25 20 15 Number Unemployed (in 1000s) 12000 8000 Northing 2 1 0 1 2 10 4000 Easting 3 4 6 8 as.factor(cyl) 1970 1980 1990 2000 2010 Date 17/ 33

Tables 18/ 33

USING TABLES TO PRESENT DATA The way that you present data in tables are very important. Readers are often drawn towards tables and figures, because it is an efficient way of obtaining information, as compared to reading a written account of the same content. Tables and figures add value to an analysis, if they can portray the relevant information and are concise. Tables and figures can provide readers with a large amount of information in a short time span. 19/ 33

USING TABLES TO PRESENT DATA Ensure that tables are self-explanatory by using clear, informative captions and titles. Be careful consistent in the way that you display information remove repetition set amount of decimal places be careful of scientific notation. Make sure your table only contains information that adds value to your analysis. Always review a table as if you are a non-expert! 20/ 33

EXAMPLE: USING TABLES TO PRESENT DATA Consider an analysis that tests whether a new pesticide affects the growth of wheat plants. Half of the wheat plants are given the new pesticide (Treatment) with the other half not given any (Control). Some plants regardless of their treatment are given 12 hours of light per day and the rest 16 hours of light per day. The height of the wheat plants are measured after 5 and 10 days of treatment. For the initial data analysis, the means and variance of the wheat plants height are produced. The results are presented in a table. 21/ 33

EXAMPLE: USING TABLES TO PRESENT DATA Table: Height after treatment Group light 5 days 10 days control 12 70.3 (2) 90 (5) Control 16 75.7 (8) 100 (3) treatment 12 60.4 (1.5) 78 (7.9) Treatment 16 52.2 (2.01) 81 (6.7) Is this the clearest way of portraying this information? 22/ 33

EXAMPLE: USING TABLES TO PRESENT DATA Comments Labels are not consistent capitalised in some places but not others. There are too many borders in the table Many journals will not accept vertical borders. The way they are ordered suggests we should compare the affect of light on the height not the treatment. The number of decimal places are not consistent. We cannot see what type of descriptive statistics are being used. The amount of light is repeated. The caption does not give enough information to clearly understand the table without knowing the study information. 23/ 33

EXAMPLE: USING TABLES TO PRESENT DATA Comments Labels are not consistent capitalised in some places but not others. There are too many lines in the table. The way they are ordered suggests we should compare the affect of light on the height not the treatment. The number of decimal places are not consistent. We cannot see what type of descriptive statistics are being used. The amount of light is repeated. The caption does not give enough information to clearly understand the table without knowing the study information. 24/ 33

EXAMPLE: USING TABLES TO PRESENT DATA Table: Means and variances of the height (in centimetres) of wheat plants after 5 and 10 days; for control and treatment groups. 5 Days 10 Days Group Mean Variance Mean Variance 12 hours of light Control 70.3 2.2 90.2 5.0 Treatment 60.4 1.5 78.0 7.9 16 hours of light Control 75.7 7.6 99.9 2.9 Treatment 52.2 2.0 81.1 6.7 25/ 33

Summarising Data 26/ 33

DESCRIPTIVE STATISTICS A statistic is calculated from the values of variable(s) in a sample. Various statistics are routinely used to describe samples. The following data refer to the total cost of drugs (in Burundi francs) received by 84 adults aged 20-29 visiting five different health centres in the Myinga province of Burundi in 1991-2. 2.0 4.0 6.2 8.0 8.0 8.0 12.0 15.0 15.0 18.0 18.2 20.0 20.0 20.0 21.0 22.0 24.2 27.0 27.0 27.0 28.0 29.7 29.7 29.7 29.7 29.7 30.0 30.0 31.0 41.4 42.3 45.4 45.4 45.4 45.4 45.4 45.4 45.4 45.4 45.4 49.4 50.8 51.4 53.4 56.0 57.0 57.4 59.0 59.4 60.0 61.5 65.4 65.4 65.4 65.4 67.0 90.8 92.0 94.0 94.0 94.0 94.0 94.0 94.0 94.7 105.0 125.0 126.0 130.0 130.0 130.0 151.2 160.0 177.0 187.0 187.0 194.4 194.4 194.4 212.4 213.0 233.0 267.0 320.4 27/ 33

EXAMPLE: DRUG COSTS There are many statistics that could be calculated from these data. The values the more common ones discussed earlier are listed in the following table. Table: Sample statistics for the drug cost data. Figure: Density plot of the drug costs data. Statistic Value Sample Size 84 Mean 75.1 Median 51.1 Variance 4494.9 Standard Deviation 67.0 Minimum 2 Maximum 320.4 Range 318.4 Lower Quartile 28 Upper Quartile 99.4 Interquartile Range 71.4 28/ 33

MEDIAN AND QUARTILES The median is a measure of the central value of the distribution of data. It halves the distribution; 50% of the values are below and 50% of the values above. The median by itself is of limited use, so we also find the, minimum, lower quartile (Q u ), upper quartile (Q l ) and maximum which with the median (the middle quartile) split the data into four intervals. Statistic Quantile List Position R code Minimum min 0% 1 min(x) 1 Lower quartile Q l 25% (N + 1) quantile(x,probs=0.25) 4 1 Median median 50% (N + 1) median(x) 2 3 Upper quartile Q u 75% (N + 1) quantile(x,probs=0.75) 4 Maximum max 100% N max(x) 29/ 33

MEDIAN AND QUARTILES Where the list position is not a whole number, the values above and below should be averaged together to give the relevant value. An idea of the spread is given by calculating the inter-quartile range IQR = Q u Q l = IQR(x) or just by calculating the range Range = max min = max(x) min(x) Advantage extreme values do not affect the quartiles. Disadvantage can be difficult to find if you have big data! 30/ 33

MEAN The mean is the most commonly used measure of the central value of a distribution. It is the sum of the observations divided by N the number of observations mean = x = 1 N N x i = sum(x)/length(x) i=1 When the data is distributed symmetrically the mean will generally close to the mean. The median is far more robust to extreme values in your data. 31/ 33

VARIANCE When using the mean, the we categorise the spread of the data using the standard deviation, which is based on the difference of the observations from the mean. The variance is calculated by dividing the sum of squares of these deviations by N 1 1 N 1 N (x i x) 2 = sum((x mean(x)) 2 )/(length(x) 1) i=1 The standard deviation is equal to the square root of the variance. Advantage uses all the information available and is therefore extensively used). Disadvantage extreme values can affect the mean and standard deviation. 32/ 33

Any Questions? 33/ 33