Similar documents
Intro to SPSS. Using SPSS through WebFAS

Introduction to SPSS S0

RESULTS REPORTING MANUAL. Hospital Births Newborn Screening Program June 2016

Using SPSS for Correlation

SAMPLING ERROI~ IN THE INTEGRATED sysrem FOR SURVEY ANALYSIS (ISSA)

SmartVA-Analyze 2.0 Help

Psychology Research Methods Lab Session Week 10. Survey Design. Due at the Start of Lab: Lab Assignment 3. Rationale for Today s Lab Session

Spectrum. Quick Start Tutorial

External validation of a COPD prediction model using population-based primary care data: a nested case-control study

IPUMS CPS Extraction and Analysis

Problem Two (Data Analysis) Grading Criteria and Suggested Answers

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

An Introduction to Modern Econometrics Using Stata

One-Way Independent ANOVA

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

Review of Veterinary Epidemiologic Research by Dohoo, Martin, and Stryhn

Key Results Liberia Demographic and Health Survey

MNSCREEN TRAINING MANUAL Hospital Births Newborn Screening Program October 2015

Lesson: A Ten Minute Course in Epidemiology

Introduction to SPSS: Defining Variables and Data Entry

Using the NWD Integrative Screener as a Data Collection Tool Agency-Level Aggregate Workbook

Chapter Eight: Multivariate Analysis

Tutorial #7A: Latent Class Growth Model (# seizures)

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

QUESTIONNAIRE USER'S GUIDE

MALARIA INDICATOR SURVEY MODEL BIOMARKER QUESTIONNAIRE IDENTIFICATION (1) CLUSTER NUMBER... HOUSEHOLD NUMBER... FIELDWORKER VISITS FIELDWORKER'S NAME

Data Analysis. A cross-tabulation allows the researcher to see the relationships between the values of two different variables

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates

Chapter Eight: Multivariate Analysis

Reliability. Scale: Empathy

The Effectiveness of Captopril

Data mining with Ensembl Biomart. Stéphanie Le Gras

Documentation TwinLife Data: Filtering inconsistencies in discrimination items

Immunization Scheduler Quick Start Guide

User Guide for Classification of Diabetes: A search tool for identifying miscoded, misclassified or misdiagnosed patients

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

Routine Questionnaire (A1)

To open a CMA file > Download and Save file Start CMA Open file from within CMA

ANALYSIS OF SURVEYS WITH EPI INFO AND STATA

Modeling Natural Selection Activity

BlueBayCT - Warfarin User Guide

How to Conduct On-Farm Trials. Dr. Jim Walworth Dept. of Soil, Water & Environmental Sci. University of Arizona

ANOVA. Thomas Elliott. January 29, 2013

QuantiPhi for RL78 and MICON Racing RL78

FGM Data Collection NHS Digital CAP

ECDC HIV Modelling Tool User Manual version 1.0.0

v Feature Stamping SMS 13.0 Tutorial Prerequisites Requirements Map Module Mesh Module Scatter Module Time minutes

LSAT India Analytical Reasoning 2019

MS&E 226: Small Data

Rapid decline of female genital circumcision in Egypt: An exploration of pathways. Jenny X. Liu 1 RAND Corporation. Sepideh Modrek Stanford University

To open a CMA file > Download and Save file Start CMA Open file from within CMA

A SAS sy Study of ediary Data

ATLANTIS WebOrder. ATLANTIS ISUS User guide

Publishing WFS Services Tutorial

AQA (A) Research methods. Model exam answers

Using Lertap 5 in a Parallel-Forms Reliability Study

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Comparing Proportions between Two Independent Populations. John McGready Johns Hopkins University

Data Analysis with SPSS

Simple Linear Regression One Categorical Independent Variable with Several Categories

I'm Squeaky Clean! Interactive Presentation Slide Notes. Slide 1. Notes: Personal & Consumer Health / Grades K-1

25. Two-way ANOVA. 25. Two-way ANOVA 371

VACCINE REMINDER SERVICE A GUIDE FOR SURGERIES

Note to the interviewer: Before starting the interview, ensure that a signed consent form is on file.

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

Chapter 3 Introducing a Control Variable (Multivariate Analysis)

Tip Sheet: Investigating an Acute Hepatitis B Event using MAVEN Purpose of Surveillance and Case Investigation:

Pain Relief During Labor

Reshaping of Human Fertility Database data from long to wide format in Excel

Pilates Handbook. Rockwell Collins Recreation Center

GLOW FOODS. By the end of this module, students should be able to:

Automated process to create snapshot reports based on the 2016 Murray Community-based Groups Capacity Survey: User Guide Report No.

Psychology of Perception Psychology 4165, Spring 2003 Laboratory 1 Weight Discrimination

OECD QSAR Toolbox v.4.2. An example illustrating RAAF scenario 6 and related assessment elements

MedRx Video Otoscope Software

Technology Design 1. Masters of Arts in Learning and Technology. Technology Design Portfolio. Assessment Code: TDT1 Task 3. Mentor: Dr.

Gender Attitudes and Male Involvement in Maternal Health Care in Rwanda. Soumya Alva. ICF Macro

In Class Problem Discovery of Drug Side Effect Using Study Designer

Section F. Measures of Association: Risk Difference, Relative Risk, and the Odds Ratio

Bruker D8 Discover User Manual (Version: July 2017)

PSI RESEARCH TOOLKIT. Dashboard Analysis Series Five: Analysis Methodology for Complex Survey Data B UILDING R ESEARCH C APACITY

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

To provide you with necessary knowledge and skills to accurately perform 3 HIV rapid tests and to determine HIV status.

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Stat 13, Lab 11-12, Correlation and Regression Analysis

Digital Impression Scanning

Title:Modern contraceptive use among sexually active men in Uganda: Does discussion with a health worker matter?

DOWNLOAD PDF 101 DATABASE EXERCISES

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures

ORIENTATION SAN FRANCISCO STOP SMOKING PROGRAM

Lesson 8 STD & Responsible Actions

USIIS User Documentation AFIX Assessment Reports

Pharmaceutical Applications

Psychology of Perception Psychology 4165, Fall 2001 Laboratory 1 Weight Discrimination

STAT 100 Exam 2 Solutions (75 points) Spring 2016

Spartan Dairy 3 Tutorial

HRS 2008 SECTION D: COGNITION PAGE 1 ******************************************************************

Transcription:

Stata: Merge and append Topics: Merging datasets, appending datasets - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1. Terms There are several situations when working with large population datasets that you need to append or merge datasets. Let us clarify a few terms first. Observations are the rows in the dataset. Observations are related to the unit of analysis in the dataset. For example, observations might be women, or households, or children. Variables are the columns in the dataset. Each variable contains a unique piece of information about each observation. The master dataset, is the main dataset. The master dataset usually is organized by the unit of analysis for example the master dataset would be women if we were answering a research question about pregnancies. The merging dataset contains the variables that we are adding to the master dataset. The appending dataset contains the observations that we are adding to the master dataset. www.populationsurveyanalysis.com Page 1 of 5

Here is one situation with DHS data that requires both merge and append: Let us say that we will analyze HIV infection status and several demographic and social covariates in both men and women. In the DHS data, women covariates, men covariates, and HIV test results (for men and women) are stored in three separate files. So we have to first append men and women demographic and social data, then merge HIV status. 2. Append Let us look at this scenario using 2010 Rwanda DHS data. 2a. Prepare data for append Before we append the women and men datasets, we open each original dataset, make some changes to facilitate the append, and then save modified versions of each file. With the men s individual recode file open, we make the following changes: First, we keep only those variables that we need for this analysis. Second, we ensure that we have the same variables in both datasets, and that the variables share a common variable name. Let us use the rename statement to change the variables names in the male recode file (which start with mv ) to start with v like the woman s file. Third, we create a unique numerical ID for each person. For DHS data, a person ID is derived from the cluster, household, and household member line number. In this example, these are variables v001, v002, and v003, respectively. Fourth, we generate a new variable called gender since we are about to combine data from men and women. www.populationsurveyanalysis.com Page 2 of 5

We perform these steps first in the men dataset and save it. Then we repeat these steps with the women s individual recode dataset creating variables for person ID and gender. Let us check the count of observations in each file and make a note, so we can check our work later. 2b. Append statement Now that both datasets have the same variable names, we are ready to append them. The append statement itself is quite easy to use. First, open the master dataset, in this case the woman file that we just created. With the master dataset open, type append using and the path and name of the men dataset in quotes. We check that the data appended correctly by tabulating the variable gender. We see that the file contains 6,329 men and 13,671 women as expected. Then we save a new permanent dataset in our project\data folder called menwomen.dta. 3. Merge 3a. Prepare data for merge Before merging HIV status to the combined menwomen dataset, we ensure that the HIV dataset has a unique person ID variable to link observations in the HIV dataset with observations in the combined menwomen dataset. We keep only the variables that we need, and save the file. 3b. Merge statement The syntax for merge is simple. Start by opening the master dataset, in this case the menwomen dataset that we just saved. After the merge statement, we must specify the type of merge (1:1, m:1, or 1:m), the variable name that is common in both datasets, and then type using and the path and name of the merging dataset in quotes. www.populationsurveyanalysis.com Page 3 of 5

3c. Merge types (1:1, m:1, 1:m) There are three types of merges that are generally performed with survey data: one-to-one, many-toone, and one-to-many. Our example of merging HIV status to person is an example of a one-to-one merge. One-to-one means that there is one observation in the master dataset for each observation in the merging dataset. It is possible that some observations will not match, but there are not duplicated IDs in either dataset. If we were merging mother data onto kid data, then we would have a many-to-one merge because there are many kids per woman. A one-to-many merge would be the opposite. 3d. Investigating unmatched data When we use the merge statement, Stata automatically generates a new variable in the dataset called _merge, and provides a summary of that variable in the output window. The _merge variable always has 3 categories: 1 is assigned to unmatched observations from the master dataset, 2 is assigned to unmatched observations in the using dataset, and 3 is assigned to matched observations. The output window tells us that 13,248 men and women had an HIV test result. And there were 6,752 men and women without an HIV result, and 274 HIV results for whom we have no other survey information. Before moving on, it is essential to understand why we have any unmatched IDs. In this case, I investigated these unmatched records using the browse statement, by comparing the merged data to the household roster dataset, and by reading the final DHS Report to check the response rates for these different datasets. Ultimately I determined that the 274 people tested for HIV with no matching survey main data refused or had incomplete responses. The 7,026 people who responded to the main survey but did not get tested for HIV were expected per the sample design because only every other household was selected for HIV testing. www.populationsurveyanalysis.com Page 4 of 5

3e. Keeping observations Since our research question is about HIV, and because we have a sampling weights for everyone tested for HIV (the HIV sample weight is hiv05), we decide to keep the 13,248 matching records, plus the 274 HIV test results that did not have matching observations in the main women s or men s surveys. Ultimately, these 274 records will fall out of the analysis due to missing data, but it is important to keep them in the dataset so that Stata can calculate variance estimates correctly based on the total observations per the sample design. When we are done with the _merge variable, we drop it to keep the dataset clean, then save the final merged dataset. If it is unclear why we kept _merge==2, check out the Survey Analysis in Stata video at populationsurveyanalysis.com. www.populationsurveyanalysis.com Page 5 of 5