Mendel Short Course @ IGES 2003 Data Preparation Eric Sobel Department of of Human Genetics UCLA School of of Medicine 02 November 2003 Mendel Short Course @ IGES Slide 1
Web Sites Mendel5: www.genetics.ucla.edu/software SimWalk3: www.genetics.ucla.edu/software FBAT: biosun1.harvard.edu/~fbat/default.html Mega2: watson.hgen.pitt.edu/mega2.html 02 November 2003 Mendel Short Course @ IGES Slide 2
Before Statistical Analysis: Data Preparation (Overview) Types of data communicating with the software Utilities that assist in creating the data files Gregor SimRun Mega2 02 November 2003 Mendel Short Course @ IGES Slide 3
Before Statistical Analysis: Data Preparation (Overview) Analysis robustness to small perturbations in the data Mistyping Analysis Making the data more useful Allele consolidation Locus consolidation 02 November 2003 Mendel Short Course @ IGES Slide 4
Data File Manipulations Data translations are tedious and errorprone (although less so when done by a program, e.g., Mega2) There are many statistical genetics programs but few methods! Settle on a few programs you know well, both how to use it it and its assumptions, and that you trust 02 November 2003 Mendel Short Course @ IGES Slide 5
Types of Data Control data: which analysis should the software perform? Locus data: which loci are the data from? Qualitative genetic loci, e.g., traits and markers Qualitative non-genetic factors, e.g., smoker Quantitative variables, e.g., birthyear, BMI, ACE Map data: genomic layout of genetic loci? 02 November 2003 Mendel Short Course @ IGES Slide 6
More Types of Data Pedigree data: each individual s data Parents, sex and twin-status Phenotypes at the loci, factors, and variables Penetrance data (for parametric analyses): how does genotype affect phenotype at the trait loci? SNP data (Mendel-specific option): which loci should be consolidated? 02 November 2003 Mendel Short Course @ IGES Slide 7
Control Data Here one sets all the parameters necessary to to specify the type of of analysis to to run. For example, the following is is a Mendel Control.in File: OUTPUT_FILE = Mendel.out!The name of of output file LOCUS_FILE = Locus0.in!The name of of the locus file MAP_FILE = Map0.in!The name of of the map file PEDIGREE_FILE = Ped0.in!The name of of the pedigree file VARIABLE_FILE = Variable0.in!The name of of the variable file ANALYSIS_OPTION = Mistyping!Analysis option MODEL = 1!Sub-option for the analysis MALE = M!Symbol for male FEMALE = F!Symbol for female Allele_separator = -!Symbol used within genotypes 02 November 2003 Mendel Short Course @ IGES Slide 8
Comments on Flexible, Mendel Format Data Files Comma-delimited files are shown Column-specific files are also permitted This is is particularly useful for the pedigree data, since the software can be told how to to read almost any consistently formatted data set Missing values are blanks All objects can be named using words (eight or fewer characters) not just integers LINKAGE format pedigree files are accepted Many more Mendel features listed in the manual 02 November 2003 Mendel Short Course @ IGES Slide 9
Locus Data Qualitative Genetic Loci Name of of loci Chromosomal region, if if known; X-linked or Autosomal is is allowed Number and name of of alleles (optional) Allele frequencies (optional) Number and name of of phenotypes and with which genotypes they are compatible (only required if if you use phenotypes at at that locus) 02 November 2003 Mendel Short Course @ IGES Slide 10
Qualitative Genetic Locus Data File Example Egomania,1q,2,2, a,0.99, b,0.01, NORMAL,3, a-a, a-a, a-b, a-b, b-b, b-b, AFFECTED,3, a-a, a-a, a-b, a-b, b-b, b-b, Marker1,1q,2,0, 213,0.445, 217,0.555, Marker2,autosome,0,0, 02 November 2003 Mendel Short Course @ IGES Slide 11
More Locus Data Qualitative Non-Genetic Factors Name of factor Number and name of categories Quantitative Variables Name of variable Minimum and maximum values allowed 02 November 2003 Mendel Short Course @ IGES Slide 12
More Example Locus Files Factors are listed at at the end of of the Mendel Locus File: HEALTH,FACTOR,2,0, Good, Poor, PROBAND,FACTOR,1,0, PROBAND, Quantitative variables are listed in in the Mendel Variable File, one per line: YearBorn,1900,2003, 02 November 2003 Mendel Short Course @ IGES Slide 13
Map Data Contains the relative position of the qualitative genetic loci in the genome For Mendel (& SimWalk), only those loci in both the Locus and Map files will be analyzed! Sex-specific recombination fractions (and thus genetic distances) are allowed One can also specify the number of analysis points within each interval 02 November 2003 Mendel Short Course @ IGES Slide 14
Example Map Data File For example, the following is is a Mendel Map File: Egomania, 0.10,0.05,, Marker1, 0.01,,4, Marker2,,,,, Marker3, 02 November 2003 Mendel Short Course @ IGES Slide 15
Pedigree Data For each individual, one lists: Pedigree name Person name Parental names Either both parents in in pedigree or or none (none Founder) Sex Name of of twin set Phenotypes listed for each of of the loci, factors and quantitative variables in in the Locus and Variable Files, and in in the same order! (Blanks imply a missing value.) 02 November 2003 Mendel Short Course @ IGES Slide 16
Example Pedigree File For example, the following is is a Mendel Pedigree File: Bush, George,,,M,,AFFECTED,213-217,1946, Bush, Laura,,,F,,NORMAL, 213-213,1946, Bush, Barbara,George,Laura,F,,NORMAL, 213-213,1981, Bush, Jenna, George,Laura,F,,AFFECTED,,1981, Clinton,Bill,,,M,,AFFECTED,213-217,1946, Clinton,Hillary,,,F,,AFFECTED,213-217,1947, Clinton,Chelsea,Hillary,Bill,F,,NORMAL, 213-213,1980, 02 November 2003 Mendel Short Course @ IGES Slide 17
Penetrance Data For a few analyses only, e.g., Parametric Linkage and Genetic Counseling Contains the model specifying how genotype influences phenotype at a trait locus. For each phenotype, set the values: Pr( phenotype 1/1 ) Pr( phenotype 1/2 ) Pr( phenotype 2/2 ) 02 November 2003 Mendel Short Course @ IGES Slide 18
Example Penetrance File For example, the following is is a Mendel Penetrance File: Egomania,PROB,,,2, NORMAL, 0.90, 0.05, 0.05, AFFECTED,0.10, 0.95, 0.95, 02 November 2003 Mendel Short Course @ IGES Slide 19
SNP File Only used for Locus Consolidation Utility Can consolidate up to four loci into one super-locus Each locus can have up to nine alleles 02 November 2003 Mendel Short Course @ IGES Slide 20
Example SNP File For example, the following is is a Mendel SNP File: 2, 2, SNP1,SNP2, 3, 3, SNP3,SNP4,SNP5, 02 November 2003 Mendel Short Course @ IGES Slide 21
Constructing the Data Files The Gregor program eases construction of the Mendel Control.in File and running Mendel itself SimRun does the same for SimWalk and its control file called BATCH3.DAT Many pedigree formats are supported by Mendel and the other files are small and easily constructed! SimWalk will copy Mendel s file formatting flexibility by 2004. 02 November 2003 Mendel Short Course @ IGES Slide 22
Constructing the Data Files More and more databases will generate analysis input files directly Mega2 is a useful utility that converts from LINKAGE format data and pedigree files to the input files for many other packages, including Mendel and SimWalk2 Next major version of of Mega2 may better support Mendel 5 and SimWalk 3 02 November 2003 Mendel Short Course @ IGES Slide 23