Data Integration: An Example Using GenePattern

Similar documents
Data mining with Ensembl Biomart. Stéphanie Le Gras

Assignment 5: Integrative epigenomics analysis

User Instruction Guide

CNV PCA Search Tutorial

IMPaLA tutorial.

User Guide. Association analysis. Input

Instructions for the ECN201 Project on Least-Cost Nutritionally-Adequate Diets

Web Feature Services Tutorial

Aggregate Report Instructions

OneTouch Reveal Web Application. User Manual for Healthcare Professionals Instructions for Use

Module 3: Pathway and Drug Development

The Hospital Anxiety and Depression Scale Guidance and Information

DENTRIX ENTERPRISE 8.0.5

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

To begin using the Nutrients feature, visibility of the Modules must be turned on by a MICROS Account Manager.

Introduction to SPSS S0

OECD QSAR Toolbox v.4.2. An example illustrating RAAF scenario 6 and related assessment elements

Psychology of Perception PSYC Spring 2017 Laboratory 2: Perception of Loudness

Making charts in Excel

Titrations in Cytobank

Two-Way Independent ANOVA

Fully Automated IFA Processor LIS User Manual

Creating YouTube Captioning

TMWSuite. DAT Interactive interface

Hands-On Ten The BRCA1 Gene and Protein

Computer Science 101 Project 2: Predator Prey Model

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Integrated Analysis of Copy Number and Gene Expression

PedCath IMPACT User s Guide

ProScript User Guide. Pharmacy Access Medicines Manager

Lab 4: Perception of Loudness

Pathway Exercises Metabolism and Pathways

Bioinformatics Laboratory Exercise

One-Way Independent ANOVA

Content Part 2 Users manual... 4

Batch Upload Instructions

MethylMix An R package for identifying DNA methylation driven genes

Chronic Pain Management Workflow Getting Started: Wrenching In Assessments into Favorites (do once!)

Medtech Training Guide

mehealth for ADHD Parent Manual

NYSIIS. Immunization Evaluator and Manage Schedule Manual. October 16, Release 1.0

Publishing WFS Services Tutorial

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

USER GUIDE: NEW CIR APP. Technician User Guide

SUPPLEMENTARY INFORMATION

CHAPTER ONE CORRELATION

Nature Methods: doi: /nmeth.3115

Quick Start Guide for the CPI Web Training Modules and Assessment FOR NEW USERS

Using SPSS for Correlation

Simple Caption Editor User Guide. May, 2017

Creating EVENTS in TPN s Partner Portal Step 1: Scroll down to the footer of the home page and click on PARTNER LOGIN:

Add_A_Class_with_Class_Number_Revised Thursday, March 18, 2010

Lab 3: Perception of Loudness

Carolina Managed Print Solutions

How To Use MiRSEA. Junwei Han. July 1, Overview 1. 2 Get the pathway-mirna correlation profile(pmset) and a weighting matrix 2

PBSI-EHR Off the Charts!

Micro-RNA web tools. Introduction. UBio Training Courses. mirnas, target prediction, biology. Gonzalo

1. To review research methods and the principles of experimental design that are typically used in an experiment.

BlueBayCT - Warfarin User Guide

DNA Sequence Bioinformatics Analysis with the Galaxy Platform

Quick Notes for Users of. Beef Ration and Nutrition. Decision Software & Sheep Companion Modules

Dexcom CLARITY User Guide

Cerner COMPASS ICD-10 Transition Guide

OpenSim Tutorial #2 Simulation and Analysis of a Tendon Transfer Surgery

High School Science MCA Item Sampler Teacher Guide

SMPD 287 Spring 2015 Bioinformatics in Medical Product Development. Final Examination

OneTouch Reveal Web Application. User Manual for Patients Instructions for Use

Guide to Use of SimulConsult s Phenome Software

Updating immunization schedules to reflect GSK vaccines

EXPression ANalyzer and DisplayER

Section D. Identification of serotype-specific amino acid positions in DENV NS1. Objective

NUTRITIONAL ANALYSIS PROJECT

Dexcom CLARITY User Guide For Clinics

Hour 2: lm (regression), plot (scatterplots), cooks.distance and resid (diagnostics) Stat 302, Winter 2016 SFU, Week 3, Hour 1, Page 1

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Allergy Basics. This handout describes the process for adding and removing allergies from a patient s chart.

The application can be accessed from the pull down menu by selecting ODOT > Drafting Apps > Signs, or by the following key-in command:

O i r ent t a iti on t t o th th e e Certification Tool ification T State We State W binar e June 14th

Clay Tablet Connector for hybris. User Guide. Version 1.5.0

Dexcom CLARITY User Guide For Clinics

Q: How do I get the protein concentration in mg/ml from the standard curve if the X-axis is in units of µg.

INSTRUCTOR WALKTHROUGH

Transferring itunes University Courses from a Computer to Your ipad

MYGLOOKO USER GUIDE. June 2017 IM GL+ A0003 REV J

OWL+USB SOFTWARE USER GUIDE

Hanwell Instruments Ltd. Instruction Manual

Unit 1: Introduction to the Operating System, Computer Systems, and Networks 1.1 Define terminology Prepare a list of terms with definitions

East Stroudsburg University Athletic Training Medical Forms Information and Directions

Managing Immunizations

LiteLink mini USB. Diatransfer 2

Gene Expression Analysis Web Forum. Jonathan Gerstenhaber Field Application Specialist

2 MINUTE PEARLS Immunization Module: New and Historical Vaccines

SPAMALIZE s Cerebellum Segmentation routine.

ShadeVision v Color Map

Table of Contents. Contour Diabetes App User Guide

Crime Scene Investigation. Story

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

R2 Training Courses. Release The R2 support team

Lionbridge Connector for Hybris. User Guide

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures

Transcription:

Data Integration: An Example Using GenePattern In this short demonstration, we will use the GenePattern server {Reich, 2006} on a set of yeast data mapped to human orthologs to detect enriched gene sets using GSEA {Subramanian, 2003}. We will also obtain inter species orthology mappings from InParanoid {Berglund, 2008} and gene identifier mappings from BioMart {Haider, 2009}. 1. First, obtain a set of sample yeast data from Brauer et al 2008, ʺCoordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast,ʺ downloaded from the following URL. This expression dataset was collected from yeast grown at six different constant growth rates in chemostats. For each growth rate, the yeast was limited on one of six nutrients: glucose (carbon), nitrate (nitrogen), phosphate (phosphorus), sulfate (sulfur), leucine, or uracil. The 36 resulting conditions are organized in six blocks, ranging from e.g. G0.05 (the slowest glucose limited growth rate) to G0.3 (the fasted glucose limited growth rate). In this exercise, weʹll consider mainly the first block of six conditions, yeast limited for glucose (and thus with highly perturbed carbon metabolism). http://growthrate.princeton.edu/data/dilution_rate_00_raw.cdt 2. Convert this CDT file (a format used by the Stanford Microarray Database, http://smd.stanford.edu/help/formats.shtml#cdt) to a GCT file as used by GenePattern (http://www.broadinstitute.org/cancer/software/genepattern/tutorial/gp_fileformats.html#_creatin g_input_files_cdt). Open the file in Excel and delete the ʺGIDʺ, ʺUIDʺ, and ʺGWEIGHTʺ columns. Add two rows to the top of the file; set the first cell to ʺ#1.2ʺ, the first on the second row to ʺ5537ʺ (the number of gene rows), and the second on the second row to ʺ36ʺ (the number of condition columns). Save this file as ʺdilution_rate_00_raw.gctʺ, making sure that Excel doesnʹt helpfully append an extra ʺ.txtʺ to the filename! Curtis Huttenhower, NESS 2010 1

3. Navigate to the InParanoid database of cross species orthologous proteins at http://inparanoid.sbc.su.se. Click ʺDownloadʺ, then navigate to ʺcurrentʺ and ʺorthoXMLʺ. Find the ʺInParanoid.H.sapiens S.cerevisiae.orthoXMLʺ file and right click on it to download and save it. 4. Unfortunately, InParanoid stores human proteins as Ensembl identifiers, while GenePattern expects HGNC symbols as input. Letʹs download a mapping from BioMart, starting by navigating to http://www.biomart.org. Curtis Huttenhower, NESS 2010 2

5. Click on ʺMARTVIEWʺ, select ʺENSEMBL GENES 57 (SANGER UK)ʺ from the first menu, ʺHomo sapiens genes (GRCh37)ʺ from the second, and you should see the following: 6. Click ʺAttributesʺ on the left and click the ʺ+ʺ to expand the ʺGENEʺ section. Uncheck ʺEnsembl Gene IDʺ and ʺEnsembl Transcript IDʺ and select ʺEnsembl Protein IDʺ instead. Curtis Huttenhower, NESS 2010 3

7. Click the ʺ+ʺ to expand the ʺEXTERNALʺ section and scroll down a bit. Check ʺHGNC symbol,ʺ which is under ʺExternal References.ʺ 8. Scroll back to the top and click ʺResults.ʺ Check the ʺUnique results onlyʺ box and, finally, click the ʺGoʺ button. Save the resulting ʺmart_export.txtʺ file when it asks you to. Curtis Huttenhower, NESS 2010 4

9. Now, weʹre going to use the following Python script to simultaneously map the yeast identifiers in our GCT file to human Ensembl proteins (using InParanoid) and from there to HGNC symbols (using the BioMart file). This will result in a new GCT file containing human gene identifiers that we can feed to GenePattern. #!/usr/bin/env python import re import sys if len( sys.argv ) < 2: raise Exception( "Usage: inparanoid_gct.py <inparanoid.orthoxml> [mart_export.txt] < <data.gct>" ) strinparanoid, strmap = sys.argv[1:] hashmap = {} if strmap: for strline in open( strmap ): astrline = strline.strip( ).split( "\t" ) if len( astrline )!= 2: continue if astrline[0] in hashmap: hashmap[astrline[0]].append( astrline[1] ) else: hashmap[astrline[0]] = [astrline[1]] astrgenes = [] hashclusters = {} hashoutput = {} ahashclusters = [] pregene = re.compile( 'gene id="(\d+)".*protid="([^"]+)"' ) preend = re.compile( '\/genes' ) precluster = re.compile( 'cluster id="(\d+)"' ) premember = re.compile( 'generef id="(\d+)"' ) icluster = hashcluster = 0 fend = False for strline in open( strinparanoid ): pmatch = pregene.search( strline ) if pmatch: igene = int( pmatch.group( 1 ) ) if len( astrgenes ) <= igene: astrgenes.extend( [None] * ( igene - len( astrgenes ) + 1 ) ) astrgenes[igene] = pmatch.group( 2 ) if not fend: hashoutput[pmatch.group( 2 )] = True continue pmatch = preend.search( strline ) if pmatch: fend = True continue pmatch = precluster.search( strline ) if pmatch: icluster = int( pmatch.group( 1 ) ) while len( ahashclusters ) <= icluster: ahashclusters.append( {} ) hashcluster = ahashclusters[icluster] continue pmatch = premember.search( strline ) if pmatch: strgene = astrgenes[int( pmatch.group( 1 ) )] Curtis Huttenhower, NESS 2010 5

foutput = strgene in hashoutput astrcur = hashmap[strgene] if ( strgene in hashmap ) else [strgene] for strgene in astrcur: if foutput: hashcluster[strgene] = True if strgene in hashclusters: hashclusters[strgene].append( icluster ) else: hashclusters[strgene] = [icluster] iline = iconditions = strheaders = 0 aaastrclusters = [] for strline in sys.stdin: iline += 1 strline = strline.strip( ) astrline = strline.split( "\t" ) if iline == 1: print( strline ) elif iline == 2: iconditions = int( astrline[1] ) elif iline == 3: strheaders = strline else: if len( astrline ) < ( iconditions + 2 ): astrline.extend( [0] * ( iconditions + 2 - len( astrline ) ) ) if astrline[0] in hashclusters: aiclusters = hashclusters[astrline[0]] for icluster in aiclusters: while len( aaastrclusters ) <= icluster: aaastrclusters.append( [] ) aaastrclusters[icluster].append( astrline ) igenes = 0 astrvalues = [None] * len( aaastrclusters ) aastrgenes = [None] * len( aaastrclusters ) for icluster in range( len( astrvalues ) ): aastrcluster = aaastrclusters[icluster] if len( aastrcluster ) == 0: continue advalues = [0] * ( len( aastrcluster[0] ) - 2 ) hashgenes = {} for astrcur in aastrcluster: for strgene in ahashclusters[icluster]: hashgenes[strgene] = True adcur = [float( strcur ) for strcur in astrcur[2:]] for i in range( len( adcur ) ): advalues[i] += adcur[i] aastrgenes[icluster] = hashgenes.keys( ) igenes += len( aastrgenes[icluster] ) astrvalues[icluster] = "\t".join( [aastrcluster[0][1]] + [str( d / len( aastrcluster ) ) for d in advalues] ) print( "\t".join( (str( i ) for i in (igenes, iconditions)) ) ) print( strheaders ) for icluster in range( len( astrvalues ) ): if not astrvalues[icluster]: continue for strgene in aastrgenes[icluster]: print( strgene + "\t" + astrvalues[icluster] ) Curtis Huttenhower, NESS 2010 6

10. Whew, thatʹs a mess! Save that file as ʺinparanoid_gct.pyʺ and execute the following command to produce the ʺdilution_rate_01_human.gctʺ file: python inparanoid_gct.py InParanoid.H.sapiens-S.cerevisiae.orthoXML mart_export.txt < dilution_rate_00_raw.gct > dilution_rate_01_human.gct 11. Finally, create a new file named ʺdilution_rate_01_human.clsʺ in a text editor of choice containing the following three lines. All of the whitespace is tabs, and there are 36 columns corresponding to the 36 conditions in our GCT file: 36 2 1 # Glucose Other 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12. Now letʹs upload this mess to GenePattern! Navigate to the web site at http://genepattern.broadinstitute.org; if you donʹt already have an account, click ʺClick to registerʺ to create one and sign in. 13. Once youʹve signed in, select ʺGSEAʺ from the left hand menu. Curtis Huttenhower, NESS 2010 7

14. In the following excessively complicated form, change the following fields: a. For ʺexpression datasetʺ, click ʺBrowseʺ and select ʺdilution_rate_01_human.gctʺ. b. For ʺgene sets databaseʺ, select ʺc2.all.v2.5.symbols.gmt [Curated]ʺ. c. For ʺphenotype labelsʺ, click ʺBrowseʺ and select ʺdilution_rate_01_human.clsʺ. d. For ʺcollapse datasetʺ, select ʺfalseʺ. e. For ʺpermutation typeʺ, select ʺtagʺ. f. For ʺmin gene set sizeʺ, enter ʺ5ʺ. 15. Ok, now finally scroll down and click ʺRunʺ. Curtis Huttenhower, NESS 2010 8

16. After waiting a little while, you should see the following screen. Click on ʺdilution_rate_01_human.zipʺ to download and save it. 17. Expand this archive and open ʺindex.htmlʺ in your favorite web browser. This will provide the following overview of gene sets that are enriched for up or down regulation in the glucose exposure conditions of our heavily manipulated dataset. Curtis Huttenhower, NESS 2010 9

18. Click on the ʺSnapshotʺ of enrichment results for the first phenotype, ʺGlucoseʺ, to see upregulated gene sets. 19. Letʹs click on the sixth gene set (the third one on the second row, rightmost, highlighted in red in the image above). This is a small gene set thatʹs highly upregulated specifically in the glucose exposure conditions. The first screen GSEA provides is a results summary showing the name of the gene set (ʺHSA00020_CITRATE_CYCLEʺ, the KEGG TCA cycle pathway), its raw and normalized enrichment scores (how and how specifically its members were upregulated in our phenotype of interest), and the nominal, FDR corrected, and FWER corrected p values of these scores (all zero in our case; thatʹs significant!) Black bars in the graphical overview show where the setʹs genes fall in terms of the entire genomeʹs enrichment in our phenotype. Curtis Huttenhower, NESS 2010 10

20. Scrolling down, the next results displayed are the individual genes in this set. In addition to the identifiers and metadata for each gene, its individual rank, score, and ʺcore enrichmentʺ is displayed, where core genes were individually enriched enough to seed the clusterʹs score. Note that in this case, all except one gene at the very bottom was active enough in the glucose phenotype to be a core. 21. Scrolling some more, another graphical overview is provided of the gene set membersʹ normalized expression values within our dataset. Yup, this looks upregulated in glucose, all right! Curtis Huttenhower, NESS 2010 11

22. The last figure isnʹt terribly relevant in this case; itʹs a histogram of the null distribution of Enrichment Scores for a randomized gene set with the same properties as the current real one. Note that the x axis doesnʹt even go beyond 0.5, and the real enrichment score was greater than 0.9. You can safely say that the TCA cycle pathway is highly upregulated in our glucose limitation conditions! 23. Returning the results overview, letʹs look at the textual (rather than graphical) list of gene sets that were downregulated during glucose limitation. Click on the ʺenrichment results in htmlʺ link for the second phenotype, ʺOtherʺ. Curtis Huttenhower, NESS 2010 12

24. Wow, a lot of these are really weird. Cholera? Flagellar assembly? Turns out a lot of these are due to large orthologous families in human that correspond to single yeast genes. Our Python script above will duplicate one yeast geneʹs expression out to the several members of its orthologous cluster; if that geneʹs downregulated during glucose limitation (as some of the vacuolar ATPases orthologous to secretion, photosynthesis, and flagellar locomotion proteins are), itʹll look like the entire pathwayʹs downregulated. Fret not! Weʹll ignore them and look at the sixth gene set again, DNA replication as annotated in Reactome. Click on its ʺDetails...ʺ link as highlighted below (clicking on the gene set name itself will take you to MSigDBʹs description of the set). 25. Well, this isnʹt quite as downregulated as the TCA cycle was upregulated, but itʹs still quite significant. Curtis Huttenhower, NESS 2010 13

26. Scroll down a bit to see if this gene set passes the smell test. The heatmap doesnʹt contain a lot of repeated rows, which means this may be real and not due simply to duplicated orthologous genes. Looks like our yeast might be locking itself out of the cell cycle (and thus not replicating its DNA) more during glucose limitation than for other nutrients. The fact that the downregulation is more extreme during the lowest growth rates and relaxed in the higher ones (e.g. ʺG0.25ʺ and ʺG0.3ʺ) is a good sign. GSEA helps us find higher level biological descriptions in terms of pathway up and down regulation for what might be happening in particular expression condition phenotypes. Nifty! Curtis Huttenhower, NESS 2010 14