Data Integration: An Example Using GenePattern

Data Integration: An Example Using GenePattern In this short demonstration, we will use the GenePattern server {Reich, 2006} on a set of yeast data mapped to human orthologs to detect enriched gene sets using GSEA {Subramanian, 2003}. We will also obtain inter species orthology mappings from InParanoid {Berglund, 2008} and gene identifier mappings from BioMart {Haider, 2009}. 1. First, obtain a set of sample yeast data from Brauer et al 2008, ʺCoordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast,ʺ downloaded from the following URL. This expression dataset was collected from yeast grown at six different constant growth rates in chemostats. For each growth rate, the yeast was limited on one of six nutrients: glucose (carbon), nitrate (nitrogen), phosphate (phosphorus), sulfate (sulfur), leucine, or uracil. The 36 resulting conditions are organized in six blocks, ranging from e.g. G0.05 (the slowest glucose limited growth rate) to G0.3 (the fasted glucose limited growth rate). In this exercise, weʹll consider mainly the first block of six conditions, yeast limited for glucose (and thus with highly perturbed carbon metabolism). http://growthrate.princeton.edu/data/dilution_rate_00_raw.cdt 2. Convert this CDT file (a format used by the Stanford Microarray Database, http://smd.stanford.edu/help/formats.shtml#cdt) to a GCT file as used by GenePattern (http://www.broadinstitute.org/cancer/software/genepattern/tutorial/gp_fileformats.html#_creatin g_input_files_cdt). Open the file in Excel and delete the ʺGIDʺ, ʺUIDʺ, and ʺGWEIGHTʺ columns. Add two rows to the top of the file; set the first cell to ʺ#1.2ʺ, the first on the second row to ʺ5537ʺ (the number of gene rows), and the second on the second row to ʺ36ʺ (the number of condition columns). Save this file as ʺdilution_rate_00_raw.gctʺ, making sure that Excel doesnʹt helpfully append an extra ʺ.txtʺ to the filename! Curtis Huttenhower, NESS 2010 1

3. Navigate to the InParanoid database of cross species orthologous proteins at http://inparanoid.sbc.su.se. Click ʺDownloadʺ, then navigate to ʺcurrentʺ and ʺorthoXMLʺ. Find the ʺInParanoid.H.sapiens S.cerevisiae.orthoXMLʺ file and right click on it to download and save it. 4. Unfortunately, InParanoid stores human proteins as Ensembl identifiers, while GenePattern expects HGNC symbols as input. Letʹs download a mapping from BioMart, starting by navigating to http://www.biomart.org. Curtis Huttenhower, NESS 2010 2

5. Click on ʺMARTVIEWʺ, select ʺENSEMBL GENES 57 (SANGER UK)ʺ from the first menu, ʺHomo sapiens genes (GRCh37)ʺ from the second, and you should see the following: 6. Click ʺAttributesʺ on the left and click the ʺ+ʺ to expand the ʺGENEʺ section. Uncheck ʺEnsembl Gene IDʺ and ʺEnsembl Transcript IDʺ and select ʺEnsembl Protein IDʺ instead. Curtis Huttenhower, NESS 2010 3

7. Click the ʺ+ʺ to expand the ʺEXTERNALʺ section and scroll down a bit. Check ʺHGNC symbol,ʺ which is under ʺExternal References.ʺ 8. Scroll back to the top and click ʺResults.ʺ Check the ʺUnique results onlyʺ box and, finally, click the ʺGoʺ button. Save the resulting ʺmart_export.txtʺ file when it asks you to. Curtis Huttenhower, NESS 2010 4

9. Now, weʹre going to use the following Python script to simultaneously map the yeast identifiers in our GCT file to human Ensembl proteins (using InParanoid) and from there to HGNC symbols (using the BioMart file). This will result in a new GCT file containing human gene identifiers that we can feed to GenePattern. #!/usr/bin/env python import re import sys if len( sys.argv ) < 2: raise Exception( "Usage: inparanoid_gct.py <inparanoid.orthoxml> [mart_export.txt] < <data.gct>" ) strinparanoid, strmap = sys.argv[1:] hashmap = {} if strmap: for strline in open( strmap ): astrline = strline.strip( ).split( "\t" ) if len( astrline )!= 2: continue if astrline[0] in hashmap: hashmap[astrline[0]].append( astrline[1] ) else: hashmap[astrline[0]] = [astrline[1]] astrgenes = [] hashclusters = {} hashoutput = {} ahashclusters = [] pregene = re.compile( 'gene id="(\d+)".*protid="([^"]+)"' ) preend = re.compile( '\/genes' ) precluster = re.compile( 'cluster id="(\d+)"' ) premember = re.compile( 'generef id="(\d+)"' ) icluster = hashcluster = 0 fend = False for strline in open( strinparanoid ): pmatch = pregene.search( strline ) if pmatch: igene = int( pmatch.group( 1 ) ) if len( astrgenes ) <= igene: astrgenes.extend( [None] * ( igene - len( astrgenes ) + 1 ) ) astrgenes[igene] = pmatch.group( 2 ) if not fend: hashoutput[pmatch.group( 2 )] = True continue pmatch = preend.search( strline ) if pmatch: fend = True continue pmatch = precluster.search( strline ) if pmatch: icluster = int( pmatch.group( 1 ) ) while len( ahashclusters ) <= icluster: ahashclusters.append( {} ) hashcluster = ahashclusters[icluster] continue pmatch = premember.search( strline ) if pmatch: strgene = astrgenes[int( pmatch.group( 1 ) )] Curtis Huttenhower, NESS 2010 5

foutput = strgene in hashoutput astrcur = hashmap[strgene] if ( strgene in hashmap ) else [strgene] for strgene in astrcur: if foutput: hashcluster[strgene] = True if strgene in hashclusters: hashclusters[strgene].append( icluster ) else: hashclusters[strgene] = [icluster] iline = iconditions = strheaders = 0 aaastrclusters = [] for strline in sys.stdin: iline += 1 strline = strline.strip( ) astrline = strline.split( "\t" ) if iline == 1: print( strline ) elif iline == 2: iconditions = int( astrline[1] ) elif iline == 3: strheaders = strline else: if len( astrline ) < ( iconditions + 2 ): astrline.extend( [0] * ( iconditions + 2 - len( astrline ) ) ) if astrline[0] in hashclusters: aiclusters = hashclusters[astrline[0]] for icluster in aiclusters: while len( aaastrclusters ) <= icluster: aaastrclusters.append( [] ) aaastrclusters[icluster].append( astrline ) igenes = 0 astrvalues = [None] * len( aaastrclusters ) aastrgenes = [None] * len( aaastrclusters ) for icluster in range( len( astrvalues ) ): aastrcluster = aaastrclusters[icluster] if len( aastrcluster ) == 0: continue advalues = [0] * ( len( aastrcluster[0] ) - 2 ) hashgenes = {} for astrcur in aastrcluster: for strgene in ahashclusters[icluster]: hashgenes[strgene] = True adcur = [float( strcur ) for strcur in astrcur[2:]] for i in range( len( adcur ) ): advalues[i] += adcur[i] aastrgenes[icluster] = hashgenes.keys( ) igenes += len( aastrgenes[icluster] ) astrvalues[icluster] = "\t".join( [aastrcluster[0][1]] + [str( d / len( aastrcluster ) ) for d in advalues] ) print( "\t".join( (str( i ) for i in (igenes, iconditions)) ) ) print( strheaders ) for icluster in range( len( astrvalues ) ): if not astrvalues[icluster]: continue for strgene in aastrgenes[icluster]: print( strgene + "\t" + astrvalues[icluster] ) Curtis Huttenhower, NESS 2010 6

10. Whew, thatʹs a mess! Save that file as ʺinparanoid_gct.pyʺ and execute the following command to produce the ʺdilution_rate_01_human.gctʺ file: python inparanoid_gct.py InParanoid.H.sapiens-S.cerevisiae.orthoXML mart_export.txt < dilution_rate_00_raw.gct > dilution_rate_01_human.gct 11. Finally, create a new file named ʺdilution_rate_01_human.clsʺ in a text editor of choice containing the following three lines. All of the whitespace is tabs, and there are 36 columns corresponding to the 36 conditions in our GCT file: 36 2 1 # Glucose Other 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12. Now letʹs upload this mess to GenePattern! Navigate to the web site at http://genepattern.broadinstitute.org; if you donʹt already have an account, click ʺClick to registerʺ to create one and sign in. 13. Once youʹve signed in, select ʺGSEAʺ from the left hand menu. Curtis Huttenhower, NESS 2010 7

14. In the following excessively complicated form, change the following fields: a. For ʺexpression datasetʺ, click ʺBrowseʺ and select ʺdilution_rate_01_human.gctʺ. b. For ʺgene sets databaseʺ, select ʺc2.all.v2.5.symbols.gmt [Curated]ʺ. c. For ʺphenotype labelsʺ, click ʺBrowseʺ and select ʺdilution_rate_01_human.clsʺ. d. For ʺcollapse datasetʺ, select ʺfalseʺ. e. For ʺpermutation typeʺ, select ʺtagʺ. f. For ʺmin gene set sizeʺ, enter ʺ5ʺ. 15. Ok, now finally scroll down and click ʺRunʺ. Curtis Huttenhower, NESS 2010 8

16. After waiting a little while, you should see the following screen. Click on ʺdilution_rate_01_human.zipʺ to download and save it. 17. Expand this archive and open ʺindex.htmlʺ in your favorite web browser. This will provide the following overview of gene sets that are enriched for up or down regulation in the glucose exposure conditions of our heavily manipulated dataset. Curtis Huttenhower, NESS 2010 9

18. Click on the ʺSnapshotʺ of enrichment results for the first phenotype, ʺGlucoseʺ, to see upregulated gene sets. 19. Letʹs click on the sixth gene set (the third one on the second row, rightmost, highlighted in red in the image above). This is a small gene set thatʹs highly upregulated specifically in the glucose exposure conditions. The first screen GSEA provides is a results summary showing the name of the gene set (ʺHSA00020_CITRATE_CYCLEʺ, the KEGG TCA cycle pathway), its raw and normalized enrichment scores (how and how specifically its members were upregulated in our phenotype of interest), and the nominal, FDR corrected, and FWER corrected p values of these scores (all zero in our case; thatʹs significant!) Black bars in the graphical overview show where the setʹs genes fall in terms of the entire genomeʹs enrichment in our phenotype. Curtis Huttenhower, NESS 2010 10

20. Scrolling down, the next results displayed are the individual genes in this set. In addition to the identifiers and metadata for each gene, its individual rank, score, and ʺcore enrichmentʺ is displayed, where core genes were individually enriched enough to seed the clusterʹs score. Note that in this case, all except one gene at the very bottom was active enough in the glucose phenotype to be a core. 21. Scrolling some more, another graphical overview is provided of the gene set membersʹ normalized expression values within our dataset. Yup, this looks upregulated in glucose, all right! Curtis Huttenhower, NESS 2010 11

22. The last figure isnʹt terribly relevant in this case; itʹs a histogram of the null distribution of Enrichment Scores for a randomized gene set with the same properties as the current real one. Note that the x axis doesnʹt even go beyond 0.5, and the real enrichment score was greater than 0.9. You can safely say that the TCA cycle pathway is highly upregulated in our glucose limitation conditions! 23. Returning the results overview, letʹs look at the textual (rather than graphical) list of gene sets that were downregulated during glucose limitation. Click on the ʺenrichment results in htmlʺ link for the second phenotype, ʺOtherʺ. Curtis Huttenhower, NESS 2010 12

24. Wow, a lot of these are really weird. Cholera? Flagellar assembly? Turns out a lot of these are due to large orthologous families in human that correspond to single yeast genes. Our Python script above will duplicate one yeast geneʹs expression out to the several members of its orthologous cluster; if that geneʹs downregulated during glucose limitation (as some of the vacuolar ATPases orthologous to secretion, photosynthesis, and flagellar locomotion proteins are), itʹll look like the entire pathwayʹs downregulated. Fret not! Weʹll ignore them and look at the sixth gene set again, DNA replication as annotated in Reactome. Click on its ʺDetails...ʺ link as highlighted below (clicking on the gene set name itself will take you to MSigDBʹs description of the set). 25. Well, this isnʹt quite as downregulated as the TCA cycle was upregulated, but itʹs still quite significant. Curtis Huttenhower, NESS 2010 13

26. Scroll down a bit to see if this gene set passes the smell test. The heatmap doesnʹt contain a lot of repeated rows, which means this may be real and not due simply to duplicated orthologous genes. Looks like our yeast might be locking itself out of the cell cycle (and thus not replicating its DNA) more during glucose limitation than for other nutrients. The fact that the downregulation is more extreme during the lowest growth rates and relaxed in the higher ones (e.g. ʺG0.25ʺ and ʺG0.3ʺ) is a good sign. GSEA helps us find higher level biological descriptions in terms of pathway up and down regulation for what might be happening in particular expression condition phenotypes. Nifty! Curtis Huttenhower, NESS 2010 14