Quality assessment of TCGA Agilent gene expression data for ovary cancer

Size: px

Start display at page:

Download "Quality assessment of TCGA Agilent gene expression data for ovary cancer"

Daniella James
6 years ago
Views:

1 Quality assessment of TCGA Agilent gene expression data for ovary cancer Nianxiang Zhang & Keith A. Baggerly Dept of Bioinformatics and Computational Biology MD Anderson Cancer Center Oct 1, 2010 Contents 1 Executive Summary Introduction Methods Data Statistical methods Results and Conclusions Data Analysis Data consistency across versions Level 1 to Level 2 data Level 2 to Level 3 data Load level 2/3 data Sample Labeling consistency of Level 2 and Level 3 data Effects in Level 2 and Level 3 data Appendix File Location SessionInfo List of Figures 1 Correlation of level1 and level 2 data Probes in G4502A 07 2, G4502A 07 3 platform and level 2 data Consistency of level 2 and level 3 data by correlation. Level 2 data are summarized to gene level by taking the mean of probes that belong to the same gene. Then pairwise correlation between summarized level 2 and level 3 data are calculated. Red color represents correlation coefficient> effects in Level 2 Agilent gene expression data. An average across all probes for each sample is calculated using level 2 data. The average gene expression level for samples is shown by batch

2 AgiExpQC.Rnw 2 5 effects in Level 3 Agilent gene expression data. An average across all genes for each sample is calculated using level 3 data. The average gene expression level for samples is shown by batch The top probes with batch effects in level 2 data The top probes with batch effects in level 3 data Executive Summary 1.1 Introduction We are interested in assessing the quality of TCGA data including Agilent gene expression data. We would like to examine the consistency of data in different levels. We also want to access batch effects in Level 2 and Level 3 data. 1.2 Methods Data We use MD Anderson local copy of TCGA Agilent gene expression data at //gcgserv.mdanderson.org/tcga- PUBLIC/tcga/tumor/ov/cgcc/unc.edu for QC assessment of Level 1 data. We use consolidated level2 and level3 data from TCGA data portal located at mdadqsfs02/workspace/nzhangtcgadata/ovarian/expression- Genes Statistical methods We use limma package in R to assess Level 1 data. We use the mean of expression level of probes located on the same gene to summarize Level 2 data to gene expression level. We calculate Pearson correlation coefficient to assess the consistency of Level 2 and Level 3 data. 1.3 Results and Conclusions We found the data are consistent across different versions from level 1 to level 3 by checking random samples. We found the 67 probes in level 2 data do not exist in level 1 data. We do not know how level 2 data were obtained. They are highly correlated to the Feature Extraction Software processed LogRatio data, but not directly from FE. We did not identify the mislabeling problem. We do see batch effects in the data. The average expression levels across all genes are different for samples from different batches. The situations are similar for level 2 and level 3 data. 2 Data Analysis 2.1 Data consistency across versions In order to perform QC assessment of the data, we will need to figure out the data storage structure and retrieve proper files. We define the directories for the two platforms. > datapath2 <- "//gcgserv.mdanderson.org/tcga-public/tcga/tcga-stage/anonsite/tcga/tumor/ov/cgcc/unc.edu > datapath3 <- "//gcgserv.mdanderson.org/tcga-public/tcga/tcga-stage/anonsite/tcga/tumor/ov/cgcc/unc.edu

3 AgiExpQC.Rnw 3 We write 2 functions to get directory name and filenames. > getdir <- function(dp,...) { + Dirs <- list.files(dp,...) + Dirs <- Dirs[-grep("gz", Dirs)] + return(dirs) > extract.datafilename <- function(dpath, level = 1,...) { + allfile <- list.files(dpath,...) + data.file <- allfile[grep("us", allfile)] + data.file <- sort(data.file) + name23 <- grep("level", data.file) + level1 <- data.file[0 - name23] + level2 <- data.file[grep("level2", data.file)] + level3 <- data.file[grep("level3", data.file)] + switch(level, `1` = return(level1), `2` = return(level2), + `3` = return(level3)) We check Level 1 data first. We examine the samples from different versions to make sure the data file names are the same in different versions. > temp.ver <- getdir(datapath2, full.names = T) > identical(extract.datafilename(temp.ver[1]), extract.datafilename(temp.ver[2])) > temp1 <- extract.datafilename(temp.ver[1], full.names = T) > temp2 <- extract.datafilename(temp.ver[2], full.names = T) We choose random 3 samples from the different versions; load the level 1 data which are from Feature Extraction Software. They are identical. > temp.ind <- sample(1:length(temp1), 3) > choosensamp <- extract.datafilename(temp.ver[1])[temp.ind] > RG1 <- read.maimages(files = temp1[temp.ind], source = "agilent", + other.columns = list("controltype", "LogRatio", "gprocessedsignal", + "rprocessedsignal")) > RG2 <- read.maimages(files = temp2[temp.ind], source = "agilent", + other.columns = list("controltype", "LogRatio", "gprocessedsignal", + "rprocessedsignal")) > colnames(rg1) <- colnames(rg2) <- NULL > identical(rg1$r, RG2$R) > identical(rg1$others, RG2$others) We also check if the Level 2 or 3 data are consistent across different versions. The level 2 and 3 data that we checked for the two versions are identical. > temp.ver <- getdir(datapath2, full.names = T) > identical(extract.datafilename(temp.ver[1]), extract.datafilename(temp.ver[2])) > for (level in 2:3) { + temp1 <- extract.datafilename(temp.ver[1], level = level, + full.names = T) + temp2 <- extract.datafilename(temp.ver[2], level = level,

4 AgiExpQC.Rnw 4 + full.names = T) + date1 <- date() + for (ii in 1:length(temp1)) { + templ1 <- read.table(file = temp1[ii], skip = 2, fill = T) + templ2 <- read.table(file = temp2[ii], skip = 2, fill = T) + colnames(templ1) <- colnames(templ2) <- NULL + if (!identical(templ1, templ2)) + cat(paste(temp1[ii], "is different from \n", temp2[ii], + "\n")) + else cat("ok \n") 2.2 Level 1 to Level 2 data We do not know how level 2 data were obtained from Level 1 data. We check the mean of processed LogRatio data in level 1 and see correlation to level 2 data. The correlation coefficient of the logratio mean and level 2 data is (Figure 1). > NC1 <- RG1$genes[RG1$other$ControlType[, 1] == 0, ] > length(unique(nc1$probename)) > table(table(nc1$probename)) > temp.ver <- getdir(datapath2, full.name = T) > temp.level1.file <- extract.datafilename(temp.ver[1], level = 1, + full.names = T) > temp.rg <- read.maimages(files = temp.level1.file[1], source = "agilent", + other.columns = list("controltype", "LogRatio", "gprocessedsignal", + "rprocessedsignal")) > temp.level2.file <- extract.datafilename(temp.ver[1], level = 2, + full.names = T) > chosensamp <- extract.datafilename(temp.ver[1])[1] > temp <- substr(chosensamp, 1, 30) > chosen.level2.file <- unlist(lapply(temp, function(x) temp.level2.file[grep(x, + temp.level2.file)])) > temp <- read.table(chosen.level2.file[1], header = F, skip = 2, + fill = T) > matchlevel1data <- temp.rg[match(temp[, 1], temp.rg$genes$probename), + ] > replevel1data <- temp.rg[duplicated(temp.rg$genes$probename), + ] > replevel1data <- temp.rg[match(temp[, 1], replevel1data$genes$probename), + ] > identical(matchlevel1data$others$logratio, temp[, 2]) > temp.lrmean <- tapply(temp.rg$other$logratio[, 1], INDEX = temp.rg$genes$probename, + mean, na.rm = T) > temp.lrmean.match <- temp.lrmean[match(as.vector(temp[, 1]), + names(table(temp.rg$genes$probename)))]

5 AgiExpQC.Rnw 5 Level Mean LogRatio Level1 Figure 1: Correlation of level1 and level 2 data. > pdf("level1level2corr.pdf") > plot(temp.lrmean.match, temp[, 2], xlab = "Mean LogRatio Level1", + ylab = "Level 2", pch = ".") > invisible(dev.off()) > cor(temp.lrmean.match, temp[, 2], use = "pairwise.complete.obs") We check level 1 data in another platform AgilentG4502A > temp.ver <- getdir(datapath3, full.names = T) > identical(extract.datafilename(temp.ver[2]), extract.datafilename(temp.ver[3])) > temp3 <- extract.datafilename(temp.ver[2], full.names = T) > temp4 <- extract.datafilename(temp.ver[3], full.names = T) We choose a sample from AgilentG4502A 07 3 platform, load the level 1 data which are from Feature Extraction Software. The level 1 data from the different versions are consistent.

6 AgiExpQC.Rnw 6 > sampleid <- substr(extract.datafilename(temp.ver[2])[1], 1, 30) > RG3 <- read.maimages(files = temp3[grep(sampleid, temp3)], source = "agilent", + other.columns = list("controltype", "LogRatio", "gprocessedsignal", + "rprocessedsignal")) > RG4 <- read.maimages(files = temp4[grep(sampleid, temp4)], source = "agilent", + other.columns = list("controltype", "LogRatio", "gprocessedsignal", + "rprocessedsignal")) > colnames(rg3) <- colnames(rg4) <- NULL > identical(rg3$r, RG4$R) > identical(rg3$others, RG4$others) We get rid of the control probes and find out the number of probes. > NC3 <- RG3$genes[RG3$other$ControlType[, 1] == 0, ] > length(unique(nc3$probename)) The results show that the level 1 data are consistent across versions for both platform. Since the two platforms have different probesets. We compare the probes among the two platform level 1 data and Level 2 data. > identical(level2data2[, 1], Level2data3[, 1]) The level2 data for different platforms have the same set of probes. However, there are 67 probes in level2 data are not in level 3 (Figure 2). > allprobe <- unlist(unique(c(nc1$probename, NC3$ProbeName, as.vector(level2data2[, + 1])))) > temp.venn <- matrix(0, length(allprobe), 3) > colnames(temp.venn) <- c("g4502a_07_2", "G4502A_07_3", "Level2_2") > temp.venn[, 1] <- allprobe %in% NC1$ProbeName > temp.venn[, 2] <- allprobe %in% NC3$ProbeName > temp.venn[, 3] <- allprobe %in% Level2data2[, 1] > pdf("probevenn.pdf") > venndiagram(temp.venn, circle.col = c("red", "blue", "green"), + lwd = 3) > dev.off() 2.3 Level 2 to Level 3 data Load level 2/3 data Now we use the consolidated level 2 and level 3 data we just downloaded to do further analysis. We convert the Agilent level 2/3 data into matrix form. > datadir <- c("../../../expression-genes/unc AgilentG4502A_07_2", + "../../../Expression-Genes/UNC AgilentG4502A_07_3") > if (exists("level2data")) rm(level2data) > Agifile <- paste(datadir, "/Level_2/", c("unc.edu AgilentG4502A_07_2 log2_lowess_normalized.txt", + "unc.edu AgilentG4502A_07_3 log2_lowess_normalized.txt"), + sep = "") > temp.data <- NULL

7 AgiExpQC.Rnw 7 G4502A_07_2 G4502A_07_ Level2_2 0 Figure 2: Probes in G4502A 07 2, G4502A 07 3 platform and level 2 data.

8 AgiExpQC.Rnw 8 > for (j in 1:length(Agifile)) { + s.name <- read.delim(file = Agifile[j], sep = "\t", header = F, + nrow = 1, stringsasfactors = F, row.names = 1) + temp.raw <- read.delim(file = Agifile[j], sep = "\t", header = F, + skip = 2, stringsasfactors = F, row.names = 1) + colnames(temp.raw) <- t(s.name) + if (!exists("level2data")) + Level2data <- temp.raw + else { + stopifnot(identical(rownames(level2data), rownames(temp.raw))) + Level2data <- cbind(level2data, temp.raw) > rm(agifile) > rm(list = ls(pattern = "temp")) > for (i in 1:ncol(Level2data)) Level2data[, i] <- as.numeric(level2data[, + i]) > save(level2data, file = file.path("rdataobjects", "AgilentOVLevel2Data.Rda")) > Agifile <- paste(datadir, "/Level_3/", c("unc.edu AgilentG4502A_07_2 gene_expression_analysis_1.txt" + "unc.edu AgilentG4502A_07_3 gene_expression_analysis_1.txt"), + sep = "") > if (exists("level3data")) rm(level3data) > for (j in 1:length(Agifile)) { + temp.raw <- read.delim(file = Agifile[j], sep = "\t", header = T, + stringsasfactors = F) + temp <- matrix(as.numeric(temp.raw[, 3]), ncol = length(table(temp.raw[, + 1])), nrow = length(table(temp.raw[, 2])), dimnames = list(unique(temp.raw[, + 2]), unique(temp.raw[, 1]))) + temp.raw[is.na(as.numeric(temp.raw[, 3])), ] + if (!exists("level3data")) + Level3data <- temp + else { + stopifnot(identical(rownames(level3data), rownames(temp))) + Level3data <- cbind(level3data, temp) > rm(agifile) > rm(list = ls(pattern = "temp")) > save(level3data, file = file.path("rdataobjects", "AgilentOVLevel3Data.Rda")) We make sure the Level 2 and Level 3 data cover the same set of samples. > all(colnames(level2data) %in% colnames(level3data)) > all(colnames(level3data) %in% colnames(level2data)) We reorder the columns of Level3 data so that level 2 and level 3 data have the same sample order. > Level3data <- Level3data[, colnames(level2data)] We get the exclusion inclusion sample list, and retain only the included samples. We also remove one cell line sample without batch assignment.

9 AgiExpQC.Rnw 9 > source("~/project/weinsteintcga062509/tcgafunctions.r") > level.si <- getsi(level2data, batchpath = "../../Effect") > require(gdata) > inex <- read.xls(xls = "/workspace/nzhangtcgadata/ovarian/analysis/tcga_ovarianuseandexcludelist.xls", + sheet = 1) > temp.sample.in <- as.vector(inex$sample.id[inex$include.exclude == + "Include"]) > keep.ind <-!is.na(level.si$batch) & paste("tcga", level.si$siteid, + level.si$patientid, sep = "-") %in% temp.sample.in > final.si <- level.si[keep.ind, ] > Level2data <- Level2data[, keep.ind] > Level3data <- Level3data[, keep.ind] > save(list = c("level3data", "Level2data", "final.si"), file = file.path("rdataobjects", + "AgilentOVData.Rda")) Sample Labeling consistency of Level 2 and Level 3 data We do not have the annotation file for the customized array, we use HGUG4112a instead, which covers some of the probes. Actually, probes are mapped. We only keep the level 2 data that are mapped genes in Level3 data are in this annotation. We only keep the probes that can be mapped to the genes. > require(hgug4112a.db) > symbol <- unlist(mget(as.vector(rownames(level2data)), env = hgug4112asymbol, + ifnotfound = NA)) > sum(!is.na(symbol)) > sum(rownames(level3data) %in% symbol) > mappedgene <- intersect(rownames(level3data), symbol) > Level2map <- Level2data[symbol %in% mappedgene, order(final.si$batch)] > symbolmap <- symbol[symbol %in% mappedgene] > Level3map <- Level3data[rownames(Level3data) %in% mappedgene, + order(final.si$batch)] > save(list = c("level3map", "Level2map", "final.si", "symbolmap"), + file = file.path("rdataobjects", "AgilentOVDataMapped.Rda")) Now, we just take the mean of probes to summarize level 2 data. > Level2Sum <- apply(as.matrix(level2map), 2, function(x) tapply(x, + INDEX = symbolmap, mean, na.rm = T)) > Level2Sum <- Level2Sum[rownames(Level3map), ] Now, we calculate the correlation of the 2 data set. We expect the summarized level 2 data should have high correlation to the level 3 data of the same sample. We set threshold of 0.9 to show the pairwise correlations (Figure 3). There is no mislabeling found since all the high correlations appear on the diagonal line. > level23cor <- matrix(0, ncol(level3map), ncol(level3map)) > for (i in 1:ncol(Level3map)) { + for (j in 1:ncol(Level3map)) { + level23cor[i, j] <- cor(level3map[, i], Level2Sum[, j], + use = "pairwise.complete.obs")

10 AgiExpQC.Rnw 10 > pdf("corrlevel2and3.pdf") > heatmap((level23cor > 0.9) + 0, Colv = NA, Rowv = NA, xlab = "Level2 Average", + ylab = "Level 3", col = c("grey", "red")) > dev.off() 2.4 Effects in Level 2 and Level 3 data We assess the batch effects in level 2 and level 3 data. We calculate the mean across all probes/genes for level 2 and level 3 data. The plots of the cross-gene mean are shown in Figure 4 and 5. > genemeanlevel2 <- apply(as.matrix(level2map), 2, mean, na.rm = T) > genemeanlevel3 <- apply(as.matrix(level3map), 2, mean, na.rm = T) > pdf("level2.pdf") > temp <- boxplot(genemeanlevel2 ~ final.si$batch, xlab = "", + main = "Agilent Level2 data", cex = 0.7) > points(y = genemeanlevel3, x = jitter(rep(1:13, temp$n)), cex = 0.7) > abline(v = 0: , col = "brown") > dev.off() > pdf("level3.pdf") > temp <- boxplot(genemeanlevel3 ~ final.si$batch, xlab = "", + main = "Agilent Level3 data", cex = 0.7) > points(y = genemeanlevel3, x = jitter(rep(1:13, temp$n)), cex = 0.7) > abline(v = 0: , col = "brown") > dev.off() We pick some extreme genes to see how bad it would be. The top probes differentially expressed by batch are shown in Figure 6 and 7. > res <- MultiLinearModel(Y ~ batch, clindata = final.si[, ], arraydata = Level2map) > mad2 <- apply(level2map, 1, mad, na.rm = T) > top6 <- mad2[order(res@p.values, -mad2)][1:6] > pdf("level2batcheffecttop.pdf", height = 8, pointsize = 9) > par(mfrow = c(3, 2)) > for (i in 1:6) { + genedata <- t(level2map[names(top6)[i], ]) + temp <- boxplot(genedata ~ final.si$batch, xlab = "", + main = names(top6)[i], cex = 0.7) + points(y = genedata, x = jitter(rep(1:13, temp$n)), cex = 0.7) + abline(v = 0: , col = "brown") > dev.off() > res3 <- MultiLinearModel(Y ~ batch, clindata = final.si[, ], + arraydata = Level3map) > mad3 <- apply(level3map, 1, mad, na.rm = T) > top6.3 <- mad3[order(res3@p.values, -mad3)][1:6] > pdf("level3batcheffecttop.pdf", height = 8, pointsize = 9) > par(mfrow = c(3, 2))

11 AgiExpQC.Rnw Level2 Average Level 3 Figure 3: Consistency of level 2 and level 3 data by correlation. Level 2 data are summarized to gene level by taking the mean of probes that belong to the same gene. Then pairwise correlation between summarized level 2 and level 3 data are calculated. Red color represents correlation coefficient>0.9.

12 AgiExpQC.Rnw Agilent Level2 data Figure 4: effects in Level 2 Agilent gene expression data. An average across all probes for each sample is calculated using level 2 data. The average gene expression level for samples is shown by batch.

13 AgiExpQC.Rnw Agilent Level3 data Figure 5: effects in Level 3 Agilent gene expression data. An average across all genes for each sample is calculated using level 3 data. The average gene expression level for samples is shown by batch.

14 AgiExpQC.Rnw 14 > for (i in 1:6) { + genedata <- Level3map[names(top6.3)[i], ] + temp <- boxplot(genedata ~ factor(final.si$batch), xlab = "", + main = names(top6.3)[i], cex = 0.7) + points(y = genedata, x = jitter(rep(1:13, temp$n)), cex = 0.7) + abline(v = 0: , col = "brown") > dev.off() 3 Appendix 3.1 File Location > getwd() [1] "/workspace/nzhangtcgadata/ovarian/analysis/baggerlyqc/agilentqc" 3.2 SessionInfo > sessioninfo() R version ( ) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=C LC_MESSAGES=en_US [7] LC_PAPER=en_US LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grdevices utils datasets methods [8] base other attached packages: [1] ClassComparison_ Biobase_2.6.1 PreProcess_ [4] oompabase_ limma_3.2.3

15 AgiExpQC.Rnw A_23_P A_23_P A_23_P A_23_P A_23_P A_23_P62741 Figure 6: The top probes with batch effects in level 2 data.

16 AgiExpQC.Rnw EHF F13A IGSF FGL SLC6A HIGD1B Figure 7: The top probes with batch effects in level 3 data.

MethylMix An R package for identifying DNA methylation driven genes

MethylMix An R package for identifying DNA methylation driven genes Olivier Gevaert May 3, 2016 Stanford Center for Biomedical Informatics Department of Medicine 1265 Welch Road Stanford CA, 94305-5479