This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Size: px

Start display at page:

Download "This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and"

Alexander Randall
5 years ago
Views:

1 Ths artcle appeared n a journal publshed by Elsever. The attached copy s furnshed to the author for nternal non-commercal research and educaton use, ncludng for nstructon at the authors nsttuton and sharng wth colleagues. Other uses, ncludng reproducton and dstrbuton, or sellng or lcensng copes, or postng to personal, nsttutonal or thrd party webstes are prohbted. In most cases authors are permtted to post ther verson of the artcle (e.g. n Word or Tex form) to ther personal webste or nsttutonal repostory. Authors requrng further nformaton regardng Elsever s archvng and manuscrpt polces are encouraged to vst:

Computatonal Statstcs and Data Analyss 53 (2009) 3987 3998 Contents lsts avalable at ScenceDrect Computatonal Statstcs and Data Analyss journal homepage: www.elsever.

Brazl b Department of Statstcs, Kansas State Unversty, 66506-0802, KS, USA a r t c l e n f o a b s t r a c t Artcle hstory: Receved 7 February 2009 Receved n revsed form 26 Aprl 2009 Accepted 19 June

2 Computatonal Statstcs and Data Analyss 53 (2009) Contents lsts avalable at ScenceDrect Computatonal Statstcs and Data Analyss journal homepage: Partton clusterng of hgh dmensonal low sample sze data based on p-values George von Borres a, Hayan Wang b, a Departamento de Estatístca, IE, Unversdade de Brasíla, , DF, Brazl b Department of Statstcs, Kansas State Unversty, , KS, USA a r t c l e n f o a b s t r a c t Artcle hstory: Receved 7 February 2009 Receved n revsed form 26 Aprl 2009 Accepted 19 June 2009 Avalable onlne 26 June 2009 Clusterng technques play an mportant role n analyzng hgh dmensonal data that s common n hgh-throughput screenng such as mcroarray and mass spectrometry data. Effectve use of the hgh dmensonalty and some replcatons can help to ncrease clusterng accuracy and stablty. In ths artcle a new parttonng algorthm wth a robust dstance measure s ntroduced to cluster varables n hgh dmensonal low sample sze (HDLSS) data that contan a large number of ndependent varables wth a small number of replcatons per varable. The proposed clusterng algorthm, PPCLUST, consders data from a mxture dstrbuton and uses p-values from nonparametrc rank tests of homogeneous dstrbuton as a measure of smlarty to separate the mxture components. PPCLUST s able to effcently cluster a large number of varables n the presence of very few replcatons. Inherted from the robustness of rank procedure, the new algorthm s robust to outlers and nvarant to monotone transformatons of data. Numercal studes and an applcaton to mcroarray gene expresson data for colorectal cancer study are dscussed. Publshed by Elsever B.V. 1. Introducton Mnng n hgh dmensonal low sample sze (HDLSS) data s an actve research topc due to the advance n data collecton technologes that allow the obtanng of nformaton from a large number of varables (for example, genes, protens) at the same tme. Contradctory to the requrement of plenty of replcatons as demanded by tradtonal methods, the number of replcatons for such data s often lmted due to tme or cost constrant. For example, a medum-szed mcroarray study often contans nformaton from thousands of genes wth no more than a hundred samples for each gene. An mportant task s to nvestgate and dentfy dsease response genes usng the post-genome data. Ths can provde target for drug development n publc health and gve the focus for genetc alteraton to yeld dsease resstant crops. Statstcal methods for such purposes are manly n three categores. One category s through the analyss of ndvdual gene and then apply false dscovery rate (FDR) control (Benjamn and Hochberg, 1995; Efron, 2007) to adjust for multple comparson ssues. A large volume of work n the lterature falls n ths category. Even though FDR s meant to mprove the dentfcaton of true postves, t stll leads to conservatve results n genomc applcatons (Storey and Tbshran, 2003). Ths s especally true n the case of small sample szes snce the test statstcs calculated from small replcatons are lack of power for nonparametrc methods and are senstve to devatons from assumptons for parametrc methods. As a result, when only a small amount of useful nformaton exsts among a large amount of noses, the lmtaton of these methods prevals. A second category of methods s referred to as gene set enrchment that consders a set of genes selected based on bologcal knowledge from pathway nformaton or lterature mnng to ncrease power (Subramanan et al., 2005; Efron and Tbshran, 2007). Unfortunately, pathway or gene ontology nformaton s not known for all genomes and so gene set enrchment Correspondng author. E-mal addresses: gborres@unb.br (G. von Borres), hwang@ksu.edu (H. Wang) /$ see front matter. Publshed by Elsever B.V. do: /j.csda

3 3988 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Table 1 Hgh dmensonal data layout, where a and n 2. Factor level Dstrbuton Observatons Sample sze 1 F 1 (x) X 11 X X 1n1 n 1 2 F 2 (x) X 21 X X 2n2 n 2... a F a (x) X r1 X r2... X rnr n a... methods may not be applcable. A thrd category s through clusterng to dentfy groups of dfferentally expressed genes (Fraley, 1998; Alon et al., 1999; Notterman et al., 2001; Yeung and Ruzzo, 2001; Jang et al., 2004; Huttenhower et al., 2007; Fu and Medco, 2007). Clusterng based methods are more flexble. However, non-probablstc dstance measures and correspondng clusters obtaned can lead to dffculty n nterpretaton. Further, most algorthms are senstve to monotone transformatons and produce dfferent results when appled to dfferent transformatons of data. In addton to the above mentoned problems, most avalable methods requre a user to pre-specfy the number of clusters. Ths s dffcult and could produce msleadng results when ncorrect number of clusters are specfed. Mxture model based clusterng (MCLUST) developed by Fraley and Raftery (2006) can automatcally estmate the number of clusters usng Bayesan Informaton Crteron. However, ths algorthm reles heavly on normalty assumpton and may produce poor clusterng accuracy when the data are heavly skewed. Further, as ponted out by the authors, MCLUST s not recommended to apply to HDLSS data drectly due to ts dependence on the covarance matrx estmaton. We propose to approach the problem from a combnaton of clusterng and gene set enrchment dea wthout havng to rely on known bologcal nformaton. Specfcally, we assume that at least two replcatons are avalable for each varable (gene) to start wth. All the varables and ther observatons together can be vewed as orgnated from hgh dmensonal mxtures of dstrbutons, where each unque dstrbuton defnes a cluster. We then ntroduce a new parttonal algorthm usng a robust measure of smlarty to cluster the large number of varables. The robust smlarty measure evolves from p-values obtaned from the rank test of no nonparametrc effect of groups (Wang and Akrtas, 2004) specally developed for the HDLSS structure. The new algorthm can automatcally determne the number of clusters and are nvarant to monotone transformatons of data. Numercal studes show that the proposed algorthm has hgh clusterng accuracy and stablty. Addtonally, the algorthm s fast and do not show memory allocaton problems observed n some algorthms when the number of varables n the study s very hgh ( or more varables). 2. Revew of the nonparametrc test for homogeneous dstrbuton Suppose we have observatons from a mxture of unknown dstrbutons. Let a cluster be all the observatons generated from the same dstrbuton. Dfferences among clusters can be reflected n many ways such as dfferent mean values or dfferent varances. In ths artcle, the problem of clusterng on observatons s proceeded as a problem of detectng a sgnfcant dfference on the dstrbuton of the observatons from each dstrbuton. Let X j denote the jth observaton from the th varable (or factor), where {X j, 1 j n } are ndependent observatons from some unknown dstrbuton F (x), = 1, 2,..., a. The observed data can be vewed as a matrx wth elements X j. Each row represents the level of a factor, and each column represents an observaton (replcaton), as s shown n Table 1. We frst test to see f these observatons are from the same dstrbuton,.e., we test the hypothess H 0 : F 1 (x) = = F a (x). The Kruskal Walls test can be used when the number of dstrbutons s small. However, the test s not vald n a hgh dmensonal settng snce the nference s based on large sample sze and small number of dstrbutons. We also do not recommend to use tradtonal ANOVA F-test as the error terms n ANOVA model need to be..d. Gaussan wth a constant varance. Akrtas and Arnold (2000) showed that the ANOVA F-test s robust to departure from homoscedastcty when there are a large number of factors, but t s not asymptotcally vald for unbalanced data wth small sample szes even under homoscedastcty. Later, Akrtas and Papadatos (2004) consdered test procedures for unbalanced and/or heteroscedastc stuatons when the number of factors tends to nfnty. However, ther tests are based on orgnal observatons that are not nvarant to monotone transformaton of data. To overcome all these lmtatons, Wang and Akrtas (2004) consdered a nonparametrc rank test of the null hypothess of equalty of dstrbuton functons for each factor level when the number of factors s large and the number of replcatons s ether small (referred as HDLSS data here) or large. We use the p-value from testng the hypothess n (1) usng the test statstc n Wang and Akrtas (2004) as the measure of smlarty among groups. Let R j represent the (md-)rank of observaton X j n the set of all n 1 + n n a observatons. Then under H 0, all observatons are..d. realzatons of a common dstrbuton. So these (md-)ranks are dscrete unformly dstrbuted random a numbers between 1 and =1 n for contnuous data. Let R. = n 1 (1) n j=1 R j be the mean rank of observatons for the th factor level and R.. = a 1 a =1 R. be the overall unweghted mean of ranks from all factor levels. Defne the test statstc, F R = MST R MSE R (2)

4 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) where MST R s the unweghted mean square error due to factor levels calculated over (md-)ranks: MST R = 1 a (R. R.. ) 2, a 1 =1 and MSE R s the pooled estmate of the sample varance, also obtaned over (md-)ranks: MSE R = 1 a 1 S 2 a n, R, =1 where S 2 R, s the sample varance calculated usng (md-)ranks of observatons from the th factor level. The asymptotc dstrbuton of a(f R 1) under H 0, as a, s gven n Wang and Akrtas (2004). For convenence of further dscusson, we restate the theorem below. Theorem 1. Let F (x) be arbtrary cumulatve dstrbuton functons and H(x) = ( a =1 n ) 1 n F (x) be the average cumulatve dstrbuton functon. Assume that the observatons are ndependent. Defne σ 2 = Var(H(X j )) and v 2 = 1 a 1 2 σ 2 > 0, τ 2 = 1 a 2σ 4 a n a n (n 1). (5) =1 Then under H 0 : F 1 (x) = = F a (x), the lmt of τ 2 /v 4 2 exsts as a. Further, a(fr 1) d N(0, lm a τ 2 /v 4 2 ), =1 regardless of n stay bounded or go to, provded that max {n }/ mn {n } = O(1) for n 2. The statstc a(f R 1) compared to the normal crtcal values can be used to obtan an approxmate p-value to gve a sample evdence of the homogenety of the dstrbutons. A large p-value ndcates that the gven sample does not provde evdence to conclude that the factor levels beng tested have dfferent dstrbutons. In such case, we cluster these factors levels nto the same group. In contrast, a small p-value gves evdence aganst H 0 ndcatng that at least two dstrbutons are dfferent. The use of the hypothess testng results from (6) to obtan smlarty measure allows flexble modelng and robust clusterng at the same tme. Wth ths general setup, the data collected can be balanced or unbalanced and the user does not have to worry about normalty or skewness of the data. Heteroscedastc varances are naturally ncorporated. Ths s mportant as gene regulatons are very complcated and the varatons of the expresson data from dfferent genes can be dramatcally dfferent. In addton, the results hold for small or large sample szes. In partcular, allowng relable nference wth the sample szes as small as two can lead to sgnfcant reducton n cost for consderng the number of arrays requred. Before we apply the results of (6) n clusterng, we frst evaluate ts performance. The estmated type I error and power were not studed n Wang and Akrtas (2004). We report our smulaton results n the next secton Type I error and power estmate when the number of varables s large Table 2 reports the Type I error estmate usng the asymptotc dstrbuton of the test statstc n (6) at sgnfcance levels 0.10, 0.05 and For performance of other nonparametrc tests n such a settng, one can see Akrtas and Papadatos (2004). In the smulatons the number of random varables, a, takes values 1000, 2000 and 4000, and the number of observatons per varable s set to be 4. The smulatons are based on 2000 runs and observatons were generated from normal, lognormal, exponental, and Cauchy dstrbutons. The Jackknfe bas corrected estmator (Pawtan, 2001) of σ 4 were used n the estmaton of the asymptotc varances. The Type I error rates reported n Table 2 are close to the true α levels, ndcatng that the test statstc a(f r 1) performs well n testng the hypothess n (1) regardless of whether the dstrbuton s symmetrc (normal), skewed (lognormal, exponental), or heavy taled (Cauchy). To study the power of the test descrbed n Secton 2, we generated data for 2000 random varables from mxture dstrbutons wth four observatons per varable. Normal, lognormal, exponental, and Cauchy dstrbutons are consdered to evaluate robustness of the test. For all cases except the exponental dstrbuton, observatons for 95% of the varables are generated wth the dstrbuton havng locaton parameter 0 and scale parameter 1, and the remanng 5% of the random varables have locaton parameter d rangng from 0 to The acheved power at sgnfcance level α = 0.05 s gven n Fg. 1. The test appears to be very powerful n detectng small proporton of dfferences n all cases consdered. 3. Partton clusterng algorthm based on p-values The p-values obtaned from the test n Secton 2 can serve as a smlarty measure n a clusterng algorthm wth hgh dmensonal data. In ths secton, we ntroduce a partton algorthm, PPCLUST (p-values based parttonal clusterng), to teratvely conduct nonparametrc hypothess testng and partton the random varables nto subgroups whenever the smlarty s below a certan threshold. That s, a group of varables s parttoned nto two smaller groups when the test (3) (4) (6)

5 3990 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Table 2 Type I error estmate. The test has accurate sze regardless of the dstrbuton beng symmetrc, skewed or heavy-taled. Dstrbuton Number of Factor levels Nomnal level Type I error Normal(0,1) Lognormal(0,1) Exponental(1) Cauchy(0,1) Estmated Power at Level 0.05 Estmated Power Normal Lognormal Exponental Cauchy Dfference n Locaton Parameter (d) Fg. 1. Acheved power for HDLSS data wth α = 0.05, consderng shfted dfferences n mean (d) n a group of 100 factor levels n a total of 2000 factor levels and data generated from four dstrbutons: Normal(0, 1) (contnuous lne n blue), Lognormal(0, 1) (dashed lne n black), Exponental(1) (dotted lne n red) and Cauchy(0, 1) (dotted-dashed lne n green). of dentcal dstrbuton n (1) s rejected and the group remans ntact f the test s not rejected. When H 0 s rejected, smaller groups are created for further testng. The algorthm contnues untl when there are no groups wth smlarty measures below the threshold The algorthm For g 1, let g 1 be the number of groups dentfed such that all the varables wthn each group have dentcal dstrbuton. PPCLUST s descrbed below n 9 steps. Throughout the algorthm, the subset of data to be tested are always stored as n Table 1 wth each row representng a random sample from the same varable. 1. Let D 1 denote the matrx of observatons from all varables as n Table 1. Each row contans observatons from the same varable. The number of rows n D 1 s denoted as n f (D 1 ). Set g = Calculate the (md-)rank of all the observatons n D 1 and store them n D 1R n the same format as n Table Calculate the medan (md-)rank for each varable (.e. each row) n D 1R. 4. Sort the varables (.e. rows) n D 1 accordng to the medan ranks from Step Conduct the test to evaluate f the varables (rows) n D 1 have dentcal dstrbuton If H 0 s not rejected, report all the varables n D 1 as a sngle group. Go to Step If H 0 s rejected: contnue to Step Take the frst half of the number (rounded to nteger) of varables from consecutve rows of D 1 and denote the data n ths subset ncludng correspondng observatons as D 2. Let n f (D 2 ) be the number of varables n D 2.

6 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Conduct the test to evaluate f the varables (rows) n D 2 have dentcal dstrbuton. 7.1 If H 0 s not rejected: Assgn the varables of D 2 and correspondng observatons to group g Assgn g + 1 to g Remove the varables n D 2 and correspondng observatons from D If n f (D 1 ) = 0, then go to Step If n f (D 1 ) 1, then do steps A and B below: A. Test to see f each varable n D 1 belongs to the newly assgned group by testng the correspondng hypothess that all nvolved random varables have the same dstrbuton. Remove the varable and ts observatons from D 1 when H 0 s not rejected and put them nto the newly assgned group. B. Let D 2 be the set that contans the remanng varables and ther observatons n D 1 and go to Step If H 0 s rejected: Take the frst half of the number (rounded to nteger) of varables from D 2 and denote the data from ths subset wth correspondng observatons as D Assgn all the varables and correspondng observatons that are not n D 3 to D Let D 2 = D 3 and delete D Go to Step If n f (D 2 ) = 1, then perform Steps ; otherwse, return to Step Allocate the varable and correspondng observatons n D 2 to group Remove the varable n D 2 and correspondng observatons from D If n f (D 1 ) = 0 then go to Step If n f (D 1 ) = 1, then let D 2 = D 1 and return to Step If n f (D 1 ) > 1, then let D 2 = D 1 and go to Step Stop the clusterng and report the groups dentfed. Remark. For Step 3, please note that each varable has multple..d. observatons. The sortng s only done to the varables, not to the observatons. The observatons from each varable reman unordered so that they are stll ndependent and dentcally dstrbuted. For the same set of varables to be tested wth gven..d observatons from each varable, the test statstc defned n (2) and the asymptotc varance calculated n (5) usng Jackknfe bas corrected estmator of σ 4 reman unchanged no matter we sort the varables or not. Therefore, the sortng has no effect on the test. However, t provdes computatonal advantage for the clusterng by puttng smlar varables n nearby groups wthout alterng the basc requrement of Theorem 1. For Steps 6 and 7.2.1, an alternatve way to partton the varables s to splt between two rows that has the largest gap n ther medan ranks. Ths can potentally ncrease the speed of clusterng f the dstrbutons underlyng dfferent clusters are well separated. However, the advantage s not sgnfcant f the underlyng dstrbutons have substantal overlap as n the numercal study n Secton 4. Step 7 bascally repeatedly partton and group the varables untl no further partton s possble. Step 8 bascally put the random varables that cannot be clustered to any of the dentfed groups nto a group labeled as 0. Therefore, the random varables wth group label 0 are not necessarly smlar (or dssmlar). Instead, they are judged to belong to none of the dentfed groups. In other words, the random varables n group 0 resulted n a rejecton of H 0 when tested wth random varables of any other dentfed group. By the end of the algorthm, g 1 s the total number of dfferent groups. A group labeled wth a lower number n the output contan random varables wth lower medan observaton values than those groups labeled wth hgher numbers. For example, f the data are the ratos of gene expressons under a treatment and a control, a group labeled wth a lower number may contan down-regulated genes and a group labeled wth a hgher number may contan up-regulated genes. Intermedate groups contan genes not dfferentally expressed. In addton to the up or down regulatons, the genes from dfferent groups are sgnfcantly dfferent as a result of the hypothess testng About the sgnfcance level to use Note that to determne f all the varables n a group have dentcal dstrbuton, Theorem 1 only apples when the number of varables (rows) s large. As the partton proceeds, the number of varables n the group to be tested wll reduce. The left panel of Table 3 gves the type I error estmate when the number of varables s no more than 500 when each varable contans two replcatons (under four dfferent dstrbutons). Ths and Table 2 together ndcate that the test n Theorem 1 s lberal when the number of varables s no more than 50. To remedy ths, we suggest to use small sgnfcance level n determnng whether to reject a test. We recommend to take the upper bound of all sgnfcance levels, α, such that smaller levels yeld smlar clusterng results. If a sgnfcance level used leads to too many small clusters, t ndcates that the level s not small enough and the clusterng results obtaned s not relable. Ths s because the test does not have acceptable type I error for small number of varables wth small sample szes. In such case, even smaller sgnfcance levels need to be consdered. We choose not to use Kruskal Walls test n that ths test requres large sample szes and small number of varables. Our numercal results show that ths test s very conservatve when the number of varables s large and the sample szes are small (see the rght panel of Table 3 for type I error estmate). For example, n a smulaton we generated 15 random varables wth scale parameter 1 from normal, lognormal, and Cauchy dstrbutons. Ten of them have locaton parameter 0.5 and the

7 3992 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Table 3 Type I error estmate at level α = 0.05 for the test n Theorem 1 (left panel) and Kruskal Walls test (rght panel) under four dstrbutons when the number of varables s below 500. Each varable has 2 replcatons. The test n Theorem 1 s lberal when the number of varables s no more than 50 and the Kruskal Walls test s conservatve for all the cases consdered. All dstrbutons have locaton parameter 0 and scale parameter 1 (the Unf s for unform dstrbuton on (0, 1)). Test n Theorem 1 Kruskal Walls test n f Unf Normal Lognormal Cauchy n f Unf Normal Lognormal Cauchy remanng 5 varables have locaton parameter 0. Two replcatons were generated for each varable. The estmated power at level 0.05 from the Kruskal Walls test for these dstrbutons are 0.006, 0.006, and respectvely. So Kruskal Walls test s not senstve enough to detect heterogeneous dstrbutons to partton the data Advantage of PPCLUST compared to tradtonal clusterng algorthms The robust smlarty measure and the clusterng mechansm entals the followng advantage of PPCLUST. 1. Invarance to monotone transformatons: The use of overall ranks of the observatons n the test statstc leads to smlarty measure that s nvarant to monotone transformaton of data and ths n turn makes PPCLUST to have such property. Many clusterng algorthms produce dfferent results before and after monotone transformatons of the data due to the fact that such transformatons change the smlarty matrces used n clusterng. PPCLUST does not have ths drawback so that a user does not need to explore approprate transformatons of data to satsfy some model assumptons. Ths s partcularly useful snce choosng approprate transformatons for HDLSS data s a dffcult queston tself. 2. Automatc specfcaton of the number of groups: PPCLUST does not requre the number of clusters to be specfed n advance. It wll determne the number of clusters automatcally by specfcaton of a sgnfcance level as the threshold to be compared wth the p-values for testng the hypothess of dentcal dstrbuton. Estmatng the number of mxture components s tself a popular research topc that s often computatonally extensve. In low dmensonal case, t has been a nusance and dffcult for a user to choose the number of clusters even though the clusterng results may be vsualzed. In hgh dmensonal case, effectve vsualzaton tools are not avalable to ad a user. So t s even harder to specfy the number of clusters for a real dataset. PPCLUST produces ths nformaton drectly. The specfcaton of a sgnfcance level s not as ntrusve as the specfcaton of the number of groups, whch s one of the objectves of clusterng analyss. In fact, the sgnfcance level can be used as a gudance n fndng the number of groups n a real data set. For example, decreasng the sgnfcance level n PPCLUST wll decrease the number of groups found because t decreases the Type I error commtted by the test. The use of dfferent sgnfcance levels can serve as a fne tunng parameter n revealng the total number of dfferent groups G where the algorthm tends to stablze,.e, fnd G that s more common to dfferent α levels. Ths can be used as an ndcaton of the true number of groups n the data. We remark that lowerng the sgnfcance level too much wll also decrease the power of the test n fndng new and small groups. The delcate balance can be acheved n the same way as how we handle the type I and type II error n regular hypothess testng. 3. Less concern for multple comparson problems n HDLSS data: Reducng false dscoveres whle strvng to mantan the power to dentfy true dscoveres s one of the challenges for HDLSS data analyss (Storey, 2002; Sabatt, 2006; Qu and Yakovlev, 2006; Strmmer, 2008). Ths s less of a concern n PPCLUST snce the test s appled to groups of varables nstead of on a one-by-one bass. 4. PPCLUST favors HDLSS for asymptotc dstrbuton of the test statstcs whle other algorthms often need pror dmenson reducton before beng appled to hgh dmensonal data. In hgh dmensonal studes t s common to apply some dmenson reducton technque such as prncpal components analyss before clusterng data (Johnson, 1998). Some studes do not recommend the use of PCA before clusterng except n very specal stuatons (Yeung and Ruzzo, 2001). Smulatons n Yeung and Ruzzo show that clusterng prncpal components nstead of orgnal data produce dfferent results on many algorthms usng dfferent smlarty metrcs. PPCLUST does not requre prevous dmenson reducton to the analyss. Instead, PPCLUST takes advantage of the hgh dmensonalty to provde power to produce relable smlarty measure. Ths s specally appealng when only very small number of replcatons are avalable.

8 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Flexble to work wth unbalanced data wth small sample szes: the algorthm works wth both balanced or unbalanced data. The only requrement s that the number of replcatons per varable s at least 2. There s no need that all varables have the same number of replcatons. Unbalanced data s common n studes of mcroarray gene expresson data and some algorthms requre balanced data. Solutons lke elmnaton from the study of factor levels wth ncomplete nformaton or mputaton of data can hde or serously compromse the result of the study. 6. PPCLUST produces fast soluton for computatonally costly problems as the computatonal complexty s O(log2(N)). Note that tradtonal clusterng algorthms need to do optmzaton at each stage to fnd the optmal partton of the data based on a crtera. As the number of varables ncreases, the optmzaton cost becomes a major concern for exhaustve search. Genetc algorthms are often used to speed up the search. Instead of searchng for the optmal soluton at each stage, PPCLUST reles on statstcal evdence obtaned from hypothess testng to judge whether a group of varables s from the same dstrbuton or not. As long as the null hypothess s not rejected, the members are not sgnfcantly dfferent and therefore a group s formed. In other words, PPCLUST only need the smlarty measure from hypothess testng and elmnates the optmzaton process. Wth the smlarty measure beng obtaned through a sngle test of hypothess, the computatonal burden s dramatcally reduced to O(log 2 (N)) as opposed to O(N log 2 (N)), the best tme complexty case for herarchcal clusterng. Ths s confrmed from our smulatons (see Secton 4), where t takes PPCLUST less than a mnute to complete the clusterng of a data set contanng up to 7000 random varables wth sample szes ranged from 5 to 20 per varable usng PC machne runnng Wndows XP wth Intel Pentum M processor, 1.6 GHz, and 1 Gb of RAM memory. 4. Numercal comparson for clusterng of HDLSS data In ths secton, we compare PPCLUST wth some benchmark algorthms on smulated data. To evaluate the smlarty between two clusterng parttons, Rand (1971) proposed the Rand ndex that gves the fracton of all pars that are correctly put n the same cluster or correctly put n separate clusters. However, the expected value of the Rand ndex of two random parttons does not take a constant value. Hubert and Arabe (1985) consdered the adjusted Rand ndex (ARI) whch s centered at zero and has maxmum value of 1 acheved when the two parttons are dentcal up to renumberng of the subsets. Mllgan and Cooper (1986) compared multple ndces for measurng agreement between two parttons n clusterng analyss wth dfferent numbers of clusters, and they recommended the ARI as the ndex of choce. We adopt the ARI to compare the performance of these algorthms n clusterng consstency compared to the truth as s known from data generaton. Study I: Clusterng for symmetrc data In the followng smulatons we generated hgh dmensonal data from mxture dstrbutons havng mxture components smlar to the gene expresson data from a colorectal cancer study (Notterman et al., 2001), whch contan several large groups havng overall dstrbuton of a t-dstrbuton wth 15 degrees of freedom shfted by some locaton parameter µ and stretched by a scale parameter σ. Specfcally, observatons for 4000 random varables were generated accordng to the followng scheme: Group 1: 300 random varables from 0.25 t Group 2: 200 random varables from 0.25 t Group 3: 2500 random varables from 0.25 t 15. Group 4: 800 random varables from 0.25 t Group 5: 200 random varables from 0.25 t The denstes of these fve dstrbutons have substantal overlap. Fve observatons were generated for each random varable. PPCLUST usng sgnfcance level α = 10 8 and the followng 10 benchmark clusterng algorthms are appled to the generated data: Parttonal Algorthms: K-means, parttonng around medods (PAM), clusterng large applcatons (CLARA) wth Eucldean metrc, Self-Organzng Maps (SOM) wth dmenson 5 1. Herarchcal Algorthms: herarchcal clusterng (HCLUST) wth Ward s agglomeraton method, agglomeratve nestng (AGNES), dvsve analyss clusterng (DIANA) wth Eucldean metrc, herarchcal clusterng by mnmum energy dstance wth Eucldean norm x y ). Fuzzy Algorthm: fuzzy clusterng (FANNY). Model Based Algorthm: mxture model based clusterng (MCLUST) wth automatc choce of best model through Bayesan Informaton Crteron. For detals of each algorthm, one can see McQueen (1967), Kaufman and Rousseeuw (1990), Kohonen (1989), Székely and Rzzo (2005) and Fraley and Raftery (2006). In all algorthms that need pre-specfcaton of the number of clusters, we set the number to be 5, the true number of groups. It should be noted that ths nformaton s often not known n real practce whch contrbute to addtonal uncertanty for ther clusterng performance. R software (verson 2.4.1) wth packages energy, mclust, cluster, and SOM were used n the smulaton. PPCLUST was wrtten n SAS macro language (verson 9.3.1), and the ARI was calculated usng both R and SAS. For each algorthm n R, we use the default settng except that we supply the number of clusters wth the true number

9 3994 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Table 4 Mean and standard devatons (std) of adjusted Rand ndex for all algorthms over 200 smulated datasets. Dfferent sample szes are consdered. The groups are generated from symmetrc dstrbutons (Study I). Adjusted Rand ndex Sample szes Algorthm Mean Std. Mean Std. Mean Std. Mean Std. PPCLUST PAM K-means Energy Mclust Clara Dana HCLUST Agnes Fanny SOM of groups. For example, by default, the algorthm of Hartgan and Wong (1979) s used for K-means. In addton, wth the specfed number of clusters n the K-means algorthm, a random set of (dstnct) rows of the data s automatcally chosen as the ntal centers. The random selecton for the centers and rows s the standard ntalzaton method used n R. It has been confrmed emprcally to have better performance than other ntalzaton methods (Bradley and Fayyad, 1998; Pena et al., 1999). To evaluate the stablty of the clusterng performance, we repeat the data generaton 200 tmes and apply above algorthms on these 200 data sets. In order to verfy the performance of PPCLUST under dfferent sample szes, the complete smulaton study was repeated consderng also samples of szes 10, 15, and 20. The average and standard devaton of the ARI reflect the clusterng accuracy and stablty respectvely. They are reported n Table 4 for all the algorthms appled to the 200 data sets wth dfferent sample szes. The best two mean ARIs and standard devatons are hghlghted. From Table 4, t can be seen that as the number of replcatons ncreases, the clusterng accuracy ncreases for all algorthms. PPCLUST has the best clusterng accuracy for all sample szes consdered. In addton, PPCLUST s also the most stable algorthm for small sample sze (5) among all 11 algorthms snce the ARI of PPCLUST has the smallest standard devaton for sample sze 5. The standard devaton of the ARI for PPCLUST stays almost the same for sample szes 5, 10 and 15. MCLUST has comparable average ARI to PPCLUST for sample szes 15 and 20, but has sgnfcantly worse performance than PPCLUST for small sample szes n both clusterng accuracy and stablty. SOM showed consstent stablty but wth very low clusterng accuracy as the average ARI for SOM s less than 0.5 for all sample szes. Algorthms Energy, HCLUST, and Agnes, are compettve to PPCLUST for samples of sze 15 or hgher, but those algorthms are not as stable as PPCLUST and MCLUST. Dana and Fanny showed the lowest stablty among all algorthms and should not be used wth HDLSS data. Fg. 2 gves a graphcal summary of the performance of these algorthms through boxplots. Overall, PPCLUST has the best clusterng performance n terms of both accuracy and stablty. For larger samples, MCLUST s a good alternatve to PPCLUST. Study II: Clusterng for skewed data In a second study, the data generated for study I are transformed usng the functon e 4(x+1), where x s an observaton generated n study I. The resultng dstrbuton of the data s close to a lognormal dstrbuton but wth more extreme ponts snce x was generated from t-dstrbuton nstead of a normal dstrbuton. The resultng dstrbutons are heavly skewed. There s stll a sgnfcant amount of overlap among the denstes. Table 5 and Fg. 3 summarze the clusterng performance of these algorthms on the transformed data. PPCLUST s consderably better than all other algorthms n all sample sze stuatons. Clara has the worst results and PAM s the best algorthm among the other algorthms, but never had average ARI greater than PPCLUST appled to the transformed data yelds dentcal results to those before the transformaton because t s nvarant to monotone transformatons. The smulatons and all calculatons were performed usng Wndows XP wth Intel Pentum M processor, 1.6 GHz, and 1 Gb of RAM memory. The processng tme for PPCLUST s consstently less than 1 mn for each run of data sets wth 4000 random varables and PAM s the only faster algorthm. MCLUST, the closer compettor to PPCLUST, showed processng tmes at least 3 tmes hgher than PPCLUST. 5. Applcaton Clusterng of genes usng expresson data can dentfy genes that are dfferentally expressed under dfferent condtons. Such genes may be responsble for dsease progresson or responsve to treatment. Identfcaton of such genes can ad n bomarker dentfcaton for drug development. Addtonally, the dfferentally expressed genes can be used to classfy

10 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) K Means PAM Clara Energy Mclust Dana Hclust Agnes Fanny SOM PPclust Sample Szes Fg. 2. Boxplots of adjusted Rand ndex for PPCLUST and 10 other algorthms on symmetrc data based on 200 smulated datasets wth dfferent sample szes (Study I). Table 5 Mean and standard devatons (std) of adjusted Rand ndex for all algorthms over 200 smulated datasets wth skewed dstrbuton. Dfferent sample szes are consdered. The data are generated from heavly skewed dstrbutons as descrbed n Study II. Adjusted Rand ndex Sample Szes Algorthm Mean Std. Mean Std. Mean Std. Mean Std. PPCLUST PAM K-means Energy Mclust Clara Dana HCLUST Agnes Fanny SOM patent dsease status. For example, usng all genes from the whole genome can lead to neffcency n classfyng tumor patents as no nference can deal wth hgh dmensonal predcton wthout mposng strong assumptons. Instead, usng only the genes found to be dfferentally expressed from the clusterng algorthm can sgnfcantly reduce the complexty of the classfcaton problem. That s, results from the clusterng can serve as a dmenson reducton tool for classfcaton. These studes would allow to mprove treatments by dentfcaton of targets for therapy n many dseases. In ths secton, we apply PPCLUST to data from Notterman et al. (2001) study about transcrptonal gene expresson profles of colorectal cancer. Heatmaps are used to vsualze the results of PPCLUST. Clusterng genes n colorectal cancer Colon and rectal cancer have many features n common and for ths reason both are often referred to as colorectal cancer. Ths cancer begns n most cases as a growth of tssue, called polyp, nsde the wall of the colon or rectum. If the cells of a tumor (adenomas) acqure the ablty to nvade and spread nto the ntestne and other areas, a malgnant tumor develops (carcnoma or adenocarcnoma). Understandng how change n DNA causes cells of the colon and rectum to become cancerous could gude scentsts n the development of new drugs, treatments and actons durng early stages of the dsease. In Notterman et al. (2001) study, normal tssues were pared wth the two types of tumors, adenoma and adenocarcnoma. The data 1 consst of mrna expresson patterns probed n 4 colon adenoma tssues, 18 adenocarcnoma and 22 pared normal 1 Avalable n mcroarray.prnceton.edu/oncology.

11 3996 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) K Means PAM Clara Energy Mclust Dana Hclust Agnes Fanny SOM PPclust Sample Szes Fg. 3. Boxplots of adjusted Rand ndex for PPCLUST and 10 other algorthms on heavly skewed data (Study II) based on 200 smulated datasets under dfferent sample szes. colon samples. In ther study, a two-way herarchcal clusterng algorthm was used to show that genome-wde expresson proflng may permt a molecular classfcaton of the three dfferent types of tssues. Here nstead of clusterng on the tssues, we apply PPCLUST to cluster genes. Snce some of the genes n the orgnal data were observed more than once, the medan of expresson levels of duplcated genes n each database (adenoma and pared normal tssues database, and adenocarcnoma and pared normal tssues database) was calculated. Then smlar transformatons as descrbed n Notterman et al. s (2001) study were performed pror to the applcaton of PPCLUST n the composte database,.e., the followng steps were appled to each dataset: Deleton of expresson levels 0; Calculaton of the logarthm of the expresson level; Deleton of genes havng more than 25% of ther values mssng. In Notterman the percentage cutoff was 15% resultng n a smaller sample. Two data sets are obtaned, one wth 4 adenoma and pared normal tssues for 4175 genes and the other one wth 18 adenocarcnomas and pared normal tssues for 4234 genes. Only 1038 genes are common to both data sets. The exstence of pared data allows the applcaton of PPCLUST to the dfference n gene expresson levels of cancer (adenoma or adenocarcnoma) and normal tssues. The dea s that genes not related to the dsease should not have sgnfcant changes n expresson levels for cancer and normal tssues. However, genes that have sgnfcant changes n expresson level can be dentfed through a clusterng algorthm. PPCLUST wth a few sgnfcance levels s appled to the data. For sgnfcance levels greater than , the clusterng resulted n too many small clusters of genes; for sgnfcance levels much smaller than , the man structure of groups obtaned stays stable. So we use as our sgnfcance level. Fg. 4 presents the heatmap of orgnal dfferences n expresson levels of adenoma and normal tssues n pared samples and the heatmap wth genes ordered by the groups to whch they were allocated. There s a concentraton of zero to postve expresson levels for ths data wth no clear exstence of any gene groups. After applyng PPCLUST, genes were clustered nto 6 groups wth 38 (0.91%), 316 (7.57%), 9 (0.22%), 3573 (85.58%), 221 (5.29%), and 15 (0.36%) genes, respectvely. The frst three groups contans genes that are sgnfcantly down regulated and the last two groups consst of genes that are sgnfcantly up regulated for adenomas compared to normal tssues. There s also a set of 3 (0.07%) genes that cannot be clustered wth any other gene. The largest group s formed mostly by genes that had no sgnfcant dfference n ther expresson levels between adenomas and normal tssues. We also appled PPCLUST to the dfference n expresson levels of adenocarcnoma and normal tssues. In ths case, 7 groups are obtaned wth only 4 (0.09%) genes not assgned to any group. The number of genes n each group are 91 (2.15%), 774 (18.28%), 9 (0.21%), 2673 (63.13%), 5 (0.12%), 655 (15.47), and 23 (0.54%). The heatmaps before and after clusterng are gven n Fg. 5. Among the 1038 genes that are present n both data sets, the membershp assgnment for comparng adenoma versus normal and adenocarcnoma versus normal tssues are tabulated n Table 6. Among these genes, 558 of them had no sgnfcant change n expresson for both adenoma and adenocarcnoma tssues. For the other genes that are not sgnfcantly dfferentally expressed n adenoma tssues, usually, there s no sgnfcant change of expresson levels n carcnoma tssues. Smlarly, genes that are sgnfcantly down regulated n adenoma tssues tend to be also down regulated n carcnoma

G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) 3987 3998 3997 0.2 0.1 0.0 0.

Heatmaps for Adenocarcnoma Normal Tssues before and after clusterng.

Genes n group 0 are not grouped by PPCLUST, and genes n group 4 are those genes that are dfferentally

Adenoma groups Adenocarcnoma groups 0 1 2 3 4 5 6 7 0 0 0 1 0 1 0 0 0 1 0 2 4 0 3 0 0 0 2 0 10 41 0 38 0

12 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Fg. 4. Heatmaps for Adenoma Normal Tssues before and after groupng Fg. 5. Heatmaps for Adenocarcnoma Normal Tssues before and after clusterng. Table 6 Dstrbuton of 1038 genes present n both adenoma and adenocarcnoma tssue types. Genes n group 0 are not grouped by PPCLUST, and genes n group 4 are those genes that are dfferentally expressed n nether tssue type. Adenoma groups Adenocarcnoma groups tssues. The same pattern s also observed for sgnfcantly up-regulated genes. Only 10 genes had opposte expresson levels n both types of tssues.

13 3998 G. von Borres, H. Wang / Computatonal Statstcs and Data Analyss 53 (2009) Comparng the heatmaps obtaned before and after clusterng n both tssue types reveals that n carcnoma tssues the clusterng of genes s more evdent than n adenoma tssues. Ths s due to the larger dfferences n expresson levels of carcnoma related genes. Results from the clusterng of the gene expresson data n the colorectal cancer study suggest target genes to molecular bologsts for further lab experments. 6. Concluson In ths artcle, we proposed a novel computatonal algorthm, PPCLUST, for effectvely clusterng a large number of random varables wth small number of replcatons per varable. The avalablty of replcatons allows us to use p-values from a (md-)rank test of homogeneous dstrbuton developed by Wang and Akrtas (2004) as smlarty measures to determne f a group need to be parttoned. Snce no optmzaton s necessary, the computatonal cost s dramatcally reduced compared to commonly used algorthms appled to a large number of varables. In addton, PPCLUST has the advantage that t s nvarant to monotone transformatons of data and can automatcally determne the number of clusters wth a specfed sgnfcance level. In our smulaton studes, PPCLUST outperformed 10 other benchmark algorthms commonly used n the mcroarray lterature when consderng clusterng accuracy, stablty and speed. The superor performance of PPCLUST on hgh dmensonal data wth small sample szes make t a useful tool n such data that arse from many dscplnes. Acknowledgements We are grateful to the two referees and the Edtor for ther helpful comments that mproved the presentaton of ths manuscrpt. We would also lke to acknowledge SAS Insttute Brazl for the use of SAS through academc agreement wth Unversty of Brasla. References Akrtas, M.G., Arnold, S., Asymptotcs for analyss of varance when the number of levels s large. Journal of The Amercan Statstcal Assocaton 95, Akrtas, M.G., Papadatos, N., Heteroscedastc one-way ANOVA and lack-of-ft tests. Journal of The Amercan Statstcal Assocaton 99, Alon, U., Barka, N., Notterman, D.A., Gsh, K., Ybarra, S., Mack, D., Levne, A.J., Broad patterns of gene expresson revealed by clusterng analyss of tumor and normal colon tssues probed by olgonucleotde arrays. Proceedngs of the Natonal Academy of Scences USA 96, Benjamn, Y., Hochberg, Y., Controllng the false dscovery rate: A practcal and powerful approach to multple testng. JRSSB 57, Bradley, P.S., Fayyad, U.M., Refnng ntal ponts for K-means clusterng. In: Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng. Morgan kaufmann publshers, Inc., San Francsco, CA, pp Efron, B., Correlaton and large-scale smultaneous sgnfcance testng. Journal of the Amercan Statstcal Assocaton 102, Efron, B., Tbshran, R., On testng the sgnfcance of sets of genes. Annals of Appled Statstcs 1, Fraley, C., Algorthms for model-based Gaussan herarchcal clusterng. SIAM 20. Fraley, C., Raftery, A.E., MCLUST verson 3.0: An R package for normal mxture modelng and model-based clusterng, Techncal Report, Unversty of Washngton. Fu, L., Medco, E., Flame, a novel fuzzy clusterng method for the analyss of DNA mcroarray data. BMC Bonformatcs 8. Hartgan, J.A., Wong, M.A., A K-means clusterng algorthm. Appled Statstcs 28, Hubert, L., Arabe, P., Comparng parttons. Journal of Classfcaton 2, Huttenhower, C., Flamholz, A.I., Lands, J.N., Sah, S., Myers, C.L., Olszewsk, K.L., Hbbs, M.A., Semens, N.O., Troyanskaya, O.G., Coller, H.A., Nearest neghbor networks: Clusterng expresson data based on gene neghborhoods. BMC Bonformatcs 8. Jang, D., Tang, C., Zhang, A., Cluster analyss for gene expresson data: A survey. IEEE Transactons on Knowledge and Data Engneerng 16, Johnson, D.E., Appled Multvarate Methods for Data Analyss. Duxbury. Kaufman, L., Rousseeuw, P.J., Fndng Groups n Data: An Introducton to Cluster Analyss. Wley Interscence. Kohonen, T., Self-organzaton and Assocatve Memory. Sprnger. McQueen, J.B., Some methods for classfcaton and analyss of multvarate observatons. In: Proceedngs of Ffth Berkeley Symposum on Mathematcal Statstcs and Probablty. Mllgan, G.W., Cooper, M.C., A study of the comparablty of external crtera for herarchcal cluster analyss. Multvarate Behavoral Research 21, Notterman, D.A., Alon, U., Serk, A.J., Levne, A.J., Transcrptonal gene expresson profles of colorectal adenoma, adenocarcnoma, and normal tssue examned by olgonucleotde arrays. Cancer Research 61, Pawtan, Y., In all lkelhood: Statstcal modelng and nference usng lkelhood, Oxford. Pena, J.M., Lozano, J.A., Larranaga, P., An emprcal comparson of four ntalzaton methods for the K-Means algorthm. Pattern Recognton Letters 20, Qu, X., Yakovlev, A., Some comments on nstablty of false dscovery rate estmaton. Journal of Bonformatcs and Computatonal Bology 4, Rand, W.M., Objectve crtera for the evaluaton of clusterng methods. JASA 36, Sabatt, C., False dscovery rate and multple comparson procedures. In: DNA Mcroarrays and Related Genomcs Technques: Desgn, Analyss, and Interpretaton of Experments. Chapman & Hall/CRC, pp Storey, J., A drect approach to false dscovery rates. Journal of the Royal Statstcal Socety B 64 (3), Storey, J.D., Tbshran, R., Statstcal sgnfcance for genomewde studes. Proceedngs of the Natonal Academy of Scences USA 16, Strmmer, K., A unfed approach to false dscovery rate estmaton. BMC Bonformatcs 9, 303. Subramanan, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gllette, M.A., Paulovch, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesrov, J.P., Gene set enrchment analyss: A knowledge-based approach for nterpretng genome-wde expresson profles. Proceedngs of the Natonal Academy of Scences USA 43, Székely, G.J., Rzzo, M.L., Herarchcal clusterng va jont between-wthn dstances: Extendng Ward s mnmum varance method. Journal of Classfcaton 22, Wang, H., Akrtas, M.G., Rank tests for ANOVA wth large number of factor levels. Journal of Nonparametrc Statstcs 16, Yeung, K.Y., Ruzzo, W.L., Prncpal component analyss for clusterng gene expresson data. Bonformatcs 9,

Copy Number Variation Methods and Data

Copy Number Variation Methods and Data Copy Number Varaton Methods and Data Copy number varaton (CNV) Reference Sequence ACCTGCAATGAT TAAGCCCGGG TTGCAACGTTAGGCA Populaton ACCTGCAATGAT TAAGCCCGGG TTGCAACGTTAGGCA ACCTGCAATGAT TTGCAACGTTAGGCA