Using Past Queries for Resource Selection in Distributed Information Retrieval

Size: px

Start display at page:

Download "Using Past Queries for Resource Selection in Distributed Information Retrieval"

Alexandrina Nichols
6 years ago
Views:

1 Purdue Unversty Purdue e-pubs Department of Computer Scence Techncal Reports Department of Computer Scence 2011 Usng Past Queres for Resource Selecton n Dstrbuted Informaton Retreval Sulleyman Cetntas Purdue Unversty, scetnta@cs.purdue.edu Luo S Purdue Unversty, ls@cs.purdue.edu Hao Yuan Purdue Unversty, yuan3@cs.purdue.edu Report Number: Cetntas, Sulleyman; S, Luo; and Yuan, Hao, "Usng Past Queres for Resource Selecton n Dstrbuted Informaton Retreval" (2011). Department of Computer Scence Techncal Reports. Paper Ths document has been made avalable through Purdue e-pubs, a servce of the Purdue Unversty Lbrares. Please contact epubs@purdue.edu for addtonal nformaton.

2 Usng Past Queres for Resource Selecton n Dstrbuted Informaton Retreval Suleyman Cetntas 1 Department of Computer Scences Purdue Unversty West Lafayette, I, 47907, USA Luo S Departments of Computer Scences and Statstcs Purdue Unversty West Lafayette, I, 47907, USA Hao Yuan Department of Computer Scences Purdue Unversty West Lafayette, I, 47907, USA SCETINTA@CS.PURDUE.EDU LSI@CS.PURDUE.EDU YUAN3@ CS.PURDUE.EDU Abstract Federated text search provdes a unfed search nterface for multple search engnes of dstrbuted text nformaton sources. Resource selecton s an mportant component for federated text search, whch selects a small number of nformaton sources that contan the largest number of relevant documents for a user query. Most pror research of resource selecton focused on selectng nformaton sources by analyzng statc nformaton of avalable nformaton sources that s sampled n the offlne manner. On the other hand, most pror research gnored a large amount of valuable nformaton lke the results from past queres. Ths paper proposes a new resource selecton technque (whch s called qsm) that utlzes the search results of past queres for estmatng the utltes of avalable nformaton sources for a specfc user query. The new algorthm calculates the query smlartes between a specfc query and all past queres, and then estmates the utltes of avalable nformaton sources by the weghted combnaton of results of past queres wth respect to the query smlartes. The new resource selecton algorthm s practcal as t does not requre relevance judgment of past queres and t only utlzes regresson based results mergng method to rank the results of past queres. Furthermore, a combned resource selecton approach s proposed to ntegrate the two approaches of learnng from past queres and usng statc sampled nformaton. An extensve set of experments demonstrate the effectveness of the new resource selecton algorthm of learnng from past queres as well as the combned resource selecton algorthm n several confguratons. Keywords: Past Queres, Resource Selecton, Federated Search 1 Correspondng Author: Phone: (765) & Fax: (765)

3 1 Introducton The prolferaton of searchable text nformaton sources on local area networks and the Internet creates a problem of fndng nformaton that s dstrbuted among many text nformaton sources (federated text search) (Callan 2000; Craswell 2000; Meng et al. 2002). Federated text search, also known as dstrbuted nformaton retreval, ncludes three sub-problems. Frst, nformaton about the contents of each ndvdual nformaton source must be acqured. Ths task s called resource representaton (Balle et al. 2009; Callan 2000; Gravano et al. 1997; Nottelmann and Fuhr 2003; S and Callan 2003a). Second, a small number of text nformaton sources should be selected for search for a specfc user query (Cetntas et al. 2009; Craswell 2000; French et al. 1999; Fuhr 1999; Gravano et al. 1999; Hawkng and Thstlewate 1999; Iperots and Gravano 2002; Nottelmann and Fuhr 2003; Powell et al. 2000; Shokouh 2007; Shokouh and Zobel 2007; S and Callan 2003a; S and Callan 2004; Zha and Lafferty 2001). Ths task s called resource selecton or collecton selecton. Thrd, after results are returned from selected nformaton sources, the ndvdual ranked lsts should be merged nto a sngle fnal ranked lst. Ths task s called results mergng (Callan 2000; Cetntas and S 2007; Krsch 1997; Larson 2002; Le Calv and Savoy 2000; Lu et al. 2005; S and Callan 2003b; Xu and Callan 1998). Most pror research work of resource selecton focused on selectng nformaton sources by analyzng the statc nformaton whch s sampled from avalable nformaton sources n an offlne manner. For example, the ReDDE (Relevant Document Dstrbuton Estmaton (S and Callan 2003a)) algorthm frst tres to buld a centralzed sampled database based on query-basedsamplng, and then uses the nformaton of the centralzed sampled database to estmate the dstrbuton of relevant documents for resource selecton. However, most prevous research gnored a large amount of valuable nformaton lke the resource selecton results of past queres. For example, n a real world search engne, there are many smlar or even duplcated queres every day. The results from those smlar past queres are very valuable to gude the resource selecton decson of a current user query. One can magne that f two user queres are very smlar, ther correspondng resource selecton results should also be close. In ths paper, we propose a new resource selecton technque (.e., qsm) that utlzes the search results of past queres for estmatng the utltes of avalable nformaton sources for a specfc user query. The new algorthm calculates the query smlartes between a specfc query and all past queres, and then estmates the utltes of avalable nformaton sources by the weghted combnaton of search results of past queres wth respect to the query smlartes. The new resource selecton algorthm s practcal as t does not requre relevance judgment of past queres, and t only utlzes regresson based results mergng method to rank the results of past queres. Furthermore, a combned resource selecton approach s proposed to ntegrate the two types of approaches: learnng from past queres and usng statc sampled nformaton. It has ts advantages to ntegrate two types of evdence. In partcular, the combned approach generates the resource selecton results by ntegratng the results of the qsm selecton algorthm and the ReDDE selecton algorthm. 2

4 Emprcal studes are conducted on two testbeds wth several confguratons to show the advantage of the new resource selecton algorthms. In partcular, we compare the performance of the new algorthms wth a state-of-the-art resource selecton algorthm as ReDDE (S and Callan 2003a) (relevant document dstrbuton estmaton). In our experments n research envronments, we followed two dfferent approaches to generate the set of past queres: ) we sampled queres for each test/onlne query by extractng the ttles of some top rankng documents n the search results of that partcular test query; ) we generated smulated queres by randomly removng some terms from test queres, and used these smulated queres as past queres. The set of all sampled queres and smulated queres for each test query forms the set of smlar past queres n ether approach. Under the usual scenaro that there exst smlar queres among past queres and onlne queres, our new qsm algorthm performs better than the pror state-of-the-art ReDDE algorthm. The larger amount of past queres we have or the more smlar the past queres are to onlne queres, the better results we can get. Furthermore, a combned resource selecton approach can provde even better resource selecton results than both the qsm and the ReDDE algorthms. Therefore when the system receves a new rare test/onlne query, t wll utlze the combned approach and perform at least as good as the ReDDE algorthm or even better than that. The rest of the paper s arranged as follows: Secton 2 dscusses the related work. Secton 3 proposes the new resource selecton approaches,.e. ) the resource selecton approach by learnng from past queres and ) the combned resource selecton approach. Secton 4 dscusses the expermental methodology. Secton 5 presents the expermental results; and fnally, Secton 6 concludes ths work. 2 Related Work There has been consderable research on all of the three sub-tasks of federated search:.e. ) acqurng resource descrpton, ) resource selecton and ) results mergng. Snce ths paper focuses on the resource selecton task, we manly survey related work n resource selecton and brefly dscuss the works n resource descrpton and results mergng. The STARTS (Gravano et al. 1997) protocol s one of the solutons to acqure the resource descrptons. It requres explct cooperaton from each nformaton source. Although STARTS s a good soluton n envronments where cooperaton can be guaranteed, t doesn t work n mult-party envronments where some nformaton sources may not cooperate. Query Based Samplng (QBS) (Callan 2000; Callan and Connell 2001) s an alternatve approach for acqurng resource descrptons snce t doesn t requre explct cooperaton from avalable sources. QBS only uses the normal process of runnng queres and retrevng a lst of downloadable documents to construct resource descrptons. Ths soluton has been shown to acqure rather accurate resource descrptons usng a relatvely small number of randomly generated queres to retreve relatvely small number of documents. There s a large body of pror research on resource selecton (e.g., Cetntas et al. 2009; Craswell 2000; French et al. 1999; Fuhr 1999; Gravano et al. 1999; Hawkng and Thstlewate 1999; Iperots and Gravano 2002; Nottelmann and Fuhr 2003; Powell et al. 2000; Shokouh 2007; Shokouh and Zobel 2007; S and Callan 2003a; S and Callan 2004; Zha and Lafferty 2001). 3

5 Space lmtatons preclude dscussng t all, so we restrct our attenton to a few that have been studed often or recently n pror research. A large body of resource selecton algorthms follows the Bg Document approach by representng each nformaton source as a sngle bg document. The smlartes between the Bg Documents and a user query are calculated to rank avalable nformaton sources. Bg Document methods nclude CORI (Callan 2000; Callan et al. 1995), ggloss (Gravano et al. 1997; Gravano et al. 1999) and CVV (Yuwono and Lee 1997). Partcularly, the CORI resource selecton algorthm has been shown to be one of the most stable and effectve Bg Document resource selecton algorthms n pror studes (e.g, Craswell 2000; French et al. 1999). The CORI resource selecton algorthm (Callan 2000; Callan et al. 1995) represents each nformaton source by usng the words that t contans, ther frequences, and a small number of corpus statstcs. Resource rankng s generated by usng a Bayesan nference network and an adaptaton of the Okap term frequency normalzaton. Detals can be found n (Callan 2000; Callan et al. 1995). However, the Bg Document approach does not consder nformaton source szes, whch s a bg problem for searchng n the envronment of a mxture of small and very large nformaton sources (S and Callan 2003a). ReDDE (S and Callan 2003a) resource selecton algorthm was proposed to address the above ssue. It explctly estmates the dstrbuton of relevant documents among avalable nformaton sources for resource selecton. ReDDE utlzes database sze estmaton and a centralzed sample database (CSDb) that conssts of the documents obtaned by query based samplng as resource descrptons. The CSDb s a representatve subset of the centralzed complete database (CCDb) whch s the unon of all the documents n avalable nformaton sources. Snce the CCDb s not avalable n the federated search envronment, ReDDE uses the CSDb (whch has already been generated by QBS) to smulate the property of CCDb. In summary, ReDDE ssues a query to an ndex created over CSDb and scores each resource ) proportonal to the number of top-ranked documents orgnatng from that resource and ) consderng the estmated sze of that resource. More detaled nformaton about CCDb and CSDb databases can be seen on Fgure 1. In partcular, ReDDE utlzes the ranked lst of a user query on CSDb to smulate the ranked lst of ths user query on CCDb. Wth the ranked lst of a user query on CSDb, ReDDE estmates the number of relevant documents to query q n an nformaton source where C j ^ 1 Re l _ q( j) = P( rel d)* * d C j _ samp C j _ samp ^ C j C as follows: j N s the estmated database sze for the nformaton source C j (e.g., by Sample- Resample method [24]); C j_samp s the set of documents that are sampled from database C j ; C (.e. the sze of C j_samp ). The only unknown n and N s the number of documents n j_samp _samp C j Equaton 1 s P(rel d ),.e. the probablty of relevance gven a specfc document. P(rel d ), the probablty of relevance of a specfc document n Equaton 1, s defned as the probablty of relevance gven document rank when a CCDb s searched by an effectve retreval method (.e., Rank_central(d ) ) as follows: (1) 4

6 ( ) P rel d where C q s a query dependent constant, C f Rank _ central( d ) < rato ˆ = 0 otherwse q all N all (2) s the estmated total number of documents n the CCDb and rato s a parameter whch s set to be as n pror experments (S and Callan 2003a). Although ths approxmaton of relevance probablty s rather rough, smlar methods have been used by many automatc relevance feedback methods. Rank_central(d ), the rank of a document d n the CCDb, s calculated as follows: c( d j ) Rank _ central( d ) = (3) d j Rank _ Samp( d j ) < Rank _ Samp( d ) where Rank_Samp(d ) s the correspondng rank n CSDb. ^ c( d j ) _ samp ^ Rel_q(j) can be calculated by pluggng Equatons 3 and 2 nto Equaton 1, so that the dstrbuton of relevant documents n dfferent nformaton sources s acqured wthout a need for relevance judgment. The dstrbuton of relevant documents among nformaton sources s suffcent to rank the avalable nformaton sources for the resource selecton task. However, all the above pror research of resource selecton focused on selectng nformaton sources by analyzng statc nformaton of avalable nformaton sources that s sampled n the offlne manner. The pror research gnored a large amount of valuable nformaton lke the results from past queres or query logs. In most real world applcatons, there s often substantal smlarty between users queres; and even many of them are duplcates of each other. Ths fact not only motvates the research to make use of the results of past queres, but also makes t possble to use the combned approach of utlzng both the valuable results from past queres and the statc nformaton of avalable nformaton sources for the resource selecton decson of a current user query. Voorhees et al. s work consdered usng results of past queres (1995). However, the task they are dealng wth s not exactly resource selecton snce ther method retreves documents from all nformaton sources. For the decson of how many documents to retreve from each source, the work n (Voorhess et al. 1995) proposed two approaches. One method s to calculate the relevant document dstrbuton among nformaton sources for a query q by averagng the relevant document dstrbutons (among all sources) of k most smlar past queres wth q. The other method s to compute the weghts of each nformaton source s most smlar query cluster wth the query q and use these weghts. Both of these methods requre relevance judgment. Some recent research has been conducted for selectng vertcal search engnes (e.g., Arguello et al. 2009; Daz et al. 2009). In partcular, research work has been conducted to utlze query log and user feedback nformaton to mprove vertcal search engne selecton. However, the research of selectng vertcal engnes was dfferent from our proposed work n several perspectves. Frst, they only studed and evaluated selectng one vertcal engne for each user query whle there are often qute a few relevant resources for each user query n tradtonal dstrbuted nformaton retreval applcatons (e.g., dgtal lbrares). Second, they treated vertcal engne selecton as a 5

7 classfcaton approach, whle our work ranks resources by the number of relevant documents contaned n avalable resources. Thrd, they utlzed human relevance judgment, whle the proposed work only utlzes results from past queres that are automatcally generated. Fnally, ther approach s a pure resource selecton method, whle the proposed method s a more complete soluton that ncludes results mergng. Recently, Cetntas et al. showed that usng the search results of past queres mproves the effectveness of resource selecton; however, ther work was lmted. Partcularly, they dd not ntegrate the two approaches of learnng from past queres and usng statc sampled nformaton. Furthermore, they dd not explore the effect of ) usng larger amount of past queres, ) more smlar past queres, ) selectng more nformaton sources, v) usng dfferent query smlarty estmaton technques on resource selecton effectveness (2009). The last step of dstrbuted nformaton retreval s mergng ranked lsts produced by dfferent search engnes. It s usually treated as a problem of transformng database-specfc document scores nto database-ndependent document scores. The CORI results mergng formula (Callan 2000) s a heurstc lnear combnaton of the database score and the document score. The Sem- Supervsed Learnng (SSL) algorthm (S and Callan 2003b) uses the documents acqured by query-based samplng as tranng data and utlzes lnear regresson to learn mergng models. When t s necessary, the Sem-Supervsed Learnng method downloads a small number of documents on the fly to create tranng data for buldng the regresson models. Please note that SSL wll be used n ths word for estmatng the relevance of nformaton sources to past queres. The documents downloaded by SSL wll be an mportant data source that wll be explaned later. More detaled nformaton about the set of SSL-downloaded documents along wth CCDb, CSDb databases and how they are used n ths work can be seen on Fgure 1. 3 ew Resource Selecton Algorthms Snce there tends to be many smlar queres n a real world federated search system, the valuable nformaton of past queres can help us provde better resource selecton results. In ths secton, we propose a novel algorthm, whch s called qsm, to utlze the valuable nformaton to gude the decson of resource selecton. Furthermore, we propose a combned resource selecton approach that utlzes both the nformaton from past queres and the statc nformaton from query-based samplng for more accurate resource selecton. The frst subsecton presents the qsm algorthm for how to use search results from past queres to gude resource selecton for onlne user queres. The second subsecton descrbes the combned resource selecton approach. 3.1 Resource Selecton by Learnng the Results from Past Queres Assume that there exsts a set of past queres, whch s denoted by P = {p 1, p 2,, p M }, where p represents the th past query. For an onlne user query q, assume that we have a specfc method to measure the smlarty between q and p (detals of the query smlarty estmaton approaches used n ths work can be found n Secton 3.1.2), let us denote the smlarty by Sm(p q). Denote the 6

8 set of nformaton sources by S = {s 1, s 2,, s N }. For a query q, we use Rel(s j q) to represent how lkely t s for the nformaton source s j to be relevant to the query q, or what (estmated) percentage of the relevant documents for the query q s n s j. We are nterested n rankng the avalable nformaton sources by comparng ther values of the Rel functon. Please note that the absolute values of Sm(p q) and Rel(s j q) are not mportant. Only the relatve dfference between Sm(p 1 q) and Sm(p 2 q) s mportant, for any par of p 1 and p 2 whle q s fxed; whch s also the case for Rel(s j q). In qsm, we can estmate the value of Rel(s j p ) for all past queres. For an onlne query q, the task s to estmate Rel(s j q) based on the nformaton of Rel(s j p ) and Sm(p q). The algorthmc framework of qsm s as follows: 1. Estmate the relevance of an nformaton source s j to a past query p,.e. estmate Rel(s j p ). Ths can be done n an offlne manner for past queres. After the federated search system has processed an onlne query, ths query can also be added to the set of past queres for future reference. 2. Estmate the smlarty between a past query p and an onlne/test query q,.e. estmate Sm(p q). 3. Estmate the relevance of an nformaton source s j to an onlne/test query q,.e. estmate Rel(s j q), accordng to the estmated relevance nformaton of smlar past queres (to ths onlne query q). 4. Rank the nformaton sources accordng to Rel(s j q), whch wll gve the most relevant nformaton sources to the onlne query q Estmatng the Relevance of Informaton to Past Queres (.e. Rel(s j p ) ) Any resource selecton algorthm can be appled to estmate the value of Rel(s j p ). If a resource selecton algorthm can explctly gve scores for all nformaton sources, we can drectly map those scores to Rel(s j p ). For example, n the ReDDE algorthm, we can set Rel(s j p ) ~ Rel_p ^ (j), where ^ (j) Rel_p s the estmated number of relevant documents n s j wth respect to p by ReDDE. In ths work, a regresson-based results mergng approach (S and Callan 2003b) s used to get a much more precse Rel(s j p ) estmaton by utlzng the search results from past queres, no matter whether a pror resource selecton has been made for p or not. The key ntuton underlyng ths approach s that the more relevant documents from an nformaton source are n the merged ranked lst of past query p, the more relevant ths nformaton source to p s. The process of ths approach s as follows: Frst, assume that a set of sources (denoted by SEL_p ) are selected for p. For the case that no pror resource selecton has been mode, we can have SEL_p =S,.e. all nformaton sources. 7

9 QBS Queres Generated by Query Based Samplng SSL Past Queres p Resource Descrpton for Source #1, S 1 Source #1, S 1 Docs downloaded from Source #1 by SSL Resource Descrpton for Source #2, S 2 Source #2, S 2 Docs downloaded from Source #2 by SSL Resource Descrpton for Source #N, S N Centralzed Sampled Database (CSDb) The collecton of documents that are sampled by Query Based Samplng Source #N, S N Centralzed Complete Database (CCDb) The collecton of all documents n all nformaton sources Docs downloaded from Source #N by SSL Set of Documents downloaded by SSL The set of documents downloaded by the SSL results mergng algorthm ReDDE Rel ReDDE(s j q) Past Query p Test Query q Sm(p q) Rel(s p ) Rel qsm(s j q) Rel Combned(s j q) Fg. 1 of evdence for ReDDE and qsm resource selecton approaches. A federated search envronment conssts of several nformaton sources. The set of all documents n all nformaton sources consttute the Centralzed Complete Database (CCDd) (shown on left). Note that CCDb s a hypothetcal database, whch s not avalable n a federated search envronment. Therefore Query Based Samplng (QBS) samples documents from every nformaton source n order to acqure the resource descrptons. The set of all documents sampled by QBS form the centralzed sampled database (CSDb) whch s used by ReDDE resource selecton algorthm (Callan 2000; Callan and Connell 2001). In ths work, the combned resource selecton approach that uses the resource selecton decsons of ReDDE, ndrectly uses the CSDb (snce ReDDE uses t). For estmatng of the relevance of nformaton sources to past queres, SSL results mergng algorthm (S and Callan 2003b) s used. In order to buld the regresson models, SSL downloads a partcular number of documents from each source; and the set of all downloaded document consttute the SSL-downloaded documents database (shown on rght). Note that SSL-downloaded documents are used for the estmaton of relevance of nformaton sources to past queres (.e. Rel(s j p )) and for retreval-based query smlarty estmaton (.e. Sm(p q)). Some statstcs about the number of documents n each database can be found n the dscusson n Secton 4.3 Second, we use the regresson based Sem-Supervsed Learnng (SSL) method (S and Callan 2003b) to merge the search results from those sources n SEL_p nto a sngle ranked lst. SSL wll gve us the ranked lst of the most relevant documents from all nformaton sources n SEL_p to the past query p. 8

10 Thrd, the nformaton sources n SEL_p are assgned relevance scores proportonal to how many of ther documents are n the sngle ranked lst of relevant documents to p (as; f a source s more relevant to p, more documents from ths source wll appear n the ranked lst). Partcularly, f a source s j s not n SEL_p, then Rel(s j p ) = 0. If the source s j s n SEL_p, then Rel(s j p ) s estmated by: where T s a pre-defned number, and 1, f doc s j; Re l( s j doc ) = (5) 0, otherwse. The value T,.e. how many documents are consdered n the sngle ranked lst, can be very small lke 20. In our experments, t s set to 20. Ths helps to mnmze the number of nteractons wth nformaton sources. A typcal search engne (such as Google, Yahoo, etc.) only returns 10 or 20 search results or document ds n one page by default. Please note that the SSL algorthm needs to download some documents for each past query p as tranng data to buld the regresson models for mergng the search results from nformaton sources n SEL_p n order to generate the sngle ranked lst for that past query p. Ths s due to the fact although SSL can use the documents n the CSDb as the tranng documents to buld the regresson models, t s not allowed to do so n order to evaluate the resource selecton approach of learnng from past queres n a strct manner. The set of all SSL downloaded tranng documents for all past queres create the database that s used for ) for the estmaton of the relevance of nformaton sources to past queres, and ) for the estmaton of the smlartes of past queres and onlne queres (wll be explaned n the next secton). In ths work, 3 top documents are downloaded as the tranng data for each past query p to buld the regresson models for results mergng, n a smlar manner as the mnmum downloadng method of the SSL results mergng algorthm (S and Callan 2003b). Re l( s j doc) Re l( s j p ) T (4) top T docs n the merged ranked lst Estmatng the Smlarty Between Past Queres and Onlne Queres (.e. Sm(p q) ) There are many ways to measure the smlartes between queres (.e. between past queres and an onlne query), such as term-based approach (cosne measure, edt dstance, latent semantc analyss, etc.), selecton-based approach and retreval-based approach (by comparng the retreval lsts). In our experments, we use two approaches: ) a retreval-based approach and ) term-based approach (cosne smlarty) to calculate the smlartes. Although the retreval-based approach s used as the default approach to calculate the query smlartes, the results acqured wth termbased approach s also used to test the robustness of the proposed algorthms (.e. to make sure that the proposed technques work well regardless of the dfferent characterstcs of dfferent query smlarty estmaton approaches). 9

11 ) Retreval-Based Query Smlarty Estmaton Retreval-based query smlarty estmaton approach compares a past query p and an onlne query q by comparng the smlarty of the retreval results of those queres on a collecton. In ths work, the set of SSL-downloaded documents (explaned n Secton ) are used as the collecton that queres to be compared are searched on. Partcularly, let R_p (or R_q) denote the search results (.e. the ranked lsts) of p (or q) on the SSL-downloaded documents database. The smlarty of p wth respect to q s measured by 1 Sm p q Score doc R R (6a) ( ) (, _ p, _ q) R _ p doc R _ p R _ q where the score functon for a matchng document s gven by: Score( doc, R _ p, R _ q) doc rank n R _ p doc rank n R _ q) = 1 R _ p R _ q (6b) The value of Sm(p q) wll be hgher f the set of documents that are common to R_p and R_q rank smlarly n R_p and R_q. In practce, we cut each search results (R_p and R_q) by only focusng on the top 20% returned documents n ther rank lsts; and normalze the values of Sm(p q) n a smple approach as follows: MAXSIM _ q = max Sm( p q) SIMTHRES _ q= 0.8 MAXSIM _ q (7) 0, f Sm( p q) < SIMTHRES _ q; normalzed ( Sm( p q) - SIMTHRES _ q) (8) Sm( p ),. q otherwse ( MAXSIM _ q - SIMTHRES _ q) Retreval-based smlarty estmaton approach s used as the default query smlarty estmaton approach n ths work due to the fact that term-based smlarty estmaton approaches are not as successful n fndng the smlarty between past and onlne queres when queres have lmted number of terms. More detaled dscusson on the comparson of retreval-based smlarty approach and term-based smlarty approach can be seen on Fgure 2. ) Term-Based Query Smlarty Estmaton (Cosne Measure) As a second method of calculatng the smlartes between past queres and onlne queres, we use the common cosne smlarty (Baeza-Yates and Rbero-Neto 1999). If we denote the bag-ofwords vector representatons of past query p and onlne query q as and q, the cosne smlarty s gven by: p q Sm( p q) cos( p, q) = (9) p q where s the dot product of these vectors. We use the common Okap weghtng scheme to calculate smlartes between past queres and onlne queres; and do normalzaton to scale the smlarty scores between 0 and 1. p 10

12 We use term-based smlarty as the secondary method of query smlarty estmaton approach and used t only wth the retreval-based query smlarty approach to test the robustness of the proposed algorthms (.e. to make sure that the proposed technques work well regardless of the dfferent characterstcs of dfferent query smlarty estmaton approaches). More detaled dscusson on the comparson of retreval-based smlarty approach and term-based smlarty approach can be seen on Fgure Estmatng the Relevance of Informaton to Onlne Queres (.e. Rel(s j q)) For an onlne query q, we predct the relevance of nformaton sources (.e. Rel(s j q)) based on the estmated relevance nformaton from smlar past queres. It s calculated by: Rel( s q) Rel( s p ) Sm( p q) j j In practce, we only consder the K most smlar past queres when we calculate the above summaton. In our experments, K s set to Resource Selecton Accordng to Rel(s j q) (10) The last step s to smply rank the nformaton sources accordng to the values of Rel(s j q). A larger value of Rel(s j q) means that t s more lkely for the source s j to contan more relevant nformaton wth respect to q. As we mentoned before, we can add a query q nto the set of past queres after ths query has been processed Dfferent Levels of Past Search Results For a past query p, the amount of search results wll affect the qualty of the estmated Rel(s j p ). The amount of results s measured by the number of nformaton sources selected and searched for generatng search results. If more sources have been searched, the amount of search results for the past query s more comprehensve. We call the qsm algorthm by qsm-cut-x where the number of sources selected and searched for p s X, or n other words SEL_p, =X. In a real world federated search applcaton wth N=100 nformaton sources, the typcal number of sources that are chosen for the search task s about X = 5, 10 or at most Combned Approach for Resource Selecton In ths subsecton, we propose a smple way to combne qsm wth other resource selecton algorthms lke ReDDE to acheve better performance. In partcular, we focus on the combnaton of qsm and ReDDE. Assume that ReDDE generates the score of a partcular nformaton source s j as Rel_ ReDDE (s j q) for resource selecton. We denote the correspondng resource selecton score from qsm as Rel_ qsm (s j q), and then we can smply combne the values of Rel_ ReDDE (s j q) and Rel_ qsm (s j q) n a lnear way: 11

13 Rel _combned ( s j q) + λ Rel _qsm ( s j q) max jrel _qsm ( s j q) Rel _ReDDE ( s j q) (1- λ) max j Rel _ReDDE ( s j q ) (11) where λ s a real number wthn [0,1]. In our experments, λ s set to 1/3. Fnally, we generate the resource selecton results accordng to the values of Rel_ combned (s j q). 4 Expermental Methodology It s most desrable to evaluate federated search algorthms wth real world applcatons. However, real world applcatons are usually short of relevance judgment data and t s dffcult to obtan full control of dfferent components of the systems (e.g., varyng retreval algorthms of each nformaton source). These dffcultes prevent us from dong thorough study wth real world applcatons. Instead, an extensve set of experments was desgned n research envronments to smulate real world applcatons such as (Avraham et al. 2006). 4.1 Testbeds Experments were carred out on Trec123 and Trec4 testbeds. Detals about Trec123 and Trec4 datasets are gven n Table 1. Trec col-bysource (Trec123): 100 nformaton sources were created from TREC CDs 1, 2 and 3. They are organzed by source and publcaton date. Trec4-bysource (Trec4): 100 nformaton sources were created from TREC4 accordng to the sources of the documents n TREC4. For Trec123 testbed, 50 queres were created from the ttle felds of TREC topcs For Trec4, another 50 queres were created from the descrpton felds of TREC topcs These queres wll be used as the set of test/onlne queres and wll be referred to test, onlne or real queres throughout the paper. The detaled statstcs about the test queres can be found n Table 2. All the 100 nformaton sources were assgned one of three types of retreval algorthms as INQUERY (Callan et al. 1995), a ungram language model wth lnear smoothng (Lemur Toolkt; Zha and Lafferty 2001) and a TFIDF retreval algorthm wth the ltc weghng schema (S and Callan 2003b). All the algorthms were mplemented wth the lemur toolkt (Lemur Toolkt, Oglve and Callan 2001). 4.2 Sampled and Smulated Past Query Sets In a real world federated search system, there often exst many smlar queres. However n Trec123 and Trec4 testbeds, we don t have many smlar queres. In order to better smulate the characterstcs of a real world federated search system, we use two dfferent approaches to generate two dfferent sets of past queres usng the test queres that are avalable n Trec123 and Trec4 datasets. Then the set of real queres that are avalable from Trec123 and Trec4 datasets are used as the set of test/onlne queres and ether one of the two sets of generated past queres are 12

14 Table 1. Summary statstcs for Trec123 and Trec4 dstrbuted IR testbeds. Sze umber of Documents Sze (MB) Testbed (GB) (x1000) Mn Avg Max Mn Avg Max Trec Trec Table 2. TREC query set statstcs. Collectons TREC Topc Set TREC Topc Feld Average Length (Words) Trec Ttle 2.92 Trec Descrpton 8.82 used as the set of past queres. Specfcally the experments are conducted separately on the two sets of generated past queres to test the robustness of the proposed algorthms (.e. to make sure that the proposed technques work well regardless of the dfferent characterstcs of the sets of generated past queres). The frst approach of generatng the set of past queres samples queres for each test/onlne query by extractng the ttles of some top rankng documents n the search results of that partcular test query. The past queres generated by ths approach wll be referred as sampled past queres. The second approach of generatng the set of past queres creates the past queres from test queres by randomly removng some terms from the test queres. The past queres generated by ths approach wll be referred as smulated past queres. To reterate; the experments are usng ether the sampled past queres or the smulated past queres as the set of past queres. More detaled explanaton of these two approaches s gven n the followng two subsectons Sampled Past Queres Ths subsecton descrbes the frst approach to generate the sampled past queres. Two sets of sampled past queres are generated for Trec123, whch are called Trec123_T20 and Trec123_T10. The past queres n Trec123_T20 (or Trec123_T10) were generated n the followng way: A new Trec123 database called Sub_Trec123 s constructed, whch conssts of the documents that exst n the orgnal Trec123 database and that do not exst n the CSDB of the ReDDE algorthm. The documents n CSDB of ReDDE are partcularly excluded n order not to explot the advantage of the documents of CSDB of ReDDE that are used as resource descrptors. For each test query, TOPX most smlar documents that have a ttle (not all the documents have a ttle) are retreved from Sub_Trec123 database. The ttles of these TOPX documents construct the set of X sampled queres for ths test query (.e. each ttle becomes a new sampled past query). Please note that X s 10 for Trec123_T10 and 20 for Trec123_T20 query set. For 50 test queres, the above samplng process s repeated and a total of 1000 past queres are acqured for Trec123_T20 and a total of 500 past queres are acqured for Trec123_T10. In the 13

15 Table 3a. Statstcs of sampled past queres. Collectons Sampled Past Query Set # of TOP Docs (ttles extracted) Average Length (Words) Trec123 Trec123_T Trec123 Trec123_T Trec4 Trec4_T Trec4 Trec4_T Table 3b. Statstcs of smulated past queres. Collectons Smulated Past Query Set # of Words Removed Average Length (Words) Trec123 Trec123_R Trec123 Trec123_R Trec4 Trec4_R Trec4 Trec4_R same way descrbed above, two sets of sampled past queres are also generated for Trec4 testbed, whch are called Trec4_T20 and Trec4_T10. Trec4_T20 & Trec4_T10 have 1000 & 500 sampled queres respectvely. In the experments that used the sampled past queres as the set of past queres; the 50 real queres are treated as test/onlne user queres, and the 1000 sampled past queres (.e. the past queres n Trec123_T20 or Trec4_T20 sampled past query sets) or 500 sampled queres (.e. n Trec123_T10 or Trec4_T10 sampled past query sets) are treated as the past queres. Detals about sampled past queres can be seen n Table 3a Smulated Past Queres Ths subsecton descrbes the second approach to generate the smulated past queres. Two sets of smulated past queres are generated for Trec123, whch are called Trec123_R1 and Trec123_R2. Each set contans 50 smulated queres. The past queres n Trec123_R1 (or Trec123_R2) were generated n the followng way: For each test query, 1 (or 2) term(s) s (are) removed to generate a smulated past query; For each smulated past query at least 2 terms are kept to make sure the smulated past query s not too short to be meanngful. (e.g., n Trec123_R2, f a real query only contans 3 terms, only 1 term s removed nstead of 2 to make sure the smulated query contans at least 2 terms.) For Trec4 testbed, two levels of smulated past queres are also generated, whch are called Trec4_R2 and Trec4_R3. Each set contans 50 smulated queres. The past queres n Trec4_R2 (or Trec4_R3) were generated by the followng way: For each test query, 2 (or 3) terms are randomly removed to generate a smulated past query; For each smulated past query at least 3 terms are kept. 14

16 Normalzed Smlarty Score Normalzed Smlarty Score Past Query Rank Normalzed Smlarty Score Normalzed Smlarty Score Past Query Rank Past Query Rank a) retreval-based smlarty on Trec123_T20 past query set Past Query Rank b) term-based smlarty on Trec123_T20 past query set Normalzed Smlarty Score Normalzed Smlarty Score Past Query Rank Normalzed Smlarty Score Normalzed Smlarty Score Past Query Rank Past Query Rank c) retreval-based smlarty on Trec123_R1 past query set Past Query Rank d) term-based smlarty on Trec123_R1 past query set Fg. 2 Statstcs of Retreval-Based and Term-Based Smlarty measures over Trec123_T20 and Trec123_R1 past query sets. Each onlne/test query has a set of smlar past queres sorted wth respect to the normalzed smlarty scores whle determnng the set of most smlar past queres. The dstrbuton of the normalzed smlarty scores between an onlne query and the set of past queres gve the nformaton of the normalzed smlarty score of the most smlar past query at a partcular rank wth respect to an onlne query (.e. the normalzed smlarty score of the k th most smlar past query wth the onlne query q). The average of the dstrbuton of normalzed smlarty scores across all onlne queres (y-axes) for ranks 1-to-50 (x-axes) s calculated for Trec123_T20 and Trec123_R1 past query datasets wth retreval-based and termbased smlarty measures and s reported n each graph. Detaled vew of each graph for the top 10 most smlar past queres (note that n ths work we only choose the top 5 most smlar past queres for each test query) can be seen on the smaller graphs on the rght top of each graph. An mportant observaton s that for queres that have very lmted number of terms (such as Trec123_R1 and Trec123_R2) term-based smlarty measure cannot calculate the smlarty between the past and onlne queres as precse as the retreval based smlarty approach as shown on Graphs c & d. Ths s due to the fact that although there s some smlarty between a past query and an onlne query, term-based approach cannot detect t f there are no common terms. But when the queres have enough number of terms, term-based approach also performs as good as the retreval-based approach n fndng the smlarty between past and onlne queres as shown on Graphs a & b. In the experments that used the smulated past queres as the set of past queres; the 50 real queres are treated as test/onlne user queres, and the 50 smulated queres (.e. n Trec123_R1, Trec123_R2, Trec4_R2 or Trec4_R3 smulated past query sets) are treated as past queres. Generally speakng, smulated past queres n Trec123_R1 are more smlar than Trec123_R2 wth 15

17 respect to the Trec123 onlne queres and smulated past queres n Trec4_R2 are more smlar than Trec4_R3 wth respect to the Trec4 onlne queres. Detals about smulated past queres can be seen n Table 3b. Snce the processes of generatng the smulatng queres nclude randomness, each set of smulated queres (e.g., Trec123_R1) has been generated for 15 tmes. Accordngly, the resource selecton experments that use smulated past queres are run 15 tmes for each set of smulated queres and the average results of all of these runs are reported for evaluaton. 4.3 Baselne Method In the experments, we use ReDDE to do resource selecton for past queres (.e. to determne SEL_p ). Also, ReDDE s a baselne resource selecton algorthm to compare wth qsm. For qsm, there are ether 1000 or 500 past queres at most (.e. for sampled queres), and for each past query 3 top documents per source are downloaded by SSL n order to buld a regresson model to merge search results nto a sngle ranked lst. So n total, we need to download at most 3000 or 1500 documents per source. The number s actually much smaller than 3000 or 1500 as only a very small porton of sources are selected for searchng each past query: for Trec123(or Trec4)_T10 we download about 150 and for Trec123(or Trec4)_T20 we download about 300 documents per source on average. For ReDDE, we buld the CSDb, whch contans 150 documents per source by QBS. All other parameters n ReDDE are the same as that n [23]. Although the types of evdence that qsm and ReDDE use are dfferent and t wll not be perfectly far to make a comparson, they are stll comparable by the amount of documents they use. 4.4 Evaluaton Metrc The recall metrc R k has been commonly used to evaluate resource selecton algorthms [2,7,23]. Let B denote a desrable rankng of avalable nformaton sources by Relevance-Based Rankng (.e., rankng of sources by the actual number of relevant documents), and E a rankng provded by a resource selecton algorthm. Let B and E denote the number of relevant documents n the th ranked database of B or E. The metrc R k s defned as follows: Rk = k k E = 1 (11) B = 1 The metrc above measures what percentage the dfference s between the estmated rankng and the most desrable rankng. Therefore, at a fxed k, a larger R k value ndcates a better rankng. 5 Experment Results An extensve set of experments are conducted to address the followng sx questons: 16

18 Table 4a. Comparson between ReDDE wth qsm-cut-10 and qsm-cut-20 on Trec123 dataset wth Trec123_T20 sampled past query set. Please note that the results for qsm- Cut-10 on Trec123 dataset wth Trec123_T20 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec123 Trec123_T20 qsm-cut-20 on Trec123_T (39.66%) (62.02%) (17.73%) (38.30%) (26.15%) (28.86%) (17.31%) (29.17%) (17.26%) (23.67%) (09.30%) (16.70%) (07.84%) (15.08%) (08.66%) (16.44%) (07.83%) (18.45%) (05.83%) (17.94%) Overall Improvement 15.76% 26.67% Table 4b. Comparson between ReDDE wth qsm-cut-10 and qsm-cut-20 on Trec4 dataset wth Trec4_T20 sampled past query set. Please note that the results for qsm- Cut-10 on Trec4 dataset wth Trec4_T20 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec4 Trec4_T20 qsm-cut-20 on Trec4_T (17.65%) (-08.16%) (-02.56%) (08.89%) (08.73%) (20.17%) (12.85%) (14.51%) (16.01%) (16.51%) (11.60%) (15.39%) (09.44%) (15.08%) (02.29%) (10.42%) (-01.96%) (06.87%) (-03.52%) (04.41%) Overall Improvement 7.05% 10.41% 1. How good s the new resource selecton algorthm by learnng results from past queres? Experments are conducted to compare the new resource selecton algorthm wth the state-of-the-art ReDDE resource selecton that uses statc sampled nformaton for resource selecton. 2. How does the new resource selecton algorthm behave when dfferent amount of nformaton sources have been searched for past queres? Experments are conducted to show the performance of the new resource selecton algorthm when dfferent amount of nformaton sources have been searched for past queres. 3. How does the new resource selecton algorthm behave wth dfferent amount of past queres? Experments are conducted to show the performance of the new resource selecton algorthm wth more or less amount of past queres. 4. How does the new resource selecton algorthm behave wth dfferent characterstcs of past queres? Experments are conducted to show the performance of the new resource selecton algorthm wth more or less smlar past queres. 17

19 Table 5a. Comparson between ReDDE wth qsm-cut-10 and qsm-cut-20 on Trec123 dataset wth Trec123_R1 smulated past query set. Please note that the results for qsm- Cut-10 on Trec123 dataset wth Trec123_R1 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec123 Trec123_R1 qsm-cut-20 on Trec123_R (63.71%) (68.78%) (35.47%) (44.91%) (33.06%) (45.94%) (30.53%) (42.81%) (25.90%) (37.35%) (15.79%) (27.91%) (12.02%) (24.21%) (09.15%) (23.47%) (08.43%) (21.62%) (03.14%) (17.06%) Overall Improvement 23.72% 35.41% Table 5b. Comparson between ReDDE wth qsm-cut-10 and qsm-cut-20 on Trec4 dataset wth Trec4_R2 smulated past query set. Please note that the results for qsm- Cut-10 on Trec4 dataset wth Trec4_R2 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec4 Trec4_R2 qsm-cut-20 on Trec4_R (05.23%) (13.07%) (-09.20%) (-02.41%) (04.80%) (13.28%) (06.62%) (11.98%) (11.37%) (17.19%) (09.45%) (14.17%) (09.76%) (13.62%) (05.24%) (09.00%) (03.83%) (06.82%) (01.74%) (06.51%) Overall Improvement 4.88% 10.32% 5. How does the new resource selecton algorthm behave when dfferent query smlarty estmaton approaches are used to calculate the smlarty between a test query and the set of past queres? Experments are conducted to show the performance of the new resource selecton algorthm wth retreval-based query smlarty estmaton approach and term-based query smlarty estmaton approach. 6. Can the combned resource selecton approach further mprove the accuracy of resource selecton? Experments are conducted to compare the combned approach wth the two approaches of ether learnng from past queres or usng statc sampled nformaton. 5.1 Comparson between qsm and ReDDE Table 4a and Table 4b show the results of qsm and the Trec123 and Trec4 testbeds wth Trec123_T20 and Trec4_T20 sampled query sets. In the same way Table 5a and Table 5b show the correspondng results wth Trec123_R1 and Trec4_R2 smulated query sets. 18

20 Table 6a. Comparson between dfferent amounts of sampled past query sets on Trec123 dataset wth Trec123_T10 and Trec123_T20 sampled past query sets. Please note that the results for Trec123 dataset wth Trec123_T20 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec123 Trec123_T10 Trec123_T (10.54%) (39.66%) (10.37%) (17.73%) (10.70%) (26.15%) (10.49%) (17.31%) (11.36%) (17.26%) (04.58%) (09.30%) (04.47%) (07.84%) (03.97%) (08.66%) (02.08%) (07.83%) (00.22%) (05.83%) Overall Improvement 6.88% 15.76% Table 6b. Comparson between dfferent amounts of sampled past query sets on Trec4 dataset wth Trec4_T10 and Trec4_T20 sampled past query sets. Please note that the results for Trec4 dataset wth Trec4_T20 sampled past query set has been reported n our prevous work (Cetntas et al. 2009). Trec4 Trec4_T10 Trec4_T (21.56%) (17.65%) (04.52%) (-02.56%) (11.07%) (08.73%) (07.89%) (12.85%) (03.96%) (16.01%) (03.57%) (11.60%) (03.49%) (09.44%) (-04.20%) (20.29%) (-08.29%) (-01.96%) (-07.98%) (-03.52%) Overall Improvement 2.65% 7.05% The performance s evaluated by the Recall metrc R k defned n secton 4.4, where n the table s the k n R k. The percentages wthn the parentheses are the relatve mprovements of qsm over ReDDE. The overall mprovement s calculated by takng the average of the mprovement percentages when 1,2,,10 sources are selected. As shown from Table 4a, Table 4b, Table 5a and Table 5b, qsm-cut-10 always generates much better results than ReDDE. Thus, qsm can be consdered as a very effectve algorthm for resource selecton. 5.2 The Performance of qsm wth Dfferent Amounts of Search Results from Past Queres Table 4a & Table 4b (wth sampled past queres) and Table 5a & Table 5b (wth smulated past queres) also show that searchng more nformaton sources (.e. X = 20 n ths case) for past queres can help us do a better resource selecton for onlne queres (than searchng fewer nformaton sources (.e. X=10)). Ths s a reasonable result as the search results of past queres n qsm-cut-20 provde more nformaton than the search results of past queres n qsm-cut

Balanced Query Methods for Improving OCR-Based Retrieval

Balanced Query Methods for Improvng OCR-Based Retreval Kareem Darwsh Electrcal and Computer Engneerng Dept. Unversty of Maryland, College Park College Park, MD 20742 kareem@glue.umd.edu Douglas W. Oard