Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language

Size: px
Start display at page:

Download "Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language"

Transcription

1 Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language Document Number Working Paper 6.12h(b) (3rd Round) Project ref. IST Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL nlp/meaning/meaning.html Availability Public Authors: Xinglong Wang (UoS) and John Carroll (UoS) INFORMATION SOCIETY TECHNOLOGIES

2 WSD Using Sense Signatures Page : 1 Project ref. IST Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Security (Distribution level) Public Contractual date of delivery February 2004 Actual date of delivery February 2004 Document Number Working Paper 6.12h(b) (3rd Round) Type Report Status & version Final Number of pages 14 WP contributing to the deliverable WP6.12 WPTask responsible Eneko Agirre Authors Xinglong Wang (UoS) and John Carroll (UoS) Other contributors Reviewer EC Project Officer Evangelia Markidou Authors: Xinglong Wang (UoS) and John Carroll (UoS) Keywords: Word Sense Disambiguation, Sense Signature Acquisition Abstract: We present a novel near-unsupervised approach to the task of Word Sense Disambiguation (WSD). For each sense of a word with multiple meanings we produce a sense signature : a set of words or snippets of text that tend to co-occur with that sense. We build sense signatures automatically, using large quantities of Chinese text, and English-Chinese and Chinese-English bilingual dictionaries, taking advantage of the observation that mappings between words and meanings are often different in typologically distant languages. We train a classifier on the sense signatures and test it on two gold standard English WSD datasets, one for binary and the other for multi-way sense identification. Both evaluations give results that exceed previous state-of-the-art results for comparable systems. We also demonstrate that a little manual effort can improve sense signature quality, as measured by WSD accuracy.

3 WSD Using Sense Signatures Page : 2 Contents 1 Introduction 3 2 Acquisition of Sense Signatures 4 3 Experiments and Results Building Sense Signatures WSD Experiments on TWA dataset WSD Experiments on Senseval-2 Lexical Sample dataset Conclusions and Future Work 12

4 WSD Using Sense Signatures Page : 3 1 Introduction We call a set of words or text snippets that tend to co-occur with a certain sense of a word a sense signature. More formally, a sense signature SS of sense s with a length n is defined as: SS s = {w 1, w 2,..., w n }, where w i denotes a word form. Note that word forms in a sense signature can be ordered or unordered: when ordered, they are a piece of text snippet, and when unordered, they are just a bag of words. In this paper, word forms in sense signatures are unordered. For example, a sense signature for the serving sense of waiter may be {restaurant, menu, waitress, dinner}, while the sense signature {hospital, station, airport, boyfriend} may be one for the waiting sense of the word. The term sense signature is derived from the term topic signature 1, which refers to a set of words that tend to be associated with a certain topic. Topic signatures were originally proposed by Lin and Hovy [2000] in the context of Text Summarisation. They developed a text summarisation system containing a topic identification module that was trained on topic signatures. Their topic signatures were extracted from manually topictagged text. Agirre et al. [2001] proposed an unsupervised method for building topic signatures. Their system queries an Internet search engine with monosemous synomyms of words that have multiple senses in WordNet [Miller et al., 1990], and then extracts topic signatures by processing text snippets returned by the search engine. This way, they can attach a set of topic signatures to each synset in WordNet. They evaluated their topic signatures on a WSD task using 7 words drawn from SemCor [Miller et al., 1993], but found their topic signatures were not very useful for WSD. Despite Agirre et al. s disappointing result, a means of automatically and accurately producing sense (topic) signatures would be one way of addressing the important problem of the lack of training data for machine learning based approaches to WSD. Another approach to this problem is to automatically produce sense-tagged data using parallel bilingual corpora [Gale et al., 1992; Diab and Resnik, 2002], exploiting the assumption that mappings between words and meanings are different in different languages. One problem with relying on bilingual corpora for data collection is that bilingual corpora are rare, and aligned bilingual corpora are even rarer. Mining the Web for bilingual text [Resnik, 1999] is a solution, but it is still difficult to obtain sufficient quantities of high quality data. Another problem is that if two languages are closely related, useful data for some words cannot be collected because different senses of polysemous words in one language often translate to the same word in the other. In the second round of the MEANING project, we [2004] used this cross-language paradigm to acquire sense signatures for English senses from large amounts of Chinese text and English-Chinese and Chinese-English bilingual dictionaries. We showed the signatures to be useful as training data for a binary disambiguation task. The WSD system described in this paper is based on our previous work. We automatically build sense signatures in a similar way and then train a Naïve Baysian classifier on them. We have evaluated the 1 We use the term sense signature because it more precisely reflects our motivation of creating this resource for the purpose of Word Sense Disambiguation (WSD), though topic signature has been more often used in the literature.

5 WSD Using Sense Signatures Page : 4 system on the TWA Sense Tagged Dataset [Mihalcea, 2003], as well as the English Lexical Sample Dataset from Senseval-2. Our results show conclusively that such sense signatures can be used successfully in a full-scale WSD task. The reminder of the paper is organised as follows. Section 2 briefly reviews the acquisition algorithm for sense signatures. Section 3 describes details of building this resource and demonstrates our applications of sense signatures to WSD. We also present results and analysis in this section. Finally, we conclude in Section 4 and talk about future work. 2 Acquisition of Sense Signatures Following the assumption that word senses are lexicalised differently between languages, we [Wang, 2004] proposed an unsupervised method to automatically acquire English sense signatures using large quantities of Chinese text and English-Chinese and Chinese-English dictionaries. The Chinese language was chosen since it is a distant language from English and the more distant two languages are, the more likely that senses are lexicalised differently [Resnik and Yarowsky, 1999]. We also showed that the topic signatures are helpful for a small-scale WSD task. The underlying assumption of this approach is that each sense of an ambiguous English word corresponds to a distinct translation in Chinese. As English ambiguous word w Sense 1 of w Chinese translation of sense 1 Chinese text snippet 1 Chinese text snippet {English sense signature 1 for sense 1 of w} {English sense signature 2 for sense 1 of w} English-Chinese Lexicon Chinese Search Engine Chinese Segementation Chinese- English Lexicon Sense 2 of w Chinese translation of sense 2 Chinese text snippet 1 Chinese text snippet {English sense signature 1 for sense 2 of w} {English sense signature 2 for sense 2 of w} Figure 1: Process of automatic acquisition of sense signatures proposed by us. For simplicity, assume w has two senses. shown in Figure 1, firstly, our system translates senses of an English word into Chinese words, using an English-Chinese dictionary, and then retrieves text snippets from a large amount of Chinese text, with the Chinese translations as queries. Secondly, the Chinese text snippets are segmented and then translated back to English word by word, using a Chinese-English dictionary. For each sense, a set of sense signatures is produced. As an example, suppose one wants to retrieve sense signatures for the financial sense of interest. One would first look up the Chinese translations of this sense in an English-Chinese dictionary, and find that is the right Chinese translation corresponding to this particular

6 WSD Using Sense Signatures Page : 5 sense. Then, the next stage is to automatically build a collection of Chinese text snippets by either searching in a large Chinese corpus or on the Web, using as query. As Chinese is a language written without spaces between words, one needs to use a segmentor to mark word boundaries before translating the snippets word by word back to English. The result is a collection of sense signatures for the financial sense of interest. For example, {interest rate, bank, annual, economy,...} might be one of them. We trained a secondorder co-occurrence vector model on the sense signatures and tested on the TWA Binary sense-tagged dataset, with reasonable results. Since this method acquires training data for WSD systems from raw monolingual Chinese text, it avoids the problem of the shortage of English sense-tagged corpora, and also of the shortage of aligned bilingual corpora. Also, if existing corpora are not big enough, one can always harvest more text from the Web. However, like all methods based on the cross-language translation assumption mentioned above, there are potential problems. For example, it is possible that a Chinese translation of an English sense is also ambiguous, and thus the contents of text snippets retrieved may talk about a concept other than the one we want. But our previous experiment, on just 6 words each with only a binary sense disambiguation, was too small-scale to determine whether this is a problem in practice. In the following section, we present our experiments on WSD and describe our solutions to such problems. 3 Experiments and Results We firstly describe in detail how we prepared sense signatures and then report our experiments on the TWA binary sense-tagged dataset, using various machine learning algorithms. Using the algorithm that performed best, we carried out a larger scale evaluation on the English Senseval-2 Lexical Sample dataset [Kilgarriff, 2001] and obtained results that significantly exceed those published for comparable systems. 3.1 Building Sense Signatures Following the approach described in Section 2, we built sense signatures for the 6 words to be binarily disambiguated in the TWA dataset and the 44 words in the Senseval-2 dataset 2. These 50 words have 229 senses in total to disambiguate. The first step was translating English senses to Chinese. We used the Yahoo! Student English-Chinese Online Dictionary 3, as well as a more comprehensive electronic dictionary. This is because the Yahoo! dictionary is designed for English learners, and its sense granularity is rather coarse-grained. For the binary disambiguation experiments, it is good enough. However, the Senseval-2 Lexical Sample task uses WordNet 1.7 as gold standard, which has very fine sense distinctions and translations in the Yahoo! dictionary don t match this standard. 2 These 44 words cover all nouns and adjectives in the Senseval-2 dataset, but exclude all verbs. We discuss this point in section See:

7 WSD Using Sense Signatures Page : 6 PowerWord was chosen as a supplementary dictionary because it integrates several comprehensive English-Chinese dictionaries in a single application. For each sense of an English word entry, both Yahoo! and PowerWord 2002 dictionaries list not only Chinese translations but also English glosses, which provides a bridge between WordNet synsets and Chinese translations in the dictionaries. Note that we only used PowerWord 2002 to translate words in the Senseval-2 dataset. In detail, to automatically find a Chinese translation for sense s of an English word w, our system looks up w in both dictionaries and determines whether w has the same number of senses as in WordNet. If it does, in one of the bilingual dictionaries, we locate the English gloss g which has the maximum number of overlapping words with the gloss for s in the WordNet synset. The Chinese translation associated with g is then selected. Although this method successfully identified Chinese translations for 29 out of the 50 words (58%), translations for the remaining word senses remain unknown because the sense distinctions are different between our bilingual dictionaries and the WordNet. In fact, unless an English-Chinese bilingual WordNet becomes available, this problem is inevitable. For our experiments, we solved the problem by manually looking up dictionaries and identifying translations. This task took an hour by an annotator who speaks both Chinese and English fluently. It is possible that the Chinese translations are also ambiguous, which can make the topic of a collection of text snippets deviate from what is expected. For example, the oral sense of mouth can be translated as or in Chinese. However, the first translation ( ) is a single-character word and is highly ambiguous: by combining with other characters, its meaning varies. For example, means an exit or to export. On the other hand, the second translation ( ) is monosemous and should be used. To assess the influence of such ambiguous translations, we carried out experiments involving more human labour to verify the translations. The same annotator manually eliminated those highly ambiguous Chinese translations and then replaced them with less ambiguous or ideally monosemous Chinese translations. This process changed roughly half of the translations and took about five hours. We compared the basic system with this manually improved one. The results are shown in section 3.3. Using translations as queries, the sense signatures were automatically extracted from the Chinese Gigaword Corpus (CGC), distributed by the LDC 5, which contains 2.7GB newswire text, of which 900MB are sourced from Xinhua News Agency of Beijing, and 1.8GB are drawn from Central News from Taiwan. A small percentage of words have different meanings in these two Chinese dialects, and since the Chinese-English dictionary (LDC Mandarin-English Translation Lexicon Version 3.0 ) we use later is compiled with Mandarin usages in mind, we mainly retrieve data from Xinhua News. We set a threshold of 100, and only when the amount of snippets retrieved from Xinhua News is smaller than 100, we turn to Central News to collect more data. Specifically, for 51 out of the 229 (22%) Chinese queries, the system retrieved less than 100 instances from Xinhua News. Then it automatically retrieved more data from Central News. In theory, if the training data is 4 A commercial electronic dictionary application. See: 5 Available at:

8 WSD Using Sense Signatures Page : 7 still not enough, one can always turn to other text resources, such as the Web. To decide the optimal length of text snippets to retrieve, we carried out pilot experiments with two length settings: 250 ( 110 English words) and 400 ( 175 English words) Chinese characters, and found that more context words helped improve WSD performance (results not shown). Therefore, we retrieve text snippets with a length of 400 characters. We then segmented all text snippets, using the application ICTCLAS 6. After the segmentor marked all word boundaries, the system automatically translated the text snippets word by word using the electronic LDC Mandarin-English Translation Lexicon 3.0. As expected, the lexicon doesn t cover all Chinese words. We simply discarded those Chinese words that don t have an entry in this lexicon. We also discarded those Chinese words with multiword English translations. Since the discarded words can be informative, one direction of our research in the future is to find an up-to-date wide coverage dictionary, and to see how much difference it will make. Finally, we filtered the sense signatures with a stop-word list, to ensure only content words were included. We ended up with 229 sets of sense signatures for all senses of the 50 words in both test datasets. Each sense signature contains a set of words that were translated from a Chinese text snippet, whose content should closely relate to the English word sense in question. Words in a sense signature are unordered, because in this work we only used bag-of-words information. Except for the inevitable very small amount of manual work with respect to mapping WordNet glosses to those in English-Chinese dictionaries, the whole process is automatic. 3.2 WSD Experiments on TWA dataset The TWA dataset is a manually sense-tagged corpus [Mihalcea, 2003]. It contains binarily sense-tagged text instances, drawn from the British National Corpus, for 6 nouns. We trained various machine learning algorithms 7 on sense signatures we built automatically, as described above, and then evaluated them on the dataset. Table 1 summarises the sizes of sense signatures we computed for each sense of the 6 words, and also the number of test items in the TWA dataset. The average length of a sense signature is 35 words, which is much shorter than the length of the text snippets, which was set to 400 Chinese characters ( 175 English words). This is because function words and words that are not listed in the LDC Mandarin-English lexicon were eliminated. We didn t apply any weighting to the features because performance went down in our pilot experiments when we applied a TF.IDF weighting scheme (results not shown). We also limited the maximum training sense signatures to 6000, for efficiency purposes. Table 2 summarises the evaluation results. It shows that 5 out of 7 algorithms beat the most-frequent-sense baseline. McCarthy et al. [2004] argue that this is a very tough baseline 6 See: zhp/ictclas 7 We used the implementation in the Weka machine learning package, apart from the Second- Order Co-occurrence Vector Model (OCV), which we implemented ourselves. Weka is available at: ml/weka.

9 WSD Using Sense Signatures Page : 8 Word bass crane motion palm plant tank Sense 1. fish 2. music 1. bird 2. machine 1. physical 2. legal 1. hand 2. tree 1. living 2. factory 1. container 2. vehicle Training Data Test Data Table 1: The 6 ambiguous words and their senses in the TWA dataset, sizes of training sense signatures used for each sense and word (3rd column), and numbers of test items. Word DT NB NBk knn AB OCV SVM MFB bass crane motion palm plant tank avg M Table 2: Accuracy scores when using 7 machine learning algorithms: Decision Trees (DT), Naïve Bayes with normal estimation (NB), Naïve Bayes with kernel estimation (NBk), k-nearest Neighbours (knn), AdaBoost (AB), Second-Order Occurrence Vector Model (OCV) and Support Vector Machines (SVM). Underlined scores are the winners in their rows. For comparison, it also shows the most-frequent-sense baseline (MFB) and the performance by a state-of-the-art unsupervised system (Mihalcea, 2003) (M03), evaluated on the same dataset. Accuracies are expressed in percentages. for an unsupervised WSD system to beat. Considering the results were obtained completely automatically 8, sense signatures look very promising as an adjunct or replacement to the expensive hand-tagged training examples for WSD. Note that the Naïve Bayes algorithms outperformed other competitors. The Naïve Bayes algorithm with kernel estimation [John and Langley, 1995] was the best algorithm overall. The NBk algorithm, trained on our sense signatures, also outperformed the unsupervised WSD system of Mihalcea [2003] 9, which was trained on a collection of feature sets, consisting of a number of context words in a window surrounding the monosemous equivalents of senses of the 6 words. One of the strengths of our approach is that training data come cheaply and relatively easily. However, the sense signatures are acquired automatically and they inevitably con- 8 Sense signatures for all senses of these 6 words are acquired completely automatically, without human intervention. 9 According to Mihalcea s paper, the average accuracy of her system is 76.6%, as shown in Table 2. But our re-calculation, based on her reported results, is 74.4%. It might be an error on either side.

10 WSD Using Sense Signatures Page : bass crane motion palm plant tank Figure 2: Accuracy scores with size of training sense signatures going up. Each bar is 1 standard deviation. tain a certain amount of noise, which may cause problems for the classifier. To assess the relationship between accuracy and the size of training data, we carried out a series of experiments, feeding the classifier with different numbers of sense signatures. For all of the 6 words, we did the following: taken a word w i, we randomly selected n sense signatures for each of its senses s i, from the total amount of sense signatures built for s i (shown in the 3rd column in Table 1). Then the NBk algorithm was trained on the 2 n signatures and tested on w i s test instances. We recorded the accuracy and repeated this process 200 times and calculated the mean and variance of the 200 accuracies. Then we assigned another value to n and iterated the above process until n took all the predefined values. In our experiments, n was taken from {50, 100, 150, 200, 400, 600, 800, 100, 1200 } for words motion, plant and tank and from {50, 100, 150, 200, 250, 300, 350 } for bass, crane and palm, because there were less sense signature data available for the latter three words. Finally, we used the t-test (p = 0.05) on pairwise sets of means and variances to see if improvements are statistically significant. The results are shown in Figure out of 39 t-scores are greater than the t-test critical values, so we are fairly confident that the more training sense signatures used, the more accurate the NBk classifier becomes on this disambiguation task. Another factor that can influence WSD performance is the distribution of training data for each sense. In our first series of experiments (Table 1), we used the natural distribution

11 WSD Using Sense Signatures Page : 10 of the senses appearing in Chinese news text 10. In our second series of experiments, ie., the t-test experiments (Figure 2), we used equal amounts of sense signatures for each sense of the words. We predict that the system should yield best performance when the distribution of the amounts of training data for senses approximates the natural distribution of sense occurrence in English text. However, this is probably difficult to verify since an unsupervised system has no a priori information about the natural distributions of senses. 3.3 WSD Experiments on Senseval-2 Lexical Sample dataset Although our WSD system performed very well on binary sense disambiguation, it is not obvious that it should perform reasonably when attempting to make finer-grained sense distinctions. Therefore, we carried out a larger scale evaluation on the standard English Senseval-2 Lexical Sample Dataset. This dataset consists of manually sense-tagged training and test instances for nouns, adjectives and verbs. We only tested our system on nouns and adjectives because verbs often have finer sense distinctions, which would mean more manual work would need to be done when mapping WordNet synsets to English-Chinese dictionary glosses. This would involve us in a rather different kind of enterprise since we would have moved from a near-unsupervised to a more supervised setup. To assess the influence of ambiguous Chinese translations, we prepared two sets of training data. As described in section 3.1, sense signatures in the first set were prepared without taking ambiguity in Chinese text into consideration, while those in the second set were prepared with a little more human effort involved trying to reduce ambiguity by using less ambiguous translations. We call the system trained on the first set Sys A and the one trained on the second Sys B. To deal with multiwords, we implemented a very simple detection module, based on WordNet entries. The results are shown in Table 3. Our Sys B system, with and without the multiword detection module, outperformed Sys A, which shows that sense signatures acquired with less ambiguous Chinese translations contain less noise and therefore boost WSD performance. To compare against, the table also shows various baseline performances and a system that participated in Senseval Considering that the manual work involved in our approach is neglectable compared with manual sense-tagging, we classify our systems as unsupervised and we should aim to beat the random baseline. This all four of our systems do easily. We also easily beat another unsupervised baseline the Lesk [1986] baseline, which disambiguates words using WordNet definitions. The MFB baseline is actually a supervised baseline, since an unsupervised system doesn t have such prior knowledge beforehand; Sys B with multiword detection exceeded it. Sys B also exceeds the performance of UNED [Fernández-Amorós et al., 2001], which was one of the top-ranked This is not strictly true because we limited the training sense signatures to Accuracies for each word and in average were calculated by us, based on the information on Senseval-2 Website. See: 12 One system performed better but their answers were not on the official Senseval-2 website so that we couldn t do the comparison. Also, this system didn t attempt to disambiguate as many words as UNED and us.

12 WSD Using Sense Signatures Page : 11 Sys A Sys B Baselines & A Senseval-2 Entry Word Basic MW Basic MW RB MFB Lesk(U) UNED art-n(5) authority-n(7) bar-n(13) blind-a(3) bum-n(4) chair-n(4) channel-n(7) child-n(4) church-n(3) circuit-n(6) colourless-a(2) cool-a(6) day-n(9) detention-n(2) dyke-n(2) facility-n(5) faithful-a(3) fatigue-n(4) feeling-n(6) fine-a(9) fit-a(3) free-a(8) graceful-a(2) green-a(7) grip-n(7) hearth-n(3) holiday-n(2) lady-n(3) local-a(3) material-n(5) mouth-n(8) nation-n(3) natural-a(10) nature-n(5) oblique-a(2) post-n(8) restraint-n(6) sense-n(5) simple-a(7) solemn-a(2) spade-n(3) stress-n(5) vital-a(4) yew-n(2) Avg Table 3: WSD accuracy on words in the English Senseval-2 Lexical Sample dataset. The left most column shows words, their POS tags and how many senses they have. Sys A and Sys B are our systems, and MW denotes a multi-word detection module was used in conjunction with the Basic system. For comparison, it also shows two baselines: RB is the random baseline and MFB is the most-frequent-sense baseline. UNED is the one of the best unsupervised participants in the Senseval-2 competition and Lesk(U) is the highest unsupervised-baseline set in the workshop. All accuracies are expressed in percentages. unsupervised systems in the Senseval-2 competition. There are a number of factors that can influence WSD performance. As mentioned above, the distribution of training data for senses is one. In our experiments, we used all sense signatures that we built for a sense (an upper bound of 6000). However, the distribution of senses in English text often doesn t match the distribution of their corresponding Chinese translations in Chinese text. For example, suppose an English word w has two senses: s 1 and s 2, where s 1 rarely occurs in English text, whereas sense s 2 is used frequently. Also suppose s 1 s Chinese translation is much more frequently used than s 2 s translation in Chinese text. Thus, the distribution of the two senses in English is different from that of

13 WSD Using Sense Signatures Page : 12 the translations in Chinese. As a result, the numbers of sense signatures we would acquire for the two senses would be distributed as if they were in Chinese text. A classifier trained on this data would then tend to predict unseen test instances in favour of the wrong distribution. The word nation, for example, has three senses, of which the country sense is used more frequently in English. However, in Chinese, the country sense and the people sense are almost equally distributed, which might be the reason for its WSD accuracy being lower with our systems than most of the other words. A possible way to alleviate this problem is to select training sense signatures according to an estimated distribution in natural English text, which can be done by analysing available sense-tagged corpora with help of smoothing techniques, or with the unsupervised approach of [McCarthy et al., 2004]. Cultural differences can cause difficulty in retrieving sufficient training data. For example, translations of senses of church and hearth appear only infrequently in Chinese text. Thus, it s hard to build sense signatures for these words. Another problem, as mentioned above, is that translations of English senses can be ambiguous in Chinese. For example, Chinese translations of the words vital, natural, local etc. are also ambiguous to some extent, and this might be a reason of their low performance. One way to solve this, as we described, is to manually check the translations. Another automatic way is that, before retrieving text snippets, we could segment or even parse the Chinese corpora, which should reduce the level of ambiguity and lead to better sense signatures. 4 Conclusions and Future Work We have presented WSD systems that use sense signatures as training data. Sense signatures are acquired automatically from large quantities of Chinese text, with the help of Chinese-English and English-Chinese dictionaries. We have tested our WSD systems on the TWA binary sense-tagged dataset and the English Senseval-2 Lexical Sample dataset, and it outperformed comparable state-of-the-art unsupervised systems. Also, we found that increasing the size of the sense signatures significantly improved WSD performance. Since sense signatures can be obtained very cheaply from any large Chinese text collection, including the Web, our approach is a way to tackle the knowledge acquisition bottleneck. There are a number of future directions we could investigate. Firstly, instead of using a bilingual dictionary to translate Chinese text snippets back to English, we could use machine translation software. Secondly, we could try this approach on other language pairs, Japanese-English, for example. This is also a possible solution to the problem that ambiguity may be preserved between Chinese and English. In other words, when a Chinese translation of an English sense is still ambiguous, we could try to collect sense signatures using translation in a third language, Japanese, for instance. Thirdly, it would be interesting to try to tackle the problem of Chinese WSD using sense signatures built using English, the reverse process to the one described in this paper.

14 WSD Using Sense Signatures Page : 13 References [Agirre et al., 2001] Eneko Agirre, Olatz Ansa, David Martinez, and Eduard Hovy. Enriching WordNet concepts with topic signatures. In Proceedings of the NAACL workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, USA. [Diab and Resnik, 2002] Mona Diab and Philip Resnik. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40 th Anniversary Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, USA. [Fernández-Amorós et al., 2001] David Fernández-Amorós, Julio Gonzalo, and Felisa Verdejo. The UNED systems at Senseval-2. In Poceedings of Second International Wordshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2), Toulouse, France. [Gale et al., 1992] William A. Gale, Kenneth W. Church, and David Yarowsky. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pages , [John and Langley, 1995] George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages , [Kilgarriff, 2001] Adam Kilgarriff. English lexical sample task description. In Proceedings of Second International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2), Toulouse, France. [Lesk, 1986] Michael E. Lesk. Automated sense disambiguation using machine-readable dictionaries: how to tell a pinecone from an ice cream cone. In Proceedings of the SIGDOC Conference, [Lin and Hovy, 2000] Chin-Yew Lin and Eduard Hovy. The automated acquisition of topic signatures for text summarization. In Proceedings of the COLING Conference, Strasbourg, France. [McCarthy et al., 2004] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant word senses in untagged text. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain. [Mihalcea, 2003] Rada Mihalcea. The role of non-ambiguous words in natural language disambiguation. In Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2003, Borovetz, Bulgaria.

15 WSD Using Sense Signatures Page : 14 [Miller et al., 1990] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. Introduction to WordNet: An on-line lexical database. Journal of Lexicography, 3(4): , [Miller et al., 1993] G. Miller, C. Leacock, R. Tengi, and T. Bunker. A sementic concordance. In Proceedings of ARPA Workshop on Human Language Technology, [Resnik and Yarowsky, 1999] Philip Resnik and David Yarowsky. Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2): , [Resnik, 1999] Philip Resnik. Mining the Web for bilingual text. In Proceedings of the 37 th Annual Meeting of the Association for Computational Linguistics, [Wang, 2004] Xinglong Wang. Automatic acquisition of English topic signatures based on a second language. In Proceedings of the Student Research Workshop at ACL 2004, Barcelona, Spain.

Approximating hierarchy-based similarity for WordNet nominal synsets using Topic Signatures

Approximating hierarchy-based similarity for WordNet nominal synsets using Topic Signatures Approximating hierarchy-based similarity for WordNet nominal synsets using Topic Eneko Agirre Enrique Alfonseca Oier López de Lacalle 2 nd Global WordNet Conference Brno, January 20 23, 2004 Index Topic

More information

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani Semi-automatic WordNet Linking using Word Embeddings Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani January 11, 2018 Kevin Patel WordNet Linking via Embeddings 1/22

More information

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Isa Maks and Piek Vossen Vu University, Faculty of Arts De Boelelaan 1105, 1081 HV Amsterdam e.maks@vu.nl, p.vossen@vu.nl

More information

Using Information From the Target Language to Improve Crosslingual Text Classification

Using Information From the Target Language to Improve Crosslingual Text Classification Using Information From the Target Language to Improve Crosslingual Text Classification Gabriela Ramírez 1, Manuel Montes 1, Luis Villaseñor 1, David Pinto 2 and Thamar Solorio 3 1 Laboratory of Language

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Combining unsupervised and supervised methods for PP attachment disambiguation

Combining unsupervised and supervised methods for PP attachment disambiguation Combining unsupervised and supervised methods for PP attachment disambiguation Martin Volk University of Zurich Schönberggasse 9 CH-8001 Zurich vlk@zhwin.ch Abstract Statistical methods for PP attachment

More information

Workshop/Hackathon for the Wordnet Bahasa

Workshop/Hackathon for the Wordnet Bahasa Workshop/Hackathon for the Wordnet Bahasa Francis Bond Computational Linguistics Group Linguistics and Multilingual Studies, Nanyang Technological University 2014-10-25 Wordnet Bahasa Boleh! Overview Thanks

More information

Exploiting Ordinality in Predicting Star Reviews

Exploiting Ordinality in Predicting Star Reviews Exploiting Ordinality in Predicting Star Reviews Alim Virani UBC - Computer Science alim.virani@gmail.com Chris Cameron UBC - Computer Science cchris13@cs.ubc.ca Abstract Automatically evaluating the sentiment

More information

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model Ruifeng Xu, Chengtian Zou, Jun Xu Key Laboratory of Network Oriented Intelligent Computation, Shenzhen Graduate School,

More information

FINAL REPORT Measuring Semantic Relatedness using a Medical Taxonomy. Siddharth Patwardhan. August 2003

FINAL REPORT Measuring Semantic Relatedness using a Medical Taxonomy. Siddharth Patwardhan. August 2003 FINAL REPORT Measuring Semantic Relatedness using a Medical Taxonomy by Siddharth Patwardhan August 2003 A report describing the research work carried out at the Mayo Clinic in Rochester as part of an

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing Chapter IR:VIII VIII. Evaluation Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing IR:VIII-1 Evaluation HAGEN/POTTHAST/STEIN 2018 Retrieval Tasks Ad hoc retrieval:

More information

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University,

More information

Automated Medical Diagnosis using K-Nearest Neighbor Classification

Automated Medical Diagnosis using K-Nearest Neighbor Classification (IMPACT FACTOR 5.96) Automated Medical Diagnosis using K-Nearest Neighbor Classification Zaheerabbas Punjani 1, B.E Student, TCET Mumbai, Maharashtra, India Ankush Deora 2, B.E Student, TCET Mumbai, Maharashtra,

More information

An empirical evaluation of text classification and feature selection methods

An empirical evaluation of text classification and feature selection methods ORIGINAL RESEARCH An empirical evaluation of text classification and feature selection methods Muazzam Ahmed Siddiqui Department of Information Systems, Faculty of Computing and Information Technology,

More information

Learning the Fine-Grained Information Status of Discourse Entities

Learning the Fine-Grained Information Status of Discourse Entities Learning the Fine-Grained Information Status of Discourse Entities Altaf Rahman and Vincent Ng Human Language Technology Research Institute The University of Texas at Dallas Plan for the talk What is Information

More information

How preferred are preferred terms?

How preferred are preferred terms? How preferred are preferred terms? Gintare Grigonyte 1, Simon Clematide 2, Fabio Rinaldi 2 1 Computational Linguistics Group, Department of Linguistics, Stockholm University Universitetsvagen 10 C SE-106

More information

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY 1 Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov

More information

Extraction of Adverse Drug Effects from Clinical Records

Extraction of Adverse Drug Effects from Clinical Records MEDINFO 2010 C. Safran et al. (Eds.) IOS Press, 2010 2010 IMIA and SAHIA. All rights reserved. doi:10.3233/978-1-60750-588-4-739 739 Extraction of Adverse Drug Effects from Clinical Records Eiji Aramaki

More information

Data Mining in Bioinformatics Day 4: Text Mining

Data Mining in Bioinformatics Day 4: Text Mining Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 What is text mining?

More information

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation Ryo Izawa, Naoki Motohashi, and Tomohiro Takagi Department of Computer Science Meiji University 1-1-1 Higashimita,

More information

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods D. Weissenbacher 1, A. Sarker 2, T. Tahsin 1, G. Gonzalez 2 and M. Scotch 1

More information

The Danish Sign Language Dictionary Jette H. Kristoffersen and Thomas Troelsgård University College Capital, Copenhagen, Denmark

The Danish Sign Language Dictionary Jette H. Kristoffersen and Thomas Troelsgård University College Capital, Copenhagen, Denmark The Danish Sign Language Dictionary Jette H. Kristoffersen and Thomas Troelsgård University College Capital, Copenhagen, Denmark The entries of the The Danish Sign Language Dictionary have four sections:

More information

Improved Intelligent Classification Technique Based On Support Vector Machines

Improved Intelligent Classification Technique Based On Support Vector Machines Improved Intelligent Classification Technique Based On Support Vector Machines V.Vani Asst.Professor,Department of Computer Science,JJ College of Arts and Science,Pudukkottai. Abstract:An abnormal growth

More information

Detecting Hoaxes, Frauds and Deception in Writing Style Online

Detecting Hoaxes, Frauds and Deception in Writing Style Online Detecting Hoaxes, Frauds and Deception in Writing Style Online Sadia Afroz, Michael Brennan and Rachel Greenstadt Privacy, Security and Automation Lab Drexel University What do we mean by deception? Let

More information

NMF-Density: NMF-Based Breast Density Classifier

NMF-Density: NMF-Based Breast Density Classifier NMF-Density: NMF-Based Breast Density Classifier Lahouari Ghouti and Abdullah H. Owaidh King Fahd University of Petroleum and Minerals - Department of Information and Computer Science. KFUPM Box 1128.

More information

Textual Emotion Processing From Event Analysis

Textual Emotion Processing From Event Analysis Textual Emotion Processing From Event Analysis Chu-Ren Huang, Ying Chen *, Sophia Yat Mei Lee Department of Chinese and Bilingual Studies * Department of Computer Engineering The Hong Kong Polytechnic

More information

Gender Based Emotion Recognition using Speech Signals: A Review

Gender Based Emotion Recognition using Speech Signals: A Review 50 Gender Based Emotion Recognition using Speech Signals: A Review Parvinder Kaur 1, Mandeep Kaur 2 1 Department of Electronics and Communication Engineering, Punjabi University, Patiala, India 2 Department

More information

Brendan O Connor,

Brendan O Connor, Some slides on Paul and Dredze, 2012. Discovering Health Topics in Social Media Using Topic Models. PLOS ONE. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0103408 Brendan O Connor,

More information

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS? EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS? Sefik Emre Eskimez, Kenneth Imade, Na Yang, Melissa Sturge- Apple, Zhiyao Duan, Wendi Heinzelman University of Rochester,

More information

Facial expression recognition with spatiotemporal local descriptors

Facial expression recognition with spatiotemporal local descriptors Facial expression recognition with spatiotemporal local descriptors Guoying Zhao, Matti Pietikäinen Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, P. O. Box

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Fuzzy Rule Based Systems for Gender Classification from Blog Data

Fuzzy Rule Based Systems for Gender Classification from Blog Data Fuzzy Rule Based Systems for Gender Classification from Blog Data Han Liu 1 and Mihaela Cocea 2 1 School of Computer Science and Informatics, Cardiff University Queens Buildings, 5 The Parade, Cardiff,

More information

Contributions to Brain MRI Processing and Analysis

Contributions to Brain MRI Processing and Analysis Contributions to Brain MRI Processing and Analysis Dissertation presented to the Department of Computer Science and Artificial Intelligence By María Teresa García Sebastián PhD Advisor: Prof. Manuel Graña

More information

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1 A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1 1 Department of Biomedical Informatics, Columbia University, New York, NY, USA 2 Department

More information

Data complexity measures for analyzing the effect of SMOTE over microarrays

Data complexity measures for analyzing the effect of SMOTE over microarrays ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity

More information

Modeling the Use of Space for Pointing in American Sign Language Animation

Modeling the Use of Space for Pointing in American Sign Language Animation Modeling the Use of Space for Pointing in American Sign Language Animation Jigar Gohel, Sedeeq Al-khazraji, Matt Huenerfauth Rochester Institute of Technology, Golisano College of Computing and Information

More information

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION OPT-i An International Conference on Engineering and Applied Sciences Optimization M. Papadrakakis, M.G. Karlaftis, N.D. Lagaros (eds.) Kos Island, Greece, 4-6 June 2014 A NOVEL VARIABLE SELECTION METHOD

More information

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection Y.-H. Wen, A. Bainbridge-Smith, A. B. Morris, Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection, Proceedings of Image and Vision Computing New Zealand 2007, pp. 132

More information

Exploiting Patent Information for the Evaluation of Machine Translation

Exploiting Patent Information for the Evaluation of Machine Translation Exploiting Patent Information for the Evaluation of Machine Translation Atsushi Fujii University of Tsukuba Masao Utiyama National Institute of Information and Communications Technology Mikio Yamamoto

More information

Generalizing Dependency Features for Opinion Mining

Generalizing Dependency Features for Opinion Mining Generalizing Dependency Features for Mahesh Joshi 1 and Carolyn Rosé 1,2 1 Language Technologies Institute 2 Human-Computer Interaction Institute Carnegie Mellon University ACL-IJCNLP 2009 Short Papers

More information

Shu Kong. Department of Computer Science, UC Irvine

Shu Kong. Department of Computer Science, UC Irvine Ubiquitous Fine-Grained Computer Vision Shu Kong Department of Computer Science, UC Irvine Outline 1. Problem definition 2. Instantiation 3. Challenge and philosophy 4. Fine-grained classification with

More information

Expert System Profile

Expert System Profile Expert System Profile GENERAL Domain: Medical Main General Function: Diagnosis System Name: INTERNIST-I/ CADUCEUS (or INTERNIST-II) Dates: 1970 s 1980 s Researchers: Ph.D. Harry Pople, M.D. Jack D. Myers

More information

THE EFFECT OF EXPECTATIONS ON VISUAL INSPECTION PERFORMANCE

THE EFFECT OF EXPECTATIONS ON VISUAL INSPECTION PERFORMANCE THE EFFECT OF EXPECTATIONS ON VISUAL INSPECTION PERFORMANCE John Kane Derek Moore Saeed Ghanbartehrani Oregon State University INTRODUCTION The process of visual inspection is used widely in many industries

More information

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB Bryan Orme, Sawtooth Software, Inc. Copyright 009, Sawtooth Software, Inc. 530 W. Fir St. Sequim,

More information

Experiment Presentation CS Chris Thomas Experiment: What is an Object? Alexe, Bogdan, et al. CVPR 2010

Experiment Presentation CS Chris Thomas Experiment: What is an Object? Alexe, Bogdan, et al. CVPR 2010 Experiment Presentation CS 3710 Chris Thomas Experiment: What is an Object? Alexe, Bogdan, et al. CVPR 2010 1 Preliminaries Code for What is An Object? available online Version 2.2 Achieves near 90% recall

More information

Evaluation & Systems. Ling573 Systems & Applications April 9, 2015

Evaluation & Systems. Ling573 Systems & Applications April 9, 2015 Evaluation & Systems Ling573 Systems & Applications April 9, 2015 Evaluation: Pyramid scoring Scoring without models Roadmap Systems: MEAD CLASSY Deliverable #2 Ideally informative summary Does not include

More information

Inter-session reproducibility measures for high-throughput data sources

Inter-session reproducibility measures for high-throughput data sources Inter-session reproducibility measures for high-throughput data sources Milos Hauskrecht, PhD, Richard Pelikan, MSc Computer Science Department, Intelligent Systems Program, Department of Biomedical Informatics,

More information

OECD QSAR Toolbox v.4.2. An example illustrating RAAF scenario 6 and related assessment elements

OECD QSAR Toolbox v.4.2. An example illustrating RAAF scenario 6 and related assessment elements OECD QSAR Toolbox v.4.2 An example illustrating RAAF scenario 6 and related assessment elements Outlook Background Objectives Specific Aims Read Across Assessment Framework (RAAF) The exercise Workflow

More information

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2011 ISSN 1349-4198 Volume 7, Number 12, December 2011 pp. 6935 6948 A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING

More information

Proposing a New Term Weighting Scheme for Text Categorization

Proposing a New Term Weighting Scheme for Text Categorization Proposing a New Term Weighting Scheme for Text Categorization Man LAN Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore 119613 lanman@i2r.a-star.edu.sg Chew-Lim TAN School of Computing

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance

Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance Gediminas Adomavicius gedas@umn.edu Sreeharsha Kamireddy 2 skamir@cs.umn.edu YoungOk

More information

Recognizing Scenes by Simulating Implied Social Interaction Networks

Recognizing Scenes by Simulating Implied Social Interaction Networks Recognizing Scenes by Simulating Implied Social Interaction Networks MaryAnne Fields and Craig Lennon Army Research Laboratory, Aberdeen, MD, USA Christian Lebiere and Michael Martin Carnegie Mellon University,

More information

Loose Tweets: An Analysis of Privacy Leaks on Twitter

Loose Tweets: An Analysis of Privacy Leaks on Twitter Loose Tweets: An Analysis of Privacy Leaks on Twitter Huina Mao, Xin Shuai, Apu Kapadia WPES 2011 Presented by Luam C. Totti on May 9, 2012 Motivation 65 million tweets every day. How many contain sensitive

More information

Epimining: Using Web News for Influenza Surveillance

Epimining: Using Web News for Influenza Surveillance Epimining: Using Web News for Influenza Surveillance Didier Breton 1, Sandra Bringay 2 3, François Marques 1, Pascal Poncelet 2, Mathieu Roche 2 1 Nevantropic, France 2 LIRMM CNRS, Univ. Montpellier 2,

More information

10CS664: PATTERN RECOGNITION QUESTION BANK

10CS664: PATTERN RECOGNITION QUESTION BANK 10CS664: PATTERN RECOGNITION QUESTION BANK Assignments would be handed out in class as well as posted on the class blog for the course. Please solve the problems in the exercises of the prescribed text

More information

What Helps Where And Why? Semantic Relatedness for Knowledge Transfer

What Helps Where And Why? Semantic Relatedness for Knowledge Transfer What Helps Where And Why? Semantic Relatedness for Knowledge Transfer Marcus Rohrbach 1,2 Michael Stark 1,2 György Szarvas 1 Iryna Gurevych 1 Bernt Schiele 1,2 1 Department of Computer Science, TU Darmstadt

More information

25. Two-way ANOVA. 25. Two-way ANOVA 371

25. Two-way ANOVA. 25. Two-way ANOVA 371 25. Two-way ANOVA The Analysis of Variance seeks to identify sources of variability in data with when the data is partitioned into differentiated groups. In the prior section, we considered two sources

More information

The Open Access Institutional Repository at Robert Gordon University

The Open Access Institutional Repository at Robert Gordon University OpenAIR@RGU The Open Access Institutional Repository at Robert Gordon University http://openair.rgu.ac.uk This is an author produced version of a paper published in Intelligent Data Engineering and Automated

More information

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Amir Barati Farimani, Mohammad Heiranian, Narayana R. Aluru 1 Department

More information

Shu Kong. Department of Computer Science, UC Irvine

Shu Kong. Department of Computer Science, UC Irvine Ubiquitous Fine-Grained Computer Vision Shu Kong Department of Computer Science, UC Irvine Outline 1. Problem definition 2. Instantiation 3. Challenge 4. Fine-grained classification with holistic representation

More information

of subjective evaluations.

of subjective evaluations. Human communication support by the taxonomy of subjective evaluations Emi SUEYOSHI*, Emi YANO*, Isao SHINOHARA**, Toshikazu KATO* *Department of Industrial and Systems Engineering, Chuo University, 1-13-27,

More information

Understanding Uncertainty in School League Tables*

Understanding Uncertainty in School League Tables* FISCAL STUDIES, vol. 32, no. 2, pp. 207 224 (2011) 0143-5671 Understanding Uncertainty in School League Tables* GEORGE LECKIE and HARVEY GOLDSTEIN Centre for Multilevel Modelling, University of Bristol

More information

Automatic Hemorrhage Classification System Based On Svm Classifier

Automatic Hemorrhage Classification System Based On Svm Classifier Automatic Hemorrhage Classification System Based On Svm Classifier Abstract - Brain hemorrhage is a bleeding in or around the brain which are caused by head trauma, high blood pressure and intracranial

More information

Migratory Bird classification and analysis Aparna Pal

Migratory Bird classification and analysis Aparna Pal Migratory Bird classification and analysis Aparna Pal apal4@wisc.edu Abstract The use of classification vectors to classify land and seabirds act as a first step to pattern classification of migratory

More information

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION Yunxiang Mao 1, Zhaozheng Yin 1, Joseph M. Schober 2 1 Missouri University of Science and Technology 2 Southern Illinois University

More information

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts Erik M. van Mulligen, Zubair Afzal, Saber A. Akhondi, Dang Vo, and Jan A. Kors Department of Medical Informatics, Erasmus

More information

Toward the Evaluation of Machine Translation Using Patent Information

Toward the Evaluation of Machine Translation Using Patent Information Toward the Evaluation of Machine Translation Using Patent Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba Mikio Yamamoto Graduate School of Systems

More information

Psychological testing

Psychological testing Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining

More information

Identifying Adverse Drug Events from Patient Social Media: A Case Study for Diabetes

Identifying Adverse Drug Events from Patient Social Media: A Case Study for Diabetes Identifying Adverse Drug Events from Patient Social Media: A Case Study for Diabetes Authors: Xiao Liu, Department of Management Information Systems, University of Arizona Hsinchun Chen, Department of

More information

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version Gergely Acs, Claude Castelluccia, Daniel Le étayer 1 Introduction Anonymization is a critical

More information

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Why Preferences? Learning Consensus from Preferences

More information

Using AUC and Accuracy in Evaluating Learning Algorithms

Using AUC and Accuracy in Evaluating Learning Algorithms 1 Using AUC and Accuracy in Evaluating Learning Algorithms Jin Huang Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 fjhuang, clingg@csd.uwo.ca

More information

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT Research Article Bioinformatics International Journal of Pharma and Bio Sciences ISSN 0975-6299 A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS D.UDHAYAKUMARAPANDIAN

More information

Discovering and Understanding Self-harm Images in Social Media. Neil O Hare, MFSec Bucharest, Romania, June 6 th, 2017

Discovering and Understanding Self-harm Images in Social Media. Neil O Hare, MFSec Bucharest, Romania, June 6 th, 2017 Discovering and Understanding Self-harm Images in Social Media Neil O Hare, MFSec 2017. Bucharest, Romania, June 6 th, 2017 Who am I? PhD from Dublin City University, 2007. Multimedia Search Currently

More information

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Kai-Ming Jiang 1,2, Bao-Liang Lu 1,2, and Lei Xu 1,2,3(&) 1 Department of Computer Science and Engineering,

More information

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis Data mining for Obstructive Sleep Apnea Detection 18 October 2017 Konstantinos Nikolaidis Introduction: What is Obstructive Sleep Apnea? Obstructive Sleep Apnea (OSA) is a relatively common sleep disorder

More information

The Long Tail of Recommender Systems and How to Leverage It

The Long Tail of Recommender Systems and How to Leverage It The Long Tail of Recommender Systems and How to Leverage It Yoon-Joo Park Stern School of Business, New York University ypark@stern.nyu.edu Alexander Tuzhilin Stern School of Business, New York University

More information

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Menno van Zaanen and Simon Zwarts Division of Information and Communication Sciences Department of Computing

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Abstract: Unlike most cancers, thyroid cancer has an everincreasing incidence rate

More information

Overview of the Patent Translation Task at the NTCIR-7 Workshop

Overview of the Patent Translation Task at the NTCIR-7 Workshop Overview of the Patent Translation Task at the NTCIR-7 Workshop Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, Takehito Utsuro University of Tsukuba National Institute of Information and Communications

More information

Finding Information Sources by Model Sharing in Open Multi-Agent Systems 1

Finding Information Sources by Model Sharing in Open Multi-Agent Systems 1 Finding Information Sources by Model Sharing in Open Multi-Agent Systems Jisun Park, K. Suzanne Barber The Laboratory for Intelligent Processes and Systems The University of Texas at Austin 20 E. 24 th

More information

An Experimental Study on Authorship Identification for Cyber Forensics

An Experimental Study on Authorship Identification for Cyber Forensics An Experimental Study on hip Identification for Cyber Forensics 756 1 Smita Nirkhi, 2 Dr. R. V. Dharaskar, 3 Dr. V. M. Thakare 1 Research Scholar, G.H. Raisoni, Department of Computer Science Nagpur, Maharashtra,

More information

Algorithms in Nature. Pruning in neural networks

Algorithms in Nature. Pruning in neural networks Algorithms in Nature Pruning in neural networks Neural network development 1. Efficient signal propagation [e.g. information processing & integration] 2. Robust to noise and failures [e.g. cell or synapse

More information

Cross-cultural Deception Detection

Cross-cultural Deception Detection Cross-cultural Deception Detection Verónica Pérez-Rosas Computer Science and Engineering University of North Texas veronicaperezrosas@my.unt.edu Rada Mihalcea Computer Science and Engineering University

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1 1. INTRODUCTION Sign language interpretation is one of the HCI applications where hand gesture plays important role for communication. This chapter discusses sign language interpretation system with present

More information

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests Objectives Quantifying the quality of hypothesis tests Type I and II errors Power of a test Cautions about significance tests Designing Experiments based on power Evaluating a testing procedure The testing

More information

RELYING ON TRUST TO FIND RELIABLE INFORMATION. Keywords Reputation; recommendation; trust; information retrieval; open distributed systems.

RELYING ON TRUST TO FIND RELIABLE INFORMATION. Keywords Reputation; recommendation; trust; information retrieval; open distributed systems. Abstract RELYING ON TRUST TO FIND RELIABLE INFORMATION Alfarez Abdul-Rahman and Stephen Hailes Department of Computer Science, University College London Gower Street, London WC1E 6BT, United Kingdom E-mail:

More information

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients Abstract Prognosis for stage IV (metastatic) breast cancer is difficult for clinicians to predict. This study examines the

More information

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL AClass: A Simple, Online Probabilistic Classifier Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL AClass: A Simple, Online Probabilistic Classifier or How I learned to stop worrying

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

Referring Expressions & Alternate Views of Summarization. Ling 573 Systems and Applications May 24, 2016

Referring Expressions & Alternate Views of Summarization. Ling 573 Systems and Applications May 24, 2016 Referring Expressions & Alternate Views of Summarization Ling 573 Systems and Applications May 24, 2016 Content realization: Referring expressions Roadmap Alternate views of summarization: Dimensions of

More information

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

Evaluation of Linguistic Labels Used in Applications

Evaluation of Linguistic Labels Used in Applications Evaluation of Linguistic Labels Used in Applications Hanan Hibshi March 2016 CMU-ISR-16-104 Travis Breaux Institute for Software Research School of Computer Science Carnegie Mellon University Pittsburgh,

More information

Language Volunteer Guide

Language Volunteer Guide Language Volunteer Guide Table of Contents Introduction How You Can Make an Impact Getting Started 3 4 4 Style Guidelines Captioning Translation Review 5 7 9 10 Getting Started with Dotsub Captioning Translation

More information