Incrementally Clustering Legislative Interpellation Documents

Size: px

Start display at page:

Download "Incrementally Clustering Legislative Interpellation Documents"

Rose Wheeler
6 years ago
Views:

2012 45th Hawaii International Conference on System Sciences Incrementally Clustering Legislative Interpellation Documents Fu-Ren Lin Institute of Service Science, National Tsing Hua University,

1 th Hawaii International Conference on System Sciences Incrementally Clustering Legislative Interpellation Documents Fu-Ren Lin Institute of Service Science, National Tsing Hua University, Taiwan Yu-tze Huang Institute of Technology Management, National Tsing Hua University, Taiwan Dachi Liao Institute of Political Science, National Sun Yat-sen University, Taiwan Abstract The Parliamentary Library of Legislative Yuan website provides a fair and objective channel for the public to trace daily activities of the Legislative Yuan and legislators inquiries in Taiwan. However, the increased information content causes information overloading problem. To mitigate such information overloading problem, this study proposes an incremental clustering mechanism to renew the information regularly by presenting it as a categorical structure to ease the efforts on tracing issue development. This study first initiates a basic categorical structure by a two-stage clustering approach. Then, the incremental clustering method is applied to clustering a collection of incoming documents corresponding to the same topic into clusters, and designates these clusters into existing categories or creates a new category. Experimental results show the effectiveness of the proposed incremental clustering method, which enables the management of the hierarchical structure of categories on legislative interpellation. This study contributes to e-government initiatives on facilitating the public to trace the legislative activities periodically. 1. Introduction Taiwan is a country with enthusiasm for politics and there are flooded political-related news reports in both TV channels and newspapers. However, the news is mainly reported via third parties (i.e. reporters and anchors), which then inevitably involved with certain political positions or personal opinions in them. Nevertheless, the web site of the Parliamentary Library of Legislative Yuan as the library of the parliament of Taiwan provides a fair and objective channel to people. Its content is about the daily activities of the Legislative Yuan (i.e., the congress of Taiwan), including written records and videos of interpellation, conference speech, and legislation proposals. Thus, this website is the most direct and realistic channel to know the issues concerned by the Legislative Yuan as well as the performances of the legislators. However, the flooded contents often cause the information overloading problem. To solve this problem, some scholars in the related fields began to tackle the issue by applying information technology to effectively provide political information. Many technologies have been proposed to solve the information overloading problem including search engines (information retrieval), information agency, information customization, etc. [2]. Other methods such as text mining techniques have been used for discovering interesting patterns from unstructured text. For example, classification techniques can distinguish relevant documents from document sets, and clustering mechanisms can group related documents corresponding to the same topic. Those who interested in politics usually collect related information via mass media. However, some researches indicate that one of the reasons that voters disinterest in politics or do not vote is probably because they lack of political information or the ability to process them [7]. Thus, we think that the effective categorizing and clustering technology can help people understand issues discussed in the Legislative Yuan and supervise the legislators behaviors. Apart from that, the Parliamentary Library of Legislative Yuan website provides massive information. In interpellation sessions, the legislators interpolate the ministers or the members of the cabinet about policies, which are important questions involving public interests in general [9]. The importance of interpellation is not less than that of the negotiation of budgetary bill, law legislation, or other significant national matters, such as the consent for the important personnel appointment of the Judicial Yuan and the Control Yuan. Especially under current fierce political atmosphere, the legislators employing the right of interpellation to its extreme for the sake of the performance, which leads to the extension of the interpellation scope of person or matter (Legislative Bureau of the Legislative Yuan, 2004). This research aims to develop a hierarchical structure of categories to cluster the interpellation documents. This can transform the text provided in the Parliamentary Library of Legislative Yuan website into /12 $ IEEE DOI /HICSS

2 categorical information in order to effectively mitigate the information overloading problem faced by the general public who concerning the development of each issue in the parliament and overseeing the performance of legislators. Based on the periodical updates of the interpellation documents, this study aims to develop a clustering mechanism which can renew the categories of information regularly. Nevertheless, people get confused easily if there is a huge change in existing categorical structure. In order to prevent the disadvantage of cognitive confusion and computational load, this study adopts an incremental clustering mechanism to efficiently and effectively maintain clusters after adding new documents. In summary, this paper contributes to e- government initiative mainly on mitigating the cognitive and computational loads in tracing issues raised in parliamentary interpellation sessions using an incremental clustering approach. 2. Incremental clustering In this research, we propose an incremental clustering architecture to develop a hierarchical structure of categories for progressive issues of legislative interpellation. 2.1 Definition The system is designed to update data periodically. First, a suitable time slot is determined to collect data and specify the period of incoming data. For example, for legislative interpellation, a period is defined as the duration of a legislative meeting session. We conduct categorical structure initialization for interpellation documents during period 1, and perform incremental clustering iteratively on periods 2, 3, 4, and so on. Considering about incremental clustering, we expect to generate a hierarchical structure of categories. In this study, we define three types of categories. 1. Sub-category contains a group of related documents. Every branch will be ended with a sub-category, and each sub-category can be only mounted under one super-category as its parent. 2. Super-category is a virtual category containing multiple sub- or super-categories. 3. Root is a virtual category positioned on the root of a categorical tree. A new category not related to any other existing categories will be mounted under the root. It is worth noting that child nodes under root may not be correlated. Each category may contain three types of relation, parent, child and peer. Figure 1 illustrates an example of a hierarchical structure of categories. There s only one root as the parent of super-categories 1 and 2 and sub-categories 5 and 10. We define that each category except the root has only one parent. Super-category 1 has two child sub-categories: subcategories 3 and 4, and three peers: super-category 2 and sub-categories 5 and 10. Sub-category 3 has one parent (super-category 1) and one peer (sub-category 4). Figure 1. An example of hierarchical structure of categories 2.2 System Framework The system framework consists of two major parts: categorical structure initialization and incremental clustering (Figure 2). Categorical structure initialization only performs once to create the categorical tree at the beginning. Incremental clustering repeats along with the progress of interpellation. 2.3 Pre-process The preprocess stage aims to collect legislative interpellation documents and perform the data transformation task. First, interpellation documents are collected from the Parliamentary Library of Legislative Yuan. Then, we adopt CKIP (Chinese Knowledge and Information Processing) for word segmentation to annotate terms with part of speech (POS). We then use n-gram method to assemble the concatenated noun terms into noun phrases. According to the characteristics of news [6], the name entity terms identified as noun phrases, e.g., people, place, and organization, appear in the same issue for the purpose of consistence. For the same reason, we assume that the issue of legislative interpellation is similar to news. Hence, terms except noun phrases will be filtered out to obtain the high efficiency and effectiveness. However, the number of noun phrases in a corpus is too large to take all of them 2522

tfidf, often taken in term weighting and information retrieval, was also adopted to feature selection.

3 into computation. Thus, we conducted tfidf (term frequency inverse document frequency) to select terms with top α percent weight, and then converted these terms or term phrases into a vector space model. tfidf, often taken in term weighting and information retrieval, was also adopted to feature selection. tfidf takes a simple idea that a term with high frequency (tf) exhibits its importance, but the appearance of the term in many documents shows its low discrimination. Therefore, a term with high tfidf can be regarded as high representative to stand for the original stories [11]. The final step of the pre-process is to form a vector space model. 2.4 Categorical structure initialization At the beginning, we need to generate a preliminary categorical structure in order to assign incoming legislative interpellation documents of following periods into corresponding categories. It also provides a basic issue structure to help the public monitor legislative performance. Among clustering methods, this study chose a two-stage clustering approach to take both partitioning and hierarchical advantages. In the first step, we, by hierarchical clustering, compute inconsistency coefficient value in each fusion, and obtain the optimal number of clusters. In the second stage, the k-means clustering is employed to physically partition documents into the number of clusters determined in the first stage. Figure 2. System framework Stage 1 (hierarchical clustering). The website of the Legislative Yuan of Taiwan provides not only the purpose of interpellation but also the themes, keywords and categories encoded by human. In this process, we extract themes, keywords and categories to represent each interpellation s vector. Cosine coefficient [11] known as cosine similarity traditionally is to calculate the distance of two terms. That is to calculate tangent of two vectors as shown below. cos ni, mi j j w w 2 nj i nj i, w j mj i w 2 mj i, where w nj is the weight of term j in document n i i, and w is the weight of term j in document m i. mj i We take Cosine coefficient to calculate the similarity between documents. Each legislator s interpellation is filed as a document, and transformed into vector space d i. We define the similarity of two clusters as the average similarity between documents in corresponding clusters as shown below, where d i and d j are documents in cluster1 with m documents and cluster2 with n documents, respectively. similarity( cluster1, cluster2) mn, i 1, j 1 cos ine( d, d ) m n The hierarchical algorithm operates in a greedy and local manner [5]. It iteratively merges a pair of clusters scored the highest similarity at each stage, and stops when all clusters are merged into one cluster. The complexity of the binary structure generated from the hierarchical clustering can be reduced by choosing a cutting threshold to determine the number of clusters in order to perform the physical partitioning with k-means clustering in the second stage [8]. Selecting the optimal number of clusters is one of the central problems both in nonhierarchical and hierarchical cluster analysis [3]. The inconsistency coefficient obtains shallower categorical trees than the silhouette coefficient [5]. We also conducted a pilot experiment using the maximum similarity difference between each merge and last merge, and the result obtained by the inconsistency coefficient outperforms the silhouette coefficient. Therefore, we use the i j 2523

4 inconsistency coefficient to determine the number of clusters. The inconsistent function is used to generate a list of the inconsistency coefficients for each link in the cluster tree. By default, the inconsistent function compares each link in the cluster hierarchy with adjacent links that are less than z levels below it in the cluster hierarchy. This is called the depth of the comparison. The objects at the bottom of the cluster tree, called leaf nodes, that have no further objects below them, have an inconsistency coefficient of zero. Clusters that join two leaves also have a zero inconsistency coefficient. The inconsistency coefficient for the i th (i 1, 2,, N-1) fusion level α i is αi αz ci σ z, where α z and σ z are the respective mean and standard deviation of the height of level α i and the z highest fusion levels before it. Notice that z heights are taken from the sub-tree rooted at the node of the i th level. Let the sub-tree contain l fusion levels. If 0<l<z, the l levels are considered, and when l0 the consistency coefficient is zero [5]. Silhouette coefficient combines the ideas of both cohesion and separation for individual points as well as clusters. It is defined as follows. b ( i) a ( i) s( i) max( b ( i), a ( i)) where a(i) denotes the average dissimilarity of i to all other objects of cluster A. For any cluster C different from A, let d(i, C) be the average dissimilarity of i to all objects of C. After computing d(i, C) for all clusters C A, the smallest value among them is selected and denoted as b(i). The value of the silhouette coefficient can vary between -1 and 1. A negative value is undesirable We want the silhouette coefficient to be positive (a(i) < b(i)), and for a(i) to be as close to 0 as possible since the coefficient assumes its maximum value of 1 when a(i) 0 [4] Stage 2 (k-means clustering). In general, the k- means clustering method obtains much better performance than hierarchical clustering in most cases, but its performance depends on the number of clusters and initial seeds. After obtaining initial seeds and determining the number of clusters from the hierarchical clustering results in Stage 1, we then use the k-means clustering method to physically partition the document set starting with these initial seeds. We obtain clusters from k-means and take these clusters as the initial categories which are mounted under the root. The system outputs the hierarchical structure of categories from top to down and layer by layer. 2.5 Incremental clustering People tend to get used to a categorical structure; meanwhile, a large degree of structural change may confuse people s existing cognition of the structure of the categories. Therefore, we explore an incremental clustering approach to modify categories in a small range of hierarchical structure instead of re-clustering the whole document set. The incremental structure maintenance approach greatly reduces people s cognitive loadings [8]. Documents in each period will be considered as incoming documents and processed by period sequentially. The proposed incremental clustering approach firstly pre-processes incoming documents to transform un-structured text into vector space. Then the two-stage clustering technique is performed to cluster these incoming documents into emerging issues. Next these clusters are incrementally added into existing categorical structure. Finally each category will be named by distinguished terms Issue identification. In the political domain, the main objects or concepts of interest from documents are generally actors (such as states, parties, and politicians) and issues (such as employment, peace, and healthcare) [1]. Legislative interpellation is formed when legislators questioned specific issues in Legislative Yuan. New categories are created when new interpellations are added but there are not any similar existing categories. This implies that the new category can be represented more distinguished from other existing categories. Hence, a group of related interpellations are defined as an issue, which may be incrementally added into the categorical structure. We take advantage of the two-stage clustering technique to group interpellations occurred in the same period into issues Categorical structure maintenance. The major steps of the categorical structure maintenance approach are depicted in Figure 3. When a new issue is added into the categorical structure, the first step is to represent the issue with the vector space model [11]. Three tests, classification test, category inter-similarity test, and the Silhouette Coefficient test, are used to examine the need to create new categories, integrate issue and category, category decomposition, adding 2524

5 peer category, and super-category re-clustering. Categories generated from the categorical structure initialization stage are represented by the vector space model, where the centroid vector of a category is the average of vector values of documents in the same category. Note that only sub-categories will be represented by vector space model, whereas supercategory is a virtual category, in which we won t add any incoming issues. An incoming issue is represented by the vector space mode, which values denote the average occurrence of terms appearing in the documents of the same cluster. In classification test, we classify a new issue by comparing the issue with existing category centroids. We calculate the similarity between an incoming issue and a category with Cosine coefficient. If the similarity value is greater than γ, we assume that this issue is suitable to be classified to an existing category, and will be assigned into the corresponding category in next tests. Otherwise, if the similarity value is smaller than γ, which means that there s no existing category similar with the issue, this issue will be transformed into a new category, and then mounted under the root of the categorical tree. category, we need to decide whether this issue is suitable to be integrated into the category or not. First, we define inter-similarity function. Each category has its own centroid vector. We use the average cosine similarity [11] between centroid and each object in the same category to represent intersimilarity as follows: inter - similarity( C) size( c) i 1 cos ine( o, C ) i centroid size( C) By inter-similarity function, we obtain the original inter-similarity of category C and the new intersimilarity of category C with added issue. If new intersimilarity(i,c) is greater than original inter-similarity (C), which means that adding issue i into category C increases the cohesion of category C, which is the best condition we may want to see. Thus, we integrate issue i into category C. If new inter-similarity(i, C) is smaller than original inter-similarity(c), we test the decreasing intersimilarity rate to evaluate the performance of the integration between issue i and category C. The decreasing inter-similarity rate is defined as following For a new issue i, when the previous process determined category C is the most similar existing decreasing inter_similarity rate ( i,c) original inter_similarity original inter_similarity ( C) new inter_similarity( i,c) ( C) If the result of decreasing inter-similarity rate is greater than the threshold α, which means that the decreasing inter-similarity is in a reasonable range, we integrate issue i into category C. For the condition that decreasing inter-similarity rate is smaller than the threshold α, we first exam the original inter-similarity(c) to check the cohesion of category C. We set up an inter-similarity threshold β as the average of all categories inter-similarity in the categorical tree. If the original inter-similarity(c) is smaller than threshold β, which means that the cohesion of category C is relatively small in the categorical tree, we conduct category decomposition. For the condition that decreasing inter-similarity rate is smaller than the threshold α, but the original inter-similarity(c) is greater than threshold β, we add the issue as category C s peer. If the inter-similarity(c) is smaller than threshold β, the two-stage clustering is applied to decompose the category after issue i is added into category C. In the meanwhile, category C is transformed from a subcategory to a super-category. Note that the number of clusters is determined by inconsistency coe cient. We need to decide where these new clusters (C news ) should be replaced, so that we treat C new as an incoming issue and conduct similar method iteratively described as follows. First, we use classification test to obtain the most similar category C with the new cluster C new and exam the criteria mentioned above original inter-similarity (C ), new inter-similarity(c new,,c ) and decreasing inter-similarity rate. If the new inter-similarity (C new,,c ) is greater than original inter-similarity(c ) or the decreasing inter-similarity rate is below threshold α, C new will be added into category C. Otherwise, the new cluster C new will be mounted under the original super-category C. If the inter-similarity(c) is greater than threshold β, but adding new issue i into category C will cause inter-similarity(c) dramatically low, we will transform this issue i into a new category and put it on the position of category C s peer. That means that this new category is related to category C, and they will have the same parent. Categories in a same family denote a certain degree of correlation. It s worth noting that children under the root do not have this kind of correlation. Therefore, if the parent of category C is the root, we will create a new super-category and place it 2525

6 on the original category C s position. Then category C will be replaced by this new super-category s child. Also issue i will be transformed into a new category and placed as this new super-category s child. So that category C and issue i are still in the same family and denote a certain degree of correlation. If the inter-similarity(c) is greater than threshold β, but adding new issue i into category C will cause inter-similarity(c) dramatically low, we will transform this issue i into a new category and put it on the position of category C s peer. That means that this new category is related to category C, and they will have the same parent. Categories in a same family denote a certain degree of correlation. It s worth noting that child nodes under the root do not have this kind of correlation. Therefore, if the parent of category C is the root, we will create a new super-category and place it on the original category C s position. Then category C will be replaced by this new super-category s child. Also issue i will be transformed into a new category and placed as this new super-category s child. So that category C and issue i are still in the same family and denote a certain degree of correlation. After inserting a new issue, we test SCs of those super-categories wherein resides the new issue. Comparing the super-category s original SC with the new SC, if the SC s decreasing rate is greater than a given threshold θ, the number of clusters of the given super-category is inaccurate after updating the categorical structure. Therefore, we should perform the re-clustering function to re-structure the categorical structure under the super-category. If the SC s decreasing rate is equal to or less than a given threshold θ, the cluster structure of the given supercategory is still acceptable, and no further re-clustering action is needed Naming categories. The main idea of labeling categories is to facilitate users to differentiate categories. Maximum Term Weight Labeling (MTWL) [10] is based on the idea of tfidf and incorporates hierarchical information through a specialized weighting function idf global and idf local. MTWL can be written as MTWLk idfglobal idflocal tfk where tf k is the term frequency of term k. The global inverse document frequency for term k is calculated as idf global D log( 1) #( t, D) + where #(t k,d) denotes the number of documents in the collection containing term k. D is the number of all documents. Global weighting penalizes terms, which are over represented in the whole collection. However, k terms over represented in a particular sub-category only, will be likely selected. Hence, the term distribution among peers has to be taken into account to avoid siblings getting similar labels. We adopt the local inverse document frequency depending on the term distribution over documents in the sub-category. idf local for term k in cluster c j calculated as idf local j Dc p, log( + 1) #( t, ) k Dc p where c p is defined as the parent cluster of cluster c j. D cp * is the number of documents in c p, and #(t k,d cj *) is the number of documents in c p which contains term k. Muhr, Kern and Granitzer [10] suggested that MTWL can be extended by hierarchical labeling method in order to take parent child relationship into account. Also the results from [10] shows that MTWL extended by hierarchical labeling reaches stable accuracy in different levels. Therefore, we apply MTWL to extract reprehensive terms from subcategories. For super-categories, we conduct MTWL extended by hierarchical labeling to take path length into account. For a category C, every terms appeared in this category will become candidate terms. The system will assign the score to each candidate term. Finally we extract three terms from candidate terms with the top three scores in a category to represent the category. We conduct the same process for every suband super- categories in the system except the root. 3. System implementation and results 3.1 Data Source We collected 12,743 legislative interpellation archive generated by the 6 th term of legislators from February 1, 2005 to January 31, 2008 distributed by the Parliamentary Library of Legislative Yuan ( According to the session plan of Legislative Yuan, two sessions are held each year; thus, we treat six sessions in total as six periods shown in Figure 4. We extract legislator, category, theme, keyword, purpose and date from the documents. The results of incremental clustering are evaluated by human evaluators. Over ten thousands of documents are too large for human evaluators to exam, so that we take ten percent of documents in each period as the sample data to evaluate the proposed system. 2526

7 Figure 3. The procedure of categorical structure maintenance 3.2 System implementation The mechanism we proposed was implemented by Java and followed the process listed in Section 3. In the preprocess stage, we first extracted three types of terms which are theme, category and keyword, and then conducted feature selection for remaining terms. A term which part-of-speech (POS) is not a noun and its corresponding tfidf rank is on the last 80% is removed from the term list. We have used different tfidf values in the experimentation, and obtained the best results by extracting top 20% of terms. Thus, in the implementation, we set the tfidf threshold α to 0.2. This threshold results in a dynamic number of terms between each period. In the two-stage clustering process, before k-means clustering applied, the optimal number of clusters must be determined first. We take hierarchical clustering in periods 1 and 2 as an example. The maximum inconsistency coefficient determined the optimal number of clusters is 65 and 136 in periods 1 and 2, respectively. In categorical structure maintenance procedure, there are three parameters decreasing inter-similarity rate α, inter-similarity threshold β, and SC s decreasing rate θ. We set decreasing inter-similarity rate to 0.2 to make sure that whenever a new issue is added into category C, the new inter-similarity cannot drop more than 20% of original inter-similarity. Intersimilarity threshold β affects whether a category will be decomposed or not. We set it as the average of all categories inter-similarity in the proposed system. SC s decreasing rate θ is set to 0.3 because we don t want to conduct re-clustering often unless it is necessary. 2527

There are 65 sub-categories produced in the basic categorical structure shown in Figure 5. We then conducted incremental clustering with interpellations from periods 2 to 6.

8 Figure 4. The number of interpellation documents in each session 3.3 Results We applied two-stage clustering to obtain the initial basic categorical structure of interpellation documents in period 1, and named each category. There are 65 sub-categories produced in the basic categorical structure shown in Figure 5. We then conducted incremental clustering with interpellations from periods 2 to 6. As time goes by, the number of new category creation increases which means that the categorical structure can cover most of incoming issues. 4. Evaluation Design and Results 4.1 Evaluation criteria The performance of the incremental clustering method is evaluated by the degree of modification done by experts on the results generated by the proposed system. We take the idea precision which is widely used as relevancy measures for information retrieval [11]. Accuracy measures the percentage of relevant documents in relation to the number of documents retrieved. In this study, we view a query as a category designation, and adopt accuracy measures to evaluate how accurate the incremental clustering method complies with domain experts in assigning documents to corresponding categories. M denotes the set of documents allocated to a category by a domain expert, and A denotes the set of documents assigned to a category by the incremental clustering method. N A denotes the number of documents in A, and N A M denotes the number of documents both in A and M. For each category, the accuracy is defined as below. N A M Accurancy N A The accuracy for a category denotes the percentage of documents assigned by the incremental clustering method matches with what domain experts assign. The accuracy for a hierarchical structure of categories is calculated by averaging the values of accuracy of all categories of the modified results. Figure 5. The number of categories in each period 2528

9 4.2 Experimental Design We designed a Web interface for subjects to modify the categorical structure on desktop computers, and then invited three evaluators with political domain background to evaluate the generated categorical structure. The evaluators need to click the categories on the left side of the screen and exam those interpellation documents on the ride side which are assigned to this category by the proposed incremental clustering system. After finishing reading the keywords and interpellations on the right side, the evaluators will decide whether these documents are suitable for this category or not, respectively. Notice that this procedure was followed from period 1 to 6, respectively. The difference on the categorical structure before and after domain experts modification is analyzed to evaluate the performance of the proposed incremental clustering methods. 4.3 Evaluation results and discussions We found that the accuracy in these experiments is very high (85%~93%); however, the difference among experts judgments varies greatly. It would bring more insight by investigating the consensus of experts by modifying accuracy measure to reflect the consensus of experts. The revised accuracy for each category is defined as below. Revised Accurancy N Ui U2... Un, where A denotes the set of documents in a category designated by the incremental clustering method; U i denotes the set of documents in a category designated by both incremental clustering method and domain expert i; N A denotes the number of documents in A. The revised accuracy denotes the percentage of the interpellation documents in a category assigned by the incremental clustering method which is also jointly designated by n domain experts. The revised accuracy of the hierarchical structure of categories is calculated by averaging the values of revised accuracy of all categories of the modified results. From the evaluation results, the revised accuracy is lower than those using the original accuracy measure. It implies that this evaluation presents difficulties in reaching consensus in categorization viewed by individual domain experts. We then conducted the second evaluation by presenting the interpellation documents with inconsistent answers between experts to ask them to discuss in order to obtain consistent categories for these interpellations. To compute the accuracy denoted as the second N A revised accuracy, we used the accuracy measure with M as the set of interpellations allocated to a category resulting from three domain experts negotiation. The accuracy of categorical structure in each period is listed in Table 1. The result shows that the proposed method can cluster these interpellations in an acceptable accuracy. These domain experts mentioned that some categories may contain multiple issues, and even we conducted category naming to help people identify these issues, they still felt difficult to decide the main issue in a category. 5. Conclusions and future works This research proposes an incremental clustering method to construct a hierarchical structure of categories, which helps the public identify the latest issues in Legislative Yuan, and monitor legislators performance. In categorical structure initialization stage, we constructed a basic categorical structure. Then, in the incremental clustering stage, the system designated each incoming interpellation document into a corresponding existing category or creates a new category. By doing this, the initial categorical structure is transformed to a hierarchical structure, and people can keep tracking legislators performance by allocating interpellations to the corresponding categories. Table 1. Accuracy of categorical structure in each period Period Accuracy: Expert 1 Accuracy: Expert 2 Accuracy: Expert 3 Accuracy: Average Revised accuracy Second revised accuracy We summarize the contributions of this study as follows: (1) This study has adopted the two-stage clustering approach iteratively to generate hierarchical structure of categories. (2) The incremental clustering has increased the work efficiency for clustering, in particular an ever-increasing volume of data. (3) This study has linked information retrieval and 2529

10 text mining techniques to streamline the transformation from interpellation documents to hierarchical structure of categories. (4) Transforming text into the form of statistics will effectively mitigate the problems caused by information overloading. The adoption of incremental clustering method for incoming documents can be further tested in following circumstances. (1) Multiple cases can be studied and implemented with different parameter settings to get the best systematic parameter setting. (2) In this paper, we only used ten percent of interpellation documents generated by the 6 th term of legislators for testing. It may be insufficient to assess the performance of the system in real world. Thus, we should take the complete data set to prove the effectiveness of the method in real world applications. (3) The results of the proposed system may facilitate people to objectively view the issues happened in Legislative Yuan if we provide the visualization for the results, which may help people understand them at a glance. (4) With domain experts feedback, the system should have the ability to learn from human judgment. References [1] W. V. Atteveldt, J. Kleinnijenhuis, N. Ruigrok, and S. Schlobach, S. "Good News or Bad News? Conducting sentiment analysis on Dutch text to distinguish between positive and negative relations." Journal of Information Technology & Politics, 5(1), 2008, pp [2] H. Berghel, "Cyberspace 2000: Dealing with information overload." Communications of the ACM, 40(2), 1997, pp [3] Everitt, B. S., Landau, S., & Leese, M. Cluster Analysis (fourth Ed.). Arnold, London, [4] Kaufman, L., & Rousseeuw, P. J.. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley's Series in Probability and Statistics. John Wiley and Sons, New York, [5] T. Korenius, J. Laurikkala, M. Juhola, and K. Jarvelin, "Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments." Information Retrieval, 9(1), 2006, pp [6] Ku, L., A study on the multilingual topic detection of news articles. Master Dissertation, Department of Computer Science and Information Engineering, National Taiwan University, [7] Y. Liao, "The Research of Voter Turnout: Case Study in Taiwan." The Journal of Chinese Public Administration, (3), 2006, pp [8] F.-r. Lin, and C.-m. Hsueh, "Knowledge map creation and maintenance for virtual communities of practice." Information Processing & Management, 42(2), 2006, pp [9] J.J. Lin, "The Study of Interpellation System of Legislative Yuan in R.O.C.". Journal of TOKO, 1(1), [10] M. Muhr, R. Kern, and M. Granitzer, Analysis of structural relationships for hierarchical cluster labeling. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp ), ACM, [11] G. Salton, and C. Buckley, "Term-weighting approaches in automatic text retrieval." Information processing & management, 24(5), 1988, pp

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,