Computing and Using Reputations for Internet Ratings

Similar documents
Using Past Queries for Resource Selection in Distributed Information Retrieval

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Physical Model for the Evolution of the Genetic Code

Study and Comparison of Various Techniques of Image Edge Detection

Optimal Planning of Charging Station for Phased Electric Vehicle *

An Approach to Discover Dependencies between Service Operations*

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

An Introduction to Modern Measurement Theory

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

NHS Outcomes Framework

THE NORMAL DISTRIBUTION AND Z-SCORES COMMON CORE ALGEBRA II

Project title: Mathematical Models of Fish Populations in Marine Reserves

Investigation of zinc oxide thin film by spectroscopic ellipsometry

ENRICHING PROCESS OF ICE-CREAM RECOMMENDATION USING COMBINATORIAL RANKING OF AHP AND MONTE CARLO AHP

Estimation for Pavement Performance Curve based on Kyoto Model : A Case Study for Highway in the State of Sao Paulo

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

AUTOMATED CHARACTERIZATION OF ESOPHAGEAL AND SEVERELY INJURED VOICES BY MEANS OF ACOUSTIC PARAMETERS

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

What Determines Attitude Improvements? Does Religiosity Help?

A New Machine Learning Algorithm for Breast and Pectoral Muscle Segmentation

Copy Number Variation Methods and Data

Balanced Query Methods for Improving OCR-Based Retrieval

Active Affective State Detection and User Assistance with Dynamic Bayesian Networks. Xiangyang Li, Qiang Ji

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Evaluation of Literature-based Discovery Systems

EXAMINATION OF THE DENSITY OF SEMEN AND ANALYSIS OF SPERM CELL MOVEMENT. 1. INTRODUCTION

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

A Linear Regression Model to Detect User Emotion for Touch Input Interactive Systems

ALMALAUREA WORKING PAPERS no. 9

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Maize Varieties Combination Model of Multi-factor. and Implement

Latent Class Analysis for Marketing Scales Development

Survival Rate of Patients of Ovarian Cancer: Rough Set Approach

Appendix F: The Grant Impact for SBIR Mills

POLITECNICO DI TORINO Repository ISTITUZIONALE

Inverted-U and Inverted-J Effects in Self-Referenced Decisions

CLUSTERING is always popular in modern technology

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

Does reporting heterogeneity bias the measurement of health disparities?

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

DS May 31,2012 Commissioner, Development. Services Department SPA June 7,2012

TOPICS IN HEALTH ECONOMETRICS

Price linkages in value chains: methodology

A Novel artifact for evaluating accuracies of gear profile and pitch measurements of gear measuring instruments

THIS IS AN OFFICIAL NH DHHS HEALTH ALERT

Sparse Representation of HCP Grayordinate Data Reveals. Novel Functional Architecture of Cerebral Cortex

Chapter 20. Aggregation and calibration. Betina Dimaranan, Thomas Hertel, Robert McDougall

The Influence of the Isomerization Reactions on the Soybean Oil Hydrogenation Process

Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

Journal of Economic Behavior & Organization

Evaluation of the generalized gamma as a tool for treatment planning optimization

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

Delving Beneath the Covers: Examining Children s Literature

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

Multidimensional Reliability of Instrument for Measuring Students Attitudes Toward Statistics by Using Semantic Differential Scale

ME Abstract. Keywords: multidimensional reliability, instrument of students satisfaction as an internal costumer, confirmatory factor analysis

N-back Training Task Performance: Analysis and Model

Fast Algorithm for Vectorcardiogram and Interbeat Intervals Analysis: Application for Premature Ventricular Contractions Classification

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

Performance Evaluation of Public Non-Profit Hospitals Using a BP Artificial Neural Network: The Case of Hubei Province in China

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Non-linear Multiple-Cue Judgment Tasks

*VALLIAPPAN Raman 1, PUTRA Sumari 2 and MANDAVA Rajeswari 3. George town, Penang 11800, Malaysia. George town, Penang 11800, Malaysia

A-UNIFAC Modeling of Binary and Multicomponent Phase Equilibria of Fatty Esters+Water+Methanol+Glycerol

Integration of sensory information within touch and across modalities

Shape-based Retrieval of Heart Sounds for Disease Similarity Detection Tanveer Syeda-Mahmood, Fei Wang

Encoding processes, in memory scanning tasks

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

The Importance of Being Marginal: Gender Differences in Generosity 1

The High way code. the guide to safer, more enjoyable drug use. (alcohol)

MULTIDIMENSIONAL RELIABILITY OF INSTRUMENT STUDENTS SATISFACTION USING CONFIRMATORY FACTOR ANALYSIS ABSTRACT

Impact of Imputation of Missing Data on Estimation of Survival Rates: An Example in Breast Cancer

Statistical Analysis on Infectious Diseases in Dubai, UAE

HYPEIIGLTCAEMIA AS A MENDELIAN P~ECESSIVE CHAI~ACTEP~ IN MICE.

Drug Prescription Behavior and Decision Support Systems

Research Article Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

Do norms and procedures speak louder than outcomes? An explorative analysis of an exclusion game. Timo Tammi

Addressing empirical challenges related to the incentive compatibility of stated preference methods

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Resampling Methods for the Area Under the ROC Curve

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

Effects of Estrogen Contamination on Human Cells: Modeling and Prediction Based on Michaelis-Menten Kinetics 1

A Geometric Approach To Fully Automatic Chromosome Segmentation

Experimentation and Modeling of Soldier Target Search

Clinging to Beliefs: A Constraint-satisfaction Model

Introduction ORIGINAL RESEARCH

Transcription:

Computng and Usng Reputatons for Internet Ratngs Mao Chen Department of Computer Scence Prnceton Unversty Prnceton, J 8 (69)-8-797 maoch@cs.prnceton.edu Jaswnder Pal Sngh Department of Computer Scence Prnceton Unversty Prnceton, J 8 (69)-8-39 jps@cs.prnceton.edu ABSTRACT Ratngs for products and servces are ncreasngly mportant on the Internet, as they allow users to harvest the wsdom of the communty n makng decsons. However, the dffculty wth ratngs s that lttle s known about the people provdng them. Interpretng ratngs well requres that the reputatons of raters be factored nto the scores computed for rated objects, even though these reputatons are not explctly avalable. Takng advantage of the nsght that reputaton can be computed mplctly from ratngs, ths paper addresses the reputaton problem for raters and ts applcaton to evaluatng rated objects. We develop a general method to automatcally compute reputatons for raters based on the ratngs they and others gve to objects, and ncorporate these reputatons to generate value-added nformaton about rated objects. We evaluate our mechansms by performng experments on data from major ratng stes, and show that they have the desred propertes of a good reputaton system. In the process, we analyze some key characterstcs of dfferent types of Internet ratngs. To our knowledge, ths s the frst nvestgaton nto automatcally computng raters reputatons and applyng these reputatons to better evaluate rated objects. Keywords Rater, rated object, ratng, reputaton herarchy, score. ITRODUCTIO. Problem Statement The Internet facltates the crculaton of opnons. The subjects span a wde range from products, servces to celebrtes personaltes. Anyone can gve comments (n ths paper, a comment conssts of a numercal ratng and the assocated text revew) based on hs/her expertse or purely on personal experence [], and comments allow users to harvest the combned wsdom of the communty n makng decsons about products and servce-provders. Users fall nto two categores: raters who are nformaton producers, and readers who are nformaton consumers. A user may be both a rater and a reader. Two key ssues arse n ths scenaro. The frst concerns the Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. EC, October -7,, Tampa, Florda, USA. Copyrght ACM -83-387-// $.. qualty of comments provded by revewers. The openness of the Internet makes comments very rch and valuable, but t also causes the relablty of these comments to be a key problem. Ths ssue s especally crucal n E-Commerce where users use others opnons as gudance before makng transactons. At the same tme, good raters would welcome reputaton dfferentaton as a reward for ther efforts. The relablty of raters can also be used by the ste desgners to better organze a ste []. Thus, determnng reputatons for raters s a common concern for all partes. The nsght behnd our approach s that whether or not reputatons are explctly provded (whch they rarely are, and almost never n a globally meanngful way), they can be nferred or computed from the comments that the raters gve to varous objects. The basc dea s to compute a reputaton for every rater based on the qualty and the quantty of all comments the rater gves (broken down by category where relevant). People s expertse usually vares across knowledge domans, so a rater s reputaton s structured as a herarchy n our method. The second ssue concerns the evaluaton and presentaton of overall ratngs for objects (the overall ratng s often referred to as reputaton n the lterature; however, to dscrmnate t from the rater s reputaton, t s called score n ths paper). A score s a summary of all numercal ratngs gven to an object. A smple average of raw ratngs does not dscrmnate between the relablty of ratngs, and thus s of lmted usefulness. Takng advantage of the computed reputatons of raters, we use reputaton-weghted ratngs as the overall scores for objects. To capture other factors affectng the relablty of a score such as the number of ratngs gven to the object or the overall reputaton of all the raters evaluatng the object, we compute a separate parameter to ndcate the confdence level of each score. Our goal s to buld a ratng aggregaton system that provdes the followng value-added nformaton to users:. The reputatons of raters (organzed as a herarchy of categores);. The scores of objects, takng reputatons of raters nto account; 3. The confdence level of scores. The quanttatve experments n ths paper are carred out wth two goals. One s to analyze the characterstcs of two major types of onlne ratngs: b-drectonal ratngs for onlne auctons and un-drectonal ratngs for varous objects. Another goal s to analyze our mechansms from two aspects. Frst, we demonstrate that our reputaton framework and methods ndeed satsfy a set of desred propertes. Second, our methods for computng reputaton

are compared to a neutral ratngs aggregaton ste called Epnons [7]. Raters wth hgh computed reputaton by our methods correspond well to the good raters chosen by ste managers and users on Epnons. However, our methods provde mportant, more granular nformaton and thus are more dscrmnatng than human judgments can provde.. Contrbutons Ths paper has three prmary research contrbutons:. A general framework for evaluatng raters and objects n ratng-based communtes;. A method to compute quanttatve confdence n raters qualfcaton and objects scores; 3. An analyss of the characterstcs of varous types of onlne ratngs. Ths work may facltate ratng-aded E-Commerce n two aspects:. Value-added nformaton about raters and rated objects gudes consumers n choosng products and servces.. The reputaton scheme benefts ste managers n better organzng ste content..3 Organzaton The rest of ths paper s organzed as follows. Secton gves an overvew of related works. Secton 3 proposes the mechansm to buld reputaton herarchy for raters. Secton dscusses score computaton and presentaton. Secton presents and analyzes expermental results from real-world data. Secton 6 draws conclusons and dscusses potental drectons for future work.. RELATED WORK. Use of Reputatons on Ratng Stes Ratng aggregaton stes can be classfed by the degree of dfferentaton n nformaton sources. In one type of ste (e.g. CET.com), no dfferentaton s made n the qualty of revews. Consequently, there s no help for readers to fnd hgh-qualty revews. The second type of ste supports readers votng on the values of revews, and tres to hghlght good revews by readers consensus (e.g., the old Deja.com). However, no reputaton about raters s aggregated on ths type of ste. Examples of the thrd type are onlne aucton stes such as ebay.com. Here, both raters and rated objects are people partcpatng n auctons. The reputaton of a user can be estmated from ratngs gven to that user. The problem wthn ths ratng scheme s that ths Mechansm for raters reputaton Evaluaton of object reputaton s not a sutable metrc for that user s qualfcaton as a rater. For all three types, every ndvdual ratng s gven the same mportance n computng the overall scores. The fourth type of ste s the most relevant to our work. On these stes, raters are dscrmnated based on ther qualfcaton, judged somehow. In calculatng scores, ratngs carry dfferent weghts. To our knowledge, the only ste n ths class s Epnons. Its basc dea s to have ts members dentfy good revewers (called advsors on Epnons), n one of two ways. Users may explctly ndcate trusted and dstrusted users; they can also rate text revews, and thus affect the scores of those revews. In ths scheme, a canddate of advsor should be wdely trusted by other users and have wrtten many revews wth hgh scores. Ths dscrmnaton determnes the relatve mportance of ratngs to revews, and n turn affects a product s score va the weght gven to each revew. The sgnfcant dfferences between our approach and that of Epnons are lsted n Table. In Epnons method to evaluate raters, subjectvty s ntroduced n two ways. Frst, drect evaluatons of users trustworthness should be based on an overall understandng of the whole database of ratngs, but n practce a user s very lkely lmted n the bass of ther judgments. Second, the evaluaton of a comment s only based on ts text part where bas can easly be ntroduced. Our method computes reputatons, scores, and confdences mplctly from the entre database, wthout applyng drect judgments by people. To weght a ratng, Epnons requres that ratngs be gven explctly to the correspondng revew; our methods do not. Our method thus provdes more granular nformaton n a more objectve way.. Related Research on Ratngs An mportant applcaton of onlne ratngs s collaboratve recommendaton [6,9,,]. Products rated hghly by users are recommended to others wth smlar tastes. Zachara and coworkers [] propose a reputaton system n E-communtes where people rate each other. That work focuses on how reputaton evolves wth tme based on drect ratngs between people. Our method dscrmnates raters from rated, and thus s applcable n both un-drectonal ratng schemes for products and b-drectonal ratng communtes for people. Dellarocas [3] proposes mechansms to combat two types of cheatng behavor. Its basc dea s to detect and flter out exceptons n certan scenaros. Our objectve s to buld reputatons for all raters; of course, one possble applcaton of reputatons computed by our methods could be n detectng fake ratngs from unrelable raters. Table. Dfferences between our mechansm and that of Epnons Epnons Our methods Framework of reputaton Bnary dvson between Quanttatve reputaton for each advsors and other users rater Evaluaton on comment Evaluate ts text revew only Evaluate both numerc and text portons, ntrnscally Users can drectly evaluate other users? Yes o Weght of a ratng Score of ts textual revew Reputaton of the rater eed readers explct ratngs on revews? Yes o Confdence of score Presented separately

3. BUILDIG REPUTATIO FOR RATERS We frst dscuss how we compute reputatons for raters. In the next secton, we wll dscuss how these reputatons are used to compute scores of rated objects. Two propertes are assocated wth a rater s reputaton n our work: () It s a dynamc random varable, reflectng uncertanty. () It s classfed based on expertse category. The uncertanty of reputaton s partly due to varablty of expertse wth tme and partly because of updates n drect or ndrect knowledge about raters over tme. The second property reveals the bottom-up feature ntrnsc n our method. Fgure llustrates a reputaton tree for a rater based on the category classfcaton of objects. Rectangles denote comments, and crcles refer to categores. The categores on the lowest level are called leaf categores. The reputaton nformaton n the nonleaf categores s aggregated from nformaton n ther chldren. Move Rater Joe Automoble Global Reputaton: (GM, GC, Reputaton) Leaf Categores: (GM, GC, Reputaton) Comments: (LM, LC) Fgure. Reputaton herarchy for one rater The procedure to buld a reputaton tree conssts of three stages. Frst, every comment s evaluated usng nformaton regardng that comment or other comments on the same object. The local match (LM) denotes the qualty of one comment. The local confdence (LC) s the confdence n LM. Second, the rater s reputaton n the leaf categores s calculated based on LM, LC for all comments by that rater n that category. The overall qualty of all comments by that rater n a category s computed and referred to as global match (GM), and the confdence n GM s computed and referred to as global confdence (GC). A rater s reputaton n a category s bult by combnng the rater s GM and GC for that category. Fnally, nformaton regardng reputaton n all non-leaf categores s generated. The ranges of all varables mentoned above are [,]. 3. Evaluatng a Comment For the frst step descrbed above, evaluatng an ndvdual comment can be splt nto two parts: evaluaton of ts numerc ratng, and of ts text revew. The qualty of a comment as a whole s based on these two evaluatons. The formulae n ths secton can be easly extended to any ratng scales, though a ratng scale of s assumed n ths dscusson. 3.. Evaluatng the numerc ratng locally The only nformaton we use to evaluate an ndvdual numerc ratng s the frequency dstrbuton of all ratngs gven to the same object. Our method s related to the method of weght propagaton or transfer of endorsement that s used to rank pages n some search engnes [,8]. Frst, raters are grouped accordng to the ratngs they gve to the object. Every rater s deemed to gve the hghest endorsement to other raters n that rater s group. The endorsement of raters to other groups s measured by the dscrepancy between the ratng levels of the groups. Then, the qualty of a ratng at the ratng level, denoted as Q, s the sum of endorsements from all groups/levels, weghted by the mportance of each endorsement. Ths mportance s the Q of the endorsng group. To make the computaton converge, the relatve group sze s used as the a pror mportance. Q s computed by solvng equaton. Q Q E j j, E, Equaton j j j j E j, : the endorsement of level j to level E, : a constant,, for any : the total number of raters gvng a ratng of The total endorsement of one group to all groups ncludng tself s. A constant value E, for any guarantees that the sum of Q s. The endorsement of one group to another group could smply be a functon of dfference of two ratngs. However, the semantcs of ratng levels provde the hnts to defne endorsements n a more meanngful way. For example, the dscrepancy between (excellent) and 3 (neutral) s smaller than that between (good) and (bad) ntrnscally, though the numercal gaps between the pars are same. The followng endorsement matrces (before normalzaton) are examples of the two approaches dscussed above. E s smply a leaner functon of the dfference of two ratng levels, whle E factors nto the meanng of ratng levels. For example, accordng to E the endorsement of a ratng group to any other group that s two ratng levels away s.6. In E the endorsement of the ratng group at to the group at 3 s., but that of to s because they are fundamentally opposte opnons..8.6...8.8.6. E.6.8.8.6..6.8.8...6.8 Endorsements based only on dfference of ratngs.8..7.3 E..3.3..3.7..8 Endorsements takng account semantcs of ratng levels In practce, ratng scarcty s a common problem for ratng aggregaton systems. Measurng the confdence of a result s mportant but s often gnored. The confdence of Q s manly determned by the sample sze (the total number of ratngs gven to an object). Based on the dscusson about the confdence of ndex numbers n [], a pecewse functon as shown n Fgure s desgned to compute a confdence level (C) for Q. C.8.6.. 6 8 6 8 (the total number of ratngs gven to an tem) Fgure. Pecewse functon for C

The functon grows slowly when s smaller than ; t grows faster when s between and ; after that C stll grows but gradually; and when s beyond, ncreasng has no effect on C. The method dscussed n ths secton s based on the assumpton that the qualty of objects can be evaluated somehow by majorty rule. For objects beyond ths assumpton, the nformaton consumers can decde to what degree they accept the raw ratngs and value-added nformaton generated from raw ratngs. 3.. Evaluatng text revews For stes that support readers votng, the qualty of a text revew can be estmated from readers explct ratngs of that revew. In ths computaton, every reader s ratng s weghted based on the current reputaton of the reader. (Thus, a user can be an nformaton producer and an nformaton consumer for the same object.) The qualty of a text revew and the correspondng confdence are denoted as QT and CT. r R QT Equaton M R : the total number of ratngs to the revew r : the th ratng on the revew R : the reputaton of the user who gves r M : the ratng scales used on text revews, ths value s on Epnons If R C R Then CT Equaton 3 C Else CT R : the reputaton of the th rater C : the threshold for the number of revewers wth reputaton to gan a full confdence n the overall ratng on a text revew. s used n the the experment secton. 3..3 Evaluatng a comment overall Equaton shows how we put the computed evaluatons of the two parts of a comment together. The resultng local confdence s based on dscussons on the standard error of arthmetc mean []. ot trvally, the evaluaton on ether porton of a comment can be used to estmate the overall qualty of a comment, f the other porton s unavalable. Q C QT CT LM C CT LC C C CT CT C CT Equaton 3. Buld the Reputaton Tree As mentoned earler, a gven rater s reputaton s organzed as a tree based on the category structure of objects. Each category node n the tree s assocated wth three values: GM s the estmate of the rater s ratng ablty for that category; GC measures the confdence n GM; Reputaton s a quanttatve summary about the rater s qualfcaton n that category based on GM and GC. GM and GC for a category are determned from the matches and the correspondng confdences of ts chldren n the smlar way as LM and LC. GM M C C C C GC Equaton C M : the total number of tems rated by the rater under the category : the match of the rater n the th subcategory or tem C : the confdence n M There are multple optons for computng the reputaton of a rater n a category from GM and GC for that category. The basc requrements of a reputaton functon nclude:.) It ncreases wth GM or GC when the other varable s fxed..) It s f ether GM or GC s, whch mples that ether all comments gven by the rater are very bad, or the relevant nformaton s nsuffcent to generate any reputaton. 3.) It s only when both GM and GC are..) It s exactly the same as GM f GC s. Some other propertes are also desrable:.) The bgger GC, the faster reputaton grows wth GM..) The bgger GM, the faster reputaton grows wth GC. 3.) GM s domnant when GC s large but GM s small..) GC s domnant when GM s large but GC s small. A formula satsfyng all of the above basc and supplementary requrements s gven by equaton 6. Reputaton GC GM Equaton 6 3.3 Deployments of Mathematcal Modules The modularty of our mathematcal model allows modules to be loaded and combned n a flexble way to buld reputaton. For example, ether part of a comment (numercal or textual) or both parts can be used n comment evaluaton as desred; for reputaton organzaton, choces can be made between a flat structure and a category-based herarchcal structure. In general, dfferent deployments make dfferent trade-offs between the computatonal complexty and the qualty of results. The resultng four mechansms to buld a rater s reputaton are dscussed n ths paper. An expermental comparson of these models wll be shown n Secton.. Table. Four deployments of mathematcal modules Organzaton of Reputaton Flat Tree Evaluate text o FLAT TREE revews too? Yes FLAT&TEXT TREE&TEXT

. SCORIG OBJECTS. Absolute Score and Confdence of Score For nformaton consumers, the central concern s the evaluaton of objects, denoted as ts score. The score of an object s calculated as the average of all ts ratngs, wth the weght of a ratng beng the rater s reputaton. Ths s much more meanngful than a smple average of raw ratngs. If only few, not very reputable people have rated an object, one can not be very confdent n the resultng score. Inspred by the statstcal method to estmate the standard error of arthmetc mean, the confdence of a score s determned by at least three factors as follows:.) Total number of ratngs gven to the object.) Reputatons of raters who gve these ratngs 3.) Degree of consstency n ratngs Equaton 7 s the formula we used to calculate the confdence of a score, denoted as C : If Else C S j S R then C j S rj R j j R j j j j R S Equaton 7 : the number of ratngs gven to the rated object r j : the jth ratng to the tem R j : the reputaton of the rater who gves rj S : the overallscore of the tem. Use of Reputaton Herarchy When computng a score for an object, the weght of a ratng may be chosen to be the rater s reputaton for any category on the path from the root of that rater s reputaton tree to the leaf category ncludng the object under evaluaton. Hence, every object can be gven a set of scores based on raters reputatons at dfferent category levels. It s desrable to apply reputatons for a coarsergraned category when there s lack of nformaton about the raters at fner granularty.. EXPERIMETAL RESULTS Our experments are dvded nto two sectons. The frst secton s a set of observatons on raw onlne ratngs. The second secton s an analyss of our methods to buld reputatons and scores.. Observatons About Raw Onlne Ratngs There are several ways to classfy onlne ratngs. Ratngs can be dscrmnated as un-drectonal or b-drectonal ratngs, or be classfed as ratngs on people or ratngs on products. For example, ratngs on onlne auctons are usually b-drectonal and the rated objects are people nvolved n transactons. Many shoppng stes or neutral ratng stes lke Epnons collect undrectonal ratngs for a wde range of objects. Two data sets are used n ths secton. The frst ncludes bdrectonal ratngs of people, collected from three aucton stes: ebay, Amazon, and Yahoo. The second data set ncludes undrectonal ratngs of dfferent objects on Epnons. All the data were collected n the sprng and summer of. Table 3. Data sets Aucton Stes Ep. Am. Ya. Eb. # of objects wth 6,377 9,93,8,97 more than ratngs.. Dstrbuton of mean ratngs Fgure 3 shows the dstrbuton of the mean ratngs for objects. Most users on aucton stes, ether sellers or buyers, have mean ratngs at the hghest ratng level. Compared to aucton ratngs, the dstrbuton of mean ratngs on Epnons spreads over a wder range of the ratng scale. % Rated Objects 8 6.. 3 3.. Mean Ratng Amazon Yahoo Ebay Epnons Fgure. 3 Dstrbuton of means of ratngs (ebay uses 3 ratng scales, so the ratngs from ebay are normalzed to scales of n experments of ths paper) Table gves more detals about the dstrbutons n Fgure 3. Table. Summary on the dstrbutons n Fgure 3 Stes Aucton stes Ep. % Objects for whch: Am. Ya. Eb. Mean of ratngs s above 98 9 99.36 Mean of ratngs s below 3..6.3.. Correlaton of ratngs gven by the two partes n b-drectonal ratng There are several possble reasons why many people have hgh scores on aucton stes:. People do well n onlne transactons.. People tend to gve hgh ratngs as long as the other party pays or sends the goods. 3. Some sellers rg the system to bolster ther ratngs.. B-drectonal ratng schemes lead users to gve hgh ratngs to others n the hope of gettng hgh ratngs n return. The frst two reasons are beyond the scope of ths paper. The cheatng behavor n the thrd case s not enough to explan the ubquty of hgh scores on aucton stes. In ths paper we focus on the fourth possblty.

Based on user IDs,,6,368 transactons were extracted from ratngs gven to 9,93 users on Yahoo. The ratngs gven by the seller and buyer n a sngle transacton are pared together. Table presents the summary of raw results. Table. Correlaton of ratngs gven by seller and buyer n a sngle transacton on Yahoo Cases % Transactons The ratngs that exactly same 9 two partes gve to each other are both hgh ratngs () 9 When one party gves a low ratng a low ratng () 6 (), the other a hgh ratng () 9 party gves: a neutral ratng (=3) Ignorng the dfferences between (good) and (excellent), and (bad) and (very bad), the Pearson's correlaton coeffcent between ratngs from two partes s about.7, whch shows a hgh postve correlaton. Based on observatons n secton.., we beleve ths postve correlaton s stronger on Amazon and ebay. These results support our hypothess that a b-drectonal ratng scheme leads to hgh ratngs to people n onlne auctons...3 The relatonshp between type of rated object and dstrbuton of ratngs For un-drectonal ratngs on Epnons, the rated objects span a wde range of products, servces, and other topcs such as travel destnatons or celebrtes. To nvestgate the relaton between types of objects and ratngs, the objects on Epnons are classfed accordng to ther nature n terms of the dffculty of beng evaluated objectvely. We defned three classes for objects: Objectve the qualty can be determned by farly objectve standards (e.g. electronc products); Subjectve the evaluaton s strongly based on personal taste (e.g. travel destnatons); Medum the evaluaton s based on a mxture of objectve standards and personal expectatons (e.g. servces). Table 6. Sample sze of three classes for objects on Epnons Obj. Med. Sub. # Objects wth more than ratngs 96 98 8 The three classes were frst compared by experments dscussed n sectons.. and... They show smlar propertes as the whole data set before classfcaton. They are then compared n the shape of ratng profles for each sngle object. We classfed all possble ratng profles nto groups, as descrbed n Table 7, and counted the relatve occurrence of each shape for each class. Some nterestng observatons are obtaned from Table 7: ) The dstrbutons of possble ratng profles are smlar for all three classes. ) The possble ratng profle for a sngle object covers all types of shapes. 3) Monotoncally ncreasng and U shaped dstrbutons are domnatng cases; bell-shapes are as popular as the unform dstrbuton. Table 7. The percentages of occurrences of ratng profles Obj.(%) Med.(%) Sub.(%) Unform Dstrbuton.3..3 Monotoncally Decreasng.7.8. Monotoncally Increasng 33. 3. 39. B-modal ( U shaped).8.7. Bell-shaped 7.6.. 9. 9...9.. ( W shaped)... ( M shaped) 3. 3.6.9 A Sngle Ratng Group.3.6.8.. Dscusson The results n secton. have two mplcatons n desgnng our ratng aggregaton methods: ) It s dffcult to perform a meanngful or non-trval evaluaton on raters and rated objects from b-drectonal ratngs n onlne auctons, because of the ubquty of hgh ratngs. We therefore do not use the aucton ratngs n the next secton. ) A general method can be appled to evaluate raters and rated objects for un-drectonal ratngs, regardless of the type of rated object n terms of expected subjectvty of ratngs.. Experments on Reputatons The kernel of our ratng aggregaton system nvolves buldng reputatons for raters. The data used n ths secton are the ratngs from Epnons. The structure of the reputaton tree for a rater s based on the classfcaton of categores on Epnons. The depth of a reputaton tree s and all leaf categores are at the same level. Table 8 Data set for experments on reputatons # Categores at # Leaf # Objects # Raters level categores,7 9,986 8 As a better metrc to compare raters qualfcatons, reputaton rank s used nstead of absolute reputaton value n the experments. Our rank generaton approach s based on the Square-Error clusterng method. The total number of rank clusters s. In a rater s reputaton tree, each reputaton value s assocated wth a reputaton rank n the correspondng category. Whle there s no gold-standard metrc to compare wth for evaluatng raters reputaton, we tested our methods along two dmensons. Frst, the rater s reputatons as computed by our system are evaluated and dscussed aganst some desrable propertes of reputatons. Second, for raters who are (ndependently) dstngushed as beng trusted or advsors on Epnons, the reputatons computed by our mechansms are presented and analyzed... Propertes of reputatons Some mportant, desred propertes of a system for computng rater s reputaton nclude:

.) Reputaton depends on both qualty and quantty of all ratngs gven by a rater..) Ratng experence should not drectly determne reputaton rank for a rater, though reputaton generally grows wth ratng experence for an ndvdual rater. 3.) Good raters are more consstent n ther ratngs for the same object than bad raters..) Good raters have greater ablty to be dscrmnatng among objects n the same category. Crteron s bult-n to our method for computng a rater s reputaton. The experments n ths secton test our system aganst the other three propertes. The model used n ths secton s TREE (see Table ). Raters n each leaf category are classfed nto three groups: Good Raters, whose reputaton rank s not lower than 7, Bad Raters (below ), and Others (between and 6, nclusve). () Rater reputaton as experence ncreases The goal of ths experment s to nvestgate the evoluton of a rater s reputaton as ther ratng experence ncreases. The categores tested are the leaf categores, and a rater s ratng experence n a category s measured as the number of ratngs the rater gves n that category. In every leaf category, reputaton for each rater s recalculated and plotted on a curve after the rater gves new ratngs n that category. Only raters who gve at least ratngs n the category are ncluded n ths experment. The curves are aggregated nto the three groups accordng to raters fnal reputaton rank for that category. In Fgure, the y-axs s the average reputaton for raters n the same group at each tme pont. Reputaton.3.3..... 3 6 7 8 9 3 6 7 8 9 Tme ( nterval = ratngs) Good raters( users) Bad raters() Others() Fgure. Temporal trend of rater s reputaton In general, a rater s reputaton ncreases wth ratng experence. However, the three groups show dfferent propertes n reputaton s evolvement. There are two stages when the reputaton of Good Raters ncreases at the hghest speed: the warm-up stage (whle gvng the frst 7 ratngs) and the maturng stage (after gvng ratngs). For Bad Raters, the reputaton grows n a step-wse manner. Others gan ther medum reputaton ranks manly from the reputaton accumulated n the warm-up stage. For Bad Raters and Others, low startng pont and low mprovng speed cause the lower reputaton than Good Raters, respectvely. The next experment shows that reputaton cannot be determned from ratng experence drectly. Frst, Table 9 shows the lack of functonal dependency between the experence and the class a rater belongs to. The lack of correlaton wthn a class s qute clear n Fgure. Table 9. Porton of raters wth a certan ratng experence belongng to a certan class # Ratngs [,] [,] [,] [,] Good Raters.38.9.6.33 Others....33 Bad Raters..6.33 Reputaton Rank 9 8 7 6 3 6 8 6 8 Rater's Experence (# ratngs gven by a raters) Good raters Others Bad raters Fgure. Reputaton rank vs. ratng experence () Consstency of Good Raters and Bad Raters In ths experment, Good Raters and Bad Raters are compared n the consstency of ratngs that members of each group gve to the same object. Only tems rated by at least Good Raters and Bad Raters are sampled, and those raters sampled must have gven more than ratngs n the correspondng leaf category. Standard devaton of ratngs by Bad Raters...... Standard devaton of ratngs by Good Raters Fgure. 6 Consstency of Good and Bad Raters In Fgure 6, a pont represents one object. The x-axs and y-axs gves the standard devaton of ratngs gven by Good Raters and Bad Raters, respectvely. For more than 8% of the 6 objects, the standard devaton of Good Raters ratngs s much smaller than that of Bad Raters ratngs. The average of standard devaton of Good Raters ratngs s.83, whle ths value s.3 for Bad Raters. So Good Raters as computed by our methods are ndeed more consstent n ratngs than Bad Raters. (3) Ablty to dscrmnate among tems Intutvely, good raters should be more dscrmnatng, and are therefore more valuable as raters. We defne the entropy for a category as the standard devaton of scores for all objects n that category. The change n entropy after applyng reputatons to computng scores s nvestgated. Only categores ncludng more than 3 objects are tested (all categores on the frst level and 9 of 8 leaf categores).

%Categores 7 7 7 % Relatve Change n Entropy After Usng Reputaton To Compute Score Level Level (Leaf Level) Fgure. 7 Change n score s entropy after ncorporatng raters reputatons n computng scores Fgure 7 shows that the score s entropy n all sampled categores s ncreased after usng reputatons n computng scores. The change s caused after hghlghtng ratngs from good raters; hence good raters are more dscrmnatng, n general... Comparson wth reputatons on Epnons On Epnons two specal communtes are qualtatvely dstngushed from other members. The frst s called Most Trusted Users who are wdely trusted by other users based on web of trust. The second knd of communty s advsors (see secton.) defned for each category. To compare our reputaton mechansm to that of Epnons, we smply solate the computed reputaton ranks of the users whom Epnons separates out nto the two specal communtes. () Most Trusted Users (MTU) On Epnons, MTU s a global (category-ndependent) group. To match Epnons n our comparson, the reputaton ranks n ths experment are global values at the root of reputaton tree n herarchcal models. Fgure 8 presents the dstrbutons of reputaton ranks for MTU, computed wth our four models (see Table ). % MTU 8 6 3 6 7 8 9 Reputaton Rank( --> : low --> hgh) TREE FLAT TREE&TEXT FLAT&TEXT Fgure. 8 Dstrbuton of reputaton rank for MTU Observatons on Fgure 8 are:.) Reputaton ranks of MTU bult wth any of our models are located at the hgh ends..) Interestngly, analyzng qualty of text revews makes reputaton ranks of MTU much closer to the hgh end. 3.) The reputaton ranks of MTU computed usng the flat structure are closer to the hgh end than those usng the tree structure. The second observaton shows that people rely heavly on qualty of text revews to decde the trustworthness of a revewer. The last observaton ndcates that people tend to deem all data as at the same level n makng overall subjectve judgments. Ths s natural when people need to deal wth many data sources n a complcated nformaton network. The herarchcal models presented n ths paper are able to capture the nformaton structure ntrnscally. The next experment s to hghlght another lmtaton of the MTU scheme due to human subjectvty and lack of ablty to use the entre database of knowledge: many of the MTU are people who gve a lot of ratngs but sometmes of poor qualty. MTU are ranked (compared to all users) separately by ther GM and GC (see secton 3.) n the root of the reputaton tree. Fgure 9 shows the results separately for MTU who qualfy as Good Raters by our reputaton calculatons and those who do not, usng the TREE model. Rank of GM 8 7 6 3 6 7 8 9 3 Rank of GC Good Raters: reputaton rank >=7 on-good-rateres: reputaton rank < 7 Fgure.9 Dstrbuton of GM and GC of MTU One pont n Fgure 9 refers to MTU wth certan ranks of GM and GC. As we can see, subjectvely determned MTU as a whole have relatvely hgh GC due to ther frequent ratng actvty on hot objects, but ther ranks accordng to GM are sometmes near or below the mddle rank level. Our method, on the other hand, gves lower credence to those popular raters whose ratngs qualty s not hgh enough at a gven confdence level. () Category Advsors Fgure shows the dstrbuton of reputaton ranks for Advsors as determned by Epnons. Usng TREE, more than 6% of Advsors fall nto the category of Good Raters accordng to our ratng system. However, ths proporton s 96% when the qualty of text revews s ncluded. Ths result s not surprsng because Epnons reles on text revews to evaluate rater s qualfcaton. % Epnons' Advsors 3 3 6 7 8 9 Reputaton Rank TREE: 6% "Advsors" have ranks not lower than 7 TREE&TEXT:96% "Advsors" have ranks not lower than 7 Fgure Reputaton ranks of Advsors

6. COCLUSIO We have addressed the problem of buldng herarchcal reputatons for raters based on ther ratng hstory and that of the entre ratng communty. Reputatons help gude users toward hgh qualty opnons, and they also provde a standard to wegh ratngs n computng overall scores for rated objects, leadng to more relable scores. Compared to current technques used by ratng aggregaton stes, our automated methods are shown to be more objectve, precse and dscrmnatng. The expermental results show that the reputaton framework bult by our statstcal methods has the desrable propertes wth regard to temporal trend, correlaton between ratng experence and reputaton rank, greater consstency among good raters than bad raters, and the ablty of reputatons (and hgh-reputaton raters) to dfferentate tems more fnely. In addton, specal communtes of raters defned explctly on Epnons bascally match the results of our automatc statstcal methods. Some nterestng patterns of humans n judgng trustworthness are also dscovered: () People use text revews substantally n evaluatons. () People tend to vew dfferent nformaton sources as at the same level n makng overall judgment. (3) People tend to trust actve raters even f ther ratng qualty s not hgh. In addton to adng consumers, reputatons also have some socal mplcatons: () Informaton exploson vs. ratng scarcty The publcaton of reputatons encourages large amount of content wth hgh qualty. It provdes a way to deal wth both nformaton exploson and ratng scarcty. () Prvacy vs. relablty Some people do not want others to know ther opnons on certan objects. Ths can be handled by hdng ther denttes and peggng them anonymously by reputaton nstead. Wthout a reputaton scheme, anonymty often degrades the relablty of people s opnons. Thus, rater s reputatons help protect prvacy wthout loss n relablty. (3) Robustness of ratng aggregaton methods Boostng score by fake ratngs s a typcal cheatng behavor n ratng-based reputaton schemes. Wth our reputaton framework, ntruders need to gan suffcent reputaton to make ther attack work. They wll need to attack other ratng profles of tems besde the product they are actually nterested n, and understand how to gan hgh reputatons based on our methods. As new ratngs come n, the ratng profles change, so mantanng a reputaton s a dynamc and ongong affar. Future work wll be along several drectons: () Refnng the methods to compute reputaton by accountng for raters temporal behavor. For example, let the mportance of a rater s old ratngs decay wth tme. () Aggregatng ratngs from multple stes. There are two drectons to mplement ths aggregaton. The frst s to merge ratngs gven to a same tem on multple stes, and thus mprove the confdence n value-added nformaton generated from raw ratngs. Our current system can be easly adjusted to ths task. Another drecton s to aggregate reputaton of one rater across multple stes. Techncally, only one layer needs to be nserted n the current reputaton herarchy. But n practce, the unform dentfcaton of users requred by the second aggregaton s not generally supported on Internet at present. (3) Desgnng nfrastructure to buld and deploy a ratng system. 7. ACKOWLEDGMETS Thanks to Andrea LaPaugh, Dongmng Jang and Stefanos Damanak for constructve dscussons and valuable support n experment utltes. 8. REFERECES [] Arkn, H. and Colton, R. R. Statstcal Methods, th edton, Barnes & oble, Inc., 97. [] Brn, S. and Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engne. WWW7 / Computer etworks 3(-7): 7-7 (998). [3] Dellarocas, C., Immunzng Onlne Reputaton Reportng Systems Aganst Unfar Ratngs and Dscrmnatory Behavor. Proceedngs of EC : The nd ACM Conference on Electronc Commerce,. [] Guernsey, L. Suddenly, Everybody's an Expert on Everythng. The ew York Tmes, Feb. 3,. [] Hafner, K. Web Stes Begn to Self Organze. The ew York Tmes, Jan. 8,. [6] Hll, W., Stead, L., Rosensten, M. and Furnas, G. Recommendng and evaluatng choces n a vrtual communty of use. Proceedngs of the 99 ACM Conference on Human Factors n Computng Systems (99). ACM, ew York, pp. 9-. [7] http://www.epnons.com [8] Klenberg, J. Authortatve sources n a hyperlnked envronment. Journal of the ACM 6 (999). [9] Konstan, J. A., Mller, B.., Maltz, D., Herlocker, J. L., Gordon, L. R., and Redl, J. GroupLens: Applyng Collaboratve Flterng to Usenet ews. Communcatons of ACM, Vol., o. 3, 77-87, March 997. [] Resnck, P., Iacovou,., Suchak, M., Bergstrom, P., and Redl, J. GroupLens: An Open Archtecture for Collaboratve Flterng of etnews. Proceedngs of CSCW 9, pp.7-86, Chapel Hll, C. [] Terveen, L., Hll, W., Amento, B., McDonald, D. and Creter, J. PHOAKS: A System for Sharng Recommendatons. Communcatons of ACM, Vol., o. 3, 9-6, March 997. [] Zachara, G., Moukas, A. and Maes, P. Collaboratve Reputaton Mechansms n Electronc Marketplaces. Proceedngs of the 3 nd Hawa Internatonal Conference on System Scence, January -8, 999.