A Unified Approach to Ranking in Probabilistic Databases. Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA

Similar documents
NIH Public Access Author Manuscript Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2013 February 01.

Probabilistic Graphical Models: Applications in Biomedicine

Leman Akoglu Hanghang Tong Jilles Vreeken. Polo Chau Nikolaj Tatti Christos Faloutsos SIAM International Conference on Data Mining (SDM 2013)

Continuous/Discrete Non Parametric Bayesian Belief Nets with UNICORN and UNINET

in Engineering Prof. Dr. Michael Havbro Faber ETH Zurich, Switzerland Swiss Federal Institute of Technology

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

The Structure of Social Contact Graphs and their impact on Epidemics

Chapter 1. Introduction

Real-time Summarization Track

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

BREAST CANCER EPIDEMIOLOGY MODEL:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Bayesian Joint Modelling of Benefit and Risk in Drug Development

Chonhyon Park, Jia Pan, Ming Lin, Dinesh Manocha University of North Carolina at Chapel Hill

Computer based delineation and follow-up multisite abdominal tumors in longitudinal CT studies

Evaluating Classifiers for Disease Gene Discovery

IMRT FOR CRANIOSPINAL IRRADIATION: CHALLENGES AND RESULTS. A. Miller, L. Kasulaitytė Institute of Oncolygy, Vilnius University

RELYING ON TRUST TO FIND RELIABLE INFORMATION. Keywords Reputation; recommendation; trust; information retrieval; open distributed systems.

Bayesian (Belief) Network Models,

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

Expert System Profile

A Decision-Theoretic Approach to Evaluating Posterior Probabilities of Mental Models

Pythia WEB ENABLED TIMED INFLUENCE NET MODELING TOOL SAL. Lee W. Wagenhals Alexander H. Levis

The PlantFAdb website and database are based on the superb SOFA database (sofa.mri.bund.de).

Bayesians methods in system identification: equivalences, differences, and misunderstandings

An Edge-Device for Accurate Seizure Detection in the IoT

Binary Diagnostic Tests Paired Samples

Benchmark Dose Modeling Cancer Models. Allen Davis, MSPH Jeff Gift, Ph.D. Jay Zhao, Ph.D. National Center for Environmental Assessment, U.S.

Exploratory Quantitative Contrast Set Mining: A Discretization Approach

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

More Examples and Applications on AVL Tree

Case Studies of Signed Networks

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Diversity of Preferences

Convolutional Coding: Fundamentals and Applications. L. H. Charles Lee. Artech House Boston London

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

CISC453 Winter Probabilistic Reasoning Part B: AIMA3e Ch

Knowledge Based Systems

Towards Effective Structure Learning for Large Bayesian Networks

Outlier Analysis. Lijun Zhang

Active User Affect Recognition and Assistance

Dmitriy Fradkin. Ask.com

Gene-microRNA network module analysis for ovarian cancer

Automated Image Biometrics Speeds Ultrasound Workflow

Individual Differences in Attention During Category Learning

Decisions and Dependence in Influence Diagrams

Bottom-Up Model of Strategy Selection

STIN2103. Knowledge. engineering expert systems. Wan Hussain Wan Ishak. SOC 2079 Ext.: Url:

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Between-word regressions as part of rational reading

CHAPTER IV PREPROCESSING & FEATURE EXTRACTION IN ECG SIGNALS

A Belief-Based Account of Decision under Uncertainty. Craig R. Fox, Amos Tversky

PROTOCOL FOR INFLUENZA A VIRUS GLOBAL SWINE H1 CLADE CLASSIFICATION

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

Evidence synthesis with limited studies

EEG Signal Description with Spectral-Envelope- Based Speech Recognition Features for Detection of Neonatal Seizures

Bayesian Belief Network Based Fault Diagnosis in Automotive Electronic Systems

Dynamics of Color Category Formation and Boundaries

Network Analysis of Toxic Chemicals and Symptoms: Implications for Designing First-Responder Systems

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Spatial Cognition for Mobile Robots: A Hierarchical Probabilistic Concept- Oriented Representation of Space

Multi-Stage Stratified Sampling for the Design of Large Scale Biometric Systems

Artificial Intelligence Programming Probability

IE 5203 Decision Analysis Lab I Probabilistic Modeling, Inference and Decision Making with Netica

Event Classification and Relationship Labeling in Affiliation Networks

Article from. Forecasting and Futurism. Month Year July 2015 Issue Number 11

Selection and Combination of Markers for Prediction

CHAPTER 4 THE QUESTIONNAIRE DESIGN /SOLUTION DESIGN. This chapter contains explanations that become a basic knowledge to create a good

Impact of Response Variability on Pareto Front Optimization

FDA-iRISK 4.0 A Comparative Risk Assessment Tool J u l y 6,

Disease predictive, best drug: big data implementation of drug query with disease prediction, side effects & feedback analysis

A Tutorial on Bayesian Networks for System Health Management

Regression. Lelys Bravo de Guenni. April 24th, 2015

Supplementary Online Content

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

IAT 355 Visual Analytics. Encoding Information: Design. Lyn Bartram

Australian Journal of Basic and Applied Sciences

Evolutionary Programming

Affective Preference From Physiology in Videogames: a Lesson Learned from the TORCS Experiment

Detection of Atrial Fibrillation Using Model-based ECG Analysis

HZAU MULTIVARIATE HOMEWORK #2 MULTIPLE AND STEPWISE LINEAR REGRESSION

Chapter 1: Exploring Data

Question 1(25= )

A Language Modeling Approach for Temporal Information Needs

Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 2010

Statistical Analysis of Complete Social Networks

PET in Radiation Therapy. Outline. Tumor Segmentation in PET and in Multimodality Images for Radiation Therapy. 1. Tumor segmentation in PET

Predicting About-to-Eat Moments for Just-in-Time Eating Intervention

Prediction of Diabetes Using Bayesian Network

Homo heuristicus and the bias/variance dilemma

ORC: an Ontology Reasoning Component for Diabetes

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Unit 1 Exploring and Understanding Data

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

An Interval-Based Representation of Temporal Knowledge

Agent-Based Models. Maksudul Alam, Wei Wang

COAL COMBUSTION RESIDUALS RULE STATISTICAL METHODS CERTIFICATION SOUTHERN ILLINOIS POWER COOPERATIVE (SIPC)

An Extended Maritime Domain Awareness Probabilistic Ontology Derived from Human-aided Multi-Entity Bayesian Networks Learning

Transcription:

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA

Probabilistic Databases Motivation: Increasing amounts of uncertain data Sensor Networks; Information Networks Data Integration and Information Extraction Probabilistic databases Annotate tuples with existence probabilities, and/or attribute values with probability distributions Interpretation according to the "possible worlds semantics"

Possible World Semantics ID Score pw1 t 1 200 t 2 150 w.p. 0.064 ID Score Prob t 3 100 t 1 200 0.2 t 2 150 0.8 t 3 100 0.4 pw2 ID Score t 1 200 t 2 150 w.p. 0.096 A probabilistic table (assume tuple-independence) pw3 ID Score t 2 150 w.p. 0.256 t 3 100 We focus on tuple uncertainty in the talk

Top-k Query Processing Score values are used to rank the tuples in every pw. ID Score pw1 t 1 200 t 2 150 w.p. 0.064 ID Score Prob t 3 100 t 1 200 0.2 t 2 150 0.8 t 3 100 0.4 pw2 ID Score t 1 200 t 2 150 w.p. 0.096 A probabilistic table (assume tuple-independence) pw3 ID Score t 2 150 w.p. 0.256 The top-1 answer for each possible world t 3 100

Top-k Queries: Many Prior Proposals Return k tuples t with the highest score(t)pr(t) [exp. score] Returns the most probable top k-answer [U-top-k] [Soliman et al. 07] At rank i, return tuple with max. prob. of being at rank i [U-rank-k] [Soliman et al. 07] Return k tuples t with the largest Pr (r(t) k) values [PT-k/GT-k] [Hua et al. 08] [Zhang et al. 08] Return k tuples t with smallest expected rank: pw Pr(pw) r pw (t) [Cormode et al. 09]

Top-k Queries Which one should we use??? Comparing different ranking functions Normalized Kendall Distance between two top-k answers: Penalizes #reversals and #mismatches Lies in [0,1], 0: Same answers; 1: Disjoint answers E-Score PT/GT U-Rank E-Rank U-Top E-Score ---- 0.124 0.302 0.799 0.276 PT/GT 0.124 ---- 0.332 0.929 0.367 U-Rank 0.302 0.332 ----- 0.929 0.204 E-Rank 0.799 0.929 0.929 ---- 0.945 U-Top 0.276 0.367 0.204 0.945 ---- Real Data Set: 100,000 tuples, Top-100

Top-k Queries Which one should we use??? Comparing different ranking functions Normalized Kendall Distance between two top-k answers: Penalizes #reversals and #mismatches Lies in [0,1], 0: Same answers; 1: Disjoint answers E-Score PT/GT U-Rank E-Rank U-Top E-Score ---- 0.864 0.890 0.004 0.925 PT/GT 0.864 ---- 0.395 0.864 0.579 U-Rank 0.890 0.395 ----- 0.890 0.316 E-Rank 0.004 0.864 0.890 ---- 0.926 U-Top 0.925 0.579 0.316 0.926 ---- Synthetic Dataset: 100,000 tuples, Top-100

Our Approach Define two parameterized ranking functions: PRF w ; PRF e.. that can simulate or approximate a variety of ranking functions PRF e much more efficient to evaluate (than PRF w ) User A specific ranking function Preference information: e.g., a ranking on a small dataset Represent as a PRF w Compute directly Learn PRF w parameters Use PRF e to approximate

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Parameterized Ranking Function Weight Function:! : (tuple, rank)! Parameterized Ranking Function (PRF) C Probability that t is ranked at position i across possible worlds Return k tuples with the highest values.

Parameterized Ranking Function!(t,i)= 1 : Rank the tuples by probabilities!(t,i)=score(t): E-Score PRF e ( ):!(i)= i complex number where can be a real or a PRF! (h): Generalizes PT/GT-k and U-Rank

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Computing PRF: Independent Tuples Computing Pr(r(t i ) = j), for a given tuple t i and a given rank j T i-1 ={t 1,t 2,,t i-1 } i.e., the set of tuples with scores higher than t i Generating Function Method The coefficient of x j : Pr(r(t i )=j)

Computing PRF: Independent Tuples Algorithm: For i=1 to n Expand Expand Can be improved from scratch to O(n) 2 ) O(n) time Return k tuples with largest values F i and F i-1 differ in two multiplicative terms O(n 32 ) overall

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Computing PRF e ( ): Independent Tuples Recall!(j)= j Generating Function Method Therefore: No need to expand the polynomial!! Morevoer: O(1) O(n) overall

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Computing PRF: x-tuples Capture mutual exclusivity correlations Xor nodes: V V V 0.5 0.3 0.3 0.2 0.2 0.7 (t 2,500) (t 1,950) (t 6,20) (t 5,30) (t 4,150) (t 3,200) Possible Worlds Pr t 4 0.02 t 3 0.08. t 1, t 4, t 6 0.03 t 1, t 3, t 6 0.018.

Computing PRF: x-tuples Construct generating function for t 4 r(i)=j if and only if (1) j-1 tuples with higher scores appear Pr(r(t 4 )=j) = coeff of x j-1 y (2) tuple i appears F(x,y)=(0.2+0.8x)(0.1+0.2y+0.7x) Xor nodes: V 0.2+0.8x V 1 V 0.1+0.2y+0.7x 0.5 0.3 0.3 0.2 0.2 0.7 (t 2,500) (t 1,950) (t 6,20) (t 5,30) (t 4,150) (t 3,200) O(n 2 ) overall x x 1 1 y x

Computing PRF e ( ): x-tuples We maintain only the numerical values of F i (, ) and F i (,0) at each node. E.g. =0.6. Now we want to compute F 5 (0.6,0.6) F 5 (0.6,0.6)=0.096*0.92*0.64 Xor nodes: V 0.2+0.8*0.6 V 0.92 V 0.1+0.2*0.6+0.7*0.6 0.5 0.3 0.3 0.2 0.2 0.7 O(1) for each new tuple (t 2,500) (t 1,950) (t 6,20) (t 5,30) (t 4,150) (t 3,200) 0.6 0.6 1 0.6 0.6 0.6 Overall O(n)

Summary of Other Results PRF computation for probabilistic and/xor trees Generalizes x-tuples by allowing both mutual exclusivity and coexistence PRF computation : O(n 3 ) PRF e computation : O(nlogn+nd) (d = height of the tree) PRF computation on graphical models A polynomial time algorithm when the junction tree has bounded treewidth A nontrivial dynamic program combined with the generating function method.

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Approximating Ranking Functions Approximating PRF! by a linear combination of PRF e Suppose Reduce to L individual PRF e computations Running time : O(nlogn+nL) (as opposed to O(n 2 )) We developed a scheme based on adapting the discrete Fourier transformation of! Works very well for monotonically non-increasing! E.g. the step function (PT/GT-k) Details in the paper

Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e ( ) X-Tuples Summery of Other Results Approximating Ranking Functions Experiments

Experiments: Approximation Ability Using a single PRF e A linear combination of PRF e s Approximating other functions using PRF e ( ) = 1-0.9 i No. of Terms vs Approximation Quality Real Icberg sighting dataset: 100,000 tuples

Experiments: Execution Time PRF e vs Others Approx vs Exact Real Icberg sighting dataset

Conclusions Proposed a unifying framework for ranking over probabilistic databases through: Parameterized ranking functions Incorporation of user feedback Designed highly efficient algorithms for computing PRF and PRF e Developed novel approximation techniques for approximating PRF w with PRF e