arxiv: v1 [cs.lg] 28 Nov 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 28 Nov 2017"

Julian James
6 years ago
Views:

1 Snorkel: Rapid Training Data Creation with Weak Supervision Alexander Ratner Stephen H. Bach Henry Ehrenberg Jason Fries Sen Wu Christopher Ré Stanford University Stanford, CA, USA {ajratner, bach, henryre, jfries, senwu, arxiv: v1 [cs.lg] 28 Nov 2017 ABSTRACT Labeling training data is increasingly the largest bottleneck in deploying achine learning systes. We present Snorkel, a first-of-its-kind syste that enables users to train stateof-the-art odels without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end ipleentation of our recently proposed achine learning paradig, data prograing. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with copanies, agencies, and research labs. In a user study, subject atter experts build odels 2.8 faster and increase predictive perforance an average 45.5% versus seven hours of hand labeling. We study the odeling tradeoffs in this new setting and propose an optiizer for autoating tradeoff decisions that gives up to 1.8 speedup per pipeline execution. In two collaborations, with the U.S. Departent of Veterans Affairs and the U.S. Food and Drug Adinistration, and on four open-source text and iage data sets representative of other deployents, Snorkel provides 132% average iproveents to predictive perforance over prior heuristic approaches and coes within an average 3.60% of the predictive perforance of large hand-curated training sets. PVLDB Reference Forat: A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, 11 (3): xxxx-yyyy, DOI: / INTRODUCTION In the last several years, there has been an explosion of interest in achine-learning-based systes across industry, governent, and acadeia, with an estiated spend this year of $12.5 billion [1]. A central driver has been the Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Articles fro this volue were invited to present their results at The 44th International Conference on Very Large Data Bases, August 2018, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowent, Vol. 11, No. 3 Copyright 2017 VLDB Endowent /17/11... $ DOI: / LABEL SOURCE 1 Accuracy: 90% LABEL SOURCE 2 Accuracy: 60% 1k labels 100k labels UNLABELED DATA Figure 1: In Exaple 1.1, training data is labeled by sources of differing accuracy and coverage. Two key challenges arise in using this weak supervision effectively. First, we need a way to estiate the unknown source accuracies to resolve disagreeents. Second, we need to pass on this critical lineage inforation to the end odel being trained. advent of deep learning techniques, which can learn taskspecific representations of input data, obviating what used to be the ost tie-consuing developent task: feature engineering. These learned representations are particularly effective for tasks like natural language processing and iage analysis, which have high-diensional, high-variance input that is ipossible to fully capture with siple rules or handengineered features [14, 17]. However, deep learning has a ajor upfront cost: these ethods need assive training sets of labeled exaples to learn fro often tens of thousands to illions to reach peak predictive perforance [47]. Such training sets are enorously expensive to create, especially when doain expertise is required. For exaple, reading scientific papers, analyzing intelligence data, and interpreting edical iages all require labeling by trained subject atter experts (SMEs). Moreover, we observe fro our engageents with collaborators like research labs and ajor technology copanies that odeling goals such as class definitions or granularity change as projects progress, necessitating re-labeling. Soe big copanies are able to absorb this cost, hiring large teas to label training data [12,16,31]. However, the bulk of practitioners are increasingly turning to weak supervision: cheaper sources of labels that are noisier or heuristic. The ost popular for is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels [4, 7, 32]. Other fors include crowdsourced labels [37, 50], rules and heuristics for labeling data [39, 52], and others [29, 30, 30, 46, 51]. While these sources are inexpensive, they often have liited accuracy and coverage.

2 Ideally, we would cobine the labels fro any weak supervision sources to increase the accuracy and coverage of our training set. However, two key challenges arise in doing so effectively. First, sources will overlap and conflict, and to resolve their conflicts we need to estiate their accuracies and correlation structure, without access to ground truth. Second, we need to pass on critical lineage inforation about label quality to the end odel being trained. Exaple 1.1. In Figure 1, we obtain labels fro a high accuracy, low coverage Source 1, and fro a low accuracy, high coverage Source 2, which overlap and disagree (splitcolor points). If we take an unweighted ajority vote to resolve conflicts, we end up with null (tie-vote) labels. If we could correctly estiate the source accuracies, we would resolve conflicts in the direction of Source 1. We would still need to pass this inforation on to the end odel being trained. Suppose that we took labels fro Source 1 where available, and otherwise took labels fro Source 2. Then, the expected training set accuracy would be 60.3% only arginally better than the weaker source. Instead we should represent training label lineage in end odel training, weighting labels generated by high-accuracy sources ore. In recent work, we developed data prograing as a paradig for addressing both of these challenges by odeling ultiple label sources without access to ground truth, and generating probabilistic training labels representing the lineage of the individual labels. We prove that, surprisingly, we can recover source accuracy and correlation structure without hand-labeled training data [5, 38]. However, there are any practical aspects of ipleenting and applying this abstraction that have not been previously considered. We present Snorkel, the first end-to-end syste for cobining weak supervision sources to rapidly create training data. We built Snorkel as a prototype to study how people could use data prograing, a fundaentally new approach to building achine learning applications. Through weekly hackathons and office hours held at Stanford University over the past year, we have interacted with a growing user counity around Snorkel s open source ipleentation. 1 We have observed SMEs in industry, science, and governent deploying Snorkel for knowledge base construction, iage analysis, bioinforatics, fraud detection, and ore. Fro this experience, we have distilled three principles that have shaped Snorkel s design: 1. Bring All Sources to Bear: The syste should enable users to opportunistically use labels fro all available weak supervision sources. 2. Training Data as the Interface to ML: The syste should odel label sources to produce a single, probabilistic label for each data point and train any of a wide range of classifiers to generalize beyond those sources. 3. Supervision as Interactive Prograing: The syste should provide rapid results in response to user supervision. We envision weak supervision as the REPL-like interface for achine learning. Our work akes the following technical contributions: A Flexible Interface for Sources: We observe that the heterogeneity of weak supervision strategies is a stubling block for developers. Different types of weak supervision 1 operate on different scopes of the input data. For exaple, distant supervision has to be apped prograatically to specific spans of text. Crowd workers and weak classifiers often operate over entire docuents or iages. Heuristic rules are open ended; they can leverage inforation fro ultiple contexts siultaneously, such as cobining inforation fro a docuent s title, naed entities in the text, and knowledge bases. This heterogeneity was cubersoe enough to copletely block users of early versions of Snorkel. To address this challenge, we built an interface layer around the abstract concept of a labeling function (LF). We developed a flexible language for expressing weak supervision strategies and supporting data structures. We observed accelerated user productivity with these tools, which we validated in a user study where SMEs build odels 2.8 faster and increase predictive perforance an average 45.5% versus seven hours of hand labeling. Tradeoffs in Modeling of Sources: Snorkel learns the accuracies of weak supervision sources without access to ground truth using a generative odel [38]. Furtherore, it also learns correlations and other statistical dependencies aong sources, correcting for dependencies in labeling functions that skew the estiated accuracies [5]. This paradig gives rise to previously unexplored tradeoff spaces between predictive perforance and speed. The natural first question is: when does odeling the accuracies of sources iprove predictive perforance? Further, how any dependencies, such as correlations, are worth odeling? We study the tradeoffs between predictive perforance and training tie in generative odels for weak supervision. While odeling source accuracies and correlations will not hurt predictive perforance, we present a theoretical analysis of when a siple ajority vote will work just as well. Based on our conclusions, we introduce an optiizer for deciding when to odel accuracies of labeling functions, and when learning can be skipped in favor of a siple ajority vote. Further, our optiizer autoatically decides which correlations to odel aong labeling functions. This optiizer correctly predicts the advantage of generative odeling over ajority vote to within 2.16 accuracy points on average on our evaluation tasks, and accelerates pipeline executions by up to 1.8. It also enables us to gain 60% 70% of the benefit of correlation learning while saving up to 61% of training tie (34 inutes per execution). First End-to-End Syste for Data Prograing: Snorkel is the first syste to ipleent our recent work on data prograing [5,38]. Previous ML systes that we and others developed [52] required extensive feature engineering and odel specification, leading to confusion about where to inject relevant doain knowledge. While prograing weak supervision sees superficially siilar to feature engineering, we observe that users approach the two processes very differently. Our vision weak supervision as the sole port of interaction for achine learning iplies radically different workflows, requiring a proof of concept. Snorkel deonstrates that this paradig enables users to develop high-quality odels for a wide range of tasks. We report on two deployents of Snorkel, in collaboration with the U.S. Departent of Veterans Affairs and Stanford Hospital and Clinics, and the U.S. Food and Drug Adinistration, where Snorkel iproves over heuristic baselines by an average 110%. We also report results on four open-

3 We We study a patient who becae quadriplegic after parenteral agnesiu adinistration for for for preeclapsia. External KBs Patterns & dictionaries Doain Heuristics UNLABELED DATA Subset A Subset B Subset C causes, induces, linked to, aggravates, Cheicals of type A should be harless WEAK SUPERVISION SOURCES Docuent Sentence Span Entity CONTEXT HIERARCHY Ontology(ctd, [A, B, -C]) Pattern( {{0}}causes{{1}} ) CustoFn(x,y : heuristic(x,y)) LABELING FUNCTION INTERFACE Λ LABEL MATRIX MODELING OPTIMIZER Λ $ Λ # Λ " GENERATIVE MODEL SNORKEL Y Y& PROBABILISTIC TRAINING DATA DISCRIMINATIVE MODEL Figure 2: An overview of the Snorkel syste. (1) SME users write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative odel to cobine the LFs outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriinative classification odel, such as a deep neural network. source datasets that are representative of other Snorkel deployents, including bioinforatics, edical iage analysis, and crowdsourcing; on which Snorkel beats heuristics by an average 153% and coes within an average 3.60% of the predictive perforance of large hand-curated training sets. 2. SNORKEL ARCHITECTURE Snorkel s workflow is designed around data prograing [5, 38], a fundaentally new paradig for training achine learning odels using weak supervision, and proceeds in three ain stages (Figure 2): 1. Writing Labeling Functions: Rather than hand-labeling training data, users of Snorkel write labeling functions, which allow the to express various weak supervision sources such as patterns, heuristics, external knowledge bases, and ore. This was the coponent ost infored by early interactions (and istakes) with users over the last year of deployent, and we present a flexible interface and supporting data odel. 2. Modeling Accuracies and Correlations: Next, Snorkel autoatically learns a generative odel over the labeling functions, which allows it to estiate their accuracies and correlations. This step uses no ground-truth data, learning instead fro the agreeents and disagreeents of the labeling functions. We observe that this step iproves end predictive perforance 5.81% over Snorkel with unweighted label cobination, and anecdotally that it strealines the user developent experience by providing actionable feedback about labeling function quality. 3. Training a Discriinative Model: The output of Snorkel is a set of probabilistic labels that can be used to train a wide variety of state-of-the-art achine learning odels, such as popular deep learning odels. While the generative odel is essentially a re-weighted cobination of the user-provided labeling functions which tend to be precise but low-coverage odern discriinative odels can retain this precision while learning to generalize beyond the labeling functions, increasing coverage and robustness on unseen data. Next we set up the proble Snorkel addresses and describe its ain coponents and design decisions. Setup: Our goal is to learn a paraeterized classification odel h θ that, given a data point x X, predicts its label y Y, where the set of possible labels Y is discrete. For siplicity, we focus on the binary setting Y = { 1, 1}, though we include a ulti-class application in our experients. For exaple, x ight be a edical iage, and y a label indicating noral versus abnoral. In the relation extraction exaples we look at, we often refer to x as a candidate. In a traditional supervised learning setup, we would learn h θ by fitting it to a training set of labeled data points. However, in our setting, we assue that we only have access to unlabeled data for training. We do assue access to a sall set of labeled data used during developent, called the developent set, and a blind, held-out labeled test set for evaluation. These sets can be orders of agnitudes saller than a training set, aking the econoical to obtain. The user of Snorkel ais to generate training labels by providing a set of labeling functions, which are black-box functions, λ : X Y { }, that take in a data point and output a label where we use to denote that the labeling functions abstains. Given unlabeled data points and n labeling functions, Snorkel applies the labeling functions over the unlabeled data to produce a atrix of labeling function outputs Λ (Y { }) n. The goal of the reaining Snorkel pipeline is to synthesize this label atrix Λ which ay contain overlapping and conflicting labels for each data point into a single vector of probabilistic training labels Ỹ = (ỹ1,..., ỹ), where ỹi [0, 1]. These training labels can then be used to train a discriinative odel. Next, we introduce the running exaple of a text relation extraction task as a proxy for any real-world knowledge base construction and data analysis tasks: Exaple 2.1. Consider the task of extracting entions of adverse cheical-disease relations fro the bioedical literature (see CDR task, Section 4.1). Given docuents with entions of cheicals and diseases tagged, we refer to each co-occuring (cheical, disease) ention pair as a candidate extraction, which we view as a data point to be classified as either true or false. For exaple, in Figure 2, we would have two candidates with true labels y 1 = True and y 2 = False: x 1 = Causes (" agnesiu ", " quadriplegic ") x 2 = Causes (" agnesiu ", " preeclapsia ")

4 Docuent Sentence Span CONTEXT HIERARCHY Entity Candidate(A,B) Figure 3: Labeling functions take as input a Candidate object, representing a data point to be classified. Each Candidate is a tuple of Context objects, which are part of a hierarchy representing the local context of the Candidate. Data Model: A design challenge is anaging coplex, unstructured data in a way that enables SMEs to write labeling functions over it. In Snorkel, input data is stored in a context hierarchy. It is ade up of context types connected by parent/child relationships, which are stored in a relational database and ade available via an object-relational apping (ORM) layer built with SQLAlchey. 2 Each context type represents a conceptual coponent of data to be processed by the syste or used when writing labeling functions; for exaple a docuent, an iage, a paragraph, a sentence, or an ebedded table. Candidates i.e., data points x are then defined as tuples of contexts (Figure 3). Exaple 2.2. In our running CDR exaple, the input docuents can be represented in Snorkel as a hierarchy consisting of Docuents, each containing one or ore Sentences, each containing one or ore Spans of text. These Spans ay also be tagged with etadata, such as Entity arkers identifying the as cheical or disease entions (Figure 3). A candidate is then a tuple of two Spans. 2.1 A Language for Weak Supervision Snorkel uses the core abstraction of a labeling function to allow users to specify a wide range of weak supervision sources such as patterns, heuristics, external knowledge bases, crowdsourced labels, and ore. This higher-level, less precise input is ore efficient to provide (see Section 4.2), and can be autoatically denoised and synthesized, as described in subsequent sections. In this section, we describe our design choices in building an interface for writing labeling functions, which we envision as a unifying prograing language for weak supervision. These choices were infored to a large degree by our interactions priarily through weekly office hours with Snorkel users in bioinforatics, defense, industry, and other areas over the past year. 3 For exaple, while we initially intended to have a ore coplex structure for labeling functions, with anually specified types and correlation structure, we quickly found that siplicity in this respect was critical to usability (and not epirically detriental to our ability to odel their outputs). We also quickly discovered that users wanted either far ore expressivity or far less of it, copared to our first library of function teplates. We thus trade off expressivity and efficiency by allowing users to write labeling functions at two levels of abstraction: custo Python functions and declarative operators Hand-Defined Labeling Functions: In its ost general for, a labeling function is just an arbitrary snippet of code, usually written in Python, which accepts as input a Candidate object and either outputs a label or abstains. Often these functions are siilar to extract-transfor-load scripts, expressing basic patterns or heuristics, but ay use supporting code or resources and be arbitrarily coplex. Writing labeling functions by hand is supported by the ORM layer, which aps the context hierarchy and associated etadata to an object-oriented syntax, allowing the user to easily traverse the structure of the input data. Exaple 2.3. In our running exaple, we can write a labeling function that checks if the word causes appears between the cheical and disease entions. If it does, it outputs True if the cheical ention is first and False if the disease ention is first. If causes does not appear, it outputs None, indicating abstention: def LF causes (x): cs, ce = x. cheical. get word range () ds, de = x. disease. get word range () if ce < ds and " causes " in x. parent. words [ ce +1: ds ]: return True if de < cs and " causes " in x. parent. words [ de +1: cs ]: return False return None We could also write this with Snorkel s declarative interface: LF causes = lf search ("{{1}}. \ Wcauses\W. {{2}}", reverse args =False ) Declarative Labeling Functions: Snorkel includes a library of declarative operators that encode the ost coon weak supervision function types, based on our experience with users over the last year. These functions capture a range of coon fors of weak supervision, for exaple: Pattern-based: Pattern-based heuristics ebody the otivation of soliciting higher inforation density input fro SMEs. For exaple, pattern-based heuristics encopass feature annotations [51] and pattern-bootstrapping approaches [18, 20] (Exaple 2.3). Distant supervision: Distant supervision generates training labels by heuristically aligning data points with an external knowledge base, and is one of the ost popular fors of weak supervision [4, 22, 32]. Weak classifiers: Classifiers that are insufficient for our task e.g., liited coverage, noisy, biased, and/or trained on a different dataset can be used as labeling functions. Labeling function generators: One higher-level abstraction that we can build on top of labeling functions in Snorkel is labeling function generators, which generate ultiple labeling functions fro a single resource, such as crowdsourced labels and distant supervision fro structured knowledge bases (Exaple 2.4). Exaple 2.4. A challenge in traditional distant supervision is that different subsets of knowledge bases have different levels of accuracy and coverage. In our running exaple, we can use the Coparative Toxicogenoics Database (CTD) 4 as distant supervision, separately odeling different subsets of it with separate labeling functions. For exaple, 4

5 we ight write one labeling function to label a candidate True if it occurs in the Causes subset, and another to label it False if it occurs in the Treats subset. We can write this using a labeling function generator, LFs CTD = Ontology (ctd, {" Causes ": True, " Treats ": False }) which creates two labeling functions. In this way, generators can be connected to large resources and create hundreds of labeling functions with a line of code. 2.2 Generative Model The core operation of Snorkel is odeling and integrating the noisy signals provided by a set of labeling functions. Using the recently proposed approach of data prograing [5, 38], we odel the true class label for a data point as a latent variable in a probabilistic odel. In the siplest case, we odel each labeling function as a noisy voter which is independent i.e., akes errors that are uncorrelated with the other labeling functions. This defines a generative odel of the votes of the labeling functions as noisy signals about the true label. We can also odel statistical dependencies between the labeling functions to iprove predictive perforance. For exaple, if two labeling functions express siilar heuristics, we can include this dependency in the odel and avoid a double counting proble. We observe that such pairwise correlations are the ost coon, so we focus on the in this paper (though handling higher order dependencies is straightforward). We use our structure learning ethod for generative odels [5] to select a set C of labeling function pairs (j, k) to odel as correlated (see Section 3.2). Now we can construct the full generative odel as a factor graph. We first apply all the labeling functions to the unlabeled data points, resulting in a label atrix Λ, where Λ i,j = λ j(x i). We then encode the generative odel p w(λ, Y ) using three factor types, representing the labeling propensity, accuracy, and pairwise correlations of labeling functions: φ Lab i,j (Λ, Y ) = 1{Λ i,j } φ Acc i,j (Λ, Y ) = 1{Λ i,j = y i} φ Corr i,j,k(λ, Y ) = 1{Λ i,j = Λ i,k } (j, k) C For a given data point x i, we define the concatenated vector of these factors for all the labeling functions j = 1,..., n and potential correlations C as φ i(λ, Y ), and the corresponding vector of paraeters w R 2n+ C. This defines our odel: ( ) p w(λ, Y ) = Zw 1 exp w T φ i(λ, y i), where Z w is a noralizing constant. To learn this odel without access to the true labels Y, we iniize the negative log arginal likelihood given the observed label atrix Λ: ŵ = arg in log w Y p w(λ, Y ). We optiize this objective by interleaving stochastic gradient descent steps with Gibbs sapling ones, siilar to contrastive divergence [21]; for ore details, see [5, 38]. We use the Nubskull library, 5 a Python NUMBA-based Gibbs sapler. We then use the predictions, Ỹ = pŵ(y Λ), as probabilistic training labels Discriinative Model The end goal in Snorkel is to train a odel that generalizes beyond the inforation expressed in the labeling functions. We train a discriinative odel h θ on our probabilistic labels Ỹ by iniizing a noise-aware variant of the loss l(h θ (x i), y), i.e., the expected loss with respect to Ỹ : ˆθ = arg in θ E y Ỹ [l(h θ(x i), y)]. A foral analysis shows that as we increase the aount of unlabeled data, the generalization error of discriinative odels trained with Snorkel will decrease at the sae asyptotic rate as traditional supervised learning odels do with additional hand-labeled data [38], allowing us to increase predictive perforance by adding ore unlabeled data. Intuitively, this property holds because as ore data is provided, the discriinative odel sees ore features that cooccur with the heuristics encoded in the labeling functions. Exaple 2.5. The CDR data contains the sentence, Myasthenia gravis presenting as weakness after agnesiu adinistration. None of the 33 labeling functions we developed vote on the corresponding Causes(agnesiu, yasthenia gravis) candidate, i.e., they all abstain. However, a deep neural network trained on probabilistic training labels fro Snorkel correctly identifies it as a true ention. Snorkel provides connectors for popular achine learning libraries such as TensorFlow [2], allowing users to exploit coodity odels like deep neural networks that do not require hand-engineering of features and have robust predictive perforance across a wide range of tasks. 3. WEAK SUPERVISION TRADEOFFS We study the fundaental question of when and at what level of coplexity we should expect Snorkel s generative odel to yield the greatest predictive perforance gains. Understanding these perforance regies can help guide users, and introduces a tradeoff space between predictive perforance and speed. We characterize this space in two parts: first, by analyzing when the generative odel can be approxiated by an unweighted ajority vote, and second, by autoatically selecting the coplexity of the correlation structure to odel. We then introduce a two-stage, rulebased optiizer to support fast developent cycles. 3.1 Modeling Accuracies The natural first question when studying systes for weak supervision is, When does odeling the accuracies of sources iprove end-to-end predictive perforance? We study that question in this subsection and propose a heuristic to identify settings in which this odeling step is ost beneficial Tradeoff Space We start by considering the label density d Λ of the label atrix Λ, defined as the ean nuber of non-abstention labels per data point. In the low-density setting, sparsity of labels will ean that there is liited roo for even an optial weighting of the labeling functions to diverge uch fro the ajority vote. Conversely, as the label density

6 Modeling Advantage Low-Density (choose MV) Mid-Density (choose GM) Low-Density Bound Optiizer (A * ) Optial (A * ) Gen. Model (A w) High-Density (choose MV) # of Labeling Functions Figure 4: A plot of the odeling advantage, i.e., the iproveent in label accuracy fro the generative odel, as a function of the nuber of labeling functions (equivalently, the label density) on a synthetic dataset. 7 We plot the advantage obtained by a learned generative odel (GM), A w; by an optial odel A ; the upper bound Ã used in our optiizer; and the low-density bound (Proposition 1). grows, known theory confirs that the ajority vote will eventually be optial [27]. It is the iddle-density regie where we expect to ost benefit fro applying the generative odel. We start by defining a easure of the benefit of weighting the labeling functions by their true accuracies in other words, the predictions of a perfectly estiated generative odel versus an unweighted ajority vote: Definition 1. (Modeling Advantage) Let the weighted ajority vote of n labeling functions on data point x i be denoted as f w(λ i) = n j=1 wjλi,j, and the unweighted ajority vote (MV) as f 1(Λ i) = n j=1 Λi,j, where we consider the binary classification setting and represent an abstaining vote as 0. We define the odeling advantage A w as the iproveent in accuracy of f w over f 1 for a dataset: A w(λ, y) = 1 (1 {y if w(λ i) > 0 y if 1(Λ i) 0} 1 {y if w(λ i) 0 y if 1(Λ i) > 0}) In other words, A w is the nuber of ties f w correctly disagrees with f 1 on a label, inus the nuber of ties it incorrectly disagrees. Let the optial advantage A = A w be the advantage using the optial weights w (WMV*). To build intuition, we start by analyzing the optial advantage for three regies of label density (see Figure 6): Low Label Density: In this sparse setting, very few data points have ore than one non-abstaining label; only a sall nuber have ultiple conflicting labels. We have observed this occurring, for exaple, in the early stages of application developent. We see that with non-adversarial labeling functions (w > 0), even an optial generative odel (WMV*) can only disagree with MV when there are disagreeing labels, which will occur infrequently. We see that 7 We generate a class-balanced dataset of = 1000 data points with binary labels, and n independent labeling functions with average accuracy 75% and a fixed 10% probability of voting. Table 1: Modeling advantage A w attained using a generative odel for several applications in Snorkel (Section 4.1), the upper bound Ã used by our optiizer, the odeling strategy selected by the optiizer either ajority vote (MV) or generative odel (GM) and the epirical label density d Λ. Dataset A w (%) Ã (%) Modeling Strategy d Λ Radiology GM 2.3 CDR GM 1.8 Spouses GM 1.4 Che MV 1.2 EHR GM 1.2 the expected optial advantage will have an upper bound that falls quadratically with label density: Proposition 1. (Low-Density Upper Bound) Assue that P (Λ i,j 0) = p l i, j, and wj > 0 j. Then, the expected label density is d = np l, and E Λ,y,w [A ] = O ( d2 ) (1) Proof Sketch: We bound the advantage above by coputing the expected nuber of pairwise disagreeents. High Label Density: In this setting, the ajority of the data points have a large nuber of labels. For exaple, we ight be working in an extreely high-volue crowdsourcing setting, or an application with any highcoverage knowledge bases as distant supervision. Under odest assuptions naely, that the average labeling function accuracy α is greater than 50% it is known that the ajority vote converges exponentially to an optial solution as the average label density d increases, which serves as an upper bound for the expected optial advantage as well: Theore 1. (High-Density Upper Bound [27]) Assue that P (Λ i,j 0) = p l i, j, and that α = 1 n n j=1 α j = 1 n n j=1 1/(1 + exp(w j )) > 1 2. Then: E Λ,y,w [A ] e 2p l(α 1 2 ) 2 d Proof: This follows fro the result in [27] for the syetric Dawid-Skene odel under constant probability sapling. Mediu Label Density: In this iddle regie, we expect that odeling the accuracies of the labeling functions will deliver the greatest gains in predictive perforance because we will have any data points with a sall nuber of disagreeing labeling functions. For such points, the estiated labeling function accuracies can heavily affect the predicted labels. We indeed see gains in the epirical results using an independent generative odel that only includes accuracy factors φ Acc i,j (Table 1). Furtherore, the guarantees in [38] establish that we can learn the optial weights, and thus approach the optial advantage Autoatically Choosing a Modeling Strategy The bounds in the previous subsection iply that there are settings in which we should be able to safely skip odeling the labeling function accuracies, siply taking the unweighted ajority vote instead. However, in practice, the (2)

7 overall label density d Λ is insufficiently precise to deterine the transition points of interest, given a user tie-cost tradeoff preference (characterized by the advantage tolerance paraeter γ in Algorith 1). We show this in Table 1 using our application data sets fro Section 4.1. For exaple, we see that the Che and EHR label atrices have equivalent label densities; however, odeling the labeling function accuracies has a uch greater effect for EHR than for Che. Instead of siply considering the average label density d Λ, we instead develop a best-case heuristic based on looking at the ratio of positive to negative labels for each data point. This heuristic serves as an upper bound to the true expected advantage, and thus we can use it to deterine when we can safely skip training the generative odel (see Algorith 1). Let c y(λ i) = n j=1 1 {Λi,j = y} be the counts of labels of class y for x i, and assue that the true labeling function weights lie within a fixed range, w j [w in, w ax] and have a ean w. 8 Then, define: Φ(Λ i, y) = 1 {c y(λ i)w ax > c y(λ i)w in} Ã (Λ) = 1 1 {yf 1(Λ i) 0} Φ(Λ i, y)σ(2f w(λ i)y) y ±1 where σ( ) is the sigoid function, f w is ajority vote with all weights set to the ean w, and Ã (Λ) is the predicted odeling advantage used by our optiizer. Essentially, we are taking the expected counts of instances in which a weighted ajority vote could possibly flip the incorrect predictions of unweighted ajority vote under best case conditions, which is an upper bound for the expected advantage: Proposition 2. (Optiizer Upper Bound) Assue that the labeling functions have accuracy paraeters (logodds weights) w j [w in, w ax], and have E[w] = w. Then: E y,w [A Λ] Ã (Λ) (3) Proof Sketch: We upper-bound the odeling advantage by the expected nuber of instances in which WMV* is correct and MV is incorrect. We then upper-bound this by using the best-case probability of the weighted ajority vote being correct given (w in, w ax). We apply Ã to a synthetic dataset and plot in Figure 6. Next, we copute Ã for the labeling atrices fro experients in Section 4.1, and copare with the epirical advantage of the trained generative odels (Table 1). We see that our approxiate quantity Ã serves as a correct guide in all cases for deterining which odeling strategy to select, which for the ature applications reported on is indeed ost often the generative odel. However, we see that while EHR and Che have equivalent label densities, our optiizer correctly predicts that Che can be odeled with ajority vote, speeding up each pipeline execution by 1.8. We find in our applications that the optiizer can save execution tie especially during the initial stages of iterative developent (see full version). 8 We fix these at defaults of (w in, w, w ax) = (0.5, 1.0, 1.5), which corresponds to assuing labeling functions have accuracies between 62% and 82%, and an average accuracy of 73%. 3.2 Modeling Structure In this subsection, we consider odeling additional statistical structure beyond the independent odel. We study the tradeoff between predictive perforance and coputational cost, and describe how to autoatically select a good point in this tradeoff space. Structure Learning. We observe any Snorkel users writing labeling functions that are statistically dependent. Exaples we have observed include: Functions that are variations of each other, such as checking for atches against siilar regular expressions. Functions that operate on correlated inputs, such as raw tokens of text and their leatizations. Functions that use correlated sources of knowledge, such as distant supervision fro overlapping knowledge bases. Modeling such dependencies is iportant because they affect our estiates of the true labels. Consider the extree case in which not accounting for dependencies is catastrophic: Exaple 3.1. Consider a set of 10 labeling functions, where 5 are perfectly correlated, i.e., they vote the sae way on every data point, and 5 are conditionally independent given the true label. If the correlated labeling functions have accuracy α = 50% and the uncorrelated ones have accuracy β = 99%, then the axiu likelihood estiate of their accuracies according to the independent odel is ˆα = 100% and ˆβ = 50%. Specifying a generative odel to account for such dependencies by hand is ipractical for three reasons. First, it is difficult for non-expert users to specify these dependencies. Second, as users iterate on their labeling functions, their dependency structure can change rapidly, like when a user relaxes a labeling function to label any ore candidates. Third, the dependency structure can be dataset specific, aking it ipossible to specify a priori, such as when a corpus contains any strings that atch ultiple regular expressions used in different labeling functions. We observed users of earlier versions of Snorkel struggling for these reasons to construct accurate and efficient generative odels with dependencies. We therefore seek a ethod that can quickly identify an appropriate dependency structure fro the labeling function outputs Λ alone. Naively, we could include all dependencies of interest, such as all pairwise correlations, in the generative odel and perfor paraeter estiation. However, this approach is ipractical. For 100 labeling functions and 10,000 data points, estiating paraeters with all possible correlations takes roughly 45 inutes. When ultiplied over repeated runs of hyperparaeter searching and developent cycles, this cost greatly inhibits labeling function developent. We therefore turn to our ethod for autoatically selecting which dependencies to odel without access to ground truth [5]. It uses a pseudolikelihood estiator, which does not require any sapling or other approxiations to copute the objective gradient exactly. It is uch faster than axiu likelihood estiation, taking 15 seconds to select pairwise correlations to be odeled aong 100 labeling functions with 10,000 data points. However, this approach relies on a selection threshold hyperparaeter ɛ which induces a tradeoff space between predictive perforance and coputational cost.

8 Nuber of Correlations Siulated Labeling Functions Perforance # of Correlations Elbow Point Correlation Threshold Predictive Perforance (F1) Nuber of Correlations Cheical-Disease Labeling Functions Correlation Threshold Predictive Perforance (F1) Nuber of Correlations All User Study Labeling Functions Correlation Threshold Predictive Perforance (F1) Figure 5: Predictive perforance of the generative odel and nuber of learned correlations versus the correlation threshold ɛ. The selected elbow point achieves a good tradeoff between predictive perforance and coputational cost (linear in the nuber of correlations). Left: siulation of structure learning correcting the generative odel. Middle: the CDR task. Right: all user study labeling functions for the Spouses task Tradeoff Space Such structure learning ethods, whether pseudolikelihood or likelihood-based, crucially depend on a selection threshold ɛ for deciding which dependencies to add to the generative odel. Fundaentally, the choice of ɛ deterines the coplexity of the generative odel. 9 We study the tradeoff between predictive perforance and coputational cost that this induces. We find that generally there is an elbow point beyond which the nuber of correlations selected and thus the coputational cost explodes, and that this point is a safe tradeoff point between predictive perforance and coputation tie. Predictive Perforance: At one extree, a very large value of ɛ will not include any correlations in the generative odel, aking it identical to the independent odel. As ɛ is decreased, correlations will be added. At first, when ɛ is still high, only the strongest correlations will be included. As these correlations are added, we observe that the generative odel s predictive perforance tends to iprove. Figure 5, left, shows the result of varying ɛ in a siulation where ore than half the labeling functions are correlated. After adding a few key dependencies, the generative odel resolves the discrepancies aong the labeling functions. Figure 5, iddle, shows the effect of varying ɛ for the CDR task. Predictive perforance iproves as ɛ decreases until the odel overfits. Finally, we consider a large nuber of labeling functions that are likely to be correlated. In our user study (described in Section 4.2), participants wrote labeling functions for the Spouses task. We cobined all 125 of their functions and studied the effect of varying ɛ. Here, we expect there to be any correlations since it is likely that users wrote redundant functions. We see in Figure 5, right, that structure learning surpasses the best perforing individual s generative odel (50.0 F1). Coputational Cost: Coputational cost is correlated with odel coplexity. Since learning in Snorkel is done with a Gibbs sapler, the overhead of odeling additional correlations is linear in the nuber of correlations. The dashed lines in Figure 5 show the nuber of correlations included in each odel versus ɛ. For exaple, on the Spouses task, fitting the paraeters of the generative odel at ɛ = 0.5 takes 4 inutes, and fitting its paraeters with ɛ = Specifically, ɛ is both the coefficient of the l 1 regularization ter used to induce sparsity, and the iniu absolute weight in log scale that a dependency ust have to be selected. takes 57 inutes. Further, paraeter estiation is often run repeatedly during developent for two reasons: (i) fitting generative odel hyperparaeters using a developent set requires repeated runs, and (ii) as users iterate on their labeling functions, they ust re-estiate the generative odel to evaluate the Autoatically Choosing a Model Based on our observations, we seek to autoatically choose a value of ɛ that trades off between predictive perforance and coputational cost using the labeling functions outputs Λ alone. Including ɛ as a hyperparaeter in a grid search over a developent set is generally not feasible because of its large effect on running tie. We therefore want to choose ɛ before other hyperparaeters, without perforing any paraeter estiation. We propose using the nuber of correlations selected at each value of ɛ as an inexpensive indicator. The dashed lines in Figure 5 show that as ɛ decreases, the nuber of selected correlations follows a pattern. Generally, the nuber of correlations grows slowly at first, then hits an elbow point beyond which the nuber explodes, which fits the assuption that the correlation structure is sparse. In all three cases, setting ɛ to this elbow point is a safe tradeoff between predictive perforance and coputational cost. In cases where perforance grows consistently (left and right), the elbow point achieves ost of the predictive perforance gains at a sall fraction of the coputational cost. For exaple, on Spouses (right), choosing ɛ = 0.08 achieves a score of 56.6 F1 within one point of the best score but only takes 8 inutes for paraeter estiation. In cases where predictive perforance eventually degrades (iddle), the elbow point also selects a relatively sall nuber of correlations, giving an 0.7 F1 point iproveent and avoiding overfitting. Perforing structure learning for any settings of ɛ is inexpensive, especially since the search needs to be perfored only once before tuning the other hyperparaeters. On the large nuber of labeling functions in the Spouses task, structure learning for 25 values of ɛ takes 14 inutes. On CDR, with a saller nuber of labeling functions, it takes 30 seconds. Further, if the search is started at a low value of ɛ and increased, it can often be terinated early, when the nuber of selected correlations reaches a low value. Selecting the elbow point itself is straightforward. We use the point with greatest absolute difference fro its neighbors, but ore sophisticated schees can also be applied [43]. Our full optiization algorith for choosing a odeling strategy and (if necessary) correlations is shown in Algorith 1.

9 Algorith 1 Modeling Strategy Optiizer Input: Label atrix Λ (Y { }) n, advantage tolerance γ, structure search resolution η Output: Modeling strategy if Ã (Λ) < γ then return MV Structures [ ] for i fro 1 to 1 do 2η ɛ i η C LearnStructure(Λ, ɛ) Structures.append( C, ɛ) ɛ SelectElbowPoint(Structures) return GM ɛ 4. EVALUATION We evaluate Snorkel by drawing on deployents developed in collaboration with users. We report on two realworld deployents and four tasks on open-source data sets representative of other deployents. Our evaluation is designed to support the following three ain clais: Snorkel outperfors distant supervision baselines. In distant supervision [32], one of the ost popular fors of weak supervision used in practice, an external knowledge base is heuristically aligned with input data to serve as noisy training labels. By allowing users to easily incorporate a broader, ore heterogeneous set of weak supervision sources, Snorkel exceeds odels trained via distant supervision by an average of 132%. Snorkel approaches hand supervision. We see that by writing tens of labeling functions, we were able to approach or atch results using hand-labeled training data which took weeks or onths to asseble, coing within 2.11% of the F1 score of hand supervision on relation extraction tasks and an average 5.08% accuracy or AUC on cross-odal tasks, for an average 3.60% across all tasks. Snorkel enables a new interaction paradig. We easure Snorkel s efficiency and ease-of-use by reporting on a user study of bioedical researchers fro across the U.S. These participants learned to write labeling functions to extract relations fro news articles as part of a twoday workshop on learning to use Snorkel, and atched or outperfored odels trained on hand-labeled training data, showing the efficiency of Snorkel s process even for first-tie users. We now describe our results in detail. First, we describe the six applications that validate our clais. We then show that Snorkel s generative odeling stage helps to iprove the predictive perforance of the discriinative odel, deonstrating that it is 5.81% ore accurate when trained on Snorkel s probabilistic labels versus labels produced by an unweighted average of labeling functions. We also validate that the ability to incorporate any different types of weak supervision increentally iproves results with an ablation study. Finally, we describe the protocol and results of our user study. 4.1 Applications To evaluate the effectiveness of Snorkel, we consider several real-world deployents and tasks on open-source datasets Table 2: Nuber of labeling functions, fraction of positive labels (for binary classification tasks), nuber of training docuents, and nuber of training candidates for each task. Task # LFs % Pos. # Docs # Candidates Che ,753 65,398 EHR , ,607 CDR ,272 Spouses ,073 22,195 Radiology ,851 3,851 Crowd that are representative of other deployents in inforation extraction, edical iage classification, and crowdsourced sentient analysis. Suary statistics of the tasks are provided in Table 2. Discriinative Models: One of the key bets in Snorkel s design is that the trend of increasingly powerful, open-source achine learning tools (e.g., odels, pre-trained word ebeddings and initial layers, autoatic tuners, etc.) will only continue to accelerate. To best take advantage of this, Snorkel creates probabilistic training labels for any discriinative odel with a standard loss function. In the following experients, we control for end odel selection by using currently popular, standard choices across all settings. For text odalities, we choose a bidirectional long short ter eory (LSTM) sequence odel [17], and for the edical iage classification task we use a 50-layer ResNet [19] pre-trained on the IageNet object classification dataset [14]. Both odels are ipleented in Tensorflow [2] and trained using the Ada optiizer [24], with hyperparaeters selected via rando grid search using a sall labeled developent set. Final scores are reported on a held-out labeled test set. See full version for details. A key takeaway of the following results is that the discriinative odel generalizes beyond the heuristics encoded in the labeling functions (as in Exaple 2.5). In Section 4.1.1, we see that on relation extraction applications the discriinative odel iproves perforance over the generative odel priarily by increasing recall by 43.15% on average. In Section 4.1.2, the discriinative odel classifies entirely new odalities of data to which the labeling functions cannot be applied Relation Extraction fro Text We first focus on four relation extraction tasks on text data, as it is a challenging and coon class of probles that are well studied and for which distant supervision is often considered. Predictive perforance is suarized in Table 3. We briefly describe each task. Scientific Articles (Che): With odern online repositories of scientific literature, such as PubMed 10 for bioedical articles, research results are ore accessible than ever before. However, actually extracting fine-grained pieces of inforation in a structured forat and using this data to answer specific questions at scale reains a significant open challenge for researchers. To address this challenge in the 10

Predicting Time Spent with Physician

Predicting Time Spent with Physician Ji Zheng jizheng@stanford.edu Stanford University, Coputer Science Dept., 353 Serra Mall, Stanford, CA 94305 USA Ioannis (Yannis) Petousis petousis@stanford.edu Stanford University, Electrical Engineering