Extracting Events with Informal Temporal References in Personal Histories in Online Communities

Size: px

Start display at page:

Download "Extracting Events with Informal Temporal References in Personal Histories in Online Communities"

Opal Cooper
5 years ago
Views:

1 Extracting Events with Informal Temporal References in Personal Histories in Online Communities First Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 @domain Second Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 @domain Abstract We present a distantly supervised system for extracting the dates of illness events from posting histories in the context of an online medical support community. We annotate a corpus containing the complete posting histories of 300 users, with a total of 509 event dates. A temporal tagger is designed to retrieve candidate dates from non-standard language in social media. Our evaluation demonstrates that the temporal tagger outperforms a state-of-the-art baseline tagger. The event extraction system learns the likelihood of candidate dates for events from timerich sentences and temporal constraints from event-related sentences. Our model achieves 89.7% of the maximum possible performance given the performance of the temporal expression retrieval step. 1 Introduction In this paper we present a challenging new event date extraction task that requires extraction and resolution of temporal expression forms that are beyond the capabilities of the best state-of-the-art approach. Specifically, it targets extraction of personal illness event mentions in the posting histories of online community participants. In this context, we present a novel event date extraction approach that achieves high accuracy on this difficult new task. At its heart, the technique builds on a temporal expression (TE) retrieval approach that beats the most effective previously published approach (Chang and Manning, 2012) in the context of this novel task. Building on this foundation, the event date extraction technique achieves 89.7% of the maximum possible performance given the performance of the TE retrieval step. Social media enables measurement of human behavior in a longitudinal manner, and thus is an informative tool in the study of how people experience and respond to important events (Brubaker et al., 2012; Choudhury et al., 2013). Extracted user-specific event dates can form a timeline that offers the opportunity to examine an individual s behavior as it changes over time in response to these events. This work has typically involved associating users with events and event temporal information manually (Park and Choi, 2012). Because of the tremendous time involved in such analysis, these investigations have typically adopted small case study methodologies (Wen et al., 2012). Recent work has explored alternative methodologies such as Amazon Mechanical Turk ( (Choudhury et al., 2013), which may suffer from low recall. Despite the value of user-specific temporally scoped event information and the considerable research that has been done in temporal information extraction, to date the existing widely used resources are designed for learning temporally scoped facts about public figures/organizes from newswire or Wikipedia articles (Ji et al., 2011; McClosky and Manning, 2012; Garrido et al., 2012). Applying these approaches in a social media context poses a number of significant challenges that we address in this paper. We begin with the production of an annotated corpus of cancer event (CE) dates from user posts in Breastcancer.org. Next we describe a temporal expression extractor, which is a critical component in most of the existing temporal information extraction work. Our temporal extractor resolves nonstandard and user-specific date references in social media. While prior work has leveraged predictable ordering constraints between events, illness events may not follow a predictable sequence. For example, in our corpus we find that cancer events such as chemotherapy, radiation, and surgery are recurring and can appear almost random in order. Thus, it is less straightforward to utilize logical constraints in terms of temporal relations between different types of events as has been valuable in previous work (McClosky and Manning, 2012). In our work, we leverage a new source of constraint. In particular, we demonstrate how to learn temporal constraints from the temporal relation that exists between document creation time (DCT, in this paper DCT is the time of a post) and the event mentioned in the post.

2 In what follows, we first review related work in Section 2. We then describe the task and the corpus annotation in Sections 3 and 4 respectively. Next, we demonstrate how to tackle the noisiness of social media language and reliably extract the event dates in Section 5. We also learn time constraints to choose the event date from multiple sources, utilizing both the sentences containing date expressions and sentences in context that may be relevant to the event. The results are presented in Section 6. 2 Related Work Previous work on TE extraction has focused mainly on newswire, Wikipedia and blogs (Strotgen and Gertz, 2010; Chang and Manning, 2012). It is not surprising that there would be difficulty in applying the same tagger to a different genre such as forum posts (Strotgen and Gertz, 2012; Kolomiyets et al., 2011). This paper presents a rule-based TE extractor that identifies and resolves a wider range of nonstandard TEs than earlier state-of-art temporal taggers. Here we review that prior work. 2.1 Temporally Anchored Relation Extraction Our task is closest to the temporal slot filling track in the TAC-KBP 2011 shared task (Ji et al., 2011) and timelining task described earlier (McClosky and Manning, 2012). The goal is to extract the temporal bounds of event-based relations. One key difference is that they used newswire, Wikipedia and blogs as data sources from which they extract temporal bounds of facts found in Wikipedia infoboxes or the Freebase database ( Similar to Mc- Closky et al. (2012), our system uses the classifier s marginal probabilities along with a constraint component. We also demonstrate that this method performs better than utilizing hard constraints as used in several Temporal KBP slot filling systems (Artiles et al., 2011; Garrido et al., 2011). 2.2 Learning temporal constraints Temporal constraints have proven to be useful for production of a globally consistent timeline. In most temporal relation bound extraction systems, the constraints are included as input rather than learned by the system (Talukdar et al., 2012; Wang et al., 2011). A notable exception is Mc- Closky et al. (2012) who developed an approach to learning constraints such as that people cannot attend school if they have not been born yet. Constraints are somewhat less straightforward in our task, however. Since Cancer Events (CEs) can be recurring and appear to have almost a random order, there can be no universal logical constraints on the order of cancer events such as that Chemotherapy should happen before Mastectomy. Instead, we utilize the temporal relations expressed between events and DCT to infer constraints on the corresponding event dates. Garrido et al. (2012) make use of DCT as well, however, they have assumed DCT is within the time-range of the event stated in the document. We do not do this since this is often not true for our data. What we do is learn the relations as a pair-wise classification task. Similar to existing methods that extract the temporal relations between event and DCT (Chambers et al., 2007; Yoshikawa et al., 2009; UzZaman and Allen, 2010), we also design new features to remove noise. 2.3 Temporal reasoning with medical data There has been much recent interest in building timelines of medical events from clinical narratives. Zhou and Hripcsak (2007) provide a comprehensive survey of temporal reasoning with clinical data. Extracting precise temporal features for highly accurate temporal ordering of medical events is difficult as the temporal relationship between medical events is varied and complicated. Raghavan et al. (2012) demonstrate that the resources widely used for learning temporal relations in newswire text do not work well on clinical text. 3 Task As the foundation for our novel task definition, we first describe the temporal KBP task (Ji et al., 2011). In that task, one is first given a list of queries, a database of example relations with temporal bounds, and source documents. Queries are named entities with their gold relations but no temporal bounds. The database consists of training entities and their relations with temporal bounds. For example, an entity might be Bill Clinton, with relation as is US president, and temporal bound as [1993, 2001]. Source documents consist of raw text from news, blogs and Wikipedia articles. For each query, systems are required to output their predicted temporal bounds. Our task is similar to the Temporal KBP task, although it is more challenging. It is similar in the sense that we are identifying temporal bounds for events mentioned in a source document collection. In our case, the query is the user s userid paired with any one of our CEs. Source documents come from each user s posts and profile. A key difference is that in the KBP task, the set of relevant gold relations are provided as input, and each of those relations should correspond to exactly

3 one event in the source document set, although it may be mentioned more than once. In our task, we provide a set of potential events. However, for each event, there may be zero or more occurrences within each user s history (e.g., a patient may have had chemotherapy more than once), and where each event that has occurred may have been mentioned any number of times. Also the event date may not be identifiable from the source documents, i.e. the user s posts, even if the event was mentioned. If there were zero occurrences of the event or the event date cannot be inferred from the source documents, the system should output unknown. Otherwise, the desired output is the extracted user s event date. <11/15/2008> I only have a mast on my left side, but I have noticed some pulling recently and I won't start rads until March [Underspecified]. <11/20/2008> It is sloowwwly healing, so slowly, in fact, that she said she HOPES it will be healed by March [Underspecified]. when I am supposed to start rads. <1/13/2009> I still have one last chemo to go on the 19th [Underspecified] and then start rads in 5 wks [Relative]. <1/31/2009> I go for my first meeting with the rad onc on 2/10 [Underspecified] (my 50th birthday [User-specific]!). <2/23/2009> I had my first rad today [Relative]. <3/31/2009> Tomorrow [Relative] will be my last full rads, then I will start the boosts. <4/2/2009> I started rads in Feb [Underspecified]. I just did #29 today [Relative]. <4/8/2009> The rad onc wants to see me again next week [Relative] for a skin check and I am on antibiotics since I am at risk of getting cellulitis since I have had it twice since August [Underspecified]. <6/21/2010> My friend Lisa had her port put in last week [Relative] and will begin 2 weeks [Relative] of radiation on Tuesday [Underspecified]. Figure 1: Excerpts from messages of a user that contains a Radiation event keyword. Cancer events keywords are in bold and date expressions are in italics. Posts from a Breastcancer.org user may contain scattered mentions of CEs including conditions along with some information on when they occurred. Much of the temporal information in the messages is informal, imprecise or embedded in complex relative TEs. Some sample message excerpts are shown in Figure 1. This user begins to post about radiation before she starts it. She first reports planning to start radiation in March, but then reschedules for February. A lot of Relative and Underspecified TEs are used around the CE keyword instead of Absolute TE. She also shares her friend s CE story. These all pose challenges for our task. In Section 3.1 and 3.2, we explain how we define CEs and their temporal representation. In Section 3.3, we illustrate how the TEs in our corpus are different from newswire text. 3.1 Cancer event representation The concept of CE is different from the events defined in the widely used Timebank corpus for temporal information learning (Pustejovsky et al., 2003). In Timebank, the notion of an event primarily consists of verbs or phrases that denote change in state. In contrast, in prior work on extraction of Medical Events (ME) from clinical texts as well as CEs in Breastcancer.org, the target events are more often expressed as nouns than as verbs. For example, previous work on temporal reasoning in clinical texts has defined MEs as temporally-associated concepts that describe a medical condition affecting the patient s health, or procedures performed on a patient, such as fever and vancomycin pain control (Zhou and Hripcsak, 2007). CEs are a subset of MEs. While there is prior work on reasoning about MEs that builds on a representation of extracted events (Raghavan et al., 2012), the MEs in this previous work are manually annotated as input to subsequent processes, as opposed to our work where we extract CEs automatically. As our ultimate goal is to study user s behavior changes around major events, we focus specifically on eight common breast cancer events that are known to be stressful experiences. The CEs fall into three groups shown below. Status change event: breast cancer Diagnosis, Metastasis and Recurrence. These are state changing events associated with the moment when the news is declared to the patient. One-day event: Mastectomy, Lumpectomy, Reconstruction. These three events are all breast cancer-related surgeries. So they typically start and finish in the same day. Event with a temporal bound: Chemotherapy and Radiation. Both of these events are common cancer treatments which last from several weeks to several months. For each CE, we manually design an event keyword set. The keyword set includes the name of the event, abbreviations, slang, aliases and other related words. For example, the Reconstruction keyword set contains the common surgery type names, such as DIEP (deep inferior epigastric perforator flap breast reconstruction). By creating an analogous keyword set, our event date extraction method could be easily adapted to other datasets. 3.2 Temporal representation of cancer events In Temporal KBP, the temporal representation allows for upper and lower bounds on both the event start and end: <s l, s u, e l, e u > where s l

4 start s u, e l end e u. However, it is difficult to obtain these bounds from the posts even for humans. Furthermore, in our case, there are no analogous resources like Freebase or Wikipedia infoboxes to obtain additional data from as used in earlier KBP work. In contrast, McClosky et al. (2012) opted for a simpler representation of the form <start, end>. We adopted a similarly simple representation that is in the same spirit as the time duration based representation of MEs used earlier (Zhou and Hripcsak, 2007). The start and end dates are the same for a one-day event. Thus we only extract a single date in that case. For a status change event, the end date is infinitely far into the future, or at least until death, so we only extract the start date (This is called a semi-interval problem in (Zhou and Hripcsak, 2007)). For events with a time bound, we transform them into a one-day event by treating the start and the end of the event as two separate events. For example, the start and end of chemotherapy are regarded as two separate events. Then for each event we extract a single date. 3.3 Unusual date expressions in our corpus The distribution of TE types found in social media is different than that in newswire text. Table 1 presents a classification of TEs, which we classify according to the way we resolve the TE to a calendar date (year and month). Absolute TE does not need resolution. We need to resolve the underspecified part of the Underspecified TE. Relative TEs need to be resolved with a reference date. User-specific TE can be resolved with userspecific information, such as birth date. Our TE retrieval system extracted 195,143 TEs from our entire corpus. According to (Pustejovsky et al., 2003), in newswire and Wikipedia articles, the frequency of DATE TEs (which corresponds to Absolute and Underspecified in Table 1) is 68.5%, and thus the simplest type of TE to resolve constitute the bulk of temporal references. However, they only account for 48% in our corpus. Relative or user-specific dates, which are more challenging to resolve, are slightly more prevalent than this set in our data (e.g., last month rather than on July 3rd, 2007 ). In particular, user-specific TEs such as the user s age or cancer anniversary account for 5% of the data. It is especially common that the user specifies her diagnosis age. Many of the TEs are written in a casual way even for Absolute TEs. For example, Feb 11, Feb/11, 02/2011 and (in) 2/11 all represent the same date February When using Relative TEs, year, week and month can be abbreviated as yr, wk and mo(n). To the best of our knowledge, these TEs are not recognized by any temporal tagger available. Type of TE Example Percentage Absolute in July Underspecified in July 24 Relative two months ago 47 User-specific at the age of 57 5 Table 1: Types of temporal expression (TE) in our corpus. 4 Corpus Annotation Our research site, Breastcancer.org, is one of the most popular breast cancer discussion boards online. We have collected all the public posts, users, and their profiles on the discussion board platform from October 2001 to January During this period there were a total of 90,242 unique users who posted 1,562,459 messages. From this set of posts we extracted and then annotated two separate corpora, one for temporal expression retrieval, the other one for event date extraction(corpus link will be provided in full paper). The temporal expression corpus consists of 1,000 posts. As each individual user tends to use a narrow range of date expression forms, we randomly draw 1,000 users and randomly pick one post from each user in order to cover a wider variety of temporal expressions. We manually extracted 601 date expressions from these posts and resolved them to a specific month and year or just year if the month is not mentioned. Our corpus for event date extraction consists of 300 users that are randomly drawn from our dataset. 509 events are annotated with dates. Each user posted a mean of 54 messages in the forum. The granularity in resolved temporal references is the month. The annotators were provided with guidelines for how to infer the date of the event(coding manual link will be provided in full paper). The annotators were instructed to read every post of the user and extract the event dates from the most precise report. For example, when extracting a user s diagnosis date, if the user has only one post about her diagnosis which says: <6/10/2010> I was diagnosed last year. Then the diagnosis date will be But if the user has another post which contains more precise date information of her diagnosis date: <5/20/2009> I was diagnosed two weeks ago. Then the diagnosis date should be 5/2009. To measure the annotation agreement, three annotators annotated 10 users complete posting history (23 event dates were extracted in total). We evaluated the reliability of the hand extraction by first listing for each annotator the complete set of events extracted or not extracted for each user. We then computed Cohen s Kappa between these

5 lists of binary indicators for each pair of annotators and then averaged across pairs. The average inter-annotator agreement for this binary indicator was.94 Kappa. Then for all events each pair of annotators agreed was present for each user, we computed the amount of time between the extracted and resolved date from the post time of the user s first post. Similar to prior work evaluating agreement on temporal extraction, we then computed the Cronbach s alpha between these continuous scales, with one assignment per event, across annotators. Again, the agreement was quite high, with an alpha of.99. Thus, we concluded that the temporal dependency annotation paradigm was reliable. 5 Method An overview of our system is shown in Figure 2. Given an event and a user s post set, the system searches for all of the sentences that contain an event keyword (keyword sentence) and all the sentences that contain both a keyword and a temporal expression (date sentence). The TEs in the date sentences are resolved and then used as candidate dates for the event. This process is sometimes known as distant supervision (Mintz et al., 2009). When users are chatting on the forums, they frequently use misspelled or mistyped words, such as Feburuary and Febuary. Medical terms like metastasis and mastectomy are even more likely to be spelled wrong. Thus, as a pre-processing step, we use the spell checker in LanguageTool ( to correct spelling in each sentence. Our model contains two main components. The classifier component learns from the date sentences and determines how likely each candidate TE is to be the correct event date. The constraint component learns temporal constraints over event dates. This is done by learning to predict the temporal relation between an event and a keyword sentence s DCT. The two components work together to find a globally optimal event date. 5.1 Non-standard temporal expression retrieval We recognize four types of time expressions as illustrated in Table 1. We can see that it is crucial to resolve the underspecified and relative TEs correctly since they occur so frequently Temporal expression identification There are a number of open source temporal expression retrieval systems available, such as HeidelTime (Strotgen and Gertz, 2010) and SUTime (Chang and Manning, 2012). However, most of the systems are designed for newswire texts and cannot handle informal language in Classifier Component Date Sent. Posts Keyword Sent. Aggregation Event date Constraint Component Figure 2: System overview. online forums well. We design a rule-based date expression tagger that is built using regular expression patterns. In Section 6.1, we compare its performance with SUTime, which has out-performed other state-of-art systems on the TempEval-2 corpus (Chang and Manning, 2012). SUTime identifies and resolves a wide range of non-standard TE types such as Feb 07 (2/2007). The additional types of non-standard TE our approach handles are listed as follows. User-specific TEs: A User s age, cancer anniversary and survivorship can provide temporal information about the User s CEs. We obtain the birth date of users from their personal profile to resolve age date expressions such as at the age of 57. Anniversary and survivorship expressions usually indicate the number of years from diagnosis to DCT. For example, Today is my third breast cancer anniversary is equivalent to I m three years from my diagnosis. Non-whole numbers such as a year and half, two and a half years, 1/2 years, 1.5 months. Abbreviations of time units : Informally people use wk as the abbreviation of week, yr as year, mo(n) as month. Other non-standard TEs: including Feb/97 (2/1997), 03/2005 (3/2005), in 6/08 (6/2008), on 6/13 (June of the DCT year) and in 06 (2006), September 08(9/2008), mid-june Temporal expression normalization For underspecified or relative dates, DCT is used to resolve these dates to fully specified expressions if possible. As underspecified or relative dates are more prevalent than absolute times in our data(e.g., last month rather than on July 3rd, 2007 ), our temporal expression retrieval method supports the following two types of resolution. Fuzzy date: We have noticed that different temporal expression types are mentioned with different accuracies. Because users are often not precise

6 when talking about an event date, an underspecified TE activates a range of dates instead of a single date. When we resolve a date expression, we also define a time window according to the date expression type that captures the relevant range of dates. For example, (1)<6/10/2010> I was diagnosed two years ago. (2)<6/10/2010> I was diagnosed last summer. Her diagnosis date is actually 7/2008, not 6/10/2008. So in this case, only the year information is accurate. The time window is [3/2010, 9/2010]. In sentence (2), the true date should be one of June, July or August of The time window is [6/2010, 8/2010]. For the underspecified month mentions, we resolve the year information according to the DCT month, the mentioned month and the verb tense. If there is no verb in the sentence, the default tense is past tense. For example, <1/16/2011> I was diagnosed in May. As the verb was is in past tense, we can infer that the user refers to May of 2010, since May of 2011 is a date in the future. However, when the month is less than the reference month, the user is usually referring to a month in the same year as the DCT. For example, the resolved date is 5/2011 for the sentence: <7/16/2011> I was diagnosed in May. 5.2 Classifier component In the classifier component, we predict how likely a retrieved TE is the event date. We create training examples by computing the metarelation between each TE and its gold event date. We allow for underspecified matches. Thus, the metarelation is true if the gold date falls into the window of the TE (See Section 5.1.2). Otherwise, the TE would be assigned false, for example, if the resolved date is 9/15/2001 and the time window is [7/15/2001,11/15/2001]. If the gold date is 8/2001, we would assign the true metarelation. But if the gold date is 1/2001, 9/15/2001 would be assigned false. We train a MaxEnt classifier on each TE in a date sentence. Features for the classifier include many of those in (McClosky and Manning, 2012; Yoshikawa et al., 2009)(Table 2). We use Stanford Parser( to generate the syntactic features. The new features are the Event-Subject, Negative and Modality features. In online support groups, users not only tell stories about themselves, they also share other patients stories (As shown in Figure 1). So we add subject features to remove this kind of noise, which includes the governing subject of the event keyword and its POS tag. Modality features include the appearance of modals before the event keyword (e.g., may, might, etc.). Negative features include the presence/absence of negative words (e.g. no, not, never, etc.). These two features indicate a hypothetical or counter-factual expression of the event. These features are not useful in newswire texts where there is not great variation either in modality and polarity. As Chambers et al. (2007) has pointed out, the majority class baseline of modality and polarity in that context is 98%. Let DS u be the set of the user s date sentences, let D u be the set of dates resolved from each TE. We represent a MaxEnt classifier by P relation (R t, ds) for a candidate date t in date sentence ds and possible relation R = {true, f alse}. We map this distribution over relations to a distribution over dates by defining P DateSentence (t DS u ): P DateSentence (t DS u ) = 1 Z(D u) δ tj (t)p relation (true t j, ds j ) t j D u δ tj (t) = { 1 ift t j s window 0 otherwise We refer to this model as the Date Classifier model as it only utilize temporal information from date sentences. 5.3 Constraint component Previous work only retrieves time-rich sentences. In other words, the sentence should include the query, slot fillers, as well as some TEs (Ji et al., 2011; McClosky and Manning, 2012; Garrido et al., 2012). However, keyword sentences can inform temporal constraints for events and therefore should not be ignored. For example, Well, I m officially a rads grad! indicates the user has done radiation by the time of the post. Radiation is not a choice for me. This sentence indicates the user probably never had radiation. Before chemotherapy starts, the users tend to talk about choices of drug combinations. After chemotherapy, they tend to talk about side-effects such as increased weight or hair loss. This section departs from the above Date Classifier and instead predicts whether each keyword sentence is posted before or after the user s event date. The goal is to automatically learn time constraints for the event. We create training examples by computing the temporal relation between the DCT and the user s gold event date. If the user is not identified with an event date, the sentences should be labeled as unknown. We train a MaxEnt classifier on each pair of event mention and its corresponding DCT. Features are listed in Table 2. All the features used in the classifier component that are not related to the TEs are included.

7 Let KS u be the set of the user s keyword sentences, let D u be the set of dates resolved from each date sentence. We represent a MaxEnt classifier by P relation (R ks) for a keyword sentence ks and possible relation R = {bef ore, af ter, unknown}. DCT is the post time of the keyword sentence ks. The rel(dct, t) function simply determines if the DCT is before or after the candidate date t. We map this distribution over relations to a distribution over dates by defining P KeywordSentence (t, KS u ): P KeywordSentence (t KS u ) = 1 Z(D u) P relation (rel(dct j, t) ks j ) ks j KS u { before if dct < t rel(dct, t) = after if dct t Feature Classifier Constraint Component Component Event-keyword X X Event-gov-verb X X (tense) X X Event-Subject X X (POS) X X Date-type X Date-gov-verb X (tense) X Date-gov-prep X Unigram (POS) X X Bigram (POS) X X Dependency path X (Date-Keyword) Path length X POS-path X Word-path X Negative X X Modality X X Table 2: Features. Event-gov-verb denotes the verb that dominates the event keyword. Date-govprep denotes the preposition that dominates the temporal expression. 5.4 Date Aggregation Given the classifier component of Section 5.2 and the constraint classifiers of Section 5.3, we create a joint model combining the two with the following linear interpolation: P (t posts u ) = λp DateSentence (t DS u ) + (1 λ)p KeywordSentence (t KS u ) where t is a candidate event date. The system will output t that maximizes P (t posts u ) and unknown if DS u is empty. 6 Evaluation metric and results 6.1 Date expression retrieval We first compare our temporal tagger s performance with SUTime (Chang and Manning, 2012) on the 601 manually extracted TEs. Here we exclude the User-specific TEs. The evaluation consists of two parts: identifying the extents of a TE and then providing the correct resolved date for each recognized expression. Table 3 gives the experimental results. Our tagger has both higher precision and recall in both tasks. The improvement in recall is especially high because we handle the non-standard TEs described in Section 3.3. P R F1 Extents SUTime Our tagger Normalization SUTime Our tagger Table 3: Date expression retrieval results 6.2 Event dates Our corpus consists of 250 users for training, and 50 for test. Each user has approximately 1.7 extracted event dates on average. λ was set to 0.7 by maximizing accuracy in five-fold cross-validation on the training set Evaluation metric The extracted date is only considered correct if it completely matches the gold date. In several cases, we have multiple dates for the same event (e.g., a user had a mastectomy twice). Similar to the evaluation metric in a previous study (Ji et al., 2011), in these cases, we give the system the benefit of the doubt and the extracted date is considered correct if it matches one of the gold dates. In previous work (McClosky and Manning, 2012; Ji et al., 2011), the evaluation metric score is defined as 1/((1+ d )) where d is the difference between the values in years. We choose a much stricter evaluation metric because we need a precise event date to study user behavior changes Baselines and oracle We provide two baselines to describe heuristic methods of aggregating the hard decisions from the classifier learned in Section 5.3. The first baseline, Baseline1, is to pick the date with the highest classifier s prediction confidence. The second baseline, Baseline2, is along the same lines as the Combined Classifier used in (McClosky and Manning, 2012). For example, if the candidate date is 6/2009 and we have retrieved two temporal expressions that are resolved to 6/2009 and 4/2008, then P ( 6/2009 ) = P relation (true 6/2009 ) = true) P relation (false 4/2008 ). To determine the best possible performance given our temporal expression retrieval system, we calculate the oracle score by considering an extraction is correct if the gold date is one of the retrieved candidate dates. The oracle score can differ from a perfect score since we can only use candidate temporal expressions if (a) the relation

8 Baseline1 Baseline2 Date Joint Oracle CE count P R F1 P R F1 P R F1 P R F1 F1 Diagnosis Metastasis Recurrence Chemo-start Chemo-end Rad-start Rad-end Mastectomy Lumpectomy Reconstruction Table 4: Event-level five-fold cross-validation performance of models and baselines on training data. is known and (b) mentions of the event are retrievable (i.e. contain a event keyword), (c) the temporal expression appears in the same sentence, and (d) our temporal tagger is able to recognize and resolve it correctly. Nevertheless, it is a reasonable upper bound in our setting Results We present the performance of our models, baselines and the oracle in Table 5. We evaluate on all the events (ALL) and only on the events with known dates (GOLD). When the gold relations are provided, the precision and recall are equal so we only report F1 measure. The joint model performs best among the non-oracle systems. Both the Date Classifier and Joint model outperform the basic aggregation models. All improvements are statistically significant (p < , McNemar s test, 2-tailed). This shows the value of our approach to leveraging redundancy of event date mentions instead of incorporating all marginal probabilities. Incorporating time constraints further improves the F1. The joint model achieves 89.7% of the oracle result. ALL GOLD Model P R F1 F1 Baseline Baseline Date Classifier Joint Model Oracle Table 5: Performance of systems on the test set. Table 4 shows the performance of our systems and baselines on individual event types. The Joint Model derives most of its improvement from performance related to the Chemotherapy/Radiationstart date. This is mainly because Chemotherapy and Radiation last for a period of time and there are more event-related discussions containing the event keyword. None of our systems improves on cancer Metastasis and Recurrence. This is likely due to the sparsity of these events. 7 Error Analysis and Discussion Distant supervision assumption: In our distant supervision framework we assume that the DE should appear in the same sentence as a cancer keyword in order to become a candidate date. This is often not true. For example, the users can use a metaphor to refer to the event: My roller coaster ride began October 9th, From the context, we can infer that this user was diagnosed on October 9th, Deep temporal reasoning: Under-specified and Relative TEs often require deep temporal reasoning in order to resolve. For example, <5/20/2010>Dx with 7.5 cm tumor same side, stage IIb in Sept 08, mastectomy in Oct, and started AC in Nov, and Taxol in Jan 09. The TE in Oct should be resolved as 10/2008. But if we resolve it only based on DCT, the resolved date is 10/ Conclusion In this paper we presented a challenging new event date extraction task that requires extraction and resolution of non-standard date expression forms, namely extraction of personal illness event dates in the posting histories of online community participants. We have constructed an evaluation corpus and designed a temporal tagger for non-standard date expressions in social media. We introduced a joint framework to learn and apply temporal constraints on the date aggregation process. Both of our models (Date and Joint) obtain substantial error reductions over simpler consistency-based heuristic approaches. By creating an analogous keyword set, the overall framework can easily be applied to other information extraction tasks. A key aspect of future work will be to extend our model to other social media such as Facebook posts. Another line of study is to utilize the extracted timelines for user behavior analysis where qualitative analysis at a large scale is difficult or impossible.

9 References Javier Artiles, Qi Li, Taylor Cassidy, Suzanne Tamang, and Heng Ji CUNY BLENDER TACKBP2011 Temporal Slot Filling System Description. In Proceedings of Text Analysis Conference (TAC). J. R. Brubaker, F. Kivran-Swaine, L. Taber and G. R. Hayes Grief-Stricken in a Crowd: The language of bereavement and distress in social media. In Proc. ICWSM Nathanael Chambers, Shan Wang, and Dan Jurafsky Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Angel X. Chang and Christopher D. Manning SUTIME: A library for recognizing and normalizing time expressions. In 8th International Conference onlanguage Resources and Evaluation (LREC 2012). De Choudhury, M., Counts, S., and Horvitz, E Major Life Changes and Behavioral Markers in Social Media: Case of Childbirth. In Proc. CSCW K. P. Davison, J. W. Pennebaker, and S. S. Dick-erson Who talks? the social psychology of illness support groups. American Psychologist, 55(2): Guillermo Garrido, Anselmo Penas, Bernardo Cabaleiro, and Alvaro Rodrigo Temporally Anchored Relation Extraction. In Proceedings of the 50th annual meeting of the as-sociation for computational linguistics. Heng Ji, Ralph Grishman, and Hoa Trang Dang Overview of the TAC 2011 Knowledge Base Population track. In Proceedings of Text Analysis Conference (TAC). Hyuckchul Jung, James Allen, Nate Blaylock, Will de Beaumont, Lucian Galescu, and Mary Swift Building timelines from narrative clinical records: initial results based-on deep natural language understanding. In Proceedings of BioNLP Oleksandr Kolomiyets, Steven Bethard and Marie- Francine Moens Model-Portability Experiments for Textual Temporal Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Oleksandr Kolomiyets, Steven Bethard, and Marie- Francine Moens Extracting narrative timelines as temporal dependency structures. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics. David McClosky and Christopher D. Manning Learning Constraints for Consistent Timeline Extraction. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP2012). Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky Distant supervision for relation extraction without labeled data. In Proceedings of the 47th annual meeting of the Association for Computational Linguistics. Heekyong Park and Jinwook Choi V-model: a new innovative model to chronologically visualize narrative clinical texts. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems. ACM. Catherine Plaisant, Brett Milash, Anne Rose, Seth Widoff, and Ben Shneiderman LifeLines: visualizing personal histories. In Proceedings of the SIGCHI conference on Human factors in computing systems. James Pustejovsky, Jos M. Castao, Robert Ingria, Roser Sauri, Robert J. Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir R. Radev TimeML: Robust specification of event and temporal expressions in text. TimeML: Robust specification of event and temporal expressions in text. In New Directions in Question Answering 03. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev et al The Timebank corpus. In Corpus Linguistics. Preethi Raghavan, Eric Fosler-Lussier, and Albert M. Lai Learning to Temporally Order Medical Events in Clinical Text. In Proceedings of the 50th annual meeting of the Association for computational Linguistics. Jannik Strotgen and Michael Gertz Heidel- Time:High Quality Rule-Based Extraction and Normalizationof Temporal Expressions. In SemEval 10. Jannik Strotgen and Michael Gertz Temporal Tagging on Different Domains: Chal-lenges, Strategies, and Gold Standards. In LREC2012. Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell Coupled temporal scoping of relational facts. In Proceedings of the fifth ACM international conference on Web search and data mining. ACM. Naushad UzZaman and James F. Allen TRIPS and TRIOS system for TempEval-2: Extracting temporal information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation. Yafang Wang, Bing Yang, Lizhen Qu, Marc Spaniol, and GerhardWeikum Harvesting facts from textual web sources by constrained label propagation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management.

10 K.-Y. Wen, F. McTavish, G. Kreps, M. Wise, and D. Gustafson From diagnosis to death: A case study of coping with breast cancer as seen through online discussion group messages. Journal of Computer-Mediated Communication, 16: Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asahara, and Yuji Matsumoto Jointly identifying temporal relations with markov logic. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Li Zhou and George Hripcsak Temporal reasoning with medical data a review with emphasis on medical natural language processing. Journal of biomedical informatics 40.2 (2007): 183.

Carolyn Penstein Rosé

Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science Foundation and the Office of Naval Research