University Depression Rankings Using Twitter Data

Size: px

Start display at page:

Download "University Depression Rankings Using Twitter Data"

Deirdre Byrd
6 years ago
Views:

1 University Depression Rankings Using Twitter Data Dept. of CIS - Senior Design Ashwin Baweja ashwinb@seas.upenn.edu Univ. of Pennsylvania Philadelphia, PA Jason Kong jkon@seas.upenn.edu Univ. of Pennsylvania Philadelphia, PA Yaou Wang yaouwang@seas.upenn.edu Univ. of Pennsylvania Philadelphia, PA Tommy Pan Fang tommyp@seas.upenn.edu Univ. of Pennsylvania Philadelphia, PA ABSTRACT With the rise of social media, university rankings are playing an increasingly influential role in the selection process for prospective university students. Simultaneously, mental health has risen to the forefront of discussions among universities nationwide, in light of calls for increased mental illness awareness. Previous attempts at formulating rankings of schools happiness and mental illness have centered around paper or electronic surveys taken by only a small fraction of the student body. Our work posits a new methodology for constructing college depression rankings through analysis of the language used by students on social media platforms. By using a corpus of 78 million Tweets generated from September 2014 to March 2015 and leveraging existing research into depression language analysis, we produce a set of meaningful rankings comparing depression among schools. 1. INTRODUCTION In this paper, we propose a novel approach to ranking universities along the dimension of depression. Such rankings not only influence the prestige of a university but also the decisions of high school students when determining where to spend the next four years of their life. However, current methodologies for computing depression rankings are neither robust nor scalable. Our approach leverages social media data and existing depression models to produce rankings that provide significant improvements in robustness and scalability. The resulting rankings provide key insights into correlations between depression at universities and characteristics of universities, such as the prevalence of a pre-professional culture at the university. Finally, although our rankings are specific to depression, our approach can be generalized and applied to other areas as well. For example, our method could be used to rank universities on the dimensions of health. Alternatively, the model could be refined to make the rankings more finegrained (ex. at the student group level) or more coarse (ex. at the region level). 2. BACKGROUND Advisor: H. Andrew Shwartz (hansens@seas.upenn.edu) and Chris Callison-Burch (ccb@cis.upenn.edu). For universities, rankings play an important role in influencing prestige, endowments, administrative decisions, and perhaps most importantly the college choices of prospective students in high school. A set of rankings that are rising in importance are happiness rankings, a listing of universities based on the purported happiness of their students. In light of mental health awareness rising to the forefront of national attention through frequent and high-profile suicides, students and administrators are pushing for more efforts in monitoring mental health and depression levels at the respective universities. Accompanying the additional importance placed on college rankings has been an increased number of published rankings by today s media. Joining established and recognized publications such as The U.S. News and World Report [8] and The Princeton Review in creating rankings are up-and-coming viral media websites such as BuzzFeed and The Huffington Post. Rather surprisingly, the methodology used by these publishers to construct such happiness rankings have not kept up with the swell in technology that have led to their greater prominence. According to writers at The Princeton Review, their methodology for constructing their annual set of rankings comprises distributing an 83-question questionnaire to university students through a physical booth and through [10]. The questions are all multiple choice, with answers being rankings of 1-5, with 1 being strongly disagree and 5 strongly agree. On average, fewer than four hundred students at each university take the survey, and official surveying is only completed once every three years. The lack of granularity in the data is troubling when considering the amount of emphasis placed on their findings. For example, the The Princeton Review s recent Happiest Colleges ranking appeared in headlines on numerous hightraffic websites such as The Huffington Post, College Atlas, and The University Herald. Upon deeper investigation into their methodology, it was found that the rankings were calculated solely through averaging the answers of students in each university to the question How happy are you? [10] At the same time, the conclusions of these emotional health ranking reports have great influence on their readers. High school students and families refer to college rankings as an important source of information during the college decision process. College students find rankings a helpful tool to un- 1

2 derstand outward perception regarding their university. Administrators look to rankings to evaluate their performance in regards to student mental health and formulate policy decisions to manage their reputation. Given the importance placed on these rankings, a mismatch exists between current rankings accuracy and the decisions made through consulting these rankings. 3. RELATED WORK There is existing academic research centered around using natural language models to predict depression in the general population. However, only a small subset of these studies focus on college students. Below, we highlight such previous work that is relevant to our study. As early as 2004, Rude et. al. [11] conducted a study that looked at the language used by depressed and depressionprone college students. The paper used a linguistic approach by analyzing the diction used in essays written by depressed college students. This study was the first to establish that there is a significant difference in language use between depressed and non-depressed college students. In 2006, Stephenson et. al. [14] published a study that examines predictors of suicidal ideation among college students. This study primarily focused on contrasting the differences in indicators of suicide between male and female college students. Unlike our work, Stephenson s study focuses on suicidal ideation rather than depression. However, many of the indicators of suicide proposed by Stephenson et. al. indicate depression as well, thus making it relevant to our work. In later years, several papers used the same linguistic approach taken by Rude et. al. to analyze depression and its symptoms. However, most of these studies lacked a demographic focus, choosing instead to analyze depression patients across the entire population. A paper by Neuman et. al. [7] used data from the Internet as well as expertise of linguistic scholars to construct a predictive model to identify depression based on a piece of writing. The model was able to achieve an 84% classification rate. A more recent study by Howes, Purver, and McCabe [5] used linguistic indicators to track depression patients throughout an online text-based therapy. They found that linguistic models could predict the important measures on depression with a high degree of accuracy. In recent years, there has been research done on applying linguistic models to social media data to predict depression. A study by Choudhury et. al. [3] used Twitter data to predict which patients were depressed before they even received a formal diagnosis. The study used the Twitter data of diagnosed depression patients one year prior to their diagnosis to test whether they could gauge a diagnosis of depression on the individual level. The model achieved a predictive accuracy of 72%. A recent study by Schwartz et. al. [12] further fine-grained this predictive model by predicting depression score on a continuous scale rather than simply producing a classification of depressed versus not depressed. Several recent studies have used social media data to analyze depression for only a subset of the population. One study, conducted by Thompson, Poulin, and Bryan [15], used social media posts to predict suicide risk in military personnel and veterans. Another study, conducted by Moreno et. al., studied disclosures of depression on Facebook by already diagnosed college students. However, this study did not focus on building a model of depression for college students. Instead, it studied how often students who are diagnosed with depression display negative emotions on Facebook. Additionally, the work also considers using PERMA scores, a positive psychology methodology, as an alternative format for validating the depression approach. PERMA is a scheme developed by Dr. Martin Seligman as an attempt to capture emotional well-being among five dimensions [13]. For each dimension, a corpus of keywords (as well as their relative weights) was developed at the University of Pennsylvania. PERMA scores are generated by aggregating the normalized frequencies of these keywords for a given input text. Specifically, the output will has different dimensions, with each dimension having two directions (positive and negative), and a score for each direction in each dimension 1. The model for calculating these scores has been set and established [13]. 4. SYSTEM MODEL Figure 1: Block Diagram of Full Model Our approach, outlined in Figure 1, leverages language features from Tweets and the World Well-Being Project s (WWBP) 2 existing depression model to construct university depression rankings. However, this approach can be generalized to work for a broader set of applications. In particular, these rankings can be produced over any set of groups, not just universities. Additionally, alternative language models 1 PERMA stands for Positive Emotion, Engagement, Relationships, Meaning, and Accomplishment. Each of these dimensions will have a positive direction and a negative direction. Hence, there are a total of 10 scores, one for each direction on each dimension. 2 The World Well-Being Project is a collaboration among computer scientists and psychologists at the University of Pennsylvania, aimed at studying psychology through modern machine learning techniques. For more details please refer to 2

3 can be substituted for WWBP s model to provide for more flexibility in how depression is measured. This more generalized approach has two main components: a language model and a data set with user-level social media messages. These two components are then combined to produce user-level scores per the language model, and these scores are finally aggregated to produce the final rankings. We now describe in more detail each of these four components/stages particularly for our attribute of interest, depression. 4.1 Prerequisite: A Depression Scale The overall model first decides on a numerical scale for depression. This scale is used to label users in the training data set on their levels of depression. It is also used by the depression model to output depression scores for each user. 4.2 Depression Model The depression rankings produced in this work use WWBP s depression model that produces depression scores given social media data. However, our approach is not restricted to this model; other existing individual-level depression models can also be substituted for WWBP s depression model. The overall requirements for such a model are rather loose. The model must take as input training data a set of users, where each user has some text and a depression score. The model then takes text from a test set of users and is able to produce a depression score for each of the users. This outputted score should be on the same scale as the depression scores in the training data. 4.3 Data Set Construction Our work aims to leverage social media language to generate depression rankings. That said, it is not required that the data set consist strictly of social media messages: any written text or combination of written texts will do. For example, for each user, one may choose to use a combination of Facebook status updates and college application essays. The only requirement is that the text in the training and ranking data sets (described below) are similar. As with any statistical model, the quality of the results is dependent on whether the data set is large enough to produce statistically significant results. However, the size of the data set required to produce meaningful rankings will depend on the depression model chosen in the system implementation. Thus, this system model leaves it to the user to ensure that the size of the data set is sufficiently large enough for the depression model chosen. The required data set can be broken down into two parts. First, the model requires a training data set that will be used to train the depression model. Here, we require that the training data consist of a sufficient number of users to train the depression model. Then, for each user, the data set should consist of sufficient text per user as well as a depression ranking on the scale described above. Second, we need a ranking data set that will be used to produce the final rankings. For each group that is to be ranked, this data set should consist of a sufficient number of users labeled as belonging to that group. Then, for each user, the data set should contain a sufficient amount of text to produce a depression score for that user. 4.4 User-Level Scores In the third stage, we compute the depression scores for each individual user. This is done by training the depression model chosen using the training data described previously. Then, for each user in the ranking data set described in the below section, the text for that user is simply run through the depression model to compute a depression score for that user. 4.5 Depression Ranking Output In the final stage, we compute rankings for our groups using the user-level depression scores. This simply requires an aggregation function that takes as input the depression score for each user in a group and returns a single depression score for that group. The simplest of such functions simply takes the average of all of the depression scores, but a more sophisticated approach that weights users based on the total number of words in their text or on other qualities of the users may also be used. The final output is, for each group, a depression score, and the groups can then be sorted on this depression score to produce the final rankings. 5. SYSTEM IMPLEMENTATION Below, we discuss the details about how each stage of the model described above is implemented. 5.1 Depression Model The method in the baseline depression model developed by WWBP is called Differential Language Analysis. Differential language analysis involves first working with a set of labeled training data, choosing a set of features that would best predict the labels (e.g. n-grams, topics, etc.), and then generating a corresponding model (e.g. Naive Bayes, Regressions, SVM) on that. Then, these weightings can be applied on a novel data set through three steps: sanitizing and converting the data set into a message table, extracting the desired features from this message table, and then performing correlation analysis and visualization. This is our overarching approach when it comes to constructing a predictive model based on social media language. WWBP Library We base our work off a low-level implementation needed to run and tune our WWBP model. The model is a machine learning library that has a interface through which a range of tasks including model creation (including a range of different regression and classification models), feature extraction, and data visualization can be accomplished. The library uses a MySQL database to store the necessary data and uses Python to run code that completes the necessary machine learning tasks. The necessary parameters for each model is stored in memory as local variables in the Python methods, resulting in the necessary creation of a.pickle file for each model that needs to be accessed later. This library provides a convenient interface with a relatively fast run time. WWBP Depression Model This work uses the model developed by Schwartz et. al. [12] as part of the World Well-Being Project as the baseline model to predict individual levels of depression. The model is trained and tested on data from 28,749 Facebook users who have opted into a study where they complete a personality questionnaire and provide access to their status updates between June 2009 and March The personality 3

We use this degree of depression score as the depression metric for this work. Schwartz et. al. s [12] model uses the following features to output the aggregate depression score: 1.

4 questionnaire measures levels of depression in seven different facets as based off a methodology developed by Cambridge University [12]. The personality survey computes an average of all seven to output a depression score, termed degree of depression that ranges from 0 to 12. We use this degree of depression score as the depression metric for this work. Schwartz et. al. s [12] model uses the following features to output the aggregate depression score: to 3-grams: The relative frequency of n-grams restricted to those used by at least 5% of all users 2. Topics: 2000 topics derived via latent Dirichlet allocation (LDA) on the Facebook data, in addition to 64 Linguistic Inquiry and Word Count (LIWC) categories [9] 3. Number of words: The total number of words a user has posted Figure 3: Words Most Positively Correlated with Depression in the WWBP Model The model first applies Principal Component Analysis to reduce its feature space, and then uses an L2-penalized regression model to predict the depression score. The model provides a more nuanced prediction of depression (having the outcome as a scale rather just a binary output) while still maintaining decent accuracy. It achieved a Pearson R value of 0.39 and a mean squared error of 0.78 on its out of sample test set, which significantly outperformed the baseline of sentiment analysis. Figures 2 and 3 are visualizations of the data set used to construct the regression model. The word clouds are generated by computing the correlation values for each feature with the labeled depression scores, and then emitting the unigrams and bigrams that output the highest absolute values. Interestingly, the words in the data set that correlate most strongly with depression tend to be in the first person (e.g. I, myself, etc.) whereas those most negatively correlated with depression tend to be activity related or refers to a collective entity (e.g. our, team, game, etc.). messages into a well-formatted table. This requires sanitizing the data for unsupported languages as well as labeling features such as links and re-tweets. Ultimately, the well-formatted table contains the text of messages as well as supporting metadata, such as the user id, the timestamp of the message, and the geographical location of the post, if available. We also create another table that contains user information (e.g. user id, bio) as well as a labeling to the school to which they were mapped. Feature Extraction To extract features from our messages, we tokenize the text. As part of the tokenizer, there have been modifications to recognize emoticons common to social media text (e.g. <3, :-) ). From the tokenized text, we then create n-grams (sequences of one, two, or three words), which allow greater context than a simple bag-of-words model. We also use lexical and topical features to find language characterizing depression. For lexica, we use LIWC (Linguistic Query and Word Count) lists. Each LIWC list of words is associated with a semantic or syntactic category, such as engagement or leisure. For topics, we use clusters of lexico-semantically related words as derived from latent Dirichlet allocation (LDA) [12]. In addition, we refine our features. We use a point-wise mutual information criteria, which looks at the ratio of the actual rate that two words occur together to the expected rate that two words appear together according to chance; 2-grams and 3-grams not meeting the criteria are discarded. We also limit our words and phrases to restrict to those used by at least 5% of the sample. While longer phrases could be considered, computations become increasing challenging because as n-gram size increase, the number of combinations increase exponentially. Words and phrases are normalized by the total number of words written by the user, and are transformed using the Anscombe transformation to stabilize variance. Figure 2: Words Most Negatively Correlated with Depression in the WWBP Model Correlation Analysis Message and User Table Conversion After feature extraction, with the training set, we run a correlation analysis between our features and depression scores. We use a ordinary least squares linear regression over stan- Assuming a data set of social media messages, the first step in the model training implementation is to convert the raw 4

5 dardized variables, producing a linear function and a Pearson R value, as well as a set of weightings for each feature. These weightings can then be applied to the features extracted above from the message table, which are aggregated on a user basis, in order to get a degree of depression score for each user. This degree of depression score will then be aggregated at a university level in order to generate the rankings, to be discussed below. 5.2 Data Set Construction Next, we construct our data set. For each university we wish to rank, our data set contains a set of Tweets that were Tweeted by Twitter users from that university. Approach Overview Unlike Facebook profiles, which track many details about users such as age, university affiliation, and work history, Twitter profiles are very simple. The sign-up process consists of merely entering one s name and , and users can also later upload a profile picture and write a very short (160 character max) bio about themselves. Without any explicit age or university labels on Twitter accounts, drawing conclusions about which users are from a given university is difficult. For our approach, we aim to construct a data set with high precision. In other words, for the Tweets that we find for each school, we want a very high percent of those Tweets to actually have been Tweeted by Twitter users at that school. To construct the data set, we make use of two observations. First, we note that most colleges across America have a Twitter account, and many students who attend the university and have Twitter accounts follow the college account. Second, we note that although Twitter doesn t directly store university affiliation for each user, many student users choose to list their university affiliation in their bio. Approach Details As per the description of the data set construction approach outlined in the System Model section, we take a four-step approach towards constructing our data set. In the first step, we manually (through a Google search) find the main Twitter account for each university in our data set.while there are many Twitter accounts containing the university name, most of which are controlled by third parties not affiliated with the university, we track only verified accounts. Verification of such accounts is done by Twitter and establish[es] authenticity of key individuals and brands on Twitter. [2] Because such accounts are verified, they are actually affiliated with the university, and thus are more easily found by users in searches and have a larger number of followers. Second, for each university Twitter account found in the previous step, we use the Twitter API [2] to find all Twitter users with public accounts who follow the university account. The reasoning behind this step follows from the previouslymade observation that students at universities usually follow their school s Twitter account. Then, for each of these Twitter accounts, again using the Twitter API [2], we pull the account s bio information (if it exists). Now, although we have information for all Twitter users who follow the school Twitter account, we note that it is very unlikely that all of these users are actually students at the university. In fact, many of these users may be prospective students, fans of the school s sports team, or faculty at the school. Thus, in the third step, we use the bios gathered to filter down the Twitter users to only those who attend the university. To do so, we use a regular expression that searches for two components in the Twitter bio. First, we look for some affiliation with the university by looking for either the full school name (e.g. University of Michigan) or a well-known abbreviation for that school (e.g. umich). All such searches are case insensitive. However, looking for just a university affiliation is not sufficient. Alumni, parents of students, faculty, and even sports fans may list the university name in their profile. Thus, to filter these out, we also look for a typical graduation year for 4-year students at the university (i.e. a year between 2015 and 2018) or the keyword student. An example of such a Twitter bio would be: UMich, Class of Finally, for the Twitter users found in the step above, we pull all Tweets that were made in the school year through use of the Twitter API. We classify Tweets made during the school year as those dated after August 31 st, Depression Ranking Output From the first two stages of the model, we are left with a degree of depression score for each user in our data set, the number of words a user has Tweeted (since August 31 st, 2015), as well as a labeling to the university the user attends. In order to generate our desired set of comparative rankings for universities, we need some methodology for aggregating user scores to a university. We choose to aggregate user degree of depression scores using a weighted average based on the number of words Tweeted, as it makes sense that users who have Tweeted more words should be given greater weight, as their degree of depression score is less volatile given the amount of data used in generating the score. When deciding between weighting schemes, we considered a logarithmic scale, square root scale, as well as linear scale for the number of words. We select a linear scale because the model is considerably more volatile for users with a small amount of Tweet data, and so we want to be conservative while weighting users with little data backing their score. Furthermore, we choose a ceiling of 500 words, at which the weighting stops increasing, because previous research et. al. [12] indicates that above this, degree of depression scores are relatively stabilized, and we did not want to overweight users who were excessive Tweeters. We decide on this weighting schema rather than only considering Twitter users who Tweeted a certain threshold because we believe that even users who seldom Tweet provide an indication of overall school well-being, and we do not want to arbitrarily filter down our set of data. Thus, the formula used for aggregating user scores to the university level is: DepScore = user university score user W user where W user = min(1, count(words)/500) From here, we rank universities according to this aggregate depression score in order to generate our results. 6. RESULTS The primary result is a depression ranking of the top 25 academic universities as chosen by the U.S. News and World 5

6 Report in 2014 [8]. The exact ranking is seen in Figure 4. The universities have 409 students on average in the data set. Rice University has the lowest number of students (88) and University of Southern California has the highest number of students (827). We note that the California Institute of Technology (CalTech) is removed from the ranking. This is because the data set construction process did not generate enough students (only 26) from CalTech for the score to be meaningful. The scores ranged from for Duke, the least depressed school according to our rankings, to for Penn, which tops our depression rankings. 7. ANALYSIS OF RESULTS Interestingly, we see some trends and surprising results emerging from our rankings. At the top of the rankings are schools that appear to have a greater emphasis on preprofessional development. The University of Pennsylvania, University of California Los Angeles, Carnegie Mellon, Emory University, Johns Hopkins University, and the University of Virginia share similarities in that they all have a focus on undergraduate pre-professional programs; all these schools have undergraduate engineering programs, all but UCLA have undergraduate business programs, and all but Emory University have undergraduate engineering programs. To compare, Duke and Stanford, two prestigious schools at the low end of the depression rankings, only offer a school in humanities as well as in engineering, and over 80% of students are enrolled in their respective school of arts and sciences, per their respective school websites. Furthermore, we see schools at the lower end of the rankings tend to have strong athletic programs and a sense of school spirit. For example, Duke University has a religious basketball following, Notre Dame basketball and football, and Stanford football, among other sports they excel at. Schools with a heavier emphasis around pre-professional development (e.g. University of Pennsylvania, University of California Los Angeles, Johns Hopkins University) tend to have a higher depression score in our ranking, whereas schools with strong athletic programs tend to rank much lower (e.g. Duke University, Stanford University, University of Notre Dame). 3 Additionally, Cornell appears surprisingly low on our rankings (ranked 16 th, 6 th among Ivy League schools), as the media often portrays either Cornell or Yale (ranked 7 th, second among Ivy League schools) as the most depressed Ivy League university. We believe that Cornell s lower than expected ranking when compared to public perception may be a result of poor publicity relating to the campus. Public perception may be negative due to the sensationalized reporting of Cornell suicides, which occur through jumping off a bridge into the gorges. The Huffington Post supports our findings that Cornell does not have an above-average suicide rate when compared to other universities. [4] In addition to the previous ranking, we use the same methodology to generate a set of depression rankings for the largest U.S. universities in terms of student enrollment, as per the Department of Education[16]. Additionally, we 3 We note that our Tweets were gathered up until the beginning of March, prior to the beginning of the 2015 NCAA March Madness Tournament. Thus, Duke s NCAA Men s Basketball Championship win as a one time event did not deflate their depression scores, although their performance during the regular season may have played a factor. generate PERMA scores as well as PERMA rankings for the top 25 academic universities as a parallel ranking in order to validate our depression rankings, which we discuss below. Please refer to the appendix for these outputs. Figure 4: Depression Ranking for Top 25 Academic Universities 8. EVALUATION OF RESULTS There currently exists no established set of university depression rankings that are widely accepted amongst the research community. As a result, we are unable to provide a benchmark to evaluate the results of this work. Consequently, we resort primarily to human evaluation to evaluate the two main components of our system: the depression model and the data set mapping. Furthermore, we perform a correlation analysis between the depression rankings produced by our model and happiness rankings backed by existing work in psychology. Depression Model To evaluate the depression model, we create a web application that displays two Twitter users in our data set and asks testers to identify which user appears more depressed. For each pair of Twitter users, the web application ensures that one user scores high on our depression model (score > 2.7) and thus exhibits traits of depression according to the model, while the other user scores low on our model (score 6

7 < 2.3). The tester then examines the Tweets of each of the two users and evaluates which user appears to be more depressed based on the Tweets displayed. We compare the human results against the outputs of the depression model, where the depression model chooses the user with the higher depression score as more depressed. In these results, when the depression model and the human agree on the more depressed user, we have a concordant pair. In the opposite case, we have a discordant pair. From these human-produced depression ranking evaluations, we take the concordant and discordant pairs to compute a Kendall s Tau coefficient for our model using the equation: τ = nc n d 1 n(n 1) 2 where n c is the number of concordant pairs, n d is the number of discordant pairs, and n is the total number of pairs in the test set. This statistic is commonly used to measure the association between two measured quantities, and it ranges between 1 τ 1. Our model yields a τ coefficient of 0.651, demonstrating a highly positively correlation between human evaluation and our model outputs. Finally, the results can be used to calculate a p-value (p) for our model. We calculated a p-value of for our model, using the following equation: Validation with PERMA p = 6(nc n d) n(2n + 5)/2 The PERMA model is provided to us by WWBP, and we run the model against our Twitter data set for the positive emotion element. After running the data, we calculate the correlations between depression, positive emotion (Pos P), negative emotion (Neg P), and a standardized metric for happiness, which we calculate as the Z-score for Pos P in the sample minus the Z-score for Neg P in the sample (Pos P Neg P Z). Professor Martin Seligman, the father of positive psychology, previously identified in his work the correlation of depression to happiness as being [12]. Our own correlation between depression rankings and standardized happiness is Furthermore, our depression rankings show a very low negative correlation with positive emotion and a moderate correlation with negative emotion, a result that is supported by Seligman s previous work [12]; this lends further confidence to the methodology in this work. Correlations Pos P Neg P Pos P Neg P Z Dep Rank Pos P Rank Neg P rank Table 1: PERMA Correlation with Depression Rankings Validation with Other Metrics Additionally, we have computed the correlation between university depression scores (as well as the PERMA scores) with some simple, easy-to-find metrics commonly used in ranking universities in terms of academic prestige [8] [10]. The values (shown in Table 2) match common intuition. We see that retention rate, defined as the percentage of freshmen that enroll as sophomores in the same university, is negatively correlated with the model s depression score. This is expected, as a higher retention rate indicates that more students are returning to school after spending a year at the university. Interestingly, the acceptance rate of the university correlates positively with the depression score. This seems to indicate that students at exclusive universities are less depressed. This is supported further by the correlation between depression score and US News Ranking, which can serve as a proxy for a university s prestige. In addition, we note that university enrollment is correlated with depression. The average depression score of the top 25 academic universities is lower than the score for the 40 largest schools, which supports the correlations in Table 2. Correlations Dep Score Pos P Score Neg P Score Tuition and fees Total enrollment accept. rate Retention rate US News ranking Table 2: Other Factors Correlation with Depression Rankings Data Set Mapping In constructing our data set, we use the approach and implementation outlined in the previous sections to find Tweets for the 40 largest universities in the United States, as specified by the total enrollment reported by U.S. Department of Education [16], as well as the top 25 academic universities as ranked by U.S. News and World Report [8]. We focus on the evaluation of the data set for the top 25 academic universities, as this is the selection from which our primary ranking is generated. The data set consists of, on average, 409 Twitter users per school. Additionally, the data set has, on average, 145,000 Tweets per school. More detailed statistics about the results (on the school level) are shown in Table 2 below: Recall Statistic min average max # of Twitter users # of Tweets 11, , ,645 Table 3: Results of data set Construction As previously mentioned, our data set construction approach aims for high precision. However, as expected, there is a trade-off between precision and recall. We can roughly measure our recall using the following calculations. Using the U.S. Department of Education s statistics, we find that on average, the 25 academic schools have about 17,601 students per school. Additionally, a study by Digiday [6] reports that as of November 2013, approximately 43.7% of college students are on Twitter. Using this, we calculate that for the 7

8 25 top academic universities, there are, on average, approximately 7,692 Twitter users per university. Because our data set has only 409 users per university, we obtain a recall of approximately 5.3%. Although our data set has a very low recall and only captures a small fraction of Tweets for each university, it is a sufficient size for our model. Our model requires a minimum of 10,000 Tweets per school to produce meaningful results [12]. All 25 schools in our ranking have at least 10,000 Tweets, with the average being much higher (over 100,000 Tweets). Precision To evaluate the quality of the data set mapping phase, we construct a web application that provides an interface for testers, a selected group of colleagues at Penn, to review a sample of our data and verify the accuracy of the mapping between Twitter bios and universities. The webpage displays the Twitter biography for a randomly chosen user from our data set along with the university to which that user was mapped. Testers on the website then use this information to determine whether the biography identifies the user as a current student at the listed university. Based on this validation, our university data mapping yields an accuracy of 86.9% from a sample of 390 Twitter user bios. Drawbacks The ideal data set for a university either consists of all Tweets that were Tweeted by users at that university, or a random subset of those Tweets. However, because of our method of finding Twitter users at each university, there is a systematic bias in which Tweets are captured by our data set. This bias is towards those twitter users who list their school and graduation year on their Twitter bio and follow the school Twitter account. One may argue that those who are more likely to list their school affiliation in their Twitter bio are less likely to be depressed, or one may argue the opposite. However, regardless of such arguments, we make the underlying assumption that any such biases introduced into our data have an equal effect on the data all universities in our data set. Therefore, these biases do not impact our results. 9. FUTURE WORK There are several useful extensions of our work that may be explored further. First, our novel message mapping approach is a useful way to label Twitter profiles with metadata about university affiliations. As such, we were able to build a data set consisting of Twitter users for each university. No such data set currently exists; thus, it may be useful to explore further applications of such a data set. Additionally, our rankings looked at select groups of schools, such as academically prestigious undergraduate institutions and the largest schools in the United States. For a complete set of rankings, we need to incorporate other universities into our data set. Furthermore, a limiting constraint in our work is the number of users mapped to each university. For most universities, the mapping technique captures a sufficient number of Twitter profiles for a university to perform our analysis detailed in the work. However, in our sampling of the top academic universities, there is one outlier university, CalTech, which is only mapped to a few dozen users. This is because CalTech has a very small student body, with an undergraduate enrollment of less than 1,000 students in 2012[1]. To include outliers in rankings and analysis, more sophisticated methods for university mapping, which improve recall without a significant trade-off in precision, should be developed. The framework developed in our work may also be extended for depression analysis at institutions other than universities. For example, the system developed may be utilized as a Human Resources tool to evaluate worker morale based on language used in s or enterprise social media platforms such as Yammer. This would allow such companies to not only increase employee satisfaction but also improve in areas such as worker retention. 10. ETHICS Although the depression model is validated with some degree of confidence, it cannot be used as a tool to diagnose individuals for depression. As discussed in prior sections of the work, the amount that an individual writes on social media will affect their depression score. Furthermore, the language that a single individual uses may not be enough to indicate their mental well-being. The depression model has not been verified medically and cannot supplant the opinion of professional services Another concern is that the data is collected from a publicly available source, Twitter, which therefore is not anonymized. It is possible to identify a user based on data we have collected, such as their user id, biography, or Tweets. The model connects a user with sensitive information about their mental well-being. As a result, our data set must be anonymized and insights drawn from the data, especially about individuals, must be filtered to avoid defamation and ensure user confidentiality. On a final note, while the framework developed in this work may be applied in studying depression beyond the university level, there are privacy and security concerns with regards to collecting user data. For example, if a depression model is applied on employee s, there will likely be public concern about using this data to draw conclusions on the mental state of employees. 11. CONCLUSION In this work, we create a set of university depression rankings using Twitter data. We developed a novel approach of mapping student Twitter accounts to universities to construct a data set of college student Tweets. Then, using differential language analysis and machine learning, we generate individual depression scores and aggregate them to the university level to create our depression rankings. Our data set of college student Tweets and depression model is a truly useful tool for understanding and ranking depression for students at universities. We find on average 409 students for each of the top 25 academic universities with an accuracy of 86.9%. Furthermore, we develop a model that has a p-value of measured against human evaluation. While there is still room for improvement in our system, we have built a strong foundation for understanding depression across universities and for conducting other sets of rankings and analysis at the university level. Student well-being is a serious and relevant topic on college campuses, and we hope that our model and insights about depression provide value 8

9 for and help for students, faculty, and administrators. 12. REFERENCES [1] CalTech Undergraduate Admissions Facts and Stats, [2] Twitter, [3] Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting depression via social media. In Emre Kiciman, Nicole B. Ellison, Bernie Hogan, Paul Resnick, and Ian Soboroff, editors, ICWSM. The AAAI Press, [4] Rob Fishman. Cornell suicides: do ithaca s gorges invite jumpers?, [5] Christine Howes, Matthew Purver, and Rose McCabe. Linguistic indicators of severity and progress in online text-based therapy for depression. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 7 16, Baltimore, Maryland, USA, June Association for Computational Linguistics. [6] John Mcdermott. Facebook losing its edge among college-aged adults, [7] Yair Neuman, Yohai Cohen, Dan Assaf, and Gabi Kedma. Proactive screening for depression through metaphorical and automatic text analysis. Artificial Intelligence in Medicine, 56(1):19 25, [8] US News. US News and World Report s Annual College Rankings, Web. Accessed 19 Oct [9] James W. Pennebaker, C.K. Chung, M. Ireland, A. Gonzales, and R.J. Booth. The development and psychometric properties of liwc Austin, TX, LIWC. Net. [10] Princeton Review. Surveying Students: How It Works Princeton Review, Web. Accessed 28 Apr [11] Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. Language use of depressed and depression-vulnerable college students, [12] H. Andrew Schwartz, Johannes Eichstaedt, Margaret L. Kern, Gregory Park, Maarten Sap, David Stillwell, Michal Kosinski, and Lyle Ungar. Towards assessing changes in degree of depression through facebook. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages , Baltimore, Maryland, USA, June Association for Computational Linguistics. [13] Martin E. P. Seligman. Flourish: A Visionary New Understanding of Happiness and Well-being. Atria Books, reprint edition, [14] Hugh Stephenson, Judith Pena-Shaff, and Priscilla Quirk. Predictors of college student suicidal ideation: Gender differences, [15] Paul Thompson, Craig Bryan, and Chris Poulin. Predicting military and veteran suicide risk: Cultural aspects. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 1 6, Baltimore, Maryland, USA, June Association for Computational Linguistics. [16] National Center for Education U.S. Department of Education. Selected statistics for degree-granting postsecondary institutions enrolling more than 15,000 students in 2012, by selected institution and student characteristics: Selected years, 1990 through , May

10 APPENDIX A. ADDITIONAL FIGURES Figure 5: Depression Rankings for Top 25 Academic Schools 10

11 Figure 6: Depression Rankings for Top 40 Largest Schools 11

12 Figure 7: Depression and PERMA Rankings for Top 25 Academic Schools 12

Asthma Surveillance Using Social Media Data

Asthma Surveillance Using Social Media Data Wenli Zhang 1, Sudha Ram 1, Mark Burkart 2, Max Williams 2, and Yolande Pengetnze 2 University of Arizona 1, PCCI-Parkland Center for Clinical Innovation 2 {wenlizhang,