Administrative notes March 14: Midterm 2: this will cover all lectures, labs and readings between Tue Jan 31 and Thu Mar 9 inclusive Practice Midterm 2 is on Exercises webpage: http://www.ugrad.cs.ubc.ca/~cs100/2016w2/ exercises.html#exams March 17: In the News call #3 March 30: Project deliverables and individual report due
Administrative notes Check Project Rubric on the Connect grade centre to learn which rubric we will be using to grade your project. Find your rubric at http://www.ugrad.cs.ubc.ca/~cs100/2016w2/proje ct-grading.html#projectmarkingscheme. If you have questions, please email your project TA (also listed on Connect). We will email you which projects you should review. Please ensure that email forwarding for your CS email (CS_ID@ugrad.cs.ubc.ca) works (you should have set this up in Lab 0).
Data Mining 4 Mining by Association: Apriori algorithm wrap-up
Recall: How to predict the future? Association rules An association rule X à Y suggests that people who buy items in set X are also likely to want items in Y Valid association rules are mined from training data, e.g. store purchases Association rules are useful to stores, and also in areas such as medical diagnoses, protein sequence composition, health insurance claim analysis and census data
When is an association rule valid? We are given two thresholds: Support threshold Confidence threshold A rule X à Y is valid with respect to these thresholds if The support of X Y is at least the support threshold The confidence of X à Y is at least the confidence threshold
Support: The degree to which items appear together The support of a set of items is the fraction of transactions that contain all items in the set. T1 T2 T3 T4 T5 T6 T7 Sushi, Chicken, Milk Sushi, Bread Bread, Vegetables Sushi, Chicken, Bread Sushi, Chicken, Ramen, Bread, Milk Chicken, Ramen, Milk Chicken, Milk, Ramen Here, the set {Chicken, Ramen, Milk} has support 3/7
Confidence: Cause à Effect The confidence of rule XàY is the fraction of transactions containing all items in X that also contain all items in Y The following rules both have confidence 3/3 = 1: Ramen à {Milk, Chicken} {Ramen, Chicken} à Milk T1 T2 T3 T4 T5 T6 T7 Sushi, Chicken, Milk Sushi, Bread Bread, Vegetables Sushi, Chicken, Bread Sushi, Chicken, Ramen, Bread, Milk Chicken, Ramen, Milk Chicken, Milk, Ramen
Exercise: Which rules X à Y are valid? Thresholds: support is 3/7, confidence is 1 Is the support of X Y at least 3/7? (support: fraction of transactions that contain X Y ) Is the confidence of X --> Y at least 1? (confidence: fraction of transactions containing X that also contain Y) A. Chicken à Milk B. Ramen à Milk C. Both T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen
The association rule data mining problem Input: A table of transactions, a support threshold and a confidence threshold Output: all of the valid association rules
The Apriori algorithm for finding valid association rules The Apriori algorithm has two main tasks: Find all frequent itemsets, i.e., those with support at least the given support threshold Find all rules X à Y with confidence at least the given confidence threshold Calculating association rules on terabytes of data can be sloooowww. The slowest part is finding the frequent itemsets. Let s get back to these.
A frequent itemset: a set whose support is at least some specified threshold Example: Let the support threshold be 3/7 T1 T2 T3 T4 T5 T6 T7 Sushi, Chicken, Milk Sushi, Bread Bread, Vegetables Sushi, Chicken, Bread Sushi, Chicken, Ramen, Bread, Milk Chicken, Ramen, Milk Chicken, Milk, Ramen {Chicken, Milk, Ramen} is a frequent itemset
The Apriori algorithm key idea The Apriori algorithm speeds up task of finding frequent itemsets, based on the observation that each subset of a frequent itemset must also be a frequent itemset Let s see how this is done
A frequent itemset: a set whose support is at least some specified threshold Support threshold: 3/7 Claim: Each subset of a frequent itemset is also a frequent itemset T1 T2 T3 T4 T5 T6 T7 Sushi, Chicken, Milk Sushi, Bread Bread, Vegetables Sushi, Chicken, Bread Sushi, Chicken, Ramen, Bread, Milk Chicken, Ramen, Milk Chicken, Milk, Ramen {Chicken, Milk, Ramen} is a frequent itemset and so {Chicken, Milk}, {Chicken, Ramen}, {Milk, Ramen} must also be frequent itemsets
A frequent itemset: a set whose support is at least some specified threshold Support threshold: 3/7 Claim: Each subset of a frequent itemset is also a frequent itemset T1 T2 T3 T4 T5 T6 T7 Sushi, Chicken, Milk Sushi, Bread Bread, Vegetables Sushi, Chicken, Bread Sushi, Chicken, Ramen, Bread, Milk Chicken, Ramen, Milk Chicken, Milk, Ramen Conversely, {Vegetables} is not a frequent itemset. So any set containing Vegetables cannot be a frequent itemset. For example, {Sushi, Vegetables} is not frequent.
The Apriori algorithm Finding frequent itemsets Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% We ll work through the algorithm to determine the frequent itemsets for this input
Apriori round 1: Find all frequent itemsets of size 1 List candidate itemsets of size 1 {apple} {corn} {dates} {rice} {tuna} Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} {dates} {rice} {tuna} What is the support for corn? a. 1/4 b. 2/4 c. 3/4 d. 4/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4 Can any itemset containing rice ever be a frequent itemset, when the support threshold is 50%? A. Yes B. No Transaction T1 T2 T3 T4 Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna corn, tuna Support threshold 50%
Apriori round 1: Find all frequent itemsets of size 1 Set F 1 to be the list of frequent itemsets of size 1: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 2: Find all frequent itemsets of size 2 List candidate itemsets of size 2: {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna} Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% Because {rice} is not frequent, any set that includes rice is not frequent, so we ignore itemsets that include rice.
Apriori round 2: Find all frequent itemsets of size 2 Calculate the support of each candidate itemset {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna} Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% Group exercise: count support for these itemsets.
Apriori round 2: Find all frequent itemsets of size 2 Calculate the support of each candidate itemset {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% Group exercise: count support for these itemsets.
Apriori round 2: Find all frequent itemsets of size 2 Set F 2 to be the list of frequent itemsets of size 2: {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% Group exercise: what are the frequent itemsets of size 2?
Apriori round 2: Find all frequent itemsets of size 2 Set F 2 to be the list of frequent itemsets of size 2: {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 3: Find all frequent itemsets of size 3 Given frequent itemsets of size 2 Transaction Items {apple, corn} {apple, dates} {corn, dates} T1 T2 T3 corn, dates, tuna {corn, tuna} T4 corn, tuna {dates, tuna} Support threshold 50% Without counting support, what are the candidate frequent itemsets of size 3? (Key: all subsets of a candidate itemset should be frequent itemsets! For example, {apple, corn, rice} is not a candidate itemset because {apple, rice} is not a frequent itemset) apple, dates, rice, corn apple, corn, dates, tuna
Apriori round 3: Find all frequent itemsets of size 3 Given frequent itemsets of size 2 {apple, corn} T1 {apple, dates} T2 {corn, dates} T3 {corn, tuna} T4 corn, tuna {dates, tuna} Support threshold 50% Without counting support, what are the candidate frequent itemsets of size 3? A. {apple, corn, dates} B. {apple, corn, dates}, {apple, corn, tuna}, {corn, dates, tuna} C. {apple, corn, tuna}, {corn, dates, tuna} D. None of the above Transaction Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna
Apriori round 3: Find all frequent itemsets of size 3 Great! We now have a list of candidate itemsets of size 3: {apple, corn, dates} {corn, dates, tuna} Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% Group exercise: calculate the support for these candidate itemsets
Apriori round 3: Find all frequent itemsets of size 3 Calculate the support of each candidate itemset {apple, corn, dates} = 2/4 {corn, dates, tuna} = 2/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 3: Find all frequent itemsets of size 3 Set F 3 to be the list of frequent itemsets of size 3: {apple, corn, dates} = 2/4 {corn, dates, tuna} = 2/4 Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori round 4: Find all frequent itemsets of size 4 Given frequent itemsets of size 3 : {apple, corn, dates} {corn, dates, tuna} Without counting support, what are the candidate frequent itemsets of size 4? A. Nothing B. {apple, corn, dates, tuna} C. {apple, corn, dates, tuna}, {apple, corn, dates, rice} Transaction T1 T2 T3 T4 Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna corn, tuna Support threshold 50%
Apriori example: done! The whole list of frequent itemsets for this example is: {apple} {corn} {dates} {tuna} {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna} Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50%
Apriori example: done! Frequent itemsets {apple} {corn} {dates} {tuna} {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna} Itemsets we counted support for: {apple} {corn} {dates} {rice} {tuna} {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna} All possible itemsets: {apple} {corn} {dates} {rice} {tuna} {apple, corn} {apple, dates} {apple, rice} {apple, tuna} {corn, dates} {corn, rice} {corn, tuna} {dates, rice} {dates, tuna} {rice, tuna} {apple, corn, dates} {apple, corn, rice} {apple, corn, tuna} {corn, dates, rice} {corn, dates, tuna} {dates, rice, tuna} {apple, corn, dates, rice} {apple, corn, dates, tuna} {corn, dates, rice, tuna} {apple, corn, dates, rice, tuna}
That s how the algorithm works Let s see it written down, and see how it works on one more example
Apriori algorithm 1. Set k to 0 [k keeps track of what round we re on] 2. Repeat a. Add 1 to k b. Set C k to be the list of candidate itemsets of size k (those whose subsets of size k-1 are frequent) c. Calculate the support of itemsets in C k d. Set F k to be the list of frequent itemsets in C k (those with support greater than the threshold) Until F k is empty 3. Output the union of all F k
Apriori algorithm Repeat loop round 1 (k=1 at step a) Transaction T1 T2 T3 T4 Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna corn, tuna Support threshold = 75% F 1 : {dates}, {corn}, {tuna} Step 2b C1 {apple} {dates} {rice} {corn} {tuna} Step 2c Support 2/4 3/4 1/4 4/4 3/4 Step 2d F1 {dates} {corn} {tuna}
Apriori algorithm Repeat loop round 2 (k=2 at step a) Transaction T1 T2 T3 T4 Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna corn, tuna Support threshold = 75% F 1 : {dates}, {corn}, {tuna} F 2 : {corn, dates}, {corn, tuna} Step 2b C2 {corn, dates} {corn, tuna} {dates, tuna} Step 2c Support 3/4 3/4 2/4 Step 2d F2 {corn, dates} {corn, tuna}
Apriori algorithm Repeat loop round 3 (k=3 at step a) Transaction T1 T2 T3 T4 Step 2b Step 2c Step 2d Items apple, dates, rice, corn corn, dates, tuna apple, corn, dates, tuna corn, tuna C3 Support F3 Support threshold = 75% F 1 : {dates}, {corn}, {tuna} F 2 : {corn, dates}, {corn, tuna} Clicker question: What are the candidate sets in C 3? A. nothing B. {corn, dates, tuna}
Great! Your turn! In a group Use the Apriori algorithm to find frequent itemsets with a support threshold of 3/7. Write down what sets you have at each step! Transaction T1 T2 T3 T4 T5 T6 T7 Items cake, jam, rolls, tea cake, jam, tea cake, jam jam, rolls, tea jam, rolls rolls, tea jam, tea Support threshold = 3/7
Apriori Algorithm Clicker question Which of the following are in F 3? A. {cake, jam, rolls} B. {cake, jam, tea} C. {jam, rolls, tea} D. All are in F 3 Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea Support threshold = 3/7 E. None are in F 3
Let s walk through the example Support for candidate sets of size 1: {cake} = 3/7 {jam} = 6/7 {rolls} = 4/7 {tea} = 5/7 F 1 : {cake},{jam},{rolls},{tea} Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea Support threshold = 3/7
Let s walk through the example Support for candidate sets of size 2: {cake, jam} = 3/7 {cake, rolls} = 1/7 {cake, tea} = 2/7 {jam, rolls} = 3/7 {jam, tea} = 4/7 {rolls, tea} = 3/7 F 2 : {cake, jam}, {jam,rolls}, {jam, tea}, {rolls, tea} Support for candidate sets of size 3: {jam, rolls, tea} = 2 F 3 is nothing Transaction T1 T2 T3 T4 T5 T6 T7 Items cake, jam, rolls, tea cake, jam, tea cake, jam jam, rolls, tea jam, rolls rolls, tea jam, tea
The Apriori algorithm shook up the research world It has over 20,000 citations! Why? It s something people really needed It scales really well It s easy to understand Lots to extend
Coming full circle: back to privacy issues Massachusetts released anonymized medical records for state employees. They removed all identifiers but left birthdate (including year), gender, and zip code. Group discussion: what percentage of people in the US could likely be uniquely identified by this information? (Note: there are ~7,500 people per zip code) A. 0-19% B. 20-39% C. 40-59% D. 60-79% E. 80-100%
Group exercise Is it a problem that we can tell that in one database one individual (we don t know the name, but we know the age, gender, and zip code) has a set of medical conditions?
Well Okay, so we can uniquely determine that there exists some person with some medical visits. We still don t who they are. But there are other data sources, too. Publically available voting records include name, zip code, birthdate and gender of voters. So if you put the two together, you now have names and health records together Security researcher (and graduate student) Latanya Sweeny sent the Governor s full health records to his office. http://arstechnica.com/tech-policy/2009/09/your-secretslive-online-in-databases-of-ruin/
Learning goals revisited [CT Building Block] Students will be able to demonstrate that they understand the Apriori algorithm by describing what the output would be for a small input. [CT Building Block] Students will be able to create English language descriptions of algorithms to analyze data and show how their algorithms would work on an input data set.