Learning to Cook: An Exploration of Recipe Data

Size: px
Start display at page:

Download "Learning to Cook: An Exploration of Recipe Data"

Transcription

1 Learning to Cook: An Exploration of Recipe Data Travis Arffa (tarffa), Rachel Lim (rachelim), Jake Rachleff (jakerach) Abstract Using recipe data scraped from the internet, this project successfully implemented both unsupervised and supervised machine learning techniques in order to understand the nature and composition of online recipes, as well as to predict recipe ratings based solely on the content of their posting. This paper is divided into several sections. The INTRODUCTION explains the goals of the project, and the DATASET AND FEATURES section details the dataset used. Since the objective is two-fold and multiple models were tried, the METHODS, RESULTS, AND DISCUSSIONS are split up according to the question addressed and model used. I. INTRODUCTION This project s aim was to explore recipe data and use machine learning techniques to understand the structures and patterns behind good cooking, given that food is such an essential part of life. Using data acquired from Epicurious.com, this project aims to answer two main questions: (1) Assuming no prior knowledge of food and recipes, what can be learned about natural groupings and structure of recipes? (2) How well can the quality of a recipe be predicted? Practically speaking, the input was a collection of recipes scraped from Epicurious.com. To address the first question of finding ideal clusterings of recipes based on ingredients used in each recipe, the unsupervised learning technique of K-means clustering was used. This technique, using sparse ingredient feature vectors, succeeded in isolating specific types of food and granting insight into prevalence of different ingredients in specific types of cuisine. It also revealed interesting commonalities between cuisines, which are detailed in the discussion portion of the LEARNING CUISINE TYPES section. To address the second question, supervised learning strategies including linear regression, Naive Bayes, and Random Forest were used to predict a recipe score based on input features such as recipe length, number of ingredients, type of ingredients, and nutritional value. These techniques illuminate which aspects of the online recipe posting served as the best predictors of ratings. The less-complex regression models failed to account for much of the variance in the rating data, while the random forest technique proved to yield more accurate predictions on the test set. Naive Bayes and Random Forest both avoided overfit as well as over-emphasis on ingredients which happened to be more prevalent in the training set. The various methods, models, and results are discussed in the RATINGS PREDICTION section of this paper. Approaching the subject of food and recipes from multiple angles allowed an increased amount of utility to be gleaned from the data acquired. Furthermore, this project was able to explore and answer interesting inquiries into the occurrence of recipe ingredients, how recipes are correlated with each other, and what makes recipes differ in quality across a wide spectrum of food and drink options. II. RELATED WORK In the past, attempts have been made to classify dishes within a given cuisine type [1]. This approach proved useful in the subset of Indian cuisines - our project plans to generalize to all types of cuisines. In addition, previous reports in CS229 sought to classify recipes based on known labels [2]. By performing unsupervised learning instead of supervised learning, this project aims to discover what cuisine types should be instead of how they are already classified. Therefore, it builds on the research of past CS229 students in discovering structure in recipes. Furthermore, previous CS229 projects sought to find the optimal amounts of specific ingredients for best recipes [3], and replace ingredients in recipes without losing taste [4]. Again, this project takes a much wider view on recipe prediction, looking to predict the quality of a recipe on the whole instead of the quality different amounts of ingredients used. III. DATASET AND FEATURES A set of 29,662 recipes was scraped from Epicurious.com, an online food resource for home cooks, each with some number of ratings and reviews given by users of the site. For each recipe, its name, list of ingredients, preparation steps, nutritional information if available, tags, and average user would make it again rating (ranging from 0-100) were collected. Further processing was done on the raw scraped data. From the list of ingredients, a dictionary of the 355 most common ingredients occurring in at least 120 recipes was hand curated, then each recipe s ingredient list was filtered with these words so that ingredients such as 1/2 teaspoon ground cardamom were reduced to simple ingredient features such as cardamom. In the end, the ingredients for each recipe were represented in an R 355 binary vector, where the element in index i is 1 if ingredient i is present in the recipe, and 0 if absent. Quantities of the ingredients used were not taken into account. This constituted what we term the ingredient features. Each recipe also had a list of

2 tags, which were classifications given to the recipe by the initial chefs. In the same manner as with simple ingredient features, tags occurring at least 15 times were converted into a R 400 binary vector for each recipe. Additionally, for each recipe, several real-valued features were extracted that were used in the ratings prediction segment of our project. These include length of recipe name, number of steps, number of ingredients, nutritional information (per serving fiber, polyunsaturated fat, sodium, carbohydrates, monounsaturated fat, calories, fat, saturated fat, cholesterol, protein). For the first part of the project, clustering recipes, all ingredient and tag binary vectors for the 29,662 recipes were used. For the second part of our project, ratings prediction, only the features for recipes with greater than 15 ratings were used to minimize noise in the ratings. The data was divided into a training set of size 8,352 (80% of our data) and test set of size 2,088 (20% of our data) for a total of 10,440 training examples. The specific features used for each model will be discussed later in the METHODS parts of the next sections. IV. LEARNING CUISINE TYPES Since the model needed to be resistant to the bias of previous conceptions about food groups, unsupervised learning was a natural choice to address the goal of learning natural groupings of food. Specifically, K-means clustering was performed on binary ingredient vectors representing each of the 30,000 recipes scraped from Epicurious.com. In K-means, the following objective value is minimized, where there are m training examples, and k clusters each centered around a point µ c, 1 apple c apple k: J(c, µ) = mx x (i) µ c (i) 2 (1) In order to find the optimal cluster size and learn the best grouping of food, K-means was run with values of k between 2 and 30. In addition to finding absolute error for each value of k, both the top ingredients by both volume and percentage, and the top tags by both volume and percentage were examined for each cluster. By inspecting the makeup of each cluster, the makeup of the dishes in each naturally occurring cluster were able to be determined. For K-means clustering, total error monotonically decreases with the number of clusters, and a kink in the total error vs k graph represents the most natural clustering of data. Therefore, the best possible clustering of ingredients into cuisine types is expected to occur at the kink. The total clustering error of K-means vs cluster size is plotted in Fig 1. Though total error starts to decrease at a decreasing rate between k =3and k =5, there is no distinct kink. Instead, the absolute error decreases relatively smoothly Fig. 1. Total clustering error of K-means vs number of clusters over time, implying that there is not an ideal clustering of recipes. Though no single ideal grouping of cuisine types was found, the clustering algorithm did present a peculiar qualitative insight into the nature of cuisines. Manually examining the user-given tags corresponding to recipes of each cluster for different values of k, it is observed that when k is increased, a new cuisine type is discovered. To see the significance of this discovery, please examine the two graphs presented of clusters of binary ingredient vectors run through principal component analysis and projected into two dimensions. Fig. 2. Three clusters identified as Cooked Meals (Yellow), Desserts (Purple), and Drinks (Green) For example, for k =3(see Fig 2), the most frequent tags by percentage of the groupings above were split into italian american, mediterranean, african, cake, dessert, and punch, cocktail, alcoholic. This corresponded to a breakdown of categories, respectively, as Cooked Food, Desserts, and Drinks.

3 postings, and could determine the user score for an online recipe with reasonable accuracy. Prediction methods included linear and locally-weighted Regression, Naive Bayes, and Random Forest. Average absolute error was used as the optimal metric of test error since it is less sensitive to outliers and better handles the noisiness of the averaged ratings and the sparsity of features. Fig. 3. Four clusters identified as European Based Meals (Green), Asian Based Meals (Teal), Desserts (Purple), and Drinks (Yellow) When k is increased from 3 to 4 (see Fig 3), notice that instead of finding an entirely new grouping, the formerly yellow cluster splits into two, producing groups with top tags stir-fry, wok, chinese, vietnamese, korean and mediterranean, italian american, greek. This change in clusters is representative of the trend in increasing k values. For each increase of clusters past k =3, one of the 3 major groups gets rearranged, but the general structure of having Meal groups, Desserts groups, and Drinks groups stays the same. This intuitively makes sense, since those three groups are extremely different in ingredient makeup. However, inside of those groups, there are myriad ways to distinguish between meal types. This phenomenon explains the lack of a strong kink in the graph of error as a function of k. As evidence to the above arguments, it is observed that when k increases from 4 to 5, four clusters have relatively similar top tags, but the fifth now has top tags middle eastern, african and only rearranges the Meals group. Further rearrangements include splitting Desserts into Baked Goods and Frozen Desserts from 6 to 7, and drinks into Fruity Drinks and Non Fruity Drinks from 10 to 11. Even though K-means is a non hierarchical algorithm, its results displayed hierarchical tendencies. Below the main hierarchy of Meals/Desserts/Drinks, by looking at when one of those clusters splits into different food groups, the new cluster s inherent differences can be assessed. Since Asian food and European food split immediately within meals, they are quite different and unique. However, since heavy meat based American food, and German/Scandanavian food do not split into separate clusters until k = 15, it can be inferred that the cuisines are similar. V. RATINGS PREDICTION In this section, our project aimed to create a model that utilized scraped and extracted information from the recipe (I) REGRESSION The first attempt at ratings prediction used the numeric features provided by the scraper, which included nutritional score and recipe length, in order to determine which recipes internet users preferred. Both linear regression and locallyweighted linear regression with exponential weights and a bandwidth parameter of 5 were run. Overall, these methods did not accurately predict user ratings. The mean absolute test error and mean squared error for linear regression were 6.94 and 92.84, respectively. Using locally-weighted regression, errors worsened to an absolute error of 7.76, and a mean squared error of Locally-weighted linear regression is susceptible to outliers and overfitting, which may account for its poor performance compared to standard linear regression. Given the variance in the number of ratings per recipe, as well as individual user biases, the number of outliers is expected to be significant. Furthermore, recipe ratings are likely not linearly related to the features chosen, and the small set of real-valued features used mean there is high bias in this model. (II) NAIVE BAYES The next attempt at ratings prediction used Naive Bayes, a simple probabilistic model that applies Bayes theorem with strong independence assumptions between the features. Specifically, it assumes that features are conditionally independent given the class variable. While the assumption does not hold in the case of recipes, since ingredient features are in reality not independent (some ingredients, after all, tend to go together ), it serves as a good baseline model for prediction. For the Naive Bayes model, ratings were discretized into buckets, each capturing the same fraction of training examples. Then multiclass Naive Bayes classification was performed, where each class corresponds to a range of predicted ratings. Suppose there exists a new recipe r for which a prediction will be run. Let k be the number of buckets, N be the number of possible ingredients, and n i be the number of ingredients for recipe r. Let x be the binary ingredient features of the new recipe (so, for the collected

4 Discretizing the labels into 10 buckets, a mean absolute error of is obtained using the basic ingredient features, and using the expanded ingredient features. This constitutes a reduction in error of about 20%. Fig. 4. Mean absolute error for Naive Bayes with basic ingredient features vs expanded ingredient features, and varying number of ratings buckets simple ingredient features, x 2 R 355 ). The bucket the recipe belongs to is chosen by: b = arg max = arg max = arg max p(y = b x) p(x y = b)p(y = b) p(x) ( Q n r p(x i y = b))p(y = b) kp ( Q n r p(x i y = b 0 ))p(y = b 0 ) b 0 =1 Thus, whichever class has the highest posterior probability is picked. This model estimates p(x i y) with a binomial distribution, based on the counts of ingredients in recipes that have been seen already, with Laplace smoothing. Supposing there are m recipes as training data, p(q y = b) = = q y=b mp 1{x (i) q =1^ y (i) = b} +1 mp 1{y (i) = b} +2 Additionally, due to the naivete of the conditional independence assumption for recipes, an expanded ingredients feature set was experimented with, where pairwise ingredients were used instead of simple ingredients. This was expected to better model dependencies between ingredients. Hence, when running Naive Bayes with the expanded feature set, each feature vector has binary features, i.e. x 2 R For both sets of experiments, the the feature vectors are expected to be sparse, since the recipes scraped use, on average, ingredients. By experimenting with the number of buckets (k = 10, 20, 30), it was found that regardless of the number of buckets the ratings are discretized into, the expanded ingredients feature set performs significantly better than the basic ingredients feature set, as can be seen from Fig. 4. Overall, Naive Bayes was a poor predictor of recipe ratings, performing worse than the other prediction models used in terms of both mean squared error and mean absolute error. This is likely because the conditional independence assumption the model is built on does not hold in the case of recipes. However, using the expanded pairwise feature set yielded a marked improvement over the simple ingredient feature set used, since pairwise ingredient features captures dependencies between ingredients better. This agrees with one s intuition, that the pairing of ingredients, rather than just the ingredients independently, has a stronger impact on how good the food tastes, and thus how well rated the recipe is. (III) RANDOM FOREST The Random Forest technique uses randomized samples of data to fit a multitude of smaller regression trees instead of one large tree. It outputs a prediction that is the average output of each smaller tree. The randomization serves to reduce correlation between trees, while the overall method reduces overfit to the training data that is common in many large trees. This model was chosen due to the large number of ingredients, as well as the potential for overfit to specific ingredients that happen to be more prevalent in the training data. Both a full model, and a reduced model were built with details shown below. A minimum leaf size of 50 was also chosen based on the graph below: TABLE I COMPARISON OF FULL AND REDUCED RANDOM FOREST MODELS Full Reduced Minimum Leaf Size 50 5 Number of Trees Number of Predictors Subsample Size Mean Absolute Test Error The test error for each model was nearly identical, indicating that the additional features as well as tree complexity in the full model did not produce more accurate predictions. The fifteen predictors were chosen for the reduced model based on the Out-of-Bag Variable Importance parameter, which is roughly the average absolute difference between tree outputs that included the feature, and those that did not. Further, the sub-sample size was kept constant (and relatively high) for both the full and reduced models to account for the sparsity of the feature vectors.

5 Bayes implementation, thus explaining the drop in error. Furthermore, the comparative success of Linear Regression against Naive Bayes suggests that predictive power could lie within nutritional data - a set of features that was not taken into account for Naive Bayes. Fig. 5. Error based on Minimum Leaf Sizes of 5, 10, 20, 50, and 100 The Random Forest procedure produced reasonable predictions, with average absolute error around 6.08, and mean squared error of Random Forest outperformed all other regression techniques attempted. One fault of the Random Forest predictions was that for each test recipe, the predicted rating tended to remain close to the mean of the training ratings, which was roughly 87. This was particularly true when the number of trees or tree-width was large, because the weights for each feature (ingredient) remained small. This was counteracted by using more features and examples for both the full and reduced model trees, as well as using the fillproximities function to account for outliers and clustering of data. Through use of the Variable Importance parameter, we were also able to determine the fifteen most influential ingredients on ratings. These included sugar, salt, chocolate, and broccoli, with the latter suggesting lower ratings in most of its corresponding regression trees. VI. CONCLUSIONS AND FUTURE WORK Both arms of this project provided interesting results. The first goal, learning about cuisine structure, provided more promising results. The K-Means clustering technique shed light onto an underlying hierarchical structure of cuisines based on ingredients. Though the expressed goal of finding the ideal clustering of cuisine types was not reached, this project uncovered a much different, and more useful explanation of how the interaction of cuisine types works. In terms of prediction methods, more naive attempts at regression and classification faltered due to their inability to counteract overfit, massive differences in sparsity of ingredients, and flawed model assumptions. The reduced-form Random Forest Model was best able to address the challenges in the data by injecting randomness into the training data, and making fewer assumptions about linearity and conditional independence. Future research into the network of dependence of ingredients is necessary. With a better understanding of the interaction between ingredients, more apt models can be chosen to predict ratings. REFERENCES [1] A. Jain, R. NK, G. Bagler, Spices form the basis of food pairing in Indian cuisine. Indian Institute of Technology Jodhpur, February [2] J. Naik, V. Polamreddi, Cuisine Classification and Recipe Generation. CS 229, November [3] D. Safreno, Y. Deng, The Recipe Learner. CS 229, November [4] A. Agarwal, D. Jachowski, S.D. Wu, RoboChef: Automatic Recipe Generation. CS 229, November OVERALL: COMPARISON OF RECIPE RATINGS PREDICTION TECHNIQUES TABLE II COMPARISON OF TEST ERROR FOR EACH METHOD Method Used Mean Absolute Error Mean Squared Error Linear Regression Locally Weighted Linear Regression Naive Bayes (k =10) Naive Bayes with Expanded Features (k =10) Random Forest Full Random Forest Reduced Overall, Random Forest performed the best of all our ratings prediction techniques, and Naive Bayes the worst. The latter makes overly simplistic assumptions regarding the conditional independence of ingredient features, whereas the former is able to elucidate most accurately the features that sets a recipe apart from others. These conditional dependence assumptions are mitigated in the pairwise Naive

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi

More information

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl Model reconnaissance: discretization, naive Bayes and maximum-entropy Sanne de Roever/ spdrnl December, 2013 Description of the dataset There are two datasets: a training and a test dataset of respectively

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 Exam policy: This exam allows one one-page, two-sided cheat sheet; No other materials. Time: 80 minutes. Be sure to write your name and

More information

Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics

More information

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering Kunal Sharma CS 4641 Machine Learning Abstract Clustering is a technique that is commonly used in unsupervised

More information

Cocktail Preference Prediction

Cocktail Preference Prediction Cocktail Preference Prediction Linus Meyer-Teruel, 1 Michael Parrott 1 1 Department of Computer Science, Stanford University, In this paper we approach the problem of rating prediction on data from a number

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Classification of Synapses Using Spatial Protein Data

Classification of Synapses Using Spatial Protein Data Classification of Synapses Using Spatial Protein Data Jenny Chen and Micol Marchetti-Bowick CS229 Final Project December 11, 2009 1 MOTIVATION Many human neurological and cognitive disorders are caused

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

Knowledge Discovery and Data Mining I

Knowledge Discovery and Data Mining I Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19 Introduction What is an outlier?

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens This paper examines three techniques for univariate outlier identification: Extreme Studentized Deviate ESD), the Hampel

More information

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT Research Article Bioinformatics International Journal of Pharma and Bio Sciences ISSN 0975-6299 A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS D.UDHAYAKUMARAPANDIAN

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

10CS664: PATTERN RECOGNITION QUESTION BANK

10CS664: PATTERN RECOGNITION QUESTION BANK 10CS664: PATTERN RECOGNITION QUESTION BANK Assignments would be handed out in class as well as posted on the class blog for the course. Please solve the problems in the exercises of the prescribed text

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Prediction of Average and Perceived Polarity in Online Journalism

Prediction of Average and Perceived Polarity in Online Journalism Prediction of Average and Perceived Polarity in Online Journalism Albert Chu, Kensen Shi, Catherine Wong Abstract We predicted the average and perceived journalistic objectivity in online news articles

More information

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Thesis Proposal Indrayana Rustandi April 3, 2007 Outline Motivation and Thesis Preliminary results: Hierarchical

More information

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Discovering Meaningful Cut-points to Predict High HbA1c Variation Proceedings of the 7th INFORMS Workshop on Data Mining and Health Informatics (DM-HI 202) H. Yang, D. Zeng, O. E. Kundakcioglu, eds. Discovering Meaningful Cut-points to Predict High HbAc Variation Si-Chi

More information

BREAST CANCER EPIDEMIOLOGY MODEL:

BREAST CANCER EPIDEMIOLOGY MODEL: BREAST CANCER EPIDEMIOLOGY MODEL: Calibrating Simulations via Optimization Michael C. Ferris, Geng Deng, Dennis G. Fryback, Vipat Kuruchittham University of Wisconsin 1 University of Wisconsin Breast Cancer

More information

EECS 433 Statistical Pattern Recognition

EECS 433 Statistical Pattern Recognition EECS 433 Statistical Pattern Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 19 Outline What is Pattern

More information

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China A Vision-based Affective Computing System Jieyu Zhao Ningbo University, China Outline Affective Computing A Dynamic 3D Morphable Model Facial Expression Recognition Probabilistic Graphical Models Some

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training. Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze

More information

Semi-Supervised Disentangling of Causal Factors. Sargur N. Srihari

Semi-Supervised Disentangling of Causal Factors. Sargur N. Srihari Semi-Supervised Disentangling of Causal Factors Sargur N. srihari@cedar.buffalo.edu 1 Topics in Representation Learning 1. Greedy Layer-Wise Unsupervised Pretraining 2. Transfer Learning and Domain Adaptation

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Outlier Analysis. Lijun Zhang

Outlier Analysis. Lijun Zhang Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based

More information

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Outlier detection Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 17 Table of contents 1 Introduction 2 Outlier

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

Prediction of Malignant and Benign Tumor using Machine Learning

Prediction of Malignant and Benign Tumor using Machine Learning Prediction of Malignant and Benign Tumor using Machine Learning Ashish Shah Department of Computer Science and Engineering Manipal Institute of Technology, Manipal University, Manipal, Karnataka, India

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

Emotion Recognition using a Cauchy Naive Bayes Classifier

Emotion Recognition using a Cauchy Naive Bayes Classifier Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method

More information

4. Which of the following is not likely to contain cholesterol? (a) eggs (b) vegetable shortening (c) fish (d) veal

4. Which of the following is not likely to contain cholesterol? (a) eggs (b) vegetable shortening (c) fish (d) veal Sample Test Questions Chapter 6: Nutrition Multiple Choice 1. The calorie is a measure of (a) the fat content of foods. (b) the starch content of foods. (c) the energy value of foods. (d) the ratio of

More information

Lecture #4: Overabundance Analysis and Class Discovery

Lecture #4: Overabundance Analysis and Class Discovery 236632 Topics in Microarray Data nalysis Winter 2004-5 November 15, 2004 Lecture #4: Overabundance nalysis and Class Discovery Lecturer: Doron Lipson Scribes: Itai Sharon & Tomer Shiran 1 Differentially

More information

Sound Texture Classification Using Statistics from an Auditory Model

Sound Texture Classification Using Statistics from an Auditory Model Sound Texture Classification Using Statistics from an Auditory Model Gabriele Carotti-Sha Evan Penn Daniel Villamizar Electrical Engineering Email: gcarotti@stanford.edu Mangement Science & Engineering

More information

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES ALLAN RAVENTÓS AND MOOSA ZAIDI Stanford University I. INTRODUCTION Nine percent of those aged 65 or older and about one

More information

The Regression-Discontinuity Design

The Regression-Discontinuity Design Page 1 of 10 Home» Design» Quasi-Experimental Design» The Regression-Discontinuity Design The regression-discontinuity design. What a terrible name! In everyday language both parts of the term have connotations

More information

Session 4 or 2: Be a Fat Detective.

Session 4 or 2: Be a Fat Detective. Session 4 or 2: Be a Fat Detective. We ll begin today to keep track of your weight. Your starting weight was Your weight goal is pounds. pounds. To keep track of your weight: At every session, mark it

More information

April-May, Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions

April-May, Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions April-May, 2015 Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions Diabetes - the Medical Perspective Do you know what your fasting blood sugar level is? It s an important

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING. Decision tree learning INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign

More information

HEALTH TRANS OMEGA-3 OILS BALANCE GOOD FAT PROTEIN OBESITY USAGE HABITS

HEALTH TRANS OMEGA-3 OILS BALANCE GOOD FAT PROTEIN OBESITY USAGE HABITS HEALTH TRANS OMEGA-3 OILS BALANCE GOOD FAT PROTEIN OBESITY USAGE HABITS think 15TH ANNUAL consumer attitudes about nutrition Insights into Nutrition, Health & Soyfoods eat Consumer Attitudes about Nutrition

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

FOOD LABELS.! Taking a closer look at the label! List of Ingredients! Serving Size! % Daily values! Recommended Amounts

FOOD LABELS.! Taking a closer look at the label! List of Ingredients! Serving Size! % Daily values! Recommended Amounts FOOD LABELS! Taking a closer look at the label! List of Ingredients! Serving Size! % Daily values! Recommended Amounts ! Calories! Total Fat Label Contents! Saturated Fat! Cholesterol! Sodium! Total Carbohydrate!

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Top 10 Protein Sources for Vegetarians

Top 10 Protein Sources for Vegetarians Top 10 Protein Sources for Vegetarians Proteins are the building blocks of life. They are one of the building blocks of body tissue, and even work as a fuel source for proper development of the body. When

More information

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper? Outline A Critique of Software Defect Prediction Models Norman E. Fenton Dongfeng Zhu What s inside this paper? What kind of new technique was developed in this paper? Research area of this technique?

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION OPT-i An International Conference on Engineering and Applied Sciences Optimization M. Papadrakakis, M.G. Karlaftis, N.D. Lagaros (eds.) Kos Island, Greece, 4-6 June 2014 A NOVEL VARIABLE SELECTION METHOD

More information

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Abstract: Unlike most cancers, thyroid cancer has an everincreasing incidence rate

More information

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES International Journal of Civil Engineering and Technology (IJCIET) Volume 10, Issue 02, February 2019, pp. 96-105, Article ID: IJCIET_10_02_012 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=10&itype=02

More information

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013) Supplementary Online Information S2 Analysis of raw data Forty-two out of the 48 experts finished the experiment, and only data from these 42 experts are considered in the remainder of the analysis. We

More information

Healthy Foods for my School

Healthy Foods for my School , y Healthy Foods for my School Schools are an ideal place for children and youth to observe and learn about healthy eating. Children learn about nutrition at school and they often eat at school or buy

More information

Predicting Kidney Cancer Survival from Genomic Data

Predicting Kidney Cancer Survival from Genomic Data Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

BayesRandomForest: An R

BayesRandomForest: An R BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data Oyebayo Ridwan Olaniran (rid4stat@yahoo.com) Universiti Tun Hussein Onn Malaysia Mohd Asrul

More information

Machine Gaydar : Using Facebook Profiles to Predict Sexual Orientation

Machine Gaydar : Using Facebook Profiles to Predict Sexual Orientation Machine Gaydar : Using Facebook Profiles to Predict Sexual Orientation Nikhil Bhattasali 1, Esha Maiti 2 Mentored by Sam Corbett-Davies Stanford University, Stanford, California 94305, USA ABSTRACT The

More information

2010 Dietary Guidelines for Americans

2010 Dietary Guidelines for Americans 2010 Dietary Guidelines for Americans Mary M. McGrane, PhD Center for Nutrition Policy and Promotion February 25, 2015 Agenda for Commodity Supplemental Food Program (CSFP) Brief history and description

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

The studies are also probing into the association of diets high in sugar and fats to this most common diabetes.

The studies are also probing into the association of diets high in sugar and fats to this most common diabetes. MEDICAL researchers announced on March 15, 2012 that they discovered a troubling link between higher consumption of white rice and Type 2 diabetes mellitus, which is of epidemic proportion in Asia and

More information

Classification of Honest and Deceitful Memory in an fmri Paradigm CS 229 Final Project Tyler Boyd Meredith

Classification of Honest and Deceitful Memory in an fmri Paradigm CS 229 Final Project Tyler Boyd Meredith 12/14/12 Classification of Honest and Deceitful Memory in an fmri Paradigm CS 229 Final Project Tyler Boyd Meredith Introduction Background and Motivation In the past decade, it has become popular to use

More information

Inter-session reproducibility measures for high-throughput data sources

Inter-session reproducibility measures for high-throughput data sources Inter-session reproducibility measures for high-throughput data sources Milos Hauskrecht, PhD, Richard Pelikan, MSc Computer Science Department, Intelligent Systems Program, Department of Biomedical Informatics,

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

STAT445 Midterm Project1

STAT445 Midterm Project1 STAT445 Midterm Project1 Executive Summary This report works on the dataset of Part of This Nutritious Breakfast! In this dataset, 77 different breakfast cereals were collected. The dataset also explores

More information

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER 2016 DELIVERING VALUE WITH DATA SCIENCE BAYES APPROACH - MAKING DATA WORK HARDER The Ipsos MORI Data Science team increasingly

More information

A Bayesian Network Analysis of Eyewitness Reliability: Part 1

A Bayesian Network Analysis of Eyewitness Reliability: Part 1 A Bayesian Network Analysis of Eyewitness Reliability: Part 1 Jack K. Horner PO Box 266 Los Alamos NM 87544 jhorner@cybermesa.com ICAI 2014 Abstract In practice, many things can affect the verdict in a

More information

August-September, Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions

August-September, Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions August-September, 2015 Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions Diabetes - the Medical Perspective Carbohydrates are an essential part of a healthy diet despite

More information

Hypothesis-Driven Research

Hypothesis-Driven Research Hypothesis-Driven Research Research types Descriptive science: observe, describe and categorize the facts Discovery science: measure variables to decide general patterns based on inductive reasoning Hypothesis-driven

More information

Computational Cognitive Neuroscience

Computational Cognitive Neuroscience Computational Cognitive Neuroscience Computational Cognitive Neuroscience Computational Cognitive Neuroscience *Computer vision, *Pattern recognition, *Classification, *Picking the relevant information

More information

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation U. Baid 1, S. Talbar 2 and S. Talbar 1 1 Department of E&TC Engineering, Shri Guru Gobind Singhji

More information

Following Dietary Guidelines

Following Dietary Guidelines LESSON 26 Following Dietary Guidelines Before You Read List some things you know and would like to know about recommended diet choices. What You ll Learn the different food groups in MyPyramid the Dietary

More information

A Survey on Brain Tumor Detection Technique

A Survey on Brain Tumor Detection Technique (International Journal of Computer Science & Management Studies) Vol. 15, Issue 06 A Survey on Brain Tumor Detection Technique Manju Kadian 1 and Tamanna 2 1 M.Tech. Scholar, CSE Department, SPGOI, Rohtak

More information

than 7%) can help protect your heart, kidneys, blood vessels, feet and eyes from the damage high blood glucose levels. October November 2014

than 7%) can help protect your heart, kidneys, blood vessels, feet and eyes from the damage high blood glucose levels. October November 2014 October November 2014 Diabetes - the Medical Perspective Diabetes and Food Recipes to Try Menu Suggestions Diabetes - the Medical Perspective Be Heart Smart: Know Your ABCs of Diabetes There is a strong

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Exercises: Differential Methylation

Exercises: Differential Methylation Exercises: Differential Methylation Version 2018-04 Exercises: Differential Methylation 2 Licence This manual is 2014-18, Simon Andrews. This manual is distributed under the creative commons Attribution-Non-Commercial-Share

More information

Excel Solver. Table of Contents. Introduction to Excel Solver slides 3-4. Example 1: Diet Problem, Set-Up slides 5-11

Excel Solver. Table of Contents. Introduction to Excel Solver slides 3-4. Example 1: Diet Problem, Set-Up slides 5-11 15.053 Excel Solver 1 Table of Contents Introduction to Excel Solver slides 3- : Diet Problem, Set-Up slides 5-11 : Diet Problem, Dialog Box slides 12-17 Example 2: Food Start-Up Problem slides 18-19 Note

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

EXECUTIVE SUMMARY DATA AND PROBLEM

EXECUTIVE SUMMARY DATA AND PROBLEM EXECUTIVE SUMMARY Every morning, almost half of Americans start the day with a bowl of cereal, but choosing the right healthy breakfast is not always easy. Consumer Reports is therefore calculated by an

More information

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin. Why Preferences? Learning Consensus from Preferences

More information

UNIT 3, Reading Food Labels Scenario 2 READING FOOD LABELS

UNIT 3, Reading Food Labels Scenario 2 READING FOOD LABELS READING FOOD LABELS Anna's new family has learned a lot about nutrition. Bill and his children now eat differently. Now they eat more healthful foods. Bill eats more than just meat, potatoes, and desserts.

More information

Virtual reproduction of the migration flows generated by AIESEC

Virtual reproduction of the migration flows generated by AIESEC Virtual reproduction of the migration flows generated by AIESEC P.G. Battaglia, F.Gorrieri, D.Scorpiniti Introduction AIESEC is an international non- profit organization that provides services for university

More information

Introduction to Computational Neuroscience

Introduction to Computational Neuroscience Introduction to Computational Neuroscience Lecture 5: Data analysis II Lesson Title 1 Introduction 2 Structure and Function of the NS 3 Windows to the Brain 4 Data analysis 5 Data analysis II 6 Single

More information

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science Abstract One method for analyzing pediatric B cell leukemia is to categorize

More information

Youth4Health Project. Student Food Knowledge Survey

Youth4Health Project. Student Food Knowledge Survey Youth4Health Project Student Food Knowledge Survey Student ID Date Instructions: Please mark your response. 1. Are you a boy or girl? Boy Girl 2. What is your race? Caucasian (White) African American Hispanic

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Fig 1. Comparison of sub-samples on the first two principal components of genetic variation. TheBritishsampleisplottedwithredpoints.The sub-samples of the diverse sample

More information

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection Esma Nur Cinicioglu * and Gülseren Büyükuğur Istanbul University, School of Business, Quantitative Methods

More information