Semantic Pattern Transformation

Similar documents
Credal decision trees in noisy domains

Motivation: Fraud Detection

EECS 433 Statistical Pattern Recognition

A Deep Learning Approach to Identify Diabetes

Variable Features Selection for Classification of Medical Data using SVM

Outlier Analysis. Lijun Zhang

An Improved Algorithm To Predict Recurrence Of Breast Cancer

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Improved Intelligent Classification Technique Based On Support Vector Machines

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

An Experimental Analysis of Anytime Algorithms for Bayesian Network Structure Learning. Colin Lee and Peter van Beek University of Waterloo

Shu Kong. Department of Computer Science, UC Irvine

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

Shu Kong. Department of Computer Science, UC Irvine

Large-scale Histopathology Image Analysis for Colon Cancer on Azure

Using AUC and Accuracy in Evaluating Learning Algorithms

A Model for Automatic Diagnostic of Road Signs Saliency

Connecting the Dots Social Media and Influence. Nancy Benavente Cedars Sinai Medical Center

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Automated Medical Diagnosis using K-Nearest Neighbor Classification

GIANT: Geo-Informative Attributes for Location Recognition and Exploration

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

Analyzing Spammers Social Networks for Fun and Profit

Predicting Breast Cancer Survivability Rates

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

1. Introduction 1.1. About the content

A STUDY OF AdaBoost WITH NAIVE BAYESIAN CLASSIFIERS: WEAKNESS AND IMPROVEMENT

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

1. Introduction 1.1. About the content. 1.2 On the origin and development of neurocomputing

SAP Hybris Academy. Public. February March 2017

Conceptual Spaces. A Bridge Between Neural and Symbolic Representations? Lucas Bechberger

AMERICAN CANCER SOCIETY FUNDRAISING APP FAQS

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

A Smart Texting System For Android Mobile Users

SVM-based Discriminative Accumulation Scheme for Place Recognition

MINING OF OUTLIER DETECTION IN LARGE CATEGORICAL DATASETS

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Artificial Immunity and Features Reduction for effective Breast Cancer Diagnosis and Prognosis

Pilot Study: Clinical Trial Task Ontology Development. A prototype ontology of common participant-oriented clinical research tasks and

On the Use of Brainprints as Passwords

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

CENTRAL UNIVERSITY OF HARYANA Mahendergarh

Trajectories of Depression: Unobtrusive Monitoring of Depressive States by means of Smartphone Mobility Traces Analysis

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Lazy Learning of Bayesian Rules

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Discrimination in Microarray Data Analysis

Identifying Novel Targets for Non-Small Cell Lung Cancer Just How Novel Are They?

OUTLIER DETECTION : A REVIEW

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Final Project Report Sean Fischer CS229 Introduction

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

This is a repository copy of Measuring the effect of public health campaigns on Twitter: the case of World Autism Awareness Day.

A Classification Algorithm that Derives Weighted Sum Scores for Insight into Disease

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

Unsupervised Identification of Isotope-Labeled Peptides

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

BLOOD GLUCOSE PREDICTION MODELS FOR PERSONALIZED DIABETES MANAGEMENT

Tactile Internet and Edge Computing: Emerging Technologies for Mobile Health

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Lecture 13: Finding optimal treatment policies

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

DEVELOPMENT OF AN EXPERT SYSTEM ALGORITHM FOR DIAGNOSING CARDIOVASCULAR DISEASE USING ROUGH SET THEORY IMPLEMENTED IN MATLAB

Renalyx Extending Renal Health

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

Abstracts. 2. Sittichai Sukreep, King Mongkut's University of Technology Thonburi (KMUTT) Time: 10:30-11:00

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

A Review on Arrhythmia Detection Using ECG Signal

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Portable Retina Eye Scanning Device

Still important ideas

Knowledge Discovery and Data Mining I

Introduction to MVPA. Alexandra Woolgar 16/03/10

KINOMAP FITNESS. Version Android KINOMAP FITNESS

A scored AUC Metric for Classifier Evaluation and Selection

Automatic Context-Aware Image Captioning

Cancer Cells Detection using OTSU Threshold Algorithm

Predicting Sleep Using Consumer Wearable Sensing Devices

SURVEY ON OUTLIER DETECTION TECHNIQUES USING CATEGORICAL DATA

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis.

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Data mining in forensics: a text mining approach to proling criminals

MRI Image Processing Operations for Brain Tumor Detection

Classification of Thyroid Disease Using Data Mining Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

KNN CLASSIFIER AND NAÏVE BAYSE CLASSIFIER FOR CRIME PREDICTION IN SAN FRANCISCO CONTEXT

the best of care Managing diabetes with the FORA Diamond MINI and tools from Discovery Health Medical Scheme

Transcription:

Semantic Pattern Transformation IKNOW 2013 Peter Teufl, Herbert Leitold, Reinhard Posch peter.teufl@iaik.tugraz.at

Our Background Topics Mobile device security Cloud security Security consulting for public insititutions (Austria) IT security research IT security lectures A-SIT e-government

Why does he talk about Knowledge Discovery? How does IT security relate to knowledge discovery? egov - eparticipation: document analysis, twitter etc. intrusion detection systems (network traffic analysis) malware detection (network traffic, mobile phones) mobile application analysis (metadata, market descriptions) mobile application security (hot topic, BYOD, etc.)

What to expect? Motivation for the Semantic Pattern Transformation Basic concepts, techniques How does it work? Evaluation? Applications, results, current topics!

Environment Arbitrary features No apriori knowledge Heteregenous domains Supervised learning Anomaly Detection Text analysis Android market descriptions Semantic search Clustering terms flexible histograms new numbers Visualization deployment domains Extracting knowledge

Process... Fayyad et al. Domain-specific data set Machine learning Different processing steps From defining the goals To extracting the desired knowledge Machine learning algorithms are often used within KDD KDT Knowledge discovery goals Target data set Preprocessing Data extraction Data mining method Data mining algorithm Data mining ML-KDT Machine learning goals Instance extraction Feature selection, construction Instance selection Machine learning algorithm Preprocessing Algorithm application However, the complete machine learning process is quite similar to KDD Knowledge extraction Knowledge processing Interpretation

Machine Learning ADAPTATION COMPLEXITY? Domain-specific data set Machine learning goals Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Algorithm application Interpretation Dependence on domain data and goals High Medium Low Assuming an arbitrary data-set (e-participation, Android Market applications) Further assuming: a knowledge discovery goal: e.g., unsupervised clustering Then: we need to adapt the steps on the left And: We need to adapt this setup when the data changes, even when the knowledge discovery goals remain the same! Android Market applications vs. text documents vs. network traffic vs. malware detection?

TOWARDS A SEMANTIC REPRESENTATION Finding a new representation... New representation is called Semantic Patterns Key properties: Still a vector representation (compatible to old representation) Not the feature values themselves, but their semantic relations are represented All values have the same meaning and feature type (activation) Transformation from raw data into Semantic Patterns: Semantic Pattern Transformation

SEMANTIC PATTERN TRANSFORMATION The Semantic Pattern Transformation is arranged in five layers Layer 1 Feature Extraction Data set Relation Instances Layer 1 - Feature extraction FROM TO TIME FROM TO TIME FROM TO TIME SF 2 Instance SF 1 SF 2 DF 1 SF 2 DF 2 Map Layer 2 - Associative network - Node generation Layer 2-3 Associative Network Generation SV MV SV SV MV Map Map Layer 3 - Associative network - Link generation Layer 4 Spreading Activation P 1 SV SV P 2 MV MV Layer 4 - Spreading activation (SA) P 3 P 4 Layer 5 - Analysis (machine learning, semantic search etc.) Layer 5 Analysis Semantic relations Semantic development over time Unsupervised clustering Feature value relevance Anomaly detection Pattern similarity Supervised learning

SPT: Layer 1 - Feature extraction Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data Extract features, their values and determine the type (categorical, distance-based) Categorical: Exports Distance-based: Unemployment rate, fertility rate

SPT: Layer 2 - Node generation Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data Distance-based feature values: map value ranges to single nodes 5% Categorical feature 20% values: Associative network one node for each value 5 coffee machinery 2 chemicals cocoa

SPT: Layer 3 - Link generation Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data coffee, 20%, 5 chemicals, cacao, 20% 5% 5 coffee 20% machinery 2 chemicals Link Weight cocoa 0.25 0.5 0.75 1.00

SPT: Layer 4 - Spreading activation Creating a Semantic Pattern: in this case for coffee and cacao Set activation value of the two nodes to 1.0 Spread this activation value to neighboring nodes via the weighted links 5% 5 20% 1.0 coffee machinery 2 chemicals cocoa 1.0

SPT: Layer 4 - Spreading activation Typically, one would create Semantic Patterns for all instances within the data set E.g. a pattern for C1 by activating coffee, 20% and 5 However, we can also create patterns for feature values: e.g. coffee Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data

SPT: Layer 4 - Spreading activation After SA: each node 0.30 5 1.15 0.38 20% 0.00 5% in the network has an activation value coffee cocoa chemicals 0.08 machinery 0.00 2 0.00 By representing the 1.15 nodes and their activation values as a vector, we gain a Semantic Pattern coffee cocoa machinery chemicals 20% 5% 5 2 1.15 1.15 0.00 0.08 0.38 0.00 0.30 0.00

0.50 0.25 0 Export: Cacao Unsorted Semantic Pattern coffee cacao machinery chemicals 20% 5% 5 2 Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data 0.50 0.25 Export: Coffee Unsorted Semantic Pattern Each feature value is represented by a semantic fingerprint 0 0.50 coffee cacao machinery chemicals 20% 5% 5 2 Fertility: 2 Unsorted Semantic Pattern Allows for an instant analysis of semantic relations to other feature values 0.25 Sort, mean, variance, adding, 0 coffee cacao machinery chemicals 20% 5% 5 2 subtracting

SPT: Layer 5 - Analysis Calculating the distance between two patterns (Euclidean distance, Cosine similarity) For unsupervised clustering, semanticaware search algorithms Keyword search for coffee C1 coffee 20% 5 C3 coffee, cacao 20% 5 C9 coffee, cacao missing data missing data Semantic aware search for coffee C9 coffee, cacao missing data missing data C1 coffee 20% 5 C3 coffee, cacao 20% 5 C2 cacao 20% 5 C8 missing data 20% 5 C7 chemicals, cacao 20% missing data C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C4 machinery 5% 2

SPT: Layer 5 - Analysis Machine learning: apply any machine learning algorithm to the Semantic Patterns Unsupervised clustering Supervised learning Semantic-aware search Knowledge discovery: semantic relations, arbitrary procedures: mean, variance etc. Anomaly detection, feature relevance, simple operations (variance, mean, etc.) Visualization

Machine Learning Benefits? Domain-specific data set Machine learning goals Domain-specific data set Machine learning goals Application in heterogeneous domains regardless of the nature of the data Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Except for Layer 1, we do not need any manual setup for the layers Regardless of the analyzed data, the Semantic Patterns always use the same model Algorithm application Interpretation Algorithm application Interpretation Dependence on domain data and goals High Medium Low This means: Regardless of the deployed knowledge discovery method, we can always use the same methods for knowledge extraction!

Comparing the two models 2.00 Mean pattern: C1, C2, C3 Unsorted Semantic Pattern Semantic Patterns 1.00 Country Coffee Cacao Machinery Chemicals 20% 5% 5 2 C1 1.30 0.53 0.00 0.08 1.45 0.00 1.45 0.00 C2 0.45 1.38 0.00 0.15 1.53 0.00 1.45 0.00 C3 1.45 1.53 0.00 0.15 1.68 0.00 1.60 0.00 C4 0.00 0.00 1.30 0.38 0.00 1.38 0.00 1.38 C5 0.00 0.08 0.38 1.30 0.08 1.38 0.00 1.38 C6 0.00 0.08 1.37 1.37 0.08 1.53 0.00 1.53 C7 0.30 1.30 0.08 1.15 1.30 0.15 0.45 0.15 C8 0.30 0.38 0.00 0.08 1.30 0.00 1.30 0.00 C9 1.15 1.15 0.00 0.08 0.38 0.00 0.30 0.00 0 1.50 0.75 coffee cacao machinery chemicals 20% 5% 5 2 Mean pattern: C4, C5, C6 Unsorted Semantic Pattern Value-centric feature vectors 0 coffee cacao machinery chemicals 20% 5% 5 2 Country Coffee Cacao Machinery Chemicals Unemployment rate Fertility rate C1 1 0 0 0 20% 5 C2 0 1 0 0 20% 5 C3 1 1 0 0 20% 5 C4 0 0 1 0 5% 2 C5 0 0 0 1 5% 2 C6 0 0 1 1 5% 2 C7 0 1 0 1 20% missing data C8 missing data 20% 5 C9 1 1 0 0 missing data missing data Same model: Android application, a country or a document... the activation values always have the same meaning

Evaluation 26 data sets from the UCI machine learning repository Supervised: SVM Unsupervised: EM and k-means Application to raw data and to Semantic Patterns Data set Label Inst DF SF Classes SVM (N) SVM (NN) SVM (P) KM (N) KM (NN) KM (P) EM (NN) EM (P) Breast Cancer Dermatology KR vs. KP Lymph Mushroom Soybean Splice Vote Zoo Anneal Colic Credit-A Credit-G Heart-C Heart-H Hepatitis Breast-w Diabetes Glass Heart-Statlog Ionosphere Iris Segment Sonar Vehicle Vowel BC DE KR LY MU SO SP VO ZO AN CO CA CG HC HH HE BW DI GL HS IO IR SE SO VE VO SVM K-Means EM SP-Parameters: D=0.5, Comb=E, Norm=L, MDL=1.5, σ = 0.2 Categorical 286 9 2 0.03 0.04 0.04 0.01 0.01 0.06 0.00 0.08 366 1 33 6 0.93 0.92 0.95 0.58 0.09 0.86 0.87 0.87 3196 36 2 0.75 0.75 0.72 0.00 0.01 0.00 0.04 0.00 148 18 4 0.53 0.51 0.48 0.13 0.18 0.25 0.26 0.27 8124 22 2 1.00 1.00 1.00 0.48 0.47 0.45 0.61 0.59 683 35 19 0.92 0.92 0.93 0.59 0.62 0.73 0.79 0.79 3190 60 3 0.71 0.72 0.80 0.03 0.03 0.44 0.41 0.31 435 16 2 0.76 0.74 0.67 0.47 0.48 0.47 0.49 0.45 101 17 7 0.94 0.94 0.97 0.78 0.78 0.82 0.82 0.85 Total 0.73 0.73 0.73 0.34 0.30 0.45 0.48 0.47 Mixed 898 6 32 6 0.86 0.86 0.92 0.23 0.03 0.30 0.31 0.32 368 7 15 2 0.31 0.32 0.31 0.13 0.03 0.05 0.10 0.12 689 6 9 2 0.41 0.41 0.39 0.16 0.02 0.25 0.17 0.21 1000 7 13 2 0.11 0.10 0.12 0.01 0.01 0.00 0.01 0.02 303 6 7 5 0.36 0.36 0.29 0.24 0.01 0.36 0.31 0.28 294 6 7 5 0.32 0.31 0.33 0.27 0.01 0.32 0.28 0.25 155 5 14 2 0.25 0.28 0.21 0.13 0.00 0.21 0.22 0.24 Total 0.37 0.38 0.37 0.17 0.02 0.21 0.20 0.20 Numerical 699 9 2 0.78 0.78 0.77 0.73 0.74 0.82 0.72 0.58 768 8 2 0.18 0.18 0.15 0.05 0.03 0.10 0.10 0.08 214 9 7 0.30 0.30 0.50 0.34 0.39 0.33 0.37 0.36 270 13 2 0.36 0.36 0.37 0.25 0.02 0.39 0.29 0.27 351 34 2 0.48 0.48 0.50 0.12 0.12 0.16 0.25 0.25 150 4 3 0.87 0.87 0.87 0.71 0.71 0.75 0.81 0.78 2310 19 7 0.88 0.88 0.90 0.61 0.53 0.59 0.62 0.60 208 60 2 0.23 0.23 0.23 0.01 0.01 0.02 0.01 0.01 846 18 4 0.51 0.51 0.48 0.11 0.19 0.19 0.10 0.19 990 10 3 11 0.63 0.63 0.76 0.06 0.34 0.23 0.19 0.25 Total 0.52 0.52 0.55 0.30 0.31 0.36 0.35 0.34

DOES IT WORK? Applications described in several publications, which analyze e-participation (Egyptian revolution, Fukoshima, Mitmachen): text documents Intrusion detection: event correlation RDF data analysis (semantic web) WiFi privacy (analyzing captured emails) Android Market application analysis

Current Project Android application security Container applications for BYOD (require encryption, secure communication, key derivation functions, root checks etc.) Manual analysis is cumbersome Semantic Patterns Extract Dalvik VM code, features (opcodes, methods, local variables etc.) Apply Semantic Patterns technique Clustering, supervised learning, anomaly detection etc.

Current Project

Current Project Also works directly on the phone... Detecting SMS catchers/sniffers More fine grained detection assymmetric cryptography symmetric cryptography

Outlook Publish the Java API... basically a converter from arbitrary feature vectors to Semantic Patterns (e.g. in/out in ARFF format) Deep learning...

Thx!

Par N NN D 0.0 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 K-Means EM Total BC DE KR LY MU SO SP VO ZO Total BC DE KR LY MU SO SP VO ZO Raw Data 0.341 0.012 0.584 0.004 0.131 0.475 0.587 0.031 0.467 0.782 Not available 0.296 0.007 0.094 0.010 0.176 0.472 0.616 0.030 0.476 0.783 0.477 0.002 0.871 0.036 0.258 0.610 0.789 0.410 0.494 0.822 Semantic Patterns 0.443 0.025 0.849 0.003 0.199 0.413 0.728 0.465 0.493 0.814 0.449 0.004 0.767 0.001 0.222 0.590 0.740 0.423 0.489 0.801 Comb=E Norm=L 0.442 0.029 0.811 0.004 0.245 0.545 0.726 0.387 0.476 0.759 0.441 0.074 0.885 0.000 0.271 0.615 0.786 0.004 0.505 0.826 0.447 0.068 0.846 0.004 0.241 0.482 0.724 0.424 0.476 0.758 0.460 0.079 0.875 0.001 0.258 0.592 0.788 0.250 0.449 0.846 0.452 0.061 0.856 0.000 0.245 0.448 0.733 0.437 0.467 0.820 0.468 0.079 0.874 0.001 0.265 0.592 0.789 0.306 0.452 0.850 0.422 0.069 0.826 0.000 0.209 0.275 0.728 0.419 0.463 0.804 0.465 0.079 0.874 0.001 0.252 0.579 0.799 0.312 0.445 0.847 Comb=S Norm=L 0.441 0.056 0.853 0.000 0.244 0.453 0.733 0.399 0.476 0.759 0.433 0.079 0.872 0.001 0.270 0.572 0.794 0.001 0.476 0.829 0.434 0.075 0.820 0.000 0.228 0.411 0.718 0.431 0.472 0.750 0.466 0.079 0.881 0.001 0.280 0.592 0.802 0.298 0.437 0.828 0.439 0.060 0.792 0.000 0.235 0.416 0.741 0.405 0.463 0.836 0.466 0.079 0.871 0.001 0.251 0.581 0.805 0.310 0.445 0.848 0.422 0.067 0.798 0.000 0.224 0.364 0.726 0.376 0.462 0.782 0.462 0.087 0.875 0.001 0.254 0.580 0.776 0.292 0.445 0.845 Comb=E Norm=S 0.418 0.029 0.790 0.006 0.236 0.311 0.705 0.449 0.496 0.742 0.472 0.002 0.893 0.000 0.263 0.571 0.767 0.432 0.495 0.820 0.452 0.030 0.860 0.001 0.231 0.470 0.715 0.475 0.491 0.799 0.476 0.002 0.914 0.000 0.261 0.586 0.775 0.427 0.495 0.823 0.448 0.048 0.799 0.009 0.215 0.539 0.725 0.450 0.493 0.758 0.472 0.002 0.897 0.000 0.267 0.584 0.758 0.427 0.484 0.829 0.448 0.033 0.850 0.000 0.230 0.495 0.712 0.435 0.493 0.787 0.473 0.002 0.903 0.000 0.250 0.586 0.773 0.427 0.484 0.829 Comb=S Norm=S 0.439 0.029 0.806 0.009 0.250 0.435 0.727 0.439 0.494 0.760 0.475 0.002 0.903 0.000 0.254 0.576 0.764 0.429 0.495 0.852 0.420 0.015 0.775 0.004 0.210 0.436 0.717 0.409 0.443 0.774 0.474 0.002 0.901 0.000 0.271 0.584 0.763 0.427 0.484 0.837 0.429 0.030 0.789 0.009 0.226 0.410 0.716 0.448 0.485 0.749 0.476 0.002 0.904 0.000 0.255 0.586 0.767 0.427 0.484 0.854 0.438 0.040 0.839 0.006 0.246 0.418 0.726 0.409 0.480 0.775 0.480 0.002 0.910 0.000 0.269 0.615 0.771 0.431 0.494 0.825

Par N NN σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 K-Means EM Total AN CO CA CG HC HH HE Total AN CO CA CG HC HH HE Raw Data 0.165 0.226 0.129 0.155 0.009 0.237 0.269 0.131 Not available 0.017 0.028 0.030 0.016 0.012 0.014 0.012 0.004 0.201 0.312 0.103 0.171 0.013 0.309 0.278 0.223 Semantic Patterns D=0.0 MDL=2.0 D=0.0 MDL=1.0 0.193 0.253 0.135 0.113 0.007 0.356 0.293 0.195 0.190 0.291 0.098 0.227 0.003 0.228 0.258 0.227 0.198 0.271 0.147 0.116 0.007 0.356 0.301 0.189 0.182 0.280 0.098 0.162 0.003 0.244 0.258 0.231 0.204 0.240 0.157 0.145 0.009 0.356 0.327 0.194 0.184 0.226 0.099 0.229 0.004 0.245 0.258 0.227 0.194 0.221 0.154 0.145 0.008 0.359 0.275 0.196 0.194 0.291 0.097 0.240 0.003 0.217 0.281 0.229 0.200 0.258 0.152 0.098 0.007 0.358 0.327 0.197 0.192 0.293 0.097 0.232 0.004 0.228 0.258 0.230 D=0.5 MDL=1.0 D=0.7 MDL=1.0 0.211 0.320 0.042 0.262 0.001 0.325 0.311 0.215 0.210 0.327 0.127 0.218 0.021 0.237 0.311 0.229 0.201 0.257 0.032 0.262 0.001 0.323 0.311 0.222 0.210 0.322 0.126 0.218 0.021 0.237 0.320 0.229 0.208 0.299 0.035 0.261 0.001 0.326 0.311 0.220 0.211 0.322 0.127 0.218 0.021 0.237 0.320 0.229 0.204 0.281 0.029 0.262 0.001 0.325 0.311 0.220 0.211 0.321 0.128 0.218 0.021 0.237 0.320 0.229 0.207 0.292 0.041 0.263 0.001 0.326 0.311 0.216 0.209 0.310 0.127 0.218 0.021 0.237 0.320 0.229 D=0.5 MDL=1.5 D=0.7 MDL=1.5 0.216 0.317 0.065 0.249 0.001 0.357 0.320 0.203 0.204 0.322 0.123 0.212 0.016 0.275 0.247 0.233 0.211 0.295 0.052 0.247 0.000 0.355 0.320 0.209 0.204 0.322 0.123 0.212 0.016 0.275 0.247 0.236 0.216 0.314 0.074 0.248 0.001 0.357 0.320 0.198 0.205 0.323 0.123 0.206 0.016 0.275 0.252 0.237 0.212 0.308 0.046 0.249 0.001 0.356 0.320 0.209 0.204 0.320 0.125 0.208 0.016 0.275 0.246 0.236 0.211 0.293 0.063 0.248 0.000 0.354 0.320 0.201 0.204 0.323 0.125 0.208 0.016 0.275 0.249 0.232 D=0.5 MDL=2.0 D=0.7 MDL=2.0 0.217 0.304 0.048 0.244 0.000 0.390 0.311 0.219 0.206 0.319 0.117 0.229 0.010 0.255 0.277 0.233 0.218 0.313 0.062 0.244 0.000 0.388 0.311 0.208 0.207 0.317 0.126 0.239 0.010 0.255 0.268 0.233 0.221 0.309 0.084 0.243 0.000 0.389 0.311 0.209 0.205 0.319 0.127 0.224 0.010 0.255 0.268 0.233 0.213 0.285 0.057 0.243 0.000 0.387 0.311 0.210 0.206 0.307 0.127 0.240 0.010 0.255 0.268 0.233 0.211 0.295 0.036 0.244 0.000 0.387 0.311 0.205 0.204 0.305 0.127 0.240 0.010 0.255 0.259 0.233 D=0.5 MDL=3.0 D=0.7 MDL=3.0 0.203 0.294 0.030 0.248 0.000 0.335 0.315 0.196 0.192 0.323 0.108 0.248 0.009 0.201 0.250 0.205 0.208 0.306 0.059 0.248 0.000 0.334 0.315 0.193 0.190 0.321 0.107 0.237 0.009 0.201 0.251 0.205 0.205 0.310 0.050 0.248 0.000 0.334 0.315 0.178 0.193 0.322 0.122 0.243 0.009 0.201 0.249 0.205 0.207 0.300 0.063 0.248 0.001 0.333 0.313 0.192 0.192 0.321 0.122 0.243 0.010 0.201 0.245 0.205 0.210 0.330 0.050 0.246 0.001 0.336 0.315 0.191 0.192 0.323 0.122 0.243 0.009 0.201 0.240 0.205

Par N NN σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 K-Means EM Total BW DI GL HS IO IR SE SO VE VO Total BW DI GL HS IO IR SE SO VE VO Raw Data 0.299 0.734 0.052 0.335 0.254 0.121 0.708 0.608 0.006 0.113 0.057 Not available 0.307 0.735 0.030 0.388 0.019 0.123 0.705 0.529 0.008 0.188 0.342 0.346 0.718 0.103 0.370 0.289 0.254 0.806 0.621 0.005 0.103 0.194 Semantic Patterns D=0.0 MDL=1.5 D=0.0 MDL=1.0 0.315 0.724 0.039 0.329 0.309 0.045 0.717 0.582 0.026 0.198 0.183 0.317 0.777 0.006 0.312 0.239 0.218 0.651 0.592 0.016 0.174 0.186 0.323 0.724 0.025 0.334 0.344 0.071 0.730 0.590 0.012 0.198 0.196 0.327 0.752 0.001 0.318 0.240 0.218 0.766 0.598 0.016 0.167 0.197 0.318 0.719 0.026 0.285 0.316 0.051 0.769 0.600 0.008 0.199 0.203 0.323 0.727 0.011 0.287 0.229 0.217 0.749 0.600 0.018 0.176 0.218 0.317 0.722 0.025 0.298 0.357 0.040 0.712 0.602 0.013 0.199 0.201 0.317 0.732 0.009 0.316 0.232 0.221 0.637 0.606 0.025 0.175 0.214 0.299 0.646 0.015 0.294 0.328 0.026 0.686 0.581 0.014 0.198 0.200 0.325 0.703 0.006 0.305 0.233 0.216 0.796 0.594 0.019 0.181 0.195 D=0.5 MDL=1.0 D=0.7 MDL=1.0 0.333 0.817 0.072 0.293 0.338 0.181 0.611 0.614 0.009 0.164 0.234 0.302 0.579 0.082 0.332 0.285 0.184 0.633 0.634 0.006 0.099 0.183 0.333 0.817 0.076 0.278 0.340 0.181 0.621 0.621 0.009 0.151 0.237 0.300 0.579 0.082 0.307 0.285 0.184 0.636 0.632 0.006 0.117 0.176 0.326 0.817 0.068 0.286 0.335 0.181 0.587 0.604 0.009 0.149 0.228 0.301 0.579 0.086 0.310 0.285 0.184 0.639 0.643 0.006 0.095 0.183 0.327 0.817 0.072 0.269 0.337 0.181 0.604 0.580 0.009 0.166 0.232 0.301 0.579 0.076 0.319 0.285 0.184 0.639 0.632 0.006 0.109 0.185 0.334 0.817 0.071 0.303 0.336 0.181 0.610 0.605 0.011 0.163 0.244 0.300 0.579 0.079 0.311 0.285 0.184 0.633 0.633 0.006 0.109 0.183 D=0.5 MDL=1.5 D=0.7 MDL=1.5 0.352 0.817 0.099 0.298 0.382 0.143 0.751 0.601 0.018 0.193 0.218 0.339 0.579 0.086 0.348 0.324 0.242 0.761 0.596 0.013 0.187 0.252 0.358 0.817 0.100 0.330 0.385 0.163 0.751 0.588 0.015 0.194 0.232 0.339 0.579 0.086 0.356 0.324 0.242 0.761 0.595 0.012 0.192 0.239 0.352 0.817 0.096 0.315 0.387 0.143 0.738 0.576 0.019 0.193 0.231 0.340 0.579 0.092 0.348 0.324 0.242 0.761 0.603 0.012 0.194 0.241 0.348 0.817 0.103 0.288 0.383 0.158 0.716 0.579 0.015 0.194 0.226 0.339 0.579 0.094 0.355 0.324 0.242 0.761 0.602 0.012 0.181 0.240 0.356 0.817 0.098 0.296 0.378 0.166 0.776 0.604 0.012 0.190 0.225 0.338 0.579 0.107 0.355 0.324 0.242 0.752 0.597 0.012 0.177 0.236 D=0.5 MDL=2.0 D=0.7 MDL=2.0 0.329 0.817 0.054 0.339 0.330 0.064 0.752 0.563 0.017 0.151 0.199 0.323 0.579 0.105 0.347 0.266 0.228 0.784 0.585 0.015 0.092 0.227 0.328 0.817 0.052 0.320 0.330 0.064 0.753 0.585 0.017 0.144 0.196 0.325 0.579 0.098 0.359 0.266 0.228 0.784 0.584 0.015 0.098 0.238 0.331 0.817 0.055 0.313 0.330 0.109 0.767 0.562 0.012 0.149 0.194 0.323 0.579 0.105 0.358 0.266 0.228 0.784 0.576 0.015 0.090 0.230 0.330 0.817 0.059 0.335 0.328 0.073 0.765 0.560 0.019 0.148 0.199 0.326 0.579 0.099 0.351 0.266 0.228 0.798 0.595 0.015 0.091 0.235 0.333 0.817 0.064 0.321 0.330 0.068 0.764 0.593 0.013 0.158 0.200 0.326 0.579 0.104 0.361 0.266 0.228 0.798 0.585 0.015 0.090 0.237 D=0.5 MDL=3.0 D=0.7 MDL=3.0 0.322 0.817 0.026 0.326 0.333 0.099 0.739 0.567 0.022 0.136 0.153 0.304 0.579 0.001 0.362 0.200 0.228 0.728 0.574 0.032 0.114 0.224 0.322 0.817 0.029 0.326 0.320 0.127 0.702 0.583 0.017 0.150 0.150 0.307 0.579 0.000 0.364 0.208 0.228 0.735 0.573 0.029 0.113 0.236 0.317 0.817 0.035 0.318 0.320 0.099 0.705 0.556 0.024 0.140 0.154 0.306 0.579 0.001 0.355 0.211 0.228 0.726 0.572 0.035 0.113 0.237 0.328 0.817 0.026 0.342 0.328 0.118 0.759 0.563 0.020 0.150 0.153 0.307 0.579 0.001 0.363 0.219 0.228 0.729 0.575 0.029 0.113 0.233 0.323 0.817 0.029 0.330 0.322 0.099 0.731 0.563 0.023 0.151 0.161 0.304 0.579 0.001 0.356 0.204 0.224 0.713 0.589 0.030 0.119 0.226

Distance Data Missing BC DE KR LY MU SO SP VO ZO Total AN CO CA CG HC HH HE Total BW DI GL HS IO IR SE SO VE VO Total Euc Cos Raw Semantic Patterns Raw Semantic Patterns 0% 10% 50% 90% 0% 10% 50% 90% 0% 10% 50% 90% 0% 10% 50% 90% Categorical 0.52 0.52 0.52 0.52 0.54 0.54 0.53 0.50 0.53 0.53 0.53 0.51 0.54 0.54 0.53 0.51 0.68 0.66 0.55 0.32 0.81 0.80 0.38 0.22 0.66 0.66 0.67 0.36 0.81 0.80 0.74 0.46 0.54 0.54 0.53 0.52 0.52 0.52 0.51 0.50 0.54 0.54 0.53 0.51 0.52 0.52 0.52 0.51 0.63 0.68 0.63 0.30 0.63 0.59 0.64 0.48 0.59 0.53 0.51 0.32 0.61 0.58 0.56 0.35 0.64 0.64 0.62 0.57 0.68 0.67 0.62 0.53 0.57 0.57 0.56 0.54 0.67 0.67 0.67 0.62 0.65 0.63 0.53 0.22 0.75 0.70 0.09 0.08 0.58 0.56 0.50 0.18 0.73 0.72 0.63 0.28 0.48 0.47 0.44 0.38 0.62 0.46 0.39 0.39 0.44 0.44 0.41 0.37 0.57 0.57 0.54 0.45 0.80 0.79 0.76 0.67 0.78 0.78 0.68 0.51 0.62 0.63 0.67 0.62 0.79 0.79 0.78 0.72 0.83 0.81 0.72 0.31 0.86 0.85 0.64 0.24 0.80 0.79 0.71 0.31 0.86 0.84 0.76 0.41 0.64 0.64 0.59 0.42 0.69 0.66 0.50 0.38 0.59 0.58 0.57 0.41 0.68 0.67 0.64 0.48 Mixed 0.64 0.63 0.55 0.38 0.66 0.67 0.51 0.38 0.44 0.46 0.50 0.38 0.66 0.66 0.61 0.42 0.59 0.59 0.56 0.51 0.59 0.58 0.52 0.50 0.50 0.50 0.51 0.51 0.62 0.62 0.60 0.57 0.62 0.61 0.59 0.54 0.65 0.65 0.60 0.52 0.55 0.55 0.54 0.51 0.65 0.64 0.63 0.57 0.52 0.52 0.52 0.50 0.52 0.53 0.54 0.53 0.51 0.51 0.52 0.51 0.52 0.52 0.52 0.52 0.86 0.86 0.85 0.81 0.87 0.87 0.85 0.81 0.81 0.81 0.82 0.81 0.87 0.87 0.86 0.84 0.87 0.86 0.85 0.82 0.87 0.87 0.83 0.80 0.84 0.84 0.83 0.81 0.88 0.88 0.87 0.83 0.59 0.58 0.56 0.50 0.64 0.64 0.58 0.55 0.52 0.51 0.55 0.52 0.65 0.65 0.64 0.57 0.67 0.67 0.64 0.58 0.69 0.69 0.63 0.58 0.60 0.60 0.61 0.58 0.69 0.69 0.68 0.62 Numerical 0.86 0.86 0.76 0.68 0.91 0.91 0.84 0.69 0.62 0.61 0.59 0.50 0.90 0.89 0.88 0.84 0.55 0.54 0.53 0.53 0.56 0.55 0.54 0.50 0.53 0.53 0.52 0.50 0.56 0.55 0.55 0.53 0.49 0.45 0.31 0.30 0.53 0.52 0.42 0.31 0.51 0.51 0.48 0.29 0.53 0.52 0.48 0.34 0.64 0.63 0.59 0.52 0.69 0.69 0.61 0.53 0.54 0.54 0.55 0.51 0.69 0.69 0.65 0.60 0.51 0.52 0.55 0.54 0.61 0.61 0.56 0.46 0.46 0.46 0.47 0.51 0.61 0.61 0.60 0.57 0.81 0.60 0.47 0.33 0.83 0.81 0.75 0.67 0.87 0.84 0.77 0.34 0.84 0.81 0.76 0.75 0.61 0.53 0.21 0.15 0.57 0.57 0.43 0.17 0.39 0.40 0.44 0.27 0.57 0.57 0.55 0.41 0.54 0.53 0.51 0.50 0.54 0.54 0.51 0.50 0.52 0.52 0.52 0.52 0.54 0.54 0.54 0.53 0.35 0.33 0.29 0.26 0.37 0.37 0.35 0.28 0.36 0.36 0.36 0.31 0.37 0.37 0.36 0.33 0.15 0.15 0.12 0.09 0.22 0.21 0.16 0.10 0.20 0.20 0.17 0.10 0.21 0.21 0.20 0.13 0.55 0.51 0.43 0.39 0.58 0.58 0.52 0.42 0.50 0.50 0.49 0.38 0.58 0.58 0.56 0.50

Data set EUC (N) EUC (NN) COS (NN) EUC (NN) COS (NN) EUC (NN) COS (NN) RAW Baseline Semantic Patterns Categorical BC DE KR LY MU SO SP VO ZO Total AN CO CA CG HC HH HE Total BW DI GL HS IO IR SE SO VE VO Total 0.52 0.53 0.53 0.52 0.53 0.54 0.54 0.68 0.68 0.66 0.67 0.67 0.81 0.81 0.54 0.54 0.54 0.54 0.54 0.52 0.52 0.63 0.63 0.59 0.60 0.57 0.63 0.61 0.64 0.64 0.57 0.64 0.64 0.68 0.67 0.65 0.65 0.58 0.69 0.70 0.75 0.73 0.48 0.48 0.44 0.48 0.48 0.62 0.57 0.80 0.80 0.62 0.80 0.80 0.78 0.79 0.84 0.83 0.80 0.85 0.84 0.86 0.86 0.64 0.64 0.59 0.64 0.64 0.69 0.68 Mixed 0.64 0.64 0.44 0.64 0.65 0.65 0.66 0.59 0.59 0.50 0.59 0.60 0.58 0.62 0.62 0.62 0.55 0.61 0.61 0.61 0.65 0.52 0.52 0.51 0.52 0.52 0.52 0.52 0.86 0.86 0.81 0.85 0.85 0.86 0.87 0.87 0.87 0.84 0.86 0.86 0.86 0.88 0.59 0.59 0.52 0.61 0.60 0.63 0.65 0.67 0.67 0.60 0.67 0.67 0.67 0.69 Numerical 0.86 0.86 0.62 0.74 0.74 0.89 0.90 0.55 0.55 0.53 0.54 0.54 0.55 0.56 0.49 0.49 0.51 0.51 0.51 0.53 0.53 0.64 0.64 0.54 0.63 0.63 0.66 0.69 0.51 0.51 0.46 0.55 0.55 0.63 0.61 0.81 0.81 0.87 0.73 0.73 0.81 0.83 0.61 0.61 0.39 0.54 0.54 0.57 0.57 0.54 0.54 0.52 0.54 0.54 0.54 0.54 0.35 0.35 0.36 0.37 0.37 0.36 0.37 0.15 0.15 0.20 0.21 0.21 0.22 0.21 0.55 0.55 0.50 0.54 0.54 0.58 0.58