Semantic Pattern Transformation

Semantic Pattern Transformation IKNOW 2013 Peter Teufl, Herbert Leitold, Reinhard Posch peter.teufl@iaik.tugraz.at

Our Background Topics Mobile device security Cloud security Security consulting for public insititutions (Austria) IT security research IT security lectures A-SIT e-government

Why does he talk about Knowledge Discovery? How does IT security relate to knowledge discovery? egov - eparticipation: document analysis, twitter etc. intrusion detection systems (network traffic analysis) malware detection (network traffic, mobile phones) mobile application analysis (metadata, market descriptions) mobile application security (hot topic, BYOD, etc.)

What to expect? Motivation for the Semantic Pattern Transformation Basic concepts, techniques How does it work? Evaluation? Applications, results, current topics!

Environment Arbitrary features No apriori knowledge Heteregenous domains Supervised learning Anomaly Detection Text analysis Android market descriptions Semantic search Clustering terms flexible histograms new numbers Visualization deployment domains Extracting knowledge

Process... Fayyad et al. Domain-specific data set Machine learning Different processing steps From defining the goals To extracting the desired knowledge Machine learning algorithms are often used within KDD KDT Knowledge discovery goals Target data set Preprocessing Data extraction Data mining method Data mining algorithm Data mining ML-KDT Machine learning goals Instance extraction Feature selection, construction Instance selection Machine learning algorithm Preprocessing Algorithm application However, the complete machine learning process is quite similar to KDD Knowledge extraction Knowledge processing Interpretation

Machine Learning ADAPTATION COMPLEXITY? Domain-specific data set Machine learning goals Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Algorithm application Interpretation Dependence on domain data and goals High Medium Low Assuming an arbitrary data-set (e-participation, Android Market applications) Further assuming: a knowledge discovery goal: e.g., unsupervised clustering Then: we need to adapt the steps on the left And: We need to adapt this setup when the data changes, even when the knowledge discovery goals remain the same! Android Market applications vs. text documents vs. network traffic vs. malware detection?

TOWARDS A SEMANTIC REPRESENTATION Finding a new representation... New representation is called Semantic Patterns Key properties: Still a vector representation (compatible to old representation) Not the feature values themselves, but their semantic relations are represented All values have the same meaning and feature type (activation) Transformation from raw data into Semantic Patterns: Semantic Pattern Transformation

SEMANTIC PATTERN TRANSFORMATION The Semantic Pattern Transformation is arranged in five layers Layer 1 Feature Extraction Data set Relation Instances Layer 1 - Feature extraction FROM TO TIME FROM TO TIME FROM TO TIME SF 2 Instance SF 1 SF 2 DF 1 SF 2 DF 2 Map Layer 2 - Associative network - Node generation Layer 2-3 Associative Network Generation SV MV SV SV MV Map Map Layer 3 - Associative network - Link generation Layer 4 Spreading Activation P 1 SV SV P 2 MV MV Layer 4 - Spreading activation (SA) P 3 P 4 Layer 5 - Analysis (machine learning, semantic search etc.) Layer 5 Analysis Semantic relations Semantic development over time Unsupervised clustering Feature value relevance Anomaly detection Pattern similarity Supervised learning

SPT: Layer 1 - Feature extraction Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data Extract features, their values and determine the type (categorical, distance-based) Categorical: Exports Distance-based: Unemployment rate, fertility rate

SPT: Layer 2 - Node generation Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data Distance-based feature values: map value ranges to single nodes 5% Categorical feature 20% values: Associative network one node for each value 5 coffee machinery 2 chemicals cocoa

SPT: Layer 3 - Link generation Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data coffee, 20%, 5 chemicals, cacao, 20% 5% 5 coffee 20% machinery 2 chemicals Link Weight cocoa 0.25 0.5 0.75 1.00

SPT: Layer 4 - Spreading activation Creating a Semantic Pattern: in this case for coffee and cacao Set activation value of the two nodes to 1.0 Spread this activation value to neighboring nodes via the weighted links 5% 5 20% 1.0 coffee machinery 2 chemicals cocoa 1.0

SPT: Layer 4 - Spreading activation Typically, one would create Semantic Patterns for all instances within the data set E.g. a pattern for C1 by activating coffee, 20% and 5 However, we can also create patterns for feature values: e.g. coffee Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data

SPT: Layer 4 - Spreading activation After SA: each node 0.30 5 1.15 0.38 20% 0.00 5% in the network has an activation value coffee cocoa chemicals 0.08 machinery 0.00 2 0.00 By representing the 1.15 nodes and their activation values as a vector, we gain a Semantic Pattern coffee cocoa machinery chemicals 20% 5% 5 2 1.15 1.15 0.00 0.08 0.38 0.00 0.30 0.00

0.50 0.25 0 Export: Cacao Unsorted Semantic Pattern coffee cacao machinery chemicals 20% 5% 5 2 Country Exports Unemployment rate Fertility rate C1 coffee 20% 5 C2 cacao 20% 5 C3 coffee, cacao 20% 5 C4 machinery 5% 2 C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C7 chemicals, cacao 20% missing data C8 missing data 20% 5 C9 coffee, cacao missing data missing data 0.50 0.25 Export: Coffee Unsorted Semantic Pattern Each feature value is represented by a semantic fingerprint 0 0.50 coffee cacao machinery chemicals 20% 5% 5 2 Fertility: 2 Unsorted Semantic Pattern Allows for an instant analysis of semantic relations to other feature values 0.25 Sort, mean, variance, adding, 0 coffee cacao machinery chemicals 20% 5% 5 2 subtracting

SPT: Layer 5 - Analysis Calculating the distance between two patterns (Euclidean distance, Cosine similarity) For unsupervised clustering, semanticaware search algorithms Keyword search for coffee C1 coffee 20% 5 C3 coffee, cacao 20% 5 C9 coffee, cacao missing data missing data Semantic aware search for coffee C9 coffee, cacao missing data missing data C1 coffee 20% 5 C3 coffee, cacao 20% 5 C2 cacao 20% 5 C8 missing data 20% 5 C7 chemicals, cacao 20% missing data C5 chemicals 5% 2 C6 chemicals, machinery 5% 2 C4 machinery 5% 2

SPT: Layer 5 - Analysis Machine learning: apply any machine learning algorithm to the Semantic Patterns Unsupervised clustering Supervised learning Semantic-aware search Knowledge discovery: semantic relations, arbitrary procedures: mean, variance etc. Anomaly detection, feature relevance, simple operations (variance, mean, etc.) Visualization

Machine Learning Benefits? Domain-specific data set Machine learning goals Domain-specific data set Machine learning goals Application in heterogeneous domains regardless of the nature of the data Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Instance extraction Feature selection, construction Instance selection Algorithm selection Preprocessing Except for Layer 1, we do not need any manual setup for the layers Regardless of the analyzed data, the Semantic Patterns always use the same model Algorithm application Interpretation Algorithm application Interpretation Dependence on domain data and goals High Medium Low This means: Regardless of the deployed knowledge discovery method, we can always use the same methods for knowledge extraction!

Comparing the two models 2.00 Mean pattern: C1, C2, C3 Unsorted Semantic Pattern Semantic Patterns 1.00 Country Coffee Cacao Machinery Chemicals 20% 5% 5 2 C1 1.30 0.53 0.00 0.08 1.45 0.00 1.45 0.00 C2 0.45 1.38 0.00 0.15 1.53 0.00 1.45 0.00 C3 1.45 1.53 0.00 0.15 1.68 0.00 1.60 0.00 C4 0.00 0.00 1.30 0.38 0.00 1.38 0.00 1.38 C5 0.00 0.08 0.38 1.30 0.08 1.38 0.00 1.38 C6 0.00 0.08 1.37 1.37 0.08 1.53 0.00 1.53 C7 0.30 1.30 0.08 1.15 1.30 0.15 0.45 0.15 C8 0.30 0.38 0.00 0.08 1.30 0.00 1.30 0.00 C9 1.15 1.15 0.00 0.08 0.38 0.00 0.30 0.00 0 1.50 0.75 coffee cacao machinery chemicals 20% 5% 5 2 Mean pattern: C4, C5, C6 Unsorted Semantic Pattern Value-centric feature vectors 0 coffee cacao machinery chemicals 20% 5% 5 2 Country Coffee Cacao Machinery Chemicals Unemployment rate Fertility rate C1 1 0 0 0 20% 5 C2 0 1 0 0 20% 5 C3 1 1 0 0 20% 5 C4 0 0 1 0 5% 2 C5 0 0 0 1 5% 2 C6 0 0 1 1 5% 2 C7 0 1 0 1 20% missing data C8 missing data 20% 5 C9 1 1 0 0 missing data missing data Same model: Android application, a country or a document... the activation values always have the same meaning

Evaluation 26 data sets from the UCI machine learning repository Supervised: SVM Unsupervised: EM and k-means Application to raw data and to Semantic Patterns Data set Label Inst DF SF Classes SVM (N) SVM (NN) SVM (P) KM (N) KM (NN) KM (P) EM (NN) EM (P) Breast Cancer Dermatology KR vs. KP Lymph Mushroom Soybean Splice Vote Zoo Anneal Colic Credit-A Credit-G Heart-C Heart-H Hepatitis Breast-w Diabetes Glass Heart-Statlog Ionosphere Iris Segment Sonar Vehicle Vowel BC DE KR LY MU SO SP VO ZO AN CO CA CG HC HH HE BW DI GL HS IO IR SE SO VE VO SVM K-Means EM SP-Parameters: D=0.5, Comb=E, Norm=L, MDL=1.5, σ = 0.2 Categorical 286 9 2 0.03 0.04 0.04 0.01 0.01 0.06 0.00 0.08 366 1 33 6 0.93 0.92 0.95 0.58 0.09 0.86 0.87 0.87 3196 36 2 0.75 0.75 0.72 0.00 0.01 0.00 0.04 0.00 148 18 4 0.53 0.51 0.48 0.13 0.18 0.25 0.26 0.27 8124 22 2 1.00 1.00 1.00 0.48 0.47 0.45 0.61 0.59 683 35 19 0.92 0.92 0.93 0.59 0.62 0.73 0.79 0.79 3190 60 3 0.71 0.72 0.80 0.03 0.03 0.44 0.41 0.31 435 16 2 0.76 0.74 0.67 0.47 0.48 0.47 0.49 0.45 101 17 7 0.94 0.94 0.97 0.78 0.78 0.82 0.82 0.85 Total 0.73 0.73 0.73 0.34 0.30 0.45 0.48 0.47 Mixed 898 6 32 6 0.86 0.86 0.92 0.23 0.03 0.30 0.31 0.32 368 7 15 2 0.31 0.32 0.31 0.13 0.03 0.05 0.10 0.12 689 6 9 2 0.41 0.41 0.39 0.16 0.02 0.25 0.17 0.21 1000 7 13 2 0.11 0.10 0.12 0.01 0.01 0.00 0.01 0.02 303 6 7 5 0.36 0.36 0.29 0.24 0.01 0.36 0.31 0.28 294 6 7 5 0.32 0.31 0.33 0.27 0.01 0.32 0.28 0.25 155 5 14 2 0.25 0.28 0.21 0.13 0.00 0.21 0.22 0.24 Total 0.37 0.38 0.37 0.17 0.02 0.21 0.20 0.20 Numerical 699 9 2 0.78 0.78 0.77 0.73 0.74 0.82 0.72 0.58 768 8 2 0.18 0.18 0.15 0.05 0.03 0.10 0.10 0.08 214 9 7 0.30 0.30 0.50 0.34 0.39 0.33 0.37 0.36 270 13 2 0.36 0.36 0.37 0.25 0.02 0.39 0.29 0.27 351 34 2 0.48 0.48 0.50 0.12 0.12 0.16 0.25 0.25 150 4 3 0.87 0.87 0.87 0.71 0.71 0.75 0.81 0.78 2310 19 7 0.88 0.88 0.90 0.61 0.53 0.59 0.62 0.60 208 60 2 0.23 0.23 0.23 0.01 0.01 0.02 0.01 0.01 846 18 4 0.51 0.51 0.48 0.11 0.19 0.19 0.10 0.19 990 10 3 11 0.63 0.63 0.76 0.06 0.34 0.23 0.19 0.25 Total 0.52 0.52 0.55 0.30 0.31 0.36 0.35 0.34

DOES IT WORK? Applications described in several publications, which analyze e-participation (Egyptian revolution, Fukoshima, Mitmachen): text documents Intrusion detection: event correlation RDF data analysis (semantic web) WiFi privacy (analyzing captured emails) Android Market application analysis

Current Project Android application security Container applications for BYOD (require encryption, secure communication, key derivation functions, root checks etc.) Manual analysis is cumbersome Semantic Patterns Extract Dalvik VM code, features (opcodes, methods, local variables etc.) Apply Semantic Patterns technique Clustering, supervised learning, anomaly detection etc.

Current Project

Current Project Also works directly on the phone... Detecting SMS catchers/sniffers More fine grained detection assymmetric cryptography symmetric cryptography

Outlook Publish the Java API... basically a converter from arbitrary feature vectors to Semantic Patterns (e.g. in/out in ARFF format) Deep learning...

Thx!

Par N NN D 0.0 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 D 0.1 D 0.3 D 0.5 D 0.7 K-Means EM Total BC DE KR LY MU SO SP VO ZO Total BC DE KR LY MU SO SP VO ZO Raw Data 0.341 0.012 0.584 0.004 0.131 0.475 0.587 0.031 0.467 0.782 Not available 0.296 0.007 0.094 0.010 0.176 0.472 0.616 0.030 0.476 0.783 0.477 0.002 0.871 0.036 0.258 0.610 0.789 0.410 0.494 0.822 Semantic Patterns 0.443 0.025 0.849 0.003 0.199 0.413 0.728 0.465 0.493 0.814 0.449 0.004 0.767 0.001 0.222 0.590 0.740 0.423 0.489 0.801 Comb=E Norm=L 0.442 0.029 0.811 0.004 0.245 0.545 0.726 0.387 0.476 0.759 0.441 0.074 0.885 0.000 0.271 0.615 0.786 0.004 0.505 0.826 0.447 0.068 0.846 0.004 0.241 0.482 0.724 0.424 0.476 0.758 0.460 0.079 0.875 0.001 0.258 0.592 0.788 0.250 0.449 0.846 0.452 0.061 0.856 0.000 0.245 0.448 0.733 0.437 0.467 0.820 0.468 0.079 0.874 0.001 0.265 0.592 0.789 0.306 0.452 0.850 0.422 0.069 0.826 0.000 0.209 0.275 0.728 0.419 0.463 0.804 0.465 0.079 0.874 0.001 0.252 0.579 0.799 0.312 0.445 0.847 Comb=S Norm=L 0.441 0.056 0.853 0.000 0.244 0.453 0.733 0.399 0.476 0.759 0.433 0.079 0.872 0.001 0.270 0.572 0.794 0.001 0.476 0.829 0.434 0.075 0.820 0.000 0.228 0.411 0.718 0.431 0.472 0.750 0.466 0.079 0.881 0.001 0.280 0.592 0.802 0.298 0.437 0.828 0.439 0.060 0.792 0.000 0.235 0.416 0.741 0.405 0.463 0.836 0.466 0.079 0.871 0.001 0.251 0.581 0.805 0.310 0.445 0.848 0.422 0.067 0.798 0.000 0.224 0.364 0.726 0.376 0.462 0.782 0.462 0.087 0.875 0.001 0.254 0.580 0.776 0.292 0.445 0.845 Comb=E Norm=S 0.418 0.029 0.790 0.006 0.236 0.311 0.705 0.449 0.496 0.742 0.472 0.002 0.893 0.000 0.263 0.571 0.767 0.432 0.495 0.820 0.452 0.030 0.860 0.001 0.231 0.470 0.715 0.475 0.491 0.799 0.476 0.002 0.914 0.000 0.261 0.586 0.775 0.427 0.495 0.823 0.448 0.048 0.799 0.009 0.215 0.539 0.725 0.450 0.493 0.758 0.472 0.002 0.897 0.000 0.267 0.584 0.758 0.427 0.484 0.829 0.448 0.033 0.850 0.000 0.230 0.495 0.712 0.435 0.493 0.787 0.473 0.002 0.903 0.000 0.250 0.586 0.773 0.427 0.484 0.829 Comb=S Norm=S 0.439 0.029 0.806 0.009 0.250 0.435 0.727 0.439 0.494 0.760 0.475 0.002 0.903 0.000 0.254 0.576 0.764 0.429 0.495 0.852 0.420 0.015 0.775 0.004 0.210 0.436 0.717 0.409 0.443 0.774 0.474 0.002 0.901 0.000 0.271 0.584 0.763 0.427 0.484 0.837 0.429 0.030 0.789 0.009 0.226 0.410 0.716 0.448 0.485 0.749 0.476 0.002 0.904 0.000 0.255 0.586 0.767 0.427 0.484 0.854 0.438 0.040 0.839 0.006 0.246 0.418 0.726 0.409 0.480 0.775 0.480 0.002 0.910 0.000 0.269 0.615 0.771 0.431 0.494 0.825

Par N NN σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 K-Means EM Total AN CO CA CG HC HH HE Total AN CO CA CG HC HH HE Raw Data 0.165 0.226 0.129 0.155 0.009 0.237 0.269 0.131 Not available 0.017 0.028 0.030 0.016 0.012 0.014 0.012 0.004 0.201 0.312 0.103 0.171 0.013 0.309 0.278 0.223 Semantic Patterns D=0.0 MDL=2.0 D=0.0 MDL=1.0 0.193 0.253 0.135 0.113 0.007 0.356 0.293 0.195 0.190 0.291 0.098 0.227 0.003 0.228 0.258 0.227 0.198 0.271 0.147 0.116 0.007 0.356 0.301 0.189 0.182 0.280 0.098 0.162 0.003 0.244 0.258 0.231 0.204 0.240 0.157 0.145 0.009 0.356 0.327 0.194 0.184 0.226 0.099 0.229 0.004 0.245 0.258 0.227 0.194 0.221 0.154 0.145 0.008 0.359 0.275 0.196 0.194 0.291 0.097 0.240 0.003 0.217 0.281 0.229 0.200 0.258 0.152 0.098 0.007 0.358 0.327 0.197 0.192 0.293 0.097 0.232 0.004 0.228 0.258 0.230 D=0.5 MDL=1.0 D=0.7 MDL=1.0 0.211 0.320 0.042 0.262 0.001 0.325 0.311 0.215 0.210 0.327 0.127 0.218 0.021 0.237 0.311 0.229 0.201 0.257 0.032 0.262 0.001 0.323 0.311 0.222 0.210 0.322 0.126 0.218 0.021 0.237 0.320 0.229 0.208 0.299 0.035 0.261 0.001 0.326 0.311 0.220 0.211 0.322 0.127 0.218 0.021 0.237 0.320 0.229 0.204 0.281 0.029 0.262 0.001 0.325 0.311 0.220 0.211 0.321 0.128 0.218 0.021 0.237 0.320 0.229 0.207 0.292 0.041 0.263 0.001 0.326 0.311 0.216 0.209 0.310 0.127 0.218 0.021 0.237 0.320 0.229 D=0.5 MDL=1.5 D=0.7 MDL=1.5 0.216 0.317 0.065 0.249 0.001 0.357 0.320 0.203 0.204 0.322 0.123 0.212 0.016 0.275 0.247 0.233 0.211 0.295 0.052 0.247 0.000 0.355 0.320 0.209 0.204 0.322 0.123 0.212 0.016 0.275 0.247 0.236 0.216 0.314 0.074 0.248 0.001 0.357 0.320 0.198 0.205 0.323 0.123 0.206 0.016 0.275 0.252 0.237 0.212 0.308 0.046 0.249 0.001 0.356 0.320 0.209 0.204 0.320 0.125 0.208 0.016 0.275 0.246 0.236 0.211 0.293 0.063 0.248 0.000 0.354 0.320 0.201 0.204 0.323 0.125 0.208 0.016 0.275 0.249 0.232 D=0.5 MDL=2.0 D=0.7 MDL=2.0 0.217 0.304 0.048 0.244 0.000 0.390 0.311 0.219 0.206 0.319 0.117 0.229 0.010 0.255 0.277 0.233 0.218 0.313 0.062 0.244 0.000 0.388 0.311 0.208 0.207 0.317 0.126 0.239 0.010 0.255 0.268 0.233 0.221 0.309 0.084 0.243 0.000 0.389 0.311 0.209 0.205 0.319 0.127 0.224 0.010 0.255 0.268 0.233 0.213 0.285 0.057 0.243 0.000 0.387 0.311 0.210 0.206 0.307 0.127 0.240 0.010 0.255 0.268 0.233 0.211 0.295 0.036 0.244 0.000 0.387 0.311 0.205 0.204 0.305 0.127 0.240 0.010 0.255 0.259 0.233 D=0.5 MDL=3.0 D=0.7 MDL=3.0 0.203 0.294 0.030 0.248 0.000 0.335 0.315 0.196 0.192 0.323 0.108 0.248 0.009 0.201 0.250 0.205 0.208 0.306 0.059 0.248 0.000 0.334 0.315 0.193 0.190 0.321 0.107 0.237 0.009 0.201 0.251 0.205 0.205 0.310 0.050 0.248 0.000 0.334 0.315 0.178 0.193 0.322 0.122 0.243 0.009 0.201 0.249 0.205 0.207 0.300 0.063 0.248 0.001 0.333 0.313 0.192 0.192 0.321 0.122 0.243 0.010 0.201 0.245 0.205 0.210 0.330 0.050 0.246 0.001 0.336 0.315 0.191 0.192 0.323 0.122 0.243 0.009 0.201 0.240 0.205

Par N NN σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 σ 0.0 σ 0.2 σ 0.4 σ 0.6 σ 0.8 K-Means EM Total BW DI GL HS IO IR SE SO VE VO Total BW DI GL HS IO IR SE SO VE VO Raw Data 0.299 0.734 0.052 0.335 0.254 0.121 0.708 0.608 0.006 0.113 0.057 Not available 0.307 0.735 0.030 0.388 0.019 0.123 0.705 0.529 0.008 0.188 0.342 0.346 0.718 0.103 0.370 0.289 0.254 0.806 0.621 0.005 0.103 0.194 Semantic Patterns D=0.0 MDL=1.5 D=0.0 MDL=1.0 0.315 0.724 0.039 0.329 0.309 0.045 0.717 0.582 0.026 0.198 0.183 0.317 0.777 0.006 0.312 0.239 0.218 0.651 0.592 0.016 0.174 0.186 0.323 0.724 0.025 0.334 0.344 0.071 0.730 0.590 0.012 0.198 0.196 0.327 0.752 0.001 0.318 0.240 0.218 0.766 0.598 0.016 0.167 0.197 0.318 0.719 0.026 0.285 0.316 0.051 0.769 0.600 0.008 0.199 0.203 0.323 0.727 0.011 0.287 0.229 0.217 0.749 0.600 0.018 0.176 0.218 0.317 0.722 0.025 0.298 0.357 0.040 0.712 0.602 0.013 0.199 0.201 0.317 0.732 0.009 0.316 0.232 0.221 0.637 0.606 0.025 0.175 0.214 0.299 0.646 0.015 0.294 0.328 0.026 0.686 0.581 0.014 0.198 0.200 0.325 0.703 0.006 0.305 0.233 0.216 0.796 0.594 0.019 0.181 0.195 D=0.5 MDL=1.0 D=0.7 MDL=1.0 0.333 0.817 0.072 0.293 0.338 0.181 0.611 0.614 0.009 0.164 0.234 0.302 0.579 0.082 0.332 0.285 0.184 0.633 0.634 0.006 0.099 0.183 0.333 0.817 0.076 0.278 0.340 0.181 0.621 0.621 0.009 0.151 0.237 0.300 0.579 0.082 0.307 0.285 0.184 0.636 0.632 0.006 0.117 0.176 0.326 0.817 0.068 0.286 0.335 0.181 0.587 0.604 0.009 0.149 0.228 0.301 0.579 0.086 0.310 0.285 0.184 0.639 0.643 0.006 0.095 0.183 0.327 0.817 0.072 0.269 0.337 0.181 0.604 0.580 0.009 0.166 0.232 0.301 0.579 0.076 0.319 0.285 0.184 0.639 0.632 0.006 0.109 0.185 0.334 0.817 0.071 0.303 0.336 0.181 0.610 0.605 0.011 0.163 0.244 0.300 0.579 0.079 0.311 0.285 0.184 0.633 0.633 0.006 0.109 0.183 D=0.5 MDL=1.5 D=0.7 MDL=1.5 0.352 0.817 0.099 0.298 0.382 0.143 0.751 0.601 0.018 0.193 0.218 0.339 0.579 0.086 0.348 0.324 0.242 0.761 0.596 0.013 0.187 0.252 0.358 0.817 0.100 0.330 0.385 0.163 0.751 0.588 0.015 0.194 0.232 0.339 0.579 0.086 0.356 0.324 0.242 0.761 0.595 0.012 0.192 0.239 0.352 0.817 0.096 0.315 0.387 0.143 0.738 0.576 0.019 0.193 0.231 0.340 0.579 0.092 0.348 0.324 0.242 0.761 0.603 0.012 0.194 0.241 0.348 0.817 0.103 0.288 0.383 0.158 0.716 0.579 0.015 0.194 0.226 0.339 0.579 0.094 0.355 0.324 0.242 0.761 0.602 0.012 0.181 0.240 0.356 0.817 0.098 0.296 0.378 0.166 0.776 0.604 0.012 0.190 0.225 0.338 0.579 0.107 0.355 0.324 0.242 0.752 0.597 0.012 0.177 0.236 D=0.5 MDL=2.0 D=0.7 MDL=2.0 0.329 0.817 0.054 0.339 0.330 0.064 0.752 0.563 0.017 0.151 0.199 0.323 0.579 0.105 0.347 0.266 0.228 0.784 0.585 0.015 0.092 0.227 0.328 0.817 0.052 0.320 0.330 0.064 0.753 0.585 0.017 0.144 0.196 0.325 0.579 0.098 0.359 0.266 0.228 0.784 0.584 0.015 0.098 0.238 0.331 0.817 0.055 0.313 0.330 0.109 0.767 0.562 0.012 0.149 0.194 0.323 0.579 0.105 0.358 0.266 0.228 0.784 0.576 0.015 0.090 0.230 0.330 0.817 0.059 0.335 0.328 0.073 0.765 0.560 0.019 0.148 0.199 0.326 0.579 0.099 0.351 0.266 0.228 0.798 0.595 0.015 0.091 0.235 0.333 0.817 0.064 0.321 0.330 0.068 0.764 0.593 0.013 0.158 0.200 0.326 0.579 0.104 0.361 0.266 0.228 0.798 0.585 0.015 0.090 0.237 D=0.5 MDL=3.0 D=0.7 MDL=3.0 0.322 0.817 0.026 0.326 0.333 0.099 0.739 0.567 0.022 0.136 0.153 0.304 0.579 0.001 0.362 0.200 0.228 0.728 0.574 0.032 0.114 0.224 0.322 0.817 0.029 0.326 0.320 0.127 0.702 0.583 0.017 0.150 0.150 0.307 0.579 0.000 0.364 0.208 0.228 0.735 0.573 0.029 0.113 0.236 0.317 0.817 0.035 0.318 0.320 0.099 0.705 0.556 0.024 0.140 0.154 0.306 0.579 0.001 0.355 0.211 0.228 0.726 0.572 0.035 0.113 0.237 0.328 0.817 0.026 0.342 0.328 0.118 0.759 0.563 0.020 0.150 0.153 0.307 0.579 0.001 0.363 0.219 0.228 0.729 0.575 0.029 0.113 0.233 0.323 0.817 0.029 0.330 0.322 0.099 0.731 0.563 0.023 0.151 0.161 0.304 0.579 0.001 0.356 0.204 0.224 0.713 0.589 0.030 0.119 0.226

Distance Data Missing BC DE KR LY MU SO SP VO ZO Total AN CO CA CG HC HH HE Total BW DI GL HS IO IR SE SO VE VO Total Euc Cos Raw Semantic Patterns Raw Semantic Patterns 0% 10% 50% 90% 0% 10% 50% 90% 0% 10% 50% 90% 0% 10% 50% 90% Categorical 0.52 0.52 0.52 0.52 0.54 0.54 0.53 0.50 0.53 0.53 0.53 0.51 0.54 0.54 0.53 0.51 0.68 0.66 0.55 0.32 0.81 0.80 0.38 0.22 0.66 0.66 0.67 0.36 0.81 0.80 0.74 0.46 0.54 0.54 0.53 0.52 0.52 0.52 0.51 0.50 0.54 0.54 0.53 0.51 0.52 0.52 0.52 0.51 0.63 0.68 0.63 0.30 0.63 0.59 0.64 0.48 0.59 0.53 0.51 0.32 0.61 0.58 0.56 0.35 0.64 0.64 0.62 0.57 0.68 0.67 0.62 0.53 0.57 0.57 0.56 0.54 0.67 0.67 0.67 0.62 0.65 0.63 0.53 0.22 0.75 0.70 0.09 0.08 0.58 0.56 0.50 0.18 0.73 0.72 0.63 0.28 0.48 0.47 0.44 0.38 0.62 0.46 0.39 0.39 0.44 0.44 0.41 0.37 0.57 0.57 0.54 0.45 0.80 0.79 0.76 0.67 0.78 0.78 0.68 0.51 0.62 0.63 0.67 0.62 0.79 0.79 0.78 0.72 0.83 0.81 0.72 0.31 0.86 0.85 0.64 0.24 0.80 0.79 0.71 0.31 0.86 0.84 0.76 0.41 0.64 0.64 0.59 0.42 0.69 0.66 0.50 0.38 0.59 0.58 0.57 0.41 0.68 0.67 0.64 0.48 Mixed 0.64 0.63 0.55 0.38 0.66 0.67 0.51 0.38 0.44 0.46 0.50 0.38 0.66 0.66 0.61 0.42 0.59 0.59 0.56 0.51 0.59 0.58 0.52 0.50 0.50 0.50 0.51 0.51 0.62 0.62 0.60 0.57 0.62 0.61 0.59 0.54 0.65 0.65 0.60 0.52 0.55 0.55 0.54 0.51 0.65 0.64 0.63 0.57 0.52 0.52 0.52 0.50 0.52 0.53 0.54 0.53 0.51 0.51 0.52 0.51 0.52 0.52 0.52 0.52 0.86 0.86 0.85 0.81 0.87 0.87 0.85 0.81 0.81 0.81 0.82 0.81 0.87 0.87 0.86 0.84 0.87 0.86 0.85 0.82 0.87 0.87 0.83 0.80 0.84 0.84 0.83 0.81 0.88 0.88 0.87 0.83 0.59 0.58 0.56 0.50 0.64 0.64 0.58 0.55 0.52 0.51 0.55 0.52 0.65 0.65 0.64 0.57 0.67 0.67 0.64 0.58 0.69 0.69 0.63 0.58 0.60 0.60 0.61 0.58 0.69 0.69 0.68 0.62 Numerical 0.86 0.86 0.76 0.68 0.91 0.91 0.84 0.69 0.62 0.61 0.59 0.50 0.90 0.89 0.88 0.84 0.55 0.54 0.53 0.53 0.56 0.55 0.54 0.50 0.53 0.53 0.52 0.50 0.56 0.55 0.55 0.53 0.49 0.45 0.31 0.30 0.53 0.52 0.42 0.31 0.51 0.51 0.48 0.29 0.53 0.52 0.48 0.34 0.64 0.63 0.59 0.52 0.69 0.69 0.61 0.53 0.54 0.54 0.55 0.51 0.69 0.69 0.65 0.60 0.51 0.52 0.55 0.54 0.61 0.61 0.56 0.46 0.46 0.46 0.47 0.51 0.61 0.61 0.60 0.57 0.81 0.60 0.47 0.33 0.83 0.81 0.75 0.67 0.87 0.84 0.77 0.34 0.84 0.81 0.76 0.75 0.61 0.53 0.21 0.15 0.57 0.57 0.43 0.17 0.39 0.40 0.44 0.27 0.57 0.57 0.55 0.41 0.54 0.53 0.51 0.50 0.54 0.54 0.51 0.50 0.52 0.52 0.52 0.52 0.54 0.54 0.54 0.53 0.35 0.33 0.29 0.26 0.37 0.37 0.35 0.28 0.36 0.36 0.36 0.31 0.37 0.37 0.36 0.33 0.15 0.15 0.12 0.09 0.22 0.21 0.16 0.10 0.20 0.20 0.17 0.10 0.21 0.21 0.20 0.13 0.55 0.51 0.43 0.39 0.58 0.58 0.52 0.42 0.50 0.50 0.49 0.38 0.58 0.58 0.56 0.50

Data set EUC (N) EUC (NN) COS (NN) EUC (NN) COS (NN) EUC (NN) COS (NN) RAW Baseline Semantic Patterns Categorical BC DE KR LY MU SO SP VO ZO Total AN CO CA CG HC HH HE Total BW DI GL HS IO IR SE SO VE VO Total 0.52 0.53 0.53 0.52 0.53 0.54 0.54 0.68 0.68 0.66 0.67 0.67 0.81 0.81 0.54 0.54 0.54 0.54 0.54 0.52 0.52 0.63 0.63 0.59 0.60 0.57 0.63 0.61 0.64 0.64 0.57 0.64 0.64 0.68 0.67 0.65 0.65 0.58 0.69 0.70 0.75 0.73 0.48 0.48 0.44 0.48 0.48 0.62 0.57 0.80 0.80 0.62 0.80 0.80 0.78 0.79 0.84 0.83 0.80 0.85 0.84 0.86 0.86 0.64 0.64 0.59 0.64 0.64 0.69 0.68 Mixed 0.64 0.64 0.44 0.64 0.65 0.65 0.66 0.59 0.59 0.50 0.59 0.60 0.58 0.62 0.62 0.62 0.55 0.61 0.61 0.61 0.65 0.52 0.52 0.51 0.52 0.52 0.52 0.52 0.86 0.86 0.81 0.85 0.85 0.86 0.87 0.87 0.87 0.84 0.86 0.86 0.86 0.88 0.59 0.59 0.52 0.61 0.60 0.63 0.65 0.67 0.67 0.60 0.67 0.67 0.67 0.69 Numerical 0.86 0.86 0.62 0.74 0.74 0.89 0.90 0.55 0.55 0.53 0.54 0.54 0.55 0.56 0.49 0.49 0.51 0.51 0.51 0.53 0.53 0.64 0.64 0.54 0.63 0.63 0.66 0.69 0.51 0.51 0.46 0.55 0.55 0.63 0.61 0.81 0.81 0.87 0.73 0.73 0.81 0.83 0.61 0.61 0.39 0.54 0.54 0.57 0.57 0.54 0.54 0.52 0.54 0.54 0.54 0.54 0.35 0.35 0.36 0.37 0.37 0.36 0.37 0.15 0.15 0.20 0.21 0.21 0.22 0.21 0.55 0.55 0.50 0.54 0.54 0.58 0.58