Data Mining Scenarios. for the Discoveryof Subtypes and the Comparison of Algorithms

Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms

Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms PROEFSCHRIFT ter verkrijging van de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur door Fabrice Pierre Robert Colas geboren te Laval, France in 1981.

Promotiecommissie: Prof. dr. J.N. Kok, LIACS, Universiteit Leiden Dr. Ingrid Meulenbelt, LUMC Prof. dr. F. Famili, NRC-IIT, Ottawa, Canada Prof. dr. T.H.W. Bäck, LIACS, Universiteit Leiden Prof. dr. G. Rozenberg, LIACS, Universiteit Leiden Prof. dr. B.R. Katzy, LIACS, Universiteit Leiden Promotor This work was carried out under a grant from the Netherlands BioInformatics Center (NBIC). Data Mining Scenarios for the Discovery of Subtypes and the Comparison of Algorithms Fabrice Pierre Robert Colas Thesis Universiteit Leiden ISBN 978-90-9023888-3 Copyright c 2008 by Fabrice Pierre Robert Colas All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author. Printed in France Email: fabrice.colas@grano-salis.net

To my parents

Contents Introduction 1 Data mining scenarios............................ 1 Part I: Subtype Discovery by Cluster Analysis.............. 1 Part II: Automatic Text Classification................... 3 Publications.................................. 5 Part I: Subtype Discovery by Cluster Analysis............ 5 Part II: Automatic Text Classification................ 6 I Subtype DiscoverybyCluster Analysis 9 1 Application Domains 11 1.1 Introduction............................... 11 1.2 Osteoarthritis.............................. 11 1.3 Parkinson s disease........................... 14 1.4 Drug discovery............................. 17 1.5 Concluding remarks.......................... 19 2 A Scenario for Subtype DiscoverybyCluster Analysis 21 2.1 Introduction............................... 21 2.2 Data preparation and clustering.................... 22 2.2.1 Data preparation........................ 22 2.2.2 The reliability and validity of a cluster result........ 23 2.2.3 Clustering by a mixture of Gaussians............. 26 2.3 Model selection............................. 27 2.3.1 A score to compare models.................. 27

ii Table of contents 2.3.2 Valid model selection...................... 27 2.4 Characterizing, comparing and evaluating cluster results...... 28 2.4.1 Visualizing subtypes...................... 28 2.4.2 Statistical characterization and comparison of subtypes.. 29 2.4.3 Statistical evaluation of subtypes............... 31 2.5 Concluding remarks.......................... 34 3 Reliabilityof Cluster Results for Different Time Adjustments 35 3.1 Introduction............................... 35 3.2 Methods................................. 36 3.3 Experimental results.......................... 38 3.4 Why does optimizing the r 2 not boost the cluster reliability?... 40 3.5 Concluding remarks.......................... 41 4 Subtypingin Osteoarthritis, Parkinson sdiseaseanddrugdiscovery 43 4.1 Introduction............................... 43 4.2 Subtyping in Osteoarthritis...................... 44 4.2.1 Outline of the analysis..................... 44 4.2.2 Model selection......................... 45 4.2.3 Subtype characteristics and evaluation............ 46 4.3 Subtyping in Parkinson s disease................... 48 4.3.1 Outline of the analysis.................... 49 4.3.2 Model selection......................... 49 4.3.3 Subtype characteristics.................... 50 4.3.4 Outline of the post hoc-analysis................ 50 4.4 Subtyping in drug discovery...................... 52 4.4.1 Outline of the analysis..................... 53 4.4.2 Model selection......................... 53 4.4.3 Subtype characteristics.................... 55 4.5 Concluding remarks.......................... 59 5 Scenario Implementationas the R SubtypeDiscovery Package 61 5.1 Introduction............................... 61 5.2 Design of the scenario implementation................ 62 5.2.1 Methods for data preparation and data specific settings.. 63 5.2.2 The dataset class (cdata) and its generic methods..... 64 5.2.3 The cluster result class (cresult) and its generic methods 65 5.2.4 Statistical methods to characterize, compare and evaluate subtypes............................. 67 5.2.5 Other methods......................... 68 5.3 Sample analyses............................ 68 5.3.1 Analysis on the original scores................ 69

Table of contents iii 5.3.2 Analysis on the principal components............ 69 5.4 Concluding remarks.......................... 70 II AutomaticText Classification 73 6 A Scenario for the Comparisonof Algorithms in Text Classification 75 6.1 Introduction............................... 75 6.2 Conducting fair classifier comparisons................ 76 6.3 Classification algorithms........................ 77 6.3.1 k Nearest Neighbors....................... 78 6.3.2 Naive Bayes........................... 78 6.3.3 Support Vector Machines................... 78 6.3.4 Implementation of the algorithms............... 82 6.4 Definition of the scenario....................... 82 6.4.1 Evaluation methodology and measures............ 82 6.4.2 Dimensions of experimentation................ 83 6.5 Experimental data........................... 85 6.5.1 To study the behaviors of the classifiers........... 86 6.5.2 To study the scale-up of SVM in large bag of words feature spaces.............................. 86 6.6 Concluding remarks.......................... 87 7 Comparison of Classifiers 89 7.1 Introduction............................... 89 7.2 Experimental data........................... 90 7.3 Parameter optimization........................ 90 7.3.1 Support Vector Machines................... 90 7.3.2 k Nearest Neighbors...................... 92 7.4 Comparisons for increasing document and feature sizes....... 94 7.5 Related work.............................. 97 7.6 Concluding remarks.......................... 97 8 DoesSVM Scale upto Large Bag of WordsFeature Spaces? 99 8.1 Introduction............................... 99 8.2 Experimental data........................... 100 8.3 Best performing SVM......................... 100 8.4 Nature of SVM solutions........................ 101 8.5 A performance drop for SVM..................... 104 8.6 Relating the performance drop to outliers inthedata...106 8.7 Related work.............................. 108

iv Table of contents 8.8 Concluding remarks.......................... 108 Conclusions 110 Subtype Discovery by Cluster Analysis................... 113 Automatic Text Classification........................ 115 Appendices 118 A TwoDimensional Molecular Descriptors 119 B Additional Results in Text Classification 127 Bibliography 133 Samenvatting 141 Curriculum Vitae 143 Acknowledgements 145