Data Mining Scenarios. for the Discoveryof Subtypes and the Comparison of Algorithms

Similar documents
The role of the general practitioner in the care for patients with colorectal cancer Brandenbarg, Daan

Travel Medicine: Knowledge, Attitude, Practice and Immunisation

Huntington s Disease Hypothalamic, endocrine and metabolic aspects

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle holds various files of this Leiden University dissertation.

University of Groningen. Physical activity and cognition in children van der Niet, Anneke Gerarda

Cover Page. The handle holds various files of this Leiden University dissertation

University of Groningen. ADHD and atopic diseases van der Schans, Jurjen

Multimodality imaging in chronic coronary artery disease Maureen M. Henneman

Cover Page. The handle holds various files of this Leiden University dissertation.

Computer-Aided Detection of Wall Motion Abnormalities in Cardiac MRI. Avan Suinesiaputra

Membrane heterogeneity

University of Groningen. Improving outcomes of patients with Alzheimer's disease Droogsma, Hinderika

The role of the general practitioner during treatment and follow-up of patients with breast cancer Roorda-Lukkien, Carriene

Cardiac Bone Marrow Cell Injection for Chronic Ischemic Heart Disease

Molecular understanding of tamoxifen resistance in breast cancer. Renée de Leeuw

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle holds various files of this Leiden University dissertation.

Multimodality Imaging of Anatomy and Function in Coronary Artery Disease. Joanne D. Schuijf

University of Groningen. Adaptation after mild traumatic brain injury van der Horn, Harm J.

Cover Page. The handle holds various files of this Leiden University dissertation.

Echocardiographic evaluation of left ventricular function in ischemic heart disease. Sjoerd A. Mollema

University of Groningen. Understanding negative symptoms Klaasen, Nicky Gabriëlle

UvA-DARE (Digital Academic Repository) Functional defecation disorders in children Kuizenga-Wessel, S. Link to publication

University of Groningen. Cost and outcome of liver transplantation van der Hilst, Christian

Towards strengthening memory immunity in the ageing population van der Heiden, Marieke

INTER- AND INTRA INDIVIDUAL VARIATION IN EARPRINTS LYNN M EIJERM AN

Testicular microlithiasis and undescended testis. Joery Goede

Force generation at microtubule ends: An in vitro approach to cortical interactions. Liedewij Laan

Tobacco control policies and socio-economic inequalities in smoking cessation Bosdriesz, J.R.

University of Groningen. Carcinoembryonic Antigen (CEA) in colorectal cancer follow-up Verberne, Charlotte

University of Groningen. Diabetes mellitus and rhegmatogenous retinal detachment Fokkens, Bernardina Teunisje

Pathophysiology and management of hemostatic alterations in cirrhosis and liver transplantation Arshad, Freeha

Citation for published version (APA): Zeddies, S. (2015). Novel regulators of megakaryopoiesis: The road less traveled by

UvA-DARE (Digital Academic Repository) Toothbrushing efficacy Rosema, N.A.M. Link to publication

Dissecting Lyme borreliosis; Clinical aspects, pathogenesis and prevention Coumou, J.

Citation for published version (APA): Owusu, E. D. A. (2018). Malaria, HIV and sickle cell disease in Ghana: Towards tailor-made interventions

Building blocks for return to work after sick leave due to depression de Vries, Gabe

UvA-DARE (Digital Academic Repository) What tumor cells cannot resist Ebbing, E.A. Link to publication

Citation for published version (APA): Braakhekke, M. W. M. (2017). Randomized controlled trials in reproductive medicine: Disclosing the caveats

UvA-DARE (Digital Academic Repository) Anorectal malformations and hirschsprung disease Witvliet, M.J. Link to publication

Citation for published version (APA): Wolff, D. (2016). The Enigma of the Fontan circulation [Groningen]: Rijksuniversiteit Groningen

Studies on inflammatory bowel disease and functional gastrointestinal disorders in children and adults Hoekman, D.R.

University of Groningen. Functional outcome after a spinal fracture Post, Richard Bernardus

University of Groningen. Pelvic Organ Prolapse Panman, Chantal; Wiegersma, Marian

University of Groningen. The role of human serum carnosinase-1 in diabetic nephropathy Zhang, Shiqi

Gut microbiota and nuclear receptors in bile acid and lipid metabolism Out, Carolien

UvA-DARE (Digital Academic Repository) The systemic right ventricle van der Bom, T. Link to publication

University of Groningen. ADHD & Addiction van Emmerik-van Oortmerssen, Katelijne

Enzyme replacement therapy in Fabry disease, towards individualized treatment Arends, M.

Diagnostic strategies in children with chronic gastrointestinal symptoms in primary care Holtman, Geeske

University of Groningen. Stormy clouds in seventh heaven Meijer, Judith

University of Groningen. A geriatric perspective on chronic kidney disease Bos, Harmke Anthonia

In search of light therapy to optimize the internal clock, performance and sleep Geerdink, Moniek

University of Groningen. Depression in general practice Piek, Ellen

Citation for published version (APA): van der Paardt, M. P. (2015). Advances in MRI for colorectal cancer and bowel motility

University of Groningen. Alcohol septal ablation Liebregts, Max

Cover Page. The handle holds various files of this Leiden University dissertation.

Functional abdominal pain disorders in children: therapeutic strategies focusing on hypnotherapy Rutten, J.M.T.M.

University of Groningen. Diagnosis and imaging of essential and other tremors van der Stouwe, Anna

University of Groningen. Prediction and monitoring of chronic kidney disease Schutte, Elise

Tumor control and normal tissue toxicity: The two faces of radiotherapy van Oorschot, B.

University of Groningen. Rhegmatogenous retinal detachment van de Put, Mathijs

The role of media entertainment in children s and adolescents ADHD-related behaviors: A reason for concern? Nikkelen, S.W.C.

Citation for published version (APA): Diederen, K. (2018). Pediatric inflammatory bowel disease: Monitoring, nutrition and surgery.

University of Groningen. Ablation of atrial fibrillation de Maat, Gijs

University of Groningen. Covered stents in aortoiliac occlusive disease Grimme, Frederike. DOI: /j.ejvs /j.jvir

University of Groningen. Gestational diabetes mellitus: diagnosis and outcome Koning, Saakje Hillie

Orthotic interventions to improve standing balance in somatosensory loss Hijmans, Juha

Citation for published version (APA): Tjon-Kon-Fat, R. I. (2017). Unexplained subfertility: Illuminating the path to treatment.

Citation for published version (APA): Koning, A. (2017). Exploring Redox Biology in physiology and disease [Groningen]: Rijksuniversiteit Groningen

Gait characteristics as indicators of cognitive impairment in geriatric patients Kikkert, Lisette

Prevention and care of chemotherapy-induced gastrointestinal mucositis Kuiken, Nicoline

UvA-DARE (Digital Academic Repository) Falling: should one blame the heart? Jansen, Sofie. Link to publication

University of Groningen. Cholesterol, bile acid and triglyceride metabolism intertwined Schonewille, Marleen

UvA-DARE (Digital Academic Repository) Obesity, ectopic lipids, and insulin resistance ter Horst, K.W. Link to publication

UvA-DARE (Digital Academic Repository) Intraarterial treatment for acute ischemic stroke Berkhemer, O.A. Link to publication

Neurodevelopmental outcome of children born following assisted reproductive technology Middelburg, Karin Janette

Citation for published version (APA): Sinkeler, S. J. (2016). A tubulo-centric view on cardiorenal disease [Groningen]

Cover Page. The handle holds various files of this Leiden University dissertation

The psychophysiology of selective attention and working memory in children with PPDNOS and/or ADHD Gomarus, Henriette Karin

Insulin sensitivity of hepatic glucose and lipid metabolism in animal models of hepatic steatosis Grefhorst, Aldo

Apoptosis in (pre-) malignant lesions in the gastro-intestinal tract Woude, Christien Janneke van der

Use of the comprehensive geriatric assessment to improve patient-centred care in complex patient populations Parlevliet, J.L.

Citation for published version (APA): van Es, N. (2017). Cancer and thrombosis: Improvements in strategies for prediction, diagnosis, and treatment

Macrophages: the overlooked target for pulmonary fibrosis and COPD Boorsma, Carian Eline

Balance between herpes viruses and immunosuppression after lung transplantation Verschuuren, Erik A.M.

UvA-DARE (Digital Academic Repository) Bronchial Thermoplasty in severe asthma d'hooghe, J.N.S. Link to publication

Finding the balance between overtreatment and undertreatment of ductal carcinoma in situ Elshof, L.E.

Citation for published version (APA): Casteleijn, N. (2017). ADPKD: Beyond Growth and Decline [Groningen]: Rijksuniversiteit Groningen

Clinical applications of positron emission tomography in coronary atherosclerosis Siebelink, Hans-Marc José

Molecular and mechanical functions of the intermediate filament protein GFAP Stassen, O.M.J.A.

Developing an exergame for unsupervised home-based balance training in older adults van Diest, Mike

3D workflows in orthodontics, maxillofacial surgery and prosthodontics van der Meer, Wicher

Citation for published version (APA): de Groof, E. J. (2017). Surgery and medical therapy in Crohn s disease: Improving treatment strategies

UvA-DARE (Digital Academic Repository) Hip and groin pain in athletes Tak, I.J.R. Link to publication

UvA-DARE (Digital Academic Repository) Mucorales between food and infection Dolat Abadi, S. Link to publication

University of Groningen. Raiders of the CNS Vainchtein, Ilia Davidovich

Citation for published version (APA): Donker, M. (2014). Improvements in locoregional treatment of breast cancer

Transcription:

Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms

Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms PROEFSCHRIFT ter verkrijging van de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur door Fabrice Pierre Robert Colas geboren te Laval, France in 1981.

Promotiecommissie: Prof. dr. J.N. Kok, LIACS, Universiteit Leiden Dr. Ingrid Meulenbelt, LUMC Prof. dr. F. Famili, NRC-IIT, Ottawa, Canada Prof. dr. T.H.W. Bäck, LIACS, Universiteit Leiden Prof. dr. G. Rozenberg, LIACS, Universiteit Leiden Prof. dr. B.R. Katzy, LIACS, Universiteit Leiden Promotor This work was carried out under a grant from the Netherlands BioInformatics Center (NBIC). Data Mining Scenarios for the Discovery of Subtypes and the Comparison of Algorithms Fabrice Pierre Robert Colas Thesis Universiteit Leiden ISBN 978-90-9023888-3 Copyright c 2008 by Fabrice Pierre Robert Colas All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author. Printed in France Email: fabrice.colas@grano-salis.net

To my parents

Contents Introduction 1 Data mining scenarios............................ 1 Part I: Subtype Discovery by Cluster Analysis.............. 1 Part II: Automatic Text Classification................... 3 Publications.................................. 5 Part I: Subtype Discovery by Cluster Analysis............ 5 Part II: Automatic Text Classification................ 6 I Subtype DiscoverybyCluster Analysis 9 1 Application Domains 11 1.1 Introduction............................... 11 1.2 Osteoarthritis.............................. 11 1.3 Parkinson s disease........................... 14 1.4 Drug discovery............................. 17 1.5 Concluding remarks.......................... 19 2 A Scenario for Subtype DiscoverybyCluster Analysis 21 2.1 Introduction............................... 21 2.2 Data preparation and clustering.................... 22 2.2.1 Data preparation........................ 22 2.2.2 The reliability and validity of a cluster result........ 23 2.2.3 Clustering by a mixture of Gaussians............. 26 2.3 Model selection............................. 27 2.3.1 A score to compare models.................. 27

ii Table of contents 2.3.2 Valid model selection...................... 27 2.4 Characterizing, comparing and evaluating cluster results...... 28 2.4.1 Visualizing subtypes...................... 28 2.4.2 Statistical characterization and comparison of subtypes.. 29 2.4.3 Statistical evaluation of subtypes............... 31 2.5 Concluding remarks.......................... 34 3 Reliabilityof Cluster Results for Different Time Adjustments 35 3.1 Introduction............................... 35 3.2 Methods................................. 36 3.3 Experimental results.......................... 38 3.4 Why does optimizing the r 2 not boost the cluster reliability?... 40 3.5 Concluding remarks.......................... 41 4 Subtypingin Osteoarthritis, Parkinson sdiseaseanddrugdiscovery 43 4.1 Introduction............................... 43 4.2 Subtyping in Osteoarthritis...................... 44 4.2.1 Outline of the analysis..................... 44 4.2.2 Model selection......................... 45 4.2.3 Subtype characteristics and evaluation............ 46 4.3 Subtyping in Parkinson s disease................... 48 4.3.1 Outline of the analysis.................... 49 4.3.2 Model selection......................... 49 4.3.3 Subtype characteristics.................... 50 4.3.4 Outline of the post hoc-analysis................ 50 4.4 Subtyping in drug discovery...................... 52 4.4.1 Outline of the analysis..................... 53 4.4.2 Model selection......................... 53 4.4.3 Subtype characteristics.................... 55 4.5 Concluding remarks.......................... 59 5 Scenario Implementationas the R SubtypeDiscovery Package 61 5.1 Introduction............................... 61 5.2 Design of the scenario implementation................ 62 5.2.1 Methods for data preparation and data specific settings.. 63 5.2.2 The dataset class (cdata) and its generic methods..... 64 5.2.3 The cluster result class (cresult) and its generic methods 65 5.2.4 Statistical methods to characterize, compare and evaluate subtypes............................. 67 5.2.5 Other methods......................... 68 5.3 Sample analyses............................ 68 5.3.1 Analysis on the original scores................ 69

Table of contents iii 5.3.2 Analysis on the principal components............ 69 5.4 Concluding remarks.......................... 70 II AutomaticText Classification 73 6 A Scenario for the Comparisonof Algorithms in Text Classification 75 6.1 Introduction............................... 75 6.2 Conducting fair classifier comparisons................ 76 6.3 Classification algorithms........................ 77 6.3.1 k Nearest Neighbors....................... 78 6.3.2 Naive Bayes........................... 78 6.3.3 Support Vector Machines................... 78 6.3.4 Implementation of the algorithms............... 82 6.4 Definition of the scenario....................... 82 6.4.1 Evaluation methodology and measures............ 82 6.4.2 Dimensions of experimentation................ 83 6.5 Experimental data........................... 85 6.5.1 To study the behaviors of the classifiers........... 86 6.5.2 To study the scale-up of SVM in large bag of words feature spaces.............................. 86 6.6 Concluding remarks.......................... 87 7 Comparison of Classifiers 89 7.1 Introduction............................... 89 7.2 Experimental data........................... 90 7.3 Parameter optimization........................ 90 7.3.1 Support Vector Machines................... 90 7.3.2 k Nearest Neighbors...................... 92 7.4 Comparisons for increasing document and feature sizes....... 94 7.5 Related work.............................. 97 7.6 Concluding remarks.......................... 97 8 DoesSVM Scale upto Large Bag of WordsFeature Spaces? 99 8.1 Introduction............................... 99 8.2 Experimental data........................... 100 8.3 Best performing SVM......................... 100 8.4 Nature of SVM solutions........................ 101 8.5 A performance drop for SVM..................... 104 8.6 Relating the performance drop to outliers inthedata...106 8.7 Related work.............................. 108

iv Table of contents 8.8 Concluding remarks.......................... 108 Conclusions 110 Subtype Discovery by Cluster Analysis................... 113 Automatic Text Classification........................ 115 Appendices 118 A TwoDimensional Molecular Descriptors 119 B Additional Results in Text Classification 127 Bibliography 133 Samenvatting 141 Curriculum Vitae 143 Acknowledgements 145