SESSION 2: MASTER COURSE Data and Text-Mining the ElectronicalMedicalRecord for epidemiologicalpurposes Dr Marie-Hélène Metzger Associate Professor marie-helene.metzger@aphp.fr 1 Assistance Publique Hôpitaux de Paris, Hôpital Avicenne, Bobigny 2 Paris 13, University, LEPS 3 INSERM U 1018, CESP
Definitions Data-Mining Computationalprocessof discoveringpatterns in large data sets involving methods of artificial intelligence, machine learning, statistics, and database systems Stepin the KnowledgeDiscoveryin Databases(KDD) process, commonly defined with the following stages: Selection Pre-processing Transformation Data Mining Interpretation/Evaluation Ref: U. Fayyad et al. 1996, American Association for Artificial Intelligence: 37-54
Definitions Text-Mining Linguistictechnologies to move text(full text) to a digital vector(presence-absence or frequency) It isthenpossible to applythe samealgorithmsas thoseusedin Data-Mining(eg. Principal Component Analysis, Naives Bayes Classifier, ) General applications Information extraction Information retrieval Keyword-based association analysis Document classification Text clustering analysis
From text to epidemiology Data sources Electronic medical records E-mails Internet queries Blogs Social networks Scientific literature Newspapers Medicalconsultation (voicerecognition transcription) Epidemiological use Prevalence, incidence of a disease Patient s phenotyping Surveillance, alert Exploratory analysis of risk factors Indicatorsof healthcare quality
From medical text to epidemiological knowledge Text-Mining Data-Mining Epidemiology Ref: U. Fayyad et al. 1996, American Association for Artificial Intelligence: 37-54
Use case: quality of care The SYNODOSproject: SYstemfor the Normalization and Organization of textual medical Data for Observation in Healthcare http://www.synodos.fr Detectionof nosocomial infections or automateddescription of the patient scare pathwayof colon cancer
Data Selection Hospital Informa on System Metadata (dates, type of document) Structured Data (ADICAP,..) Internet Textual Data (discharge summaries, opera ve reports, ) HTTPS protocol 3 LAN 1 Mul terminology extractor «PUSH» import - XML format SOAP request SYNODOS Mediator 2 Storage directory Indexing Seman c Analyzer (source, de-iden fied, terminology and seman cs labeling) 4 Fact Base Certain facts Inferred facts 5 DMZ User Interface Rule Base
Multi-terminology Indexer Terminology Code Terminology Wording Terminology Source UMLS semantic type
Text-Mining Processing Token Part of Speech tags (17 universal tags) Dependency Parses (triplet) : 40 universal relations The DT Determiner Determiner patient NN Common noun nsubj Nominal Subject claims VB Verb : base form null not RB Adverb neg negation modifier to TO To aux have VB Verb : base form aux Auxiliary Verb consumed VBN Verb : past participle xcomp Open clausal complement this DT Determiner det morning NN Common noun tmod Temporal Modifier 6 CD Cardinal Number num Numeric modifier es NNS Common noun dobj Direct Object of IN Preposition prep paracetamol NNP Proper Noun pobj Object of Preposition + SYM Symbol dep dependency other JJ Adjective amod Adjectival Modifier drugs NN dep dependency. PUNCT Punctuation
Populating the database of facts Patient s care pathway Surgical report : T0 Relational database BRMS (Business Rules Management System) Discharge summary: T0 +7days Expert rules Consultation letter: T0+3 months
Expert rules: linguistic Token Part of Speech tags (17 universal tags) Dependency Parses (triplet) : 40 universal relations The DT Determiner Determiner patient NN Common noun nsubj Nominal Subject claims VB Verb : base form null not RB Adverb neg negation modifier to TO To aux have VB Verb : base form aux Auxiliary Verb consumed VBN Verb : past participle xcomp Open clausal complement this DT Determiner det morning NN Common noun tmod Temporal Modifier 6 CD Cardinal Number num Numeric modifier es NNS Common noun dobj Direct Object of IN Preposition prep paracetamol NNP Proper Noun pobj Object of Preposition + SYM Symbol dep dependency other JJ Adjective amod Adjectival Modifier drugs NN dep dependency. PUNCT Punctuation
Expert rules: medical Detection of nosocomial infection Example of a medical expert rule: If "surgery» at T0 and "purulent drainage from the scar" tagged "> T0" then»surgical site infection» = «YES»
Performances of detection
Use case : psychiatry Ref: Int. J. Methods Psychiatr. Res. (2016)
Use case : psychiatry
Discussion Critical steps Linguisticissues Use of different terminologies depending on whether one is studying corpus written by patients or doctors. The corpus of medical reports are difficult to analyze because written with a lot of unconventional abbreviations, typographicalerrors, absence of punctuation
Discussion Critical steps Availability of clinical text for secondary use Ethical challenges: Barriers to accessing clinical text linkedto perceptions or misperceptionsof risksto patient privacy In fact, adoption of NLP methods and automatic information extraction wouldreducethreatsto patient privacy because the human intervention is lower Technicalchallenges Quality of the text deidentification process Standardisation of metadatarelatedto the type of documents available in the Health Information System