jsci2016 Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Wutthipong Kongburan, Praisan Padungweang, Worarat Krathu, Jonathan H. Chan School of Information Technology King Mongkut's University of Technology Thonburi
Agenda INTRODUCTION PROPOSED METHOD RESULT AND DISCUSSION CONCLUSION AND FUTURE WORK
Agenda INTRODUCTION Overview of text mining and NER Thyroid Cancer and Medical Intervention Rationale and Motivation Research questions PROPOSED METHOD RESULT AND DISCUSSION CONCLUSION AND FUTURE WORK
Thyroid cancer The cells of the thyroid gland become abnormal, grow uncontrollably, and form a mass of cells called a tumor. Thyroid neoplasms represent almost 95% of all endocrine tumors Women are three times more likely than men In Bangkok, Thailand, the number of new cases of thyroid cancer is 6 per 100,000 men and women per year It is now ranked as the 8th most common cancer in women in the United States and 4% increase annually
Medical Intervention Any objects or acts of treatment and preventive care of intervening, interfering or interceding with the intent of modifying the outcome Such typical treatment - the side effects. Medical interventions alternative treatment increasing patient satisfaction lead to better health outcomes
Thyroid cancer publications in PubMed 3500 3000 #Publications 2500 2000 1500 1000 1995 2000 2005 2010 2015 Years
Overview of Biomedical Text mining
Named Entity Recognition (NER) Classify tokens in to predefined categories such as person name, organizations, diseases, proteins.
Rationale and Motivation Training data ML-based Text mining Classifier/ Model
Research questions How can we construct a thyroid cancer intervention corpus in reasonable time? What are the criteria to measure the performance of corpus?
Agenda INTRODUCTION PROPOSED METHOD A construction workflow of corpus Basic definitions of entities Evaluation of corpora 10-fold cross validation Leave-one(intervention)-out cross validation RESULT AND DISCUSSION CONCLUSION AND FUTURE WORK
A construction workflow of thyroid cancer intervention corpus PubMed query: (((((((((((thyroid neoplasms) OR thyroid cancer) OR thyroid carcinoma) OR thyroid malignancies) OR papillary) OR follicular) OR medullary) OR anaplastic) AND treatment) AND hasabstract[text] AND Humans[Mesh] AND English[lang])) The query limits the results to abstracts of human studies in English. Sorted by relevance Online database Information Retrieval Relevant documents Title Text PubMed ID
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval The initial number of 50 abstracts is arbitrary, subject to the limitation of human annotator that can be done in reasonable time. To determine entity boundaries in a text by sentence splitting and define the data in token form. Relevant documents Carefully select 50 Preprocessing 50
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval Relevant documents Carefully select 50 Each token or word is tagged with three classes of biomedical entities: INTERVENTION, DISEASE, and O. The annotation process is carried out following the annotation guideline. Preprocessing 50 Manual annotation 50 manually annotated abstracts Annotation guideline
Basic definitions of entities INTERVENTION means a treatment and preventive care object or action that performed by doctor or other clinician targeted at a thyroid cancer patient. Non-drug therapy: e.g., surgery, radioactive iodine treatment, external beam radiation therapy, and Drug therapy: e.g., anti-angiogenesis drugs, target inhibitors, chemotherapy, and thyroid hormone replacement therapy. DISEASE means a disorder in which the cells of the thyroid gland become abnormal, grow uncontrollably, and form a mass of cells called a tumor. Including Common name and tumors name of thyroid cancer: e.g., thyroid cancer, thyroid malignancies, thyroid carcinoma, thyroid tumors. Group names of thyroid cancer: e.g., differentiated thyroid carcinoma, undifferentiated thyroid cancer. Sub types or other forms of thyroid cancer: e.g., follicular cancer, papillary thyroid cancer, Hurthle cell carcinoma, medullary carcinoma, anaplastic carcinoma, insular thyroid carcinoma, thyroid lymphoma, and thyroid sarcoma. Other terms which not refer to INTERVENTION and DISEASE will be tagged as O. http://dlab.sit.kmutt.ac.th/wp-content/uploads/2015/10/tcit-annotation-guideline.pdf
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval Relevant documents Carefully select 50 Preprocessing 50 Manual annotation Annotation guideline 50 manually annotated abstracts NER classifier Train NER classifier Stanford NER provides a general implementation of linear chain Conditional Random Fields (CRFs) sequence models.
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval Relevant documents The number of n equals to 30 for each experiment round which is the arbitrarily selected in our current experiment Select n first abstracts 30 Carefully select 50 Preprocessing Preprocessing 30 50 Manual annotation Annotation guideline 50 manually annotated abstracts NER classifier Train NER classifier
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval 30 automatically annotated abstracts Relevant documents Select n first abstracts Carefully select 30 50 Preprocessing Preprocessing 30 50 Automatic annotation Manual annotation Annotation guideline 50 manually annotated abstracts NER classifier Train NER classifier
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval 30 automatically annotated abstracts Relevant documents Select n first abstracts Carefully select Neglect No Contain new intervention 30 50 To filter only abstracts that contain new INTERVENTION out of the manually annotated datasets. Yes 30-neglect Preprocessing 30 Preprocessing 50 Automatic annotation Manual annotation Annotation guideline 50 manually annotated abstracts NER classifier Train NER classifier
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval 30 automatically annotated abstracts Relevant documents Select n first abstracts Carefully select Neglect No Contain new intervention 30 50 Yes Preprocessing Preprocessing 30-neglect 30 50 The automatic annotated abstracts will be judged by annotators together with the annotation guideline. Manual correction Correct automatically annotated abstracts Annotation guideline Automatic annotation Manual annotation 50 manually annotated abstracts Annotation guideline This step can reduce the effort of manually annotated. NER classifier Train NER classifier
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval 30 automatically annotated abstracts Relevant documents Select n first abstracts Carefully select Neglect No Contain new intervention 30 50 Yes Preprocessing Preprocessing 30-neglect 30 50 Manual correction Annotation guideline Automatic annotation Manual annotation Annotation guideline Correct automatically annotated abstracts 50 manually annotated abstracts The result of each round of incremental model is a corpus. It is a combination of manually annotated datasets and automatic tagging with manual correction datasets. Construction a corpus Corpus NER classifier Train NER classifier
A construction workflow of thyroid cancer intervention corpus Online database Information Retrieval 30 automatically annotated abstracts Relevant documents Select n first abstracts Carefully select Neglect No Contain new intervention 30 50 Yes Preprocessing Preprocessing 30-neglect 30 50 Manual correction Annotation guideline Automatic annotation Manual annotation Annotation guideline Correct automatically annotated abstracts 50 manually annotated Construction a corpus NER classifier Train NER classifier The process will be repeated again with next 30-first abstracts, until the performance of a new corpus and current corpus have no significant difference. Corpus
Incremental construction Round Current corpus size #Tokens unique interventions New n-first abstracts # contain new intervention # after manual correction 1 50 312 1-30 11 9 2 59 339 31-60 9 8 3 67 356 61-90 6 6 4 73 379 91-120 12 8 5 81 400 121-150 10 8 6 89 413 151-180 10 10 7 99 460 181-210 14 13 8 112 501 211-240 13 13 9 125 525 241-270 11 10 10 135 546 271-300 10 8 11 143 564 301-330 12 10 12 153 594 331-360 11 9 13 162 619 361-390 10 5 14 167 632 391-420 11 7 15 174 648 421-450 9 7 16 181 655 451-480 8 7 17 188 675 - - -
Research questions How can we construct a thyroid cancer intervention corpus in reasonable time? What are the criteria to measure the performance of corpus?
Agenda INTRODUCTION PROPOSED METHOD A construction workflow of corpus Basic definitions of entities Evaluation of corpora 10-fold cross validation Leave-one(intervention)-out cross validation RESULT AND DISCUSSION CONCLUSION AND FUTURE WORK
Performance Evaluation 10-fold cross validation Predicted class Predicted class Actual class INTERVENTION DISEASE or O INTERVENTION DISEASE or O True Positive(TP) False Positive(FP) False Negative(FN) True Negative(TN) Actual class DISEASE INTERVENTION or O DISEASE True Positive(TP) False Positive(FP) INTERVENTION or O False Negative(FN) True Negative(TN) Predicted class Actual class O INTERVENTION or DISEASE O True Positive(TP) False Positive(FP) INTERVENTION or DISEASE False Negative(FN) True Negative(TN) Precision (P) =, Recall (R) =, F1-score (F) = 2
10-fold cross validation Sequentially select Randomly select
Agenda INTRODUCTION PROPOSED METHOD RESULT AND DISCUSSION How many abstracts are suitable for constructing the corpus? CONCLUSION AND FUTURE WORK
The average performance of three entity classes by using 10-fold cross validation
The change rate of average F1-score of three entity classes
The average performance of 143 model Class Precision Recall F1-score INTERVENTION 0.923 0.666 0.768 DISEASE 0.982 0.915 0.946 O 0.974 0.996 0.984 Average 0.959 0.859 0.906 in a case of 143 abstracts there are 41,657 tokens consist of 2,473 INTERVENTION (564 unique intervention tokens), 1,996 DISEASE, (94 unique tokens of disease), and 37,188 O.
Leave-one(intervention)-out cross validation All leave out keywords are the words that describe common interventions applied for thyroid cancer e.g. chemotherapy, radioiodine, radioactive, radiotherapy, drug, vandetanib, cabozantinib, sorafenib, inhibitors, surgery, ablation, resection, dissection, and thyroidectomy.
The accuracy of discovery new interventions by using leave-one-out cross validation
Result and discussion Based on the experimental results, the corpus with 143 abstracts is a suitable size of corpus for Acceptable model in performance metrics. Discovery new interventions (overfitting avoidance)
Confusion matrix of radioactive keyword in leave-one-out cross validation using 143 model Predicted class Actual class O INTERVENTION DISEASE O 8668 56 4 INTERVENTION 191 465 4 DISEASE 19 1 396 False negatives and false positive mainly fall into mislabeled (e.g., total lobectomy is annotated as O), lacking in entity s boundaries (e.g., central neck node dissection only dissection annotated as INTERVENTION), and case sensitive (e.g., Surgery is annotated as O instead INTERVENTION).
Agenda INTRODUCTION PROPOSED METHOD RESULT AND DISCUSSION CONCLUSION AND FUTURE WORK
Result and discussion The production of a gold standard corpus requires a significant amount of manual curation work Contribution a semi-automatic semantic annotation approach to construct a thyroid cancer intervention corpus together with a criteria of identifying reasonable performance Future work Improve performance of the corpus for example, diversify the keywords search, remove sentences without annotations, and use different number of abstracts in the experiment Apply our approach to extract relationships between general biomedical entities
38