Data Mining Research Project Report Generating Texts in Estonian Language. Author: Robert Roosalu Supervisor: Tambet Matiisen

Similar documents
arxiv: v1 [stat.ml] 23 Jan 2017

A HMM-based Pre-training Approach for Sequential Data

Retinopathy Net. Alberto Benavides Robert Dadashi Neel Vadoothker

Object Detectors Emerge in Deep Scene CNNs

Convolutional and LSTM Neural Networks

Using Deep Convolutional Networks for Gesture Recognition in American Sign Language

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

Convolutional Neural Networks for Text Classification

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

arxiv: v1 [cs.lg] 6 Oct 2016

Automatic Context-Aware Image Captioning

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Convolutional and LSTM Neural Networks

Factoid Question Answering

Image Classification with TensorFlow: Radiomics 1p/19q Chromosome Status Classification Using Deep Learning

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Skin cancer reorganization and classification with deep neural network

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Deep CNNs for Diabetic Retinopathy Detection

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Clinical Trials: Non-Muscle Invasive Bladder Cancer. Tuesday, May 17th, Part II

arxiv: v1 [cs.ai] 28 Nov 2017

Quality Improvement of Causes of Death Statistics by Automated Coding in Estonia, 2011

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

HOW AI WILL IMPACT SUBTITLE PRODUCTION

Expert System Profile

Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

arxiv: v1 [cs.lg] 4 Feb 2019

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

Convolutional Neural Networks for Estimating Left Ventricular Volume

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

[PDF] Keys To The Mind, Learn How To Hypnotize Anyone And Practice Hypnosis And Hypnotherapy Correctly

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

The Effect of Sensor Errors in Situated Human-Computer Dialogue

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

Reading personality from blogs An evaluation of the ESCADA system

Ergonomics questions will account for 13% or 26 questions of the ASP exam.

Learning Convolutional Neural Networks for Graphs

Value of emotional intelligence in veterinary practice teams

Language Volunteer Guide

Evolutionary Programming

Young Cam Jansen And The Missing Cookie Download Free (EPUB, PDF)

Machine learning for neural decoding

Scanning Brains for Insights on Racial Perception

Rating prediction on Amazon Fine Foods Reviews

Machine Learning Models for Blood Glucose Level Prediction

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

A Novel Capsule Neural Network Based Model For Drowsiness Detection Using Electroencephalography Signals

University of Cambridge Engineering Part IB Information Engineering Elective

R Jagdeesh Kanan* et al. International Journal of Pharmacy & Technology

Audiovisual to Sign Language Translator

Food Adulteration Detection Using Neural Networks. Youyang Gu

Differentiating Tumor and Edema in Brain Magnetic Resonance Images Using a Convolutional Neural Network

The use of Topic Modeling to Analyze Open-Ended Survey Items

FSWCHS NHS Information Packet

arxiv: v1 [cs.cv] 9 Oct 2018

Winograd Schema Common Sense Challenge

Performing Under Pressure June 21 st, 1:00 EST

Captioning Your Video Using YouTube Online Accessibility Series

Scientific Research. The Scientific Method. Scientific Explanation

A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Diabetic Retinopathy Detection Using Eye Images

Your Task: Find a ZIP code in Seattle where the crime rate is worse than you would expect and better than you would expect.

Quantitative Analysis of the Elements of Interpersonal Trust of Poles

An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation

Search e Fall /18/15

Recurrent Neural Networks

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Neural Network for Detecting Head Impacts from Kinematic Data. Michael Fanton, Nicholas Gaudio, Alissa Ling CS 229 Project Report

Stacked Gender Prediction from Tweet Texts and Images

ECG Signal Classification with Deep Learning Techniques

Automated Prediction of Thyroid Disease using ANN

arxiv: v1 [cs.cv] 21 Jul 2017

CPSC81 Final Paper: Facial Expression Recognition Using CNNs

The Compact 3D Convolutional Neural Network for Medical Images

A Deep Learning Approach to Identify Diabetes

Synthesis of Gadolinium-enhanced MRI for Multiple Sclerosis patients using Generative Adversarial Network

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

OPTIMIZING CHANNEL SELECTION FOR SEIZURE DETECTION

Deep Multimodal Fusion of Health Records and Notes for Multitask Clinical Event Prediction

Cardiac Arrest Prediction to Prevent Code Blue Situation

Efficient Deep Model Selection

An Escalation Model of Consciousness

Referring Expressions & Alternate Views of Summarization. Ling 573 Systems and Applications May 24, 2016

Rising Scholars Academy 8 th Grade English I Summer Reading Project The Alchemist By Paulo Coelho

A RABBIT OMNIBUS (RABBIT, RUN; RABBIT REDUX; RABBIT IS RICH) BY JOHN UPDIKE

Reframing Perspectives

GLOOKO REPORT REFERENCE GUIDE

Visual semantics: image elements. Symbols Objects People Poses

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

arxiv: v1 [cs.cv] 30 Aug 2018

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Hierarchical Conflict Propagation: Sequence Learning in a Recurrent Deep Neural Network

Transcription:

Data Mining Research Project Report Generating Texts in Estonian Language Author: Robert Roosalu Supervisor: Tambet Matiisen Tartu University Institute of Computer Science January 2016

Introduction The aim of this project is to replicate the results of Andrej Karpathy's blog post [1]. In it he demonstrates training a recurrent neural network to generate text character by character. The post shows examples of English text, Wikipedia markup, LaTeX markup and even C++ source code. An implementation of the described model can be found in the deep learning library Keras's example set. The goal of this project is to use it on an Estonian dataset and to explore the model's hyperparameter space. This project should provide insight into: * how different hyperparameters effect the performance of the model * comparing the quality of generated text * differences when training on Estonian versus training on English datasets Background The concept of using RNNs for character by character prediction was first demonstrated by Ilya Sutskever et al in 2011 [2]. Besides the fun application of generating text, they describe how such models could be used in text compression. They compare the RNN solution to two previous state of the art methods, while achieving competetive results: * Memoizer a hierarchical Bayesian model * PAQ a data compression method, that uses probabilistic sampling. The sampling is done by a mixture of context-models (n-gram, whole word n-gram, etc). The mixing proportions (when to prefer which model) take into consideration the current context. Dataset I began by concatenating 160 books from Estonian literature (since year 1990). They were part of the balanced corpus, provided by Tartu University computational linguistics group. [3] The books were in the TEI format (uses XML syntax), requiring some parsing to get the pure text format I was after. The resulting 35 MB dataset took too long to process. I estimated a reasonable size to allow me get results would be around 1 MB, as that got interpretable results in half a day intervals. I split the original concatenated dataset from the 1 MB marker, and ended up with Rein Põder's Hiliskevad and a quarter of Ene Mihkelson's Ahasveeruse Uni.

Method The architecture presented in both Karpathy's post and the Keras example set: 2-layer LSTM, 512 neurons Dropoout 0.2 Softmax activation Categorical Crossentropy loss RMSProp For hyperparameter tuning, this is the set I tweaked, with their baseline values: neurons, 512 dropout, 0.2 corpus_size, 1M window_size, 20 batch_size, 256 epoch_number, 1 window_size is the substring size of the training examples. epoch_number is a parameter for the fitting function. Training in the original code is done in iterations, after each an example sample was produced. The epoch_number is a parameter for the fit function. At first glance it seemed to have a different effect then the iterations, but as expected, they turned out to be the same thing. When conducting these experiments, it didn't make sense to me to use a validation set, as this problem seems a bit different from regular machine learning tasks. However, as described by Sutskever in [2], it would make sense. Instead, I used a custom metric to accompany the training loss, described by eq 1. (1) We can affect the sampling of the text with a temperature parameter. Lower temperature means picking characters that are most probable according to the model. The idea is, that models predicting correct words even at high temperatures, should be good. For the experiment I tried various different hyperparameter values. I recorded the loss and the custom accuracy after each iteration.

Results I obtained results from 14 different models. The models were trained for varying lengths of time, between a day and two days. All the trained models' cost functions are presented in figure 1. Figure 1. Cost functions of all the experiments. The longest, light blue line is the baseline, with default hyperparameter values. Figure 2. Loss functions for modified batch, corpus, window_size and dropout.

Figure 2 displays cost functions for varying batch, corpus, window sizes, and droput. Larger batch sizes were hindering, but it seemed a smaller batch size would make it better. Larger corpus did not improve the loss for a day of training. It could be the network is too small. Or it could take longer to achieve the same loss value, due to more input data. Larger window seems to have no effect. The network seems to care about character history of only up to 20. Larger dropout values worsen the loss. Unfortunately a smaller run on dropout of 0.1 got lost. Varying the fitting functions's epoch number seemed to have a positive effect, until I realised it's essentially the same thing iteratively running the fitting. Figure 3. Fitting with higher epoch number on the left, normalised to iterations on the right. The strangest loss function was produced by one larger network, as presented in figure 4. Figure 4. Modifying the network layer size.

To see how the larger network compares to the baseline, I plot them both against the accuracy metric described in he previous section. Figure 5. Comparison of networks sized 512 and 1024 with the accuracy metric. The low temperature sample goes to nearly 100% in the beginning, due to generating only short, correct words. It's interesting, how the low and high temperature generated sample accuracies converge. Perhaps convergence would indicate that training is done. The bigger network displays quicker convergence and also a slightly higher accuracy on the high temperature samples. This would indicate a better model. However, reading through the generated texts, at around iteration 60, the smaller network seemed to make more sense. Some examples of generated text: 512 neurons: Low temp: mees oli ta vastanud, et ta oli ka enam mingit neist temasse viinalt parjada sellele peale mõneda kohale ja arusaatust taastasse saati. ja see kasvab veel tema pea sees. nii et ma teadsin, et mulle tundus mulle ka enam midagi ja ette vastama. ma ei ole kui tema keegi meele. High temp: kinni üle metsavendi tema silmis peaaegu palamamaatlus. ainult ette vaaduks tõgima!, viga minu omida... esimesel tegul. meil oli ruttanud, päris kilduv paesa vananud ja just naguti olemuspoolt ühne sedasama vajaks. 1024 neurons: Low temp: kuid midagi ses maastikus, mis on see võimalik. ja millal ma seda kõrvale ja kuulus see tahtis mind ema taastaks, kui mu kümmeks säärasega jooksid rõhkumise haamatust küünalda letk High temp: ma olin karini armendama näonud paidagi elanud. selgus, et veel siis naiselik. tegelikult oli hetke teeline piri, oli ma auendaid kordi alla peen, seda tuli ta rihumad loodus.

Discussion Tuning the parameters mostly did not lead to a notable increase in either training loss, measured accuracy, or perceived text quality. The dataset I ended up using, was in the same size range, as the one used in the original example. The author of which had probably done a good job tuning the parameters already. The original text was in English, and the samples from it made much more sense, then the ones generated on Estonian. This indicates how Estonian language has a more complicated structure, to be captured with such a model. The main take away for me, was learning to set up such an experiment. This time it was a bit error prone, concluding in some lost measurements. Logging the results should be as automatic as possible, with minimum amount of manual work. The latter of which makes the process a bit tedious and also greatly increases the chance of mistakes. Future work There are some loose ends in this work, such as measuring smaller batch sizes, and seeing how the bigger network behaves after longer training. However, it was proved that the dataset is too small to obtain any reasonable results. So the first thing would be to increase the dataset greatly, and rerun the hyperparameter tuning process. Following experiments should include the validation loss as well. Furthermore, it would be interesting to try different accuracy metrics, for example measuring the amount of distinct words in the samples, or checking how much of the sampled text is present in the original data. This would help gauge the models originality. Conclusion This research project in data mining set out to try the recurrent neural network model presented by Andrej Karpathy [1], originally by Ilya Sutskever et al [2], for generating Estonian text. An implementation of which was included in the Keras's example set worked quite well, so I set out to explore it's hyperparameters. The available timeframe set a limit on the dataset size, which coincided with the size of the examples's dataset. Hyperparameter search revealed, how the original hyperparameters were good enough. A metric for measuring the models's performance was used, which observed the amount of correct words in the generated sample. This helps to automatically estimate the goodness of a model. Comparing with the results from the example, which used English text as input, it can be said Estonian language has a more complex structure. Future work should build on the fact, that 1 MB of data is too small for this kind of sampling.

References [1] http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [2] I. Sutskever, J. Martens, G. Hinton. Generating Text with Recurrent Neural Networks. In Proceedings of the 28th International Conference on Machine Learning, 2011. [3] http://www.cl.ut.ee/korpused/grammatikakorpus/