MACHINE LEARNING ANALYSIS OF PERIPHERAL PHYSIOLOGY FOR EMOTION DETECTION

Size: px

Start display at page:

Download "MACHINE LEARNING ANALYSIS OF PERIPHERAL PHYSIOLOGY FOR EMOTION DETECTION"

Alexander Simmons
5 years ago
Views:

1 MACHINE LEARNING ANALYSIS OF PERIPHERAL PHYSIOLOGY FOR EMOTION DETECTION A Thesis Presented by Sarah M Brown to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in the field of Communications, Signal Processing, and Control Northeastern University Boston, Massachusetts January

2 c Copyright by Sarah M Brown All Rights Reserved The author hereby grants to Northeastern University and The Charles Stark Draper Laboratory, Inc. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in any part medium now known or hereafter created. ii

3 Abstract Peripheral physiological signals have shown promise as a measure of a person s emotional state. There are many applications where a more quantitative evaluation of an individual s mental state would be beneficial. For example, in PTSD or depression diagnosis, a quantitative measure to assist and compliment the qualitative assessments conducted by clinicians could reduce the time involved in treatment planning. A better understanding of the underlying mechanism is necessary for building systems that use these signals to assist in critical decision making. Previous work in emotion research has relied upon averaging in time and then applying standard significance tests and simple classifiers to analyze psychophysiological data. In this work, we extend analysis beyond traditional hypothesis testing and simple classifiers to better understand previous results and design an appropriate computational model. Under the hypothesis that modeling dynamics is important, we design and apply an Input-Output Hidden Markov Model (IOHMM). Through exploration of the learned IOHMM model parameters, we demonstrate the promise that more descriptive, generative machine learning models provide over the more taskspecific discriminative models and traditional statistical hypothesis testing. Incorporating time provides an improvement over simple static classifiers in single trial (without averaging in time) prediction accuracy, but does not provide significant improvement over the time averaged results found in iii

4 literature. To address this, we employ exploratory data analysis methods and examine properties of the algorithms applied to better understand the results and consider improvements. Mutual information computation and clustering provide insight as to the challenges in modeling this data. By applying concepts from learning theory, we show that these seemingly weaker results are actually consistent with previous results. We conclude with insights as to how an alternative approach could elicit more positive results out of this dataset and key theoretical contributions to machine learning that are of value for applying these techniques in scientific research. iv

5 Acknowledgements This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE v

6 Contents Abstract Acknowledgements v vii Introduction. Motivation and Context Psychology background Computational Models for Emotion Technical Challenges and Approach Technical Barriers Technical Approach Objectives Key Contribution Outline Machine Learning in an Engineering Framework 6. Tools in Machine Learning Probability Background Decision Theory Probabilistic Graphical Models Designing a Machine Learning Solution Components of a Learning Problem Generative vs. Discriminative vi

7 .. Parametric vs. Nonparametric Bayesian vs. Frequentist Model Specifications Data Preparation and Exploratory Analysis. Data Collection Experimental Setup Feature Extraction Exploratory Analysis Traditional Statistical Analysis Visualization Clustering Mutual Information Computations Feature Selection Feature Selection as a Learning Problem Mutual Information-Based Feature Selection Results Model Based Analysis 7. Static Modeling Discriminant Analysis Support Vector Machines Regression Dynamic Modeling Dynamic Modeling Review Input-Output Hidden Markov Model Linear Gaussian Dynamic System Interpretations 9 5. Contextual Parameter Analysis vii

8 5. Classifier Performance and Sample size Stability Analysis Stability in IOHMM Parameters Conclusions 6. Key findings Future Work Time Series Analysis Analysis of Machine Learning Algorithms A Extended Feature Visualizations 9 B Extended Mutual Information Results 6 C Extended Feature Selection Results 7 D Extended Clustering Results 8 Acronyms 59 viii

9 List of Figures. PGM notation Experiment Timeline ANOVA Box Plots PCA of all measurements Sample Feature distribution: FPRT sound K-Means Clustering Solution-sound GMM Clustering Solution-sound Image feature selection top scores Image feature selection top Sound feature selection top scores Sound feature selection top Final feature selection results, all stimuli PGMs for static models PGMs for Generative Classifiers PGM for discrete models PGM for Linear Gaussian Best performing Learned Physiology Model Learned transition model Cross Validation Instability in Physiology means Subset of cross validated learned Physiology means ix

10 6. Individual subject GSR A. PCA of all measurements A. Sample Feature distribution: FPRT image A. Sample Feature distribution: FPRT image A. Sample Feature distribution: FPRT image A.5 Sample Feature distribution: FPRT image A.6 Sample Feature distribution: FPRT image C. Sound feature selection top C. Sound feature selection top C. Sound feature selection top C. Sound feature selection top C.5 Sound feature selection top C.6 Sound feature selection top C.7 Sound feature selection top C.8 Sound feature selection top C.9 Sound feature selection top C. Sound feature selection top D. K-Means Clustering Solution-image D. GMM Clustering Solution-image D. K-Means Clustering Solution-image D. K-Means Clustering Solution-sound D.5 K-Means Clustering Solution-image D.6 GMM Clustering Solution-sound D.7 K-Means Clustering Solution-image D.8 K-Means Clustering Solution-sound D.9 GMM Clustering Solution-image D. GMM Clustering Solution-sound x

11 List of Tables. ML Problem Summary Summary of Data Feature List Discrete MI Variables MI among experimental variables MI between features and experimental variables Sound LDA Results Sound QDA results Sound SVM Results Linear Sound SVM Results RBF Regressors for Sound dataset with Valence IOHMM Prediction Accuracy B. MI among experimental variables B. MI between features and experimental variables xi

12 Chapter Introduction Emotion is a core component of the human experience; it is something we all feel and recognize, yet we have limited understanding and few complete models for it. An understanding of how it works and can be measured through peripheral physiology has tremendous power to impact clinical care. In this chapter, we introduce the necessary psychological background to interpret the remainder of this thesis, state the objectives of the work and provide a summary of the key contributions of this thesis. We review prior work in computational models of emotion and outline the technical barriers faced. We include an overview of the approach taken throughout the rest of thesis and objective and key contribution of the thesis.. Motivation and Context Troops are currently returning from war in the Middle East with higher prevalence of post-traumatic brain injury than ever before. Currently, these individuals face a qualitative iterative diagnostic process before doctors converge upon an effective treatment plan. Depression has a similar long treatment planning process and progress is difficult to manage. Affect is a core component of the human experience; it impacts our lives constantly in a

13 CHAPTER. INTRODUCTION variety of ways, including as an underlying factor in mental health diagnoses. Our very limited quantitative scientific understanding of the underlying mechanisms of affect is a prohibiting factor in diagnosing and treating these conditions in an effective manner. A better understanding of how emotion works and how it manifests in our brains, especially through simple measurements, could revolutionize diagnosis and treatment planning for these conditions which up to half of all Americans may face at some point in their lifekessler, Ronald C and Berglund, Patricia and Demler, Olga and Jin, Robert and Merikangas, Kathleen R and Walters [5]. Experimental psychologists aim to map abstract constructs of the mind onto the biological understanding of the brain through controlled experiments using a variety of physiological and behavioral measures. In these efforts, phenomena of interest, mental processes, are inherently unmeasurable quantities. The objective is to recover them through statistical analysis of a set of indirect, but measurable, physiological measurements, in this case we use peripheral, or distant from the brain, physiology measures. Statistical correlations between properties of the experimental stimuli and these measurements allow for inference about the underlying mental activities. A reliance on well-developed and generic statistical analyses is currently a barrier to advancing the scientific understanding of how physiological brain function relates to abstract psychological constructs, because these methods don t allow researchers to directly ask complex questions of the data. Instead, researchers are forced to construct a question, experiment, and preprocessing methods that can be analyzed with simple statistical models, typically, classification or regression. This makes the results harder to interpret and reproduce and leaves more room to question their validity. This research is a collaborative effort with psychologists to develop better suited computational models to translate their qualitative ideas into quantitative understanding of the brain. Prior discoveries from more detailed, but

14 CHAPTER. INTRODUCTION also more involved studies such as functional Magnetic Resonance Imaging (fmri), have found that the brain regions known to control the autonomic nervous system are also strongly involved with emotion. This effort will utilize peripheral physiological measurements and standard psychological paradigms to inform the models. This design choice will facilitate a prompt technology transfer process to develop deployable systems that can aid the diagnosis, treatment planning and monitoring of psychopathologies in a doctors office, without the expense or discomfort of fmri. Beyond providing algorithmic tools that can advance psychological research, this approach is motivated by its applicability to a variety of scenarios and outcomes. A robust, quantitative, model for emotion is scalable to a variety of real world measurements. A choice of accessible measurements and interpretable models can allow for integration into clinical decision making systems. For example, an improved understanding of emotion and emotion regulation is believed to assist in the understanding of a variety of psycho pathologiesdillon et al. [], particularly Post-traumatic Stress Disorder (PTSD). Decades of research have demonstrated that there are significant differences in physiological responses among those with and without PTSDOrr et al. [] that can be exploited with classifiers. Incorporating such models into our understanding of psychopathologies could help lead to better diagnostic and therapeutic monitoring techniques. In order to study psychological constructs of emotion, researchers recreate emotional experiences in the laboratory by presenting subjects with stimuli. Stimuli are generally an image or a sound. Many experimental designs utilize presentation procedures in which stimuli are widely spaced in time to allow physiological responses to return to baseline following presentation of the stimulus. This allows for an assessment of physiological responses that are relatively unaffected by responses from preceding stimuli. This approach is motivated by a data analysis related desire for independent measurements

15 CHAPTER. INTRODUCTION and then data analysis is conducted under this independence assumption. A conditional independence assumption is reasonable in the physiological measurements, given the controlled environment, but independence in the underlying processes of interest is less reasonable. Although this approach is appropriate for gaining an understanding of stimulus-response, its applicability to real-world situations in which stimulation occurs continuously is limited. Many behavioral scientists are well-versed in regression models and different types of classifiers, but few have a deep understanding of the flexible and powerful models available with machine learning techniques... Psychology background Psychological constructs are difficult to study. It requires the researcher to bridge from the philosophical constructs of the mind to measurable phenomena of the brain in order to explain complex behaviors and responses. As a core component of the human experience, emotion is something that is often discussed, but rarely formally defined. A myriad of terms are used to describe it in everyday life. Emotion has been studied in humans in various capacities, but most psychology research still relies on a modest set of well developed statistical mechanisms. Attempts to use new tools have been proposed, but are not widespreadkolodyazhniy et al. [], Janoos et al. [b]. The quantity of interest, emotion, is latent, or unobservable, only measured indirectly through a physiological or behavioral response to a stimulus. Different measurements that are believed to capture the physical changes reflective of the mental processes of interest are selected in each experimental paradigm. Emotion is frequently studied and is a part of daily experience, but poorly defined in comparison to the rigor required for defining quality research questions and appropriate measuresbloch et al. []. Three general categories of definitions and theoretical models exist: dimensional, discrete,

16 CHAPTER. INTRODUCTION 5 and appraisal. Appraisal theory posits that emotion is an emergent process based on the individual s interpretation of surroundings and events; it s a cognitive processscherer [9]. Dimensional models provide a set of parameters which combine to describe emotion such as valence (negative to positive) and arousal (level of response). Discrete emotion theories propose a set of categorical emotions that describe behavior and emotion experienceekman et al. [98]. We will focus on a theory of discrete emotion, but we will also explore relationships with dimensional emotion theory. Although brain measurements are seemingly more direct, they can be challenging to obtain due to cost, constraints on participant behavior, and the fidelity of information collected by a particular technology. The modalities most well suited for capturing activity in deep-brain structures have poor temporal resolution, such as functional Magnetic Resonance Imaging (fmri). Modalities like Electroencephalogram (EEG) and Magnetoencephlogram (MEG) have better temporal resolution but are not able to resolve activity in deeper brain structures reliably. Peripheral physiological measurements, like Electrocardiogram (ECG) and pupil diameter, are relatively easy to obtain. Many mental processes of interest, including emotion, have been linked to the same deep brain structures that control autonomic nervous system activity. Previous work has shown that many of these peripheral physiology signals correlate with emotion stimuliekman et al. [98], Webb et al. []. Conceptually, authors describe the relationship between autonomic response and emotion in multiple levels, which can be grouped into psychological level descriptions, brain-behavioral descriptions, and peripheral physiological descriptionskreibig []. Hierarchical relationships among emotions and have been proposed in psychology literature, but they lack the mathematical concreteness that machine learning research can provideekman et al. [98]. The weaknesses of simple classification

17 CHAPTER. INTRODUCTION 6 have been noted by domain experts and attempts have been made to evaluate effects of various feature selection, classification, and cross-validation strategies, as those are all readily accessiblekolodyazhniy et al. []. The objective of the current study is to develop a better understanding of emotion as a time-based process through analysis of several physiological signals collected in response to emotionally evocative stimuli. Although brain measurements are seemingly more direct, they can be challenging to obtain due to cost, constraints on participant behavior, and the fidelity of information collected by a particular technology. The modalities most well suited for capturing activity in deep-brain structures have poor temporal resolution, such as functional Magnetic Resonance Imaging (fmri). Modalities like EEG and MEG have better temporal resolution but are not able to resolve activity in deeper brain structures reliably. Peripheral physiological measurements, like ECG and pupil diameter, are relatively easy to obtain. Many mental processes of interest, including emotion, have been linked to the same deep brain structures that control autonomic nervous system activity. Previous work has shown that many of these peripheral physiology signals correlate with emotion stimuliekman et al. [98], Webb et al. []. Conceptually, authors describe the relationship between autonomic response and emotion in multiple levels, which can be grouped into psychological level descriptions, brain-behavioral descriptions, and peripheral physiological descriptionskreibig []. Hierarchical relationships among emotions and have been proposed in psychology literature, but they lack the mathematical concreteness that machine learning research can provideekman et al. [98]. The weaknesses of simple classification have been noted by domain experts and attempts have been made to evaluate effects of various feature selection, classification, and cross-validation strategies, as those are all readily accessiblekolodyazhniy et al. []. The objective of

18 CHAPTER. INTRODUCTION 7 the current study is to develop a better understanding of emotion as a timebased process through analysis of several physiological signals collected in response to emotionally evocative stimuli. In this work we demonstrate the ability of more powerful tools from the signal processing and machine learning communities to discover new results in psychophysiological experimental data. This application also motivates specific novel contributions to the technical approach and more in-depth analysis of the performance of these methods... Computational Models for Emotion Much of psychology research relies upon computational methods, typically statistical testing. Here we differentiate between applying a statistical test and developing a computational model. Applying a statistical test to validate a theoretical model only requires the details of the theory to be fully described in natural languagemarsella et al. []. Then a hypothesis is formed, an experiment designed, and if the experiment works the null hypothesis is rejected through statistical hypothesis testing. A computational model however, requires describing all aspects of the model, not only a null hypothesis, mathematically. Prior work in incorporating emotion models to artificial agents has demonstrated that building a computational model can help solidify and expand the reach of the psychological modeldias and Paiva [5]. Prior work in developing computational models for emotion has served three purposes: to advance emotion research, to advance artificial intelligence, and in Human Computer Interaction (HCI)Marsella et al. []. This effort is focused on the first of these. Inserting more concrete computational models into the process of psychological discovery prevents ambiguity that can linger when descriptions of ideas are exclusively in natural

19 CHAPTER. INTRODUCTION 8 language. The linguistic descriptions of the concepts are important, however, and by enforcing a requirement that they relate to a mathematical description a comprehensive model is possible.. Technical Challenges and Approach Technically, this work must address three key challenges: information extraction, temporal modeling, and model validation. Measuring physiological signals is easy; sensors are available and can record a lot of information, but what part of the signal contains the information of interest? How can we integrate physiological knowledge and psychological models and still allow the data to tell its story? It is clear that these constructs should be a temporally dynamic process, that the state now, impacts the next state, even if a stimulus is presented, but current experiments are unable to capture this factor. Finally, once we develop a model, how can we tell that it is working? We are modeling abstract phenomena, for which no true, or baseline, measurement exists. Thorough analysis of the models performance and parameters will be important for psychological interpretation in order to validate the models consistency with the existing body of knowledge in psychophysiology. Ultimately, we aim to build computational models that serve as a platform to enable rapid technology transfer of discoveries made in emotion research. If successful we will have developed a platform for joint engineering and psychological research to interpreting the human mind, in a way that minimizes the time from the lab to the doctors office. We propose a standard machine learning approach to this problem. We will fit our model to the data by learning a set of parameters and validate through inferring other quantities. Combining unsupervised methods with graphical modeling techniques will allow for the discovery of novel, abstract

20 CHAPTER. INTRODUCTION 9 states and relationships without loss of physiologically meaningful and psychologically interpretable model structure. The learned model will include parameters that probabilistically relate emotional states, stimuli, and measured physiology... Technical Barriers The key technical barriers to solving this problem are choosing a set of features from the measured signals, modeling the data and then interpreting the results. From physiological time series a large number of features can be extracted, but which will be the most informative? An additional constraint in the context of producing transferable results, is that these features should be interpretable, ideally. In this work we limit ourselves to a set of previously developed features and thus this barrier can be overcome with feature selection. Feature selection will reduce the dimensionality of the problem, which gives computational gains, but more importantly, reduces the noise in the dataset and provides room for interpretation on its own. Second, we need to choose a way to model the data. Classifiers have been employed previously, but these ignore the temporal nature of the experiment. We explore two methods to incorporate time into the analysis. To interpret the results, a simple accuracy measure does not sufficiently describe the data, how the model fits or provide insight as to how to improve the model. We will analyze the results in the context of the full set of assumptions each model makes as well as properties of the algorithms and how these properties impact the interpretation of the results. We demonstrate that known properties of the algorithms employed can be used to provide insight as to the quality of fit of the model to the data in more insightful and conclusive ways than prediction accuracy alone. Many machine learning algorithms are applied as black box tools in other fields for data analysis, but without always good interpretations of how the

21 CHAPTER. INTRODUCTION underlying assumptions apply to the new context. In psychology research, as in all human subjects research, the number of subjects is chosen to provide enough power to see the desired effect, under a prescribed statistical test. These computations, however, do not necessarily provide enough data to apply more complex models, as we find in this effort. In many machine learning problems data is becoming readily available, at least in unlabeled forms, so the bounds on the number of samples needed in each stage of the model development are not well developed guidelines either. Thus we do a post-hoc analysis of these issues to accurately value the strength of the results obtained as well as in order to better inform future research... Technical Approach We aim to move toward studying emotion as a dynamic problem, by utilizing techniques from tracking, state spaces modeling and probabilistic dynamic models. Each of these bodies of work addresses essentially the same problem, with slightly differing views, we draw on all three bodies of work in developing proposed solutions in this area. Constructing the problem of emotion modeling onto this framework allows for numerous novel outcomes in psychology and motivates deeper exploration of several active areas of machine learning and signal processing. This modeling paradigm introduces computational tools that allow for removal of two key restrictive assumptions common in psychological research: that repeated trails of an experiment are independent and that there is a one to one correspondence from stimulus to emotional state to physical measurement. Considering that an individuals emotional state is a continuous process is a natural extension of current work. Natural phenomena are not inherently discrete and independent from one time to the next, nor are they deterministic, thus, exploring Dynamic Bayesian Networks (DBNs) are a good choice for a next step in advancing this area of study. DBNs capture

22 CHAPTER. INTRODUCTION time dependencies in a probabilistic manner, with probabilistic dependencies among the variables within each time slice as well. Working within an engineering design framework allows for further learning. Constructionist and component based models have recently gained popularity in psychology researchbarrett [9]. These lend naturally to engineering problem solving strategies to break the problem down and design modular components that solve various aspects of the problem or address various stages of the problem. The modular framework also allows for deeper exploration and comparison of models. Capturing the temporal relationships is a primary objective. Many tools exist for this, as tracking states that are only indirectly measurable is a welldeveloped field in engineering. This problem has taken numerous forms over the years, beginning with communications and control attempting to model and predict random signals. Kalman filtering and state space models have been broadly used in these areas. In machine learning, these are approached through Probabilistic Graphical Models (PGMs). A Hidden Markov Model (HMM) describes the relationship in time of a latent variable and between that variable and an observation variable in a discrete probabilistic manner. These latent, or hidden, variables model unmeasurable quantities and can be constructed to create conditional independencies which are computationally attractive. By fitting model parameters to define a function directly from stimulus labels or properties to measurements, traditional analyses assume that there is an isomorphic relationship between these properties and mental state. If the subject experiences a a different mental state than prescribed by the researcher, it can only be accommodated as noise. We aim to remove this assumption by introducing an abstract intermediate state that relates the stimulus to the measurements that is the only part of the model that has dynamic dependencies. Independent of modeling temporal relationships, machine learning has

23 CHAPTER. INTRODUCTION developed numerous methods for clustering data; unsupervised learning. This area extends concepts from traditional classification to detecting a pattern without labeled examples and learning the associated statistical model. Clustering methods have been applied to psychological data in fmri studies, for learning various levels of connectivitygonzalez-castillo et al. [], Majeed et al. [] and to features learned in dynamic models Janoos et al. [a].. Objectives The fundamental objective of this thesis is to determine how machine learning can advance the study of emotion through peripheral physiology. We specifically focus on the ability of more sophisticated machine learning techniques to provide greater insight to the meaning of data in a single study over traditional statistical analyses. We begin from prior work, which shows promise, but a weak result. We hypothesize that time l Initial attempts at improving previous results were unsuccessful, but through deeper analysis we gain a clearer understanding of where the model breaks down and how we can proceed. By opening up the applied statistical methods for deeper analysis and applying additional methods, we demonstrate that though we are unable to confirm the proposed hypothesis, machine learning can provide more conclusive results and guidance as to how to improve the analysis and experiment to produce better results. We frame this into two central claims:. Machine Learning and Signal Processing can assist in emotion research and facilitate better understanding.. Working on real applications can motivate novel theoretical advances

24 CHAPTER. INTRODUCTION in machine learning research. This work serves as an exploratory analysis and to assess the feasibility of applying machine learning tools to researching human emotion with respect to these two claims. We approach this by taking an example of modeling emotion through peripheral physiology. We address each of the key technical challenges by applying existing technical methods with appropriate modification and supplement traditional performance measures with additional analysis to determine their suitability for the problem at hand. We begin with a common framework for addressing several machine learning problems and present all stages of analysis with this notation to enable clear comparisons and relationships among various stages of analysis. We aim to engineer an end-to-end machine learning framework for the problem that can serve as a foundation for further work in the area by designing more appropriate solutions for each of the key steps. The objective is to adopt an engineering design based approach to build a platform for answering questions about the temporal dynamics of emotion as expressed through peripheral physiology. We address three key questions: Which commonly used physiological features are best suited for this problem? How well do existing latent variable models capture the underlying emotional process of interest and what modifications can be made to these models to better explain experimental data? How can we evaluate models in a setting where data is necessarily incomplete; where even under experimental conditions, the variable of interest remains hidden? Throughout, we present analysis of methods used in order to best gauge their limitations and advantages. We provide analysis of how to address

25 CHAPTER. INTRODUCTION feature selection, discrete and continuous representations for the underlying latent variable and comparison of strategies for measuring performance... Key Contribution Many machine learning algorithms have the power to advance basic science research, but few are characterized well enough to be applied to new problems and even fewer are well developed enough to be properly applied by non-machine learning expert users. In this thesis, we present such a problem area that could be well served by a variety of machine learning techniques, demonstrate in the form of a case study how some of these techniques are applied to the problem and explore some of the behaviors of these techniques that assist in advancing the understanding of the problem at hand. The key contribution of this thesis is a more conclusive result where ambiguous results had previously only generated new hypothesis in psychophysiological research. This work addresses the modeling of a new type of data with mostly previously existing machine learning methods. We also provide new insights to usage of these methods through careful analysis and meta analysis of the joint methods. The experimental design combined a variety of sensors in a paradigm that is traditionally conducted with a single sensor. The multitude of sensors added new interesting questions about the interactions among signals. We find that that indeed the combination of features that are most discriminatory are derived from multiple sensors. Throughout, we present machine learning techniques with a focus on the underlying statistical models to be able to relate and compare results from various machine learning procedures.

26 CHAPTER. INTRODUCTION 5.. Outline This thesis is organized thematically in order to allow the reader to build up background content first for the relationships among the applied methods to arise naturally. First, a common framework for machine learning is presented in chapter. The format of the data, preprocessing, and exploratory techniques are explained in chapter. Some of the exploratory techniques were conducted in a truly exploratory capacity and others in a diagnostic capacity, but we present them all early because several can be viewed as simpler cases of the desired class of models. After presenting the static models as a baseline for comparison and more detailed illustration of the motivation for dynamic modeling, we explore the dynamic models of interest in chapter. Finally, in chapter 5 we evaluate these models by incorporating results from statistical learning theory as context for good performance. We conclude with key findings and future work. Illustrative examples of analyses are presented throughout. Most analyses produced large sets of results that were repetitive, these are included as appendices. Acronyms are defined at their first use in each chapter and linked to a listing at the end of the thesis.

27 Chapter Machine Learning in an Engineering Framework Machine learning is a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (such as planning how to collect more data!) Murphy []. In this chapter we introduce general principles and notation conventions of machine learning including basics of graphical models and what generative vs discriminative modeling involves. We also include relevant concepts from control theory and signal processing. This chapter introduces the common components necessary for both the static and dynamic models as well as exploratory data analysis methods. This thesis is a part of a scientific endeavor, to answer questions about how emotion is expressed through peripheral physiology. This chapter positions the work as an engineering problem by providing the technical background, context, and structure for addressing this scientific question as an engineering problem. 6

28 CHAPTER. ENG A ML FRAMEWORK 7. Tools in Machine Learning Uncertainty is an essential part of machine learning, that is what makes finding patterns and making predictions non-trivial. Probability theory provides a natural way to represent uncertainty and thus serves as the foundation for a broad class of machine learning techniques. In this section, we review fundamental tools that will be employed in the modeling and analysis throughout this thesis. We begin with core concepts in probability theory to introduce the notation and formality necessary to build the more complex models used to approach the problem of emotion modeling in this thesis. Modeling the data is only the first step, we also want to be able to act on the patterns discovered, to accomplish this objective, we employ decision theory. We review the basic components of how to make decisions under uncertainty to develop the terminology necessary for the remainder of the thesis. Finally we introduce a common tool for describing models in machine learning, Probabilistic Graphical Models (PGMs)... Probability Background In order to do machine learning and statistical analysis of data, we model the quantities involved as random variables. This choice may be viewed as a consequence of uncertainty caused by fundamental limitations or of noisy measurements, but either way the mathematical representation is the same. For continuous quantities, we specify a probability distribution function (PDF) to define the distribution and for discrete quantities we specify a probability mass function (PMF). Many algorithms rely upon maximum likelihood estimation (MLE) to estimate values of the parameters of the statistical model that fit the observed data. The likelihood function for a random variable is a function of the parameters of a statistical model given a set of observations as in Equation., which takes on the value of the

29 CHAPTER. ENG A ML FRAMEWORK 8 probability of observing the given data for a value of the parameters. MLE maximizes this function to find an estimate of the parameters. L(θ X) = Pr(X; θ) (.) Alternatively, we may consider the parameters to be random variables, as in Bayesian statistics, then, the objective is to maximize the probability of the parameters given the observed data. This technique is called maximum a posteriori (MAP) and involves maximizing the posterior as defined in Equation.. In this work we don t apply MAP, but we will consider some models that use the posterior distribution. Pr(θ X) = Pr(θ) Pr(X θ) Pr(X) (.) The distribution of the parameter, θ is called the prior. This introduces two key concepts for which formal definitions will be useful throughout. A statistical model describes the properties of a random variable and an estimator is a function that generates an estimate. Definition.. A statistical model, M is a collection of probability distributions or density functions over a set of random variables D obs. In a parametric model, each item of the collection can be indexed by a unique finite dimensional parameter vector, θ, in a non-parametric model the parameter vector is infinite dimensional Stark and Woods []. Definition.. An estimator is a function of a sample of observations of a random variable that estimates the parameters of the distribution, but is not a function of the parameters Stark and Woods [].

30 CHAPTER. ENG A ML FRAMEWORK 9.. Decision Theory Probability theory is used to represent the uncertainty in the data and compute and update the belief state of the model. In order to perform actions, which is the goal, we need decision theory. There are two forms of decision theory as there are two general forms of statistics: Bayesian and frequentist. In Bayesian methods we assume the parameters of a model are random variables as well and describe them with prior probabilities. We integrate over the prior to obtain solutions. In frequentist statistics, we assume that the parameters have some true known value and we are trying to uncover that value from a finite sample of data. In both cases decision theory involves a loss function. However, without a prior, as in frequentist methods, we can only compute this empirically. In either case we make a decision by minimizing the loss. This step may involve inferring the value of a variable under multiple conditions and then comparing the loss or the method may construct a loss function another way. This makes numerical optimization an important part of machine learning. In this work we don t explore deeply the effects of different optimization techniques, but they can change performance of a model as well. Through a prior or regularization, it is sometimes of value to create a sparse solution. Adding an l or l penalty term on the parameter over which we re optimizing creates this effect, which can also be interpreted as a specific type of prior... Probabilistic Graphical Models PGMs are compact visual representation of joint distribution among a group of random variables. By connecting a set of nodes with edges, we encode conditional dependencies among random variables in an easy to interpret manner. Nodes are represented by shapes in the diagram and edges are lines

31 CHAPTER. ENG A ML FRAMEWORK or arrows connecting them. In chapter, we ll use PGMs for each proposed model. Using a PGM allows the modeler to easily relate the conditional dependencies and see common structure to solved problems. Learning and Inference in graphical models is a well studied problem and numerous efficient algorithmic solutions exist, therefore by using a PGM to describe a proposed model, the modeler gains easy access to apply existing methods where appropriate Murphy []. In this thesis we follow a common PGM notation to distinguish among various types of random variables by using different node types as shown in Figure.. Discrete random variables are depicted with squares and continuous random variables are modeled with circles. A shaded node indicates an observed variable and conversely, an open node represents a latent variable. Arrows between nodes do not indicate causation, just the direction of the defined probabilistic relationship. A set of probabilistic equations always accompanies a PGM to specify what the distributions are, the pgm only visualizes which relationships are specified. For models with repeating structure, we will use what is called a -dbn. Assuming the index of the repetition is time, we ll show two time slices with both the dependencies within and between time steps denoted, though the model may include many more time slices. For Markov models, where the time dependence is only one time step, this is sufficient to describe the whole graph. We will only consider Markov models, but this isn t very restrictive, it just requires careful definition of the state to include sufficient information. In models that include fixed parameters or hyperparameters, these constants are denoted by a dot for a node with an arrow to the variable that uses that parameter. To create a compact form, plates are used to group together variables that repeat together, with the variable that indicates the number of repeated values in the bottom right corner.

32 CHAPTER. ENG A ML FRAMEWORK A A B B (a) Continuous (b) Discrete Figure.: Examples of notation for PGM: In.a and.b, the observed random variable, B, is conditioned on the latent random variable, A. The joint distribution for both of these graphs would be p(a, B) = p(b A)p(A).. Designing a Machine Learning Solution In applying machine learning to solve a problem, the modeler must make several modeling decisions, in this section we review some of the key decisions to be made and provide a common framework for the machine learning problems we will address. We begin by adopting an engineering view of machine learning. Engineering takes a process and problem oriented view to solving real problems using a well established engineering design process... Components of a Learning Problem As the field of machine learning was defined in section., there are two main subproblems: detect a pattern and then decide or predict. Both must be solved under uncertainty. We capture this uncertainty through a statistical model and thus these two steps can often be reduced to estimation and inference. We introduce the following notation for the structure of a machine learning problem and related algorithms. A machine learning problem consists of three components: data, a statistical model, and a hypothesis space. We fit the statistical model to the data through estimation and we perform inference and apply decision theory

33 CHAPTER. ENG A ML FRAMEWORK Component Supervised Dimensionality Reduction Unsupervised Data (X f, X l ) (X f, X l ) X Model p(x f X l ; θ) or p(x l X f ; θ) I(X f θ, X l ) = I(X f, X l ) X p(x o, Xõ; θ) Hypothesis X f X l Xθ X o Xõ Estimation θ of statistical model θ a projection matrix to a lower dimension X f (θ, Xõ) Task X l for new X f exploration, preprocessing Examples Classification PCA (..), Feature (..), Regression Selection (.) (..) explanation, exploration, predict Xõ Clustering (..), Dynamic Modeling (.) Table.: Structure of Machine Learning problems explored in this work. X f is the feature values in and X l is the labels, θ is used to represent the parameters of the statistical model in each case. X o and X represent observed and unobserved data respectively. I(A, B) represents mutual information between A and B as per one of the definitions in subsection.. in the hypothesis space to reach our objective. In this thesis, we address our scientific question of interest by decomposing it into several machine learning problems. For many machine learning problems, we partition the complete data provided into one portion that is referred to as data, or features, X f, and one that is the labels, X l. We ll refer to the combination of the data and the labels as the complete dataset. In supervised learning we will work under the premise that there is some functional, though noisy, relationship between data and the labels. We prescribe a statistical model for the conditional distribution between them. In feature selection we assume models that say

34 CHAPTER. ENG A ML FRAMEWORK that only some portion of the data is related to the measurements and we want to uncover that subset. In unsupervised learning, we assume that some of the data is missing; that the probabilistic description of the observed data, X o, requires another variable, that is unobserved, or latent, Xõ. The data that we re provided need not be partitioned, in this case. Ultimately, at any point in this work, our goal is to use data to complete some task using the data, as summarized in Table.. We accomplish this task by employing a statistical model that represents either known or hypothesized structure in the data. Generally, some aspects of the model or hypothesis are unknown, or too complex to be specified from knowledge, so we employ a learning algorithm to complete that description. We can describe the task by a functional mapping, this is a very flexible representation. For example, a classifier maps from data to labels. In supervised learning the hypothesis space is a mapping over the data-space (from samples to labels); in unsupervised learning, mapping is, since we learn everything. These concepts will be made more formal in the following chapters, but this general framework for a machine learning problem will provide a consistent notation and terminology to be used throughout. This description also enables direct comparisons among results from feature selection, classification, and clustering, since they can all be described in a single notation. With the complete model we can apply some task algorithm to a make decision, provide insight, or complete the dataset. The subsequent chapters will discuss a variety of machine learning models. A machine learning model is a complete toolbox for solving a problem as defined at the beginning of the chapter and we will use the name for the whole solution. In calling something a machine learning model, or model, we refer to jointly the underlying statistical model and an associated algorithm. This includes the specified underlying family of statistical models as well as the relationships among the various random variables. A machine learning

35 CHAPTER. ENG A ML FRAMEWORK model may also be reference to the algorithm used to solve the problem. Many models rely on the same underlying statistical model, but make different additional assumptions and have different names as machine learning models as we will see. These assumptions change the way in which we solve the problem, but the shared statistical model allows for comparisons of the models and procedures. If referencing the statistical model, that will be specifically stated. Many learning algorithms, including the ones we will apply can be viewed as estimators. Though we specify that it is a learning algorithm because it may require some inference steps in order to solve its estimation task. The view as an estimator is important for thinking about how to analyze performance and quality of the result. The core function of the task algorithm is to make probabilistic inferences about the data provided to it, and generally make some decisions based on comparisons among these inferences. Frequently, the problem is solved with a task in mind, so the complete underlying statistical model is not important, only the output of the task, so portions of the task algorithm may be combined with the learning algorithm to only compute the necessary quantities. This can make the task algorithm more computationally efficient. Finally we measure the performance through a cost function. This cost function may have also been used in the training, but we evaluate on a different set of data this time. Evaluating the performance of an algorithm can be difficult in and of itself. Most commonly, the performance is accuracy for a prediction task. For a supervised learning task, where the algorithm is provided with labeled examples for training, we can evaluate performance on a portion of the data not used in training. For unsupervised learning, in algorithm design we often have some labeled examples for testing and evaluation, even if the true data set will not include that information. In many machine learning problems we have three stages in which we need to use the data: learning

36 CHAPTER. ENG A ML FRAMEWORK 5 the parameters, tuning the model, and evaluating the performance of the algorithm or fit of the statistical model. In the case that a model or the algorithm used to solve it has hyper parameters, such as a model order or a convergence tolerance, that need to be tuned. To accomplish this, we further partition the data, into three sets: training, validation, and test. We use the validation set to compute a score for each setting of the hyper parameter that needs to be considered, then we report the final performance based on the test set. Often we do not have enough data to obtain reliable estimators or scores on way partitions of the data. In this case we use cross validation, the most popular is k-fold. This involves partitioning the data into k sections, training on k and testing on the last section. This is repeated for each of the k folds and the score is averaged. Leave-one-out (LOO) cross validation is the extreme case where k = N, with N being the number of samples. This process is represented algorithmically in Experiment.. For example, if classification is the objective, we may adopt a model that the data comes from two Gaussian distributions, with different means and the same covariance. Then, the learning algorithm would estimate this covariance and the two means. The inference or decision algorithm would compute the likelihood of the test sample belonging to each class, compare them and assign it the higher one. If the parameter values are not important in a given application, we can directly compute a threshold hyperplane in the data space and compare the test sample directly... Generative vs. Discriminative In machine learning, models and their associated learning algorithms can be classified as generative or discriminative. For a given family of statistical models, there can be both a generative and discriminative machine learning model. The main difference is in how the relationship among the variables

37 CHAPTER. ENG A ML FRAMEWORK 6 is defined. Taking a classifier as an example, a generative model defines the joint probability, while a discriminative model defines only the posterior only. A classifier always uses the posterior to decide the class label assignment, but in a generative classifier the posterior is computed as a function of the class conditional, p(y z = i) for data y and labels z instead of directly specifying and fitting the posterior p(z = i x)murphy []. The main advantage of a generative model is that by specifying the full joint probability distribution, and thus a hypothesis for generative process of the data, once the model is learned, synthetic data can be generated. Generative models are also usually easy to fit. For example, given a joint statistical assumption of a mixture of Gaussians, the generative, Linear Discriminant Analysis (LDA) (discussed in section..), is fit with simple counting and averaging while the discriminative counterpart, logistic regression, requires solving a convex optimization problemmurphy []. A major advantage of discriminative model is that they allow preprocessing the data in arbitrary waysmurphy [], however in this application, this preprocessing freedom only makes it harder to understand the model. Discriminative models are often shown to have better predictive performance and even lower asymptotic error. However, a generative model may converge to its asymptotic error with fewer samples, which even for a higher error means that with a small sample size it will outperform its discriminative counterpartjordan and Ng []. In this work we want to understand the data generation process. The objective is to build computational models for emotion, that are interpretable and help understand how emotion works, so a generative model is better suited. This allows us to specify a model that describes how the variables interact and represents a hypothesis about the underlying mechanisms. The model fit then allows for inference about the quality of that hypothesis. A discriminative model does not fully describe the underlying process, only

38 CHAPTER. ENG A ML FRAMEWORK 7 the a posteriori relationships between measured data and a set of labels, as described by the posterior, so this type of model is not appropriate... Parametric vs. Nonparametric As defined in Definition., a nonparametric model is one that is indexed by an infinite dimensional parameter. In this work we use non-parametric methods in data exploration to be more flexible and then apply parametric models in the design process. The clustering addressed in subsection.. is a nonparametric statistical model, since the number of parameters varies with data size, this is not the traditional way in which nonparametric clustering is defined in machine learning. The name non-parametric can lead to some confusion. Non-parametric models, do in fact, have parameters, but there can be infinite many. A more concrete explanation for the concept of a potentially infinite number of parameters is a number of parameters that grows with the amount of data. This includes models with a latent, or unobserved quantity. Non-parametric models frequently have hyperparameters. For example, a Kernel Density Estimate (KDE), approximates a distribution from points by the sum of kernel distributions placed at locations prescribed by the samples. This is a nonparametric model because it specifies the shape of the whole distribution as a function of the samples directly, not of statistics of the samples, causing the number of parameters grow with the number of samples. Additionally, a KDE has hyperparameters determined by the parameters of the kernel distribution. In machine learning the most common nonparametric methods are also Bayesian, they use a prior probability for each of the parameters. creates a very flexible model, but requires a choice of a good prior. This

39 CHAPTER. ENG A ML FRAMEWORK 8.. Bayesian vs. Frequentist Bayesian modeling involves assuming that all quantities, including parameters are random variables, through application of priors. This can be useful for many reasons. In the absence of knowledge to set a good prior, or the choice to use a conjugate prior to make the analytical solution work nice, one can adopt a uniform prior. In the case of a uniform prior, the Bayesian solution is equivalent to a maximum likelihood solution. In many cases we apply maximum likelihood solutions, but in subsequent analysis we acknowledge that the solutions and parameters are also random variables because we re working with finite samples of data. We present derivations of the methods applied built on a unifying probabilistic foundation. Many of these derivations are traditionally presented in the context of Bayesian methods, however we select a noninformative prior and solve for point estimates using maximum likelihood. In some cases, such as the classifiers applied in, subsection.., our class prior is designed to be uniform; we have equal samples of each class. Statistical methods applied in psychology research, and thus our baseline and context for comparing and interpreting results are based on frequentist statistics. We will explain and relate some of these methods throughout. Some of the pitfalls we faced in the analysis can be attributed to the typically presented flaws of frequentist statistics providing greater context for the consistency of these findings.. Model Specifications Hypothesis. The information contained in peripheral physiology will require multivariate patterns, so multiple signals must be used.

40 CHAPTER. ENG A ML FRAMEWORK 9 The ultimate objective of this work is to develop a model for the physiological response to emotional stimuli. The goal is to gain understanding, hence generative models are preferred over discriminative approaches. There are inherently latent variables, thus we require latent variable modeling techniques. The desired model should have psychologically and/or physiologically meaningful parameters. The work is at this stage primarily exploratory and will include comparing alternative hypothesis. Hypothesis. A latent state mediates between the stimulus and measured physiological response. The stimulus generates a response that is not an isomorphic relationship with the stimulus. Prior analyses showed that various stimuli generate statistically significantly different mean responses. However classification efforts in these same works show a weak relationship. There are alternative ways to cluster the physiological responses, which lead to greater separability, but these do not have a one to one relationship with the category labels. Hypothesis. Time is important to accurately model physiological correlates of emotion. Even with time between stimulus presentations, trials are not statistically independent, and a model that accommodates time will be more successful in describing the data. Given generative models, we can compare different parameter sets, or different solutions, of the same model structure using the likelihood. We can also apply various model selection criteria to choose the best among these and the performance of these criteria. Ultimately we want to consider generative approaches, we don t need the flexibility of nonparametric models for our primary modeling task; manually optimizing over the number of latent state will be fine. A generative, parametric approach will allow for comparison and testing of a multiple working

41 CHAPTER. ENG A ML FRAMEWORK hypothesis. We want a model that is robust and generalization is especially important, so the model should be stable to changes is the data.

42 CHAPTER. ENG A ML FRAMEWORK Experiment. Training, Validation, Test process for data D and a cost function C. for all hyper-parameter settings i do Learn model parameters θ i from training data D train Compute the score on the validation data C(θ i, D val ) end for Choose the optimal parameters θ opt = arg min i return the score C(θ opt, D test )) (C(θ i, D val ))

43 Chapter Data Preparation and Exploratory Analysis In this chapter we describe the data and techniques used to understand its structure. This chapter uses more nonparametric, flexible methods than our primary objective would imply, but these methods provide an understanding of the data that assists in modeling and evaluating the performance of the models and algorithms applied later. The preprocessing described here includes both methods that are standard to psychology research and feature selection that is necessary here because we have measured more signals than most work in this area. The preliminary methods applied here build upon the background information in chapter.. Data Collection In this thesis we take as an example a standard psychological paradigm and present new analysis methodologies. In this section we describe the experimental procedure and the portion of the preprocessing that is shared across

44 CHAPTER. DATA EXPLORATION all of the methods presented throughout the rest of the thesis. The novelty in the experimental design lies in the number of measurements collected all time synched. Most psychophysiological experiments use a small subset of these measurementscacioppo et al. [], Kreibig []. The hope here is that by using a larger number of features, we ll be able to exploit more of the response from the signals. From these signals we extract a set of features typically used in emotion research... Experimental Setup In the current experiment, participants (ages 9-55, 5% male and 7% female) were each presented with two sequences of emotionally evocative stimuli: one of soundsbradley et al. [7] and one of imageslang et al. [999]. The same sequence of stimuli was used for each subject. After each stimulus presentation, the subject indicated the degree to which they experienced each of 5 discrete emotions (fear, disgust, amusement, sadness, and anger) on a scale of to 7. Stimuli were chosen to represent theses 5 discrete emotions as well as neutral stimuli from standard databases used for emotion research. During the stimulus presentations, the following physiological signals were continuously recorded: Electrocardiogram (ECG), Pupil Diameter (via eye-tracker), Skin Conductance, Respiration, Finger Pulse, and Activity (gross body movement). Figure. shows the timeline of the experiment. Orienting response, the slight shock of adjusting to a new environment, is expected during the presentation of the first stimulus, so the first stimulus was dropped from all analyses. Due to a problem during data collection, data for the last stimulus were truncated and some features could not be computed. Therefore the last stimulus period was also dropped from all analyses. All analysis were conducted with stimulus responses per subject. After removing participants with incomplete data, 7 subjects had complete data sets for the image

45 CHAPTER. DATA EXPLORATION Timeline of a A 96 LifeshirtResp LifeshirtEcg EyeTracker McpChart McpChart McpChart Sounds Mental Imagery Images Seconds x Sensor Data Validity of a A 96 LifeshirtResp LifeshirtEcg EyeTracker mcppleth mcpgsr mcpa Sounds Mental Imagery Images Seconds x Figure.: Timeline of sensor data and stimuli. The top plot shows the time each sensor was recording, the bottom is the validity of each sensor. For the subject shown, there were no sensor failures. Sound and image stimuli were selected from standard databases. Initial analyses showed weak results for the mental imagery portion of the experiment, therefore that portion of th data was not utilized in any of the presented analyses.

46 CHAPTER. DATA EXPLORATION 5 Data Name Quantity Data range Measurements Self report Categorical stimulus labels Valence Rating Arousal Rating value per feature, per subject per time instance value per category per subject per time value per time, shared across subjects value per time, shared across subjects value per time, shared across subjects Continuous, varies per sensor -7 Likert Happy, Sad, Anger, Fear, Disgust, Neutral Continuous Continuous Table.: Summary of available data used throughout the study stimuli and 6 subjects had complete data sets for the sound stimuli... Feature Extraction Features were extracted from the physiological waveforms using CPSLAB (Scientific Assessment Technologies, Salt Lake City, UT). The extracted feature set included features as defined in Table.. All of the ECG features were derived from the inter-beat-interval (IBI), the time between consecutive peaks. The ECG and Finger Pulse features were extracted with second windows that begin at stimulus onset. The feature values used for modeling and comparisons were withinsubject z-scores. These were computed to allow for meaningful comparisons across features and subjects. This also provides a normalized dataset

47 CHAPTER. DATA EXPLORATION 6 Table.: Feature List Sensor FeatName Feature Description EyeTracker LPAFR left pupil area to full recovery EyeTracker LPAMP left pupil peak amplitude diameter Activity AcLEN Activity length Abdominal respiration AOLEN Abdominal respiration length ECG IIAFR IBI Area to full recovery ECG IIAMP IBI Peak Amplitude ECG IIFRT IBI Full recovery time ECG IIHRT IBI half recovery time ECG IILEN IBI interval length ECG IILEV IBI interval level ECG IISTD IBI standard deviation Finger Pulse FPAFR area to full recovery Finger Pulse FPAMP peak amplitude Finger Pulse FPLEN line length Finger Pulse FPRT rise time from to first low point recovery Finger Pulse FPSTD standard deviation Skin Conductance SCAFR area to full recovery Skin Conductance SCAMP peak amplitude Skin Conductance SCLEV level Skin Conductance SCRR rise rate from first low point Skin Conductance SCRTO rise time from response onset Skin Conductance SCSTD standard deviation

48 CHAPTER. DATA EXPLORATION 7 across features that is important to ensure proper numerical behavior when estimating Gaussian parametersdy and Brodley []. For each subject s... S = 6, for each feature f... F =, a sample mean and sample standard deviation are computed, µ s,i and σ s,i. The raw feature values are then converted to z-scores to normalize them within each subject as in Equation.. y s,t,f = yraw s,t,f µ s,i σ s,i (.) These preprocessing steps are consistent with psychology literature.. Exploratory Analysis Exploratory analysis is an important part of solving a machine learning problem. In approaching this project, initial exploratory analysis was conducted as a part of a prior study where analysis was restricted to traditional statistical testing and classifiers were applied after averaging in time. In this section we recap the traditional statistical test used and then show some of the techniques that were used to understand the results obtained by the proposed model. These techniques aim to help understand what types of structures we should design a model to discover. These results explain away poorer than desired performance in the proposed dynamic modeling and serve as guidance as to what other types of models may be good candidates for future analyses... Traditional Statistical Analysis This deeper exploration of the experimental data was motivated by a preliminary statistical analysis of the data that showed the categorical stimulus labels have a significant effect on the measured signals. This analysis was

49 CHAPTER. DATA EXPLORATION 8 conducted using the Analysis of Variance (ANOVA) test, which allows for statements about the means of classes dependent upon all of the classes being normally distributed. The results show that the means are significantly different, but it is important to note that this test does not measure the separability of the classes. The appropriate conclusion is that these stimuli do change physiological response in a meaningful way, but that does not infer the opposite, that from the physiology we can predict the stimulus class. ANOVA is a commonly used statistical test; but in order to understand it, some context on the general procedure for designing and evaluating a statistical test is useful. Statistical hypothesis testing is a form of inference where one tests the validity under probability of a statement about a population parameter of the data. Specifically, these statements are made in the form of membership of the parameter to a subset of the parameter space or not, i.e.: θ Θ or θ Θ, with the former called the null hypothesis, H. A hypothesis test is a rule that specifies for which sample values H can be rejected. Tests are specified with respect to a test statistic that is designed around the assumptions of the null hypothesis. Tests can be designed and evaluated in a variety of methods including likelihood ratios, Bayesian method, and union-intersection method. A detailed review of all of these methods can be found in Casella and Berger [99]. We ll look only at the likelihood ratio method of designing a test statistic. This method uses the ratio of the supremum of the likelihood in the Θ region to the whole parameter space. This is equivalent to the ratio of the maximum likelihood estimation (MLE) in the null hypothesis subset of the parameter space, ˆθ and the MLE over the whole parameter space,ˆθ. The test rejects the null hypothesis if this ratio is too low.

50 CHAPTER. DATA EXPLORATION 9 T (x) = L(ˆθ x) L(ˆθ x) (.) Once a test statistic is defined, then the rejection rule must be defined. For likelihood ratio tests we reject when the ratio is below a threshold value, which we must decide. A typical way to do this is to chose the threshold by modeling the distribution of the test statistic and selecting the threshold to limit the probability of the data belonging to the rejection region, for parameters in Θ. The statistical model assumed for ANOVA is that each identically independently distributed (i.i.d.) sample, y i,j, the j th sample in the i th class, is from a class-specific normal distribution. These distributions have unique means, but a shared covariance. So in designing the test, the parameter of interest, θ is the population mean, µ. For K classes, this is as shown in (.a). The null hypothesis, (.b), is that all the class means are equal. y i,j N (µ i, σ) H : µ = = µ K (.a) (.b) In terms of experimental design the model is usually described as each measurement being the sum of a population mean (µ), a treatment effect due to the experimental condition(τ i ), and an error or noise component (η). This is equivalent to above, but is parametrized as follows: y i,j = µ + τ i + η (.) H : τ i = = τ K = (.5) We use a likelihood ratio test as our test statistic. We compare the ratio of the samples being drawn from a single distribution (Equation.6a)

51 CHAPTER. DATA EXPLORATION and the likelihood of it being drawn from K unique means (Equation.6b). The data must be standardized, so that the shared covariance matrix is the identity matrix, Σ = I, prior to computation. In taking the log ratio of the two likelihoods we find that it is the ratio of the between and within class scatter matrices, this maintains the same result in the test because log is an increasing function. Assuming N total samples: L(µ y) = L([µ i ] y) = ( ) L(µ y) log = L([µ i ] y) N N (y j µ, Σ) j= K N i N (y i,j µ i, Σ) i= j= (y ˆµ) T Σ(y ˆµ) i = K j = N i (y i,j ˆµ i ) T Σ(y i,j ˆµ i ) (.6a) (.6b) (.6c) We need to know how this statistic is distributed in order to probabilistically specify the rejection region and complete the test. We see that the components of the ratio, the two scatter matrices, are sums of squares. Since the samples are assumed to be normally distributed, removing their respective means, still results in normal distributions. S B = S W = y i,j µ i N (, Σ) (.7a) ˆµ i µ N (, Σ)) (.7b) K (ˆµ i µ) T (ˆµ i µ) σ χ (K) (.7c) i K N i (y i,j µ i ) T (y i,j µ i ) σ χ (N K) (.7d) i j The latter two follow from the definition of the χ (d) distribution, that is, that it is derived as the distribution for the sum of squares of zero mean normally distributed variablesjohnson and Wichern [998]. The F (d, d ) distribution (Equation.8) is defined as the ratio of two χ (d) distributed

52 CHAPTER. DATA EXPLORATION random variables normalized (and parametrized) by their respective degrees of freedom, d and d, which matches our likelihood ratio, as desired. F H (d, d ) = S B K S W N K (.8) We can now use the CDF of F (K, N K) to compute the confidence that the means are not equivalent. As the goal is to reject the null hypothesis, we desire a very low probability of the computed F H. To set a rejection rule, we must find c so that Pr(F H c) α. Alternatively to a binary accept or reject test, we may want to look at the probability Pr(F H )Casella and Berger [99]. We can then decide to accept or reject by comparing this value, called the p-value, to a desired α, but we get a more informative score than to just the accept or reject. The p-value of a statistical test is the probability of the observed value of the statistic under the null hypothesis. A low p-value indicates a high probability of rejecting the null hypothesis. In this data many individual features have low p for a univariate ANOVA test, indicating that these features are very likely to be capturing a portion of the physiological response to the stimuli. This is shown in Figure. using a box plot. Box plots are frequently used in exploratory data analysis and for visualizing the results of an ANOVA test. They indicate order statistics: median at the horizontal bar, the box indicates the 5 th and 75 th percentiles, error bars mark the 5 t h and 95 th, plus signs indicate outliers. The notch in the box indicates the confidence interval for different class means, if the notches between any pair of classes do not overlap they are significantly different at the p =.5 level.

53 CHAPTER. DATA EXPLORATION Single Feature Class Distribution, p = FPRT Anger Disgust Fear Happiness Neutral Sadness Figure.: Box plots of a single feature with significant p value in the ANOVA test. Note that some pairs are significantly different while others are not. This also shows clearly that these classes will not be separable along the direction of this feature, though the classes do have a significant effect on on this measure.

54 CHAPTER. DATA EXPLORATION second principle component of X 6 Amusement Anger Disgust Fear Neutral Sadness first principle component of X Figure.: The data in all features projected onto the first two PCA directions for the sound stimuli.. Visualization Visualization is the most intuitive type of exploratory data analysis. Viewing the data can guide the modeler in choosing the right statistical model and set ideas for the expected efficiency of the data. Since our problem includes many dimensional data, it is useful to view both individual features and various combinations of features. A commonly used view for high dimensional data is to plot the data projected onto the first two or three principal components as in Figure.Jollife [5]. These directions are found using Principal Component Analysis (PCA) to find the directions of the data that display the largest variance. We can also look at individual features. In this effort some features are

55 CHAPTER. DATA EXPLORATION FPRT Figure.: The sample distribution of a single feature: time to first return to baseline of the finger pulse for sound stimuli. The main section shows a KDE of the various class distributions, the insets show histograms for the various classes. shown to have more differentiation among classes than others. It is most intuitive to look at histograms per discrete emotion label, we will refer to these as the classes. To assist in comparing classes we plot the trace of a Kernel Density Estimate (KDE) of the distribution of each class. The KDE approximates the distribution non parametrically as a sum of Gaussian distributions for each sample, thus providing a smoother result than a histogram approximation and allowing for interpolation to approximate the probability distribution function (PDF) for values not contained in the current sampleparzen [96], Rosenblatt [956].

56 CHAPTER. DATA EXPLORATION 5.. Clustering Unsupervised learning is most commonly employed for exploratory purposesjain and Lansing []. We apply two common clustering algorithms that employ a similar Gaussian model. The underlying assumption of all clustering is that the statistical model for the data relies upon some unknown set of labels for each point. For each point, y, the parameters of the probability distribution, θ, that it comes from are dependent on the cluster label, z, which is a discrete random variable that takes one of K values. We look at two models using Gaussian distributions, θ = [µ, Σ]. K-means : p(y z = i) = N (µ i, σi) (.9) GMM : p(y z = i) = N (µ i, Σ i ) (.) The difference between the two statistical models is in how the covariances are modeled. In the first model, only the means are unique to the latent variable z, the clusters all have a shared covariance Σ, this is solved using the k-means algorithm. The second model allows each cluster to also have a unique covariance this is referred to as Gaussian mixture model (GMM), since the distribution of y without conditioning on z is a sum of Gaussian distributions. Solving for these parameters is harder than writing a simple estimator, the values of z that separate the data and tell us which points to use for estimating each θ i. Both require iterative algorithms with inference and estimation steps, both derived using maximum likelihood for the estimation step. The inference step in clustering is often referred to as an assignment step, since the inferred quantity is a cluster assignment. This is a second difference between the two methods, K-Means is a hard assignment, GMM is a soft assignment. For Equation.9 we initialize the algorithm by choosing some points to be class means. Then each point is scored by its distance to each mean, and

57 CHAPTER. DATA EXPLORATION 6 assigned to the nearest mean. The means based on these assignments are recomputed. Each subsequent iteration involves these two assignment and mean computations. To solve the GMM we use an algorithm for maximum likelihood estimation with incomplete data called Expectation Maximization (EM)Dempster et al. [977]. As noted before this is a two step algorithm: one step of assignment completed by taking the expectation over all possible assignments and one step of estimation, setting parameters by maximizing the likelihood. In the general case EM is described for a given joint distribution Pr(Y, Z θ) with Y observed and Z latent described by parameters θbishop [6]. The general formulation and updates are in Algorithm.. Algorithm. General EM Algorithm Bishop [6] Initialize parameters θ old while n doot converged E step: Evaluate Pr(Z Y, θ old ) M step: θ new = arg max Z Pr(Z Y, θ old) ln Pr(Y, Z θ) θ check convergence, if not θ old θ new end while For any specific latent variable problem, we then only need to write out the E and M steps and choose to look for convergence in the parameters or likelihood. For GMM we look for likelihood convergence, by setting a threshold and stopping the algorithm when the change in likelihood in consecutive iterations is less than the threshold. The E step involves computing what is called the responsibility, or the soft assignment: the probability of each sample y n belonging to each cluster k =,..., K. γ kn = Pr(z = k y n ) = π k N (y n µ k, Σ k ) K j= π jn (y n µ j, Σ j ) (.)

58 CHAPTER. DATA EXPLORATION 7 ] In the M-Step we compute each new parameters θ new = [µ new Σ new π new using these responsibilities: Σ new k = N k N n= N k = µ new k = N k N n= γ nk N γ nk y n n= (.a) γ nk (y n µ new k )(y n µ new k ) T (.b) π new k = N k N (.c) In Figure.5 and Figure.6 we show sample results for k = 6, as an illustrative example, additional results are shown in the appendices. the number of clusters increased in GMM, numerical issues arose due to small sample size and estimating a full covariance matrix. We will also see which number of clusters makes sense in subsection.. where we evaluate the mutual information between learned clusters and other experimental variables. The clustering results show that there may be more meaningful ways to group the features into discrete groups than the stimulus category labels. These clusters are more separable than the class labels, but still do not have clear margins of separation. In clustering, the number of clusters, k, is unknown. We perform clustering for a variety of values of k. As.. Mutual Information Computations Mutual information provide insight into nonlinear relationships in a dataset through the use of entropy. Entropy is a classical quantity in information theory. It was defined to meet desirable properties of a measure of information in a random processshannon [98]. It is defined in the context of discrete random variable from known distributions.

59 CHAPTER. DATA EXPLORATION 8 FPSTD IIAMP 5 6 AOLEN FPRT AOLEN FPSTD FPRT IIAMP Figure.5: K Means clustering results for 6 clusters using features measured with sound stimuli. The top two panels show the scatter plots in two pairs of dimensions and the bottom four show box plots of the learned clusters in each feature. The clusters are labeled only with numbers because they are learned, not labeled or interpreted.

60 CHAPTER. DATA EXPLORATION 9 FPSTD IIAMP 5 6 AOLEN FPRT AOLEN FPSTD FPRT IIAMP Figure.6: GMM clustering results for 6 clusters using features measured with sound stimuli. The top two panels show the scatter plots in two pairs of dimensions and the bottom four show box plots of the learned clusters in each feature. The clusters are labeled only with numbers because they are learned, not labeled or interpreted.

61 CHAPTER. DATA EXPLORATION 5 Definition.. Entropy measures the uncertainty in a random variable. For a discrete random variable A that can take values a A with probability mass function (PMF) p(a) it is defined: H(A) = p(a) log p(a) (.) a A When one random variable is conditioned on another, the result is a new random variable. As such, entropy can also be conditioned where its definition follows from above by standard probability theory. Definition.. Conditional entropy measures the uncertainty in a random variable after another random variable is observed. For two random variables A and B: H(A B) = p(a b) log p(a b) (.) b B a A The base of the logarithm only determines the units of the measure; nats for base e and bits for base. We use nats in the following computations and scores, but this decision is arbitrary, the measure will just be compared to other values of the measure. Definition.. Mutual information is a quantity that describes the amount that knowledge of one random variable tells about another random variable. It is a symmetric quantity, defined as the difference between the entropy in one variable and the entropy in that variable after conditioning on the othershannon [98]. This gives the interpretation that it is the uncertainty in a random removed by observing another variable. For two discrete random variables A and B: I(A, B) = H(A) H(A B) = p(a, b) p(a, b) log( p(a)p(b) ) a A b B (.5a) (.5b)

62 CHAPTER. DATA EXPLORATION 5 For this data, the true underlying distributions are unknown and only some of the quantities are discrete. To compute these quantities for continuous random variables, sums may be replaced by integrals, but a density estimate is still necessary to proceed. Instead use histogram counts to estimate the density here, following the software package provided withbrown et al. []. This estimation technique is advantageous because we can then use sums; the histogram counts are the PDFs. This technique requires setting two hyperparameters: the bin sizes and centers. We assume equal bin sizes, uniformly spaced over the range of the sample data and re-parametrize to have a single parameter: the number of bins. Varied bin sizes enable asking different questions of the data as well. Choosing a small number of large bins corresponds to asking if coarse changes in the features carry sufficient information. For example increase, decrease or no change. At the other extreme we may choose the finest discretization that is sufficiently reliable, by examining the behavior of the estimator, to look for information in the high frequency variations in the data. In between these, we could also try a variety of different bin sizes and choose one to keep based on some indication of the performance of each one. We can choose a number of bins empirically or visually, by what gives a sufficiently smooth histogram or by inspecting a kernel density estimate. In this analysis we compute a variety of mutual information computation between different pairs of experimental variables. This gives insight as to how some of the variables are related and what might be good additional questions to ask. It also sets a level of expectation for how well we can expect future analyses to find relationships in the data. The procedure used for computing and tabulating mutual information computations is outlined in Experiment.. We compute a score for each combination of variables. We only vary the bin size in the density estimate if one or both of the variables is a feature (as listed in Table.) because

63 CHAPTER. DATA EXPLORATION 5 Variable Values 6 learned clusters,..., 6 5 learned clusters,..., 5 learned clusters,..., learned clusters,..., Self report happiness level,..., 7 Self report anger level,..., 7 Self report disgust level,..., 7 Self report fear level,..., 7 Self report sadness level,..., 7 Highest self report component fear, anger, disgust, sadness, happiness Discrete stimulus labels fear, anger, disgust, sadness, happiness, neutral Stimulus ID,..., Subject ID,..., 6 Table.: Discrete MI Variables

64 CHAPTER. DATA EXPLORATION 5 all of the the experimental variables in Table. are truly discrete random variables so the histogram density estimate does not require binning. By using the clustering assignments learned in subsection.., the mutual information scores give us a better interpretation of those results. The top mutual information score results are displayed in Table. and Table.5 and the remainder are shown in Appendix B. In general mutual information scores between experimental variables and the measurements are low, indicating that the measurements do not strongly predict these conditions. Though all of the scores are low it is in agreement with Hypothesis, that time in addition to stimulus label is important, that the learned clusters have more mutual information with the individual stimulus than any characterizations of the stimulus. The features are shown to share more information with other features extracted from the same signal than with the experimental variables. This result suggests that improved feature extraction may be useful, however in this work, we focus on reducing this shared, or redundant information with feature selection as described in section.. Many features also have higher mutual information with the subject ID than the experimental conditions. This means that these features would be more successful at predicting the identity of the subject than our target predictions. This suggests that individual differences are important and a subject specific model may be necessary for these features to be applicable. Individual feature values have more information with the specific stimulus ID than any of the categorical labels for the stimuli (aggregate or self report).

65 CHAPTER. DATA EXPLORATION 5 A B Score discrete stimulus labels stimulus ID Self report highest stimulus ID discrete stimulus labels Self report highest.6565 Disgust stimulus ID Sadness stimulus ID Happiness stimulus ID.6896 Fear stimulus ID Happiness discrete stimulus labels Disgust discrete stimulus labels Anger stimulus ID.5765 Fear Self report highest.9687 Fear discrete stimulus labels.7779 Sadness discrete stimulus labels Sadness Subject ID Anger discrete stimulus labels.87 Fear Subject ID Anger Disgust.989 Anger Subject ID.7758 Fear Anger.9757 Disgust Subject ID.796 Sadness Anger.866 Happiness Subject ID Happiness Disgust.558 Sadness Fear.86 Table.: Select Results for experimental variable mutual information scores for sound data. Rows comparing mutual information between different clustering solutions and between the highest self report component and the individual components of the self report are removed and the table is sorted by score.

66 CHAPTER. DATA EXPLORATION 55 A B Score FPAMP FPAFR SCAMP SCSTD.9856 SCAMP SCAFR IIAMP IISTD FPLEN FPSTD SCAFR SCSTD IIAMP IIAFR Subject ID FPRT.5977 IIHRT IIFRT.5977 LPAMP LPAFR FPAMP FPSTD.586 IIAFR IISTD.68 FPAMP FPLEN.96 Subject ID IIFRT.5569 Subject ID FPAFR.966 FPAMP FPRT.7958 FPAFR FPRT.6998 SCRTO SCRR.975 Subject ID IIHRT.8587 Subject ID FPAMP.7658 stimulus ID AOLEN.6766 FPAFR FPSTD.5579 Subject ID AcLEN stimulus ID SCAFR.656 Subject ID SCRTO Subject ID SCAFR.5 Subject ID FPSTD.85 Table.5: Select Results for feature mutual information scores for sound data with eight bins in the density estimate. Rows redundant to Table. are removed and the table is sorted by score.

67 CHAPTER. DATA EXPLORATION 56. Feature Selection To gain additional insight and to develop a more parsimonious model, we applied feature selection. Though features is not a large number in comparison to many machine learning problemsguyon [], this is an important step in assisting with interpreting the results. The features are extracted from only 6 signals, driven by the autonomic nervous system which is comprised of two subsystems, the sympathetic and parasympathetic nervous systems, that work both independently and in coordination to create various responses. Thus the feature space is fundamentally determined by a lower dimensional subspace, we use feature selection to find the information in the measured signals in a manner which combines expert knowledge and information theoretic strategies. Domain experts are interested in understanding which features of these signals are most discriminatory. To understand how the given features interact with each other and the labels, feature selection is more appropriate than projection or transformation based dimensionality reduction techniques such as PCA, which save computational complexity at test, but still use information from all of the measurements. There are three main types of feature selection algorithms that differ in the way that they interact with the learning algorithm and thus how they choose best subset of features: filter, wrapper, and embeddedguyon []. Embedded feature selection algorithms integrate the dimensionality reduction in the main learning algorithm. Wrapper feature selection uses the learning task as a feature selection criterion and performs a search, as a wrapper, scoring various subsets of features. Filter feature selection performs a search in a similar manner to the wrapper method, but it uses an separate, generally less computationally expensive, criterion. This relies upon the assumption that the feature selection can be done independently of the classifierbrown et al. [].

68 CHAPTER. DATA EXPLORATION 57.. Feature Selection as a Learning Problem Feature selection is an extreme case of dimensionality reduction. The assumption is that only a subset of the measurements are are dependent on the labels and thus only a subset are necessary for prediction. A the components of a machine learning problem are described in subsection.. feature selection makes is specified as possible. The data is the set of measurement, stimulus label pairs. The parameter, θ is a binary vector with θ i = to indicate that the i th feature is to be selected. The hypothesis space is the set of possible projections onto subsets of measurements. Since the parameter space is discrete, the set of all possible values a binary d-vector can take, feature selection is an NP-hard problem. This view of the problem aligns with the work of Brown et albrown et al. []. Filter based feature selection has a two part objective: to maximize the quality of the feature subset and minimize the number of features. These are often combined into a single optimization problem through a regularizing term or constraints. As we aim to apply feature selection as an exploratory analysis method and in order to compare a variety of models a filter method is adopted going forward, because it is computationally efficient and unlikely to cause over-fitting. Since the result is a combinatorial optimization, feature selection is generally solved as a search problem. The most straight forward is sequential forward searchwhitney [97] which consists of choosing the best feature to add at each step. If the scores for candidate features do not depend on previously selected features this results in just selecting the top d features... Mutual Information-Based Feature Selection We aim to discover feature subsets that contain the most information about various characteristics of the stimuli and experiments. In subsection..

69 CHAPTER. DATA EXPLORATION 58 we looked at a variety of different mutual information computations and compared and visualized them. In this section, we use several mutual information based criteria to conduct feature selection. A variety of mutual information based criteria have recently been related under a common theoretical framework and related to maximizing a conditional probabilitybrown et al. []. With mutual information defined as in subsection.., we review other feature selection criteria related to mutual information. For all of these criteria we apply a sequential forward search method to construct a list of features to keep. The filter method involves computing a score, J for each feature and adding the one with the highest score at each iteration, until a maximum is reached or a predetermined maximum number of features is selected. The most basic criterion applied is Mutual Information Maximization (MIM) (Equation.9a), it just ranks the features by mutual information and keeps the top oneslewis [99]. The other methods are iterative and compute the score at each iteration, considering some more complex score than the feature s individual mutual information with the labels. The next method is Maximum Relevance -Minimum Redundancy (MRMR) (Equation.9d), which scores each candidate feature as its mutual information with the labels less its mutual information with the previously selected candidatespeng et al. [5]. The mutual information with previously selected features is called the redundancy, and the mutual information with the labels is relevancy, thus by adding the feature with the highest score by this criterion at each step we maximize the relevancy and minimize the redundancy of the selected features. As a probabilistic quantity, mutual information can be conditioned, which allows for information added by candidate features to be considered in the light of the previously selected features.

70 CHAPTER. DATA EXPLORATION 59 Definition.. The conditional mutual information between two variables A and B, conditioned on C is the amount of uncertainty in A between two variables after a third is observed. The formulation follows directly from mutual information I(A; B C) = H(A C) H(A BC) (.6) With this in place we can introduce two more criteria: Conditional Mutual Information (CMI) and Conditional Mutual Information Maximization (CMIM). CMI (Equation.9b) scores a candidate feature as the amount of mutual information it has with the labels, conditioned on all of the the previously selected features and is shown to be the ideal casebrown et al. []. CMIM (Equation.9e) computes the mutual information a candidate feature has with the labels conditioned on each previously selected feature and assigns the lowest one as the score for that candidatefleuret []. The candidate with the highest score is added at each iteration, so it takes a best worst case approach. We can also evaluate the mutual information of a joint random variable consisting of two features with the labels. Definition.5. Joint mutual information is the mutual information of a joint random variable and anotheryang and Moody [999]. For a set of random variables A, A,..., A n and a random variable B: I(A,..., A n ; B) = H(A,..., A n ) H(A,..., A n B) (.7) = p(a,..., a n, b) log( p(a,..., a n, b) p(a,..., a n )p(b) ) (.8) a i A i b B This gives another criterion directly, Joint Mutual Information (JMI) which scores features as the sum of joint mutual information of the candidate with each of the previous candidates and the labels as shown in Equation.9c. This measure also allows us to consider Double Input Symmetric

71 CHAPTER. DATA EXPLORATION 6 Relevance (DISR), which considers a normalized version of the JMI shown in Equation.9fMeyer and Bontempi [6]. In the feature selection experiments, we only use mutual information between the measurements, Y, and the discrete emotion labels for the stimuli, U, so all criteria will be defined as such. A subscript will denote a single feature Y i and S = Y θ will denote the set of features currently included in the selected group, where θ is a binary vector indicating which features to keep. Mutual information based filter feature selection can be derived rather than chosen heuristically as maximizing the conditional likelihood of the labels given the parameters Brown et al. []. Within this framework, key relationships among numerous previously demonstrated criteria that had been chosen frequently. They also compare them with respect to accuracy in a variety of settings and stability in the sense of returning a consistent subset up to small changes in the data. In conclusion they recommend a subset of the criteria for practitioners. Based on the objective of obtaining a feature subset that achieves both accurate classification and consistent subset selection for feature selection criteria, they find three criteria: CMIM, DISR, and JMI, with JMI as the strongest recommendation. For higher stability, MIM is recommended. We also include CMI, which is what directly results from maximizing the conditional likelihood of the labels. In addition, we include MRMR, it does not include the conditional redundancy term that is identified as important, and relies on an assumption that the features are pairwise class conditionally independent. We find that it achieves similar scores, but selects different features from other criteria, though with a similar mix of sensors. Overall, in our results we find that CMI works best and JMI has similar results, which is consistent with their findingsbrown et al. [].

72 CHAPTER. DATA EXPLORATION 6 J mim (Y i ) = I(Y i ; U) J cmi (Y i ) = I(Y i ; U S) (.9a) (.9b) J jmi (Y i ) = Y j S I(Y i, Y j ; U) (.9c) J mrmr (Y i ) = I(Y i ; U) S J cmim (Y i ) = min Y j S I(Y i; U Y j ) J disr (Y i ) = Y j S I(Y i Y j ; U) H(Y i Y j ; U) I(Y i ; Y j ) Y j S (.9d) (.9e) (.9f) (.9g).. Results Questions to be answered from the feature selection process include questions for each data set and some for the group:. How many features is best for classification?. Which features are most descriptive of the categorical labels?. Are the same features descriptive for the two different types of stimuli (sounds and images)?. At what level of granularity is the information encoded in the data? Is there a maximum? Is there a good leveling off point? To address these questions we conduct feature selection as described in Experiment.. We vary the number of bins and for each criterion we perform feature selection to obtain the best feature set for each size -. For various bin sizes, the feature subset that obtained the maximum MI score for each criterion are displayed in the bottom of Figure.7 and

73 CHAPTER. DATA EXPLORATION 6 Experiment. Mutual Information Computation Experiments for all dataset i do for all pair of random variables, A, B Table. Table. do if A, B Table. then else end if end for end for Compute Mutual Information per Definition. for all number of bins k =,..., 8 do Compute Mutual Information per Definition. end for Experiment. Feature Selection Experiment Structure for all dataset i do for all number of bins k =,..., 8 do for all maximum subset size n =,..., do for all mutual information criteria J per Equation.9 do Perform feature selection Compute MI score end for end for end for end for

74 CHAPTER. DATA EXPLORATION 6 Figure.9 for the image and sound results respectively. In the top of these two figures, we see that as finer detail in each feature, via more bins, is used, fewer features are required to obtain the maximum MI score. In each case after a certain number of features were selected, additional features no longer add more information. We also note that JMI and CMI require the fewest number of features to obtain the maximum score and after 8- bins the results become more stable. The smallest number of features selected that still obtains a maximum MI score is 5. This is good because with a small sample size the covariance estimate is sensitive to the dimensionality as well. The results for all criteria and bin sizes for 5 features selected are shown in.8 and Figure.9. As the number of bins increases, the MI score for the 5 features selected by each criterion increases. However, for very small bin sizes we may be over-fitting and the increases in information decrease, so we choose the results from the smallest number of bins that retains the peak mutual information (MI) score. After observing the general trends we selected candidate subsets for future testing as shown in Figure.. As expected we find that the most informative feature subsets are different when the subjects are shown sound stimuli versus image stimuli. There are some similarities, FPSTD is selected in almost every case as is pupil amplitude. For the features derived from Galvonic Skin Conductance (GSR) and IBI, the selections vary across criteria, number of features selected, and bin size. The skin conductance amplitude is more important for sound stimuli, but the skin conductance level dominates for image stimuli. For IBI derived features, in the sound stimuli data set the amplitude is selected in every case, but it only ranks seventh for image stimuli, where the length is more important. The selected subsets include features from multiple signals and rarely include more than one feature from a single channel for small subsets. This supports the experimental design consideration that multiple physiological signals should

75 CHAPTER. DATA EXPLORATION 6 Figure.7: Feature Selection Results for the maximum mutual information score at each bin size in the image data. Each color represents a different criterion, each group of columns is a different number of bins (and thus bin size) used in the histogram density estimation. The top plot shows the number of features selected and the bottom plot shows which features were selected by each criterion in the color indicated in the legend. The maximum score is the same for all tests.

76 CHAPTER. DATA EXPLORATION 65 Figure.8: Feature selection results for a maximum of 5 features for image stimuli. The top plot shows the score for the best set of 5 features for each bin size (-) and the bottom figures show the features selected for each.

77 CHAPTER. DATA EXPLORATION 66 Figure.9: Feature selection results for the maximum mutual information score at each bin size in the sound data. Each color represents a different criterion, each group of columns is a different number of bins (and thus bin size) used in the histogram density estimation. The top plot shows the number of features selected and the bottom plot shows which features were selected. The maximum score is the same for all tests.

78 CHAPTER. DATA EXPLORATION 67 Figure.: Feature selection results for a maximum of 5 features for sound stimuli. The top plot shows the score for the best set of 5 features for each bin size, and the bottom figures show the features selected for each.

79 CHAPTER. DATA EXPLORATION 68 be used to capture information about an individual s mental state.

80 CHAPTER. DATA EXPLORATION 69 LPAMP SCLEV SCRR SCRTO SCSTD SCAFR SCAMP FPSTD FPRT FPLEN FPAFR FPAMP IISTD IIAFR IILEV IILEN IIFRT IIHRT IIAMP AOLEN AcLEN LPAFR sound cmi bins sound cmi8 bins Final Feature Subsets for testing sound mim8 bins sound cmi8 bins sound cmi8 bins sound cmi8 bins sound cmi8 bins image cmi bins image cmi8 bins image disr8 bins image cmi8 bins image cmi8 bins image cmi8 bins image cmi8 bins Figure.: Final feature selection results. mim cmi cmim disr mrmr jmi not selected The first seven columns are sound results and the latter seven are image results. For each data set, we choose the selections from the coarsest and finest density estimations with CMI, one other criterion result that performed best, and the CMI 8 bin result for -7 features. These are the feature subsets that are tested in all models in the following chapters. It is important to note that some of the selected features vary for the sound and image datasets.

81 Chapter Model Based Analysis In this chapter we look at two classes of techniques to try to fit the data to structured models that represent hypotheses about the underlying generating processes. First, we look at models that are consistent with most literature in the field, static models, that assume the experimental trials are independent in time. Second, we explore dynamic models, which are able to represent and test for the presence of time dependencies in the data. These methods will again build upon the background content described in chapter. We will now work to design models that meet the objectives defined in section. and will leverage what was learned about the data in section. about which features to use. All future tests were conducted only on the feature subsets identified in Figure.. The techniques described thus far have used statistical models that said little about the underlying structure or generative processes. Some assumptions were made, but they were flexible, many of the techniques were nonparametric. Here, we change our focus to finding the best structure to explain the data. Now our task is to predict a known value that s held out in training and we ll measure performance by predictive accuracy. 7

82 CHAPTER. MODEL BASED ANALYSIS 7 z t z t z t y t y t y t (a) T S (b) T S (c) T S Figure.: Graphical models for the classes of static models we consider in this work. In (a) is a generative classifier, (b) is a discriminative classifier, and (c) shows a regression model. The two classifiers ((b) and (c)) share that the labels Z are discrete. Regression is also a discriminative model, so both (b) and (c) have arrows indicating that the joint should be decomposed into p(z Y )p(y ) instead of p(y Z)p(Z) as in the generative classifier in (a). Since predictions are in the form of p(z Y ), discriminative models only learn this quanity, not the full model. All models treat the T repeated trials and S subjects as independent, so there are T S samples of each.. Static Modeling In this section we continue to explore the data under the assumption that the stimulus trials are independent in time. We hypothesize that this is not true and present these results for comparison with the models of interest in section.. First, we consider mapping the measurements to discrete, categorical (not ordered) responses, in classification. The measurements are the physiology recorded and the latent category can be the stimulus class labels or the highest rated class from the self report. We are most interested in generative models, but for comparison, we also discuss a discriminative classifier, the support vector machine. Finally, fit the measured response to the labeled valence and/or arousal levels of the stimuli using a linear

83 CHAPTER. MODEL BASED ANALYSIS 7 z t z t µ k µ k K y t y t Σ (a) T S Σ k K (b) T S Figure.: Expanded Graphical models for generative classifiers: LDA in (a) and QDA in (b). These graphical models expand upon what was shown in Figure.a to show the parameters to highlight the difference between these two models. regression model. All of these techniques assume that the stimulus presentations are independent in time and that there is a one-to-one correspondence between the input and the measurement. The graphical models for these three classes of models are shown side by side in Figure. for comparison. These results present a stronger test of the underlying theoretical assumptions that typically applied in psychology research. Normally, multiple trials for each subject are averaged together first. We do not average over trials of the same stimulus class, we are testing the degree to which a single response carries the desired response, rather than that it has an effect on average as in subsection..... Discriminant Analysis Within the learning problem structure defined in subsection.., the data, D, in classification consists of measurement-label pairs, here (Y, Z). The

84 CHAPTER. MODEL BASED ANALYSIS 7 measurements, Y are continuous random variables and the labels, Z, are discrete. In discriminant analysis, the hypothesis space H is a mapping from the space of Y to Z. A set of discriminant functions, g i (y) are one of the most common ways to represent classifiers, that a sample y belongs to class z = i if g i (y) > g j (y) j iduda et al. []. We assume normally distributed classes so M will be the family of normal distributions with θ = (µ, Σ) and is indexed by the labels, Z. The result is that the likelihood is the discriminant function, that g i (y) = N (mu i, Σ i ). Linear Discriminant Analysis Linear Discriminant Analysis (LDA) makes the assumption that the data, Y is generated from separate Multivariate Normal (MVN) distributions, dependent on the classes, Z, that correspond to the stimuli in training. This choice of notation is different than normal convention (data X, labels Y ), but it is chosen so that in section. we can use a consistent notation. This notation is then consistent with other dynamic modeling literature in machine learning. We assume that each class i has its own mean µ i and they have a shared covariance, Σ as in Equation.. p(y z = i) = N (µ i, Σ) (.) In classification, the task is prediction, hence an emphasis the decision boundary that optimally separates the classes over the actual estimated parameters. When we see a new sample, y observed, we would like to assign it to the most likely class, so in terms of probability the rule is as in Equation.. This is still a generative technique, the decision boundary is defined in terms of the model parameters. Is assigns the label ẑ that maximizes the likelihood of the observed data, under the statistical models for each class, L(Y θ).

85 CHAPTER. MODEL BASED ANALYSIS 7 ẑ = arg max p(y observed z candidate = i) (.) z candidate For two classes (, ) the decision boundary is where the two likelihoods are equal. We insert the expression for the Gaussian distribution and take the log on both sides to get the following. p(y z = ) = p(y z = ) (.) Σ µ y µ Σ µ = Σ µ y µ Σ µ y T Σ (µ µ ) = (µ Σ µ µ Σ µ ) (.) This rule generalizes to multiple classes to draw multiple lines in the space as boundaries among the different regions. However, in the training data, the parameters in Equation., θ = [µ,..., µ K, Σ], are unknown, so we cannot compute the boundary Equation.. First, we must estimate the parameters, then we can apply the rule in either the initial form or decision boundary form. In practice, the decision boundaries are computed in training and used in test. In the results presented below we apply cross validation to compute the Leave-one-out (LOO) and ten fold prediction accuracies. We also present the training accuracy, because if this score is similar to the test accuracy it indicates more similarity between the two subsets of data. Quadratic Discriminant Analysis We also consider the case when the covariances are not shared among classes. p(y z = i) = N (µ i, Σ i ) (.5)

86 CHAPTER. MODEL BASED ANALYSIS 75 Test Name Training Accuracy LOO -fold cmi bins 7 features.5.7. cmi 8 bins 5 features mim8 bins 5 features.6.6. cmi 8 bins features cmi 8 bins 5 features.5.8. cmi 8 bins 6 features cmi 8 bins 7 features.5.8. Table.: Results for LDA in sound data The classification rule follows as above up through Equation.. When we plug in the probability distribution function (PDF) for the MVN distribution with different covariances, the quadratic terms no longer cancel. In this case we get a quadratic decision boundary and the result is Quadratic Discriminant Analysis (QDA). This model has a much greater number of model parameters and can be much more flexible, especially in higher dimensional models. As such it is more likely to be be subject to overfitting, as seen by comparing the first line of Table. with that of Table.. The training accuracy is higher than the cross validated error in both cases, but the difference is more pronounced in QDA where training accuracy is much higher than cross validated. We present both LOO and ten fold cross validation but find that they produce similar results. (y µ ) T Σ (y µ ) = (y µ ) T Σ (y µ ) (.6)

87 CHAPTER. MODEL BASED ANALYSIS 76 Test Name Training Accuracy LOO -fold cmi bins 7 features.5.. cmi 8 bins 5 features mim8 bins 5 features cmi 8 bins features cmi 8 bins 5 features cmi 8 bins 6 features...98 cmi 8 bins 7 features Table.: Results for QDA in sound data.. Support Vector Machines Though a generative model is preferred, we also try a discriminative model for comparison. Discriminative techniques more easily allow transformations of the features to Support Vector Machines (SVMs) have been shown to have good performance in a variety of cases due to their ability to capture arbitrary nonlinear decisions boundaries between classescortes and Vapnik [995]. This is accomplished by projecting the data into a higher dimensional space through a kernel function prior to choosing a hyperplane as the decision rule in the projected space. Under the more common, max-margin interpretation of an SVM there is no proper probabilistic interpretation to SVM, so it s not easy to compare SVM to the other models we applymurphy []. Alternatively, SVM can be derived as a modification to logistic regression. Starting from a probabilistic discriminative model, using the same statistical model as the classifiers discussed in subsection.. enables comparison with other methods. In the previous models we have only looked at the likelihood derivation, we will describe a probabilistic view of a generative classifier, logistic regression, first and then describe the adjustments necessary for SVM.

88 CHAPTER. MODEL BASED ANALYSIS 77 As a discriminative technique, the objective is to assign the label Z that is most likely for each point, instead of choosing the Z that makes the observation most likely as we did in subsection... To accomplish this, we model the distribution of Z conditioned on the observed data, p(z Y ), in contrast to the methods above that model p(y Z). SVMs are specifically defined for the two class case; in order to apply them to multiple classes, we must use additional extensions. Since it is a two class problem we can say z [, ] without loss of generality. Then we can model the the distribution of the labels over the data as Bernoulli. p(z y) = Ber(z, f(y)) = f(y) z ( f(y)) z (.7) Here we have conditioned the labels on the data by using some function of the data f(y). Now we need to specify the function we need a function that maps a measurement onto [, ]. We opt for a linear projection into one dimension first, with weight vector w. To limit the range to use the sigmoid function to [, ] we pass the linear projection through a sigmoid function as shown in Equation.8. The sigmoid function has a domain of (, ) and a range of (, ) and a smooth curve through (,.5). f(y) = sigm(w T y) = + exp( w T y) (.8) Now a decision rule can be derived by maximizing the likelihood of the labels conditioned on the data. This is the same as minimizing the negative log likelihood which is an upper bound to a / loss function, as in chapter 6 of Murphy []. This provides the transition from a probabilistic framework to empirical risk minimization.

89 CHAPTER. MODEL BASED ANALYSIS 78 In SVM, we apply an alternative loss function, called the hinge loss, and force the projection, w, to be sparse. The samples for which w does not go to zero are called the support vectors. This reduces to an optimization problem that can be solved numerically. To handle the case where the classes are not completely separable, slack variables are added as weights on the points, for a soft margin, allowing some error. The slack variables provide a continuous spectrum and a smooth criterion for optimization. min w,w,ζ w + C s.t.ζ i, z i (y T i w + w ) ζ i, i =,..., N (.9) Here C is a hyper parameter to be set in the training phase. The w is We need extend this to multiple classes, since we have a 6 class problem, but SVM does not naturally apply to multiple classes. There are many options which are described in detail in any machine learning text, for example Bishop [6], Murphy []. We apply the Error Correcting Output Code (ECOC) method which uses multiple classifiers and has designed codes that each class is represented by as the result (, ) of each classifier for each classdietterich and Bakiri [995]. The classifiers are all trained on the training data, then the test sample is run through each binary classifier. The resulting code, or pattern of binary classification results, is compared to the codes designed for each class and the sample is assigned to the class with the smallest hamming distance, which is used for comparing binary vectors in communication systems. The results of a SVM with ECOC for the sound stimuli are shown Table.. The results for image stimuli are similar. Note that not all of the feature subsets that were tested in the generative models above were tested in the SVM. A model including more than 8 features would not converge in the optimization process for SVM. N i= ζ i

90 CHAPTER. MODEL BASED ANALYSIS 79 Test Name Linear Kernel Training Linear Kernel fold cmi 8 bins 5 features mim8 bins 5 features cmi 8 bins features cmi 8 bins 5 features cmi 8 bins 6 features cmi 8 bins 7 features Table.: Results for SVM with a linear kernel in sound data The main advantage of SVM is that as a discriminative method, we can alter the features before applying the classifier. This allows for a linear separation boundary in the projected space that corresponds to a more complex, nonlinear boundary in the original space. We apply a kernel by computing a feature vector for each sample, modifying the function f(y) above, to be f(φ(y)). We apply a Gaussian, or Radial Basis Function (RBF) kernel, to generate the results in Table.. The kernel trick allows us to computationally embed the kernel as an operator in the optimization, rather than computing it explicitly for each pointmurphy []. Note that the training accuracy is much higher than the -fold cross validated error. This is a result of overfitting due to the small sample size and flexibility of the model induced by the RBF kernel... Regression We can also use a linear regression model using the valence and arousal descriptions of the stimuli instead of the discrete categorical labels of the stimuli. Under our machine learning problem framework, regression has the same specifications of data, hypothesis and task, however both Y and Z are continuous random variables, which changes the underlying model and the

91 CHAPTER. MODEL BASED ANALYSIS 8 Test Name RBF kernel training RBF kernel Fold cmi 8 bins 5 features mim8 bins 5 features cmi 8 bins features cmi 8 bins 5 features cmi 8 bins 6 features cmi 8 bins 7 features Table.: Results for SVM with a RBF kernel in sound data. Results show are accuracy rates, this is for a 6 class problem. computations. The model for linear regression can be written in two equivalent ways, each emphasizing a different aspect of the model. We can model it as that we know the conditional distribution of the labels, z, given the measurements, y, as in Equation. or that z is a noisy linear function of y, as in Equation.. Pr(z y) = N (βy, Σ) (.) z = βy + η, η N (, Σ) (.) These two representations of the model are mathematically equivalent, but provide two ways of thinking about and illustrating the model. We use a dimensional definition of emotion for this effort; the labels, z, is a two dimensional vector, for the valence and arousal descriptions of the stimuli. We find that there are no strong relationships as shown in Table.5. Fitting a regression model, we obtain results that have a near zero β vector. A typical way of evaluating a regression model is to compute a residual, the difference between values predicted by the model and the observed values.

92 CHAPTER. MODEL BASED ANALYSIS 8 With this data, the predicted values are almost all zeros and the resulting residuals, are of approximately the same magnitude as the data itself.. Dynamic Modeling In this section we address the target class of models, dynamic models with a latent variable. This will allow us to model non isotropic relationships between the stimulus and measurement and correlations in the measurements through an intermediate, time dependent, value. With the introduction of a latent variable we move from a supervised learning task to more of an estimation-focused effort, since the value of the latent variables is not even available under these experimental conditions. We will evaluate the quality of fit in proposed models by using a subset of the observed variables to predict another set in testing, specifically, the measurements to infer the stimulus sequence.our new primary interest is in discovering the latent state sequence and the model parameters. These new statistical models include time as a parameter and index the samples in groups based on time, not only on the samples... Dynamic Modeling Review Modeling dynamical systems has a long history in signal processing and how to model random processes forms the basis of modern control theorykalman [959]. The state space model representation of the problem introduced by Kalman defines relationships among an input u, state, z and measurement y as in Equation. using a system model, f and a measurement model, gkalman [96]. The state variable is a core concept in a large area of signal processing and machine learning. A state variable is a quantity sufficient to predict it s value at the next time step and thus also the measurement.

93 CHAPTER. MODEL BASED ANALYSIS 8 cmi cmi 8 mim 8 cmi 8 all Individual LPAMP LPAFR AcLEN AOLEN IIAMP IIHRT IIFRT IILEN IILEV IIAFR IISTD FPAMP.7.5 FPAFR FPLEN...5 FPRT FPSTD.... SCAMP SCAFR SCSTD SCRTO.7.8 SCRR.8.. SCLEV Table.5: Regressors for Sound dataset with Valence. The columns are a selection of the feature subsets described in Figure., the result fo using all features and the result using each feature individually. correspond to the features described in Table. The rows

94 CHAPTER. MODEL BASED ANALYSIS 8 z t = f(z t, u t, η t ; θ transition ) η t Pr(, Σ transition ) (.a) y t = g(z t, ω t ; θ observation ) ω t Pr(, Σ observation ) (.b) Though f and g are deterministic functions, the system becomes a random process because there is assumed to be process noise, η, and measurement noise ω corrupting the system. Alternatively we can model the same relationships in a completely probabilistic manner. This allows for more flexible representations and is normally the view taken in machine learning. In machine learning, we specify two probability distributions, one for the transition model, Equation.a, and one for the observation model, Equation.b. With appropriate choices of the probability distributions, these two models are the same. Pr(z t z t, u t ; θ transition ) Pr(y t z t ; θ observation ) (.a) (.b) For many applications the system can be modeled as discrete states and without input, this is the well known Hidden Markov Model (HMM)Rabiner [989]. Inference in the HMM, is completed using the forward backward algorithm and HMM model parameters can be learned using Expectation Maximization (EM)Dempster et al. [977] the same algorithm used for the Gaussian mixture model (GMM) in subsection... In the E step, inference is performed with the forward backward algorithmrabiner [989]. Several variations of the HMM were introduced, each posing a slightly different form, adding inputs, changing the representation of the state. Ghahramani and Roweis provide a unifying framework for all of the of the linear Gaussian modelsroweis and Ghahramani [999] and MurphyMurphy [] proposes

95 CHAPTER. MODEL BASED ANALYSIS 8 an overarching framework, Dynamic Bayesian Networks (DBNs), that unifies the field. These applications were linked through the development of literature and mathematics of probabilistic graphical models and specifically DBN. DBN encompass a broad set of models for time series and are generally represented using Probabilistic Graphical Model (PGM) as described in subsection..murphy []. For this application, we need a model that has a latent variable as well. Given the nature of our problem, we aim to learn both latent state variables and unknown model parameters, we will employ the expectationmaximization algorithm. This model iteratively solves a maximum likelihood problem for the observable data given the model. At it s completion we have estimated values describing the relationship between the inputs, states, and measurements at each time and the state at consecutive time samples as well as for what the state sequence was... Input-Output Hidden Markov Model One objective is to arrive at a model that represents the preferred conceptual model for the experiment that includes a latent construct where the subject processes the stimulus, instead of assuming that the measurement is a direct function of the stimulus as is inferred by classifier models. Since we also aim to explore temporal dynamics, we utilize a HMM with input as shown in in Figure.b. For comparison, a graphical model for a classifier is show in Figure.a. This model is a case of the Input-Output Hidden MarkovModel (IOHMM) described by Bengio and Frasconi [996], however we model the input to only affect the output through the latent state, there is no direct influence of the stimulus on the measured response. We apply this model as a generative model: the learned parameters will describe the hypothesized data generation mechanism. In the IOHMM, we use the generic model from Equation. with the

96 CHAPTER. MODEL BASED ANALYSIS 85 PGM in Figure.b. We specify the transition model (Equation.b) with a multinomial distribution and the observation model (Equation.a) as Gaussian. This gives us the following model for one time slice: P (y t z t = j) N (y t ; µ j, Σ j ) P (z t u t = j, z t =k ) Mult(t :,j,k ) (.a) (.b) Where t :,j,k is an element of the transition tensor T, which describes the state transitions conditioned on the stimulus class as: t i,j,k = Pr(z t+ = i z t = j, u t+ = k) (.5a) t =,..., T i, j, k =,..., K Pr(z t+ z t = j, u t+ = k) = K i= p [z t+=i] i (.5b) A key model feature here is that we relate the stimulus to the physiology probabilistically through an abstract mental-physical state. An a posteriori analysis in conjunction with the corresponding physiological descriptions could assign descriptive labels to the states, but they are not fixed to have any specific meaning or to correspond to known underlying mental states. For each of these abstract states, we assume there is a corresponding physiological profile, which is the output model for the IOHMM. For this model, we use EM to learn the parameters with junction tree inference implemented with the BNT toolbox for MATLAB Murphy []. Learning this model is similar to learning a Gaussian mixture model, but with a dependency inducing prior over the latent variable determined by the observed input and the neighboring latent state(s). Thus, this model is related to the clustering results shown in subsection... However clustering only finds the groupings, the IOHMM additionally models how the clusters in the physiology relate to the stimulus labels.

97 CHAPTER. MODEL BASED ANALYSIS 86 u t u t+ z t z t z t+ y t (a) Graphical model for classifier y t y t+ (b) Input-Output Hidden Markov Model Figure.: Graphical models for the two discrete models we consider. After initialization, the E step infers the latent state sequence from the measurements and the stimuli. The M step uses the ML solution for the parameters: the transition matrix and the parameters of the Gaussian distribution for each latent state value. With the learned parameters from the training set of subjects, we can infer the state sequence and the stimulus class jointly for the test subject(s). This allows us to report an accuracy as a measure of model fit, by comparing the inferred stimulus sequence to the true stimulus sequence. The procedure used to test the IOHMM model is shown in Experiment.. We have one hyper parameter to set for this model, the model order, or the number of latent states, as Experiment. shows, we vary this from three to nine. The training set size is determined by the type of cross validation. Here we present leave one subject out cross validation so there is one training set for each possible left out subject. Following set notation, S is the set compliment of S, or the held out features. We compute the likelihood score L of each iteration to choose among the various random restart solutions. An illustrative sampling of the results are shown in Table.6, the complete results are shown in

98 CHAPTER. MODEL BASED ANALYSIS 87 Experiment. Input-Output HMM Experiment Template for all datasets i do for all model orders k,..., 9 do for all Training sets S do for all Random initializations j,..., 5 do Learn model parameters θ i,k from subjects in S Infer Pr(z, u) for subjects S Compute L end for end for end for end for

99 CHAPTER. MODEL BASED ANALYSIS 88 stimulus set criterion bins Features K mean correct stimulus predictions sound cmi 8 6. sound cmi sound cmi 8 6. sound cmi sound cmi 8. sound cmi sound cmi sound cmi sound cmi sound cmi Table.6: Results for IOHMM. The top performing feature subset and number of latent state combinations for sound data. The score is the stimulus prediction accuracy from inferring the state and stimulus jointly

100 CHAPTER. MODEL BASED ANALYSIS 89 u t u t+ x t x t+ y t y t+ Figure.: Graphical Model for Linear Gaussian Model.. Linear Gaussian Dynamic System Again we also consider the case with continuous random variable for the latent state and inputs. Using the same DBN model, Equation., we specify both the transition and observation model as Gaussian, in this case the model is called a Linear Dynamic System (LDS). For this model, we again use EM to train and then use the Kalman Filter for inferencekalman [96]. The stimulus sequence, {u t }, is the control signal, z is the state, and y are the measurements, with all of the variables normally distributed. Here the state is assumed to be in the same space as the input signal, of valence and arousal, hence we employ a two dimensional vector. We can set the stimulus and state variables to be in the valence/arousal plane. The PGM is shown in Figure. and the system equations are: z t+ = Mz t + Su t+ + η y t+ N (µz t+, Σ). (.6a) (.6b) The solution to the inference, or state estimation, problem in LDS is called Kalman filteringkalman [96]. This is also an iterative two step

101 CHAPTER. MODEL BASED ANALYSIS 9 algorithm: here we alternate updating the state estimate based on the model parameters and the measurements. Kalman filtering is generally posed as an online task, the state is to be estimated as new measurements arrive, but the same architecture can be applied in a batch mode, called Kalman smoothingmurphy []. To do inference however, we need to know the model parameters. EM type solutions can be used or subspace methods can be applied as well. Subspace methods are a linear algebra technique and thus are more computationally efficient and not subject to local minima as EM. The LDS solution shows results that agree with the regression analysis. The elements of the learned S matrix are close to zero. The residuals between the predicted and measured values are large. This result is less informative than the IOHMM result. In Equation.8 we show an illustrative solution to the LDS model. While M is not zero is is less than, and with S nearly zero, the state will approach zero no matter what the initial solution is, with the state nearly zero, the y will be as well even though µ is not and thus the residual is large M =.67. S =. (.7) E µ = (.8)

102 Chapter 5 Interpretations Understanding the strengths and weaknesses of a design is an essential part of the engineering design process. Especially in the case of less than favorable outcomes, it is important to understand why the current design did not produce the desired outcomes in order to develop a better design. In this chapter we provide some contextual interpretation of the results obtained. This demonstrates the potential for impact of this general class of models, generative latent variable models, over relying completely on traditional statistical testing in answering questions in emotion research. We examine the model parameters in terms of the application and show what implications can be made based on parameter values. We relate the output parameters back to the initial hypotheses, to show what can be inferred from the learned transition matrices. In chapter we presented the models, experimental form used to test them and results with the traditional performance metric, prediction accuracy. Here we refer back to the initial problem objectives from section. and apply ideas from statistical learning theory to produce more descriptive metrics. 9

103 CHAPTER 5. INTERPRETATIONS 9 5. Contextual Parameter Analysis The primary advantage of adopting a generative model is that the full distribution of the data is modeled. Since the hypothesis is that this model should represent at least some meaningful portion of the true underlying process, we can get valuable insight by examining the model parameters. In the proposed model, from subsection.. the learned parameters are the latent state sequence Z, the physiology model (µ i ), and the transition model, T. Additional exploration and visualization of the parameters can also assist in drawing conclusions about the initial hypothesis. We can say something about the value of Hypothesis, that time is important, by comparing the outcomes of the static models section. to the outcomes of the dynamic models section.. But there are additional differences between these two sets of models. We can make stronger statements about the value of time and add greater detail by looking at the parameters that describe time. In the learned physiological parameters, we see that these states are more separable than the separation from the discrete class labels. This is consistent with the outcomes of the cluster analysis (subsection..) and mutual information computations (subsection..)where we found that these clusters have more mutual information with the stimulus index than the unique labeling. To display this improvement we compare it with the class separation using the stimulus labels, in Figure 5. using box plots again as in subsection... From the learned transition model we can obtain another view of how important time is in the modeling of emotional state. We learn a transition matrix for each stimulus label. If the previous state did not impact the next state and the stimulus dominated that distinction, we would expect these matrices to each have one column of high probabilities. This corresponds to

104 CHAPTER 5. INTERPRETATIONS 9 Figure 5.: Distribution of features based on stimulus categories (top) and learned states (bottom). The inferred states are labeled with letters because they are abstract and do not have a specific meaning. This is the model for the best performing cross-validation set.

The cells are color coded to indicate the probability value of that entry according to the legend on the right. This is the model for the best performing cross validation set.

105 CHAPTER 5. INTERPRETATIONS 9 Figure 5.: Transition matrices for k = 6, for the sound data on the cross validation loop with the highest accuracy. Each row of each matrix is the marginal distribution of the next state conditioned on the current state, p(z t z t = i, u t = j). The cells are color coded to indicate the probability value of that entry according to the legend on the right. This is the model for the best performing cross validation set. having all of the marginals, p(z t z t = i, u t = j) for a fixed stimulus, u t = j be approximately equal, or that only the stimulus impacts the next state. However if time is important these marginals will be different. We see in Figure 5., the latter is the case. The tables present the transition matrix in the form of a heat map with low probability transitions in red and high in green. There is one table for each stimulus class. We see that the most probable next state (darkest green) for each current state (row) varies, since the darkest green is not concentrated in a single column.

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this