A Unified Probabilistic Framework For Measuring The Intensity of Spontaneous Facial Action Units

Size: px

Start display at page:

Download "A Unified Probabilistic Framework For Measuring The Intensity of Spontaneous Facial Action Units"

Cornelius Short
5 years ago
Views:

1 A Unified Probabilistic Framework For Measuring The Intensity of Spontaneous Facial Action Units Yongqiang Li 1, S. Mohammad Mavadati 2, Mohammad H. Mahoor and Qiang Ji Abstract Automatic facial expression analysis has received great attention in both academia and industry in the past two decades. Facial action coding system, which describes all possible facial expressions based on a set of anatomical facial muscle movements, called Action Unit (AU), is the most popularly used descriptive approach for analyzing facial expressions. In majority of the existing studies in the area of facial expression recognition, the focus has mostly been on facial action unit detection or basic facial expression recognition and there have been very few works on investigating the measuring the intensity of spontaneous facial actions. In addition, these works try to measure the intensity of facial actions statically and individually, ignoring the dependence among AUs, as well as the temporal information, which is crucial for analyzing spontaneous expression. To overcome this problem, this paper proposes a framework based on Dynamic Bayesian Network (DBN) to systematically model such relationships among spontaneous AUs for measuring their intensities. Our experimental results show improvement over image-driven methods alone in AU intensity measurement. I. INTRODUCTION Facial expression is one of the common nonverbal communication skills that humans use in their daily social communications and interactions. In the past two decades, many researchers in the field of computer vision and pattern recognition have been attracted to develop computer-based techniques to automatically recognize facial expressions in visual data. This is due to different applications, such as developmental psychology, emotive social robots, and intelligent Human-Computer Interaction (HCI) design [1], [2]. In order to describe and analyze facial expressions, several coding systems have been developed by psychologists. The Facial Action Coding System (FACS), which was originally developed by Ekman in 1970s, is one of the most comprehensive coding system in the behavioral science [3]. FACS describes all possible facial expressions based on a set of anatomical facial muscle movements and called them Action Unit (AU). For instance, AU12 (lip corner puller) codes contractions occurred on the face by Orbicularis oculi muscle and AU6 (cheek raiser) codes contractions occurring by Zygomaticus major muscle [3]. Traditionally for facial expression analysis and AU intensity measurement, expert FACS coders manually codify images or video frames. However, this is a very labor intensive 1, 2 contribute equally to this work. Yongqiang Li and Qiang Ji are with Rensselaer Polytechnic Institute, Department of Electrical, Computer, and Systems Engineering {liy23,jiq}@rpi.edu S. Mohammad Mavadati and Mohammad H. Mahoor are with University of Denver, Department of Electrical and Computer Engineering {smavadat,mmahoor}@du.edu Fig. 1. Relation between the scale of evidence and intensity scores for facial action units [3]. and time consuming task. Studying the literature reveals that computer algorithms can help scientist for recognizing and measuring facial expressions automatically [4], [5]. Although automatic facial expression measurement has been utilized in distinguishing between posed and spontaneous occurring smile [6] and categorizing pain-related facial expression [7], there are still many areas which suffer from lack of comprehensive studies. In a real face-to-face communication, we deal with spontaneous facial expressions. Posed facial expressions and action units are those that are created by asking subjects to deliberately make specific facial actions or expressions. On the other hand, spontaneous facial expressions and action units are representative of facial expressions in daily life. They typically occur in uncontrolled conditions and are combined with head pose variation, head movement and often more complex facial action units. Most of the developed systems for facial expression and action unit classification are evaluated using posed expressions data [23]. One reason is that the majority of available databases, are focusing on posed facial expression and there are very few databases that are available for studying spontaneous facial expression. [7], [8], [9], [30]. For automatic facial expression recognition, there are some valuable databases that contain either the six basic facial expressions (i.e. anger, surprise, fear, sadness, joy and disgust) or combination of AUs, among which Cohen-Kanade database [10], MMI database [11], Bosphorus database [12] are the AU-coded face databases that are publicly available for research studies. Recently a new database, called Denver Intensity of Spontaneous Facial Action (DISFA) has been published [9] [30], which contains the intensity of 12 action units. To measure the intensity of action units, as defined in the FACS manual [3], there are five ordinal scales (i.e. scale A through E that respectively indicate the the barely visible to maximum intensity of each AU). The general relationship between the scale of evidence and the A-B-C-D-E intensity scoring is illustrated in Fig. I. Generally, the A level refers to a trace of the action; B, slight evidence; C, marked or pronounced; D, severe or extreme; and E, maximum evidence. For example, we use AU12B to indicate AU12 with a B

2 intensity level. In this study we utilized the DISFA database for measuring the intensity of spontaneous facial action units. Analyzing the spontaneous facial expression is a challenging task and currently there are very few studies available in this area. Bartlett et al. [13] attempted to measure the intensity of action units in posed and spontaneous facial expressions, by using Gabor wavelet and support vector machines. They have reported the average correlation values of 0.3 and 0.63 between a human coder and the predicted intensity of AUs for spontaneous and posed expressions, respectively. The reported results demonstrate that measuring the intensity of spontaneous expressions is more challenging than measuring the intensity of posed expressions. In another study for spontaneous facial expression measurement [5], the authors used AAM features in conjunction with SVM classifiers to automatically measure the intensity of AU6 and AU12 in videos captured from infant-mother communications. In [14], the histogram of oriented gradient and Gabor features were utilized for detecting spontaneous action units by using K-Nearest neighbor and SVM classifiers. In majority of the studies in the area of facial expression, the focus of the studies are mostly on facial action unit detection or basic facial expression recognition and there are very few works which are measuring the intensity of facial actions [5], [15]. To the best of the author s knowledge, most of the current studies, including [5], [15], measure the intensity of facial actions statically and individually, and the dependencies among multilevel AU intensities, as well as the temporal information, which are crucial for analyzing spontaneous expression are ignored. Tong et al. [23] employed dynamic Bayesian Network (DBN) to model the dependencies among AUs, and achieved improvement over image-driven methods alone, especially for recognizing AUs that are difficult to detect but have strong relationships with other AUs. However, work [23] focuses on AU detection of posed expression. Following the idea in [23], in this paper we introduce a framework based on DBN to systematically model the relationships among different intensity levels of AUs, in order to measure the intensity of spontaneous facial actions. The proposed probabilistic framework is capable of recognizing multilevel AUs intensities in the spontaneous facial expressions. II. OVERVIEW OF OUR APPROACH The focus of this paper is to develop a framework to measure the intensity of spontaneous facial action units from absence of an AU to maximum intensity level of an AU. Fig. 2 gives the flowchart of our proposed system, which consists of an offline training phase (Fig. 2(a)) and an online testing phase (Fig. 2(b)). The training phase includes training multi-class Support Vector Machine (SVM) and learning the DBN to capture the semantic and dynamic relationships among AUs. Advanced learning techniques are applied to learn both the structure and parameters of the DBN based on both training data and domain knowledge. The online AU intensity recognition phase consists of two independent Input Database Images and labels Labels HOG/Gabor Feature Extraction Structure Learning SVM Training HOG/Gaborr Feature Extraction (a) Parameter Learning DBN Learning SVM Classification Measurement Extraction (b) SVM Learning DBN Inference AU Intensity Output Fig. 2. The flowchart of the proposed system. (a) Training process. (b) Testing process. but collaborative components: AU observation extraction by SVM classification and DBN inference. For observation extraction, we employ HOG and Gabor features describing local appearance changes of the face, that are followed by SVM classifiers. Given the AU observation, we estimate the intensity of facial action units through a probabilistic inference with the DBN model. Through this way, we can further incorporate the dependencies among multilevel AU intensities, as well as the temporal information. The remainder of the paper is organized as follows. Sec. III describes the AU observation extraction method. In Sec. IV, we build DBN model for AU intensity recognition, including BN model structure learning (Sec. IV-A), DBN model parameter learning (Sec. IV-C) and DBN inference (Sec. IV-D). Sec. V presents our experimental results and discussion and Sec. VI concludes the paper. III. AU INTENSITY OBSERVATION EXTRACTION This section describes our proposed AU intensity observation extraction method, which consists of several component, such as face registration (Sec. III-A), facial image representation (Sec. III-B), dimensionality reduction (Sec. III-C) and classification (Sec. III-D). The flowchart of our AU intensity observation extraction method is shown in Fig. 3. A. Face Registration Image registration is a systematic way for aligning two images of same object (i.e. the reference and sensed image) that are taken at different time, viewpoint, or with different sensors. In order to register two images efficiently, oftentimes, a set of points called control points or landmark points, are utilized for representing the object in both images. In our study, we used the 66 landmark points of DISFA database (i.e. points for representing mouth boundary, corners of eyes, tip of nose, face boundary, etc.) to represent the location of important facial components [9]. The Reference landmark points were obtained by averaging the 66 landmark points over the whole training set. A 2D similarity transformation was calculated between the reference points and the target points. Afterwards, we utilized the corresponding points, the

3 Fig. 3. The flowchart of AU intensity observation extraction method. calculated transformation function and the bilinear interpolation technique, to transform the new image into the reference coordinate system. B. Facial Image Representation After registering facial images, we utilized two wellknown feature extraction techniques that are capable of representing the appearance information. These features are Histogram of Oriented Gradient (HOG), and Localized Gabor Features which are described below. 1) Histogram of Oriented Gradient: The histogram of oriented gradient was firstly introduced by Dalal and Triggs for the application of human detection [16]. HOG is a descriptor which counts the occurrences of gradient orientation in localized portion of an image and it can efficiently describe the local shape and appearance of an object. To represent the spatial information of an object, images are divided into small cells and for each cell, the histogram of gradient is calculated (For more information about gradient filters and number of histogram bins, interested readers are referred to [16]). In our experiment, for every image (i.e image size is pixels), we built a cell with pixels and in overall 48 cells are constructed out of each image. We applied the horizontal gradient filter [ 1 0 1] with 59 orientation bins in our experiments. In order to construct the HOG feature vector, the HOG representation for all the cells were stacked together and finally the HOG feature vector with size 2832 (48 59) was obtained. 2) Localized Gabor Features: Gabor wavelet is another famous technique for representing texture information of an object. A Gabor filter is defined with a Gaussian kernel that is modulated by a sinusoidal plane. Gabor feature has a powerful capability for representing facial textures and has been used for different applications including facial expression recognition [17]. In our experiment, to efficiently extract both texture and shape information of facial images, 40 Gabor filters (i.e. 5 scales and 8 orientations) were applied to regions defined around every 66 landmark points and as a result 2640 Gabor features were extracted. C. Dimensionality Reduction: Manifold Learning In many real world applications in machine learning and pattern classification, high-dimensional features make analyzing the samples more complicated. In this regards, a number of algorithms, such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and manifold learning have been proposed to reduce the dimensions of features [18]. Amongst, manifold learning is a nonlinear technique, which assumes that the data points are sampled from a low dimensional manifold but they are embedded into a high dimensional space. Mathematically speaking, given a set of points x 1,..., x n R D find a set of point y 1,..., y n R d (d << D) such that y i represents x i efficiently. Currently, there are several different manifold learning techniques (e.g ISOMAP, Local Linear Embedding (LLE), Laplacian Eigenmap, etc.), in which the following three steps are in common: 1) Building a nearest neighbor graph for the entire set of sample points, 2) Linearly approximate the local manifold geometry within each neighborhood, 3) Define and minimize the cost function in a way to get the best low dimensional representation. The key assumption in every manifold learning algorithm is that in neighborhood of each sample point the manifold is smooth [19]. Several studies [20], [5] show that manifold learning techniques outperform linear techniques (e.g., PCA) in reducing the dimensionality of data, such as facial expressions and human actions. In this paper we utilized Laplacian Eigenmap technique to extract low dimensional features of facial images. The Laplacian Eigenmap algorithm [21], was originally introduced by M. Belkin and P. Niyogi in In this algorithm, after finding the K-nearest neighbors of each sample point, the node i and j are connected if x i is among the k-nearest neighbors of x j otherwise they are disconnected. To calculate the weight of connected neighbor samples, the heat kernel W ij = exp x i x j t is one approach. The cost function for Laplacian Eigenmap is ij W ij y i y j 2 = tr(y t LY ) which aims to map the close points in high dimensional space into the close points in the low dimensional one. The generalized eigenvector problem is applied for solving Lf l = λ l Df l, where D is diagonal weight matrix D ii = j W ij and L = D W is symmetric, positive semidefinite Laplacian matrix. Let s assume the f 1,..., f d are the first d eigenvectors corresponding to the first d smallest eigenvalues (0 < λ 1 λ 2,... λ d,...) then for embedding in d-dimensional Euclidean space we can apply x i (f 1 (i),..., f d (i)). Readers can see more details

(a) (b) (c) Fig. 4. Nonadditive effect in an AU combination. (a) AU12 occurs alone. (b) AU15 occurs alone. (c) AU12 and AU15 appear together. (Adapted from [3]) in [21].

Classification Given the reduced feature vectors, we extract the AU intensity observation through SVM classification.

There are several parameters, such as kernel s type (e.g. Linear, Polynomial, Radial Basis Function (RBF) kernels), that can affect the efficiency of the SVM classifier.

4 (a) (b) (c) Fig. 4. Nonadditive effect in an AU combination. (a) AU12 occurs alone. (b) AU15 occurs alone. (c) AU12 and AU15 appear together. (Adapted from [3]) in [21]. Similar to [5], we utilized Spectral Regression (SR) algorithm to find a projection function which can map the high dimensional data, such as HOG and Gabor features, into low dimensional space. D. Classification Given the reduced feature vectors, we extract the AU intensity observation through SVM classification. SVM is one of the classifiers which has gained popularity for pattern recognition in the last decade. The SVM classifiers aims to find a hyperplane with the maximum margin. There are several parameters, such as kernel s type (e.g. Linear, Polynomial, Radial Basis Function (RBF) kernels), that can affect the efficiency of the SVM classifier. For more detailed information on SVM we refer readers to [22]. For AU intensity observation extraction, in our experiment we utilized the multiple SVM classifiers with One-against- One strategy and we examined three different kernels (i.e. Linear, Polynomial and Gaussian RBF kernels) where the Gaussian RBF outperformed the other two kernels. Although we can extract AU intensities up to some accuracy, this image-appearance-based approach treats each AU and each frame individually and largely relies on the accuracy of face region alignment. In order to model the dynamics of AUs, as well as their semantic relationships, and to deal with the image uncertainty, we utilize a DBN for AU inference. Consequently, the output of the SVM classifier is used as the evidence for the subsequent AU inference via the DBN. IV. DBN MODEL FOR FACIAL ACTION UNIT INTENSITY RECOGNITION A. AU Dependencies Learning Measuring the intensity of each AU statically and individually is difficult due to the variety, ambiguity, and dynamic nature of facial actions. This is especially the case for spontaneous facial expressions. Moreover, when AUs occur in a combination, they may be non-additive: this means that, the appearance of an AU in a combination is different from its stand-alone appearance. Fig. 4 demonstrates an example of the non-additive effect: when AU12 (Lip corner puller) appears alone, the lip corners are pulled up toward the cheekbone; however, if AU15 (lip corner depressor) is also became active, then the lip corners are somewhat angled down due to the presence of AU15. The non-additive effect increases the difficulty of recognizing AUs individually. Fortunately, there are some inherent relationships among AUs, as described in the FACS manual [3], i.e., Cooccurrence relationships and Mutual exclusion relationships. The co-occurrence relationships characterize some groups of AUs, which usually appear together to show meaningful facial emotions. For example, AU1+AU2+AU5+AU26+AU27 to show surprise; AU6+AU12+AU25 to represent happiness, etc. On the other hand, based on the alternative rules provided in the FACS manual, some AUs are mutually exclusive since it may not be possible anatomically demonstrate AU combinations simultaneously or the logic of FACS precludes the scoring of both AUs [3]. For instance, one can not perform AU25 (Lip part) with AU23 (Lip tightener) or AU24 (Lip pressor) simultaneously. Furthermore, there are also some restrictions on the AU intensities besides the cooccurrence and mutual exclusion relationships. For instance, when AU6 (Cheek raiser) and AU12 (Lip corner puller) are present together, the high/low intensity of one AUs indicates a high probability of high/low intensity of the other one. At the same time, for the combination of AU10 (Upper lip raiser)+au12 (Lip corner puller), one cannot score AU10 as D or E if the AU12 is a AU12D or E, since such strong actions of AU12 counteract the influence of AU10 on the shape of the upper lip. With such strong AU12, one can only able to score AU10 as AU10A, B, or C. With a 12C or less, one may be able to score AU10E. [3] Tong et al. [23] employed a Bayesian network to model the co-occurrence and mutual exclusion relationships among AUs. However, [23] focuses on AU detection, which only recognizes AU s absence or presence. In addition, [23] detects AUs from posed expressions, which are created by asking subjects to deliberately make specific facial actions or expressions. Spontaneous expressions, on the other hand, typically occur in uncontrolled conditions, and are more challenging to measure [13]. In this work, following the idea in [23], we adopt Bayesian Network (BN) to capture the semantic relationships among AUs, as well as the correlations of the AU intensities, for measurement of the intensity of spontaneous facial actions. A BN is a Directed Acyclic Graph (DAG) that represents a joint probability distribution among a set of variables. In this work, we employ 12 hidden nodes representing 12 AUs of DISFA database, (i.e, AU1 AU2 AU4 AU5 AU6 AU9 AU12 AU15 AU17 AU20 AU25 AU26), each of which has six discrete states indicating the intensity of the AU. In a BN, its structure captures the dependence among variables and the structure is crucial for accurately modeling the joint probabilities among variables. In this work, we learn the BN structure directly from the training data. The learning algorithm is to find a structure G that maximizes a score function. We employ the Bayesian Information Criterion (BIC) score function [24] which is defined as follows: s D (G) = max θ log P (D G, θ) log M Dim G (1) 2 where the first term evaluates how well the network fits the

5 Fig. 5. AU1 AU2 AU5 AU4 AU9 AU6 AU15 AU17 AU12 AU20 AU25 AU26 The learned BN structure from training data. data D; the second term is a penalty relating to the complexity of the network; log P (D G, θ) is the log-likelihood function of parameters θ with respect to data D and structure G; M is the number of training data; and Dim G is the number of parameters. Cassio et al. [25] developed a Bayesian Network structure learning algorithm which is not dependent on the initial structure and guarantee a global optimality with respect to BIC score. In this work, we employ the structure learning method [25] to learn the dependencies among AUs, as well as the correlations of AU intensities. To simplify the model, we use the constraints that each node has at most two parents. The learned structure is shown in Fig. 5. B. Dynamic Dependencies Analysis The above BN structure can only capture the static dependencies. In this section, we extend it to a dynamic Bayesian network by adding dynamic links. In general, a DBN is made up of interconnected time slices of static BNs, and the relationships between two neighboring time slices are modeled by an HMM such that variables at time t are influenced by other variables at time t, as well as by the corresponding random variables at time t 1 only. In the proposed framework, we consider two types of conditional dependencies for variables at two adjacent time slices. The first type, i.e., an arc from AU i node at time t 1 to that node at time t, depicts how a single variable develops over time. For instance, since spontaneous facial expression changes smoothly, the intensity of AUs has high probability to change in order, either in ascending order or in descending order. Such dynamic restrictions are modeled by the first type dynamic links, and we consider such dynamic link for every single AU. The second type, i.e., an arc from AU i at time t 1 to AU j (j i) at time t, depicts how AU i at the previous time step affects AU j (j i) at the current time step. This dynamic dependence is also important for understanding spontaneous expression. For example, K. Schmidt et al. [26] found that certain action units usually closely follow the appearance of AU12 in smile expression. For 88% of the smile data they collect, the appearance of AU12 was either simultaneously with or closely followed by one or more associated action units, and for these smiles with multiple action units, AU6 was the first action unit to follow AU12 in 47%. Messinger et al. in [27] also show that AU6 may follow AU12 (smile) or AU20 (cry) to act as an enhancer to AU2 AU12 t-1 t AU1 AU2 AU5 5 AU4 AU9 AU6 9 6 AU15 AU17 AU AU20 AU25 AU Fig. 6. The complete DBN model for AU intensity recognition. The shaded node indicates the observation for the connected hidden node. The self-arrow at the hidden node represents its temporal evolution from previous time slice to the current time slice. The link from AU i at time t 1 to AU j (j i) at time t indicates the dynamic dependence between different AUs. enhance the emotion. This means that certain AU in next time step may be affected by other AUs in the current time step. Analysis of other expressions results in a similar conclusion. Based on this understanding and the analysis of the database, as well as the temporal characteristics of the AUs we intend to recognize, in this work, we link AU2 node and AU12 node at time t 1 to AU5 node and AU6 node at time t, respectively to capture the second type dynamics. Fig. 6 gives the whole picture of the dynamic BN, including the shaded visual observation nodes. For presentation clarity, we use the self-arrows to indicate the first type of temporal links as described above. C. DBN Parameter Learning Given the DBN structure, now we focus on learning the parameters from training data in order to infer the hidden nodes. DBN can be seen as a pair of BN and parameter learning are the same in implementation. Learning the parameters in a BN is to find the most probable values θ for θ that can best explain the training data. Let θ ijk indicates a probability parameter, θ ijk = p(x k i pa j (X i )) (2) where i ranges over all the variables (nodes in the BN), j ranges over all the possible parent instantiations for variable X i and k ranges over all the instantiations for X i itself (intensity levels of AUs). Therefore, x k i represents the kth state of variable X i. In this work, the fitness of parameters θ and training data D is quantified by the log likelihood function log(p(d θ)), denoted as L D (θ). Assuming the training data are independent, based on the conditional independence assumption in BN, we have the log likelihood function in Eq. 3, where n ijk is the count for the case that node X i has the state k, with the state configuration j for its parent nodes. L D (θ) = log n q i r i i=1 j=1 k=1 θ n ijk ijk (3) Since we have a complete training data, i.e., for each frame we have the intensity labels for all 12 AUs, Maximum

6 Likelihood (ML) estimation method can be described as a constrained optimization problem, i.e. maximize (Eq. 4), subject to n equality constraints Max L D (θ) r i S.T. g ij (θ) = θ ijk 1 = 0 k=1 where g ij imposes the constraint that the parameters of each node sums to 1 over all the states of that node, 1 i n and 1 j q i. Solving the above equations, we can get θ ijk = n ijk k n. ijk D. DBN Inference Given the complete DBN model and the AU observations, we can estimate the true state of hidden nodes by maximizing the posterior probability of the hidden nodes. Let AU1:N t represent the nodes for N target AUs at time t. Given the available evidence until time t: 1:N 1:t, the probability p(au1:n t 1:t 1:N ) can be factorized and computed via the facial activity model by performing the DBN updating process as described in [28]. Because of the recursive nature of the inference process as well as the simple network topology, the inference can be implemented rather efficiently. V. EXPERIMENTAL RESULTS In the following section, we have utilized the DISFA database for evaluating the performance of automatic measurement of the intensity of spontaneous action units. First we introduce the contents of DISFA and then the results of the proposed system for measuring the intensity of 12 AUs of this database are reported. A. DISFA Database Description Denver Intensity of Spontaneous Facial Action (DISFA) database [9] [30], contains the videos of spontaneous facial expressions of 27 adult subjects with different ethnicities (i.e Asian, Caucasian, Hispanic, and African American). The facial images were video recorded by a high resolution camera (i.e pixel at 20 fps) while every subject watched a 4-minute emotive audio-video stimulus clip. The intensity of 12 AUs (i.e. AU1, AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17, AU20, AU25, AU26) have been coded by a FACS coder and the six levels of AU intensities were reported in an ordinal scale (0-5 scale, where 0 represents the absence of an AU and 1-5 represent intensity from trace through maximum, respectively) [3]. The database also contains a set of 66 landmark points that represent the coordinates of important components of human s face, such as corner of the eyes and boundary of the lips, etc. [9]. In this study, we utilized all the video frames of DISFA ( 125, 000 frames) for measuring the intensity of 12 AUs. B. Results Analysis We evaluate our system based on leave-out-subject-out cross validation and report the average recognition results over all 27 subjects. In order to compare the predicted and (4) TABLE I AU INTENSITY RECOGNITION RESULTS (ICC) USING DIFFERENT FEATURES ON DISFA DATABASE. AU No. HOG Feature Gabor Feature SVM DBN SVM DBN AU AU AU AU AU AU AU AU AU AU AU AU Avg manually coded intensities of action units, we calculate Intra- Class Correlation (ICC). ICC ranges from 0 to 1 and is a measure of correlation or conformity for a data set when it has multiple targets [29]. In other words, ICC measures the reliability studies in which n targets of data are rated by k judges (i.e., in this paper k = 2 and n = 6). ICC is similar to Pearson correlation and is preferred when computing consistency between judges or measurement devices. The ICC in is defined as: ICC = (BMS EMS) BMS + (k 1) EMS where BMS is the between-targets mean squares and EMS is the residual mean squares defined by Analysis Of Variance (ANOVA). That is, the ICC indicates the proportion of total variance due to differences between targets. See [29] for additional details. For AU observation extraction, we employed two types of features, i.e., HOG feature and Gabor feature, and followed by SVM classification. Given the image observation, we estimate the intensity of each AU through the same DBN model and the results are given in Table I. From Table I we can see that, for both types of features, employing DBN model yields improvement over using image-driven method alone, and when the image observations are not very accurate, i.e., HOG feature observation, the improvement is significant. This is because that the enhancement of the framework mainly comes from combining the DBN model with the image driven methods, and the erroneous image observation could be compensated through the dynamic and semantic relationships encoded in the DBN. For instance, AU20 (lip stretcher) is not well recognized by using both HOG feature and Gabor feature, because the activation of AU20 produces subtle facial appearance changes. However, AU20 (lip stretcher) is strongly exclusive with AU25 (lips apart), which is recognized with high accuracy with both kinds of features. By encoding such relationships in the DBN model, the ICC of AU20 is increased from 0.49 to 0.54 for the HOG feature observation, and from 0.53 to 0.55 for (5)

7 the Gabor feature observation. Similarly, by modeling the cooccurrence relationships between AU15 and AU17, for the HOG features, the ICC of AU15 is increased from 0.58 to 0.61, and that of AU17 is increased from 0.53 to Hence, we can conclude that considering the semantic relationships among AUs, as well as the temporal information does help in analyzing spontaneous facial actions. VI. CONCLUSIONS AND FUTURE WORKS In this paper, we presented a unified probabilistic framework for measuring the intensity of spontaneous facial action units from image sequences. Our framework consists of two independent but collaborative components, i.e., observation extraction and DBN inference. The enhancement of our framework mainly comes from combining the DBN model with image-driven methods. For instance, the overall ICC value, for HOG features increased from 0.67 to 0.70 and for Gabor feature from 0.76 to 0.77 which demonstrate that using the unified probabilistic frame work can improve the efficiency of AU intensity measurement system. In this study, we focused on facial images from frontal view. In order to deal with a more comprehensive problem in measuring the intensity of spontaneous action units, as a future work, we will expand our framework by introducing another hidden layer nodes to model head movements. REFERENCES [1] Breazeal, C., Sociable Machines: Expressive Social Exchange Between Humans and Robots. Sc.D. dissertation, Department of Electrical Engineering and Computer Science,MIT, 2000 [2] F. Dornaika, B. Raducanu, Facial Expression Recognition for HCI Applications, Prentice Hall computer applications in electrical engineering series, pp , 2009 [3] P. Ekman, W. V. Friesen, and J. C. Hager, Facial Action Coding System. Salt Lake City, UT: A Human Face, [4] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int l Conf. Computer Vision and Pattern Recognition (CVPR 05), pp , [5] M. H. Mahoor, S. Cadavid, D. S. Messinger, and J. F. Cohn, A Framework for Automated Measurement of the Intensity of Non- Posed Facial Action Units, 2nd IEEE Workshop on CVPR for Human communicative Behavior analysis (CVPR4HB), Miami Beach, June 25, 2009 [6] K. L. Schmidt, Z. Ambadar, J. F. Cohn, L. I. Reed, Movement Differences Between Deliberate and Spontaneous Facial Expressions: Zygomaticus Major Action In Smiling, Journal of Nonverbal Behavior. vol. 30(1), pp , 2006 [7] P. Lucy, J. F. Cohn, K. M. Prkachin, P. Solomon, I. Matthrews, Painful data: The UNBC-McMaster Shoulder Pain Expression Archive Database. IEEE International Conference on Automatic Face and Gesture Recognition (FG2011),Santa Barbra CA, March 2011 [8] W. Shangfei, L. Zhilei, L. Siliang, L. Yanpeng, W. Guobing, P. Peng, C. Fei and W. Xufa, A Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference, IEEE Transactions on Multimedia, vol.12, no.7, pp , Nov [9] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, DISFA: A Spontaneous Facial Action Intensity Database, IEEE Transactions on Affective Computing, revised and resubmitted [10] T. Kanade, J. Cohn, and Y. Tian, Comprehensive database for facial expression analysis, In Proceedings of the International Conference on Automatic Face and Gesture Recognition, pages 46-53, [11] M. Pantic, M. Valstar, R. Rademaker and L. Maat, Web-based database for facial expression analysis, Multimedia and Expo, ICME IEEE International Conference on, vol., no., pp. 5 pp., 6-8 July 2005 [12] N. Alyz, B. Gkberk, H. Dibeklioglu, A. Savran, A. A. Salah, L. Akarun and B. Sankur, 3D Face Recognition Benchmarks on the Bosphorus Database with Focus on Facial Expressions, The First COST 2101 Workshop on Biometrics and Identity Management (BIOID 2008), Roskilde University, Denmark, May 2008 [13] M.S. Bartlett, G.C. Littlewort, C. Lainscsek, I. Fasel, M.G Frank, J.R. Movellan, Fully automatic facial action recognition in spontaneous behavior. 7th International Conference on Automatic Face and Gesture Recognition, p , [14] S. M. Mavadati, M. H. Mahoor, K. Bartlett, and P. Trinh, Automatic Detection of Non-posed Facial Action Units,in the proceeding of the IEEE international conference on image processing(icip), Sep-Oct 2012 [15] R. Sprengelmeyer, I. Jentzsch, Event related potentials and the perception of intensity in facial expressions, Neuropsychologia, vol. 44, pp , 2006 [16] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, Computer Vision and Pattern Recognition 2005, vol.1, no., pp ,june 2005 [17] Y. Tian, T. Kanade, J. F. Cohn, Evaluation of Gabor-Wavelet-Based Facial Action Unit Recognition in Image Sequences of Increasing Complexity, International Conference on Automatic Face and Gesture Recognition,pp. 229, 2002 [18] I.K. Fodor, A survey of dimension reduction techniques [19] L. Cayton, Algorithms for manifold learning,university of California, San Diego, Tech. Rep. CS ,2005 [20] C. Lee and A. Elgammal, Nonlinear shape and appearance models for facial expression analysis and synthesis, IEEE Conference on Computer Vision and Pattern Recognition, I:313320, [21] M. Belkin and P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Computation, pp , vol.15, 2003 [22] V. N. Vapnik, An overview of statistical learning theory, Neural Networks, IEEE Transactions on, vol.10, no.5, pp , Sep 1999 [23] Y. Tong, W. Liao, and Q. Ji, Facial Action Unit Recognition by Exploiting Their Dynamic and Semantic Relationships, IEEE Trans. Pattern Anal. Mach. Intell, vol. 29, no. 10, [24] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics, vol. 6, pp , [25] C. P. de Campos and Q. Ji, Efficient structure learning of bayesian networks using constraints, Journal of Machine Learning Research, pp , [26] K. Schmidt and J. Cohn, Dynamics of Facial Expression: Normative Characteristics and Individual Differences, in Proc. IEEE Intl Conf. Multimedia and Expo, pp , [27] D. S. Messinger, W. I. Mattson, M. H. Mahoor, and J. F. Cohn, The eyes have it: Making positive expressions more positive and negative expressions more negative. Emotion, vol. 12, pp , [28] K. B. Korb and A. E. Nicholson, Bayesian Artificial Intelligence. Chapman and Hall/CRC, [29] P. E. Shrout and J. L. Fleiss, Intraclass correlations: Uses in assessing rater reliability, Psychological Bulletin, vol 86(2),pp , Mar 1979 [30]

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation Neuro-Inspired Statistical Pi Prior Model lfor Robust Visual Inference Qiang Ji Rensselaer Polytechnic Institute National Science Foundation 1 Status of Computer Vision CV has been an active area for over