Fusion of visible and thermal images for facial expression recognition

Size: px

Start display at page:

Download "Fusion of visible and thermal images for facial expression recognition"

Lesley Mason
6 years ago
Views:

Front. Comput. Sci., 2014, 8(2): 232 242 DOI 10.

1 Front. Comput. Sci., 2014, 8(2): DOI /s Fusion of visible and thermal images for facial expression recognition Shangfei WANG 1,2, Shan HE 1,2,YueWU 3, Menghua HE 1,2,QiangJI 3 1 School of Computer Science and Technology, University of Science and Technology of China, Hefei , China 2 Key Lab of Computing and Communicating Software of Anhui Province, Hefei , China 3 Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy NY , USA c Higher Education Press and Springer-Verlag Berlin Heidelberg 2014 Abstract Most present research into facial expression recognition focuses on the visible spectrum, which is sensitive to illumination change. In this paper, we focus on integrating thermal infrared data with visible spectrum images for spontaneous facial expression recognition. First, the active appearance model AAM parameters and three defined head motion features are extracted from visible spectrum images, and several thermal statistical features are extracted from infrared (IR) images. Second, feature selection is performed using the F-test statistic. Third, Bayesian networks BNs and support vector machines SVMs are proposed for both decision-level and feature-level fusion. Experiments on the natural visible and infrared facial expression (NVIE) spontaneous database show the effectiveness of the proposed methods, and demonstrate thermal IR images supplementary role for visible facial expression recognition. Keywords facial expression recognition, feature-level fusion, decision-level fusion, support vector machine, Bayesian network, thermal infrared images, visible spectrum images 1 Introduction Facial expression, a primary method by which humans display their emotions, has drawn growing attention in many research areas related to human-computer interaction and Received July 30, 2013; accepted October 14, sfwang@ustc.edu.cn psychology. In recent years, considerable progress has been made in the field of facial expression recognition using visible spectrum images and videos [1 5]. However, most of existing methods are not robust enough for employment in uncontrolled environments. Illumination change is the most important factor, because it can significantly influence the appearance of visible images. However, thermal infrared (IR) images, which record temperature distribution, are not sensitive to illumination conditions. Thus, IR-based facial expression recognition algorithms can improve recognition performance in uncontrolled environments. While IR images possesses robustness to illumination change, it also has several drawbacks that visible images do not. IR images are sensitive to the surrounding temperature and opaque to glass [6]. Combining the advantages and disadvantages of both visible and IR images for facial expression recognition is promising. However, to the best of our knowledge, little attention has been paid to facial expression recognition by fusing visible and IR images [7 9]. In this paper, we focus on integrating IR with visible images for spontaneous facial expression recognition. Two Bayesian networks (BNs) and two support vector machines (SVMs) are presented to perform decision-level and featurelevel fusion, respectively, of visible spectrum and IR images for facial expression recognition. First, features are extracted from visible spectrum and infrared images. Several studies have suggested that using both geometric and appearance features may achieve good performance [1]. Thus, the active

2 Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 233 appearance model (AAM) parameters, which capture both appearance and shape information [10], are extracted from apex expression images. As head motion has been proved useful for facial expression recognition, we also define three head motion features [11]. For IR images, considering the temperature changes of the environment and the temperature drift of IR cameras, the relative temperature, which is difference between temperature matrices of the apex and onset images, is adopted instead of the absolute temperature. Several statistical features are extracted from temperature matrices. After feature extraction we perform feature selection using the F-test statistic. Then, the structures of fusion models are defined, and we parameterize the fusion models. Using the trained model, the true state of expression is inferred through probabilistic inference using BNs or classified by SVMs. Finally, we evaluate the proposed methods on the natural visible and infrared facial expression (NVIE) spontaneous database [12] to show their effectiveness. A review of related work is presented in Section 2. The fusion methods for facial expression recognition are explained in Section 3. Experiments are given in Section 4 followed by conclusions in Section 5. 2 Related work Recently, a number of researchers have turned their attention to spontaneous facial expression recognition. A review on this topic can be found in [1]. As mentioned in [1], only a few studies have investigated fusing information from face and head motions to improve recognition performance. The study of Littlewort et al. [13] shows that head pose and head motion contain substantial information about emotion. Cohn et al. [14] proposed a method to recognize facial action units (AUs) from a spontaneous expression video using head motions and facial features. Their study shows a high correlation between brow and head motions. Tong et al. [15] proposed a unified dynamic Bayesian network (DBN) based facial action model to discover and learn spatiotemporal relationships between rigid head motions and non-rigid facial motions. They then performed robust and reliable spontaneous facial action recognition by combining these relationships with image observations. Their study suggests that the models built-in spatiotemporal relationships between rigid head motions and nonrigid facial motions can compensate for erroneous AU measurements. Valstar et al. [16] combined the face modality and the information from head and shoulder motions to automatically distinguish between posed and spontaneous smiles. Their study indicates that humans also reliably reveal their intentions through body language, such as head movements. Gunes and Pantic [17] proposed using head motions in spontaneous conversations as an extra dimension of information for predicting emotions. All the above studies suggest that head motions are useful for improving the performance of spontaneous facial expression recognition. Facial expression recognition in the IR spectrum has received relatively little attention compared to facial expression recognition in the visible spectrum [18 23]. Hernández et al. [19] defined regional descriptors of IR images using a gray level co-occurrence matrix (GLCM). These descriptors were used to distinguish the expressions of surprise, happiness, and anger. Yoshitomi [21] extracted features by performing a two-dimensional discrete cosine transformation (2D-DCT) to transform the grayscale values of each facial block into their frequency components, and then use knn (k-nearest Neighbors) as a classifier. Khan et al. [20] tried to use thermal variations along the major facial muscles as features to discern between posed and evoked expressions. Jarlier et al. [18] investigated the thermal patterns associated with specific AUs performance. Shen et al. [23] proposed a spontaneous facial expression recognition method using thermal videos. Shen et al. s method uses Adaboost with weak classifiers of knn to distinguish facial expressions into arousal and valence dimensions. All these studies show that the information provided by infrared images is useful for recognizing facial expression. To the best of our knowledge, little attention has been paid to facial expression recognition by fusing images from the visible and infrared spectrums. Yoshitomi et al. [7] described a method for the recognition of affective states using decision-level fusion of voices, visual and infrared facial images. Firstly, visible spectrum and IR features are extracted by DCT. Then, two neural network (NN) classifiers are trained for visual and IR features respectively. And, hidden Markov models (HMMs) are adopted for emotion detection from voices. Finally, the results of the three classifiers are integrated using simple weighted voting. Yoshitomi et al. evaluate their approach on a posed expression database. They do not report the relationships between the selected features and facial expressions. Wang et al. [8] and Wang and He [9] proposed two feature-level fusion methods for fusing visible and IR images for facial expression recognition. The F-statistic [8] and multiple genetic algorithms (GAs) [9] are adopted to select visible and IR features, then a knn classifier is used to recognize happiness, fear, and disgust.

3 234 Front. Comput. Sci., 2014, 8(2): Although researchers have focused on face recognition (rather than facial expression recognition) by fusing visible and IR images for several years, and have already proven the fusion effectiveness for face recognition, the role of IR images for expression recognition have not been well explored. We cannot take for granted that the role of IR images for expression recognition is the same as that for face recognition, since face recognition and expression recognition are different tasks. A good face recognition system should be invariant to facial expression, while a good facial expression system should be face independent. Furthermore, the underlying mechanism that enables IR images to help face recognition and expression recognition are different, although both recognition tasks take advantage of the robustness of IR images to illumination changes. Specially, IR images record the temperature distributions that are formed by facial vein branches. The facial vascular network is highly characteristic of each individual, much as a fingerprint is. Thus, IR images can be used for face recognition [24]. However, the changes of emotions increase the levels of adrenaline and regulates blood flow. Redistribution of blood flow in superficial blood vessels causes abrupt changes in local skin temperature that can be recorded by IR images. The muscle movements caused by facial expressions also cause the movement of facial vein branches. These movements make IR images reflect the change of emotion and expression [25]. Fusion of visible and IR images for facial expression is relatively new, we believe that many intelligent techniques worth exploring have not been investigated. Our work makes the following three main contributions: 1) We are among the first to introduce IR to expression recognition and we introduce several new models to effectively integrate IR with visible images. It is novel for the affective computing community. 2) We are the first to propose both decision-level and feature-level fusion methods fusing visible and infrared spectrums for facial expression recognition. 3) We report the relative significance of temperature variations of different facial regions for thermal facial expression recognition. 3 Proposed methods Our proposed method for facial recognition, combining IR and visible spectrum images, consists of three parts as shown in Fig. 1. After obtaining the visible and IR images from the database, visible and IR features are extracted.featureselection is then performed by using the F-test statistic. Given the expression labels and extracted features of the training set, the parameters for BNs and SVMs are learned. During testing, the extracted features of test sample are used as the input of the learned BN and SVM models for expression recognition. Fig Visible feature extraction Head motion Block diagram of the proposed method Three features, including velocities of head motion along the x- andy-axis, and the head s rotational speed, are calculated as head motion features. First, the subject s eyes are located using an AdaBoost and Haar feature eye location method [11]. Then, head motion features are calculated according to the coordinates of the pupils using Eqs. (1) (3). C apex x C onset x Velocity x = frame apex onset, (1) frame Cy onset Velocity y = frame apex onset, (2) frame ( R apex arctan y L apex ) ( y R onset y L onset ) y R apex x L apex arctan x R onset x L onset x Velocity r =, (3) frame apex frame onset where, (L x, L y ) represents the coordinates of the left pupil, (R x, R y ) represents the right pupil, and (C x, C y )represents the center of two pupils. The apex image frame refers to the frame where the expression is most exaggerated, and the onset image refers to the frame where the expression just begins AAM C apex y Since the AAM captures information about both appearance and shape, we use AAM to extract visible features from the

Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 235 apex expressional images which are the most exaggerated expression images.

4 Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 235 apex expressional images which are the most exaggerated expression images. We use the am_tools from [26] to extract the AAM features here. 3.2 IR feature extraction To eliminate the influence of ambient temperature and the temperature drift of IR cameras, IR features are extracted from the differential temperature matrix between the apex and the onset IR images. First, four points (the centers of the eyes, the tip of the nose, and tip of the jaw) are automatically located on the apex and onset IR images using the method devised by Tong et al. [27]. Using these points, the apex and onset IR images are rotated, resized, and cropped into images with size of H W. Subsequently, the differential temperature matrix is obtained by subtracting the temperature matrix that corresponds to the normalized onset IR image from that of the normalized apex IR image, as shown in Fig. 2(b). Fig. 2 (a) Visible image used for constructing AAM; (b) diagram of thermal feature extraction where N is the size of classes, n c is the number of samples of class c, n is total number of samples, x c is the average of feature x within class c, x is the global mean of feature x,andσ 2 c is the variance within class c. Features are selected from high to low according to the calculated F-statistics of all features. 3.4 Expression recognition BN models of facial expression recognition Decision-level fusion using BN We design our BN model as shown in Fig. 3(a). The model consists of three layers: detected expression layer, intermediate expression layer, and feature layer. All nodes in the detected expression layer and intermediate expression layer are discrete nodes with N states, and each state corresponds to an expression to be recognized. The relations between the detected expressions and the visible expressions or IR expressions are established through links, which capture the uncertainty of visible results or IR results. The lowest layer in the model is the feature layer containing head motion features and AAM features extracted for visible images, and IR features. All variables in this layer are observable. The relations between the second layer and the lowest layer capture the dependencies between an expression and its visual and thermal features. Owing to the opaqueness of IR light to glass, the subarea of the eyes is not taken into consideration here (since the target may be wearing glasses). We divided the facial region into several grids with the same size of w w. For each grid, several statistical parameters, including mean, standard deviation, minimum, and maximum, are calculated as IR features. 3.3 Feature selection In order to select distinctive features for classification, the F-test statistic [28] is used for feature selection. The significance of all features can be ranked by sorting their corresponding F-test statistics in descending order. This assumes that the feature fits a normal distribution. The F-test statistic of feature x is calculated according to Eq. (4). between-group variability F-statistic = within-group variability N n c ( x c x) 2 ( c=1 n N ) =, (4) N N 1 (n c 1)σ 2 c c=1 Fig. 3 (a) Decision-level fusion model using BN; (b) feature-level fusion model using BN Given the model structure, the conditional probability distribution (CPD) associated with each node in the model needs to be learned from the training set. The database regularly labels the image expression and the intermediate layer shows the expression labels resulting from different kinds of features, the values of the intermediate layer are omitted from the training set. Therefore, we divide the parameter learning phase in two phases. In the first phase, the expression labels are taken as the value of nodes in the intermediate expression layer to learn the CPDs of nodes in the feature layer. The CPDs of the fea-

5 236 Front. Comput. Sci., 2014, 8(2): ture layer are parameterized by a multivariate Gaussian distribution. Specifically, for a visible features node, let F v denote the visible features and E v denote the value of visible expression node, and assume that the CPD of the visible features p(f v E v = k) satisfy a multivariate Gaussian distribution with corresponding mean vector μ k and covariance matrix k.the CPD of visible features can be represented by Eq. (5): exp{ (F v μ k ) T (F v μ k )} k p(f v E v = k) =. (5) (2π) n/2 1/2 Now we can learn parameters μ k and k given expression E v = k by using the maximum likelihood estimation as shown in Eqs. (6) and (7): ˆ = 1 k N k ˆμ k = 1 N k N k i=1 N k i=1 k x ik, (6) (x ik ˆμ k )(x ik ˆμ k ) T, (7) where N k is the number of samples of expression k, andx ik is the feature vector of ith samples of expression k. Subsequently, we use the features in the training set as evidence to get the intermediate results according to the CPDs of the feature nodes. In the second phase, the conditional probabilistic table (CPT) for each node in the intermediate expression layer is calculated based on the analysis of real expressions and the previous intermediate results. Specifically, for the visible expressions node: p(e v = i E = j) = N ji N j, (8) where N j is the number of samples of expression j in the training set, and N ji denotes the number of samples of expression j whose visible intermediate result is i. Given the BN model and a query image, we find the expression for the image according to Eq. (9): E = argmax p(e)p(f v, F ir E) E = argmax p(e) p(e v = k E)p(F v E v = k) E k j p(e ir = j E)p(F ir E ir = j), (9) where E v and E ir denote the states of the visible expressions node and the IR expressions node respectively, F v and F ir are the feature vectors of visible and IR images. Therefore, the true state of an expression can be derived by probabilistic inference Feature-level fusion using BN For feature-level fusion using BN, we propose to construct a simplified graphic model of a naive Bayes classifier to combine visible and IR features at the feature-level as shown in Fig. 3(b). The model consists of two nodes: the expression node and the feature node. The expression node is a discrete node with N states, where each state corresponds to an expression to be recognized. The feature node is a continuous node, representing a combined feature vector of visible features and IR features. The CPD associated with feature node is parameterized as a multivariate Gaussian distribution, and parameters are estimated using maximum likelihood estimation. In the testing phase, a new sample is recognized as the one with the maximum likelihood, i.e., E = argmax p(e)p(f E). (10) E SVM models of facial expression recognition Decision-level fusion using SVM For SVM-based decision-level fusion, two linear SVM classifiers are first adopted to estimate probabilities of expressions from visible spectrum images and IR images. The pairwise coupling is used for multi-class probability estimation with the SVM [29]. Then, the recognition results of the two modalities are combined by another linear SVM to obtain the final expression Feature-level fusion using SVM For SVM-based feature-level fusion, we directly concatenate the visible features and infrared features into a higher dimension feature vector, and feed it into a linear SVM for classification [30]. 4 Experiments and analysis 4.1 Experimental conditions To evaluate the effectiveness of the proposed methods, experiments are performed on the data chosen from the NVIE spontaneous database [12]. The NVIE spontaneous database contains both visible and IR images of more than 100 subjects. All visible expression sequences have been labeled by five students on the intensity of six basic facial expressions (happiness, sadness, surprise, fear, anger, and disgust) with a

6 Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 237 three point scale (0, 1, or 2). The expression with the highest average intensity is used as the expression label for the visible and IR image sequences. In this paper, we choose experimental samples under the following criteria. First, the average intensity associated with a sample label must be larger than 1. Second, as three expressions (happiness, fear, and disgust) were successfully induced in most cases when the NVIE spontaneous database was constructed, the sample label should be one of happiness, fear, and disgust. Third, the sample should consist of both visible and IR image sequences. Following these criteria, we obtain a total of 535 samples from 123 subjects under different illuminations. The size of the located facial region of the IR images is normalized to The size of grids is set to 4 4. There are a total of IR features extracted. Before extracting AAM features, experimental samples are equally divided into two groups according to their illuminations. At each time, we take one group as the training set to train AAM. Sixty one landmark points are manually marked on each sample of the training set as shown in Fig. 2(a). Then samples in the other group are marked using the learned AAM. We find a total of 33 visible features including three head motion features and 30 AAM features. Our experimental results are obtained by applying a tenfold cross validation method on all samples according to the subjects. Owing to the high dimension of IR features, we perform IR feature selection. Another ten-fold cross validation is applied on the training set to choose the best dimension of IR features that achieves the highest accuracy rate on the validated set. We compare the proposed methods with that of using visible or IR features alone. 4.2 Feature analyses The F-statistics of a feature reflect that feature s importance to facial expression recognition. We select features according to the calculated F-statistics from high to low. We analyze the top 50 IR features according to the F-statistics calculated based on expression labels. As shown in Fig. 4, we can find that most of top 50 selected IR features are extracted from the mouth and cheek regions. This seems to indicate that the temperature variations of the mouth and cheek regions are the most reliable sources compared to other facial regions for facial expression recognition. We also sort the visible features according to their F-statistics. We find that head motion features are high on the selected order of visible features, which demonstrates that head motion information is highly related to facial expression. Fig. 4 Counts of features from different facial area 4.3 Experimental results of Bayesian networks Due to the poor performance by using IR features alone, we consider IR features as a supplement to visible facial expression recognition. Figs. 5(a) and 5(b) show the fusion recognition accuracy curves of the decision-level and feature-level fusion using BN with different numbers of IR features on the validation set. Bold lines show the average accuracy curves. For decision-level fusion, the average accuracy curve reaches the highest accuracy when 15 IR features are selected. For feature-level fusion, of the ten folds, the highest accuracies are reached when fewer IR features are selected. The confusion matrices of the different methods using BNs are shown in Table 1. Each cell represents the number of times the the row emotion (actual) in classified as the columnemotion(classification result). The overall recognition accuracy of decision-level fusion and feature-level fusion are 72.15% and 73.83%, respectively. The recognition accuracies are 71.59% and 46.54% for visual spectrum images and IR images, respectively, when used alone. When only IR features are used, the overall recognition accuracy is poor, especially for distinguishing fear from disgust. When only visible features are used, the performance is much better. However, it also has certain limitations, such as, it is much more likely to misclassify disgust as happiness than using IR features. Using decision-level and feature-level fusion, the overall recognition accuracy is improved, especially for fear. IR features play a supplementary role in visible facial expression recognition. And it seems that feature-level fusion is more effective than decision-level fusion. To further assess the essence of the effectiveness of the fusion models, we calculate Cohen s kappa coefficients, which is a statistical measure of inter-rating agreement or interannotation agreement for categorical items and represents that bias is greater when kappa is smaller, for the four methods: considering both correctly and incorrectly classified samples. The Cohen s kappa coefficients for the decisionlevel fusion method, feature-level fusion method, visible alone, and IR alone are 0.572, 0.597, 0.562, and 0.197,

238 Front. Comput. Sci., 2014, 8(2): 232 242 Fig.

7 238 Front. Comput. Sci., 2014, 8(2): Fig. 5 Expression recognition accuracy with different number of selected IR features using (a) BN-based decision-level fusion; (b) BN-based feature-level fusion; (c) SVM-based decision-level fusion; (d) SVM-based feature-level fusion Table 1 Confusion matrices of the methods using Bayesian networks Actual emotion Decision level classification Feature level classification Visible feature classification IR feature classification Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Accuracy% Kappa Discrepancy respectively. We conclude that the classification accuracy is improved by adding IR features. Another useful statistics is to measure the discrepancy in the kappa scores for the three expression types. A highly discrepant model might be excellent at classifying certain expressions but poor for others. Here we calculate the standard deviation of kappa scores for the three expressions. The discrepancies for the decision-level fusion method, the feature-level fusion method, the visible alone, and the IR alone are 0.097, 0.102, 0.106, and 0.117, respectively, which again indicates the effectiveness of the fusion methods according to the low value of standard deviation. From the previous experiments, we can see that proposed methods can successfully capture the supplementary information in IR images to help improve visible facial expression recognition. 4.4 Experimental results of SVMs Figures. 5(c) and 5(d) show the accuracy curves of decisionlevel and feature-level fusion using SVMs with different numbers of IR features on the validation set. Compared with those of the BN-based methods, the fusion accuracies on the validation set of the SVM-based methods change slower as the number of selected IR features increases. For decisionlevel fusion, the average accuracy curve reaches the highest accuracy when about 55 IR features are selected. For

8 Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 239 Table 2 Confusion matrices of methods using SVMs Actual emotion Decision level classification Feature level classification Visible feature classification IR feature classification Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Disgust Fear Happiness Accuracy /% Kappa Discrepancy feature-level fusion, of the ten folds, the highest accuracies are reached when fewer IR features are selected, which is similar to that of BN. The recognition results of methods using SVMs are shown in Table 2. The overall recognition accuracy of decisionlevel and feature-level fusion are 76.45% and 76.82%, respectively. The overall recognition accuracies for separately using visible and IR features are 75.33% and 52.90%, respectively. The overall recognition accuracy is improved by using decision-level and feature-level fusion, especially in the case of recognizing fear. We also calculate the kappa coefficients and discrepancies for the different methods, as shown in Table 2. From the view of both kappa coefficient and discrepancy, the performance of facial expression recognition is improved by both decision-level fusion and feature-level fusion compared to each modality. Furthermore, the performance of feature-level fusion is better that that of the decision-level fusion. The above analysis demonstrates the supplementary role of IR images in improving facial expression recognition. Comparing the SVM-based methods with the BN methods, we find that, as a typical discriminative learning method, SVM achieves better performance than BN, which is a generative model, does. Without considering the computational issues and the handling of missing data, the literature suggests that discriminative classifiers are almost superior to generative ones [31, 32]. For BN-based fusion, the accuracy is improved from 71.59%, to 72.15% and 73.83%, respectively by using decision-level and feature-level fusion. The accuracy is increased by an average of 1.40%. For SVM-based fusion, the accuracy is improved from 75.33%, to 76.45% and 76.82% by respectively using decision-level and feature-level fusion. The accuracy is increased by an average of 1.30%. The average improvement for BN-based fusion is a little higher than that of SVM. This may indicate that, as a probabilistic model, BN can effectively capture uncertainties in data and allows data from different modalities to be systematically represented and integrated. 4.5 Comparison with related works In this section we compare the performance of our proposed methods with related works. Currently, there are only three reported studies of facial expression recognition based on the fusion of IR and visible images [7 9]. The database used in [7] is not publicly available. Furthermore, the method in [7] also uses audio signal alongside IR and visible images. Therefore, it is difficult to perform a fair comparison with the method in [7] since our method uses only IR and visible images. The other two studies [8, 9] perform experiments on the NVIE database. Thus, we are able to compare results directly. Table 3 compares our results with those in [8, 9]. We can see that our methods outperform theirs in both accuracy and kappa coefficient. In [8] and [9], by introducing infrared features, the accuracies are increased by 1.90% and 1.10% respectively. For the proposed method BN-based feature-level fusion, the accuracy is increased by 2.20%; this demonstrates that our proposed methods can better capture the supplementary information in IR images. Table 3 Compared with other fusion methods Methods Accuracy /% Kappa BN Decision-level fusion BN Feature-level fusion SVM Decision-level fusion SVM Feature-level fusion F-statistic & knn [8] Multiple GAs & knn [9] Conclusions In this paper, we have proposed the fusion of visible features and IR features for facial expression recognition. Specifically, two BN and two SVM based methods are proposed to perform decision-level fusion and feature-level fusion, respectively. Our experiments on the NVIE spontaneous database show that our methods yield an average improvement of 1.35% in performance of facial expression recognition com-

9 240 Front. Comput. Sci., 2014, 8(2): pared with using only visible features, especially for fear. We found that feature-level fusion is more effect than the decision-level fusion. The results of IR feature selection indicate that the temperature variations of the mouth and cheek regions are the most reliable sources compared to other facial regions for facial expression recognition. We performed a comparative study that also demonstrats the superiority of our method over existing techniques. Current research into expression recognition from thermal infrared images has adopted either temperature statistical parameters extracted from the facial regions of interest or several hand-crafted features that are commonly used in visible spectrum [23, 33]. So far there are no special features designed for thermal infrared images. The immaturities of IR features may lessen the role of IR images for improving visible facial expression recognition. In the future, we intend to investigate specific thermal features. A potential solution is to learn features using deep learning methods. Furthermore, a larger scale expression database consisting of both visible and thermal images is needed. Besides studying fusion of different image modalities, we plan to also investigate fusion of different types of features such as fusion of Scale-invariant feature transform (SIFT) and Gabor features, and visible images recorded under different illumination conditions. Acknowledgements This paper is supported by the National Natural Science Foundation of China (Grant Nos , ), Special Innovation Project on Speech of Anhui Province ( ), Project from Anhui Science and Technology Agency(1106c ) and the Fundamental Research Funds for the Central Universities. We also acknowledge partial support from the US National Science Foundation ( ). References 1. Zeng Z, Pantic M, Roisman G, Huang T. A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): Wang Z, Liu F, Wang L. Survey of facial expression recognition based on computer vision. Computer Engineering, 2006, 32(11): Liu X m, Tan H c, Zhang Y j. New research advances in facial expression recogntion. Journal of Image and Graphics, 2006, 11(10): Xue Y L, Mao X, Guo Y, Lv S W. The research advance of facial expression recognition in human computer interaction. Journal of Image and Graphics, 2009, 14(5): Bettadapura V. Face expression recognition and analysis: The state of the art. CoRR, 2012, abs/ Wesley A, Buddharaju P, Pienta R, Pavlidis I. A comparative analysis of thermal and visual modalities for automated facial expression recognition. Advances in Visual Computing, 2012, Yoshitomi Y,KimS,KawanoT,KilazoeT.Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face. In: Proceedings of the 9th IEEE International Workshop on Robot and Human Interactive Communication. 2000, Wang Z, Wang S. Spontaneous facial expression recognition by using feature-level fusion of visible and thermal infrared images. In: Proceedings of the 2011 IEEE International Workshop on Machine Learning for Signal Processing. 2011, Wang S, He S. Spontaneous facial expression recognition by fusing thermal infrared and visible images. Intelligent Autonomous Systems, 2013, 194: Cootes T, Edwards G, Taylor C. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6): Lv Y, Wang S. A spontaneous facial expression recognition method using head motion and AAM features. In: Proceedings of the 2nd World Congress on Nature and Biologically Inspired Computing. 2010, WangS,LiuZ,LvS,LvY,WuG,PengP,ChenF,WangX.Anatural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Transactions on Multimedia, 2010, 12(7): Littlewort G, Whitehill J, Wu T, Butko N, Ruvolo P, Movellan J, Bartlett M. The motion in emotion a cert based approach to the fera emotion challenge. In: Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. 2011, Cohn J, Reed L, Ambadar Z, Xiao J, Moriyama T. Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. In: Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics. 2004, Tong Y, Chen J, Ji Q. A unified probabilistic framework for spontaneous facial action modeling and understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2): Valstar M, Gunes H, Pantic M. How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of the 9th international conference on Multimodal interfaces. 2007, Gunes H, Pantic M. Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners. In: Proceedings of the 10th International Conference on Intelligent Virtual Agents. 2010, 6356: Jarlier S, Grandjean D, Delplanque S, N Diaye K, Cayeux I, Velazco M, Sander D, Vuilleumier P, Scherer K. Thermal analysis of facial muscles contractions. IEEE Transactions on Affective Computing, 2011, 2(1): Hernández B, Olague G, Hammoud R, Trujillo L, Romero E. Visual learning of texture descriptors for facial expression recognition in thermal imagery. Computer Vision and Image Understanding, 2007, 106(2): Khan M, Ward R, Ingleby M. Classifying pretended and evoked facial expressions of positive and negative affective states using infrared measurement of skin temperature. ACM Transactions on Applied Perception, 2009, 6(1): Article Yoshitomi Y. Facial expression recognition for speaker using thermal image processing and speech recognition system. In: Proceedings of the 10th World Scientific and Engineering Academy and Society Inter-

Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 241 national Conference on Applied Computer Science. 2010, 182 186 22.

In: Proceedings of the 2005 Conference on Human Factors in Computing Systems. 2005, 1725 1728 23. Shen P, Wang S, Liu Z. Facial expression recognition from infrared thermal videos.

Pavlidis I, Levine J, Baukol P. Thermal image analysis for anxiety detection. In: Proceedings of the 2001 International Conference on Image Processing. 2001, 315 318 26. Cootes T. am_tools.

Pattern Recognition, 2007, 40(11): 3195 3208 28. Ding C. Analysis of gene expression profiles: class discovery and leaf ordering.

Probability estimates for multi-class classification by pairwise coupling. The Journal of Machine Learning Research, 2004, 5: 975 1005 30. Chang C C, Lin C J.

10 Shangfei WANG et al. Fusion of visible and thermal images for facial expression recognition 241 national Conference on Applied Computer Science. 2010, Puri C, Olson L, Pavlidis I, Levine J, Starren J. Stresscam: non-contact measurement of users emotional states through thermal imaging. In: Proceedings of the 2005 Conference on Human Factors in Computing Systems. 2005, Shen P, Wang S, Liu Z. Facial expression recognition from infrared thermal videos. Intelligent Autonomous Systems, 2013, 194: Buddharaju P, Pavlidis I, Manohar C. Face recognition beyond the visible spectrum. In: Advances in Biometrics, Springer, Pavlidis I, Levine J, Baukol P. Thermal image analysis for anxiety detection. In: Proceedings of the 2001 International Conference on Image Processing. 2001, Cootes T. am_tools. bim/software/am_ tools_doc/ 27. Tong Y, Wang Y, Zhu Z, Ji Q. Robust facial feature tracking under varying face pose and facial expression. Pattern Recognition, 2007, 40(11): Ding C. Analysis of gene expression profiles: class discovery and leaf ordering. In: Proceedings of the 6th Annual International Conference on Computational Biology. 2002, Wu T F, Lin C J, Weng R C. Probability estimates for multi-class classification by pairwise coupling. The Journal of Machine Learning Research, 2004, 5: Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2: 27:1 27:27. Software available at tw/ cjlin/libsvm 31. Long P M, Servedio R A. Discriminative learning can succeed where generative learning fails. In: Proceedings of the 19th Annual Conference on Learning Theory. 2006, 4005: Ng A Y, Jordan A. On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems, 2002, 14: Hermosilla G, Ruiz-del-Solar J, Verschae R, Correa M. A comparative study of thermal face recognition methods in unconstrained environments. Pattern Recognition, 2012, 45(7): Shangfei Wang received her MS in circuits and systems, and her PhD in signal and information processing from University of Science and Technology of China (USTC), China in 1999 and 2002, respectively. From 2004 to 2005, she was a postdoctoral research fellow in Kyushu University, Japan. She is currently an associate professor in the School of Computer Science and Technology, USTC. Dr. Wang is an IEEE member. Her research interests cover computer intelligence, affective computing, multimedia computing, information retrieval, and artificial environment design. She has authored or co-authored over 50 publications. Shan He received his BS in Computer Science from Anhui Agriculture University, China in He received his MS in Computer Science from the University of Science and Technology of China, China in His research interest is affective computing. Yue Wu is a PhD candidate in the Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, USA. Her research interest is computer vision. Menghua He received her BS in Information and Computation Science from Anhui University, China in She is currently pursuing her MS in Computer Science at the University of Science and Technology of China, China. Her research interesting is affective computing. Qiang Ji received his PhD in Electrical Engineering from the University of Washington, USA. He is currently a professor with the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute (RPI), USA. He recently served as a program director at the National Science Foundation (NSF), where he managed NSF s computer vision and machine learning programs. He also held teaching and research positions with the Beckman Institute at University of Illinois at Urbana-Champaign, the Robotics Institute at Carnegie Mellon University, the Dept. of Computer Science at University of Nevada at Reno, and the US Air Force Research Laboratory, USA. Prof. Ji currently serves as the director of the Intelligent Systems Laboratory (ISL) at RPI, USA. Prof. Ji s research interests are in computer vision, probabilis-

11 242 Front. Comput. Sci., 2014, 8(2): tic graphical models, information fusion, and their applications in various fields. He has published over 160 papers in peer-reviewed journals and conferences. His research has been supported by major governmental agencies including NSF, NIH, DARPA, ONR, ARO, and AFOSR as well as by major companies including Honda and Boeing. Prof. Ji is an editor on several related IEEE and international journals and he has served as a general chair, program chair, technical area chair, and program committee member in numerous international conferences/workshops. Prof. Ji is a fellow of IAPR.

Facial expression recognition with spatiotemporal local descriptors

Facial expression recognition with spatiotemporal local descriptors Guoying Zhao, Matti Pietikäinen Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, P. O. Box