Chapter 1. Fusion of Manual and Non-Manual Information in American Sign Language Recognition

Size: px

Start display at page:

Download "Chapter 1. Fusion of Manual and Non-Manual Information in American Sign Language Recognition"

Estella Miles
6 years ago
Views:

1 Chapter 1 Fusion of Manual and Non-Manual Information in American Sign Language Recognition Sudeep Sarkar 1, Barbara Loeding 2, and Ayush S. Parashar 1 1 Computer Science and Engineering 2 Special Education University of South Florida University of South Florida Tampa, Florida Lakeland, Florida sarkar@cse.usf.edu bloeding@poly.usf.edu We present a bottom-up approach to continuous American sign language recognition without wearable aids, but with simple low-level processes operating on images and building realistic representations that are fed into intermediate level processes, to form sign hypotheses. At the intermediate level, we construct representations for both manual and non-manual aspects, such as hand movements, facial expressions and head nods. The manual aspects are represented using Relational Distributions that capture the statistical distribution of the relationships among the low-level primitives from the body parts. These relational distributions, which can be constructed without the need for part level tracking, are efficiently represented as points in the Space of Probability Functions (SoPF). Manual dynamics are thus represented as tracks in this space. The dynamics of facial expressions along with a sign are also represented as tracks, but in the expression subspace, constructed using principal component analysis (PCA). Head motions are represented as 2D image tracks. The integration of manual with non-manual information is sequential, with non-manual information refining the manual information based hypotheses set. We show that with just image-based manual information, the correct detection rate is around 88%. However, with the use of facial information, accuracy increases to 92%. Thus face contributes valuable information towards ASL recognition. Negation in sentences is correctly detected in 90% of the cases using just 2D head motion information Introduction While speech recognition has made rapid advances, sign language recognition is lagging behind. With the gradual shift to speech-based I/O devices, there is a great danger that persons who rely solely on sign languages for communication will be deprived access to state-of-the-art technology unless there are significant advances in automated recognition of sign languages. Reviews of prior work in sign language recognition appear in 1 and. 2 From these reviews, we can see that work in sign language recognition initially focused on the recognition of static gestures, e.g. 3 5 and isolated signs, e.g. 6 Starner and Pentland 7 1

2 2 S. Sarkar, B. Loeding, and A. Parashar were the first to seriously consider continuous sign recognition. Using HMM based representations, they achieved near perfect recognition with sentences of fixed structure, i.e. containing personal pronoun, verb, noun, adjective, personal pronoun in that order. Vogler and Metaxas 8 10 were instrumental in significantly pushing the state-of-the-art in automated ASL recognition using HMMs. In terms of the basic HMM formalism, they have explored many variations, such as context dependent HMMs, HMMs coupled with partially segmented sign streams, and parallel HMMs. The wide use of HMM is also seen in foreign sign language recognizers. 1 While HMM based methods perform very well with limited vocabulary, they are haunted by scalability issues and requirements for large training data. Many work in continuous sign language recognition has avoided the very basic problem of segmentation and tracking of hands by using wearable devices, such as colored gloves, data gloves, or magnetic markers, to directly get the location features. For example Vogler and Metaxas 8 10 have used 3D magnetic tracking system, Starner and Pentland 7 have used colored gloves while Ma et al. 11,12 have used Cybergloves. However, since this is unnatural for signers, in our research we restrict ourselves to plain color images, without the use of any augmenting wearable devices. Non-manual information, which refers to information from facial expressions, head motion, or torso movement, convey linguistic information in ASL. 10,13 Many work in sign language recognition has concentrated on just using hand motion, i.e. manual information. Although, some work in automated understanding of sign language facial expressions is under way Non-manual information can provide vital cues. For example, head motion can be used to detect whether the ASL sentence has any Negation. For instance, the sentence I don t understand is manually signed exactly same as I understand, except that there is distinct head shake indicating Negation in the sentence I don t understand. There has been some work on detecting head shakes and nods, 15,18,19 but there is no result in continuous sign language recognition. In this paper, we use non-manual information to decrease insertion and deletion errors, and to find whether there is Negation in the sentence using the motion trajectories of the head. We concentrate on the problem of recognition of continuous sign language, i.e. signs in sentences and not isolated signs or finger-spelled signs. We adopt a bottomup approach with simple low-level processes operating on images to build realistic representations that are fed into intermediate level processes integrating manual and non-manual information. Drawing on the emerging wisdom in computer vision Note that Sign Language is different from Signed English, the later is an artificial construct that employs signs but using English language grammatical structure. We use following ASL conventions in the paper. Text in italics indicate sentence in English. For example I can lipread. Text in capitalized letters indicate ASL gloss. For example LIPREAD CAN I. Or, the ASL gloss for sign lipread is LIPREAD. Negation in a sentence signed using non-manual markers is indicated by ˆNOT or Negation. Multiword gloss for a single sign in ASL is indicated by a hyphen. For example DONT-KNOW is a multiword gloss for a single sign in ASL.

American Sign Language Recognition 3 that simple methods are usually found to be the most robust ones when tested on large data sets, we use simple components.

3 American Sign Language Recognition 3 that simple methods are usually found to be the most robust ones when tested on large data sets, we use simple components. The low-processes are fairly simple ones involving detection of skin, motion pixels, and face detection by correlation with an eye template. This would work for signs again simple backgrounds, which is the most commonly considered scenario. However, for more complex backgrounds, alternative strategies, such as described in, could be considered. The intermediate level consists of modeling the hand motion using relational distributions, which are efficiently represented as points in the Space of Probability Functions (SoPF). This captures the placement of the hands with respect to the body, but does not capture hand shape accurately. Many signs can be recognized based on just this global information. For signs that are very hand-shape dependent, alternative methods, such as described in, 23 can be used. The expression subspace, derived using PCA, is used to represent the dynamics of facial expression. This level also deals with integrating non-manual with manual information to reduce the deletion and insertion errors. The third or topmost level, which we do not explore in this paper, would consist of using context and grammatical information from ASL (American Sign Language) phonology, to constrain and to prune the hypotheses set generated by the intermediate level processes. The primary contribution of this work is the demonstration that, even with fairly simple 2D representations, the use of non-manual information can improve ASL recognition. The integration of facial expression with manual information is confounded by the fact that expression events may not coincide exactly with the manual events. This work also constrains itself in relying on pure image-based inputs and does not require external wearable aids, such as gloves, to enhance the low-level primitives Data Set (a) (b) Fig Sample images in the dataset. (a) shows an image of face taken for the sign PHONE, as captured by camera A and (b) shows a synchronous image of the upper body as captured by camera B.

4 4 S. Sarkar, B. Loeding, and A. Parashar One of the issues of ASL recognition research is the data set used in the study. We constrained the domain to sentences that would be used while communicating with deaf people at airports. An ASL interpreter signed for the collection and helped with creating the ground truth. Two digital video cameras were used for collecting data; one captured images of the upper body, and the other synchronously captured face images. Fig. 1.1 shows one sample view. Some statistics of the data set are as follows. The dataset includes 5 instances of 25 distinct sentences. In total there are 125 sentence samples spanning 325 instances of ASL signs. There are 39 distinct ASL signs. Each sentence has 1 to 5 signs. On an average 2.7 signs is present per sentence. The number of frames in a sentence varies from 51 to 163. The longest sentence is, AIRPLANE POSTPONE AGAIN, MAD I, comprised of 163 frames. The smallest sentence is, YES, made up of 51 frames. Average number of frames in a sentence is about 90. Sign length varied from 4 frames for the sign, CANNOT, to 71 frames for sign LUGGAGE-HEAVY. On an average a sign has 18 frames spanning about 0.6 second. There are significant variations among the 5 instances of some of the sentences. For example the sentence If the plane is delayed, I ll be mad was signed as AIR- PLANE POSTPONE AGAIN, MAD I as well as AIRPLANE AGAIN POST- PONE, MAD I. Also in one of the instance of the sentence I packed my suitcase, the ASL sign I was not present. This was also true for some other sentences. The reason, as given by the signer, was that signs like I are implicit while conversing in ASL and hence can be excluded. For some sentences, Negation is conveyed only through head shakes. For example, for the sentences I understand and I don t understand, the ASL glosses is the same ( I UNDERSTAND ). The only difference is that in sentence I don t understand, there is presence of a head shake i.e. a non-manual expressions to convey the presence of Negation in the sentence Low Level Processing In many previous work in continuous ASL, detection and tracking of hands have been simplified using colored gloves 7 or magnetic markers. 8 Even foreign sign language recognizers have used colored gloves or data gloves. There has been recent effort to extract information and to track directly from color images, without the use of special devices, 6,20 22 but with added computational complexity. Our intermediate level representation, as we shall see later, does not require the tracking of hands. We just need to segment the hands and the face in each frame. Since segmentation is not the focus of this work, we have used fairly simple ideas based on skin color to detect the hands and face. We cluster the skin-pixels by using the Expectation-Maximization (EM) algorithm based on a Gaussian model for the clusters in the Lab color space. We use the 2-class version of the EM algorithm twice: first to separate the background and second time to separate the clothing of the signer from the skin pixels. Fig. 1.2(b) shows the segmentation after EM clus-

This helps to remove some pixels that are closer to skin color but do not form blobs big enough to be a part of hand or face. Fig. 1.2(d) shows an example of the final blobs.

5 American Sign Language Recognition 5 tering of (a), where the skin and clothing pixels are separated from the background. Fig. 1.2 (c) shows the output after the second application of the EM to separate skin color from the clothing. Blobs of size greater than 200 pixels are kept. This helps to remove some pixels that are closer to skin color but do not form blobs big enough to be a part of hand or face. Fig. 1.2(d) shows an example of the final blobs. These blobs, along with the color values and the edge pixels detected within them, comprise the low-level primitives. (a) (b) (c) (d) Fig Segmentation of skin pixels using EM: (a) shows original color frame, (b) shows the pixels obtained after first application of EM, (c) shows the skin pixels obtained after second application of EM, and (d) final blobs corresponding to hand and face Intermediate Level Representations We have separate representation for manual movement (or hold), facial expression, and head motion. We would like these representations, in particular, the manual motion, to be somewhat robust with respect to low-level errors. The manual motion (including no movement) representation does not require the need for tracking hands or fingers and emphasizes the 2D spatial relationships of hands and face. The face expression representation is a 2D view-based one. And, head motion model just takes into account projected 2D motion. Even without the use of 3D information, we demonstrate robust ASL recognition possibilities.

6 6 S. Sarkar, B. Loeding, and A. Parashar Manual Movement Grounded on the observation that the organization or structure or relationships among low-level primitives are more important than the primitives themselves, we focus on the statistical distribution of the relational attributes observed in the image, which we refer to as relational distributions. Such statistical representation also alleviates the need for primitive level correspondence or tracking or registration across frames. Such representations have been successfully used for modeling periodic motion in the context of identification of a person from gait. 24 Here, we use it to model aperiodic motion in ASL signs. Primitive level statistical distributions, such as orientation histograms, have been used for gesture recognition. 25 However, the only uses of relational histograms that we are aware of are by Huet and Hancock, 26 who used it to model line distributions in the context of image database indexing. The novelty of relational distributions lies in that it offers a strategy for incorporating dynamic aspects. We refer the reader to 24 for details of the representation. Here we just sketch the essentials. Let F = {f 1,,f N } represent the set of N primitives in an image. For us these are the Canny edge pixels inside the low-level skin-blobs described earlier. Let F k represent a random k tuple of primitives, and the relationship among these k-tuple primitives be denoted by R k. Let the relationships R k be characterized by a set of M attributes A k = {A k1,,a km }. For ASL, we use the distance of the two edge pixels in the vertical and horizontal directions (dx,dy) as the attributes. We normalize the distance between the pixels by a distance D, which is inversely related to the distance from the camera. The shape of the pattern can be represented by joint probability functions: P(A k = a k ), also denoted by P(a k1,,a km ) or P(a k ), where a ki is the (discretized, in practice) value taken by the relational attribute A ki. We term these probabilities as the Relational Distributions. One interpretation of these distributions is: given an image, if you randomly pick k-tuples of primitives, what is the probability that it will exhibit the relational attributes a k? What is P(A k = a k )? Given that these relational distributions exhibit complicated shapes that do not readily afford modeling using a combination of simple shaped distributions, we adopt non-parametric histogram based representation. However, to reduce the size that is associated with a histogram based representation, we use the Space of Probability Functions (SoPF). As the hands of the signer move, the relational distributions will change. Motion of hands will introduce non-stationarity in the relational distributions. Figure 1.3 shows some more examples of the 2-ary relational distributions for the sign CAN. Notice the change in the distributions as the hands come down. The change in the vertical direction in relational distributions can be seen clearly as the hands are coming down, while there is comparatively less change in the relational distributions in other direction.

American Sign Language Recognition 7 Fig. 1.3. Variations in relational distributions with motion. The left column shows the image frames in sign CAN.

7 American Sign Language Recognition 7 Fig Variations in relational distributions with motion. The left column shows the image frames in sign CAN. The middle column shows in the edge pixels in the skin-blobs. The right column shows the relational distributions. Let P(a k,t) represent the relational distribution at time t. Let n P(ak,t) = c i (t)φ i (a k ) + µ(a k ) + η(a k ) (1.1) i=1 describe the square root of each relational distribution as a linear combination of

8 8 S. Sarkar, B. Loeding, and A. Parashar orthogonal basis functions, where Φ i (a k ) s are orthonormal functions, the function µ(a k ) is a mean function defined over the attribute space, and η(a k ) is a function capturing small random noise variations with zero mean and small variance. We refer to this space as the Space of Probability Functions (SoPF). We use the square root function so that we arrive at a space where the distances are not arbitrary ones but are related to the Bhattacharya distance between the relational distributions, which is an appropriate distance measure for probability distributions. More details about the derivation of this property can be found in. 24 Given a set of relational distributions, {P(a k,t i ) i = 1,,T }, the Space of Probability Functions SoPF can be arrived at by principal component analysis (PCA). In practice, we can consider the subspace spanned by a few (N << n) dominant vectors associated with the large eigenvalues. Thus, a relational distribution can be represented using these N coordinates (c i (t)s), which is more compact representation than a normalized histogram based representation. The ASL sentences form traces in this Space of Probability Functions (SoPF). The eigenvectors of the SoPF associated with the largest eigenvalues are shown in Figure 1.4. The space was trained with 39 distinct signs. The size of each relational distribution is The vertical axes of the images plot the distance attribute dy, and the distance attribute dx is along the horizontal axes. Brightness is proportional to the component magnitude. The first eigenvector shows 3 modes in it. The bright spot in the second eigenvector emphasizes the differences in the attribute dx between the two features. The third eigenvector is radially symmetric, emphasizing the differences in both the attributes. Most of the energy of the variation is captured by the 15 largest eigenvalues. Fig Dominant dimensions of the learned SoPF, modeling the manual motion

American Sign Language Recognition 9 1.4.2. Non-manual: Facial Expression The first step is the localization of the face in each frame. There are various sophisticated approaches to detecting faces.

9 American Sign Language Recognition Non-manual: Facial Expression The first step is the localization of the face in each frame. There are various sophisticated approaches to detecting faces Here we adopt a very simple approach that relies on eye localization using eye template matching. The eye template, which is a rectangular region enclosing the two eyes, is the average image from 4 persons, different from the ASL signer, but similar imaging geometry as used in the ASL signs. The correlation is calculated on the whole image, only for the first image frame of the sentence. For the subsequent images, the center of the rectangular box bounding the eye is found by correlating around the neighborhood of center found in previous image. A window of 10 pixels in width and height is considered for the neighborhood search. After the detection of eyes, we demarcate the face with an elliptical structure. We use the golden ratio for face, 30 to mask the face with two elliptical structures, one for the top part and the other for the bottom. Fig. 1.5 shows example outputs of the eye detection and facial demarcation steps. (a) (b) Fig (a) Output of eye detection. (b) Extracted elliptical facial region. We adopt a view-based representation of expression, modeled as traces in expression sub-space, which is computed using Principal Component Analysis (PCA) of 4 expression examples for each of the 39 signs. We have found that 20 largest eigenvalues capture most of the energy in the expression variations, at least in this dataset. It is interesting to note that various aspects of facial expression that are important to ASL, such as motion of eye brows, cheek puffing, lip movement, and nose wrinkles, 13 are captured by the dominant eigenvectors, some of which are show in Figure 1.6. It is observed that lip movements are emphasized by most of the eigenvectors because the interpreter was also mouthing the English equivalent of the signs. This might not be true for native signers.

10 10 S. Sarkar, B. Loeding, and A. Parashar Non-manual: Head Motion Head motion is represented by the sequence of the average of the two eye locations, which are detected during face localization. The 2D trajectories are defined with respect to the location in the starting frame. Also, Fig. 1.7 shows some example head trajectories. Figs. 1.7 (a), (b) and (c) clearly shows the presence of Negation in the sentences, Fig. 1.7(d) shows the vertical motion of face, indicating a head nod while Figs. 1.7 (e) and (f) shows the motion trajectories for the sentences in which there is no positive or negative meaning conveyed through them Combination of Manual and Non-Manual We have used the facial expression information to reduce the deletion and insertion errors while head motion information is used to find whether the sentence contains Negation or not. The combination of the non-manual with manual information is not trivial because (i) the non-manuals are not time-synchronized with manuals. This is not a video synchronization issue. The facial event might lag or lead the manual event. Also, (ii) presence of presence of a strong non-manual indicating Assertion or Negation in the sentence makes it hard to extract facial information for some frames. Fig. 1.8 shows the information flow architecture. We process manual information, facial expressions, and head motion as independent channels, which are then combined as follows (1) Find the n signs with least distances in the sentence using manual information. (2) Find the distances for same n signs found in Step 1, using non-manual information. (3) Sort these signs in ascending order of distances obtained from non-manual information. (4) Discard α signs having maximum distances from sorted list obtained from Step 3. (5) Keep the remaining n α signs from Step 1. And, finally head motion is used to detect whether the sentence has Negation in it. The selection of n and α is a function of the number of signs in a sentence and the computational costs of high level processes. Distances between SoPF traces quantify motion involved in the manual information, while the distances in the face space quantify changes in facial expression. In this work, we adopt a simple distance measure between two traces to find a sign in the sentence using manual and non-manual information. For the manual information, we cross correlate the SoPF trace of the training sign with the trace of given sentence and pick the value and shift that results in the minimum distance (Euclidean). For non-manual information, we correlate the trace of trained sign near the time neighborhood where the smallest distance for manual information has

11 American Sign Language Recognition 11 been found. Let us look at some example of correlation of signs with sentences. Fig. 1.9(a) plots the correlations of manual information of LIPREAD with the sentence LIPREAD CAN I. Lower values of distance indicate matches, which in this case occurs around frame 12. Similarly, Fig. 1.9(b) & Fig. 1.10(c) shows the correlation of the manual information of the signs CAN & I with the same sentence. The actual position of the signs can be seen in Fig. 1.10(d). The epenthesis movement 31 indicated by E, can also be clearly seen in between the signs Experiments In this section, we show results to demonstrate the efficacy of the proposed approach using the data described earlier. Given that we have 5 instances of 25 distinct sentences, we use 5-fold cross validation to evaluate the effectiveness of (a) manual motion modeling, (b) the integration of facial expression with manual information, and (c) the detection of negation in sentences. Four instances of each sentence are used for training and one is used for testing. Note that some signs occur multiple times in the training data in different arrangements with other signs. There are 65 sign instances, making up the 25 test sentences, to be recognized. Before we present results, a few words about the performance measures are in order Quantifying Performance Since we are considering the output of an intermediate level process that would typically be further refined using grammar constraints, the performance measures should reflect the tentative nature of the output. We sort the signs on the basis of minimum distance of a sign to the sentence. Then we choose n signs with the least distances. If a sign is part of a sentence, but not present in these n signs, then a deletion error has occurred. The number of deletion errors depends on n; as n increases, errors go down but the cost of high level processing increases since it has to consider more possibilities. Since the maximum number of words in a sentence is 5, we report results with n = 6, which is also 10% of the number of possible sign instances in the test set. The correct detection rate or accuracy is 100 minus the deletion rate. A sign that is not a part of sentence but has distance less than the last correctly detected sign is declared to be wrongly inserted in the sentence. We have an insertion error. Insertion errors can be reduced using the context knowledge or by grouping the signs that are very similar. Here we have reduced the insertion errors using facial expression information in the sentence. It is harder to recover from deletion errors.

12 12 S. Sarkar, B. Loeding, and A. Parashar Error Manual Manual + Facial Deletion [9% to 15%], 12% [4% to 11%], 8% Insertion [12% to 19%], 15% [7% to 15%], 11% Use of Non-Manual Information To study the effect of the use of manual information, we start with the top 8 signs, as determined by manual information, and then prune out 2 signs (α = 2) based on facial information, so as to finally arrive at 6 hypothesized signs per sentence. We compare the final insertion and deletion error rates with hypothesizing 6 possible signs per sentence based on just manual information. The deletion error rate for the top n = 6 matches based on just manual information as captured by the SoPF traces, with 5-fold validation, ranges from 9% to 15% with an average of about 12%. Thus, the average correct detection rate from just manual information is 88%. Table 1.1 shows the improvement in rates to about 92% correct detection when face information is added. We also see a corresponding reduction in the insertion error rates. The percentage of sentences that are perfectly recognized, i.e. all the correct signs are among the topmost ranks (zero deletion and insertion errors), is around 46%. This is a sentence level performance measure, i.e. the percentage of sentences that are perfectly recognized. This is a very strict measure. A sentence can be misunderstood even if there is only one sign not correctly recognized. We contend that it is important to also report this number, even if it is low. Note that with the use of grammatical constraints, this performance can be further improved Head Motion to Find Negation Negation in a sentence is indicated by the head shake. We use aspect ratio of the aggregate 2D track of the whole sentence, i.e the ratio of width to height of the entire trajectory, as the features to recognize the presence of Negation in the sentence. We consider sentences whose motion trajectories have aspect ratio greater than 1.25, width greater than 40 pixels, and height less than 50 pixels, to have Negation in them. Using this logic, out of 30 sentences in the database that have Negation in them 27 were correctly recognized while there were 18 false alarms out of the remaining 95 sentences. The false alarms were mainly because of sentences like GATE WHERE, which also have motion of face in horizontal direction. Also, the missed detections are in sentences that specifically have sign the ME in them (example: You don t understand me ). This is because in sign ME, there is a natural downward movement of face, and hence the height of motion trajectory increases to a good extent, causing aspect ratio to decrease. In future, this performance can be increased by looking for portions in the sentences where the negation occurs and not using the entire trajectory over the whole sentence.

13 American Sign Language Recognition Conclusion We presented a framework for continuous sign language recognition that combined non-manual information from faces with manual information to decrease both the deletion and insertion errors. Unlike most previous approaches that are top-down HMM based, ours is a bottom-up approach that relies on simple low-level processes feeding into intermediate level processes that hypothesize signs that are present in a sentence. The approach also does not bypass the segmentation problem, but relies on simple, yet robust, low-level representations. The manual dynamics were modeled by capturing the statistics of the relationships among the low-level features via the concepts of relational distributions embedded in Space of Probability Functions. Facial dynamics were captured using an expression sub-space, computed using PCA. Even with fairly simple vision processes, embedded in a bottom-up approach, we were able to achieve good performance from pure image based inputs. Using 5-fold cross-validation on a data set of 125 sentences over 325 sign instances, we showed that the accuracy of individual sign recognition was about 88% with just manual information. The use of non-manual information increased the accuracy from 88% to 92%. We were also able to correctly detect negations in sentences 90% of the time Acknowledgment This work was supported in part by the National Science Foundation under grant IIS Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation. References 1. B. Loeding, S. Sarkar, A. Parashar, and A. Karshmer. Progress in automated computer recognition of sign language. In Lecture Notes in Computer Science, vol. 3118, pp , (2004). 2. C. Sylvie and S. Ranganath, Automatic sign language analysis: A survey and the future beyond lexical meaning, IEEE Transactions on Pattern Analysis and Machine Intelligence. 27(6), (Jun, 2005). 3. Y. Cui and J. Weng, Appearance-based hand sign recognition from intensity image sequences, Computer Vision and Image Understanding. 78(2), (May, 2000). 4. M. Zhao and F. K. H. Quek, RIEVL: Recursive induction learning in hand gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (11), , (1998). 5. J. Triesch and C. von der Malsburg. Robust classification of hand postures against complex backgrounds. In International Conference on Automatic Face and Gesture Recognition, pp , (1996). 6. M. H. Yang, N. Ahuja, and M. Tabb, Extraction of 2d motion trajectories and its

14 14 S. Sarkar, B. Loeding, and A. Parashar application to hand gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, (Aug., 2002). 7. T. Starner and A. Pentland. Real-time American Sign Language recognition from video using hidden Markov models. In Symposium on Computer Vision, pp , (1995). 8. C. Vogler and D. Metaxas. ASL recognition based on a coupling between HMMs and 3d motion analysis. In International Conference on Computer Vision, pp , (1998). 9. C. Vogler and D. Metaxas. Parallel hidden Markov models for American Sign Language recognition. In International Conference on Computer Vision, pp , (1999). 10. C. Vogler and D. Metaxas, A framework of recognizing the simultaneous aspects of American Sign Language, Computer Vision and Image Understanding. 81, , (2001). 11. J. Ma, W. Gao, C. Wang, and J. Wu. A continuous Chinese sign language recognition system. In International Conference on Automatic Face and Gesture Recognition, pp , (2000). 12. C. Wang, W. Gao, and S. Shan. An approach based on phonemes to large vocabulary Chinese sign language recognition. In International Conference on Automatic Face and Gesture Recognition, pp , (2002). 13. B. Bahan and C. Neidle. Non-manual realization of agreement in American Sign Language. Master s thesis, Boston University, (1996). 14. R. Wilbur and A. Martinez, Physical correlates of prosodic structure in American Sign Language, Meeting of the Chicago Linguistics Society, April. pp , (2002). 15. U. M. Erdem and S. Sclaroff. Automatic detection of relevant head gestures in American Sign Language communication. In International Conference on Pattern Recognition, pp , (2002). 16. C. Vogler and S. Goldenstein, Facial movement analysis in ASL, Universal Access in the Information Society. 6(4), , (2008). 17. U. Canzler and T. Dziurzyk, Extraction of Non Manual Features for Videobased Sign Language Recognition, IAPR Workshop on Machine Vision Application (MVA2002). pp , (2002). 18. M. L. Cascia, S. Sclaroff, and V. Athitsos, Fast, reliable head tracking under varying illumination, IEEE Transactions on Pattern Analysis and Machine Intelligence. 21 (6) (June, 1999). 19. A. Kapoor and R. W. Picard. A real-time head nod and shake detector. In Workshop on Perspective User Interfaces (Nov., 2001). 20. R. Yang, S. Sarkar, and B. Loeding. Enhanced level building algorithm for the movement epenthesis problem in sign language recognition. In Computer Vision and Pattern Recognition, (2007). 21. R. Yang and S. Sarkar, Coupled grouping and matching for sign and gesture recognition, Computer Vision and Image Understanding. (2008). 22. J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff. Simultaneous localization and recognition of dynamic hand gestures. In IEEE Workshop on Motion and Video Computing, vol. 2, pp , (2005). 23. L. Ding and A. Martinez, Modelling and recognition of the linguistic components in American Sign Language, Image and Vision Computing. (2009). 24. I. Robledo and S. Sarkar, Representation of the evolution of feature relationship statistics: Human gait-based recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (Feb. 2003). 25. W. Freeman and M. Roth. Orientation histograms for hand and gesture recognition.

15 American Sign Language Recognition 15 In International Workshop on Face and Gesture Recognition, pp , (1995). 26. A. Huet and E. Hancock, Line pattern retrieval using relational histograms, IEEE Transactions on Pattern Analysis and Machine Intelligence. 12(13), , (1999). 27. H. A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20, 23 38, (1998). 28. A. Colemnarez and T. Huang, Face detection with information based maximum discrimination, Computer Vision and Pattern Recognition. pp , (1997). 29. K. K. Sung and T. Poggio, Example-based learning for view-based human face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20, 39 51, (1998). 30. L. G. Frakas and I. R. Munro, Anthropometric Facial Proportions in Medicine. (Charles C. Thomas, Springfield, IL, 1987). 31. S. K. Liddell and R. E. Johnson, American Sign Language: The phonological base, Sign Language Studies. 64, , (1989).

16 16 S. Sarkar, B. Loeding, and A. Parashar Fig Dominant dimensions of the learned facial expressions over the 39 signs.

American Sign Language Recognition 17 (a) DONT-KNOW I (b) I NOT HAVE KEY (c) NO (d) YES (e) YOU UNDERSTAND ME (f) SUITCASE I PACK FINISH Fig. 1.7. Head motion trajectories.

17 American Sign Language Recognition 17 (a) DONT-KNOW I (b) I NOT HAVE KEY (c) NO (d) YES (e) YOU UNDERSTAND ME (f) SUITCASE I PACK FINISH Fig Head motion trajectories. (a),(b) and (c) show the motion trajectories for sentences with negation. (d) shows the motion trajectory for the sign YES. (e) and (f) show motion trajectories for sentences that do not convey negative meaning.

18 S. Sarkar, B. Loeding, and A. Parashar Fig. 1.8. A bottom-up architecture for fusing information from

18 18 S. Sarkar, B. Loeding, and A. Parashar Fig A bottom-up architecture for fusing information from facial expressions and head motion with manual information to prune the set of possible ASL sign hypotheses.

19 American Sign Language Recognition Distribution of sign LIPREAD in sentence LIPREAD CAN I 300 LIPREAD 250 Distance Frames of sentence LIPREAD CAN I (a) Distribution of sign CAN in sentence LIPREAD CAN I 500 CAN 400 Distance Frames of sentence LIPREAD CAN I (b) Fig Cross-correlation of sign with sentences. (a),(b) shows the correlation of signs LIPREAD and CAN with the sentence LIPREAD CAN I. Lower values indicates closer matches.

20 20 S. Sarkar, B. Loeding, and A. Parashar 450 Distribution of sign I in sentence LIPREAD CAN I I 300 Distance Frames of sentence LIPREAD CAN I (c) Position of signs in the sentence LIPREAD CAN I E is epenthesis movements LIPREAD CAN I E E Frames in sentence LIPREAD CAN I (d) Fig Cross-correlation of sign with sentences (contd.). (c) shows the correlation of signs I with the sentence LIPREAD CAN I. Lower values indicates closer matches. (d) shows the ground-truth position of the signs in the sentence. E indicates the epenthesis movements present between two signs.

Representation and interpretation of manual and non-manual information for automated American Sign Language recognition

University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2003 Representation and interpretation of manual and non-manual information for automated American Sign Language