Modeling the Use of Space for Pointing in American Sign Language Animation Jigar Gohel, Sedeeq Al-khazraji, Matt Huenerfauth Rochester Institute of Technology, Golisano College of Computing and Information Sciences jg2252@rit.edu; sha6709@rit.edu; matt.huenerfauth@rit.edu Abstract Based on motion recordings of humans, we model grammatically appropriate locations in space for a virtual human to point during American Sign Language animations. Keywords Deaf and Hard of Hearing, Emerging Assistive Technologies, Research & Development
Modeling the Use of Space for Pointing in ASL Animation 95 Introduction American Sign Language (ASL) is a primary means of communication for over 500,000 people in the United States (Mitchell et al., 2006), and standardized English reading-skills testing has revealed that many deaf adults in the U.S. have lower English literacy than their hearing peers (Traxler 2000), which can lead to accessibility barriers when these users are faced with English text online, which may be at too high a reading difficulty level. Given the challenges in re-recording videos of human signers (to update information content in ASL on websites), providing animations on websites is a more maintainable way to provide information in ASL. In ASL, signers associate items under discussion with locations around their body (McBurney 2002; Meier 1990), which the signer may point to later in the discourse to refer to these items again. For instance, if a signer were discussing a favorite book, she might mention the title of the book once, and then point at a location in space around her body. For the remainder of the conversation, she would not mention the title of the book again, but instead she would point to this location in space to refer to it. In this work, we model and predict the most natural locations for these spatial reference points (SRPs), based on recordings of human signers movements. We evaluated ASL animations generated from the model in a user-based study. Problem Statement We would like to automate the creation of ASL animations as much as possible (to make it easier to provide ASL content online), yet prior sign language animation systems do not automatically predict where signers should place SRPs in space so that the animation appears natural. Linguists debate whether ASL signers constrain their selection of SRP locations to a semicircular arc region around their torso or use a wider variety of locations (McBurney 2002; Meier 1990), but in any case, there is no finite set of points in space where SRPs may be
Modeling the Use of Space for Pointing in ASL Animation 96 established. Given this complexity, rather than defining a simplistic rule for where an animated human should establish SRPs, we investigate a data-driven prediction model. Specifically, in this work, we have extracted and modeled the locations of all SRPs established by human signers who were recorded in our pre-existing ASL motion capture corpus (Lu and Huenerfauth 2012b). Literature Review To justify why designers of ASL animations must accurately model the use of SRPs, Lu and Huenerfauth (2012c) conducted studies in which native ASL signers evaluated ASL animations with and without the establishment of SRPs. The authors found that adding these pointing-to-locations-in-space phenomena to ASL animations led to a significant improvement in users comprehension of the animations and to more positive subjective ratings of the animations quality. While early work on sign language animations utilized rule-based approaches, as more sign language corpora have become available, recent work has turned to data-driven modeling (Lu and Huenerfauth 2010; Morrissey and Way 2005; Segouat and Braffort 2009), e.g. using ASL corpus data to predict the movements of signer s hands during inflecting verbs (Lu and Huenerfauth 2012a). However, no previous research has predicted SRP locations using datadriven modeling. Modeling Methodology From 2009-2013, our lab recorded 246 unscripted multi-sentence single-signer ASL performances (Lu Huenerfauth 2010) from 8 human signers during 11 recording sessions, and the first 98 of these recordings were released with linguistic annotation of the timing of individual words and SRPs (Lu Huenerfauth 2012b). In this work, using these 98 annotated recordings, we have extracted the right-hand index-fingertip location and 3D-vector where the
Modeling the Use of Space for Pointing in ASL Animation 97 finger is pointing, during the time-periods in the corpus when the human signer was pointing to establish an SRP location in the 3D space around the body. We reduced the dimensionality of the modeling task as follows: We imagined a 2D plane in front of the human signer (at a distance of twice the human s arm-length), and we identified where the 3D finger-pointing vector would intersect with this 2D plane thereby producing a 2D coordinate on the plane that represents where the human was pointing. (See the side-view image in Figure 1.) Fig. 1. Clusters Projected onto 3D Virtual Human Space (see Figure 2 for Detail) To combine data from multiple human signers of different body size, we normalized based on: the human s shoulder width (in the left-right axis), the human s shoulder height (in the up-down axis), and the human s arm-length (to determine the position of the 2D intersection plane in the forward-backward axis). To model this 2D data representing where the human pointed when they set up individual SRPs, we use a Gaussian Mixture Model (GMM),
Modeling the Use of Space for Pointing in ASL Animation 98 implemented in R, to identify clusters in the signing space where signers tended to place items. A GMM with three clusters was the best fit to the data. Fig. 2. Visualization of the Shape of GMM Clusters (see Figure 1 for scale) While Figure 1 visualized the clusters overlaid on a virtual human character, to convey scale, Figure 2 provides a more detailed view of the clusters, to better convey their shape. In Figure 2, the color of data points (and redundant dotted-line boundaries) indicate cluster membership. The centroid of each cluster is indicated in a lighter yellow-color dot within each. A slightly smaller proportion of points were in cluster 2 (red). Evaluation Study A user-based study was conducted to evaluate animations produced using this model, with participation from fluent ASL signers who self-identified as Deaf/deaf, recruited at Rochester Institute of Technology. Out of the 30 participants who were recruited for the study, 20 were male and 10 were female. Their ages ranged from 19-30 (average: 24). Out of the 30 participants, 18 had started using ASL since birth. Eleven participants grew up in a household
Modeling the Use of Space for Pointing in ASL Animation 99 with parents who knew ASL, 6 participants have parents or family members who are deaf, 28 participants use ASL at college, and all participants reported Deaf community connections (e.g. friends, significant other, social groups). During the study, each participant viewed animations of ASL with a virtual human character performing 16 short (5-10 sentence) stories, originally created as stimuli in (Lu Huenerfauth 2012a). Each story was produced in the following three versions: ARTIST: The first version was designed by an ASL-expert 3D artist using a commercial animation tool, as described in (Lu Huenerfauth 2012a). These were intended as an upper baseline (topline), i.e. high-quality animations carefully produced by a skilled human. NO-POINTING: In this version, all pointing to SRP locations (to refer to previously mentioned people or things) was replaced with the virtual human simply re-stating the name of the person or thing throughout the story. These were intended as a lower baseline, i.e. as produced by an animation system that could not select SRP locations. GMM: The ARTIST version of the story was edited so that all of the SRP pointing in the story (originally specified by the artist) was instead replaced by pointing to locations sampled from our GMM model. Each participant saw all 16 stories, but they saw a third of the stories in each of the 3 conditions, counterbalanced across the study. After viewing each animation, the participant answered 4 comprehension questions about the information in that story (total 64 questions), and the participant was asked three 1-to-10 scalar-response subjective questions about three aspects of the animation s quality: its grammatical correctness, how easy it was to understand, and how natural the movement of the animation appeared.
Modeling the Use of Space for Pointing in ASL Animation 100 We hypothesized that: (H1) the subjective ratings and the comprehension question scores for the GMM condition would be higher than for our NO-POINTING baseline animations, and (H2) the ARTIST and GMM would have statistically equivalent scores. To evaluate H1, we conducted a Mann-Whitney Test for the subjective scores and a t-test for comprehension question scores, but we did not observe any statistical difference (p>0.05) in the scores between our GMM model and the NO-POINTING baseline condition. H1 was not supported. To evaluate H2, a Two One-Sided Test (TOST) revealed equivalence between the ARTIST and GMM conditions for both the comprehension question scores (comparison margin = 10% response accuracy) and for the subjective questions (comparison margin = 1.0 units on the 1-to-10 scale). H2 was supported. This result indicates that animations with pointing based on the GMM model were perceived as similar in quality to the ARTIST animations and led to similar comprehension-question scores. Conclusion and Future Work We have presented a method for modeling where human signers tend to point in 3D space when performing spatial reference in ASL, and we demonstrated how the model can predict pointing locations when generating ASL animations. Fluent ASL signers rated these animations as being similar in quality to animations in which the pointing that had been selected by an ASL-expert animation artist. This result contributes to our lab s overall goal of automating the process of synthesizing animations of ASL, which could lead to greater availability of ASL content on websites, to improve information accessibility for ASL users. In future work, we plan on modeling additional phenomena from our motion-capture recordings to synthesize ASL animations.
Modeling the Use of Space for Pointing in ASL Animation 101 Works Cited Huenerfauth, Matt, and Pengfei Lu. Effect of Spatial Reference and Verb Infection on the Usability of American Sign Language Animations. Universal Access in the Information Society: vol. 11, no. 2 (June 2012), 2012c, pp. 169-184. Lu, Pengfei, and Matt Huenerfauth. Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research. Proceedings of the 1 st SLPAT Workshop, at HLT-NAACL 2010, Los Angeles, CA, 2010. Lu, Pengfei, and Matt Huenerfauth. CUNY American Sign Language Motion-Capture Corpus: First Release. Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, The 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 2012b. Lu, Pengfei, and Matt Huenerfauth. Learning a Vector-Based Model of American Sign Language Inflecting Verbs from Motion-Capture Data. Proceedings of the 3 rd SLPAT Workshop, at NAACL-HLT 2012, Montreal, Quebec, Canada, 2012a. McBurney, Susan Lloyd. Pronominal reference in signed and spoken language: Are grammatical categories modality-dependent. Modality and structure in signed and spoken languages, 2002, pp. 329-369. Meier, Richard P. Person deixis in American sign language. Theoretical issues in sign language research, vol. 1, 1990, pp. 175-190. Mitchell, Ross E, et al. How many people use ASL in the United States?: Why estimates need updating. Sign Lang Studies, vol. 6, no. 3, 2006, pp. 306-335.
Modeling the Use of Space for Pointing in ASL Animation 102 Morrissey, Sara, and Andy Way. An example-based approach to translating sign language. Proc. Workshop on Example-Based Machine Translation, 2005, pp. 109-116. Segouat, Jérémie, and Annelies Braffort. Toward the study of sign language coarticulation: methodology proposal. Advances in Computer-Human Interactions, 2009. ACHI'09. Second International Conferences on. IEEE, 2009. Traxler, Carol Bloomquist. The Stanford Achievement Test: National norming and performance standards for deaf and hard-of-hearing students. Journal of deaf studies and deaf education, vol. 5, no.4, 2000, pp. 337-348.