DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation Biyi Fang Michigan State University ACM SenSys 17 Nov 6 th, 2017 Biyi Fang (MSU) Jillian Co (MSU) Mi Zhang (MSU) 1

Deep Learning is Changing our Lives Now Self-Driving Face Recognition Speech Recognition Play Go 2

Background American Sign Language (ASL) is the primary language used by deaf people to communicate with others. Unfortunately, very few people with normal hearing understand sign language. Existing communication approaches have key limitations in cost, availability or convenience. Sign Language Interpreter Write on Paper Type on Phone 3

Sign Language Translation Technology A S L Characteristics of Signs Hand Shape Hand Movement Relative Location of Two Hands Sensors Computational Models 4

Limitations of Existing Sign Language Translation Systems EMG + Motion [Wu et al. 2015] RGB Camera [Zafrulla et al. 2010] Kinect [Chai et al. 2013] intrusive constrained by lighting condition and privacy intrusive lack of resolution 5

Our Solution: DeepASL A deep learning-based sign language translation framework that enables ubiquitous and non-intrusive ASL translation at both word and sentence levels. 6

Leap Motion (Infrared Sensing) Design Choice 3D Skeleton Joint Data Skeleton Joint Bone Extended Bone Elbow 7

Comparison with Existing Sign Language Translation Systems Non- Intrusive Lighting Condition Privacy Preserving High Resolution EMG + Motion RGB Camera Kinect DeepASL 8

System Architecture of DeepASL Sentence-Level Translation Word-Level Translation ASL Characteristics Extraction 9

ASL Characteristics Extraction Hand Shape + Relative Location of Two Hands Right Hand Shape Hand Movement [0, 0, 0] Right Hand Movement Left Hand Shape Left Hand Movement 10

ASL Characteristics Organization Right Hand Shape Right Hand Movement Left Hand Shape Thank Left Hand Movement Fully Connected Softmax Low-Level ASL Characteristics Mid-Level Right/Left Hand Representation High-Level Single-Sign Representation Probability Distribution over Vocabulary 11

Similar ASL Differentiation Some signs share very similar characteristics at the beginning of their trajectories. Want What 12

Similar ASL Differentiation A bidirectional recurrent neural network (B-RNN) model is incorporated to capture both forward and backward representation of a sign. Output Layer y t 1 y t y t+1 Backward Layer h t 1 h t h t+1 Forward Layer h t 1 h t h t+1 Input Layer x t 1 x t x t+1 13

Sentence-Level ASL Translation DeepASL adopts a probabilistic framework based on Connectionist Temporal Classification (CTC) [Graves et al. 2006] for sentence-level ASL translation. @ Training @ Inference How are you How_are_you How are you Insert blank symbols Remove blank symbols It eliminates the restriction of pre-segmenting the whole sentence into individual words, enabling end-to-end whole-sentence translation. 14

Performance on Word-Level ASL Translation ASL Word Dataset 56 ASL words 11 participants 6440 samples In total Performance Average 95% accuracy Worst-case 91% on participant #11 15

Necessity of Model Components Model Translation Accuracy Increase Note Baseline 1 89.4 ± 3.1 % 5.1 % No hand shape information Baseline 2 89.5 ± 2.4 % 5.0 % No hand movement information Baseline 3 91.1 ± 3.4 % 3.4 % No hierarchical structure Baseline 4 93.7 ± 1.7 % 0.8 % No bidirectional structure DeepASL 94.5 ± 2.4 % 16

Performance on Sentence-Level ASL Translation ASL Sentence Dataset 4-word sentence from 16 ASL words 100 sentences 866 samples in total Performance Average 16% Top-1 word error rate (WER) Average 4% Top-5 WER 17

Application#1: ASL Tutor ASL Tutor helps hearing parents of deaf children learn ASL. MyASLTutor Looked-up Word & Explanation ASL Visualization 18

Application#2: ASL Interpreter ASL Interpreter enables two-way communication between deaf and hearing majority. Deaf Person First-person point of view of the deaf person using Microsoft HoloLens AR headset 19

Video: https://www.youtube.com/watch?v=0pmjnnnn77c

Conclusions DeepASL represents the first deep learning-based sign language translation framework that enables ubiquitous and non-intrusive ASL translation at both word and sentence levels. DeepASL achieves an average 94.5% translation accuracy over 56 commonly used ASL words, and an average 16.1% word error rate on translating 100 sentences. Take an initiative on ASL sign data crowdsourcing. We believe that, with the crowdsourced efforts, ASL translation technology can be significantly advanced. 21

Thank You Biyi Fang Michigan State University fangbiyi@msu.edu Web: fangbiyi.com 22