Project for production of closed-caption TV programs for the hearing impaired

Similar documents
Broadcasting live and on demand relayed in Japanese Local Parliaments. Takeshi Usuba (Kaigirokukenkyusho Co.,Ltd)

How Hearing Impaired People View Closed Captions of TV Commercials Measured By Eye-Tracking Device

INTERLINGUAL SUBTITLES AND SDH IN HBBTV

Making Sure People with Communication Disabilities Get the Message

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

Noise-Robust Speech Recognition Technologies in Mobile Environments

Director of Testing and Disability Services Phone: (706) Fax: (706) E Mail:

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

Access to Internet for Persons with Disabilities and Specific Needs

Requirements for Maintaining Web Access for Hearing-Impaired Individuals

Captioning Your Video Using YouTube Online Accessibility Series

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

Assistive Technology for Regular Curriculum for Hearing Impaired


Word Frequency in Captioned Television

Requested Technical and Operational Information Related to the Provision of Closed Captioning by Canadian Private Broadcasters Prepared by:

A Case Study for Reaching Web Accessibility Guidelines for the Hearing-Impaired

General Soundtrack Analysis

Chapter 17. Speech Recognition and Understanding Systems

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

Sign Language Interpretation in Broadcasting Service

Dutch Multimodal Corpus for Speech Recognition

SubtitleFormatter: Making Subtitles Easier to Read for Deaf and Hard of Hearing Viewers on Personal Devices

Broadband Wireless Access and Applications Center (BWAC) CUA Site Planning Workshop

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

Guidelines for Captioning

Mr Chris Chapman ACMA Chair The ACMA PO Box Q500 Queen Victoria Building NSW July Dear Mr Chapman

Committee-based Decision Making in Probabilistic Partial Parsing

Senior Design Project

ACCESSIBILITY FOR THE DISABLED

In this chapter, you will learn about the requirements of Title II of the ADA for effective communication. Questions answered include:

VITHEA: On-line word naming therapy in Portuguese for aphasic patients exploiting automatic speech recognition

Concurrent Collaborative Captioning

ACCESSIBILITY FOR THE DISABLED

Extraction of Adverse Drug Effects from Clinical Records

Creating YouTube Captioning

User Guide V: 3.0, August 2017

An Avatar-Based Weather Forecast Sign Language System for the Hearing-Impaired

The Telecommunications Act of 1996

Date: April 19, 2017 Name of Product: Cisco Spark Board Contact for more information:

Language Volunteer Guide

Online Speaker Adaptation of an Acoustic Model using Face Recognition

Signing for the Deaf using Virtual Humans. Ian Marshall Mike Lincoln J.A. Bangham S.J.Cox (UEA) M. Tutt M.Wells (TeleVirtual, Norwich)

myphonak app User Guide

Real Time Sign Language Processing System

Transferring itunes University Courses from a Computer to Your ipad

Section Web-based Internet information and applications VoIP Transport Service (VoIPTS) Detail Voluntary Product Accessibility Template

Relay Conference Captioning

Designing Caption Production Rules Based on Face, Text and Motion Detections

easy read Your rights under THE accessible InformatioN STandard

The University of Texas at El Paso

Gender Based Emotion Recognition using Speech Signals: A Review

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

R: Initial - Pr. Learning Objectives:

Before the Department of Transportation, Office of the Secretary Washington, D.C

Apple emac. Standards Subpart Software applications and operating systems. Subpart B -- Technical Standards

Making YouTube Videos Accessible through Closed Captioning and Community Contributions

4 User-friendly information presentation

Accessible cinema. What is it? How does it work?

Subject Group Overview. skills) Welcome Week None needed None needed None needed None needed None needed Learners will complete and (Week 1-7 days

CHAPTER I INTRODUCTION. This chapter encompasses the background of the study, research questions, scope of

1971: First National Conference on Television for the Hearing Impaired is held in Nashville, Tennessee. The Captioning Center, now Media Access Group

Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

A Smart Texting System For Android Mobile Users

Audiovisual to Sign Language Translator

Panopto: Captioning for Videos. Automated Speech Recognition for Individual Videos

Speech Processing / Speech Translation Case study: Transtac Details

Evaluating Language and Communication Skills

Appendix C: Protocol for the Use of the Scribe Accommodation and for Transcribing Student Responses

Full Utilization of Closed-captions in Broadcast News Recognition

What assistive technology is available that could help my child to be more independent?

Research Proposal on Emotion Recognition

MAKING THE BEST OF IMPERFECT AUTOMATIC SPEECH RECOGNITION FOR CAPTIONING ONE-ON-ONE MEETINGS

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

Webstreaming City Council and Standing Committee Meetings

icommunicator, Leading Speech-to-Text-To-Sign Language Software System, Announces Version 5.0

doi: / _59(

easy read Your rights under THE accessible InformatioN STandard

Use the following checklist to ensure that video captions are compliant with accessibility guidelines.

EDUCATIONAL TECHNOLOGY MAKING AUDIO AND VIDEO ACCESSIBLE

Video Captioning Using YouTube. Andi Dietrich May 8th, 2018

Effect of Sensor Fusion for Recognition of Emotional States Using Voice, Face Image and Thermal Image of Face

Unit 7 Lessons. Unit 7 Objectives. What s your routine?, p /6/14

The Future of Access to Digital Broadcast Video. Session hashtag: #SWDigiAccess

News English.com Ready-to-use ESL/EFL Lessons by Sean Banville Speech jammer device stops people talking

TIPS FOR TEACHING A STUDENT WHO IS DEAF/HARD OF HEARING

Is there a case for making digital media accessible? Peter Olaf LOOMS Chairman ITU-T Focus Group FG AVA

MobileAccessibility. Richard Ladner University of Washington

Sign Language MT. Sara Morrissey

Construction of the EEG Emotion Judgment System Using Concept Base of EEG Features

MN 400: Research Methods CHAPTER 8. Survey Methods: Communication with Participants

Recommended Practices for Closed Captioning Quality Compliance

Virtual Sensors: Transforming the Way We Think About Accommodation Stevens Institute of Technology-Hoboken, New Jersey Katherine Grace August, Avi

Summary Table Voluntary Product Accessibility Template. Supporting Features. Supports. Supports. Supports. Supports

Technology for Visual and Auditory Access - Trends and the Future. Christian Vogler, Ph.D. Director, Technology Access Program Gallaudet University

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14

Sound is the. spice of life

Chapter 7 BAYLEY SCALES OF INFANT DEVELOPMENT

VOLUNTARY CODE FOR THE ADVERTISING AND MARKETING OF ALCOHOL BEVERAGES AND THE LABELLING OF ALCOHOL BEVERAGE CONTAINERS

Transcription:

Project for production of closed-caption TV programs for the hearing impaired Takahiro Wakao Telecommunications Advancement Organization of Japan Uehara Shibuya-ku, Tokyo 151-64, Japan wakao@shibuya.tao.or.jp Eiji Sawamura TAO Terumasa Ehara NHK Science and Technical Research Lab / TAO Ichiro Maruyama TAO Katsuhiko Shirai Waseda University, Department of Information and Computer Science / TAO Abstract We describe an on-going project whose primary aim is to establish the technology of producing closed captions for TV news programs efficiently using natural language processing and speech recognition techniques for the benefit of the hearing impaired in Japan. The project is supported by the Telecommunications Advancement Organisation of Japan with the help of the ministry of Posts and Telecommunications. We propose natural language and speech processing techniques should be used for efficient closed caption production of TV programs. They enable us to summarise TV news texts into captions automatically, and synchronise TV news texts with speech and video automatically. Then the captions are superimposed on the screen. We propose a combination of shallow methods for the summarisation. For all the sentences in the original text, an importance measure is computed based on key words in the text to determine which sentences are important. If some parts of the sentences are judged unimportant, they are shortened or deleted. We also propose keyword pair model for the synchronisation between text and speech. Introduction The closed captions for TV programs are not provided widely in Japan. Only 1 percent of the TV programs are shown with captions, in contrast to 7 % in the United States and more than 3 % in Britain. Reasons why the availability is low are firstly the characters used in the Japanese language are complex and many. Secondly, at the moment, the closed captions are produced manually and it is a time-consuming and costly task. Thus we think the natural language and speech processing technology will be useful for the efficient production of TV programs with closed captions. The Telecommunications Advancement Organisation of Japan with the support of the ministry of Posts and Telecommunications has initiated a project in which an electronically available text of TV news programs is summarised and syncrhorinised with the speech and video automatically, then superimposed on the original programs. It is a five-year project which started in 1996, and its annual budget is about 2 million yen. In the following chapters we describe main research issues in detail and the project schedule, and the results of our preliminary research on the main research topics are presented. 134

T V program with vm VTR Original program closed captions videp & audio [oudjo & ~ime code IL.-..L...M-J] ~ I news s e r i ~ closed [-._~.tomallc~atlsn.;1 r II ~ - [ 1.,mmansanojn captions [ ~ gnitlo~hasel L~Ynehr n~:l I [ Origiqal k.m.lzrd Time Cndr~l I Original ]~ummar~l Time Cnde] Qri~,illal [Summar,]'rime C.,I,I ADrn~L'n I "i- -- - Itnenvrn I~nr 1... I lnrne~p_ llnt' II1NflN19/IA [l]'l"l~l'tag I I I liiiipfij~ IWI'~ I,_ ] Ill'f'lt~rl~ l111-r I5fifiigr~J~ I I... 1 I I1":... I".'" I '" [... I... I... I Figure 1 System Outline 1 Research Issues Main research issues in the project are as follows: automatic text summarisation automatic synchronisation of speech text and building an efficient closed caption production system The outline of the system is shown in Figure 1. Although all types of TV programs are to be handled in the project system, the first priority is given to TV news programs since most of the hearing impaired people say they want to watch closed-captioned TV news programs. The research issues are explained briefly next. 1,1 Text Summarisatiou For most of the TV news programs today, the scripts (written text) are available electronically before they are read out by newscasters. Japanese news texts are read at the speed of between 35 and 4 characters per minute, and if all the characters in the texts are shown on the TV screen, there are too many of them to be understood well (Komine et al. 1996). Therefore we need to summarise the news texts to some extent, and then show them on the screen. The aim of the research on automatic text summarisation is to summarise the text fully or partially automatically to a proper size to obtain closed captions. The current aim is 7% summarisation in the number of characters. 1.2 Synchronisation of Text and Speech We need to synchronise the text with the sound, or speech of the program. This is done by hand at present and we would like to employ speech recognition technology to assist the synchronisation. First, synchronising points between the original text and the speech are determined automatically (recognition phase in Figurel). Then the captions are synchronised with the speech and video (synchronisation phase in F igurel). 1.3 Efficient Closed Caption Production System We will build a system by integrating the summarisation and synchronisation techniques with techniques for superimposing characters on to the screen. We have also conducted research on how to present the captions on the screen for the handicapped people. 2 Project Schedule The project has two stages: the first 3 years and the rest 2 years. We research on the above issues and build a prototype system in the first stage. The prototype system will be used to produce closed captions, and the capability and functions of the system will be evaluated. We will focus on improvement and evaluation of the system in the second stage. 1341

3 Preliminary Research Results We describe results of our research on automatic summarisation and automatic synchronisation of text and speech. Then, a study on how to present captions on TV screen to the hearing impaired people is briefly mentioned. 3.1 Automatic Text Summarisation We have a combination of shallow processing methods for automatic text summarisation. The first is to compute key words in a text and importance measures for each sentence, and then select importanct sentences for the text. The second is to shoten or delete unimportant parts in a sentence using Japanese language-specific rules. 3.1.1 Sentence Extraction Ehara found that compared with newspaper text, TV news texts have longer sentences and each text has a smaller number of sentences (Ehara et al 1997). If we summarise TV news text by selecting sentences from the orignal text, it would be 'rough' summarisation. On the other hand, if we devide long sentences into smaller units, thus increase the number of sentences in the text, we may have finer and better summarisation (Kim & Ehara 1994). Therefore what is done in the system is that if a sentence in a given text is too long, it will be partitioned into smaller units with minimun changes made to the original sentence. To compute importance measures for each sentence, we need to find first key words of the text. We tested high-frequency key word method (Luhn 1957, Edumundson 1969) and a TF-IDF-based (Text frequency, Inverse Document Frequency) method. We evaluated the two methods using ten thousand TV news texts, and found that high-frequency key word method showed slightly better results than the method based on TF-IDF scores (Wakao et al 1997). 3.1.2 Rules for shortening text Another way of reducing the number of characters in a Japanese text, thus summarising the text, is to shorten or delete parts of the sentences For example, if a sentence ends with a sahen verb followed by its inflection, or helping verbs or particles to express proper politeness, it does not change the meaning much even if we keep only the verb stem (or sahen noun) and delete the rest of it. This is one of the ways found in the captions to shorten or delete unimportant parts of the sentences. We analysed texts and captions in a TV news program which is broadcast fully captioned for the hearing impaired in Japan. We complied 16 rules. The rules are devided into 5 groups. We describe them one by one below. 1) Shotening and deletion of sentence ends We find some of phrases which come at the end of the sentence can be shortened or deleted. If a sahen verb is used as the main verb, we can change it to its sahen noun. For example:... keikakushiteimasu (~ IF ~ ~, ~ ~. ~-) --~... keikaku (~) (note: keikakusuru = plan, sahen verb) If the sentence ends in a reporting style, we may delete the verb part.... bekida to nobemashita -* bekida (,< ~ U_) (bekida -- should, nobemashita = have said) 2) Keeping parts of sentence Important noun phrases are kept in captions, and the rest of the sentence is deleted. taihosaretano ha Matumoto shachou --, taiho Matumoto shachou (~ ~,~) (taiho = arrest, shachou : a company president, Matumoto = name of a person ) 3) Replacing with shorter phrase Some nouns are replaced with a simpler and shoter phrase. souridaijin (~J<~.)---~ shushou (i~i~) (souridaijin, shushou both mean a prime minister) 4) Connetieting phrases omitted Connecting phrases at the beginning of the sentence may be omitted. shikashi ( b h, b = however), ippou (--J~ = on the other hand) 1342

5) Time expressions deleted Comparative time expressions such as today (kyou ~ Et ), yesterday (kinou, ~ ~ ) can be deleted. However, the absolute time expressions such as May, 1998 (1 9 9 8~fi H) stay unchanged in summarisation. When we apply these rules to selected important sentences, we can reduce the size of text further 1 to 2 percent. 3.2 Automatic Synchronisation of Text and Speech We next synchronise the text and speech. First, the written TV news text is changed into a stream of phonetic transcriptions. Second, we try to detect the time points of the text and their corresponding speech sections. We have developed 'keyword pair model' for the synchronisation which is shown in Figure 2. Ke3 ord setl Figure 2 Garbage Keywords~t2 Keyword Pair Model The model consists of two sets of words (keywordsl and keywords2) before and after the synchronisation point (point B). Each set contains one or two key words which are represented by a sequence of phonetic HMMs (Hidden Markov Models). Each HMM is a three-loop, eight-mixture-distribution HMM. We use 39 phonetic HMMs to represent all Japanese phonemes. When the speech is put in the model, nonsynchronising input data travel through the garbage arc while synchronising data go through the two keyword sets, which makes the likelihood at point B increase. Therefore if we observe the likelihood at point B and it becomes bigger than a certain threshold, we decide it is the synchronisation point for the input data. Thirty-four (34) keywords pairs were taken from the data which was not used in the training and selected for the evaluation of the model. We used the speech of four people for the evaluation. The evaluation results are shown in Table 1. They are the accuracy (detection rate) and false alarm rate for the case that each keyword set has two key words. The threshold is computed as logarithm of the likelihood which is between zero and one, thus it becomes less than zero. Threshold -1-2 -3-4 -5-6 -7-8 -9-1 -15-2 -25-3 Detection rate (%) 34.56 44.12 54.41 6.29 64.71 69.12 69.85 71.32 78.68 82.35 91.18 94.85 95.59 99.26 False Alarm Rate (FA/KW/Hour).6.6.6.12.18.18.54 1.21 1.81 2.41 Table 1 Synchronisation Detection As the threshold decreases, the detection rate increases, however, the false alarm rate increases little (Maruyama 1998). 3.3 Speech Database We have been gathering TV and radio news speech. In 1996 we collected speech data by simulating news programs, i.e. TV news texts were read and recorded sentence by sentence in a studio. It has seven and a half houses of recordings of twenty people (both male and female). In 1997 we continued to record TV news speech by simulation, and recorded speech data from actual radio and TV programs. It has now ten hours of actual radio recording and ten hours of actual TV programs. We will continue to record speech data and increase the size of the database in 1998. 3.4 Caption Presentation We have conducted a study, though on small scale, on how to present captions on TV screen 1343

to the hearing impaired people. We superimposed captions by hand on several kinds of TV programs. They were evaluated by the hadicapped people (hard of hearing persons) in terms of the following points : characters : size, font, colour numberoflines timing location methods of scrolling inside or outside of the picture (see two examples below). Figure 3 Figure 4 Captions in the picture Captions outside of the picture Most of the subjects preferred 2-line, outside of the picture captions without scrolling (Tanahashi, 1998). This was still a preliminary study, and we plan to conduct similar evaluation by the hearing impaired people on large scale. Conclusion We have described a national project, its research issues and schedule, as well as preliminary research results. The project aim is to establish language and speech processing technology so that TV news program text is summarised and changed into captions, and synchronised with the speech, and superimposed to the original program for the benefits of the hearing impaired. We will continue to conduct research and build a prototype TV caption production system, and try to put it to a practical use in the near future. Acknowledgements We would like to thank Nippon Television Network Corporation for letting us use the pictures (Figure 3, 4) of their news program for the purpose of our research. References Edmundson, H.P. (1969) New Methods in Automatic Extracting Journal of the ACM, 16(2), pp 264-285. Ehara, T., Wakao, T., Sawamura, E., Maruyama I., Abe Y., Shirai K. (1997)Application of natural language processing and speech processing technology to production of closed-caption TV programs for the hearing impaired NLPRS 1997 Kim Y.B., Ehara, T. (1994) A method of partitioning of long Japanese sentences with subject resolution in J/E machine translation, Proc. of 1994 International Conference on Computer Processing of Oriental Languages, pp.467-473. Komine, K., Hoshino, H., Isono, H., Uchida, T., Iwahana, Y. (1996) Cognitive Experiments of News Captioning for Hearing Impaired Persons Technical Report of IECE (The Institution of Electronics, Information and Communication Engineers), HCS96-23, in Japanese, pp 7-12. Luhn, H.P. (1957)A statistical approach to the mechanized encoding and searching of literary information IBM Journal of Research and Development, 1(4), pp 39-317. Maruyama, I., Abe, Y., Ehara, T., Shirai, K. (1998) A Study on Keyword spotting using Keyword pair models for Synchronization of Text and Speech, Acoustical Society of Japan, Spring meeting, 2-Q- 13, in Japanese. Tanahashi D. (1998) Study on Caption Presentation for TV news programs for the hearing impaired. Waseda University, Department of Information and Computer Science (master's thesis) in Japanese. Wakao, T., Ehara, E., Sawamura, E., Abe, Y., Shirai, K. (1997) Application of NLP technology to production of closed-caption TV programs in Japanese for the hearing impaired. ACL 97 workshop, Natural Language Processing for Communication Aids, pp 55-58. 1344