A METHOD TO ENHANCE INFORMATIZED CAPTION FROM IBM WATSON API USING SPEAKER PRONUNCIATION TIME-DB

Size: px

Start display at page:

Download "A METHOD TO ENHANCE INFORMATIZED CAPTION FROM IBM WATSON API USING SPEAKER PRONUNCIATION TIME-DB"

Victor Wilcox
5 years ago
Views:

1 A METHOD TO ENHANCE INFORMATIZED CAPTION FROM IBM WATSON API USING SPEAKER PRONUNCIATION TIME-DB Yong-Sik Choi, YunSik Son and Jin-Woo Jung Department of Computer Science and Engineering, Dongguk Universit, Seoul, Korea ABSTRACT IBM Watson API is a kind of speech recognition API sstem which can automaticall generate not onl recognized ws from voice signal but also generate speaker ID and timing information of each ws including the starting time and the ending time. The performance of IBM Watson API is ver good at the well-reced voice signal b the clearl speaking trained speakers but the performance is not enough good when there are some noises in the reced voice signal. This situation is easil found with movie sounds that include not onl speaking voice signal but also background music or special sound effects. This paper deals with a novel method to enhance this informatized caption from IBM Watson API to resolve this nois signal problem based on speaker pronunciation time-db. To do this, the proposed method uses the original caption information as an additional input. B comparing the original caption with the output of IBM Watson API, the error ws could be automaticall detected and correctl modified. And using the speaker pronunciation time-db containing the average pronunciation time of each w for each speaker, the timing information of each error w could be estimated. In this wa, more precisel enhanced informatized captions could be generated based on the IBM Watson API. The usefulness of the proposed method is verified with two case studies with nois voice signals. K EYWORDS Informatized caption, Speaker Pronunciation Time, IBM Watson API, Speech to Text Translation. INTRODUCTION B the waves of 4 th industrial revolution, artificial intelligence becomes one of the most promising technologies nowadas. There are so man research areas and research results from artificial intelligence. One of them is natural language processing b speech recognition. Tpical speech recognition technologies include speech to text conversion.among captions in which speech is converted into characters, captions including timing information and speaker ID information are referred to as informatizedcaptions [, 2]. Such an informatized caption could be generated b using IBM Watson API [3]. However, the IBM Watson API is more susceptible to clipping errors due to poor recognition results when there aresome noises in the voice signal. And this situation is easil found with movie sounds that include not onl speaking voice signal but also include background music or special sound effects.in er to solve this nois voice problem, there has been a method of predicting the timing information of informatized caption based on a linear estimation formula proportional to the number of alphabets used in each w [2]. But, this linear estimation method based on the number of alphabets is not good enough when there are some silent sllables. Therefore, a novel method to enhance the informatized caption from IBM Watson API is proposed in this paperbased on the speaker pronunciation time-db. DOI: 0.52/ijnlc

2 2. SPEAKER PRONUNCIATION TIME-DB (SPT-DB) 2.. STRUCTURE Figure. Structure of SPT-DB SPT-DB consists of each node for each speaker(s p ) as shown in Fig..The nodes consist of the average pronunciation times(d p ) of each w(w pk ).The nodes of the speaker are arranged in ascending er based on the average pronunciation time, and are connected to each other, and a null value is present at the end. When SPT-DB searches for a w spoken b the speaker, it searches based on the pronunciation time ASSUMPTION Before proceeding with the stud, the following assumptions are based on SPT-DB. [Assumption]SPT-DB is alread configured for each speaker. 3. PROPOSED ALGORITHM 3.. ALGORITHM MODIFYING INCORRECTLY RECOGNIZED WORD BASED ON N SPT-DB Figure 2. Original caption T(X) and informatized caption T + s (X) Basicall, original caption,t(x),, and informatized captionfrom speech recognition result,t + s (X), are input together. Here, S x and E x mean the start time and end time of pronunciation for the w X, respectivel. 2

3 [Step ] Judge whether there is an incorrectl recognized w b comparing T (X) with T s + (X).If there is no incorrectl recognized w, it terminates. If there is an incorrectl recognized w, go to the next step. [Step2] Judge whether there are several consecutive ws in the sequence and pass the parameter to the case. [Step3] Modif the ws in the SPT-DB based on the start and end points of the cases. [Step4] If there is an incorrectl recognized w in the following w, repeat steps to 3 and terminate if there is no incorrectl recognized w CASE : THERE IS ONLY ONE INCORRECTLY RECOGNIZED WORD. Figure3. There is one incorrectl recognized w [Step] Find the point at which the signal of a specific volume(db) T or more starts for E a to S c and determine S b. [Step2] If there is a minimum time t'in S b tos c at which the signal intensit falls below a certain volume T and then remains below T until S c, E b = t' is determined. If there is no t'satisfing the above condition, E b = S c. [Step3]Returns the start time and end time CASE 2: THERE ARE TWO INCORRECTLY RECOGNIZED WORD. Figure 4. Two incorrectl recognized w [Step] Find the point at which the signal of a specific volume(db) T or more starts for E a to S c and determine S b. [Step2] If there is a minimum time t'in S b tos c at which the signal intensit falls below a certain volume T and then remains below T until S c, E b = t' is determined. If there is no t'satisfing the above condition,e b = S c. [Step3] The ending point of the current w is obtained b multipling the start time of the current w b the ratio of the pronunciation time of the two ws to the average pronunciation time of the current w.the following are summarized as follows. 3

4 (, ) = +( ) (,)+(,) [Step4] Returns the start time and end time CASE 3: THERE ARE MORE THAN THREE INCORRECTLY RECOGNIZED WORD. Figure 5. More than three incorrectl recognized w [Step] Find the point at which the signal of a specific volume(db) T or more starts for E a to S w2 and determine S w. [Step2] If there is a minimum time t'in S w to S w2 at which the signal intensit falls below a certain volume T and then remains below T until S w2, E w = t' is determined. If there is no t'satisfing the above condition,e w = S w2. [Step3]The ending point of the current w is obtained b multipling the start time of the current w b the ratio of the pronunciation time of the incorrectl recognized ws to the average pronunciation time of the current w.the following are summarized as follows. = ( ) ( ) [Step4] Returns the start time and end time. 4. CASE STUDYⅠ The case was tested based on English listening assessment data. Fig. 6 shows a problem of the English listening evaluation for universit entrance examination. Fig. 7 and Fig. 8 show the result of speech recognition using the IBM Watson API.Table and Table2 list the time information of the caption at that time, it is expressed as [start time end time].using the IBM Watson API, speech recognition in an environment with no constraints results in high accurac as shown in Fig. 7. Table shows that incorrectl recognition w is (B, 0) and the accurac of speech recognition is 97.56%.However, in a nois environment like Fig.8, the accurac dropped significantl. Table 2shows that incorrectl recognition ws are(a, ), (A, 7), (A, 7), (A, 3), (B, 2), (B, 3), (B, 4), (B, 4), (B, 9), (B, 0) and (C, 3), and the accurac of speech recognition is 73.7%.For reference, the original voice source was snthesized with raining sound using Adobe Audition CC 207 to create a nois environment. If we improve the proposed algorithm with noise, we can obtain the same result as Fig.9 and Table3. The accurac of speech recognition is 00% b the help of original caption and each w includes its own start time and end time. 4

5 Figure 6. Original caption of Case StudⅠ Figure 7. Recognition of original voice without noise b IBM Watson sstem Table. Informatized caption from original voice without noise b IBM Watson sstem Sentence Dad I want to send this book to grandm do ou have a box A Speake a r Yeah I ve got this one to put phot albums and but it s a bit smal o l B Speake r The box looks big enoug for the book Can I use it h C Speake r

6 Sentence A B C D Figure 8. Recognition of mixed voice with rain noise b IBM Watson sstem Table 2. Informatized caption from mixed voice with rain noise b IBM Watson sstem W Speaker Speaker 0 Speaker Speaker Yeah I want to send this perfecgrandm do ou havea plot t a Yeah I foun someon to put photo albums and boughit s a bit sma d e t ll The box lookebig enoug for the book d h Can I use it Figure 9. Speech recognition result modified b proposed algorithm 6

7 Table3.Informatized caption modified b the proposed algorithm Sentence Dad I want to send this book to Grandm Do ou have a box A Speake a r Yeah I ve got this one to put phot albums in but it s a bit smal o l B Speake r The box looks big enoug for the book Can I use it h C Speake r CASE STUDYⅡ The case was tested based on English listening assessment data. Fig.0 shows a problem of the English listening evaluation for universit entrance examination. Fig. and Fig. 2 show the result of speech recognition using the IBM Watson API.Table 4 and Table 5 list the time information of the caption at that time, it is expressed as [start time end time].using the IBM Watson API, speech recognition in an environment with no constraints results in high accurac as shown in Fig.. Table 4shows that incorrectl recognized w is (D, ) and the accurac of speech recognition is 97.29%.However, in a nois environment like Fig.2, the accurac dropped significantl. Table 5shows that incorrectl recognized ws are(a, ), (A, )and (A, ), and the accurac of speech recognition is9.89%.for reference, the original voice source was snthesized with raining sound using Adobe Audition CC 207 to create a nois environment.if we improve the proposed algorithm with noise, we can obtain the same result as Fig.3 and Table 6. The accurac of speech recognition is 00% b the help of original caption and each w includes its own start time and end time. Figure 0. Original caption of Case StudⅡ 7

8 Figure. Recognition of original voice without noise b IBM Watson sstem Table 4. Informatized caption from original voice without noise b IBM Watson sstem Sentenc e Hone I heard the Smit famil move out to the countrsi I reall env the A Speak er 0 B Speak er C Speak er 0 D Speak er h d de m Reall wh is that I think we can sta health if we live in the countr Who Can ou be more specifi m c

9 Sentence A Speaker 0 B Speaker C Speaker 0 D Speaker Figure 2. Recognition of mixed voice with rain noise b IBM Watson sstem Table 5. Informatized caption from mixed voice with rain noise b IBM Watson sstem W I heard the Smith famil moved out to the countrsideenv them Reall wh is that I think we can sta healthif we live in the countr Can ou be more specific Figure 3. Speech recognition result modified b proposed algorithm 9

10 Table 6. Informatized caption modified b the proposed algorithm Sentenc e Hone I heard the Smit famil move out to the countrsi I reall env the A Speak er 0 B Speak er C Speak er 0 D Speak er h d de Reall wh is that I think we can sta health if we live in the countr Hmm Can ou be more specifi c CONCLUSIONS m In this paper, a novel method to enhance the informatized caption from IBM Watson API based on speaker pronunciation time-db is addressed to find and modifincorrectl recognized ws from IBM Watson API. SPT-DBcontains the average pronunciation time information of each w foreach speaker and is used to correct the errors in the informatized caption obtained through the IBM Watson API. The usefulness of the proposed method is verified with two case studies with nois voice signals.however, the proposed algorithm also has some limitations such as that SPT-DB should be created in advance because it is assumed that the information of the corresponding ws alread exists in SPT-DB. Furtherstud will be conducted to modif incorrectl recognized ws while performing speech recognition and simultaneousl to update the SPT-DB in real time. ACKNOWLEDGEMENTS Following are results of a stud on the "Leaders in INdustr-universit Cooperation+" Project, supported b the Ministr of Education and National Research Foundation of Korea,and partiall supported b Basic Science Research Program through the National Research Foundation of Korea (NRF) funded b the Ministr of Education, Science and Technolog (205RDAA ),and also partiall supported bthis work was supported b the Korea Institute for Advencement of Technolog(KIAT) grant funded b the Korean government(motie:ministr of Trade, Industr&Energ, HRD Program for Embedded Software R&D) (No. N00884). 0

REFERENCES [] CheonSun Kim, "Introduction to IBM Watson with case studies." Broadcasting and Media Magazine, Vol.22, No.,pp24-32.

[3] IBM Watson Developer s Page, https://www.ibm.com/watson/developer AUTHORS Yong-Sik Choi has been under M.S. candidate course at Dongguk universit, Korea, since 207.

11 REFERENCES [] CheonSun Kim, "Introduction to IBM Watson with case studies." Broadcasting and Media Magazine, Vol.22, No.,pp [2] Yong-Sik Choi, Hun-Min Park, Yun-Sik Son and Jin-Woo Jung, Informatized Caption Enhancement based on IBM Watson API, Proceedings of KIIS Autumn Conference 207, Vol. 27, No. 2, pp [3] IBM Watson Developer s Page, AUTHORS Yong-Sik Choi has been under M.S. candidate course at Dongguk universit, Korea, since 207. His current research interests include machine learning and intelligent human-robot interaction. YunSik Son received the B.S. degree from the Dept. of Computer Science and Engineering, Dongguk Universit, Seoul, Korea, in 2004, and M.S. and Ph.D. degrees from the Dept. of Computer Science and Engineering, Dongguk Universit, Seoul, Korea in 2006 and 2009, respectivel. He was a research professor of Dept. of Brain and Cognitive Engineering, Korea Universit, Seoul, Korea from Currentl, he is an assistant professor of the Dept. of Computer Science and Engineering, Dongguk Universit, Seoul, Korea. Also, His research areas include secure software, programming languages, compiler construction, and mobile/embedded sstems. Jin-Woo Jung received the B.S. and M.S. degrees in electrical engineering from Korea Advanced Institute of Science and Technolog (KAIST), Korea, in 997 and 999, respectivel and received the Ph.D.. degree in electrical engineering and computer science from KAIST, Korea in Since 2006, he has been with the Department of Computer Science and Engineering at Dongguk Universit, Korea, where he is currentl a Professor. During 200~2002, he worked as visiting researcher at the Department of Mechano-Informatics, Universit of Toko, Japan. During 2004~2006, he worked as researcher in Human-friendl Welfare Robot Sstem Research Center at KAIST, Korea. During 204, he worked as visiting scholar at the Department of Computer and Information Technolog, Purdue Universit, USA. His current research interests include human behaviour recognition, multiple robot cooperation and intelligent human-robot interaction.

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

1: (Junyoung Jung et al.: Adaptation of Classification Model for Improving Speech Intelligibility in Noise) (Regular Paper) 23 4, 2018 7 (JBE Vol. 23, No. 4, July 2018) https://doi.org/10.5909/jbe.2018.23.4.511