Broadband Wireless Access and Applications Center (BWAC) CUA Site Planning Workshop Lin-Ching Chang Department of Electrical Engineering and Computer Science School of Engineering
Work Experience 09/12-present, Associate Professor, EECS, CUA 09/07-08/12, Assistant Professor, EECS, CUA 09/03-08/07, IRTA Postdoctoral Fellow, NIH 03/03-08/03, Senior Software Programmer and Medical Image Analyst, NIH 03/99-02/03, Senior Software Engineer, 3Com Corporation 2
Research Experience Overview Pattern recognition Image processing Big-data analysis Medical informatics Parallel processing Telecommunication Medical Image Processing and Analysis Diffusion Tensor MRI Spectral Image Stack Decision map Generate raw images Source Images ICA Unmix Compute XCNR & Decision maps ICA Results Estimate Noise Denoised Images Noise standard deviations ROI Masks Microscopic Image Processing & Analysis Two-Photo Microscopy Imaging GPU Hardware Acceleration Solar Image Processing & Analysis Coronal Mass Ejections 3
Adapted HMM for Robust Speech Recognition
The Benefits of Effective Speech Recognition Benefits can vary based on industries Work processes become more efficient Save a great deal of labor Save a great deal of time Hand free computing - voice dictations from digital dictation devices Speech recognition is fun - nothing is more fascinating than the quick transformation of spoken words into readable text. However, Speech recognition has the chance to cause increased frustration for the users/customers 5
LVCSR Large Vocabulary Continuous Speech Recognition (LVCSR) ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 6
Word error rates Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech (clean) ~5000+ 3 WSJ read speech (clean) ~20,000+ 3 Broadcast news ~64,000+ 10 Conversational Telephone ~64,000+ 20 *WSJ: Wall Street Journal 7
HSR versus ASR Task Vocab ASR Hum SR Continuous digits 11.5.009 WSJ clean 5K 3 0.9 WSJ w/noise 5K 9 1.1 SWBD 65K 20 4 Conclusions: Machines are about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt Error Rate (%) *SWBD: Switchboard database human-to-human telephone conversations 8
ASR Today http://voice-recognition-software-review.toptenreviews.com/ 9
Accuracy ranged 60%~95% 10
Challenges in the Design of a SR System SR systems have to deal with a large number of challenges The speaker s voice is often accompanied by surrounding noise which makes their accurate recognition difficult. A speaker may speak a number of different words and all of these words have to be accurately recognized. Accent of speaking varies from person to person and this is a very big challenge A speaker may speak something very quickly and all of the words spoken have to be individually recognized accurately. 11
Types of SR Systems Speaker Dependent SR systems Work by learning the unique characteristics of a single person s voice and depend on the speaker for training. Speaker Independent SR systems Designed to recognize anyone s voice, so no training is involved. 12
SIRI and GOOGLE NOW Intelligent Personal Assistant developed by Apple. Google Now is an intelligent personal assistant developed by Google. Both use a combination of speaker- dependent and speaker- independent speech recognition systems 13
Applications Health Care - Medical documentation - Therapeutic use In-car Systems Military - High performance aircrafts - Air traffic control systems Telephony - Smart-phones - Customer Helpline Services Usage in Education People with Disabilities Daily Life 14
Speech Recognition for Healthcare Speech recognition drives efficiencies and cost savings in clinical documentation by turning clinician dictations into formatted documents -- automatically. Front-end speech recognition allows clinicians to dictate, self-edit and sign transcription-free, completed reports in one sitting directly into a PACS system or EHR. Background speech recognition clinician dictation into speech-recognized first drafts that medical language specialists (MLS) edit it later. 15
Speech Recognition for Healthcare Benefits Reduce document turnaround times Save on transcription costs - significantly Enhance patient care through increased clinical record accuracy, inclusiveness and access Dictate directly into the EHR with front-end speech recognition Accelerate EHR navigation within the EHR, saving physicians time Increase clinician satisfaction and EHR adoption Employ multiple dictation options including phone, dictation devices, and workstations Several studies shows speech recognition leads to imaging report errors Basma S1, Lord B, Jacks LM, Rizk M, Scaranelo AM., Error rates in breast imaging reports: comparison of automatic speech recognition and dictation transcription. AJR Am J Roentgenol. 2011 Oct;197(4):923-7. 16
Common Error Types Word omission Word substitution Nonsense phrase Wrong word Punctuation error Incorrect measurement (mm/cm) Missing or added no Added word Verb tense Plural Spelling mistake Incomplete phrase Conclusion of their study Complex breast imaging reports generated with ASR were associated with higher error rates (3~8 times higher) than reports generated with conventional dictation transcription. Basma S1, Lord B, Jacks LM, Rizk M, Scaranelo AM., Error rates in breast imaging reports: comparison of automatic speech recognition and dictation transcription. AJR Am J Roentgenol. 2011 Oct;197(4):923-7. 17
Hidden Markov Model (HMM) Markov models are excellent ways of abstracting simple concepts into a relatively easily computable form. Used in data compression to sound recognition. From this graph we can create sequences such as: N1 N2 N3 N1 N2 N2 N2 N3 N3 N3 N3 N3 N1 N1 N2 N2 N3 18
Hidden Markov Model (HMM) N1 N2 N3 = 0.4 X 0.8 X 0.5 = 0.16 N1 N2 N2 N2 N3 N3 N3 N3 N3 = 0.4 x 0.2 x 0.2 x 0.8 x 0.5 x 0.5 x 0.5 x 0.5 = 0.0008 N1 N1 N2 N2 N3 = 0.6 x 0.4 x 0.2 x 0.8 x 0.5 = 0.192 19
Hidden Markov Model (HMM) There are approximately 44 phonemes in English. Phoneme example: tomato This accommodates for pronunciations such as: t ow m aa t ow - British English t ah m ey t ow - American English t ah mey t a - Possibly pronunciation when speaking quickly 20
Hidden Markov Model (HMM) Language model example: With sentences such as: I like apple juice - Very probable I like tomato juice - Very improbable! I hate apple juice - Relatively improbable I hate tomato juice - Relatively probable 21
Robust Speech Recognition The study of building speech recognition that handle mismatch condition. Mismatch condition? The difference between training and operating (testing) environment. It exists. For example, Simpler example: sudden door slam when dictating a letter. In wireless environment, the background of the speaker can change. 22
Mismatch Conditions Why mismatch conditions are hard to deal with? There are so many causes of it. Additive noise (e.g. background noise such as air-conditioning) Channel noise (e.g. difference between microphones in training and testing conditions) Others : Lombard noise. Reflection of building. In general, noise can have Random amplitude, Random duration, Random occurrence, Random spectral characteristic. 23
Previous Works Parallel Model Combination (PMC) (Gales 1995) First collect some samples of noise in operating environment, Update acoustic model using the noise statistics, Work satisfactorily for stationary noise, General time-varying noise cannot be handled. Dealing with Short Time Noise (Chan 2002) HMM-based Skip poor frames Modified Viterbi Algorithm dealing with Impulsive Noise (Siu 2005) Joint decoding and detection during the Viterbi search Lost frames are replaced by interpolated neighboring frames 24
Proposed Work HMM-based approach Finding a state sequence with best robust likelihood Conventional approach: For every state sequence, consider all possible patterns of corruption of K frames among T frames. Our approach: incorporate some prior information to find possible K Replace dynamic programming approach to branch-and-bound approach Developing outlier detection algorithms Leverage my research experience in outlier detection in medical images Define the characteristics of outliers in a wireless environment Classification or ICA to separate the speaking with noise/outliers Skipping frames or replacing frames? Different strategies should be used to deal with different types of noise/outliers (mismatch conditions) 25
CONCLUSION Speech Recognition systems are an indispensable part of the ever-advancing field of human-computer interaction. Needs greater research to tackle various challenges. 26
Thank You! Questions? 27