Google Voice Search: Faster and More Accurate

Slides:



Advertisements
Similar presentations
广州市教育局教学研究室英语科 Module 1 Unit 2 Reading STANDARD ENGLISH AND DIALECTS.
Advertisements

高考英语阅读分析 —— 七选五. 题型解读: 试题模式: 给出一篇缺少 5 个句子的文章, 对应有七个选项,要求同学们根据文章结构、 内容,选出正确的句子,填入相应的空白处。 考查重点: 主要考查考生对文章的整体内容 和结构以及上下文逻辑意义的理解和掌握。 (考试说明) 选项特点: 主旨概括句(文章整体内容)
期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
智慧老伯的一席話 原稿 : 溫 Sir 中譯 : 老柳 A man of 92 years, short, very well- presented, who takes great care in his appearance, is moving into an old people’s.
TOEFL Speaking ----Q1&Q2 坚果托福 秀文. 评分标准评分标准 Volume Grammar Fluency Logic / Organization Lexical ability Pronunciation.
L5-L6 Review Oct. 30, Adj. as predicate 1.She is tall and pretty. 2.His house is pretty but very small. 3.Is your book expensive ? (try different.
Healthy Breakfast 第四組 電子一甲(電資一) 指導老師:高美玉 組長:B 侯昌毅
专题八 书面表达.
How can we become good leamers
Chapter 29 English Learning Strategy Of High School Students
沐阳老年社区.
院公共选修课 Leisure English 黄瑛瑛.
Writing 促销英文信 促销的目的就是要卖出产品,那么怎样才能把促销信写得吸引人、让人一看就对产品感兴趣呢?下面就教你促销信的四步写法。
专题讲座 武强中学外语组 制作:刘瑞红.
深層學習 暑期訓練 (2017).
What are the shirts made of?
Unit 4 I used to be afraid of the dark.
Ⅱ、从方框里选择合适的单词填空,使句子完整通顺。 [ size beef special large yet ]
M3 U7 LESSON 3-2 The Sea World Grammar.
考试与考生 --不对等与对等 邹申 上海外国语大学
Source: IEEE Access, vol. 5, pp , October 2017
Write a letter in a proper format
Guide to Freshman Life Prepared by Sam Wu.
1 Introduction Prof. Lin-Shan Lee.
InterSpeech 2013 Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding University of Rouen(France)
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi
This Is English 3 双向视频文稿.
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳 中譯潤稿:風刀雨箭
SpringerLink 新平台介绍.
学习报告 —语音转换(voice conversion)
陳明璋 一個引導注意力為導向的數位內容設計及展演環境 Activate Mind Attention AMA
A Study on the Next Generation Automatic Speech Recognition -- Phase 2
基于课程标准的校本课程教学研究 乐清中学 赵海霞.
錢買不到的禮物 自動換頁 音樂:海莉·衛斯頓 演唱<Nada Sousou> 日本電影「淚光閃閃」主題曲英文版
第十五课:在医院看病.
Hobbies II Objectives A. Greet a long time no see friend: Respond to the greeting: B. Ask the friend if he/she likes to on this weekend? She/he doesn’t.
The First Course in Speech Lab
1 Introduction Prof. Lin-Shan Lee.
Long short term memory 郭琦鹏
Unit 1 How can we become good learners?
Unit 1 This is me ! Task.
VIDEO COMPRESSION & MPEG
Term 2, Week 1 16th - 20th April, 2018.
Guide to a successful PowerPoint design – simple is best
BORROWING SUBTRACTION WITHIN 20
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
Lesson 19: A Story or a Poem?
关联词 Writing.
Unit 5 First aid Warming up 《和你一样》 中国红十字会宣传曲 高二年级 缪娜.
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
SpringerLink 新平台介绍.
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Unit 1 How do you study for a test?
Introduction of this course
More About Auto-encoder
Selecting Reading Materials
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳
Chapter 9 Validation Prof. Dehan Luo
高考英语短文改错答题技巧 砀山中学 黄东亚.
Views on the News 不同的观点 选自《多维阅读第11级》.
热气球商务手绘模板 The user can demonstrate on a projector or computer, or print the presentation and make it into a film to be used in a wider field.
錢買不到的禮物 自動換頁 音樂:海莉·衛斯頓 演唱<Nada Sousou> 日本電影「淚光閃閃」主題曲英文版
以分为镜知对错 以卷为鉴晓得失 —邯郸市一模得与失
高考英语作文指导 福建省教研室 姚瑞兰.
WiFi is a powerful sensing medium
Unit 1 Book 8 A land of diversity
Gaussian Process Ruohua Shi Meeting
適用於數位典藏多媒體內容之 複合式多媒體檢索技術
Presentation transcript:

Google Voice Search: Faster and More Accurate http://googleresearch.blogspot.tw/2015/09/google-voice-search-faster-and-more.html Google Voice Search: Faster and More Accurate  Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk Google Speech Team GOOGLE speech team 發表在INTERSPEECH 2015 Ming-Han Yang

Today (1/2) Today, we’re happy to announce we built even better neural network acoustic models using Connectionist Temporal Classification (CTC)[1] and sequence discriminative training techniques[2]. These models are a special extension of recurrent neural networks (RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast! We recently showed that RNNs for LVCSR trained with CTC can be improved with the sMBR sequence training criterion and approaches state-of-the-art[3]. 早在2012年, GOOGLE voice search就已經使用DNN來當作核心技術, 用來替聲音建立模型 這取代了30年來的GMM DNN能夠更好地評估使用者在每個時間點的發音,以及大大增加了語音識別的準確性 ---- 現在他們使用比較好的neural network聲學模型, 使用CTC及sequence discriminative training 這些模型是一種特殊的RNN, 正確率更高(特別是在吵雜的環境中), 而且非常快 // 補充 當輸入跟輸出間的aligment是未知的時候 , 可以使用CTC來做, (是一種使用RNN來label序列) CTC實作上呢, 是在softmax加一個額外的空白label, 用來估測現在這個時間點沒有label的機率 CTC與傳統的差別有兩點 = (1)當不確定現在這個frame是哪一個label的時候, 這個空白labe能夠 (緩解) 讓模型不用一定要做出預測 (2)訓練時是要最佳化state sequence的log機率, 而不是最佳化input的log likelihood [1] Alex Graves, Santiago Fernandez, Faustino Gomez, Jurgen Schmidhuber, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, ICML, 2006. [2] Brian Kingsbury(IBM), Lattice-based Optimization Of Sequence Classification Criteria For Neural-network Acoustic Modeling, ICASSP, 2009. [3] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays, Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition, INTERSPEECH, 2015.

Traditional Speech Recognizer “museum” => /m j u z i @ m/ waveform  small consecutive slices or “frames” (10 ms) Each frame (frequency content)  the resulting feature vector  acoustic model  outputs a probability distribution over all the phonemes A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. Pronunciation Model that links sequences of sounds to valid words in the target language Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. 傳統語音辨識器, 使用者的語音通常以每10毫秒, 切成小的連續的slice或frame 每個frame我們對它的頻率內容進行分析, 並將得到的特徵向量, 通過聲學模型傳遞 (諸如DNN),其輸出所有音素(聲音)的機率分佈。 HMM可以幫助並強加給這些機率分佈的序列 一些時間結構 接著加入額外的知識, 像是pronunciation model (在目標語言中連結聲音序列與有意義的詞); 以及language model (這些詞的序列在這個語言中有多像正常人講的話) 辨識器接著調和這些資訊來定義使用者說了什麼句子 舉例來說, 如果使用者說了 Museum這個詞, 它可以表示成phone層次的表示, 但是這種表示法很難看出什麼時候 /j/ 結束, 什麼時候 /u/ 開始; 但是對辨識器來說, 他其實不在意這些phone什麼時候轉換成別的phone, 它只在意整體被說出來的詞是什麼

Our Improved Acoustic Models “museum” => /m j u z i @ m/ rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies (ex: /u/ , /j/ , /m/) Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is aLong Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly. 我們改進的聲學模型是基於RNN RNN有feedback loop的結構, 讓它可以model時間的相關性, (舉例: u的前面的發音是由 /j/ 來的, 而 /j/前面 是由 /m/來的) 還有另一個例子, 假如你大聲地講muzeum這個詞, 其實muzeum這個詞的發音是很自然地一口氣講完, 而RNN可以捕捉到這個資訊 而我們使用的是LSTM, 它透過記憶的細胞複雜的gating機制, 記憶效果比RNN好, 使用LSTM顯著地改善了我們辨識器的辨識效果

Train The Models (1/2) The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct.  The tricky part though was how to make this happen in real- time. 下一個步棷是要訓練model能夠辨識一個utterance裡的phoneme (而不是每個時間點都預測他是哪個phone) 使用CTC, 模型被訓練成 能夠輸出spike (尖峰)的序列, 用來表示wavform的聲音序列 而最棘手的部分是如何讓這件事即時做完

Train The Models (2/2) After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here. 經過幾次iteration後, 我們成功地訓練出串流式的單向模型, 能截取較大的音訊的chunk, 但是計算量比較少 由此, 我們建立出計算量較少, 且計算較快的辨識器 我們也加了一些人工的噪音在訓練資料中, 讓我們辨識器能夠更robust //Youtube的例子 How cold is it outside

Result We are happy to announce that our new acoustic models are now used for voice searches and commands in the Google app (on Android and iOS), and for dictation on Android devices. - so give it a try, and happy (voice) searching! X軸 = 由左到右是語音訊號輸入時間的開始到結束 (phone) Y軸 = neural network 預測的postierior prob 虛線 = 模型預測的不是phone的部分

THANK YOU!