Google Voice Search: Faster and More Accurate http://googleresearch.blogspot.tw/2015/09/google-voice-search-faster-and-more.html Google Voice Search: Faster and More Accurate Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk Google Speech Team GOOGLE speech team 發表在INTERSPEECH 2015 Ming-Han Yang
Today (1/2) Today, we’re happy to announce we built even better neural network acoustic models using Connectionist Temporal Classification (CTC)[1] and sequence discriminative training techniques[2]. These models are a special extension of recurrent neural networks (RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast! We recently showed that RNNs for LVCSR trained with CTC can be improved with the sMBR sequence training criterion and approaches state-of-the-art[3]. 早在2012年, GOOGLE voice search就已經使用DNN來當作核心技術, 用來替聲音建立模型 這取代了30年來的GMM DNN能夠更好地評估使用者在每個時間點的發音,以及大大增加了語音識別的準確性 ---- 現在他們使用比較好的neural network聲學模型, 使用CTC及sequence discriminative training 這些模型是一種特殊的RNN, 正確率更高(特別是在吵雜的環境中), 而且非常快 // 補充 當輸入跟輸出間的aligment是未知的時候 , 可以使用CTC來做, (是一種使用RNN來label序列) CTC實作上呢, 是在softmax加一個額外的空白label, 用來估測現在這個時間點沒有label的機率 CTC與傳統的差別有兩點 = (1)當不確定現在這個frame是哪一個label的時候, 這個空白labe能夠 (緩解) 讓模型不用一定要做出預測 (2)訓練時是要最佳化state sequence的log機率, 而不是最佳化input的log likelihood [1] Alex Graves, Santiago Fernandez, Faustino Gomez, Jurgen Schmidhuber, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, ICML, 2006. [2] Brian Kingsbury(IBM), Lattice-based Optimization Of Sequence Classification Criteria For Neural-network Acoustic Modeling, ICASSP, 2009. [3] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays, Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition, INTERSPEECH, 2015.
Traditional Speech Recognizer “museum” => /m j u z i @ m/ waveform small consecutive slices or “frames” (10 ms) Each frame (frequency content) the resulting feature vector acoustic model outputs a probability distribution over all the phonemes A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. Pronunciation Model that links sequences of sounds to valid words in the target language Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. 傳統語音辨識器, 使用者的語音通常以每10毫秒, 切成小的連續的slice或frame 每個frame我們對它的頻率內容進行分析, 並將得到的特徵向量, 通過聲學模型傳遞 (諸如DNN),其輸出所有音素(聲音)的機率分佈。 HMM可以幫助並強加給這些機率分佈的序列 一些時間結構 接著加入額外的知識, 像是pronunciation model (在目標語言中連結聲音序列與有意義的詞); 以及language model (這些詞的序列在這個語言中有多像正常人講的話) 辨識器接著調和這些資訊來定義使用者說了什麼句子 舉例來說, 如果使用者說了 Museum這個詞, 它可以表示成phone層次的表示, 但是這種表示法很難看出什麼時候 /j/ 結束, 什麼時候 /u/ 開始; 但是對辨識器來說, 他其實不在意這些phone什麼時候轉換成別的phone, 它只在意整體被說出來的詞是什麼
Our Improved Acoustic Models “museum” => /m j u z i @ m/ rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies (ex: /u/ , /j/ , /m/) Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is aLong Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly. 我們改進的聲學模型是基於RNN RNN有feedback loop的結構, 讓它可以model時間的相關性, (舉例: u的前面的發音是由 /j/ 來的, 而 /j/前面 是由 /m/來的) 還有另一個例子, 假如你大聲地講muzeum這個詞, 其實muzeum這個詞的發音是很自然地一口氣講完, 而RNN可以捕捉到這個資訊 而我們使用的是LSTM, 它透過記憶的細胞複雜的gating機制, 記憶效果比RNN好, 使用LSTM顯著地改善了我們辨識器的辨識效果
Train The Models (1/2) The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct. The tricky part though was how to make this happen in real- time. 下一個步棷是要訓練model能夠辨識一個utterance裡的phoneme (而不是每個時間點都預測他是哪個phone) 使用CTC, 模型被訓練成 能夠輸出spike (尖峰)的序列, 用來表示wavform的聲音序列 而最棘手的部分是如何讓這件事即時做完
Train The Models (2/2) After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here. 經過幾次iteration後, 我們成功地訓練出串流式的單向模型, 能截取較大的音訊的chunk, 但是計算量比較少 由此, 我們建立出計算量較少, 且計算較快的辨識器 我們也加了一些人工的噪音在訓練資料中, 讓我們辨識器能夠更robust //Youtube的例子 How cold is it outside
Result We are happy to announce that our new acoustic models are now used for voice searches and commands in the Google app (on Android and iOS), and for dictation on Android devices. - so give it a try, and happy (voice) searching! X軸 = 由左到右是語音訊號輸入時間的開始到結束 (phone) Y軸 = neural network 預測的postierior prob 虛線 = 模型預測的不是phone的部分
THANK YOU!