Download presentation
Presentation is loading. Please wait.
Published bySiska Irawan Modified 6年之前
1
Joint Training Of Convolutional And Non-Convolutional Neural Networks
Hagen Soltau, George Saon, and Tara N. Sainath Joint Training Of Convolutional And Non-Convolutional Neural Networks 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) IBM WATSON 研究室 Ming-Han Yang Joint Training Of CNN And Non-CNN | Page 1
2
Outline EXPERIMENTS ON RATS ABSTRACT INTRODUCTION RELATED WORK
MODEL - Joint training of MLPs and CNNs - Multi-GPU usage - Multi-task learning EXPERIMENTS ON SWB - GMM system - MLP, CNN, and jointly trained MLP/CNN - Joint training with I-Vectors - Comparison with System Combination EXPERIMENTS ON RATS - Acoustic Models for Keyword Search - Speech Activity Detection CONCLUSION Joint Training Of CNN And Non-CNN | Page 1
3
ABSTRACT A simple modification of neural networks
which consists in extending the commonly used linear layer structure to an arbitrary graph structure. The joint model has only a small increase in parameter size training and decoding time are virtually unaffected. Significant improvements over very strong baselines on two LVCSR tasks and one speech activity detection task. 我們對neural networks做了一個簡單的改良, 讓常用的linear layer結構延伸到任意的圖形結構 讓我們可以結合一般的NN和CNN的優點 這個joint model只增加了一點點參數, 幾乎不會影響training 跟decoding的時間 我們從兩個LVCSR的task 和一個speech activity detection task都有顯著的改善 (他用的baseline是GMM跟CNN) Joint Training Of CNN And Non-CNN | Page 1
4
INTRODUCTION While until recently most LVCSR systems were based on GMM
in the last two years many research groups have seen substantial improvements when switching to neural network acoustic models. The renewed interest in neural networks was sparked by the work of Microsoft where a context dependent neural network outperformed a good GMM baseline on the SWB (Switchboard) task. other types of neural Nets have become popular as well, such as convolutional neural networks (CNN) The idea of a CNN is to generate shift invariance to make the model more robust against small changes in the input space 一直到最近, 大多數的LVCSR系統都based on GMM, 而近兩年來, 有許多的研究團隊把Neural network用在acoustic model後, 效果都出現大幅改善 Neural network的複活與microsoft的研究有關(2011年余東的研究團隊用Context-Dependent的Deep- Neural-Network來做Switchboard的語音轉文字, word error rate減少到18.5%, 打敗GMM的27.4%) 而另一種類型的Neural network: CNN已經成為現在的流行 CNN的idea是 當input有微小的變化時能找到它不變的特徵, 讓模型更robust hinton在1986年的paper, 提出一個network架構, 透過局部感受野, 一次看一部分的input, 來辨識T型和C型的圖形, 找到圖形的不變性 (T-C Problem, 再點一下看圖) Joint Training Of CNN And Non-CNN | Page 1
5
INTRODUCTION For speech recognition problems, invariance against small changes in the temporal domain is important. TDNN applies weight sharing and shift invariance in the temporal domain on a phoneme classification task and performed better than an HMM baseline. Weight sharing and shift invariance in the feature domain was first explored in [2012 H.Jiang, G.Penn] which used log-mel features as input features on a small scale speech task. The work in [2013 T.Sainath] demonstrated that these ideas also work for larger tasks (Broadcast News and Switchboard) 對語音辨識的問題來說, 找到temporal domain中微小變化的不變性是很重要的, Hinton 1988的paper提出了一個temporal domain的架構叫time delay neural network (TDNN), 用在分辨phone的classification, 效果比baseline HMM好 (TDNN的input以frame為單位, 看一次要取幾個frame, 餵給hidden layer的一個neuron) 2012年江輝和Gerald Penn的paper將CNN做在語音上, corpus是TIMIT, 讓CNN沿著語音訊號的frequency軸掃, 而時間的變化交給HMM處理 2013年IBM的paper把deep convolutional neural network做在LVCSR的task上證明了上面這些idea用在LVCSR一樣可行 Joint Training Of CNN And Non-CNN | Page 1
6
INTRODUCTION If CNNs are configured to obtain shift invariance in the feature domain, it places certain constraints on the type of features that can be used. Applying a maximum or average operator on the outputs of localized windows is meaningful only if the features are topographical. For example, log-mel features considerable progress has been made by using more elaborate feature processing for GMM systems. These features can be used directly for regular MLPs and are known to improve results We want to combine the benefits of a CNN (shift invariance) with the benefits of a conventional MLP that can use more advanced features. 如果說CNN能夠在feature domain上取得shift invariance, 那麼餵給CNN的feature就會有些限制, 只有feature是topographical的時候, 透過local window取max或取mean才有意義, 例如logmel, 就符合這種屬性 另一方面,透過使用更複雜的feature餵給GMM的研究也有相當大的進展, 而這些feature可直接給一般的MLP, 而且可以改善效果 這點激發了我們的想結合CNN shift invariant的優點與傳統MLP可以使用更高級的feature的優點 --- 我們的model是一般neural network的簡單擴充, 在第三章討論, 我們分別比較了兩種LVCSR task下的效能, 第一個實驗是switchboard, 在第四章討論, 第二個是RATS (Robust Automatic Transcription of Speech), 在第五章, 我們只報告RATS的兩個subtasks : namely keyword search 和 speech activity detection Joint Training Of CNN And Non-CNN | Page 1
7
RELATED WORK [2013 Tara.N]: a combination of MLPs and CNNs.
They reported small improvements (18.% to 18.6% WER) while almost doubling the parameter size. Difference with our work : Our model is configured such that most layers are shared between the MLP and the CNN We do not restrict our model to have the same input features for both MLP and CNN. Indeed, in some preliminary experiments we found that log-mel features are substantially worse than FMLLR features for MLPs and placing the same restriction for MLP features that are used for CNN features will lead to suboptimal results. Tara(這篇paper的第三個作者)2013年的工作也是MLP和CNN的結合, 當時的WER只有微小的進步, 從18%~18.6%, 參數數量大約增加了兩倍 與這篇paper不同的是, 在這篇paper的model中, MLP和CNN可以大多數的layer裡面互相share, 而且沒有限制這個model的input一定要相同 ---- 事實上,在一些初步的實驗中,我們發現, 如果我們把log-mel feature餵給MLP, 實質上的效果比餵FMLLR feature還差, 同樣地, 如果CNN使用MLP的feature(相識FMLR), 效果也不是最好的 將不同的model或不同的feature的優點合併在一起可以看成是一種系統的結合, 我們的work集中在neural network, 結合GMM系統這些已知的fature 舉例來說, 2005年Nelson的paper結合GMM使用的MFCC train出來的posterior feature, 做在LVCSR上面 2013年Hermann ney將不同的NN和GMM系統結合, 用來改善RWTH automatic speech recognition system的一個叫transLectures的task, 這個task的目的是將網路上的影片轉寫成文字和翻譯, 他們的方法可以把WER從59.2降低到43.4 他們(這篇的作者)2013的paper結合了based on acoustic model的 : FMLLR, FMMI, logmel這三種feature, 他們用在在50小時的廣播新聞上有看出進步, 可是當他們想用更大的training資料和model之後, 改善(進步)就消失了 相對於這些feature組合的方法,我們提出一個模型,使我們能夠把這些不同類型的neural networks 一起training Joint Training Of CNN And Non-CNN | Page 1
8
RELATED WORK Combining the benefits of different models or different features can also be seen as a form of system combination. While our work focuses on neural nets, combining different feature streams was already explored for GMM systems. For example, [2005 N.Morgan] combined MLP based posterior features with traditional MFCC features for a GMM based LVCSR system. More recently, [2013 H.ney] showed improvements from combining different NN derived features with a GMM system. In [2013 G.saon]*, we experimented with combining different features (FMLLR, FMMI, log-mel) for MLP based acoustic models. 將不同的model或不同的feature的優點合併在一起可以看成是一種系統的結合 在GMM系統中, 結合不同feature的研究已經研究的差不多了, 因此我們現在注重在neural network上 舉例來說, 2005年Nelson的paper結合GMM使用的MFCC train出來的posterior feature, 是做在LVCSR上面 2013年Hermann ney將不同的NN和GMM系統結合, 用來改善RWTH automatic speech recognition system的一個叫transLectures的task, 這個task的目的是將網路上的影片轉寫成文字和翻譯, 他們的方法可以把WER從59.2降低到43.4 他們(這篇的作者)2013的paper結合了based on acoustic model的 : FMLLR, FMMI, logmel這三種feature, 他們用在在50小時的廣播新聞上有看出進步, 可是當他們想用更大的training資料和model之後, 改善(進步)就消失了 相對於這些feature組合的方法,我們提出一個模型,使我們能夠把這些不同類型的neural networks 一起training *[2013 G.saon] : Neural Network Acoustic Models for the DARPA RATS Program ,in Proc. Interspeech, 2013 Hagen Soltau, HongKwang Kuo, Lidia Mangu, George Saon, and Tomas Beran Joint Training Of CNN And Non-CNN | Page 1
9
MODEL While a regular MLP (or CNN) is normally a linear sequence of layers, our model consists of a graph structure of layers. Each layer can receive inputs from multiple layers, and each layer can send its output to multiple layers. Difference with a conventional MLP: Forward pass: Outputs of all input layers have to be combined (joined). Backward pass: The gradients of all output layers have to be summed up before back-propagating the gradient. The graph structure is not restricted to the hidden layers only, the models also allows for multiple input features and outputs. 通常MLP或CNN是好幾層的layer(線型地)疊起來(linear sequence的layer), 不過我們的model是採用graph的結構 每一個layer可以從多個 layer接受input, 而且每個layer可以把它的output傳到多個 layer 這和傳統的MLP差異很小, 在forward的時候, 所有input layer的output必須組合在一起; backward的時候output layer的gradient會在back-propagating前先加起來; 這個graph結構不只限制hidden layer, 這個模型還允許多個input feature和output 舉例來說, 就像FIG.1看到的架構, 這個network由兩層CNN(右邊)和一個非CNN的layer(左邊)所組成, 他們share 4層hidden layer和一層output layer 下文中, 我們列出了graph structured neural network的優點: Joint Training Of CNN And Non-CNN | Page 1
10
Joint training of MLPs and CNNs MLP’s feature
For a conventional MLP: It is easy to use multiple input features by simply combining the feature matrices into one single input matrix. This is not the case for CNNs. 要講MLP和CNN如何joint training之前,先提一下MLP和CNN適合什麼樣的feature 傳統的MLP要使用多個input feature的話, 可以簡單的透過把多個input feature的矩陣合併成一個 而CNN則不同, Joint Training Of CNN And Non-CNN | Page 1
11
Joint training of MLPs and CNNs CNN’s feature
CNNs achieve shift invariance by applying a pooling operation of the output values. In order to achieve shift invariance in the feature domain, the features have to be topographical, such as log-mel features. On the other hand, state-of-the-art GMM and MLP based systems make use of more advanced speaker adaptive features such as FMLLR or fMMI features. These features are not topographical and therefore can not be used with a CNN. 而CNN則不同, 它可以透過pooling operation中獲得shift invariance(移動不變性), 但是餵給CNN的feature必須是topographical, 像是logmel feature 另一方面, 發展成熟的GMM和MLP based的系統可以使用更高級的feature的speaker adaptative feature (像是FMLLR, fMMI) 這些feature不是topographical, 不能用在CNN Joint Training Of CNN And Non-CNN | Page 1
12
Joint training of MLPs and CNNs
The neural network graph structure allows us to use such features by having CNN and MLP layers in parallel as shown in Figure.1. Since most layers are shared, the joint MLP CNN configuration shown in Figure.1 has only about 10% more parameters than the corresponding CNN. Neural network graphic structure結構允許並行地使用這兩種feature, (Fig.1)這個架構比起相應的CNN, 大約只增加了10%的參數 左邊是一般的MLP, 右邊是CNN, 後面會有更詳細的介紹 Joint Training Of CNN And Non-CNN | Page 1
13
Benefits of the graph structure Multi-GPU usage
We can split one layer into n parallel parts. For example: A 2048x7000 output layer => Two 2048 x 3500 layers (in parallel) Each matrix multiplication can run in parallel on separate devices. Combining the output of each layer is simply a cudaMemcpy2D call where we use the pitch parameter to specify the target position of the combined matrix. The nice part here is that cudaMemcpy2D can be used directly to copy memory between devices without any extra device management. 另一個graph structure的好處是可以把一個layer分成兩塊, 舉例來說, 2048*7000的output layer可以分成兩個2048*3500的layer, 每塊layer的矩陣運算可以在不同的機器上run, 合併各層layer的output是一個簡單的cudaMemcpy2D call, 在這裡我們使用pitch參數來指定組合矩陣的目標位置 這有個很棒的地方是設備之間可以直接使用cudaMemcpy2D, 這個指令不需要額外的device management, 直接copy memory Joint Training Of CNN And Non-CNN | Page 1
14
Benefits of the graph structure Multi-task learning
While we have not exploring multi-task learning in this work, It allows us to specify multiple targets. An example of multiple targets is the use of different decision trees. [2013 M.Seltzer] Multiple target output layers for neural networks is a form of multi-task learning and used to improve generalization. For example: speech enhancement features [2004 S.Sehgal] The graph structure would look like a regular neural net with 2 parallel output layers : One for the state posteriors (with softmax and cross-entropy) Another one for clean target features (with sigmoid and MSE). 雖然這篇paper中我們還沒有開始研究multi-task learning, 但是graph structure允許我們指定multiple targets(多個目標) 2013 Seltzer的paper使用不同的decision trees來處理多目標, 而多目標輸出層的NN是Multi-task learning的一種形式, 一個multi target的例子是英國的Sehgal使用multi task RNN做robustness, noisy speech經過RNN萃取出三中task需要的資訊(task: Class target, enhance target, gender target) graph structure看起來像一個普通的NN, 有兩個並行的輸出層, 一個是state posteriors (with softmax and cross-entropy), 另一個是clean target features (with sigmoid and MSE). Joint Training Of CNN And Non-CNN | Page 1
15
EXPERIMENTS ON SWB GMM system
Training set: 300h SWB1 corpus Test set: Hub5-2000 The model comes with: FMLLR, MLLR, LDA, STC plus feature, model space discriminative training. Gaussians with 8260 states 下面的實驗做在Switchboard的task, training set是300小時的SWB1 corpus, test set使用Hub5-2000, 首先我們使用THE IBM ATTILA SPEECH RECOGNITION TOOL裡面的Attila Training Recipe建立出GMM baseline, model含有 FMLLR, MLLR, LDA, STC plus feature 和model space discriminative training, 這個系統有 Gaussians 與 8260 states, 它的performance可以從table1看到 SA: speaker adapted (SA) features SI: speaker independent feature // 有沒有加入speaker的資訊 Table 1. GMM baseline error rate On HUB5-2000 Joint Training Of CNN And Non-CNN | Page 1
16
EXPERIMENTS ON SWB MLP, CNN, and jointly trained MLP/CNN
All neural nets were trained with layer-wise back-propagation. The weights are randomly initialized. Initial step size: 5× 10 −3 And is reduced by half every time the performance on a held-out set does not improve. Training data: randomized at the frame level for each 30h chunk. Mini-batch size: 256 frames 所有的 neural nets 都是透過layer-wise back-propagation來training, weight是random initial, 初始的step size是5*10的-3次方, 每次減少一半, held-out set的performance改善, training set是每30小時為一個trunk, 隨機取frame出來train, mini batch size = 256個frame, Joint Training Of CNN And Non-CNN | Page 1
17
EXPERIMENTS ON SWB MLP, CNN, and jointly trained MLP/CNN
MLP: 6 layers Each conventional hidden layer has 2048 nodes. The MLP uses 40-dim FMLLR features with a temporal context of 11 frames. CNN: 2 convolution + pooling operations The 2 convolutional layers use 512 nodes. We use 40-dim warped log-mel features together with their ∆ and ∆∆ features. MLP有6層(Fig.1, 含左半邊), CNN在右邊有兩層, 每個hidden layer有2048個node, CNN那兩層有512個node, MLP使用40維的FMLLR feature, temporal context取11個rame, CNN則是使用40維的log-mel feature(含delta, delta-delta), temporal context也是取11個frame CNN layer的設定在table2, 第一層我們用一個9*9的window, 去掃log-mel input, shift size 是1, pooling size是3, 由於我們使用40維的mel-log及 temporal context 11個frame我們可以得到32*3個window, 對這些做max pooling(pooling window = 3*1, 也就是pooling size=3, 且沒有overlap), 我們可以得到11*3個window 第二層的CNN就對第一層的11*3做4*3(有overlap的)convolution, 產生8*1的input, 與MLP一起餵到hidden layer, 第一層的input dimension是3*9*9=243, 第二層則是 512*4*3 = 6144, (前一層的output數量*windows的數量) 1
18
EXPERIMENTS ON SWB MLP, CNN, and jointly trained MLP/CNN
In Table.3 , we summarize the performance of various neural net models. The models were trained with standard backpropagation and cross entropy as objective function. The models were then retrained with Hessian-free sequence. table3中我們列出了不同NN model的performance, model透過標準的back-propagation training, 使用cross entropy作為objective function, 之後model透過Hessian free sequence 重新training, 表中有分別把只用cross entropy和使用 Hessian free重新train的實驗結果列出來 MLP實際上已經比最好的GMM還要好(12.3% vs 14.5%), 如果使用CNN約可以比MLP好0.5%左右(11.8%), 如果合併起來訓練的話, 可以比CNN進步(11.2%) Joint Training Of CNN And Non-CNN | Page 1
19
EXPERIMENTS ON SWB Joint training with I-Vectors
In the next experiment, we add I-Vectors to the jointly trained model. I-Vectors were recently proposed in [2013 G.saon]* as a form of speaker adaptation for neural nets. In this work, I-Vectors were used to augment regular FMLLR features to feed in extra speaker information when training neural nets. In [2013 G.saon]*, conventional MLPs were used for the experiments, and a 5-6% relative improvement was seen on top of MLPs with speaker adaptive features on a SWB task similar to the one used here. Since I-Vector derived features are not topographical, CNNs were not used in [2013 G.saon]*. 另一個實驗我們根據[2013 G.saon],的paper, 加入了ivector來訓練 2013年georage soan(這篇的第二作者)有把ivector加入NN來做acoustic model的研究, 這項工作中, training NN的時候,, 在FMLR feature新增了ivector的這種sperker的資訊, 也以前有學者把傳統MLP用在在SWB這個task上, 它把speaker adaptive features加在最上面的layer, 有5%-6%的相對進步, 這裡我們也是這樣做 由於ivector這種feature不是topographical, 因此2013年george soan的那個task就沒有使用CNN *[2013 G.saon] : Speaker adaptation of neural network acoustic models using i-vectors ,in Proc. ASRU, 2013 George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny Joint Training Of CNN And Non-CNN | Page 1
20
EXPERIMENTS ON SWB Joint training with I-Vectors
On the other hand, The CNN outperformed the MLP by 0.5% The jointly trained model is 1.1% better than the MLP so it is natural to look for ways of how to add I-Vectors to CNNs. The graph structure for our jointly trained model makes it easy for us to do that. The I-Vectors were generated exactly the same way as described in [2013 G.saon]*. A text independent GMM with 2048 components is trained on the same training corpus that is used to train the neural nets. 另一方面, CNN比MLP要好0.5%, 而joint trained model 比MLP要好1.1%左右, 所以很自然地讓我們想要尋找如何把IVectors添加到CNN的方法 而我們的這個graph structure能夠很輕易地做到這件事 > I-Vectors的產生方式和這篇作者2013年的paper一樣 > 這個text independent 的GMM有2048個component …(A text independent GMM with 2048 components) train這2048個component, 所使用的corpus和之前trainNN是同一個 *[2013 G.saon] : Speaker adaptation of neural network acoustic models using i-vectors ,in Proc. ASRU, 2013 George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny Joint Training Of CNN And Non-CNN | Page 1
21
EXPERIMENTS ON SWB Joint training with I-Vectors
We extract 100-dimensional I-Vectors for every speaker and append them to the input stream for the MLP. This brings the total input for the MLP to 11 x = 540 features. The experiments in [2013 G.saon]* were done based on an MLP setup with FMLLR features on the same task as here. Error rate reduction from 12.5% to 11.% after sequence training. The results for the joint CNN/MLP are shown in Table 4. 前端的GMM使用40維的FMLLR feature, 我們從每個speaker截取出100維的ivector, 並把它們加入MLP的input, 這讓餵給MLP input的數量變成 11* = 500 (2013年的work我們將這個feature餵到DNN中, 使用sequence training方法錯誤率是11.9%, 而cross entropy的錯誤率則是13.2%) 結果可以從table4看到, 經過sequence training之後, error rate從11.2%改善到10.4%, 我們可以看到從MLP到現在這個聯合MLP/CNN joint model的改善效果 *[2013 G.saon] : Speaker adaptation of neural network acoustic models using i-vectors ,in Proc. ASRU, 2013 George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny Joint Training Of CNN And Non-CNN | Page 1
22
EXPERIMENTS ON SWB Comparison with System Combination
The jointly trained MLP/CNN can be seen as a form of system combination. The outputs of the first MLP hidden layer get combined with the outputs of the second CNN layer. As a contrast experiment, we wanted to see how much we would gain from combining separate MLP and CNN models. Since ROVER does not work well when combining only two systems, we used score fusion, where we average the acoustic scores from both models. The error rate for the system combination of separate MLP (12.3%) and CNN (11.8%) is 11.2% - the same as for the jointly trained model. 聯合training的MLP/CNN可以看成是系統的一種結合, 第一層MLP的hidden layer的output會跟CNN layer的第二層的output結合, 我們想要知道個別的MLP和CNN分別改善多少, 於是我們做了一個對比實驗 分開的實驗要怎麼評估error rate呢? 由於ROVER(recognizer output voting error reduction)在這個task上表現得不好所以在這邊我們使用, score fusion, 將每個model的acoustic分數取平均 個別的error rate分別是 MLP: 12.3% 和 CNN: 11.8%, 合併之後取平均是11.2%, 與合起來training的結果相同 我們也重複了上述的i-vector的實驗, error rate 分別是 MLP+ivector: 11.9% 和 CNN+ivector: 11.2%, 取平均後是10.5%, 比合併training的10.4%稍微差一點 我們可以透過一個共同訓練的model實現系統整合, joinly trained model的好處是我們不用分別train兩個model再合併, 而且decode的時候我我們只需要計算一次acoustic的分數, 分開的話就需要兩倍的計算量, joinly trained model所需要的參數數量大約只比純使用CNN的參數量多10%, 又比分別train MLP和CNN的參數少 Joint Training Of CNN And Non-CNN | Page 1
23
EXPERIMENTS ON RATS The data collection consists of retransmitted clean data over a noisy channel. The clean audio data has Call-home type characteristics (telephone conversations, while the noisy data was obtained by transmitting the original audio through a pair of senders and receivers. In total, 8 different transmissions were performed by using different sender and receiver combinations. RATS是國防高等研究計劃署(Defense Advanced Research Projects Agency,縮寫:DARPA)的一個研究計劃, 主要是做語音的降噪和一些子計劃, 他們的實驗是其中的兩個子計劃 : speech activity detection (SAD) 和 keyword search (KWS), 目標要把吵雜的環境中的語音轉換成乾淨的人聲 ---- 乾淨的audio具有call-home的特色, 而該噪聲數據是通過一對發送者和接收者的傳送原始音頻得到, (不同的接收者和發送者排列組合) 總共有8種不同的(傳輸)組合 Joint Training Of CNN And Non-CNN | Page 1
24
EXPERIMENTS ON RATS Acoustic Models for Keyword Search
The goal of keyword search (also known as spoken term detection) is to locate a spoken keyword in audio documents. For this task, 300 hours of acoustic training data are available. The target languages for KWS are Levantine and Farsi. The baseline acoustic models are neural nets and described in detail in [2013 G.saon]*. The only difference is the number of output units, where we use 7000 HMM states for our RATS Levantine system. keyword search(或者叫spoken term detection)的目標是想從語音中找到關鍵字, 我們使用了300小時的acoustic training data, 從我們的角度看, KWS實際上是一種LVCSR加上對lattice做某種形式上的後處理, 生成可搜尋的index KWS的目標語言是Levantine(東地中海語, 現伊拉克) 和 Farsi(波斯語, 現伊朗/塔吉克斯坦的官方語) acoustic model的baseline是照[2013年他們發的paper, 一樣是坐在DARPA的RATS]: 語音訊號抽完特徵後分成3個pass, 1)特徵餵進3個GMM channel, 3個GMM channel合併成一個channel detector 2)把特徵和(1)的結果餵給NN 3)把2的結果和1的結果餵給NN1+NN2+CNN CNN 和 joint MLPCNN的設定和前面SWB的設定相同, 差別在於output的數量, RATS Levantine系統使用7000個HMM state, (下一頁(table5)表示joint training model不同noise在不同條件下的效果) *[2013 G.saon] : Neural Network Acoustic Models for the DARPA RATS Program ,in Proc. Interspeech, 2013 Hagen Soltau, HongKwang Kuo, Lidia Mangu, George Saon, and Tomas Beran Joint Training Of CNN And Non-CNN | Page 1
25
EXPERIMENTS ON RATS Acoustic Models for Keyword Search
The results are shown in Table 5 for different noise conditions and demonstrate the effectiveness of the joint training approach. (table5)表示joint training model不同noise在不同條件下的效果 Joint Training Of CNN And Non-CNN | Page 1
26
EXPERIMENTS ON RATS Speech Activity Detection
The goal of the speech activity detection (SAD) task is to determine whether a signal contains speech or is just comprised of background noise or music. Messured the performance: probability of miss(PMiss) : 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑖𝑠𝑠𝑒𝑑 𝑠𝑝𝑒𝑒𝑐ℎ 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑠𝑝𝑒𝑒𝑐ℎ probability of false accept(PFA) : 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑎𝑐𝑐𝑒𝑝𝑡 𝑠𝑝𝑒𝑒𝑐ℎ 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜𝑛−𝑠𝑝𝑒𝑒𝑐ℎ speech activity detection (SAD)的目標是要能判斷訊號是否 包含語音 或 只有背景噪音或音樂 performance可以由(1)miss的機率及(2)false accepte的機率來評估, 1的計算方式是一段時間內miss的speech / 一段時間內總共有多少speech 2的是一段時間內false-accept的的speech(沒講話卻判斷成有講話) / 一段時間內總共有多少的non-speech Joint Training Of CNN And Non-CNN | Page 1
27
EXPERIMENTS ON RATS Speech Activity Detection
First system uses a score-level fusion of two neural nets: An MLP trained on a combination of PLP voicing and FDLP features. A CNN trained on log-mel spectral features. The second system uses a joint MLP/CNN: The MLP part has the same inputs as the separately trained MLP (PLP FDLP and voicing). The CNN part are given by log-mel spectra (same as for the CNN trained in isolation. audio … … … Fig2我們在DEV1 test, 這個test set有11小時的語音, 用ROC curve來比較兩種系統 第一個系統使用score-level fusion, 結合兩種NN, 分別是使用 PLP voicing和FDLP feature的MLP 和 使用log-mel spectral feature training的CNN 更多的models, features 及normalizations細節可以看這篇作者2013年的paper [2013 G.saon] Channel從A~H, audio透過第一層GMM, 第二層NN, 第三層是兩層NN+一層CNN (8個channel-dependent 的Gaussian mixture models trained with maximum likelihood on a fusion of PLP and voicing features) 第二層的NN有n層, 每層有1024個unit, 第三層的CNN和這篇paper的設定相同 第二個系統MLP-CNN, 它的MLP part的input和前面個別training MLP是相同的, CNN的input也是使用log-mel spectral feature Joint Training Of CNN And Non-CNN | Page 1
28
EXPERIMENTS ON RATS Speech Activity Detection
In Figure.2, we compare the ROC curves on the DEV1 test set which contains 11 hours of audio for two systems. As can be seen, the joint model yields a 20% relative improvement in equal error rate over the separate models with score fusion. 從表格中可以看出: joint model與 使用score fuscion的這種個別的model相比, 能夠改善error rate, 相對錯誤率可以減少20% Joint Training Of CNN And Non-CNN | Page 1
29
CONCLUSIONS We described a simple extension of neural networks, that changes the typical linear sequence of layers to a graph structure. The benefit of the graph structure is that it allows us to train convolutional and regular neural networks jointly. While I-Vectors are not topographical, the joint training approach allows us to leverage I-vectors for convolutional neural networks. Starting with our baseline CNN with an error rate of 11.8%, we reduced the error rate to 10.4%. This is a 10% relative improvement over a very strong baseline. 我們描述了一種NN的擴展, 改變了傳統的linear squence的layer架構, 變成我們提出的graph架構, graph架構的好處是能夠並行地training CNN和一般的MLP, 雖然ivector不是topographical的feature, 但graph架構允許我們可以讓CNN得到ivector的資訊, 從我們的baseline的ERROR RATE: CNN 11.8%, 我們減少error rate 到10.4%, 與CNN這麼強的baseline相比, 我們的model比CNN相對進步了10%, Joint Training Of CNN And Non-CNN | Page 1
30
CONCLUSIONS We also demonstrated that our model works across different tasks, such as speech activity detection and LVCSR for RATS keyword search. Furthermore, the proposed neural graph structure allows us to implement other features in an elegant way, such as multitask learning or the parallel use of multiple GPU devices. 我們也展示了我們的model可以坐在不同的task上, 像是speech activity detection及RATS的LVCSR的keyword search, 此外, 我們所提出的neural graph structure 讓我們可以很”文雅的”使用不同的feature(像是multi-learning), 也可以並行地使用多個GPU設備 Joint Training Of CNN And Non-CNN | Page 1
31
Thank you ! Joint Training Of CNN And Non-CNN | Page 1
Similar presentations