1 Introduction Prof. Lin-Shan Lee TA: Chung-Ming Chien
Outline Project Introduction Linux and Bash Introduction 2 Project Introduction Linux and Bash Introduction Feature Extraction Acoustic modeling Homework
Project Introduction 3
第一階段專題 目的:透過建立一個基本的大字彙語音辨識系統,讓同學對語音辨識有具體的了解,並且以此作為進一步研究各項進階技術的基礎。 4 目的:透過建立一個基本的大字彙語音辨識系統,讓同學對語音辨識有具體的了解,並且以此作為進一步研究各項進階技術的基礎。 Input Speech Speech Recognition System Output Sentence 今天 We are going to learn the black box of the speech recognition system.
語音辨識系統 Conventional ASR (Automatic Speech Recognition) system: 5 Conventional ASR (Automatic Speech Recognition) system: Input Speech Feature Vectors Output Sentence Linguistic Decoding and Search Algorithm Front-end Signal Processing 今天 Speech Corpora Acoustic Model Training Acoustic Model Language Model Language Model Construction Text Corpora Lexicon Deep learning based ASR system
語音辨識系統 Conventional ASR system Deep learning based ASR system 6 Conventional ASR system Widely used in commercial system Deep learning based ASR system Widely studied in recent years Both will be implemented in this project with Kaldi toolkit Kaldi is the most widely used ASR toolkit.
Schedule Week Progress Report Group 1 7 Week Progress Report Group 1 Introduction + Linux intro+ Feature extraction Acoustic model training : monophone & triphone 2 InterSpeech 2019 3 Language model training + Decoding A 4 Live demo B 5 Deep Neural Network 6 Progress Report 7 ... ... 第一階段 …. 第二階段
語音辨識系統 Week 1 Week 3 Week 4 Week 5 8 Conventional ASR (Automatic Speech Recognition) system: Week 1 Week 3 Input Speech Feature Vectors Output Sentence Linguistic Decoding and Search Algorithm Front-end Signal Processing 今天 Speech Corpora Acoustic Model Training Acoustic Model Language Model Language Model Construction Text Corpora Lexicon Week 4 Deep learning based ASR system Week 5
How to do recognition? How to map speech O to a word sequence W ? 9 How to map speech O to a word sequence W ? P(O|W): acoustic model P(W): language model
Language model P(W) W = w1, w2, w3, …, wn 10 W = w1, w2, w3, …, wn 𝑃 𝑊 =𝑃 𝑊 1 𝑃 𝑊 2 𝑊 1 𝑖=3 𝑛 𝑃( 𝑊 𝑖 | 𝑊 𝑖−2 , 𝑊 𝑖−1 )
Language model examples 11 log Prob Probability in log scale
Acoustic Model P(O|W) Model of a phone Markov Model 12 Model of a phone Markov Model Gaussian Mixture Model
Lexicon 13
語音辨識系統 Conventional ASR (Automatic Speech Recognition) system: 5 Conventional ASR (Automatic Speech Recognition) system: Input Speech Feature Vectors Output Sentence Linguistic Decoding and Search Algorithm Front-end Signal Processing 今天 Speech Corpora Acoustic Model Training Acoustic Model Language Model Language Model Construction Text Corpora Lexicon Deep learning based ASR system
Linux and Bash Introduction 15
Vim 如何建立文件: vim hello.txt 進去後,輸入“ i ”即可進入編輯模式 此時,按下ESC即可回復一般模式,此時可以: 16 如何建立文件: vim hello.txt 進去後,輸入“ i ”即可進入編輯模式 此時,輸入任何你想要打的 此時,按下ESC即可回復一般模式,此時可以: 輸入” /想搜尋的字“ 輸入”:w”即可存檔 輸入”:wq”即可存檔+離開
Screen 簡單講一下,避免因為斷線而程式跑到一半就失敗了, 大家可以使用screen,簡單使用法如下: 17 簡單講一下,避免因為斷線而程式跑到一半就失敗了, 大家可以使用screen,簡單使用法如下: 1. 一登入後打"screen",就進入了screen使用模式,用法都相同 2. 如果想要關掉此screen也是用"exit" 3. 如果還有程式在跑沒有想關掉他,但是想要跳出, 按"Ctrl + a" + "d"離開screen模式(此時登出並關機程式也不會斷掉) 4. 下次登入時,打"screen -r"就可以跳回之前沒關掉的screen唷~ 5. 打”screen -r” 也許會有很多個未關的screen,輸入你要的 screen id 即可(越大的越新) 這樣就算關掉電腦,工作仍可以進行!!! 也可以用tmux,tmux像是有更多功能的screen
Linux Shell Script Basics 18 echo “Hello” (print “hello” on the screen) a=ABC (assign ABC to a) echo $a (will print ABC on the screen, $: 取用變數) b=$a.log (assign ABC.log to b) cat $b > testfile (write “ABC.log” to testfile) 指令 -h (will output the help information)
Bash Example 19 雙小括號 ((xxx)): 表示c語法
Bash script 20 [ condition ] uses ‘test’ to check. Ex. test -e ~/tmp; echo $? File [ -e filename ] -e 該「檔名」是否存在? -f 該「檔名」是否存在且為檔案(file)? -d 該「檔名」是否存在且為目錄(directory)? Number [ n1 -eq n2 ] -eq equal (n1==n2) -ne not equal (n1!=n2) -gt greater than (n1>n2) -lt less than (n1<n2) -ge greater or equal (n1>=n2) -le less than or equal (n1<=n2) SPACE COUNTS!!!! $?: 取用上一個指令的回傳值
Bash script Logic -a and -o or ! negation 21 Logic -a and -o or ! negation [ "$yn" == "Y" -o "$yn" == "y" ] [ "$yn" == "Y" ] || [ "$yn" == "y" ] Don’t forget the space and the double quote!!!!
Bash script ` operation && || ; operation Some useful commands. 22 ` operation echo `ls` my_date=`date` echo $my_date && || ; operation echo hello || echo no~ echo hello && echo no~ [ -f tmp ] && cat tmp || echo "file not found” [ -f tmp ] ; cat tmp ; echo "file not found” Some useful commands. grep, sed, touch, awk, ln
Bash script Pipeline program1 | program2 | program3 23 Pipeline program1 | program2 | program3 echo “hello” | tee log More information about pipeline: http://www.gnu.org/software/bash/manual/html_node/Pipelines.html
Bash script Input / output for bash: 24 Input / output for bash: cmd > logfile # 將 stdout 導入logfile,stderr 印於螢幕 cmd > logfile 2>&1 # 將stdout、stderr 全部導到 logfile cmd <inputfile 2>errorfile | grep stdoutfile More Information about bash input/output: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_08_02.html
Feature Extraction 02.extract.feat.sh 25 今天 Feature Vectors Output Input Speech Feature Vectors Output Sentence Linguistic Decoding and Search Algorithm Front-end Signal Processing 今天 Language Model Speech Corpora Acoustic Model Training Acoustic Model Language Model Construction Text Corpora Lexicon
Feature Extraction - MFCC 26
MFCC (Mel-frequency cepstral coefficients) 27 13 dimensions vector 數位語音第二章
Extract Feature (02.extract.feat.sh) 28 Training Set Input Output Archive 目錄 Development Set Testing Set
Kaldi rspecifier & wspecifier format 29 ark:<ark file> 眾多小檔案的檔案庫,可能是wav檔、mfcc檔、statistics的集合 scp:<scp file> 一群檔案的位置表,可能指向個別檔案(如我們的material/train.wav.scp),也可以指向ark檔中的位置 ark,t:<ark file> 輸出文字檔案的ark,當輸入時,t無作用;不加,t,預設輸出二進位格式 ark,scp:<ark file>,<scp file> 同時輸出ark檔和scp檔
Extract Feature (02.extract.feat.sh) 30 compute-mfcc-feats add-deltas compute-cmvn-stats apply-cmvn
MFCC – Add delta add-deltas Deltas and Delta-Deltas 31 add-deltas Deltas and Delta-Deltas 將MFCC的Δ以及ΔΔ (意近一次微分與二次微分) 加入參數中,使得總維度變成39維 Usage:
MFCC – CMVN 32 CMVN: Cepstral Mean and Variance Normalization
MFCC – CMVN 33 compute-cmvn-stats Usage: apply-cmvn
Hint (Important!!) compute-mfcc-feats output為ark:$path/$target.13.ark 34 compute-mfcc-feats output為ark:$path/$target.13.ark add-deltas [input] [output] [input] = ark:$path/$target.13.ark [output] = 𝑥 compute-cmvn-stats [input] [comput_result] [input] = 𝑥 apply-cmvn [comput_result] [input] [output] [output] MUST BE rm -f [output] [comput_result] ark,t,scp:$path/$target.39.cmvn.ark,$path/$target.39.cmvn.scp
MFCC – CMVN 35 CMVN: Cepstral Mean and Variance Normalization
Acoustic Modeling 03.mono.train.sh 05.tree.build.sh 06.tri.train.sh 36 Input Speech Feature Vectors Output Sentence Linguistic Decoding and Search Algorithm Front-end Signal Processing 今天 Speech Corpora Acoustic Model Training Acoustic Model Language Model Language Model Construction Text Corpora Lexicon
Hidden Markov Model(HMM) 37 Given Sequence of observations(balls) Hidden Markov model (transition between the baskets and observation probability for each basket) Expected Sequence of states(baskets)
Hidden Markov Model(HMM) 38 Elements of an HMM {S,A,B,π} S is a set of N states A is the N x N matrix of state transition probabilities B is a set of N probability functions, each describing the observation probability with respect to a state π is the vector of initial state probabilities s2 s1 s3 {R:.3,G:.2,B:.5} {R:.7,G:.1,B:.2} {R:.3,G:.6,B:.1} 0.6 0.7 0.3 0.2 0.1 0.5
Gaussian Mixture Model(GMM) 39 Observation may be continuous. (e.g., mfcc) Use GMM to model continuous prob. density function.
一般的HMM不必有方向性,但用在Acoustic model的都是單向的 Acoustic Model: P(O| λ) 40 Model of a phone Markov Model 一般的HMM不必有方向性,但用在Acoustic model的都是單向的 Gaussian Mixture Model
Acoustic Model: P(O| λ) 41
Acoustic Model: P(O| λ) 42
Acoustic model: Best State Seq. 43
Acoustic model: Training 44
Acoustic model: Training 45 O1 State O2 O3 1 2 3 4 5 6 7 8 9 10 O4 s2 s3 s1 O5 O6 O9 O8 O7 O10 v1 v2 b1(v1)=3/4, b1(v2)=1/4 b2(v1)=1/3, b2(v2)=2/3 b3(v1)=2/3, b3(v2)=1/3
Acoustic model: Training 46 Initialization Bad initialization leads to local minimum with higher probability. Model Initialization: Segmental K-means Model Re-estimation: Baum-Welch
Acoustic model: Training 47 假設有四個人同時發出「ㄅ」這個音
Acoustic Model: P(O|W) 48 One acoustic model for a phoneme? The Pronounce of a phoneme may be affected by its neighbors! ㄐ 一ㄣ ㄊ 一ㄢ
Monophone v.s. Triphone Monophone Triphone 49 Monophone Consider only one phone information per model Ex. ㄧ, ㄨ, ㄩ Triphone Consider both left and right neighboring phones (60)3→ 216,000 Ex. ㄇ+ㄧ+ㄠ
Triphone Too much (216000) model to train? Share! Generalized Triphone 50 Too much (216000) model to train? Share! Generalized Triphone Shared Distribution Model (SDM) Sharing at Model Level Sharing at State Level OOV的概念?
Triphone Decision tree decides which triphones should be combined 51 Decision tree decides which triphones should be combined Example Questions(designed with human knowledge): 12: Is left context a vowel? 24: Is left context a back-vowel? 30: Is left context a low-vowel? 32: Is left context a rounded-vowel? 12 30 sil-b+u a-b+u o-b+u y-b+u Y-b+u 32 46 42 U-b+u u-b+u i-b+u 24 e-b+u r-b+u 50 N-b+u M-b+u E-b+u yes no
Acoustic Model: Training Steps 52 Get features(previous section) Train monophone model Use previous model to build decision tree for triphone Train triphone model
Acoustic Model: Training Steps 53 Get features(previous section) Train monophone model a. gmm-init-mono Initialize monophone model b. compile-train-graphs Get train graph c. align-equal-compiled model -> decode & align (gmm-align-compiled instead when looping) d. gmm-acc-stats-ali EM training: E step e. gmm-est EM training: M step f. numgauss = numgauss + incgauss g. Goto step c. Train several times Use previous model to build decision tree for triphone Train triphone model
Acoustic Model: Training Steps 54 Get features(previous section) Train monophone model Use previous model to build decision tree for triphone Train triphone model a. gmm-init-model Initialize GMM ( from decision tree) b. gmm-mixup Gaussian merging (increase #gaussian) c. convert-ali Convert alignments(model <-> decisoin tree) d. compile-train-graphs get train graph e. gmm-align-compiled model -> decode&align f. gmm-acc-stats-ali EM training: E step g. gmm-est EM training: M step h. numgauss = numgauss + incgauss i. Goto step e. train several times
align-equal-compiled 55 Write an equally spaced alignment (for getting training started) Usage: align-equal-compiled <graphs-rspecifier> <features-rspecifier> <alignments-wspecifier> e.g. align-equal-compiled 1.fsts 1.fsts scp:train.scp ark:equal.ali
gmm-align-compiled Performing re-alignment 56 Performing re-alignment Usage: gmm-align-compiled [options] <model-in> <graphs-rspecifier> <feature-rspecifier> <alignments-wspecifier> e.g. align-equal-compiled 1.mdl ark:graphs.fsts1.fsts scp:train.scp ark:equal.ali gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$[$beam*4] <hmm-model*> ark:$dir/train.graph ark,s,cs:$feat ark:<alignment*> For first iteration(in monophone) beamwidth = 6, others = 10; Only realign at mono: $realign_iters="1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38” tri: $realign_iters=“10 20 30”
gmm-acc-stats-ali Accumulate stats for GMM training.(E step) 57 Accumulate stats for GMM training.(E step) Usage: gmm-acc-stats-ali [options] <model-in> <feature-rspecifier> <alignments-rspecifier> <stats-out> e.g. gmm-acc-stats-ali 1.mdl scp:train.scp ark:1.ali 1.acc gmm-acc-stats-ali --binary=false <hmm-model*> ark,s,cs:$feat ark,s,cs:<alignment*> <stats>
gmm-est 58 Do Maximum Likelihood re-estimation of GMM-based acoustic model Usage: gmm-est [options] <model-in> <stats-in> <model-out> e.g. gmm-est 1.mdl 1.acc 2.mdl gmm-est --binary=false --write-occs=<*.occs> --mix-up=$numgauss <hmm-model-in> <stats> <hmm-model-out> --write-occs : File to write pdf occupation counts to. $numgauss increases every time.
Homework Linux, background knowledge 01.format.sh, 02.extract.feat.sh 59 Linux, background knowledge 01.format.sh, 02.extract.feat.sh 03.mono.train.sh, 05.tree.build.sh, 06.tri.train.sh
Homework 如果你沒有操作 Linux 系統的經驗,請事先預習 Linux 系統的指令。 鳥哥的Linux 私房菜 60 如果你沒有操作 Linux 系統的經驗,請事先預習 Linux 系統的指令。 鳥哥的Linux 私房菜 第七章Linux 檔案與目錄管理http://linux.vbird.org/linux_basic/0220filemanager.php 第十章vim 程式編輯器http://linux.vbird.org/linux_basic/0310vi.php
Homework (optional) 閱讀: 使用加權有限狀態轉換器的基於混合詞與次詞以文字及語音指令偵測口語詞彙” – 第三章 61 閱讀: 使用加權有限狀態轉換器的基於混合詞與次詞以文字及語音指令偵測口語詞彙” – 第三章 https://www.dropbox.com/s/dsaqh6xa9dp3dzw/wfst_thesis.pdf Kaldi documentation: http://kaldi-asr.org/doc/tools.html
Login Workstation By pietty/putty/Xshell ssh 140.112.21.80 port 22 62 By pietty/putty/Xshell ssh 140.112.21.80 port 22 By terminal ssh -p 22 username@140.112.21.80
Data 將壓縮檔複製至自己的家目錄底下 cp /share/proj1.ASTMIC.subset.tar.gz ~/. 解壓縮 tar -zxvf proj1.ASTMIC.subset.tar.gz
To Do: Feature Extraction 64 Step 1: Execute the following command: script/01.format.sh | tee log/01.format.log script/02.extract.feat.sh | tee log/02.extract.feat.sh.log Step 2: Add-delta CMVN Observe the output and report
Hint (Important!!) compute-mfcc-feats output為ark:$path/$target.13.ark 65 compute-mfcc-feats output為ark:$path/$target.13.ark add-deltas [input] [output] [input] = ark:$path/$target.13.ark [output] = 𝑥 compute-cmvn-stats [input] [comput_result] [input] = 𝑥 apply-cmvn [comput_result] [input] [output] [output] MUST BE rm -f [output] [comput_result] ark,t,scp:$path/$target.39.cmvn.ark,$path/$target.39.cmvn.scp
To Do: Acoustic Modeling 66 Step1. Execute the following commands. script/03.mono.train.sh | tee log/03.mono.train.log script/05.tree.build.sh | tee log/05.tree.build.log script/06.tri.train.sh | tee log/06.tri.train.log Step2. finish code in TODO script/03.mono.train.sh script/06.tri.train.sh Step3. Observe the output and results. Step4. (opt.) tune #gaussian and #iteration.
Hint(important!!) Use the variables already defined. 67 Use the variables already defined. Use these formula: Pipe for error compute-mfcc-feats … 2> $log
工作站注意事項 請避免在程式中重複暴力的搜尋外網或抓取資料,這類的行為如果被計中偵測到,會將ip給ban,造成大家無法連進工作站。 68 請避免在程式中重複暴力的搜尋外網或抓取資料,這類的行為如果被計中偵測到,會將ip給ban,造成大家無法連進工作站。 如果需要train的corpus佔用空間需要超過50G以上,麻煩請寄信給我,以控制專題工作站的空間使用量。 因工作站運算資源有限,請避免使用工作站train一些個人作業等,而讓資源留給大家使用在專題研究上。 本次project中,Week 3 & Week 5 的實驗需要的大量運算資源和時間,請大家儘早開始,免得積到最後一兩天,大家的程式會因運算資源有限,而造成全部卡住,大家都無法進行實驗。
工作站注意事項 69 有為大家裝不同版本的cuda library,大家如果在某些檔案需要使用各版本的cuda library,請自行加進path中,如果所需要的cuda版本沒有,可以寫信請我幫忙裝。 第二階段專題的時候建議大家使用virtual environment,Ex: virtualenv, conda等,也可以使用pip --user 將需要的package放在local端 請不要在工作站跑ipython之類互動式的程式,會吃掉大家的資源,請直接跑python檔。
其他注意事項 Problems about the project: Problems about the workstation: 70 Problems about the project: Facebook Group:數位語音專題 DSP Website: http://speech.ee.ntu.edu.tw/courses.html Week 1 TA: 簡仲明 r08922080@ntu.edu.tw Problems about the workstation: Workstation TA: 簡仲明 r08922080@ntu.edu.tw
其他注意事項 71 You are encouraged to post your problems on the Facebook group. Your problems may be others’ too. Always clearly specify your problem with the error message or screenshots of your code/results/error, or nobody could help you. Start early!!!
其他注意事項 請大家務必至以下網址填入自己個人資料:https://reurl.cc/dr0mMD 72 請大家務必至以下網址填入自己個人資料:https://reurl.cc/dr0mMD 所有公告會同時公告於fb社團和寄信給各位。 請於報告前一天晚上23:59:59之前把報告上傳到https://reurl.cc/k5daY9,統一使用教室電腦報告