Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yow-Bang Wang Lin-shan Lee

Similar presentations


Presentation on theme: "Yow-Bang Wang Lin-shan Lee"— Presentation transcript:

1 Yow-Bang Wang Lin-shan Lee
Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer- Assisted language Learning IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 3, MARCH 2015 2015/05/19 Ming-Han Yang 1

2 電腦輔助語言學習的時代 Speaking 2 or more languages is necessary
利用Speech processing的技術來學習第二語言 Computer-assisted language learning (CALL) Virtual language Tutor, HAFSS, SpeechRater, Rainbow Rummy game, CALLJ, CHELSEA CAPT, analyzes the produced utterance to offer feedback to the language learner in form of quantitative or qualitative evaluations of pronunciation proficiency. Pronunciation的評估方式  ASR posterior (GOP) 如何根據發錯的音回饋給learner? (based on Error Patterns) Error Patterns, are patterns of erroneous pronunciations frequently produced by language learners. 發錯的音通常是因learner的母語沒有目標語言這種發音器官的使用方式 母語/目標的pair不同, EP也跟著不同; 語言很多, 組合也跟著複雜起來 CAPT中, 可分成兩個面向 EP derivation/discover= L1-L2 pair / L2-non-specific-L1  建立EP dictionary EP detection= 根據learner的 voice segment是否正確, 或屬於EP詞典中的特定EP Virtual language Tutor=教講瑞典文的人機互動系統, self-assessment program=教講英文的人講日文, HAFSS=教L2講阿拉伯語 也有教小朋友讀寫的系統, web-based role-playing environment=教對話, SpeechRater=線上學英文, Rainbow Rummy game=學單字, CALLJ=用句子描述概念學日文(L2) CHELSEA=幫助講英文的學中文 1

3 EP derivation的研究大致可分為兩類
根據L1-L2 pairs  EP dictionary(查表) 比較 free-phone ASR output 與 corpus的phone label 缺點= cost高 (標label耗時間, 需語言專業知識, ASR的結果不一定可靠) Unsupervised acoustic pattern discover Spoken term detection, OOV word modeling No longer need ‘human-annotated data’ for acoustic model training Goal= automatically discover the acoustic patterns in a data set on the signal characteristics 為了解決傳統HMM訓練時, labeled training data cost很高 Motivation= label EP需要更多專家, 且難取得, cost更高 ✡ Major difficulty of EP detection: EPs are intrinsically similar to their corresponding canonical pronunciation, and that EPs corresponding to the same canonical pronunciation are also intrinsically similar to each other. 從EP對應的標準發音辨別出EPs很困難 有些研究使用log-likelihood ratio或posterior probability來解決這類問題 In this paper Supervised EP detection & Unsupervised EP discover Tested on a corpus from Mandarin Chinese learners 我們提出了 一個heterogeneous model initialization及concatenated model adaptation技術為每個EP產生acoustic model, 並用一個額外的EP classifier(MLP) 來結合這些EP model的判斷結果 Unsupervised EP discover方面: 我們也提出一個新的framework, hierachica agglomerstive clusteringl(HAC) 從frame level的feature建立segment level的feature 在這兩個work中我們使用UPP (universal phoneme posteriorgram) 1

4 1 這篇paper的主軸 1. 正確的使用更大的multi-speaker的corpus或mix-language的corpus, 不是止限制learner的corpus 2. EP往往與標準發音只有細微的差別, 我們盡了最大的努力來提高我們系統的辨識度和鑑別度 1

5 Data Collection Chinese language teachers from the International
Chinese Language Program(ICLP) of NTU Corpus: 278 ICLP leaners from 36 different countries Balanced gender 30 sentences / leaner 6~24 characters / sentence The recording text prompts were chosen so as to cover as many Chinese syllables and tone patterns as possible Corpus中mispronunciation站得比例不高(約10趴), 因為這些學習者在錄corpus之前已經受過基礎的中文發音訓練 我們的目標就變成要從corpus中整體約90趴發音正確及10趴錯誤發音的比例中找到EP 總共有152種EP, 每個phoneme平均有3.9個EPs (152/39≈3.9EPs) 大多數的EP可用中英文的phoneme描述, 少數用閩南&客家的phoneme [Table3] 注意 j, q, zh, z, ch, c的EP比較多, f, s, sh, x, h的EP也比較多, 這些對學中文的外國人是比較困難的 有兩個人根據corpus中的每個句子的每個segment來標正確或EP, 第一個人的當作實驗的參考, 另一個估測人評的一致性 1

6 Universal phoneme posteriorgram (UPP)
Used as fundamental frame-level features for the two tasks Posterior probability has been widely used in CAPT and unsupervised acoustic pattern discover Much work on pattern discovery has adopted posteriorgrams as the features for further processing. Some work derives the posteriorgram with GMM trained on the target corpus and some with MLP trained on a separate large corpora. We train an MLP with a large multi-speaker corpora of mixed languages. Posteriorgram feature extractor Output layer= Softmax Output= the set of all phoneme units for the mixed languages Input= the MFCC feature vectors for all signal frames of the mixed language corpora. This is the UPP feature vector. GOP based 有很多種類的研究, 像是事先定義threshold來找mispronounced segments, 用EP network+ GOP-based的mispronunciation detector 或者也有把log-likelihood probability or posterior probability vectors 當作 input feature餵給的svm這種discriminative classifiers的研究 MLP的output layer是softmax, output是這個mix language 的所有phoneme units, input是mix language corpus的所有frame的MFCC feature, 這個MLP把MFCC轉換成 所有phoneme的 posterior probabilities 的 vector => UPP feature 1

7 Universal phoneme posteriorgram (UPP)
We reduce speaker variation while preserving the traits of pronunciation variation, which is the key for both supervised EP detection and unsupervised EP discovery. 使用multi-speaker加上mix language來train MLP的動機是希望能透過MLP萃取出posteriorgraam的feature, 這種方式原本是為了解決資料很少/或將近沒有資料的語言的ASR問題, 我把它轉換成EP [圖] A和B是不同speaker的發音, speaker的變異和發音的變異混在一起, 經過轉換到posterior space後, 即使是不同speaker也能很容易的區分 1

8 Supervised Detection of Pronunciation Error Patterns
We consider the EPs as the pronunciation variations for each corresponding canonical pronunciation. Phoneme-level forced alignment can be performed with learners’ recordings We expand the phoneme sequence of the orthographic transcription of the utterance into a network of canonical pronunciations and Eps Those surface pronunciations with maximum likelihood are then automatically chosen during forced alignment. This framework requires an AM for each EP Corpus的mispronunciation比例太少, learner發音不會只有中文的phoneme 根據專家標的EP描述, 從其他的corpus找到對應的發音(L1)來train初始的EP model Cascaded adaptation 3 stages= global MLLR, class-based MLLR, Maximum A Posterior 我們把EP當作是每個標準發音的一種變異, 從learner的音檔中得到phoneme-level forced alignment, 並把每個utterance的phoneme seqence擴展 兩個syllable中間放了short-pause來吸收learner猶豫的空檔 因為EP不只是substution, 也有insertion和deletion, 這個graph都可以處理 099: 以上皆非 因為我們的corpus都是L2的發音, 而且misp的比例滿低的, 所以我們根據老師描述的每個EP到其他的中英文corpus(L1)去找相對應的phone來當作初始的EP model Model初始化以後 copy過來的這些中英文的phone就變成EP model 1

9 Model initialization & adaptation procedures
AM adaptation has been widely adopted for alleviating the speaker or environment mismatch between the learner’s voice and the training or reference set for fairer comparison or evaluation Main purpose of model adaptation : to create the AMs which better capture the characteristics of EPs. After the EP models are successfully built, we use them to construct the pronunciation network as in Fig. 4 for maximal-likelihood alignment, and the surface pronunciation of learners’ recordings can thus be determined. This is the baseline system in the experiments 同fig1, EP可能散佈在標準發音的四周, 而且想清楚定義這個EP是從中文的/L/來的還是英文的/L/來的並不容易 儘管如此, 我們的做法還是能夠減少EP之間的差異 AM adaptation 廣泛地用在減輕speaker or environment mismatch 在這裡的目標是要產生acoustic model, 看誰比較能表達EP 1

10 EP Detection Framework Based on the Hybrid Approach
We propose the framework for EP detection 1st pass of Viterbi decoding: on the learners’ utterances using the EP AM set and the pronunciation network to obtain more precise time boundaries. 2nd pass of Viterbi decoding: performed given the estimated segment boundaries, taking into account the scores from both the EP AMs and MLP-based EP classifiers using 39-dimensional MFCCs, c0 to c12 plus derivatives and accelerations, as the frame-level feature vector using MFCCs, UPPs proposed here, or different variants of UPPs as the frame-level feature vector (假設UPPs能 與MFCCs起到互補作用) 分數計算: 一種主流的研究叫hybrid approach , 是研究如何在傳統的GMM/HMM加入額外的feature或classifiers 這種研究吸引了很多人的目光而且有不錯的進步, 最近DNN也是起源於這個, 更因為更能夠利用BIG Data的關係而竄起 基於這雷hybrid approach, 我們提出了一個framework, 透過Fig4講的network 來做2pass的viterbi, 𝑒 𝑖 𝑝 : phoneme p 的第 i 個EP, i=0,1,2,…, 𝑁 𝑝 𝑒 0 𝑝 : the canonical pronunciation 𝑁 𝑝 : is the total number of Eps for phoneme p 𝑥 𝑡 : frame-level feature vector of MFCC 𝑦 𝑡 : can be MFCCs, UPPs or its variants 1

11 EP Classifiers & Confidence estimation
In order to take more context information Time t的這個frame 看前4後4的frame當作 input 𝑦 𝑡 a distinct set of MLPs is trained for each phoneme 不同的phoneme會有不同的 𝑤 1,𝑝 和 𝑤 2,𝑝 After the 2nd pass of Viterbi decoding, we further estimate the confidence of the outputs of the EP-MLP For each acoustic segment Y of phoneme p with frames 𝑦 1 , 𝑦 2 , …, 𝑦 𝑇 in the learner’s utterance We accumulate the frame-wise entropy of the EP diagnostic results, and then take its negative as the EP diagnosis confidence of the segment : 有很多強的classifier像是svm已經用在CAPT中 我們採取2-level的hierarchical MLP來當作classier, 因為MLP的output是一種精心校準過得機率估測, 非常適合集中起來餵給GMM/HMM 又因為w1p和w2p 的值可以大於1或小於1, 所以Sg(xt | e ip) 和Ss( e ip | yt )的相對關係也就隱含在wip和w2p裡面 我們也試過用monolithicMLP來為每個phoneme的標準發音算分數, 但是performance 很爛 --- 如果分數比threshold高, 就把EP診斷的結果給learner, 否則只回傳發音正不正確(binary) 1

12 Evaluation metrics false rejection rate : F𝑅𝑅= 𝐹𝑅 𝑇𝐴+𝐹𝑅
false accept rate : F𝐴𝑅= 𝐹𝐴 𝑇𝑅+𝐹𝐴 diagnostic error rate : DER= 𝐷𝐸 𝐶𝐷+𝐷𝐸 F1 score = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∙𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇𝐴 𝑇𝐴+𝐹𝐴 recall = 𝑇𝐴 𝐹𝐴+𝐹𝑅 有很多不同的方法用來評估CAPT Correlation coefficient最常使用, 當老師評分的時候是連續給分時使用 如果老師評分是採是discrete或者分類時則大多使用Cohen’s kappa value = false accept rate / flase rejection rate 或者recall precision也很長使用 也有直接評估CAPT 系統的效能的方法 1

13 Unsupervised Discovery of Pronunciation Error Patterns
The task here is to unsupervisedly discover the EPs for each phoneme given a corpus of learner voice data We can thus focus on one phoneme at a time: each time we are given a set of acoustic segments corresponding to a specific phoneme, and the goal is to divide this set into several clusters, each of which corresponds to an EP 目標是用unsupervised的方式根據learner的音訊找到每個phoneme的EP 假設corpus內的句子的transcription是可利用的, 根據這些trans, 可以把learner的音檔切成segment並做forced alignment對應到phoneme 因為我們的corpus 中misp的比例比較低, 所以cluster的時候我們之把這些錯的丟進去 [圖] 希望phone p的misp 透過uns的方式自動分群 1

14 Z是frame leve feature 是segment對應到的每個phoneme, (也可使用UPP feature, 把MFCC這種較多speaker-dependent的空間轉換到less-speaker dependent的空間) 首先透過HAC來merge鄰近的frame-level feature vector, feature經過HAC後就會分成Mp個 sub-segment 每個segment是由acoustically相似的frame組成, 然後取平均, 再把這些mean vector串起來, 使用K-means與GMM-MDL來做unsp clustering 因為EP的是signal segment組成, 我們幫他們必須要定義segment-level feature, 又因為我們是分別每個phoneme去做EP discovery, 所以這些 feature vector的長度要相同 由於每個segment的frame數量不相同, 所以不能直接把所有的frame-level feature zt串成segment level feature, 但是也不能直接把frame-level feature 取平均, 因為EP和標準發音的差異是非常細微的, 取平均很可能會破壞這種細微的特徵, 1 1

15 Hierarchical Agglomerative Clustering (HAC) Algorithm
The number of sub-segments 𝑀 𝑝 is the same for all segments corresponding to a specific phoneme , but can be different for different phonemes. The segment-level feature vector for all speech segments corresponding to phoneme are then clustered into different EPs by an unsupervised algorithm. Hierarchical Agglomerative Clustering (HAC) Algorithm Let B=( 𝑡 0 ,…, 𝑡 𝑀𝑝 ) be the set of boundaries dividing a given segment of L frames into 𝑀 𝑝 sub-segments with 0=𝑡 0< 𝑡 1 <…< 𝑡 𝑀𝑝 =𝐿. The m-th sub-segment ( 𝑧 𝑡 𝑚−1+1 ,…, 𝑧 𝑡 𝑚 ) ends at 𝑡 𝑚 . The sum of squared error (SSE) when representing 需要調的參數: λ(threshold), 𝑀 𝑝 (#sub-sequences) Cluster algorithm K-means 需要知道cluster的數量, 假設等於EP的數量 GMM-MDL In a more ideal scenario, the number of clusters (or EPs) should be learned from data. 對應到同一個phoneme的所有segment, 其相對應的Sub-Segment的數量Mp會相同, 不同的phoneme的mp就不同 每個sub-segment各自取平均, 叫Z1bar~Zmp bar然後把這些串起來叫o, 當作segment level feature vector (對應到某個phoneme的這些所有segment組合而成的segment level feature vector)接下來會使用unsupervised algo cluster到不同的EP, 我們分別使用kmeans和GMM-MDL --- 我們使用的HAC algo透過把臨近相似的framecluster在一起, 相似的臨近cluster再慢慢merge到更高層上去 , 能夠自動排列好spech segment的frame, 轉換到tree結構的hierarchy, (看paper上的筆記解釋) GMMMDL: 對每個phoneme train一個GMM, 並使用maximum likelihood classification把每個instance分類, GMMMDL的優點是用在估計GMM最佳的mixture個數(在這裡是EP的個數) 透過maximizing左下角這個objective function, theta是GMM在phoneme p的參數, Op是phoneme p 的segment-level feature vector set, 左下角這個式子 左半邊是log-likelihood, 右半邊是model complexity (後面的第17式是假設每個gaussian的prior prob=1來的)(?) 1

16 Evaluation metrics Here we adopt the Rand Index for its balance between the similarity within clusters and dissimilarity among different clusters. Rand index:𝑅𝐼= 𝑇 𝐴 ′ +𝑇 𝑅 ′ 𝑇 𝐴 ′ +𝑇 𝑅 ′ +𝐹 𝐴 ′ +𝐹𝑅′ Since the mispronounced segments for each phoneme are individually clustered for EPs, we report the Average Rand Index (ARI) over all phonemes p for the phoneme set P: 𝐴𝑅𝐼= 1 |𝑃| 𝑝∈𝑃 𝑅𝐼(𝑝) 有很多評估法, 像是cluster purity, Ri的好處是他在相似的cluster與不同的cluster之間很平衡 由於每個phoneme的misp的segment各自會cluster到某個EP, 所以我們使用average rand index的方法, 對所有的phoneme p, 把所有的phoneme的RI分數相加取平均 1

17 Experiment setup We chose the monophone as the phoneme model unit for both Chinese and English Chinese phoneme model : ASTMIC Mandarin corpus 95 males, 95 females, 200 utterances, 24.6 hours English phoneme model : TIMIT corpus 462 speakers, 3.9 hour Input feature : MFCC 39 parameters, c0 to c12 plus first and second derivatives UPP 73 posteriors for 73 Mandarin/English mono-phones Logarithm of UPP (log-UPP) Principal component analysis (PCA) transformed log-UPP(PCA-log-UPP) For PCA we retained 95% of the total variance One problem arose when training the binary-MLPs: there were far more correctly-pronounced than mispronounced instances Training時把正確發音的instance down-sampling到跟錯誤發音一樣 𝐶 𝑝 =𝐶𝑜𝑢𝑛𝑡 𝑝 𝑤 1,𝑝 >0, 𝑤 2,𝑝 >0} 英文corpus的speaker來自USA的8個地方 UPP: ASTMIC+TMIT, MLP training的目標是中英文所有monophone, 總共有73個phoneme(35+38 monophone) Hierarchical MLPs in the EP classifier : Binary MLP和EP-MLP的weight(w1p和w2p)是照每個phoneme訂的 , 兩者的training set也相同, input feature都是看前4後4 有個問題是正確發音的資料遠比錯誤的多, 所以我們在train binary-MLP時(對每個phoneme)把正確發音的isntance down-sampling到跟錯誤發音相同 MLP的hidden node的數量是用dev set, 透過最小化frame-level的FAR+FRR(for binary MLP), 及最小化frame-level的DER (for EP-MLP), train完後才串起來跟fig7一樣, 和前面fig6的graph, 跟那兩個weight, tune的時候用dev set, 透過minimum segment-level AER 那兩個weight在缺乏訓練的MLP的狀況下, 允許設為0, 最後也用dev對每個phoneme p來調threshold 1

18 To understand how well the EP classifiers were trained and selected, we further counted the number 𝐶 𝑝 of phonemes whose EP classifiers were actually activated : evaluated the average number of EPs, 𝑁 ∗ 𝑝 , of those phonemes with activated EP classifiers: [左圖] 39個phone只有12個使用到, 然後在這些有用到的classifier中, EP的數量比平均3.9還要多, 這表示其他27個EP classifier訓練的太差了, 導致那12個classifer攔截到大部分會造成confuse的EP group [右圖] 右表, baoi有較高的FAR和較低的FRR, 原因可能是這些音是很常出現的發音, 很多語言也都有, 他們的misp相較比較少見, 而且很難分區分, 而且bao這三個音的DER是0趴, 因為我們的系統裡面確實沒有做出錯誤的分析, 因為他們的EP沒有通過confidence verification的門檻, 所以沒有評分的feedback, 這表示我們的confidence verification的方法可以保護learner接收到錯誤的診斷結果 接下來來看相對進步, 發現Zh, s, iu這些phone的相對進步非常多, 而且他們的DER有顯著的減少, 這表示EP classifier有很好的效果 1

19 Experiment setup K-means
1) MFCC (39 parameters, c0 to c12 plus first and second derivatives); 2) UPP (73 posteriors for 73 Mandarin/English mono-phones); 3) Logarithm of UPP (log-UPP); 4) Principal component analysis (PCA) transformed log-UPP (PCA-log-UPP). For PCA we retained 95% of the total variance; sub-segments 𝑀 𝑝 的三種選擇: 1, 𝑀 o𝑝𝑡 , 𝑀 max K-means 單位: Rand index 1

20 Experiment GMM-MDL Table shows the results of ARI using GMM-MDL with an automatically estimated number of EPs for each phoneme. 效果比K-means差, 原因可能是缺乏對EP數量的專業理解 In other words, with UPP or its variants the machine is able to perform slightly finer clustering while there are some patterns with only subtle differences that human experts may consider the same. In contrast, MFCC resulted in a lower number of clusters this further shows the superior discriminating power of UPP in discovering EPs. 同時有些細微的pattern可能機器的判斷跟人類專家會一樣的 單位: Rand index 1

21 Automatically Discovered Eps & analysis
we try to analyze a typical set of examples of automatically discovered EPs in the log-UPP space Next, we calculate the displacement in each dimension of log-UPP, which is for each Chinese or English phoneme p Note here the displacements are evaluated in each dimension of the log-UPP space while each dimension of log-UPP represents the log-posterior probability of the input frame with respect to a certain Chinese or English phoneme Log-UPP兩個centroid之間, 每個維度的位移實際上是正確發音和錯誤發音之間的posterior的比值 如圖, 三種Automatically Discovered Eps of phoneme 中文/b/ , 最左邊的圖, 老師評分是b_010, 表示念的像中文的p, 這個cluster的purity很高, 中間的圖, 藍色佔的比例變少, 其他兩個的比例上升, b_020(不知道是什麼意思), 右邊的圖跟020這個很有關, 然後考量正確發音和EP發音他們的log-UPP中心的位移, 我們把所有維度的位移視覺化, 每個bar是一個維度, 越黑的值越高 --- 下面的式子: 每個gaussian的prior prob是1 1

22 如圖, 三種Automatically Discovered Eps of phoneme 中文/b/ ,
最左邊的圖, 老師評分是b_010, 表示念的像中文的p, 這個cluster的purity很高, 中間的圖, 藍色佔的比例變少, 其他兩個的比例上升, b_020(不知道是什麼意思), 右邊的圖跟020這個很有關 然後考量正確發音和EP發音他們的log-UPP中心的位移, 我們把所有維度的位移視覺化, 每個bar是一個維度, 越黑的值越高 1

23 Conclusion In this paper we consider both supervised detection and unsupervised discovery of pronunciation EPs in computer-assisted language learning. We propose new frameworks for both supervised detection and unsupervised discovery of pronunciation EPs with empirical analysis over different approaches. Supervised EP detection We integrate the scores from both HMM-based EP models and MLP-based EP classifiers with a two-pass Viterbi decoding architecture. We use EP classifiers in Viterbi decoding to encompass different aspects of EP detection, while maintaining flexibility for fine tuning. Unsupervised EP discovery We use the hierarchical agglomerative clustering (HAC) algorithm to divide speech seg-ments corresponding to a phoneme into sub-segments In both tasks : we use the universal phoneme posteriorgram (UPP) derived from a multi-layer perceptron (MLP) trained on corpora of mixed languages, as a set of very useful features to reduce speaker variation while maintaining pronunciation variation across speech frames. 1


Download ppt "Yow-Bang Wang Lin-shan Lee"

Similar presentations


Ads by Google