Introduction to Digital Speech Processing 數位語音處理概論 Introduction to Digital Speech Processing 4.0 More about Hidden Markov Models References for 4.0 1. 6.1-6.6, Rabiner and Juang, 2. 4.4.1 of Huang 授課教師:國立臺灣大學 電機工程學系 李琳山 教授 【本著作除另有註明外,採取創用CC「姓名標示-非商業性-相同方式分享」臺灣3.0版授權釋出】
Markov Model Markov Model (Markov Chain) First-order Markov chain of N states is a triplet (S,A,) S is a set of N states A is the NN matrix of state transition probabilities P(qt=j|qt-1=i, qt-2=k, ……)=P (qt=j|qt-1=i) aij is the vector of initial state probabilities j =P(q0=j) The output for any given state is an observable event (deterministic) The output of the process is a sequence of observable events A Markov chain with 5 states (labeled S1 to S5) with state transitions.
Markov Model s2 s3 A B C s1 An example : a 3-state Markov Chain λ State 1 generates symbol A only, State 2 generates symbol B only, and State 3 generates symbol C only Given a sequence of observed symbols O={CABBCABC}, the only one corresponding state sequence is {S3S1S2S2S3S1S2S3}, and the corresponding probability is P(O|λ)=P(q0=S3) P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2) =0.10.30.30.70.20.30.30.2=0.00002268 s2 s3 A B C 0.6 0.7 0.3 0.2 0.1 s1
Hidden Markov Model HMM, an extended version of Markov Model The observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a state The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden) What is hidden? The State Sequence According to the observation sequence, we never know which state sequence generates it Elements of an HMM {S,A,B,} S is a set of N states A is the NN matrix of state transition probabilities B is a set of N probability functions, each describing the observation probability with respect to a state is the vector of initial state probabilities
Simplified HMM 1 2 3 RGBGGBBGRRR……
Hidden Markov Model Two types of HMM’s according to the observation functions Discrete and finite observations : The observations that all distinct states generate are finite in number V={v1, v2, v3, ……, vM}, vkRD the set of observation probability distributions B={bj(vk)} is defined as bj(vk)=P(ot=vk|qt=j), 1kM, 1jN ot : observation at time t, qt : state at time t for state j, bj(vk) consists of only M probability values Continuous and infinite observations : The observations that all distinct states generate are infinite and continuous, V={v| vRD} the set of observation probability distributions B={bj(v)} is defined as bj(v)=P (ot=v|qt=j), 1jN bj(v) is a continuous probability density function and is often assumed to be a mixture of Gaussian distributions
Hidden Markov Model s2 s1 s3 {A:.3,B:.2,C:.5} {A:.7,B:.1,C:.2} An example : a 3-state discrete HMM λ Given a sequence of observations O={ABC}, there are 27 possible corresponding state sequences, and therefore the corresponding probability is s2 s1 s3 {A:.3,B:.2,C:.5} {A:.7,B:.1,C:.2} {A:.3,B:.6,C:.1} 0.6 0.7 0.3 0.2 0.1
Hidden Markov Model Three Basic Problems for HMMs Given an observation sequence O=(o1,o2,…..,oT), and an HMM λ =(A,B,) Problem 1 : How to efficiently compute P(O| λ) ? Evaluation problem Problem 2 : How to choose an optimal state sequence q=(q1,q2,……, qT) ? Decoding Problem Problem 3 : Given some observations O for the HMM λ , how to adjust the model parameter λ =(A,B,) to maximize P(O| λ)? Learning /Training Problem
N ( Basic Problem 1 for HMM l = (A, B, p ) O = o o …… 2 3 …… t T observation sequence q = q state sequence N ․Problem 1: Given l and O, find P(O| )=Prob[observing O given ] ․Direct Evaluation: considering all possible state sequence q P(O| l ) = å ( [b q 1 (o ) · b 2 …… T )] [ p a 3 - ] P(q| total number of different q : N huge computation requirements P (O | q, all q
Basic Problem 1 𝑖 𝑁 3 2 1 𝑡 2 3 ⋯ 𝑡 ⋯ 𝑇 𝑜 1 𝑜 2 𝑜 3 ⋯ ⋯ 𝑜 𝑡 ⋯ 𝑜 𝑇
Basic Problem 1 𝑖 αt(i) t i t(i) = P(o1o2……ot , qt = i|)
Basic Problem 1 t+1 Forward Algorithm αt+1(j) αt(i) αt+1(j) j i t+1 t t+1( j) = [ t(i)aij ] bj(ot+1) 1 j N 1 t T1 N i = 1
βt(i) Basic Problem 2 t(i) = P(ot+1 , ot+2 ,…, oT |qt= i, ) i T t
Basic Problem 2 t+1 Backward Algorithm t( i) = aij bj(ot+1)t+1( j) t = T1, T2,…, 2, 1, 1 i N N j = 1 Backward Algorithm βt(i) βt+1(j) j i t+1 t T
Basic Problem 2 βt(i) αt(i) 𝐴:( 𝑜 1 𝑜 2 ⋯ 𝑜 𝑡 𝜆 ) P(O, qt = i |) = Prob [observing o1, o2, …, ot , …, oT , qt = i | ] = t(i)t(i) αt(i) t 𝐴:( 𝑜 1 𝑜 2 ⋯ 𝑜 𝑡 𝜆 ) 𝐵:( 𝑜 𝑡+1 , 𝑜 𝑡+2 , ⋯ 𝑜 𝑇 𝜆 ) 𝐶:( 𝑞 𝑡 =𝑖 𝜆 ) 𝑃 𝐴, 𝐵, 𝐶 =𝑃 𝐴, 𝐶 𝑃(𝐵 𝐴, 𝐶 ) (𝐵⊥𝐴) 𝑃( 𝑂 , 𝑞 𝑡 =𝑖 𝜆 ) 𝛼 𝑡 (𝑖) 𝑃(𝐵 𝐶 ) 𝛽 𝑡 (𝑖)
P(Ō|λ) αt(i)βt(i) Basic Problem 2 i t P(O|) = P(O, qt = i |) = [t(i)t(i)] N i = 1 P(Ō|λ) αt(i)βt(i) t i P(O|) = T(i) N i = 1
Basic Problem 2 for HMM ․Approach 1 – Choosing state q t * individually as the most likely state at time t - Define a new variable g t (i) = P( q = i | O, l ) (i) = ¾ = a (i) b å N i = 1 P(O, q = i| P(O| - Solution g t (i)], 1 £ T q * = arg max [ 1 i N in fact P(O, q = i| l ) ] a (i) b - Problem maximizing the probability at each time t individually q*= q 1 * 2 … T may not be a valid sequence (e.g. a t *q t+1 = 0)
Viterbi Algorithm δt+1(j) δt(i) i j t t+1 q1,q2,…q t-1 t+1( j) = max [t(i)aij] bj(ot+1) i δt(i) δt+1(j) i t t+1 1 j t(i) = max P[q1,q2,…qt-1, qt = i, o1,o2,…,ot |] q1,q2,…q t-1
Viterbi Algorithm Path backtracking
(for a single best path) Basic Problem 2 for HMM ․Application Example of Viterbi Algorithm - Isolated word recognition . observation 1 £ i n 1 £ i £ n Basic Problem 1 Forward Algorithm (for all paths) Basic Problem 2 Viterbi Algorithm (for a single best path) The model with the highest probability for the most probable path usually also has the highest probability for all possible paths.
Basic Problem 3 t(i) aij bj(ot+1)t+1( j) βt+1(j) αt(i) t t+1 i j
αt(i)aijbj(ot+1)βt+1(j) Basic Problem 3 αt(i)aijbj(ot+1)βt+1(j) βt+1(j) αt(i) i j t t+1
Basic Problem 3 𝛾 𝑡 𝑖 = 𝛼 𝑡 𝑖 𝛽 𝑡 (𝑖) = 𝑃( 𝑂 , 𝑞 𝑡 =𝑖 𝜆 ) 𝑃( 𝑂 𝜆 ) =𝑃( 𝑞 𝑡 =𝑖 𝑂 ,𝜆 ) 𝑖=1 𝑁 [𝛼 𝑡 𝑖 𝛽 𝑡 (𝑖)] 𝜀 𝑡 𝑖,𝑗 = 𝛼 𝑡 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 𝑜 𝑡+1 𝛽 𝑡+1 𝑗 𝑗=1 𝑁 𝑖=1 𝑁 𝛼 𝑡 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 𝑜 𝑡+1 𝛽 𝑡+1 (𝑗) = 𝑃( 𝑂 , 𝑞 𝑡 =𝑖, 𝑞 𝑡+1 =𝑗 𝜆 ) 𝑃( 𝑂 𝜆 ) =𝑃( 𝑞 𝑡 =𝑖, 𝑞 𝑡+1 =𝑗 𝑂 ,𝜆 )
Basic Problem 3 𝑎 𝑖𝑗 = 1.95/69 10.59/69 T-1 t =1 t(i, j)/(T-1) a ij = T-1 t =1 t(i, j)/(T-1) t(i)/(T-1) 𝑎 𝑖𝑗 = 1.95/69 10.59/69
Basic Problem 3 −∞ ∞ 𝑥 𝑓 𝑋 𝑥 𝑑𝑥= 𝑥 −∞ ∞ (𝑥− 𝑥 ) 2 𝑓 𝑋 𝑥 𝑑𝑥= 𝜎 𝑥 2 −∞ ∞ 𝑥 𝑓 𝑋 𝑥 𝑑𝑥= 𝑥 −∞ ∞ (𝑥− 𝑥 ) 2 𝑓 𝑋 𝑥 𝑑𝑥= 𝜎 𝑥 2 𝑓 𝑋 𝑥 : prob. density function 𝜎 𝑥 2 𝑜 𝑡 1 − 𝜇 𝑗𝑘 1 𝑜 𝑡 2 − 𝜇 𝑗𝑘 2 ⋮ 𝑜 𝑡 𝐷 − 𝜇 𝑗𝑘 𝐷 𝑜 𝑡 1 − 𝜇 𝑗𝑘 1 𝑜 𝑡 2 − 𝜇 𝑗𝑘 2 ⋯ 𝑜 𝑡 𝐷 − 𝜇 𝑗𝑘 𝐷 𝑓 𝑋 𝑥 𝑥 𝑥 𝑥
Basic Problem 3 𝑚 𝑢 𝑙𝑚 𝑙
Model Initialization: Segmental K-means Model Re-estimation: Basic Problem 3 for HMM No closed-form solution, but approximated iteratively An initial model is needed-model initialization May converge to local optimal points rather than global optimal point - heavily depending on the initialization Model training Model Initialization: Segmental K-means Model Re-estimation: Baum-Welch
Basic Problem 3 Global optimum Local optimum P P(O|λ) jkn
Vector Quantization (VQ) An Efficient Approach for Data Compression replacing a set of real numbers by a finite number of bits An Efficient Approach for Clustering Large Number of Sample Vectors grouping sample vectors into clusters, each represented by a single vector (codeword) Scalar Quantization replacing a single real number by an R-bit pattern a mapping relation Jk -A=m0 vk A = mL S = Jk , V ={ v1 , v2 , …, vL } Q : S V Q(x[n]) = vk if x[n] Jk L = 2R Each vk represented by an R-bit pattern Quantization characteristics (codebook) { J1 , J2 , …, JL } and { v1 , v2 , …, vL } designed considering at least error sensitivity probability distribution of x[n]
Vector Quantization Scalar Quantization :Pulse Coded Modulation (PCM) 0 1 1 quantization error 0 1 0 0 0 1 0 0 0 𝐽 𝑘 1 0 0 1 0 1 𝑣 𝑘 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0
Vector Quantization Px(x)
Vector Quantization (VQ) 2-dim Vector Quantization (VQ) Example: xn = ( x[n] , x[n+1] ) S ={xn = (x[n] , x[n+1] ) ; |x[n]| <A,|x[n+1]|<A} VQ S divided into L 2-dim regions { J1 , J2 , …, Jk ,…JL } each with a representative vector vk Jk, V= { v1 , v2 , …, vL } Q : S V Q(xn)= vk if xn Jk L = 2R each vk represented by an R-bit pattern Considerations error sensitivity may depend on x[n], x[n+1] jointly distribution of x[n] , x[n+1] may be correlated statistically more flexible choice of Jk Quantization Characteristics (codebook) { J1 , J2 , …, JL } and { v1 , v2 , …, vL }
Vector Quantization 𝐴 𝐴 −𝐴 −𝐴
Vector Quantization (256)2=(28)2=216 1024=210 Jk vk
Vector Quantization Jk vk
Vector Quantization (VQ) N-dim Vector Quantization x = (x1 , x2 , …, xN ) S = {x = (x1 , x2 , …, xN) , | xk |< A , k = 1,2,…N} V = {v1 , v2 , …, vL } Q : S V Q(x) = vk if x Jk L = 2R , each vk represented by an R-bit pattern Codebook Trained by a Large Training Set ˙Define distance measure between two vectors x, y d( x, y ) : SS R+ (non-negative real numbers) -desired properties d( x, y ) 0 d( x, x ) = 0 d( x, y ) = d( y, x ) d( x, y ) + d( y, z ) d( x, z ) examples : d( x, y ) = (xi yi)2 d( x, y ) = | xi yi | d( x, y ) = (x-y)t -1(x-y) Mahalanobis Distance : Co-variance Matrix i
Distance Measures 𝑑 𝑥 , 𝑦 = 𝑖 𝑥 𝑖 − 𝑦 𝑖 city block distance 𝑑 𝑥 , 𝑦 = 𝑖 𝑥 𝑖 − 𝑦 𝑖 city block distance 𝑑 𝑥 , 𝑦 = ( 𝑥 − 𝑦 ) 𝑡 Σ −1 ( 𝑥 − 𝑦 ) Mahalanobis distance = 1 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1 ,𝑑 𝑥 , 𝑦 = 𝑖 ( 𝑥 𝑖 − 𝑦 𝑖 ) 2 = 𝜎 1 2 ⋯ 0 ⋮ ⋮ 0 ⋯ 𝜎 𝑛 2 ,𝑑 𝑥 , 𝑦 = 𝑖 ( 𝑥 𝑖 − 𝑦 𝑖 ) 2 𝜎 𝑖 2 𝜎 2 2 ⋱
Vector Quantization (VQ) K-Means Algorithm/Lloyd-Max Algorithm (1) Fixed { v1 , v2 , …, vL } find best set of { J1 , J2 , …, JL } (2) Fixed { J1 , J2 , …, JL } { v1 , v2 , …, vL } (1) Jk = { x | d(x , vk ) < d(x , vj) , j k } D = d(x , Q(x) ) = min nearest neighbor condition (2) For each k Dk = d(x , vk) = min centroid condition (3) Convergence condition D = Dk after each iteration D is reduced, but D 0 | D(m+1) D(m) | < , m : iteration L k = 1 Iterative Procedure to Obtain Codebook from a Large Training Set
Vector Quantization
Vector Quantization (VQ) K-means Algorithm may Converge to Local Optimal Solutions depending on initial conditions, not unique in general Training VQ Codebook in Stages― LBG Algorithm step 1: Initialization. L = 1, train a 1-vector VQ codebook step 2: Splitting. Splitting the L codewords into 2L codewords, L = 2L example 1 step 3: k-means Algorithm: to obtain L-vector codebook step 4: Termination. Otherwise go to step 2 Usually Converges to Better Codebook ‧example 2 : the vector most far apart
LBG Algorithm
Initialization in HMM Training An Often Used Approach― Segmental K-Means Assume an initial estimate of all model parameters (e.g. estimated by segmentation of training utterances into states with equal length) For discrete density HMM For continuous density HMM (M Gaussian mixtures per state) Step 1 : re-segment the training observation sequences into states based on the initial model by Viterbi Algorithm Step 2 : Reestimate the model parameters (same as initial estimation) Step 3: Evaluate the model score P( |λ): If the difference between the previous and current model scores exceeds a threshold, go back to Step 1, otherwise stop and the initial model is obtained
Segmental K-Means
Initialization in HMM Training An example for Continuous HMM 3 states and 4 Gaussian mixtures per state State s3 s3 s3 s3 s3 s3 s3 s3 s3 s2 s2 s2 s2 s2 s2 s2 s2 s2 s1 s1 s1 s1 s1 s1 s1 s1 s1 1 2 N O1 O2 ON {12,12,c12} {11,11,c11} LBG Global mean Cluster 1 mean Cluster 2mean {13,13,c13} {14,14,c14} K-means
Initialization in HMM Training An example for discrete HMM 3 states and 2 codewords b1(v1)=3/4, b1(v2)=1/4 b2(v1)=1/3, b2(v2)=2/3 b3(v1)=2/3, b3(v2)=1/3 O1 State O2 O3 1 2 3 4 5 6 7 8 9 10 O4 s2 s3 s1 O5 O6 O9 O8 O7 O10 v1 v2
版權聲明 頁碼 作品 版權標示 作者 / 來源 2 Lawrence Rabiner, Biing-Hwang Juang / FUNDAMENTALS OF SPEECH RECOGNITION Chap. 6, Sec. 6.2 Discrete-Time Markov Processes, page 323, Prentice-Hall International, Inc. 3 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 5 7
版權聲明 頁碼 作品 版權標示 作者 / 來源 11 Lawrence Rabiner, Biing-Hwang Juang / FUNDAMENTALS OF SPEECH RECOGNITION Chap. 6, Sec. 6.2 Discrete-Time Markov Processes, page 323, Prentice-Hall International, Inc. 12 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 13 15
版權聲明 頁碼 作品 版權標示 作者 / 來源 16 Lawrence Rabiner, Biing-Hwang Juang / FUNDAMENTALS OF SPEECH RECOGNITION Chap. 6, Sec. 6.2 Discrete-Time Markov Processes, page 323, Prentice-Hall International, Inc. 17 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 18 21
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 21 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 22 26 27
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 29 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 31 33 34
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 35 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 36 38 39
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 41 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 42 43 45
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 46 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 47 49 51
本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 版權聲明 頁碼 作品 版權標示 作者 / 來源 52 國立臺灣大學電機工程學系李琳山 教授。 本作品採用創用CC「姓名標示-非商業性-相同方式分享3.0臺灣」許可協議。 53