Tree Pattern Matching to Subset Matching in Linear Time R. Cole and R. Hariharan
Tree Pattern Matching Input: An ordered binary tree T, |T| = n. An ordered binary tree P, |P| = m. Output: All nodes in T where P matches. p t
Subset Matching Input: A set-string T and a set-string P . Output: All occurrences of P in T. a b c a c b c a c c e f b f b T = a c c b P =
History Hoffman and O’Donell, 1982, O(nm). Kosaraju, 1989, O(nm0.75logm). Dubiner, Galil, and Magen, 1994,O(nm0.5logm). Cole and Hariharan, 1997, randomized O(nlog3m). Indyk, 1998, randomized O(nlogn). Cole, Hariharan, and Indyk, 1999, O(nlog3m). Cole and Hariharan, 2002, O(nlog2m).
Period Def : The period of a string s is the smallest number j such that s[i]=s[i+j]. S = 0 0 1 0 0 1 0 0 1 0 0 1 j = 3
非正式用語 (1) 後面的投影片如果說週期為 θ,意思是以 θ `` 開頭 ”,並且週期為 | θ |。 | θ | 有時會省略為 θ 。 Let θ = 0 0 1, |θ| = 3. S = 0 0 1 0 0 1 0 0 1 0 0 1 Yes θ S = 0 1 0 0 1 0 0 1 0 0 1 0 No
Classical Lemma (1) s: a string with period θ s = 把s切兩半,如果切的地方距離開頭不是 θ的整數倍, 則後面那一半的開頭不會是θ。 Ex: s = 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
Period in linear time Exercise 1: Design an algorithm to compute the period in linear time.
θ-Path Def: A path p is a θ-path if its string representation has period θ. Let θ = 0 0 1. p s = 0 0 1 0 0 1 p is a θ-path.
Maximalθ-Path Def: A θ-path p is maximal if it can not be extended. θ = 0 0 1. not not
Maximalθ-Paths in linear time Exercise 2: Design an linear time algorithm to find all maximalθ-Paths in a tree.
非正式用語 (2) 一個node的大小 = 以這個node為root的subtree的大小。 7 4 2 1 2 1 1
Centroid of a tree m = 19 m/2 = 9
Spine of a Tree Spine = centroid 加強版
Spine of a Tree 1 1 1 Link node = Centroid 的最後一個node 1 1 1 Link node = Centroid 的最後一個node = Spine上最後一個 ≥ m/2的點 < m/2 ≥ m/2
A Special Case of Tree Pattern Matching Input: An ordered binary tree T, |T| = n. An ordered binary tree P, |P| = m. Output: All nodes in T where P matches Additional constraint: T has only one maximal θ-path, where θ is the period of the spine of P.
A Special Case of Tree Pattern Matching
Reduce to Subset Matching (1) P 1 1 B2 B6
Reduce to Subset Matching (2) f a b c e 1 1 1 a b d e f
Reduce to Subset Matching (3) 肋2 1 1 1 肋5 肋6
Reduce to Subset Matching (4) 肋2 a a b e b e f c f c d a b c e f 1 1 1 Time: O( min{ m, 肋2} )
Reduce to Subset Matching (5) Total Time: ∑i O( min{m, 肋i} ) = O(n)
How about the general case? 如果找出來的 maxima θ-paths不只一條該怎麼 reduce呢? 暴力法: 對每一條maximal θ-paths都用剛才的方法reduce成subset matching problem。 Time?
Where is the intuition come from? Truncation lemma: If the first |θ| edges of each maximal θ-paths are removed, then those truncated paths are disjoint。
Truncation Lemma (1) 重疊開始的地方和u 的距離不可能是θ的整數倍。 u p p’ θ θ θ θ θ
Truncation Lemma (2) By Classical Lemma: p p’ < |θ|
Warm up is over! 接下來要做的事: Part 1: 證明 truncated maximal θ-paths 可以在linear time reduce 成 subset matching。 Part 2: 考慮被砍掉的部分該如何解決。
Step 1: Find all maximal θ-paths Link node
Step 2: Filtering(1) 把不符合以下三個property的maximal θ-pahts 過濾掉。 Property 1:
Step 2: Filtering (2) Propety 2: ∵ P ≥ m/2 ≥ m/2
Step 2: Filtering (3) Propety 3: ∵ P ≥θ ≥θ ≥ m/2 ≥ m/2
Step 3: Truncation 將過濾後每一條maximal θ-paths開頭的θ條edges去掉。
Step 4: Filtering again 把 truncated maximal θ-paths 再過濾一遍,剩下的這些paths在之後將簡稱為truncated paths.
Step 5: 一條一條reduce成 subset matching Time: ∑truncated paths ∑i O( min{m, 肋i} )
Analysis of Step 5 (1) ∑truncated paths ∑i O( min{m, 肋i} )
Analysis of Step 5 (2) ∑O( min{m, 肋} ) (大肋 = 大於或等於 m 的肋骨,小肋 = 小於m的肋骨) =∑O( min{m, 大肋} ) + ∑O( min{m, 小肋} ) = O(m * (#大肋)) + ∑O(小肋)
Analysis of Step 5 (3) O(m * (#大肋)) + ∑O(小肋) 剩下只需證明 Part 1. (#大肋) = O( n/m ) Part 2. ∑O(小肋) = n
Analysis of Step 5 (4) Part 2: ∑O(小肋) = n < m ∵小肋骨 are disjoint 大 小
Marked nodes Def: A node in t is marked if its left and right subtrees both contain ≥ m nodes.
# marked nodes is O(n/m)
# marked nodes is O(n/m) ∵(# external nodes) * m ≤ n ∴ # external nodes ≤ n/m ⇒ # marked nodes = # internal nodes ≤ n/m - 1
Analysis of Step 5 (5) Part 1. (#大肋) = O( n/m ) 一條truncated path上如果有k > 1根大肋骨, 則有k-1 個maked nodes。 大 大 大 大 大
Analysis of Step 5 (6) 擁有 k > 1根大肋骨的truncated paths上的大肋骨全部加起來是O(n/m)。 剩下的問題: 有多少條擁有 k = 1根大肋骨的truncated paths?
Analysis of Step 5 (7) O(n/m) 條 小 小 ≥ m/2 小 小 大
An observation 擁有 k > 1根大肋骨的truncated paths只有O(n/m)條。
Disjoint Lemma Let C be a set of disjoint θ-paths and these θ-paths satisfy property 1~3. Then there are O(n/m) θ- paths in C. Pf: 擁有 k > 1根大肋骨的θ-paths只有O(n/m)條。 擁有 k = 1根大肋骨的θ-paths只有O(n/m)條。 擁有 k = 0根大肋骨的θ-paths只有O(n/m)條。
Review Step 1: Finding all maximal θ-paths Step 2: Filtering Step 3: Truncation Step 4: Filtering again Step 5: Reduce to subset mathching
How about the removed parts? θ θ θ θ Time: O(m)
The Last Job Step 1: Finding all maximal θ-paths Step 2: Filtering only O(n/m) paths left. Step 3: Truncation Step 4: Filtering again Step 5: Reduce to subset mathching
Tail Lemma path的尾巴不會被其他path碰到。
Chains . . .
Chain Lemma 一條chain上 (1) 編號 1, 3, 5 , 7 , … 的paths會是disjoint。 1 1 2 2 3 3
Truncated-Chains Lemma(1) Truncated-Chains lemma: If the first two paths of each chains are removed, then those truncated chains are disjoint。
Truncated-Chains Lemma(2) ρ ρ’
Truncated-Chains Lemma(3) ρ ρ’ Case 1: ≥ θ
Truncated-Chains Lemma(4) ρ ρ’ Case 2: ≥ θ
Truncated-Chains Lemma(5) ρ ρ’ Case 3: ≥ θ
Truncated-Chains Lemma(6) ρ ρ’ Case 4: ≥ θ
Almost Over (1) Chain lemma: 在一條chain上 (1) 編號 1, 3, 5, 7 , … 的paths會是disjoint。 (2) 編號 0, 2, 4, 6 , … 的paths會是 disjoint。 Truncated-Chains lemma 去掉編號 0, 1的 paths 後的chains會是disjoint ⇒(1) 編號 3, 5, 7, … 的paths會是disjoint (2)編號 2, 4, 6 , … 的paths會是disjoint
Almost Over (2) (1)編號 3, 5, 7, … 的paths會是disjoint By Disjoint Lemma: (1)編號 3, 5, 7, … 的paths共 O(n/m)條 (2)編號 2, 4, 6, … 的paths共 O(n/m)條
Almost Over (3) If #chains = O(n/m), then 編號0與1的 paths共 O(2n/m) = O(n/m)條。
Over (1) Maximal Connected chains Maximal Connected Maximal chains
Over (2) maximal conneted chains中如果有 k > 1條 chains則有 k – 1 個 marked nodes。
Over (3) 擁有 k > 1條chains的maximal connected chains 的chains 全部加起來是O(n/m)。
Over(4) By Disjont Lemma: 擁有 k = 1條chains的maximal connected chains 共O(n/m)個。
Unproved Lemma Maximal θ-paths in linear time. (Lemma 2.2 in M. Dubiner, Z. Galil, and E. Magen. Fast Tree Pattern Matching, J. ACM, 1994. ) Chain Lemma. (Lemma 5.5 in R. Cole and R. Hariharan. Tree Pattern Matching to Subset Matching in Linear Time, SIAM J. Computing, 2003.)