Self-Attention huitr 2019.03.16
Motivation 普通CNN堆叠卷积层不能够很好地捕捉long range dependency 提出non-local operation,对于特征图中每个像素点,都用其他所有像素点的变换结果做加权求和,归一化后作为该像素点的新特征 Self-attention,用同一张图片中的其他像素点来增强当前像素点
Formulation 𝑖:𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 X:𝑖𝑛𝑝𝑢𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑗:𝑖𝑛𝑑𝑒𝑥 𝑡ℎ𝑎𝑡 𝑒𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑒 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓:𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑝𝑎𝑖𝑟𝑤𝑖𝑠𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑖 𝑎𝑛𝑑 𝑗 𝑔:𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠 𝑎 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 X 𝐶 X :𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛
Instantiation 𝑐ℎ𝑜𝑖𝑐𝑒 𝑜𝑓 𝑓: Embedded dot product version
Core idea first gather key features from the entire space into a compact set then distribute them to each location adaptively
Method
Computational graph
Comparison Chen Y , Rohrbach M , Yan Z , et al. Graph-Based Global Reasoning Networks[J]. 2018.
Experiments 5 extra A2-blocks at Res3 and Res4 6.5 GFLOPs and 33.0 M parameters
Experiments Chen Y , Rohrbach M , Yan Z , et al. Graph-Based Global Reasoning Networks[J]. 2018.
Method
A Generic Formulation of Self-Attention X: feature maps as a matrix of size s × c 𝐾 𝑋 : key function 𝑄 𝑋 : query function 𝑉 𝑋 : value function Implemented as linear layers S=𝑋𝐾 𝑋𝑄 𝑇𝑋𝑉 X: feature maps as a matrix of size s × c K, Q:c × 𝑏 matrice 𝑉:𝑐 × 𝑐 matrix
X: feature maps as a matrix of size s × c K, Q:c × 𝑏 matrice 𝑉:𝑐 × 𝑐 matrix Left Associativity S= 𝑋𝐾 𝑋𝑄 𝑇 𝑋𝑉 𝑋𝐾 𝑋𝑄 𝑇 : 𝑠 × 𝑏 ∗𝑏 × 𝑠=𝑠 × 𝑠, 可以看成所有spatial location之间的相似度,即Non-local的思想 𝑋𝐾 𝑋𝑄 𝑇 𝑋𝑉: 𝑠 × 𝑠 ∗𝑠 × 𝑐=𝑠 × 𝑐 Right Associativity S=𝑋𝐾[ 𝑋𝑄 𝑇𝑋𝑉] [ 𝑋𝑄 𝑇𝑋𝑉]: b × 𝑠 ∗𝑠 × 𝑐=𝑏 × 𝑐, 可以看成b个c维的Global Descriptor,即Double Attention的思想 𝑋𝐾[ 𝑋𝑄 𝑇𝑋𝑉]: 𝑠 × 𝑏 ∗𝑏 × 𝑐=𝑠 × 𝑐
X: feature maps as a matrix of size s × c K, Q:c × 𝑏 matrice Left Associativity S= 𝑋𝐾 𝑋𝑄 𝑇 𝑋𝑉 𝑋𝐾 𝑋𝑄 𝑇 : 𝑠 × 𝑏 ∗𝑏 × 𝑠=𝑠 × 𝑠, 可以看成所有spatial location之间的相似度,即Non-local的思想 𝑋𝐾 𝑋𝑄 𝑇 𝑋𝑉: 𝑠 × 𝑠 ∗𝑠 × 𝑐=𝑠 × 𝑐 Complexity: 𝑠 × 𝑏 × 𝑠+𝑠 × 𝑠 × 𝑐= 𝑠 2 (𝑏+𝑐) Right Associativity S=𝑋𝐾[ 𝑋𝑄 𝑇𝑋𝑉] [ 𝑋𝑄 𝑇𝑋𝑉]: b × 𝑠 ∗𝑠 × 𝑐=𝑏 × 𝑐, 可以看成b个c维的Global Descriptor,即Double Attention的思想 𝑋𝐾[ 𝑋𝑄 𝑇𝑋𝑉]: 𝑠 × 𝑏 ∗𝑏 × 𝑐=𝑠 × 𝑐 Complexity: 𝑏 × 𝑠 × 𝑐+𝑠 × 𝑏 × 𝑐=𝑠2𝑏𝑐
Framework
Experiments
Comparison with Non-local
Criss-cross attention module H x W x C2 (H+W-1) x H x W H x W x C1 H x W x C2 H x W x C1
Criss-cross attention module 𝑸 𝒖 : C2 x 1 H x W x C2 (H+W-1) x H x W H x W x C1 H x W x C2 H x W x C1
Criss-cross attention module 𝑸 𝒖 : C2 x 1 H x W x C2 (H+W-1) x H x W H x W x C1 H x W x C2 𝛀 𝒖 : (H+W-1) x C2 H x W x C1
Criss-cross attention module 𝑸 𝒖 : C2 x 1 H x W x C2 (H+W-1) x H x W H x W x C1 H x W x C1 H x W x C2 𝛀 𝒖 : (H+W-1) x C2 H x W x C1 H x W x C1 𝚽 𝒖 : (H+W-1) x C1
Why 2 loops
Experiments
评价 直接使用Non-local(或者包装一下再使用),相对创新度不是很高,但是精度刷上去或许也可以 分析并降低Non-local的复杂度,从另一个方向理解计算图,比较有insight