之前都是分类的蒸馏很简单。然后从分类到分割也是一样,下一篇是检测的蒸馏
Motivation Semantic segmentation is a structured prediction problem Pixel-level distillation is straightforward. But structured distillation schemes should be introduced.
Contributions We study the knowledge distillation strategy for training accurate compact semantic segmentation networks. First? We present two structured knowledge distillation schemes, pair-wise distillation and holistic distillation, enforcing pair-wise and high- order consistency between the outputs of the compact and cumbersome segmentation networks. We demonstrate the effectiveness of our approach by improving recently-developed state-of-the-art compact segmentation networks, ESPNet, MobileNetV2-Plus and ResNet18 on three benchmark datasets:Cityscape, CamVid and ADE20K
Approach
Pixel-wise distillation 每个位置点都是一个C维的向量
Pair-wise distillation Motivation: spatial labeling contiguity.
Holistic distillation Conditional WGAN Real:score map produced by teacher network Fake: score map produced by student network Condition: Image
Experiment Cityscapes dataset.
Experiment
Experiment
Motivation Detectors care more about local near object regions. The discrepancy of feature response on the near object anchor locations reveals important information of how teacher model tends to generalize. 与分类不同,蒸馏方法在检测中如果进行全特征模拟的话对子网络的提升很有限(这里存疑,文章没有明确指出全特征包含哪些特征层)。这可能是由于过多的无用背景anchor引入的噪音覆盖了来自teacher net的监督信息。文章认为检测器会关注目标区域以及其周边的位置,目标区域上的不同positive anchor之间的差异表现的就是teacher net对于检测目标的泛化特点。
Motivation
Framework
Imitation region estimation 计算每一个GT box和该特征层上WxHxK个anchor的IOU得到IOU map m 找出最大值M=max(m),乘以rψ作为过滤anchor的阈值: F = ψ ∗ M. 将大于F的anchor合并用OR操作得到WxH的feature map mask 遍历所有的gt box并合并获得最后总的mask 将需要模拟的student net feature map之后添加feature adaption层使其 和teacher net的feature map大小保持一致。 加入mask信息得到这些anchor在student net中和在teacher net 中时的偏 差作为imitation loss,加入到蒸馏的训练的loss中,形式如下: When ψ = 0, the generated mask includes all locations on the feature map while no locations are kept when ψ = 1. We can get varied imitation mask by varying ψ. In all experiments, a constant ψ = 0.5 is used
Fine-grained feature imitation 1) The student feature’s channel number may not be compatible with teacher model. The added layer can align the former to the later for calculating distance metric. 2) We find even when student and teacher have compatible features, forcing student to approximate teacher feature directly leads to minor gains compared to the adapted counterpart. self.stu_feature_adap = nn.Sequential(nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU()) Here I is the imitation mask
Experiment 使用全特征模拟(hint learning:teacher net和student net的feature map大小不一致)(F)的精度甚至不如直接拿student net训练的结果,说明全特征包含的太多背景anchor用于监督训练引入噪声太大 使用gt box作为监督信号 (G)说明可以显著降低全特征学习的噪声问题,但是效果不如加入positive anchor的方法(I)也说明了gt box周围包含的信息也是teachnet用于定位的重要依据。 单纯蒸馏loss(首先使用adaption层将student net 和teacher net的大小整成一致) (D)提升的精度很少(mAP 0.9%),说明直接移植分类的蒸馏方式在检测中是不合适的。 同时使用蒸馏loss和imitation loss的效果(ID)比单纯使用imitation loss还要差,说明高层级的特征模拟和蒸馏关注的东西是不一致的。
Visualization of imitation mask
Experiment
Supplementary materials 如上图,之所以有Wr是因为teacher network的层输出与小网络的往往是不一样的,因此需要这样一个mapping来匹配,并且这个mapping也是需要学习的。paper中提到说用多加一个conv层的方法比较节省参数(其实也比较符合逻辑),然后这个conv层不加padding,不stride。下面是一个公式表述: