Ten Years of Pedestrian Detection, What Have We Learned?

Ten Years of Pedestrian Detection, What Have We Learned?
Rodrigo Benenson, Mohamed Omran, Jan Hosang, Bernt Schiele Max Planck Institut for Informatics Saarbrücken, Germany 2014, Computer Vision for Road Scene Understanding and Autonomous Driving (CVRSUAD, ECCV workshop)

agenda Abstract Introduction Datasets
Main approaches to improve pedestrian detection Experiments Conclusion

Abstract Discuss the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark. Observe that there exist three families of approaches. We study the complementarity of the most promising ideas by combining multiple published strategies. This new decision forest detector achieves the current best known performance on the challenging Caltech-USA dataset. pedestrian detection 行人檢測 all currently reaching similar detection quality Based on our analysis, we study the complementarity of the most promising ideas by combining multiple published strategies

Introduction Pedestrian detection is direct applications in car safety, surveillance, and robotics. It has served as a playground to explore different ideas for object detection. The main paradigms for object detection “Viola & Jones variants”, HOG+SVM rigid templates, deformable part detectors (DPM), and convolutional neural networks (ConvNets) have all been explored for this task. The aim of this paper is to review progress over the last decade of pedestrian detection (40+ methods), identify the main ideas explored, and try to quantify which ideas had the most impact on final detection quality. Do not aim to introduce a novel technique, by putting together existing methods we report the best known detection results on the challenging Caltech-USA dataset. 行人檢測具有極其廣泛的應用：智能輔助駕駛，智能監控，行人分析以及智能機器人等領域 it has attracted much attention in the last years. it is a well defined problem with established benchmarks基準 and evaluation metrics評價指標 Playground 平台 paradigms 模型

Introduction The last decade has shown tremendous progress on pedestrian detection. What have we learned out of the 40+ proposed methods?

Datasets Multiple public pedestrian datasets have been collected over the years; INRIA, ETH, TUD-Brussels, Daimler, Caltech-USA, and KITTI are the most commonly used ones. 這張圖表示利用特征：SquaresChnFtrs (输入图像转换为一系列特征图，通过对一大群矩形区域进行sum-pooling得到最终的特征向量)

Datasets INRIA : ETH and TUD-Brussels Daimler
the oldest and as such has comparatively few images It benefits however from high quality annotations of pedestrians in diverse settings. ETH and TUD-Brussels mid-sized video datasets Daimler It is not considered by all methods because it lacks colour channels. Daimler stereo, ETH, and KITTI provide stereo information All datasets but INRIA are obtained from video, and thus enable the use of optical flow as an additional cue. 擁有比較豐富的背景環境（如城市，沙灘，山）所以被使用的比較多 Daimler缺乏彩色信息 Stereo 立體光流Optical Flow :火車上，然後往窗外看。你可以看到樹、地面、建築等等，他們都在往後退。這個運動就是光流。

Datasets Caltech-USA and KITTI are the predominant benchmarks for pedestrian detection. Both are comparatively large and challenging Caltech-USA : the large number of methods that have been evaluated side-by side. KITTI : its test set is slightly more diverse, but is not yet used as frequently Predominant優越的 Caltech-USA有大量的方法還提供了相應的Matlab工具包使用因而比較起來比較方便這篇文章主要是以Caltech數據集作為標準，以INRIA和KITTI作為輔助。

Datasets In this paper we use primarily Caltech-USA for comparing methods, INRIA and KITTI secondarily. Caltech-USA and INRIA results are measured in log-average miss-rate (MR, lower is better), while KITTI uses area under the precision-recall curve (AUC, higher is better). 這張圖表示利用特征：SquaresChnFtrs (输入图像转换为一系列特征图，通过对一大群矩形区域进行sum-pooling得到最终的特征向量) 這些miss rate 求log平均，從而得到log-average miss rate.

Main approaches to improve pedestrian detection
“Training” data column: I→INRIA, C → Caltech, I+/C+ →INRIA/Caltech and additional data, P → Pascal, T→TUD-Motion, I & C →both INRIA and Caltech. Listing of methods considered on Caltech-USA, sorted by log-average miss-rate (lower is better). Consult sections 3.1 to 3.9 for details of each column. See also matching figure 3. “HOG” indicates HOG-like [1]. Ticks indicate salient aspects of each method.

Listing of methods considered on Caltech-USA, sorted by log-average miss-rate (lower is better). Consult sections 3.1 to 3.9 for details of each column. See also matching figure 3. “HOG” indicates HOG-like [1]. Ticks indicate salient aspects of each method.

Training data Methods trained on Caltech-USA systematically perform better than methods that generalize from INRIA. High performing methods with “other training” use extended versions of Caltech-USA. Listing of methods considered on Caltech-USA, sorted by log-average miss-rate (lower is better). Consult sections 3.1 to 3.9 for details of each column. See also matching figure 3. “HOG” indicates HOG-like [1]. Ticks indicate salient aspects of each method.

Solution Families Discern three families: 1 DPM variants(DPM), 2 Deep networks(DN) and 3 Decision forests(DF) Based on raw numbers alone boosted decision trees (DF) seem particularly suited for pedestrian detection. Better classifiers Since the original proposal of HOG+SVM, linear and non-linear kernels have been considered. The distinction between features and classifiers is not clear-cut anymore whether non-linear kernels provide meaningful gains over linear kernels whether one particular type of classifier (e.g. SVM or decision forests) is better suited for pedestrian detection than another. 上圖中表現最好的幾種方法都是DF 並沒有經驗性的證據表明非線性核比線性核的性能更好。也沒有證據表明某種分類器是最適合做行人檢測的。

Additional data The core problem of pedestrian detection focuses on individual monocular colour image frames. Some methods explore leveraging additional information at training and test time to improve detections. Ex: stereo images [45], optical flow (using previous frames, e.g. MultiFtr+Motion [22] and ACF+SDt [42]), tracking [46], or data from other sensors stereo images立體像現在的技術可以使僅基於單鏡頭彩色圖像的方法已經能與使用了額外信息進行增強的方法不相上下。

Using additional data provides meaningful improvements Show the performance improvement for methods incorporating context. Exploiting Context AFS+Geo :The evaluation metrics changed from per-window (FPPW) to per-image (FPPI) Context provides consistent improvements for pedestrian detection, although the scale of improvement is lower compared to additional test data. 根據周圍的環境信息來改進檢測得到的結果不如增加訓練的樣本數目和深度結構那麼明顯。

Deformable parts The DPM detector [19] was originally motivated for pedestrian detection. For pedestrian detection the results are competitive, but not salient (LatSvm [50,12], MultiResC [33], MT-DPM [39]). For pedestrian detection there is still no clear evidence for the necessity of components and parts, beyond the case of occlusion handling. Multi-scale models improve performance by 1 ∼ 2 MR percent points Despite consistent improvements, their contribution to the final quality is rather minor. HOG論文中訓練出來的人形模型。它是單模型，對直立的正面和背面人檢測效果很好，較以前取得了重大的突破。也是目前為止最好的的特徵（最近被CVPR20 13年的一篇論文《Histograms of Sparse Codes for Object Detection》超過了）。但是，如果是側面呢？所以自然我們會想到用多模型來做。 DPM就使用了2個模型，主頁上最新版本Versio5的程序使用了12個模型。多模型就能解決視角可變形的部分的性能提升被單一成分的detector系統性的超過了，因此現在沒有明確的證據顯示有使用這種方法的必要。多解析度的模型在提取features前解析度不同會有影響 Multi-scale models provide a simple and generic extension to existing detectors.

Deep architectures Convnet：ImageNet Classiﬁcation with Deep CNN a mix of unsupervised and supervised training to create a convolutional neural network trained on INRIA. Another line of work focuses on using deep architectures to jointly model parts and occlusions these works use edge and colour features [40,34,28], or initialise network weights to edge-sensitive filters Despite the common narrative there is still no clear evidence that deep networks are good at learning features for pedestrian detection. Most successful methods use such architectures to model higher level aspects of parts, occlusions, and context. 隨著數據量的增加和計算能力的增強，在計算機視覺領域（包括行人檢測方面）使用深度網絡（尤其是CNN）變得流行。 ConvNet結構混合了監督的和無監督的訓練來搭建卷積神經網絡而另一些結構（DBNJointDeepSDN）將part model和遮擋結合起來都放進了深度結構使用了邊緣和色彩特徵，或者將網絡權重初始化時設置為edge detector.

Better features A large set of feature types have been explored: edge information [1,26,58,41], colour information [26,22], texture information [17], local shape information [38], covariance features [24], amongst others. More and more diverse features have been shown to systematically improve performance. some papers have considered up to an order of magnitude more channels [16,58,24,30,38]. Despite the improvements by adding many channels, top performance is still reached with only 10 channels The next scientific step will be to develop a more profound understanding of the what makes good features good, and how to design even better ones The most popular approach (about 30 % of the considered methods) 在改進行人檢測的工作中，做的最多的就是增加或者多樣化輸入圖像的特徵很多decision forest方法採用10個feature channel 更好的特徵表示可以提升性能

Experiment Three aspects seem to be the most promising in terms of impact on detection quality: better features (§3.9), additional data (§3.4), and context information (§3.5). choose the Integral Channels Features framework [26] (a decision forest) for conducting our experiments. Methods from this family have shown good performance, train in minutes∼hours, and lend themselves to the analyses we aim. 從前面的討論中可以得知，有三個最有希望提升性能的方面 We thus conduct experiments on the complementarity of these aspects.

Experiment Reviewing the effect of features
In this section, we evaluate the impact of increasing feature complexity EX:expand the 10 HOG+LUV channels into 40 channels by convolving each channel with three DCT (discrete cosine transform) basis functions (of 7 × 7 pixels), and storing the absolute value of the filter responses as additional feature channels. We name this variant SquaresChnFtrs+DCT INRIA training set Caltech-USA reasonable test set. Simple tweaks to these well known features 從VJ以來，性能的提升多半可以歸功於採用了更好的特徵，梯度方向和顏色信息等。即使是在已有特徵基礎上加入的一點點微調也能產生顯著的提升（如SquaresChnFtrs加入DCT變換）。

Experiment Complementarity of approaches
consider the complementary of better features (HOG+LUV+DCT), additional data (via optical flow), and context (via person-to- person interactions). We encode the optical flow using the same SDt features from ACF+SDt The context information is injected using the +2Ped re-weighting strategy The combination SquaresChnFtrs+DCT+SDt+2Ped is called Katamari-v1. 在上文SquaresChnFtrs+DCT的基礎上，作者用和ACF+SDt中同樣的方法將光流信息編碼，同時用+2Ped中的re-weighting技巧把環境信息加入。這種SquaresChnFtrs+DCT+SDt+2Ped的方法被稱為Katamari-v1。如圖7 所示，Katamari-v1方法達到了在Caltech上的最好結果，圖7還顯示了其他方法所獲得最好效果。 show that adding extra features, flow, and context information are largely complementary (12 % gain, instead of %), even when starting from a strong detector.

Experiment How much model capacity is needed?
we consider a necessary condition for high quality detection: is the learned model performing well on the training set? Caltech-USA training set performance. None of these methods performs perfectly on the training set we do not observe yet symptoms of over-fitting. Our results indicate that research on increasing the discriminative power of detectors is likely to further improve detection quality. More discriminative power can originate from more and better features or more complex classifiers. 所以，我們還是應該研究更有區分力的檢測子來提升檢測結果。這些更有區分力的檢測子可以通過尋找更好的features和更複雜的分類器來實現。

Experiment Generalisation across datasets
For real world application beyond a specific benchmark, the generalization capability of a model is key shows the performance of SquaresChnFtrs over Caltech-USA when using different training sets(MR for INRIA/Caltech/ETH, AUC for KITTI) While detectors learned on one dataset may not necessarily transfer well to others, their ranking is stable across datasets, suggesting that insights can be learned from well-performing methods regardless of the benchmark 泛化誤差（Generalization error），是一個描述學生機器在從樣品數據中學習之後，離教師機器之間的差距的函數。使用這個名字是因為這個函數表明一個機器的推理能力，即從樣品數據中推導出的規則能夠適用於新的數據的能力。用不同的訓練集訓練(INRIA、Caltech、KITTI)，然後用不同的測試集測試(INRIA、Caltech、KITTI、ETH) 使用的INRIA訓練的模型，在各個測試集上都表現良好，兩個第1(INRIA、ETH)，兩個第二(Caltech、KITTI)；只要方法好，在不同的benchmark上的表現都是穩定的。

Conclusion Pedestrian detection can be attributed to the improvement in features alone. Our experiment combining the detector ingredients that our retrospective analysis found to work well (better features, optical flow, and context) shows that these ingredients are mostly complementary The main challenge ahead seems to develop a deeper understanding of what makes good features good, so as to enable the design of even better ones. 而且這些特徵大部都是經過人工反复實驗(hand-crafted with trial and error)得到的

Ten Years of Pedestrian Detection, What Have We Learned?

Similar presentations

Presentation on theme: "Ten Years of Pedestrian Detection, What Have We Learned?"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Ten Years of Pedestrian Detection, What Have We Learned?

Similar presentations

Presentation on theme: "Ten Years of Pedestrian Detection, What Have We Learned?"— Presentation transcript:

Similar presentations

About project

反馈