Video Caption Technique Based on Joint Image - Audio Deep Learning ICCE-Berlin 2019 Video Caption Technique Based on Joint Image - Audio Deep Learning Authors:Chien-Yao Wang, Pei-Sin Liaw, Kai-Wen Liang, Jai-Ching Wang, Pao-Chi Chang Presenter: Dr. Pao-Chi Chang National Central University, Taiwan September 8, 2019 Good afternoon! In this talk, I will present … The focus will be on the balance between audio and video features. Namely, we try to perform feature normalization to maximize the effects from both the audio and image features. This work was done by Prof. Wang … and my group. I am …
Outline Introduction to Video Caption Technique Proposed Joint Image - Audio Based Video Caption System Simulation Environment and Dataset Experiments Conclusion 視頻字幕技術簡介 深度學習 提議的聯合圖像 - 音頻視頻字幕系統 仿真環境和數據集 實驗 結論
Introduction to Video Caption Technique (1/2) Goal: Analyzing videos to understand video content and using natural language to describe it. Classification: Girl 影片描述技術是基於深度學習的影像辨識、影片分類、動作辨識等技術的延伸,主要是希望可以讓機器學習並理解影片內容,使用自然語言描述 The goal … is … For example, based on image classification, this image could be …. But by viewing the whole video, this caption ,,,, might be more appropriate. Caption: A girl is waking up in bed.
Introduction to Video Caption Technique (2/2) The 1st technical report to solve video caption problem [1] CNN (AlexNet - fc7) + RNN (two-layers LSTM) Using “one-hot” vectors to represent words 分析視頻以了解視頻內容並使用自然語言來描述它 The first technical report for VC was published in 2014, it used this architecture, 2014 第一篇 : 利用CNN (AlexNet with full connection) 將影片中所有的frames做特徵提取,經過mean pooling後, 將特徵輸入LSTM decoder, 生成文字 [1] Venugopalan S. , Xu H. , Donahue J. , “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014.
Architecture of Proposed Method 語意構成網路 (Semantic Compositional Network) SCN-RNN SCN: Semantic Compositional Network
Image features processing Using RGB frames of videos as input, with 2 frames per second. Using the 2D CNN (ResNet-152 [4]) and the 3D CNN (C3D [5]) to extract features. Using mean pooling [6] process to get 2048-way 2D CNN features and 512-way 3D CNN features. Concatenation of 2D CNN and 3D CNN features become 2560-way image features. 我們將視頻的RGB幀視為輸入,每秒2幀。 使用2D ResNet-152和3D CNN(C3D)來提取特徵。 我們對所有二維CNN特徵和三維CNN特徵執行均值合併處理,以生成兩組特徵向量。 每個視頻v的表示是通過連接這兩組特徵產生的。 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [5] D.Tran, L.Bourdev, R.Fergus, L.Torresani, and M.Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. [6] S.Venugopalan, H.Xu, J.Donahue, M.Rohrbach, R.Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015. 1, 2, 5
2-D CNN 3-D CNN Architecture Pre-trained Input size Output Highlights ResNet -152 [4] C3D network [5] Pre-trained ImageNet dataset Sports-1M dataset Input size 224x224x3 112x112x3 Output 2048 ways / frame from conv-5 layer 4096 ways / 8 frames from fc-7 layer Highlights Winner in 2015 ILSVRC Residual learning Video clips of length 16 frames Overlap of 8 frames. 2-D 在2015年 ImageNet 圖像識別比賽中,取得冠軍。 ImageNet Large Scale Visual Recognition Competition (ILSVRC). Error rate is 3.5%. Imput size needs scaling. 殘差學習: 他使神經網路不需要一層層依序往下傳遞,而是可以由跳躍的方式傳遞,而在計算誤差值時,假設神經網路中某個神經元的輸入為x,期望輸出為H(x),若我們將輸入直接傳到輸出端,則網路的學習目標則變為F(x)=H(x)-x。 2. Video is 3D. 3-D Sports-1M數據集包含一百一十三萬個視頻(1,133,158),註釋487個體育標籤. Figure: using 3D kernel to scan the whole video clip.
Audio features processing Audio signals go through Short Time Fourier Transform (STFT) and Mel filter bank to obtain a Log Mel scale spectrogram with a size of 40mel×frames. Using the Asymmetrical Kernel Convolutional Neural Network (AKCNN) [6] to extract features. Concatenation acoustic scene (ASC) features and sound event (SED) features as audio features. 先將原始訊號經過短時距傅立葉轉換(STFT)與梅爾刻度映射,STFT中漢明窗每次框選40微秒,漢明窗一次移動距離為漢明窗的長度; 而梅爾刻度映射則取40個三角窗,最後得到大小為 40mel×frames 的對數梅爾刻度頻譜,frames=(44100×seconds/822)。 AK CNN: one dim is temporal, one dim is frequency. Using separate data sets to train the ASC and SED (a) audio pre-processing [6] Y. C. Wu, P. C. Chang, C. Y. Wang, J. C. Wang, “Asymmetrical Kernel Convolutional Neural Network for acoustic scenes classification,” IEEE International Symposium on Consumer Electronics (ISCE), May. 2018.
Asymmetrical Kernel Convolutional Neural Network [6] 1st convolutional: kernel size: 7x5 activation function : ReLU Fully connected: Max-pooling: activation function : softmax Window size: 5x5, without overlap Loss function: Cross entropy 2nd convolutional:
Semantic features processing Using the 300 most common words in the training captions to determine the vocabulary of tags. Treating this problem as a multi-label classification task. Being implemented as a multilayer perceptron (MLP) with the logistic sigmoid function. 我們使用訓練字幕中最常用的300個詞來確定標籤的詞彙,其中包括最常用的名詞,動詞或形容詞。 我們把這個問題作為一個多標籤分類任務。 使用多層感知機加上sigmoid function 來實現。
Word Embedding [1,2,3,4,0] Dictionary: Text Input: <eos> a 1 chipmunk 2 is 3 eating 4 Text Input: a chipmunk is eating <eos> [1,2,3,4,0] One-hot encoder : Multilayer perceptron: Output: 𝑣 <𝑒𝑜𝑠> =[1 0 0 0 0] 𝑣 𝑎 =[0 1 0 0 0] 𝑣 𝑐ℎ𝑖𝑝𝑚𝑢𝑛𝑘 =[0 0 1 0 0] 𝑣 𝑖𝑠 =[0 0 0 1 0] 𝑣 𝑒𝑎𝑡𝑖𝑛𝑔 =[0 0 0 0 1] 𝑣 <𝑒𝑜𝑠> =[0.98 0.53 0.41] 𝑣 𝑎 =[0.83 0.42 0.61] 𝑣 𝑐ℎ𝑖𝑝𝑚𝑢𝑛𝑘 =[0.32 0.64 0.12] 𝑣 𝑖𝑠 =[0.03 0.56 0.72] 𝑣 𝑒𝑎𝑡𝑖𝑛𝑔 =[0.38 0.54 0.26] Q: output / ? (word, sentence)
Video description model - SCN-LSTM Concatenation of image and audio features is fed into the LSTM to initialize the first step, which is expected to provide an overview of the video content. The parameters of LSTM combine semantic feature to generate the caption. a : audio feature i : image feature s : semantic feature (a) basic LSTM 語意構成網路 (Semantic Compositional Network) SCN-RNN 合併前面取得的聲音以及影像特徵作為LSTM的初始化,以提供影片內容的訊息。 最後結合LSTM以及語義特徵來生成語意描述。 After the weight matrix is generated using the SVM, the first output for a sentence is obtained from the SCN. The first output and the weight matrix are combined as the input for the next sentence. Adam演算法是一種自適應時刻估計的方法 (Adaptive Moment Estimation) Q: relationship with word embedding? dim reduction from 10k to 300. (b) SCN-LSTM [7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, L. Deng, “Semantic Compositional Networks for Visual Captioning,” CVPR, 2017
Outline Introduction to Video Caption Technique Proposed Joint Image – Audio Based Video Caption System Simulation Environment and Dataset Simulation Environment Youtube2text Dataset Acoustic Scene Classification pre-train dataset: TUT Acoustic Scenes 2016 Experiment Conclusion 視頻字幕技術簡介 深度學習 提議的聯合圖像 - 音頻視頻字幕系統 仿真環境和數據集 實驗 結論
Simulation Environment Audio features extract Video description CPU Intel Xeon E5-2630v3 @2.4GHz * 2 Intel® Core™ i7-6700 CPU @ 3.40GHz * 8 GPU GeForce GTX TITAN X GeForce GTX 1070 RAM 256GB DDR4-2133 MHz 16GB DDR4-2133 MHz OS Ubuntu 16.04-x64 Software Python 2.7 Python 2.7, Matlab 2016a Neural network tool Tensorflow Theano GPU ! 延續實驗室研究,所以環境不同
Background music deleted Youtube2text Dataset Microsoft Research 1970 YouTube Video Clips. Each video clips are marked with about 40 English sentences. Length: 10-25s 8 different sound events Number of video clips A man is mixing a batter. A man is stirring batter in a metal bowl. A person is stirring a flour mixture. A girl is drinking from a cup. A child is drinking from a cup. The little girl drank from her cup in the bathroom. Event Teaching Kitchen Narration Background Playing Performance Dialogue 訓練、驗證、測試 事件 (Event):包含明顯的聲音特徵,如:槍聲、東西敲打聲 背景聲 (Background):持續的聲音,如:風聲、引擎聲 演奏 (Playing):樂器演奏,如:鋼琴、吉他、長笛 表演 (Performance):含有人聲與音樂,如:跳舞、唱歌 影片旁白 (Narration):含有影片錄製時原始聲音,與後加上的人聲 對話 (Dialogue):影片中明顯包含兩個以上的人在進行交談 教學 (Teaching):影片拍攝時同時包含現場人聲 廚房 (Kitchen):廚房會出現的聲音,如:切菜聲、煮水聲 Train Validation Test Total Original 1027 83 549 1659 Background music deleted 835 69 450 1354
Acoustic Scene Classification pre-train dataset: TUT Acoustic Scenes 2016 [7] For each acoustic scene, there are 78 segments. Total: 1170 segments Length: 30 seconds Recorded sampling rate = 44100 Hz 15 different acoustic scenes lakeside beach bus cafe/restaurant car city center forest path grocery store home library metro station office urban park residential area train tram Bus:在城市內乘坐公共汽車(車輛) Cafe / Restaurant:小咖啡廳/餐廳(室內) Car:在城市,作為乘客搭乘汽車(車輛) City center:市中心(室外) Forest path:森林路徑(戶外) Grocery store:中型雜貨店(室內) Home:家(室內) Lakeside beach:湖畔沙灘(室外) Library:圖書館(室內) Metro station:地鐵站(室內) Office:一般工作日的辦公室(室內) Residential area:住宅區(室外) Train:火車(旅行、車輛) Tram:電車(旅行,車輛) Urban park:城市公園(室外) [7] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” IEEE 2016 24th European Signal Processing Conference, pp. 1128-1132, Aug. 2016.
Experiment – Audio kernel size selection beach 80.77 88.46 96.15 92.31 bus 100.00 cafe/restaurant 61.54 46.15 53.85 57.69 car city_center 84.62 forest_path grocery_store home library 50.00 metro_station office park 69.23 residential_area 73.08 train tram Overall Accuracy 82.31 84.36 86.67
Metrics BLEU: METEOR ROUGE-L: CIDEr-D: Computing the same modified precision metric using n-grams, n=1,2,3,4. METEOR Matching the synonymy, along with the standard exact word matching. ROUGE-L: Statistics Longest Common Subsequence (LCS). CIDEr-D: Removal of stemming and introduce a Gaussian penalty. 評估從一種自然語言機器翻譯到另一種語言的文本質量的算法。 背後的核心理念:“機器翻譯越接近專業的人類翻譯,它越好” BLEU BLEU使用n-gram計算相同的修改精度度量。 n-gram越長說明翻譯的流暢性,或者它在多大程度上代表“好英語”。 METEOR 同義詞匹配,以及標準的確切單詞匹配。 ROUGE-L 統計最長公共子序列(LCS) 考慮到句子的結構相似性 CIDEr-D 刪除詞幹可以確保使用正確的詞彙形式。在某些情況下,當長句子重複較高置信度的單詞時,基本CIDEr指標會產生較高的分數。引入高斯懲罰函數
Image+audio normalization [-1~1] Image+audio normalization [0~1] Deleted videos with background music, add acoustic scene and sound event features Metrics B_1 B_2 B_3 B_4 Meteor Rouge-L Cider-D Image only (Base) 0.8224 0.7154 0.6260 0.5310 0.3490 0.7148 0.8160 Image+audio 0.8278 0.7227 0.6316 0.5365 0.3425 0.7144 0.7962 Improvement 0.54% 0.73% 0.56% 0.55% -0.65% -0.04% -1.98% Image+audio normalization [-1~1] 0.8334 0.7282 0.6401 0.5475 0.3513 0.7221 0.8387 1.10% 1.29% 1.41% 1.66% 0.23% 2.27% Image+audio normalization [0~1] 0.8283 0.7244 0.6352 0.5398 0.3518 0.7218 0.8175 0.59% 0.90% 0.92% 0.88% 0.27% 0.70% 0.15% 加入聲音場景與聲音事件特徵 Improvement = ((image with audio)-(image only))x100%
Ground truth: a man is playing with his dog Image only: a man is playing with a toy Image+audio: a monkey is playing Image+audio [-1~1]: a man is playing with a dog Image+audio [0~1]: a man and two girl are running on beach two men are dancing a man is riding a boat two men are racing the man is playing basketball a boy is playing a boy is playing basketball a man is playing a basketball a boy is playing football
Conclusion We have proposed a video caption technique based on joint Image-Audio-Semantic processing. Adding audio features achieves significant improvement. Set audio normalization to [-1~1] and image-audio ratio at 1:1 makes the best performance. Combining sound events and acoustic scene improves video subtitling. BLEU score increased by at least 1%. Cider-D score increased by 2.27%. The Meteor and Rouge-L ratings also increased by 0.2% and 0.7%. 透過語言的自動評分機制,我們發現加入聲音特徵 對整個語意構成網路是有幫助的. 正規化 有幫助 結合聲音事件以及場景特徵,在 BLEU 評 分中,至少 1%的提升,Cider-D 評分有高達 2.27%的提升,另外在 Meteor 與 Rouge-L 評分中,也大約有 0.2%與 0.7%的提升。
Thanks for your listening.