Video Caption Technique Based on Joint Image - Audio Deep Learning

Slides:

Advertisements

Similar presentations

數位訊號處理概論 [ 音樂情感 Music Emotion ] 資工三甲 4A1G0030 李裕家 1.

Advertisements

L5-L6 Review Oct. 30, Adj. as predicate 1.She is tall and pretty. 2.His house is pretty but very small. 3.Is your book expensive ? (try different.

Time Objectives By the end of this chapter, you will be able to

統合分析臨床試驗實之文獻品質評分：以針灸療法之統合分析為例

Section A Period 1 (1a-2d) Unit 8. 1.There be “ 某处有 ( 存在 ) 某人或某物 ” 结构 :There be (is, are, was, were)+ 名词 + 地点状语。 There are forty-eight students in our.

Unsupervised feature learning: autoencoders

中职英语课程改革中如何实践“以就业为导向，服务为宗旨”的办学理念

汇报人：李臻中国海洋大学信息科学与工程学院计算机科学与技术系

二維品質模式與麻醉前訪視滿意度中文摘要麻醉前訪視，是麻醉醫護人員對病患提供麻醉相關資訊與服務，並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式，設計病患滿意度問卷，探討麻醉前訪視內容與病患滿意度之關係，以期分析關鍵品質要素為何，作為提高病患對醫療滿意度之參考。本研究於台灣北部某醫學中心，通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患，其中實驗組共107位病患，在麻醉醫師訪視之前，安排先觀看麻醉流程衛教影片；另外對照組111位病患，則未提供衛教影片。問卷於麻醉醫師

-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學資訊管理系李麗華教授.

Mode Selection and Resource Allocation for Deviceto- Device Communications in 5G Cellular Networks 林柏毅羅傑文.

What were you doing when the UFO arrived?

XI. Hilbert Huang Transform (HHT)

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

A TIME-FREQUENCY ADAPTIVE SIGNAL MODEL-BASED APPROACH FOR PARAMETRIC ECG COMPRESSION 14th European Signal Processing Conference (EUSIPCO 2006), Florence,

A Question Answering Approach to Emotion Cause Extraction

深層學習暑期訓練 (2017).

Visualizing and Understanding Neural Machine Translation

-Artificial Neural Network- Adaline & Madaline

Reading Do you remember what you were doing? 学习目标 1、了解几个重要历史事件。

An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET

Feng Lin, Chen Song, Yan Zhuang, Wenyao Xu, Changzhi Li, Kui Ren

Some Effective Techniques for Naive Bayes Text Classification

Applications of Digital Signal Processing

Motivational Curriculum Design For A Lesson--Dating (约会)

毕业论文报告孙悦明

NLP Group, Dept. of CS&T, Tsinghua University

Manifold Learning Kai Yang

Speaker: Kai-Wei Ping Advisor: Prof Dr. Ho-Ting Wu 2014/06/23

異質計算教學課程內容「異質計算」種子教師研習營洪士灝國立台灣大學資訊工程學系

Source: IEEE Access, vol. 5, pp , October 2017

1 Introduction Prof. Lin-Shan Lee TA: Chun-Hsuan Wang.

1 Introduction Prof. Lin-Shan Lee.

China Standardization activities of ITS

Time Objectives By the end of this chapter, you will be able to

InterSpeech 2013 Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding University of Rouen(France)

Advanced Artificial Intelligence

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi

報告人：吳家麟教授資訊網路與多媒體研究所資訊工程學系暨研究所

Time Objectives By the end of this chapter, you will be able to

学习报告 —语音转换（voice conversion)

A Study on the Next Generation Automatic Speech Recognition -- Phase 2

Try to write He Mengling Daqu Middle School.

基于课程标准的校本课程教学研究乐清中学赵海霞.

A high payload data hiding scheme based on modified AMBTC technique

1 Introduction Prof. Lin-Shan Lee.

Version Control System Based DSNs

VIDEO COMPRESSION & MPEG

高性能计算与天文技术联合实验室智能与计算学部天津大学

Research 裴澍炜 Shuwei Pei Tel:

前向人工神经网络敏感性研究曾晓勤河海大学计算机及信息工程学院 2003年10月.

Task 10: Focus on the language (1)

Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.

Convolutional Neural Network

Inter-band calibration for atmosphere

An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。本文目的：做到既有较高的藏量（1bpp),

定语从句 ●关系词的意义及作用 : 定语从句一般都紧跟在它所修饰名词后面，所以如果在名词或代词后面出现一个从句，根据它与前面名词或代词的逻辑关系来判断是否是定语从句。

BiCuts: A fast packet classification algorithm using bit-level cutting

李宏毅專題 Track A, B, C 的時間、地點開學前通知

Introduction of this course

More About Auto-encoder

Speaker : YI-CHENG HUNG

Chapter 9 Validation Prof. Dehan Luo

Sun-Star第六届全国青少年英语口语大赛全国总决赛 2015年2月北京

之前都是分类的蒸馏很简单。然后从分类到分割也是一样，下一篇是检测的蒸馏

WiFi is a powerful sensing medium

Gaussian Process Ruohua Shi Meeting

適用於數位典藏多媒體內容之複合式多媒體檢索技術

Presentation transcript:

Video Caption Technique Based on Joint Image - Audio Deep Learning ICCE-Berlin 2019 Video Caption Technique Based on Joint Image - Audio Deep Learning Authors：Chien-Yao Wang, Pei-Sin Liaw, Kai-Wen Liang, Jai-Ching Wang, Pao-Chi Chang Presenter： Dr. Pao-Chi Chang National Central University, Taiwan September 8, 2019 Good afternoon! In this talk, I will present … The focus will be on the balance between audio and video features. Namely, we try to perform feature normalization to maximize the effects from both the audio and image features. This work was done by Prof. Wang … and my group. I am …

Outline Introduction to Video Caption Technique Proposed Joint Image - Audio Based Video Caption System Simulation Environment and Dataset Experiments Conclusion 視頻字幕技術簡介深度學習提議的聯合圖像 - 音頻視頻字幕系統仿真環境和數據集實驗結論

Introduction to Video Caption Technique (1/2) Goal: Analyzing videos to understand video content and using natural language to describe it. Classification: Girl 影片描述技術是基於深度學習的影像辨識、影片分類、動作辨識等技術的延伸，主要是希望可以讓機器學習並理解影片內容，使用自然語言描述 The goal … is … For example, based on image classification, this image could be …. But by viewing the whole video, this caption ,,,, might be more appropriate. Caption: A girl is waking up in bed.

Introduction to Video Caption Technique (2/2) The 1st technical report to solve video caption problem [1] CNN (AlexNet - fc7) + RNN (two-layers LSTM) Using “one-hot” vectors to represent words 分析視頻以了解視頻內容並使用自然語言來描述它 The first technical report for VC was published in 2014, it used this architecture, 2014 第一篇 : 利用CNN (AlexNet with full connection) 將影片中所有的frames做特徵提取，經過mean pooling後, 將特徵輸入LSTM decoder, 生成文字 [1] Venugopalan S. , Xu H. , Donahue J. , “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014.

Architecture of Proposed Method 語意構成網路 (Semantic Compositional Network) SCN-RNN SCN: Semantic Compositional Network

Image features processing Using RGB frames of videos as input, with 2 frames per second. Using the 2D CNN (ResNet-152 [4]) and the 3D CNN (C3D [5]) to extract features. Using mean pooling [6] process to get 2048-way 2D CNN features and 512-way 3D CNN features. Concatenation of 2D CNN and 3D CNN features become 2560-way image features. 我們將視頻的RGB幀視為輸入，每秒2幀。使用2D ResNet-152和3D CNN（C3D）來提取特徵。我們對所有二維CNN特徵和三維CNN特徵執行均值合併處理，以生成兩組特徵向量。每個視頻v的表示是通過連接這兩組特徵產生的。 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [5] D.Tran, L.Bourdev, R.Fergus, L.Torresani, and M.Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. [6] S.Venugopalan, H.Xu, J.Donahue, M.Rohrbach, R.Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015. 1, 2, 5

2-D CNN 3-D CNN Architecture Pre-trained Input size Output Highlights ResNet -152 [4] C3D network [5] Pre-trained ImageNet dataset Sports-1M dataset Input size 224x224x3 112x112x3 Output 2048 ways / frame from conv-5 layer 4096 ways / 8 frames from fc-7 layer Highlights Winner in 2015 ILSVRC Residual learning Video clips of length 16 frames Overlap of 8 frames. 2-D 在2015年 ImageNet 圖像識別比賽中，取得冠軍。 ImageNet Large Scale Visual Recognition Competition (ILSVRC). Error rate is 3.5%. Imput size needs scaling. 殘差學習: 他使神經網路不需要一層層依序往下傳遞，而是可以由跳躍的方式傳遞，而在計算誤差值時，假設神經網路中某個神經元的輸入為x，期望輸出為H(x)，若我們將輸入直接傳到輸出端，則網路的學習目標則變為F(x)=H(x)-x。 2. Video is 3D. 3-D Sports-1M數據集包含一百一十三萬個視頻（1,133,158），註釋487個體育標籤. Figure: using 3D kernel to scan the whole video clip.

Audio features processing Audio signals go through Short Time Fourier Transform (STFT) and Mel filter bank to obtain a Log Mel scale spectrogram with a size of 40mel×frames. Using the Asymmetrical Kernel Convolutional Neural Network (AKCNN) [6] to extract features. Concatenation acoustic scene (ASC) features and sound event (SED) features as audio features. 先將原始訊號經過短時距傅立葉轉換(STFT)與梅爾刻度映射，STFT中漢明窗每次框選40微秒，漢明窗一次移動距離為漢明窗的長度；而梅爾刻度映射則取40個三角窗，最後得到大小為 40mel×frames 的對數梅爾刻度頻譜，frames=(44100×seconds/822)。 AK CNN: one dim is temporal, one dim is frequency. Using separate data sets to train the ASC and SED (a) audio pre-processing [6] Y. C. Wu, P. C. Chang, C. Y. Wang, J. C. Wang, “Asymmetrical Kernel Convolutional Neural Network for acoustic scenes classification,” IEEE International Symposium on Consumer Electronics (ISCE), May. 2018.

Asymmetrical Kernel Convolutional Neural Network [6] 1st convolutional: kernel size: 7x5 activation function : ReLU Fully connected: Max-pooling: activation function : softmax Window size: 5x5, without overlap Loss function: Cross entropy 2nd convolutional:

Semantic features processing Using the 300 most common words in the training captions to determine the vocabulary of tags. Treating this problem as a multi-label classification task. Being implemented as a multilayer perceptron (MLP) with the logistic sigmoid function. 我們使用訓練字幕中最常用的300個詞來確定標籤的詞彙，其中包括最常用的名詞，動詞或形容詞。我們把這個問題作為一個多標籤分類任務。使用多層感知機加上sigmoid function 來實現。

Word Embedding [1,2,3,4,0] Dictionary: Text Input: <eos> a 1 chipmunk 2 is 3 eating 4 Text Input: a chipmunk is eating <eos> [1,2,3,4,0] One-hot encoder : Multilayer perceptron: Output: 𝑣 <𝑒𝑜𝑠> =[1 0 0 0 0] 𝑣 𝑎 =[0 1 0 0 0] 𝑣 𝑐ℎ𝑖𝑝𝑚𝑢𝑛𝑘 =[0 0 1 0 0] 𝑣 𝑖𝑠 =[0 0 0 1 0] 𝑣 𝑒𝑎𝑡𝑖𝑛𝑔 =[0 0 0 0 1] 𝑣 <𝑒𝑜𝑠> =[0.98 0.53 0.41] 𝑣 𝑎 =[0.83 0.42 0.61] 𝑣 𝑐ℎ𝑖𝑝𝑚𝑢𝑛𝑘 =[0.32 0.64 0.12] 𝑣 𝑖𝑠 =[0.03 0.56 0.72] 𝑣 𝑒𝑎𝑡𝑖𝑛𝑔 =[0.38 0.54 0.26] Q: output / ? (word, sentence)

Video description model - SCN-LSTM Concatenation of image and audio features is fed into the LSTM to initialize the first step, which is expected to provide an overview of the video content. The parameters of LSTM combine semantic feature to generate the caption. a : audio feature i : image feature s : semantic feature (a) basic LSTM 語意構成網路 (Semantic Compositional Network) SCN-RNN 合併前面取得的聲音以及影像特徵作為ＬＳＴＭ的初始化，以提供影片內容的訊息。最後結合LSTM以及語義特徵來生成語意描述。 After the weight matrix is generated using the SVM, the first output for a sentence is obtained from the SCN. The first output and the weight matrix are combined as the input for the next sentence. Adam演算法是一種自適應時刻估計的方法 (Adaptive Moment Estimation) Q: relationship with word embedding?  dim reduction from 10k to 300. (b) SCN-LSTM [7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, L. Deng, “Semantic Compositional Networks for Visual Captioning,” CVPR, 2017

Outline Introduction to Video Caption Technique Proposed Joint Image – Audio Based Video Caption System Simulation Environment and Dataset Simulation Environment Youtube2text Dataset Acoustic Scene Classification pre-train dataset： TUT Acoustic Scenes 2016 Experiment Conclusion 視頻字幕技術簡介深度學習提議的聯合圖像 - 音頻視頻字幕系統仿真環境和數據集實驗結論

Simulation Environment Audio features extract Video description CPU Intel Xeon E5-2630v3 @2.4GHz * 2 Intel® Core™ i7-6700 CPU @ 3.40GHz * 8 GPU GeForce GTX TITAN X GeForce GTX 1070 RAM 256GB DDR4-2133 MHz 16GB DDR4-2133 MHz OS Ubuntu 16.04-x64 Software Python 2.7 Python 2.7, Matlab 2016a Neural network tool Tensorflow Theano GPU ! 延續實驗室研究，所以環境不同

Background music deleted Youtube2text Dataset Microsoft Research 1970 YouTube Video Clips. Each video clips are marked with about 40 English sentences. Length: 10-25s 8 different sound events Number of video clips A man is mixing a batter. A man is stirring batter in a metal bowl. A person is stirring a flour mixture. A girl is drinking from a cup. A child is drinking from a cup. The little girl drank from her cup in the bathroom. Event Teaching Kitchen Narration Background Playing Performance Dialogue 訓練、驗證、測試事件 (Event):包含明顯的聲音特徵，如:槍聲、東西敲打聲背景聲 (Background):持續的聲音，如:風聲、引擎聲演奏 (Playing):樂器演奏，如:鋼琴、吉他、長笛表演 (Performance):含有人聲與音樂，如:跳舞、唱歌影片旁白 (Narration):含有影片錄製時原始聲音，與後加上的人聲對話 (Dialogue):影片中明顯包含兩個以上的人在進行交談教學 (Teaching):影片拍攝時同時包含現場人聲廚房 (Kitchen):廚房會出現的聲音，如:切菜聲、煮水聲 Train Validation Test Total Original 1027 83 549 1659 Background music deleted 835 69 450 1354

Acoustic Scene Classification pre-train dataset： TUT Acoustic Scenes 2016 [7] For each acoustic scene, there are 78 segments. Total: 1170 segments Length: 30 seconds Recorded sampling rate = 44100 Hz 15 different acoustic scenes lakeside beach bus cafe/restaurant car city center forest path grocery store home library metro station office urban park residential area train tram Bus：在城市內乘坐公共汽車（車輛） Cafe / Restaurant：小咖啡廳/餐廳（室內） Car：在城市，作為乘客搭乘汽車（車輛） City center：市中心（室外） Forest path：森林路徑（戶外） Grocery store：中型雜貨店（室內） Home：家（室內） Lakeside beach：湖畔沙灘（室外） Library：圖書館（室內） Metro station：地鐵站（室內） Office：一般工作日的辦公室（室內） Residential area：住宅區（室外） Train：火車（旅行、車輛） Tram：電車（旅行，車輛） Urban park：城市公園（室外） [7] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” IEEE 2016 24th European Signal Processing Conference, pp. 1128-1132, Aug. 2016.

Experiment – Audio kernel size selection beach 80.77 88.46 96.15 92.31 bus 100.00 cafe/restaurant 61.54 46.15 53.85 57.69 car city_center 84.62 forest_path grocery_store home library 50.00 metro_station office park 69.23 residential_area 73.08 train tram Overall Accuracy 82.31 84.36 86.67

Metrics BLEU: METEOR ROUGE-L: CIDEr-D: Computing the same modified precision metric using n-grams, n=1,2,3,4. METEOR Matching the synonymy, along with the standard exact word matching. ROUGE-L: Statistics Longest Common Subsequence (LCS). CIDEr-D: Removal of stemming and introduce a Gaussian penalty. 評估從一種自然語言機器翻譯到另一種語言的文本質量的算法。背後的核心理念:“機器翻譯越接近專業的人類翻譯，它越好” BLEU BLEU使用n-gram計算相同的修改精度度量。 n-gram越長說明翻譯的流暢性，或者它在多大程度上代表“好英語”。 METEOR 同義詞匹配，以及標準的確切單詞匹配。 ROUGE-L 統計最長公共子序列（LCS）考慮到句子的結構相似性 CIDEr-D 刪除詞幹可以確保使用正確的詞彙形式。在某些情況下，當長句子重複較高置信度的單詞時，基本CIDEr指標會產生較高的分數。引入高斯懲罰函數

Image+audio normalization [-1~1] Image+audio normalization [0~1] Deleted videos with background music, add acoustic scene and sound event features Metrics B_1 B_2 B_3 B_4 Meteor Rouge-L Cider-D Image only (Base) 0.8224 0.7154 0.6260 0.5310 0.3490 0.7148 0.8160 Image+audio 0.8278 0.7227 0.6316 0.5365 0.3425 0.7144 0.7962 Improvement 0.54% 0.73% 0.56% 0.55% -0.65% -0.04% -1.98% Image+audio normalization [-1~1] 0.8334 0.7282 0.6401 0.5475 0.3513 0.7221 0.8387 1.10% 1.29% 1.41% 1.66% 0.23% 2.27% Image+audio normalization [0~1] 0.8283 0.7244 0.6352 0.5398 0.3518 0.7218 0.8175 0.59% 0.90% 0.92% 0.88% 0.27% 0.70% 0.15% 加入聲音場景與聲音事件特徵 Improvement = ((image with audio)-(image only))x100%

Ground truth: a man is playing with his dog Image only: a man is playing with a toy Image+audio: a monkey is playing Image+audio [-1~1]: a man is playing with a dog Image+audio [0~1]: a man and two girl are running on beach two men are dancing a man is riding a boat two men are racing the man is playing basketball a boy is playing a boy is playing basketball a man is playing a basketball a boy is playing football

Conclusion We have proposed a video caption technique based on joint Image-Audio-Semantic processing. Adding audio features achieves significant improvement. Set audio normalization to [-1~1] and image-audio ratio at 1:1 makes the best performance. Combining sound events and acoustic scene improves video subtitling. BLEU score increased by at least 1%. Cider-D score increased by 2.27%. The Meteor and Rouge-L ratings also increased by 0.2% and 0.7%. 透過語言的自動評分機制，我們發現加入聲音特徵對整個語意構成網路是有幫助的. 正規化有幫助結合聲音事件以及場景特徵，在 BLEU 評分中，至少 1%的提升，Cider-D 評分有高達 2.27%的提升，另外在 Meteor 與 Rouge-L 評分中，也大約有 0.2%與 0.7%的提升。

Thanks for your listening.