arxiv: v1 [cs.cv] 23 Apr 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 23 Apr 2018"

Ashley Ryan
5 years ago
Views:

Jointly Localizing and Decribing Event for Dene Video Captioning Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei School of Data and Computer Science, Sun Yat-en Univerity, Guangzhou,

Education arxiv:1804.08274v1 [c.cv] 23 Apr 2018 {yehaoli.yu, panyw.utc}@gmail.com, {tiyao, tmei}@microoft.com, ichhy@mail.yu.edu.

The problem neverthele i not trivial epecially when a video contain multiple event to be worthy of mention, which often happen in real video.

In thi paper, we preent a novel framework for dene video captioning that unifie the localization of temporal event propoal and entence generation of each propoal, by jointly training them in an

To combine thee two world, we integrate a new deign, namely decriptivene regreion, into a ingle hot detection tructure to infer the decriptive complexity of each detected propoal via entence

Our model differ from exiting dene video captioning method ince we propoe a joint and global optimization of detection and captioning, and the framework uniquely capitalize on an attribute-augmented

1 Jointly Localizing and Decribing Event for Dene Video Captioning Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei School of Data and Computer Science, Sun Yat-en Univerity, Guangzhou, China Microoft Reearch, Beijing, China Univerity of Science and Technology of China, Hefei, China Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-en Univerity), Minitry of Education arxiv: v1 [c.cv] 23 Apr 2018 {yehaoli.yu, {tiyao, Abtract Automatically decribing a video with natural language i regarded a a fundamental challenge in computer viion. The problem neverthele i not trivial epecially when a video contain multiple event to be worthy of mention, which often happen in real video. A valid quetion i how to temporally localize and then decribe event, which i known a dene video captioning. In thi paper, we preent a novel framework for dene video captioning that unifie the localization of temporal event propoal and entence generation of each propoal, by jointly training them in an end-to-end manner. To combine thee two world, we integrate a new deign, namely decriptivene regreion, into a ingle hot detection tructure to infer the decriptive complexity of each detected propoal via entence generation. Thi in turn adjut the temporal location of each event propoal. Our model differ from exiting dene video captioning method ince we propoe a joint and global optimization of detection and captioning, and the framework uniquely capitalize on an attribute-augmented video captioning architecture. Extenive experiment are conducted on ActivityNet Caption dataet and our framework how clear improvement when compared to the tate-of-the-art technique. More remarkably, we obtain a new record: ME- TEOR of 12.96% on ActivityNet Caption official tet et. 1. Introduction The recent advance in 2D and 3D Convolutional Neural Network (CNN) have uccefully puhed the limit and improved the tate-of-the-art of video undertanding. For intance, the firt rank performance achieve 8.8% in term of top-1 error in untrimmed video claification tak of ActivityNet Challenge 2017 [9]. A uch, it ha become poible to recognize a video with a pre-defined et of label Thi work wa performed at Microoft Reearch Aia. Input Video Video Captioning Dene Video Captioning A man i playing fribee with a dog. A man and a dog are outdoor and waiting for their turn to play on a fenced in green field. The man and the dog run onto the field and he throw the fribee a far ditance and the dog run and fetche it, then return it back to the man and they repeat the proce 6 time. The whole time there are people on the ideline watching them and taking picture. A man and a dog walk onto a field. A man throw a fribee and the dog chae after it. When they are done, another man run to them and hand the man a leah and he leahe hi dog. The dog bring the fribee back to the man. Start time End time Figure 1. Example of video captioning and dene video captioning (upper row: input video; middle row: the entence generated by video captioning method; bottom row: temporally localized entence generated by dene video captioning approach.) or categorie. In a further tep to decribe a video with a complete and nature entence, video captioning [19, 32, 34] ha expanded the undertanding from individual label to a equence of word to expre richer emantic and relationhip in the video. Neverthele, conidering that video in real life are uually long and contain multiple event, the conventional video captioning method generating only one caption for a video in general will fail to recapitulate all the event in the video. Take the video in Figure 1 a an example, the output entence generated by a popular video captioning method [32] i unable to decribe the procedure of playing fribee with a dog in detail. A a reult, the tak of dene video captioning i introduced recently in [16] and the ultimate goal i to generate a entence for each event occurring in the video. The difficulty of dene video captioning originate from two apect: 1) how to accurately localize each event in time? 2) how to deign a powerful entence generation model? In the literature, there have been everal technique, including temporal action propoal [2, 3, 6, 17] and image/video captioning [32, 34, 37], being propoed for each individual apect. However, imply olving the problem of dene video captioning in a two-tage way, i.e., firt temporal event propoal and then entence generation, may detroy the interaction between localizing and decribing event, reulting in a ub-optimal olution. Thi paper propoe a novel deep architecture to unify

2 the accurate localization of temporal event with the decriptive principle of entence generation for dene video captioning. Technically, we devie a new decriptivene regreion component and integrate it into a ingle hot detection framework a a bridge, on one hand to meaure the complexity of each event being decribed in entence generation, and on the other, to adjut the event propoal. More pecifically, the decriptivene regreion guide the learning of temporal event propoal together with event/background claification and temporal coordinate regreion. In between, event/background claification i to predict event propoal and temporal coordinate regreion i to refine the temporal boundarie of each propoal. Furthermore, the inference of decriptivene regreion i employed a an attention to weight video clip in each propoal locally. The propoal-level repreentation i then averaged over all the clip-level repreentation in the propoal weighted by the holitic attention core of each clip and finally fed into an attribute-augmented captioning architecture for entence generation. A uch, the tak of dene video captioning could be jointly learnt and globally optimized in an end-to-end manner. The main contribution of thi work i the propoal of a new architecture to unify the temporal localization of event propoal and entence generation for dene video captioning. The olution alo lead to the elegant view of what kind of interaction hould be built acro the two ub problem and how to model and integrate the interaction in a deep learning framework, which are problem not yet fully undertood in the literature. 2. Related Work Temporal Action Propoal. [5] i one of the early work that detect temporal egment containing the action of interet in a liding window fahion. Next, a few ubequent work [2, 3, 6, 8, 27] tackle temporal action propoal by leveraging action claifier on a maller number of temporal window. In particular, Spare-prop [3] utilize dictionary learning to encode repreentation of trimmed action intance and then retrieve the mot repreentative egment from teting video, which are treated a the claindependent propoal. S-CNN [27] train a 3D CNN to claify a video egment a background or being-action and employ varied length temporal window for multi-cale action propoal generation. Later in [6], DAP utilize Long Short-Term Memory () to encode video tream and enable multi-cale propoal generation inide the tream with a ingle pa through the video, obviating the need for deploying liding window on multiple cale. Furthermore, Buch et al. develop SST baed on DAP by contructing no overlapping liding window over the input video and encoding each window equentially with a Gated Recurrent Unit (GRU) in [2]. Mot recently, Gao et al. [8] deign temporal coordinate regreion for temporal action propoal generation. Video Captioning. The reearch in thi direction ha proceeded along two different dimenion: template-baed language method [11, 15, 26] and equence learning approache (e.g., RNN) [16, 19, 32, 33, 34, 38]. Templatebaed language method directly generate the entence with detected keyword in predefined language template. Sequence learning approache utilize CNN plu RNN architecture to generate novel entence with more flexible yntactical tructure. In [33], Venugopalan et al. preent a -baed model to generate video decription with the mean pooling repreentation over all frame. The framework i then extended by inputting both frame and optical flow image into an encoder-decoder in [32]. Compared to mean pooling, Yao et al. propoe to utilize the temporal attention mechanim to exploit temporal tructure for video captioning [34]. Later in [38], a hierarchical RNN i devied to further capture the inter-entence dependency, targeting for decribing a long video with a paragraph coniting of multiple entence. Different from the video paragraph captioning with non-overlapping and annotated temporal interval, a more challenge tak, named a dene video captioning, i recently introduced in [16] which involve both detecting and decribing multiple event in a video. A two-tage dene-captioning ytem i thu deigned by leveraging DAP [6] to localize temporal event propoal and a -baed equence learning module to decribe each event propoal. Mot recently, [35] additionally incorporate KNN-baed retrieval module into -baed equence learning module to boot video captioning. Summary. Our work aim to detect and decribe event in video, i.e., dene video captioning. Different from the aforementioned method [16], our approach contribute by tudying not only detecting the event with the imple objective of binary claification (i.e., event or background) and modeling entence generation with, but alo enhancing the temporal event propoal by utilizing both temporal boundary regreion to correct tart and end time of event and decriptivene regreion to infer whether the event can be well decribed from language perpective. Moreover, entence generation module i further booted by leveraging emantic attribute and reinforcement learning to optimize with non-differentiable metric. 3. Dene Video Captioning The baic idea of thi work i to automatic decribe multiple event in video by temporally localizing event propoal and generating language entence for each event propoal. The temporal event propoal (TEP) i performed by encapulating the event claification to recognize video egment of event from background, propoal generation to temporally localize the event, and decriptivene infer-

3 Bae Layer Anchor Layer FC Prediction Layer 3D CNN Temporal Coordinate Regreion Decriptivene Regreion Event/Background Claification 500D 2048D 1024D 512D 512D 512D Propoal ranking lit The dog i een in everal clip chaing after a fribee. METEOR Reward S w, w,, w N 1 2 S ˆ w, w,, w N 1 2 Sentence Generation wn w 1 N 1 w N w N w w 1 1 BOS BOS w2 w 2 w w 1 1 Attribute Start time End time Figure 2. An overview of our Dene Video Captioning framework mainly including Temporal Event Propoal (TEP) and Sentence Generation (SG) (better viewed in color). The input video i firt encoded into a erie of clip-level feature via a 3-D CNN, which are fed into TEP module to produce candidate propoal. The TEP module i employed by integrating the event/background claification to predict event propoal, temporal coordinate regreion to refine the temporal boundarie of each propoal, and decriptivene regreion to infer the decriptive complexity of each event, into a ingle hot detection architecture. After ranking the candidate propoal with regard to both eventne and decriptivene core, the top propoal are in turn injected into SG module for entence generation. The SG module leverage both attribute and reinforcement learning baed optimization to enhance captioning. ence procedure to infer the decriptive complexity of thi event, in one ingle network. Such deign enable traightforward temporal event propoal in a ingle hot manner to eae the training conumption. The entence generation module (SG) leverage an attribute-augmented baed model for generating decription. Moreover, the policy-gradient baed reinforcement learning i adopted to optimize with evaluation metric baed reward, harmonizing the module with repect to the teting inference. Pleae note that the decriptivene inference procedure in TEP module i not only leveraged to additionally refine the localized event propoal from language perpective through decriptivene regreion, but alo integrated into SG module to conider the decriptivene core a one kind of temporal attention over clip for weighted fuing them a the input propoal-level repreentation of. A uch, our ytem including both TEP and SG module can be jointly trained through the global optimization of detection and captioning in an end-to-end manner. An overview of our dene video captioning ytem i illutrated in Figure Problem Formulation Suppoe we have a video V = {v t } Tv t=1 with T v frame/clip and v t denote the t-th frame/clip in temporal order. The ultimate target of our dene video captioning ytem i to generate a et of temporal localized decription Φ v = {φ i = (t i tart, t i end, S i)} Mv i=1 for the input video V, where M v i the number of entence, t i tart and t i end repreent the tarting time and ending time for each entence S i, and S i = {w 1, w 2,, w N } conit of N word. Hence the TEP module in our ytem i firtly utilized to produce a et of candidate propoal for the input video V: Φ p = {φ i p = (t i tart, t i end, p i event, p i de)} Np i=1, (1) where p i event i the probability of recognizing the candidate a an event (i.e., eventne core), p i de denote the decriptivene core meauring how well the candidate can be decribed from language perpective and N p i the total number of candidate propoal. By conolidating the idea of electing propoal from both viion and language perpective, all the candidate are ranked according to the fued core p i conf = pi event + λ 0 p i de and only the candidate with a p i conf higher than a threhold are injected into SG module for captioning, denoted a Φˆp. Inpired by the uccee of equence learning model in machine tranlation [28] and attribute utilized in image/video captioning [7, 20, 36], we formulate our SG module a an attribute-argument -baed model which encode the input event propoal repreentation (F) and it detected attribute/categorie (A) into a fixed dimenional vector and then decode it to the output target entence. A uch, the entence generation problem we exploit here can be formulated by minimizing the negative log probability Time

4 of the correct textual entence ( log Pr (S F, A)). The negative log probability i typically meaured with cro entropy lo, reulting in the dicrepancy of evaluation between training and inference. Hence, to further boot our SG module by amending uch dicrepancy, we take the inpiration from the reinforcement learning [23] leveraged in equence learning and directly optimize by minimizing the following expected entence-level reward lo a L SG = E S pθ [r(s)], (2) where θ denote the parameter of that chedule a policy p θ for generating entence. r(s) i the reward meaured by comparing the generated entence S to groundtruth entence over non-differentiable evaluation metric Temporal Event Propoal Exiting olution for temporal action/event propoal mainly focu on detecting event with binary claifier (i.e., event or background) in a liding window fahion. However, the temporal liding window are typically too dene and even deigned with multiple cale, reulting in the heavy computation cot. Inpired from patial boundary regreion in object localization [24] which imultaneouly predict objectne core and object bound, we integrate the event claification with temporal coordinate regreion for correcting event temporal bound, puruing both low-cot and high-quality event propoal. Moreover, for the dene video captioning tak, the event identification undoubtedly play the major role in temporal event propoal, while the propoal containing rich decribable object or cene are alo preferred by human being in decription generation. A uch, a novel decriptivene regreion i epecially devied to infer the decriptive complexity of each propoal, and further refine the event propoal from language perpective. Similar to ingle hot object detection in [18], all the three component (i.e, event claification, temporal coordinate regreion and decriptivene regreion) are elegantly integrated into one feed-forward CNN, aiming to imultaneouly produce a fixed-ize et of propoal, core for the preence of event in the propoal (i.e., eventne core), and decriptivene core of propoal. Input. Technically, given input video V = {v t } Tv t=1, a 3- D CNN i utilized to encode the frame equence into a erie of clip-level feature {f t = F (v t : v t+δ )} T f t=1 where δ i the temporal reolution of each feature f t. In our experiment, C3D [30] i adopted a 3-D CNN encoder F with δ = 16 frame and the temporal interval for encoding i et a 8 frame, reulting in the initial output feature map with the ize of T f D 0. Note that D 0 i the dimenion of clip-level feature and T f = T v /8 dicretize the video frame. Network Architecture. The initial feature map of ize T f D 0 i fed into 1-D CNN architecture in TEP module, which conit of convolutional layer in three group: bae layer, anchor layer, and prediction layer, a hown in Figure 2. In particular, two bae layer (conv 1 and conv 2 ) are firtly devied to reduce the temporal dimenion of feature map and increae the ize of temporal receptive field, producing the output feature map of ize T f / Then, we tack nine anchor layer (conv 3 to conv 11 ) on the top of bae layer conv 2, each of which i deigned with the ame configuration (kernel ize: 3, tride ize: 2, and filter number: 512). Thee anchor layer decreae in temporal dimenion of feature map progreively, enabling the temporal event propoal at multiple temporal cale. For each anchor layer, it output feature map i injected into prediction layer to produce a fixed et of prediction in one hot manner. Concretely, given an output feature map f j with the ize of T fj D j, the baic element (anchor) for predicting parameter of a candidate propoal i a 1 D j feature map cell that produce a prediction core vector p pred = (p cl, c, w, p de ) via fully connected layer. p cl = [p event, p bk ] denote the two dimenional claification core for event/background and p de i the decriptivene core to infer the confidence of thi propoal to be well decribed. c and w are two temporal offet relative to the default center location µ c and width µ w of thi anchor, which are leveraged to adjut it temporal coordinate a ϕ c = µ c + α 1µ w c, ϕ w = µ w exp(α 2 w), t tart = ϕ c 1 ϕw, t 2 end = ϕ c + 1 ϕw, (3) 2 where ϕ c and ϕ w are refined center location and width of the anchor. α 1 and α 2 are utilized to control the impact of temporal offet, both of which are et a 0.1. t tart and t end repreent the adjuted tarting and ending time of the anchor. In addition, derived from the anchor boxe in [18, 24], we aociate a et of default temporal boundarie with each feature map cell. The different temporal cale ratio for thee default temporal boundarie are denoted a R = {r d } D d=1 = {1, 1.25, 1.5}. For each temporal cale ratio r d, we can thu achieve one default center location (µ cd = t+0.5 T fj ) and width (µ wd = r d T fj ) of t-th feature map cell, reulting in a total of T fj D anchor. Accordingly, for the feature map f j, the et of all the produced propoal i defined a Φ fj = {φ fj = (t u tart, t u end, pu cl, pu de )}N U u=1, where N U = T fj D. By accumulating all the produced propoal of the output feature map of nine anchor layer, the final predicted propoal et i Φ p = {Φ fj } = {φ i p} Np i=1. Training. During training, a poitive/negative label hould be firtly aigned to each predicted propoal conditioned on the ground truth propoal et Φ v. Specifically, for each φ i p Φ p, we meaure it temporal Interection over Union (tiou) with each ground truth propoal and obtain the highet tiou. If the highet tiou i larger than 0.7, φ i p i treated a poitive ample with regard to the correponding ground truth propoal φ g, otherwie φ i p i a negative ample. The training objective in our TEP module i formulated a a multi-tak lo by integrating the event/background cla-

5 ification lo (L event ) for ditinguihing event from background, temporal coordinate regreion lo (L tcr ) for adjuting temporal coordinate of event propoal, and decriptivene regreion lo (L de ) for inferring the decriptive complexity of each propoal, which i defined a L T EP = L event + αl tcr + βl de, (4) where α and β are tradeoff parameter. The event/background claification lo L event i meaured a the tandard oftmax lo over the whole predicted propoal et Φ p. The temporal coordinate regreion lo L tcr i devied a Smooth L1 lo [10] (S L1 ) between the poitive predicted propoal and the correponding ground truth propoal. Similar to [24], both of the temporal offet for center location (ϕ c ) of the predicted propoal and for it width (ϕ w ) are regreed a L tcr = 1 N po N p I i(s L1(ϕ i c gc) i + S L1(ϕ i w gw)), i (5) i=1 where I i denote the aigned label for predicted propoal φ i p (1 for poitive ample and 0 for negative ample), N po i the number of poitive ample, gc i and gw i repreent the center location and width of ground truth propoal. The decriptivene regreion lo L de i calculated a the Euclidean ditance between the inferred decriptivene core p i de of the predicted propoal and it entence-level reward r(s i ) in SG with regard to ground truth entence: L de = 1 N p p i de r(s i) 2 (6) N p 2. i=1 In particular, for each poitive predicted propoal φ i p, we achieve it entence-level reward r(s i ) by directly feeding thi propoal into SG module and comparing the generated entence S i with the correponding ground-truth entence over evaluation metric (e.g., METEOR). For each negative ample, it entence-level reward i naturally fixed a 0. Accordingly, by minimizing the decriptivene regreion lo, our prediction layer i additionally endowed with the ability to directly infer the approximate decriptivene core (i.e., the entence-level reward for captioning) of an event propoal without referring ground-truth entence Sentence Generation Given the et of elected predicted event propoal Φˆp Φ p from TEP module, each propoal φˆp Φˆp i injected into attribute-augmented -baed model in SG module for decription generation. Specifically, the attribute repreentation A of predicted propoal φˆp i firtly tranformed into to inform the whole about the high-level attribute, followed by the propoal repreentation F which i encoded into at the econd time tep. Then, decode each output word baed on previou word and previou tep hidden tate. Decriptivene-driven Temporal Attention. One natural way to achieve the propoal repreentation F i performing mean pooling proce over all the clip within thi propoal. However, in many cae, the generated decription only relate to ome key clip with low decriptive complexity. A a reult, to pinpoint the local clip containing rich decribable object or cene and further incorporate the contribution of different clip into producing propoal repreentation, a decriptivene-driven temporal attention mechanim i employed on the predicted event propoal. Given the input propoal φˆp containing Nˆpi clip {v i } N ˆp i i=1, the clip-level decriptivene core pvi de of each clip v i i firtly achieved by holitically taking the average of all the decriptivene core of predicted propoal containing thi clip v i. Here we treat the clip-level decriptivene core of each clip a one kind of temporal attention over all the clip within thi propoal. Baed on the attention ditribution, we calculate the weighted um of the clip feature and obtain the aggregated propoal feature F weighted by holitic attention core of each clip: α vi = p v i de / N ˆpi j=1 p v j de, N ˆpi F = α vi f vi, (7) where α vi i the normalized attention core and f vi i the clip repreentation of v i. The aggregated propoal feature could be regarded a a more informative propoal feature ince the mot decriptive clip for entence generation have been ditilled with higher attention weight. Training. The training objective in our SG module i formulated a the expected entence-level reward lo in Eq.(2). Inpired from Self-critical Sequence Training (SCST) [25], the gradient of thi objective i given by i=1 θ L SG (r(s ) r(ŝ)) θ log p θ (S ), (8) where S i a ampled entence and r(ŝ) denote the reward of baeline achieved by greedily decoding inference Joint Detection and Captioning The overall objective of our dene video captioning i compried of the training objective of TEP module in Eq.(4) and the reward lo of SG module in Eq.(2): L = λ 1L T EP + λ 2L SG, (9) where λ 1 and λ 2 are tradeoff parameter for TEP and SG, repectively. Note that decriptivene inference could be regarded a a bridge which i not only leveraged in TEP for adjuting the event propoal from language perpective, but alo integrated into SG for meauring the decriptivenedriven temporal attention to boot entence generation, enabling the interaction between TEP and SG. A a reult, the overall objective function of our ytem can be olved through the joint and global optimization of detection and captioning in an end-to-end manner.

6 4. Experiment We conduct our experiment on the ActivityNet Caption dataet [16] and evaluate our propoed ytem on both dene video captioning and temporal event propoal tak Dataet The dataet, ActivityNet Caption, i a recently collected large-cale dene video captioning benchmark, which contain 20,000 video covering a wide range of complex human activitie. Each video i aligned with a erie of temporally annotated entence. On average, there are 3.65 temporally localized entence for each video, reulting in a total of 100,000 entence. In our experiment, we follow the etting in [16], and take 10,024 video for training, 4,926 for validation and 5,044 for teting Dene Video Captioning Tak We firtly invetigate our ytem on dene video captioning tak. The tak i to detect individual event and then decribe each event with natural language. Compared Approache. To empirically verify the merit of our propoed dene video captioning ytem, we compare the following video captioning baeline: (1) Long Short-Term Memory () [33]: utilize a CNN plu RNN framework to directly tranlate from video pixel to natural language decription. The frame feature are mean pooled to generate the video feature. (2) Sequence to Sequence - Video to Text (S2VT) [32]: S2VT incorporate both RGB and optical flow input, and the encoding and decoding of the input and word repreentation are learnt jointly in a parallel manner. (3) Temporal Attention (TA) [34]: TA combine the frame repreentation from GoogleNet [29] and video clip repreentation from 3D CNN trained on hand-crafted decriptor. Furthermore, a oft attention mechanim i employed to dynamically attend to pecific temporal region of the video while generating entence. (4) Hierarchical Recurrent Neural Network (H-RNN) [38]: H-RNN generate paragraph by uing one RNN to generate individual entence and the econd to capture the inter-entence dependencie. Moreover, both patial and temporal attention mechanim are leveraged in H-RNN. (5) Dene-Captioning Event (DCE) [16]: DCE leverage a multi-cale variant of DAP [6] to localize temporal event propoal and S2VT [32] a bae captioning module to decribe each event. An attention module i further incorporated to exploit temporal context for dene captioning. (6) Dene Video Captioning (DVC) i our complete ytem in thi paper. Two lightly different etting of DVC are named a DVC-D and DVC-D-A. The former only incorporate the decriptivene-driven temporal attention mechanim into in SG module and i trained without attribute and reinforcement learning baed optimization, while the latter i more imilar to DVC that only replace the expected entence-level reward lo in DVC with the traditional cro entropy lo. Note that DCE i the only exiting work on dene video captioning tak and mot previou video captioning work (e.g.,, S2VT, TA, and H-RNN) focu on decribing entire video without detecting a erie of event. Hence we compare the five video captioning baeline on dene video captioning tak by feeding them with the fixed ground truth propoal or the learnt one from our TEP module. Setting. For video clip repreentation, we utilize the publicly available 500-way C3D in [16], whoe dimenion i reduced by PCA from the original 4,096-way output of f c7 of C3D pre-trained on Sport-1M video dataet [13]. For repreentation of attribute/categorie, we treat all the 200 categorie on Activitynet dataet [4] a the high-level emantic attribute and train the attribute detector with cro entropy lo, reulting in the final 200-way vector of probabilitie. Each word in the entence i repreented a one-hot vector (binary index vector in a vocabulary). The dimenion of the input and hidden layer in are both et to 1,024. The tradeoff parameter λ 0 leveraging the event probability and decriptivene core for propoal election i empirically et to 0.2. The tradeoff parameter α and β in Eq.(4) are et a 0.5 and 10. For the tradeoff parameter λ 1 and λ 2 in Eq.(9), we et them a 1 and 20, repectively. We mainly implement our DVC baed on Caffe [12], which i one of widely adopted deep learning framework. The whole ytem i trained by Adam [14] optimizer. The initial learning rate i et a and the mini-batch ize i et a 1. Note that SG module in our DVC i pre-trained with ground-truth propoal-entence pair. The entencelevel reward in SG i meaured with METEOR. Evaluation Metric. For the evaluation of our propoed model, we follow the metric in [16] to meaure the ability to jointly localize and decribe dene event. Thi metric compute the mean average preciion (map) acro tiou threhold of 0.3, 0.5, 0.7, and 0.9 when captioning the top 1,000 propoal. The preciion of caption i meaured by three evaluation metric: BLEU@N [21], METEOR [1], and CIDEr-D [31]. All the metric are computed by uing the code 1 releaed by ActivityNet Evaluation Server. Performance Comparion. Table 1 how the performance of different model on ActivityNet Caption validation et. Overall, the reult acro ix evaluation metric with ground truth propoal and learnt propoal conitently indicate that our propoed DVC achieve uperior performance againt other tate-of-the-art video captioning technique including non-attention model (, S2VT) and attention-baed approache (TA, H-RNN, DCE). In particular, the METEOR core of our DVC can achieve 10.33% 1

ground truth (GT) propoal; right: performance with the learnt propoal from our TEP module). All value are reported a percentage (%).

57 8.74 24.05 11.10 4.68 1.83 0.65 5.56 12.16 TA [34] 18.19 8.62 3.98 1.56 8.75 24.14 11.06 4.66 1.78 0.65 5.62 12.19 H-RNN [38] 18.41 8.80 4.08 1.59 8.81 24.17 11.21 4.79 1.90 0.70 5.68 12.

05 0.74 6.14 13.21 DVC 19.57 9.90 4.55 1.62 10.33 25.24 12.22 5.72 2.27 0.73 6.93 12.

clipper. She begin cutting the cat' The woman then begin The woman then begin to claw while the cat quirm cutting the nail of the cat. clip the cat' claw. around a bit.

DVC A man i tanding in a A man i een peaking to A man i tanding in a kitchen making a andwich. the camera while holding a kitchen and a andwich i piece of paper. the counter.

He put the andwich into a A man i een peaking to The man then put a platic Tupperware box. the camera while holding a andwich on a plate. piece of paper.

Dene video captioning reult on ActivityNet Caption validation et. The output temporally localized entence are generated by 1) Ground Truth, 2), and 3) our DVC.

33%, which i conidered a a ignificant progre on thi benchmark. A expected, the METEOR core i dropped down to 6.

7 Table 1. METEOR (M) and CIDEr-D (C) core of our DVC and other tate-of-the-art video captioning method for dene video captioning tak on ActivityNet Caption validation et (left: performance with ground truth (GT) propoal; right: performance with the learnt propoal from our TEP module). All value are reported a percentage (%). with GT propoal with learnt propoal B@1 B@2 B@3 B@4 M C B@1 B@2 B@3 B@4 M C Ground Truth Ground Truth [33] S2VT [32] TA [34] H-RNN [38] DCE [16] DVC-D DVC-D-A DVC DVC A woman i een holding on A woman i itting on a A woman i een itting on a a cat on a coach and couch with a cat. couch with a cat and lead holding up a pair of nail into her cutting a cat' claw. clipper. She begin cutting the cat' The woman then begin The woman then begin to claw while the cat quirm cutting the nail of the cat. clip the cat' claw. around a bit. She continue cutting the claw while peaking to the camera. The woman i then hown bruhing the dog' hair while the camera capture him. The woman i now cutting the cat' claw and i talking to the camera. DVC A man i tanding in a A man i een peaking to A man i tanding in a kitchen making a andwich. the camera while holding a kitchen and a andwich i piece of paper. the counter. He put the andwich into a A man i een itting behind A man i on a counter and platic bag. a table and peaking to the begin to make a andwich. camera. He put the andwich into a A man i een peaking to The man then put a platic Tupperware box. the camera while holding a andwich on a plate. piece of paper. He pick up the food and The man i tanding in the The man i then hown in walk out a door. table. the kitchen and begin to pull the andwich in the back. Figure 3. Dene video captioning reult on ActivityNet Caption validation et. The output temporally localized entence are generated by 1) Ground Truth, 2), and 3) our DVC. We how the reult with the highet overlap with ground truth caption. with ground truth propoal, making the relative improvement over the bet competitor DCE by 16.33%, which i conidered a a ignificant progre on thi benchmark. A expected, the METEOR core i dropped down to 6.93% when provided the predicted propoal from our TEP module intead of ground truth propoal. Moreover, DVC- D by additionally leveraging decriptivene-driven temporal attention for dene video captioning, lead to a performance boot againt. The reult baically indicate the advantage of weighting each local clip with holitic attention core in producing propoal repreentation for entence generation, intead of repreenting each propoal by directly performing mean pooling over it clip in. In addition, DVC-D-A by additionally augmenting with high-level emantic attribute, conitently improve DVC-D over all the metric, but the METEOR core are till lower than DVC. Thi confirm the effectivene of utilizing reinforcement learning technique for directly optimizing with METEOR-baed reward lo, which harmonize SG module with repect to it teting inference. Figure 3 howcae a few dene video captioning reult generated by different method and human-annotated ground truth entence. From thee exemplar reult, it i Table 2. Uer tudy on two criteria: M1 - percentage of et of caption generated by different method that are evaluated a better/equal to human caption; M2 - percentage of et of caption that pa Turing Tet. Human DVC H-RNN [38] TA [34] S2VT [32] [33] M M eay to ee that all of thee automatic method can generate omewhat relevant entence, while our propoed DVC can generate more relevant and decriptive entence by jointly exploiting decriptivene-driven temporal attention mechanim and high-level attribute for booting dene video captioning. For example, compared to phrae bruhing the dog hair in the entence generated by, cutting the cat claw in our DVC i more precie to decribe the event propoal in the lat propoal of the firt video. Human Evaluation. To better undertand how atifactory are the localized temporal event propoal and the correponding generated entence of different method, we alo conducted a human tudy to compare our DVC againt four baeline, i.e., H-RNN, TA, S2VT, and. A total number of 12 evaluator from different education background are invited and a ubet of 1K video i randomly elected from validation et for the ubjective evaluation. The evaluation proce i a follow. All the evaluator are organized into two group. We how the firt group all the five et of temporally localized entence generated by each approach plu a erie of temporally human-annotated entence and ak them the quetion: Do the ytem produce the et of temporally localized entence reembling human-generated entence? In contrat, we how the econd group once only one et of temporally localized entence generated by different approach or human annotation and they are aked: Can you determine whether the given et of entence ha been generated by a ytem or by a human being? From evaluator repone, we calculate two metric: 1) M1: percentage of et of caption that are evaluated a better or equal to human caption; 2) M2: percentage of et of caption that pa the Turing Tet. Table 2 lit the reult of the uer tudy. Overall, our DVC i clearly the winner for all two criteria. In particular, the percentage achieve 35.1% and 38.7% in term of M1 and M2, repectively, making the abolute improvement over the bet competitor H-RNN by 2.2% and 2.4%.

8 Table 3. Leaderboard of the publihed tate-of-the-art dene video captioning model on the online ActivityNet evaluation erver. DVC (P3D) DVC (C3D) TAC [9] DCE [16] METEOR Performance on ActivityNet Evaluation Server. We alo ubmitted our bet run in term of METEOR core, i.e., DVC, to online ActivityNet evaluation erver and evaluated the performance on official teting et. Table 3 how the performance Leaderboard on official teting et. Pleae note that here we deign two ubmiion run for our DVC, i.e., DVC (C3D) and DVC (P3D). The input clip feature of the two run are 500-way C3D feature and 2048-way output of pool5 layer from P3D ReNet [22], repectively. Compared to the top performing method, our propoed DVC (C3D) achieve the bet METEOR core. In addition, when leveraging the clip feature from P3D ReNet, our METEOR core on teting et i further booted up to 12.96%, ranking the firt on the Leaderboard Temporal Event Propoal Tak The econd experiment i conducted on temporal event propoal tak, which evaluate our TEP module capability to adequately localize all event for a given video. Compared Approache. We compare our DVC with three tate-of-the-art temporal action propoal method: (1) Temporal Actionne Grouping (TAG) [39]. TAG utilize actionne claifier to generate actionne curve, followed by the waterhed algorithm to produce bain. The propoal are finally generated by grouping the bain. (2) Dene-Captioning Event (DCE) [16]. DCE leverage a multi-cale variant of -baed action propoal model in [6] to localize temporal event propoal. (3) Temporal Unit Regreion Network (TURN) [8]. TURN jointly predict action propoal and refine the temporal boundarie by temporal coordinate regreion. Evaluation Metric. For temporal action propoal tak, we adopt the Area-Under-the-Curve (AUC) core for Average Recall v. Average Number of Propoal per Video (AR-AN) curve in [9] a the evaluation metric. AR i defined a the mean of all recall value uing tiou threhold between 0.5 and 0.95 (tep ize: 0.05), and AN denote the total number of propoal divided by the number of video. Performance Comparion. Figure 4 how the AR-AN curve of four run on ActivityNet Caption validation et for temporal event propoal tak. Overall, the quantitative reult with regard to AUC core indicate that our DVC outperform other method. In particular, by leveraging temporal coordinate regreion for adjuting temporal boundary of detected propoal, DCE and TURN lead a large performance boot againt TAG. Moreover, DVC by additionally incorporating decriptivene regreion into TEP module further improve DCE and TURN. The reult indicate the advantage of joint detection and captioning, which refine the event propoal from language perpective. Average Recall DVC, AUC=59.91 TURN, AUC=57.85 DCE, AUC=54.54 TAG, AUC= Average Number of Propoal per Video Figure 4. The AR-AN curve of different approache on ActivityNet Caption validation et for temporal event propoal tak. Table 4. Effect of utilizing multiple anchor layer. conv 3 8 conv 9 conv 10 conv 11 conv 12 AUC Parameter Number M M M M M Effect of Multiple Anchor Layer. In order to how the relationhip between performance and the number of anchor layer with different temporal reolution, we progreively tack anchor layer with decreaing temporal reolution and compare the performance. The reult hown in Table 4 indicate increaing the number of anchor layer with different temporal reolution can generally lead to performance improvement. Meanwhile, the number of parameter in all adopted anchor layer increae. Thu, we finally adopt conv 3 11 a anchor layer a that ha a better tradeoff between performance and model complexity. 5. Concluion We have preented a novel deep architecture which unifie the temporal localization of event propoal and entence generation for dene video captioning. Particularly, we tudy the problem of how to build the interaction acro the two ub challenge (i.e., temporal event propoal and entence generation) and how to integrate uch interaction into a deep learning framework for enhancing dene video captioning. To verify our claim, we have devied a decriptivene regreion component and incorporated it into a ingle hot detection tructure, on one hand to adjut the event propoal from language perpective in TEP module, and on the other, to meaure the decriptive complexity of each event in SG module. Experiment conducted on ActivityNet Caption dataet validate our model and analyi. More remarkably, we achieve uperior reult over tateof-the-art method when evaluating our framework on both dene video captioning and temporal event propoal tak. Acknowledgment. Thi work wa upported in part by NSFC under Grant , U , and the Guangzhou Science and Technology Program, China, under Grant

9 Reference [1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgment. In ACL workhop, [2] S. Buch, V. Ecorcia, C. Shen, B. Ghanem, and J. C. Nieble. St: Single-tream temporal action propoal. In CVPR, [3] F. Caba Heilbron, J. Carlo Nieble, and B. Ghanem. Fat temporal activity propoal for efficient detection of human action in untrimmed video. In CVPR, [4] F. Caba Heilbron, V. Ecorcia, B. Ghanem, and J. Carlo Nieble. Activitynet: A large-cale video benchmark for human activity undertanding. In CVPR, [5] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human action in video. In ICCV, [6] V. Ecorcia, F. C. Heilbron, J. C. Nieble, and B. Ghanem. Dap: Deep action propoal for action undertanding. In ECCV, [7] H. Fang, S. Gupta, et al. From caption to viual concept and back. In CVPR, [8] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regreion network for temporal action propoal. In ICCV, [9] B. Ghanem, J. C. Nieble, C. Snoek, F. C. Heilbron, H. Alwael, R. Khrina, V. Ecorcia, K. Hata, and S. Buch. Activitynet challenge 2017 ummary. arxiv preprint arxiv: , [10] R. Girhick. Fat r-cnn. In ICCV, [11] S. Guadarrama, N. Krihnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and decribing arbitrary activitie uing emantic hierarchie and zero-hot recognition. In ICCV, [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girhick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fat feature embedding. In MM, [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-cale video claification with convolutional neural network. In CVPR, [14] D. Kingma and J. Ba. Adam: A method for tochatic optimization. In ICLR, [15] A. Kojima, T. Tamura, and K. Fukunaga. Natural language decription of human activitie from video image baed on concept hierarchy of action. IJCV, [16] R. Krihna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Nieble. Dene-captioning event in video. In ICCV, [17] T. Lin, X. Zhao, and Z. Shou. Single hot temporal action detection. In MM, [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single hot multibox detector. In ECCV, [19] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and tranlation to bridge video and language. In CVPR, [20] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with tranferred emantic attribute. In CVPR, [21] K. Papineni, S. Rouko, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine tranlation. In ACL, [22] Z. Qiu, T. Yao, and T. Mei. Learning patio-temporal repreentation with peudo-3d reidual network. In ICCV, [23] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural network. In ICLR, [24] S. Ren, K. He, R. Girhick, and J. Sun. Fater r-cnn: Toward real-time object detection with region propoal network. In NIPS, [25] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ro, and V. Goel. Self-critical equence training for image captioning. In CVPR, [26] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Tranlating video content to natural language decription. In ICCV, [27] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed video via multi-tage cnn. In CVPR, [28] I. Sutkever, O. Vinyal, and Q. V. Le. Sequence to equence learning with neural network. In NIPS, [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolution. In CVPR, [30] D. Tran, L. Bourdev, R. Fergu, L. Torreani, and M. Paluri. Learning patiotemporal feature with 3d convolutional network. In ICCV, [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Conenu-baed image decription evaluation. In CVPR, [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to equence - video to text. In ICCV, [33] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Tranlating video to natural language uing deep recurrent neural network. In NAACL HLT, [34] L. Yao, A. Torabi, K. Cho, N. Balla, C. Pal, H. Larochelle, and A. Courville. Decribing video by exploiting temporal tructure. In ICCV, [35] T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. MSR Aia MSM at activitynet challenge 2017: Trimmed action recognition, temporal action propoal and denecaptioning event in video. In CVPR ActivityNet Challenge Workhop, [36] T. Yao, Y. Pan, Y. Li, and T. Mei. Incorporating copying mechanim in image captioning for learning novel object. In CVPR, [37] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Booting image captioning with attribute. In ICCV, [38] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning uing hierarchical recurrent neural network. In CVPR, [39] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with tructured egment network. ICCV, 2017.

EXPLORING COGNITIVE STRATEGIES FOR INTEGRATING MULTIPLE-VIEW VISUALIZATIONS

EXPLORING COGNITIVE STRATEGIES FOR INTEGRATING MULTIPLE-VIEW VISUALIZATIONS Young Sam Ryu 1, Beth Yot 2, Gregorio Convertino 2, Jian Chen 2, and Chri North 2 Grado Department of Indutrial and Sytem Engineering