arxiv: v1 [cs.cv] 23 Apr 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.cv] 23 Apr 2018"

Transcription

1 Jointly Localizing and Decribing Event for Dene Video Captioning Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei School of Data and Computer Science, Sun Yat-en Univerity, Guangzhou, China Microoft Reearch, Beijing, China Univerity of Science and Technology of China, Hefei, China Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-en Univerity), Minitry of Education arxiv: v1 [c.cv] 23 Apr 2018 {yehaoli.yu, {tiyao, Abtract Automatically decribing a video with natural language i regarded a a fundamental challenge in computer viion. The problem neverthele i not trivial epecially when a video contain multiple event to be worthy of mention, which often happen in real video. A valid quetion i how to temporally localize and then decribe event, which i known a dene video captioning. In thi paper, we preent a novel framework for dene video captioning that unifie the localization of temporal event propoal and entence generation of each propoal, by jointly training them in an end-to-end manner. To combine thee two world, we integrate a new deign, namely decriptivene regreion, into a ingle hot detection tructure to infer the decriptive complexity of each detected propoal via entence generation. Thi in turn adjut the temporal location of each event propoal. Our model differ from exiting dene video captioning method ince we propoe a joint and global optimization of detection and captioning, and the framework uniquely capitalize on an attribute-augmented video captioning architecture. Extenive experiment are conducted on ActivityNet Caption dataet and our framework how clear improvement when compared to the tate-of-the-art technique. More remarkably, we obtain a new record: ME- TEOR of 12.96% on ActivityNet Caption official tet et. 1. Introduction The recent advance in 2D and 3D Convolutional Neural Network (CNN) have uccefully puhed the limit and improved the tate-of-the-art of video undertanding. For intance, the firt rank performance achieve 8.8% in term of top-1 error in untrimmed video claification tak of ActivityNet Challenge 2017 [9]. A uch, it ha become poible to recognize a video with a pre-defined et of label Thi work wa performed at Microoft Reearch Aia. Input Video Video Captioning Dene Video Captioning A man i playing fribee with a dog. A man and a dog are outdoor and waiting for their turn to play on a fenced in green field. The man and the dog run onto the field and he throw the fribee a far ditance and the dog run and fetche it, then return it back to the man and they repeat the proce 6 time. The whole time there are people on the ideline watching them and taking picture. A man and a dog walk onto a field. A man throw a fribee and the dog chae after it. When they are done, another man run to them and hand the man a leah and he leahe hi dog. The dog bring the fribee back to the man. Start time End time Figure 1. Example of video captioning and dene video captioning (upper row: input video; middle row: the entence generated by video captioning method; bottom row: temporally localized entence generated by dene video captioning approach.) or categorie. In a further tep to decribe a video with a complete and nature entence, video captioning [19, 32, 34] ha expanded the undertanding from individual label to a equence of word to expre richer emantic and relationhip in the video. Neverthele, conidering that video in real life are uually long and contain multiple event, the conventional video captioning method generating only one caption for a video in general will fail to recapitulate all the event in the video. Take the video in Figure 1 a an example, the output entence generated by a popular video captioning method [32] i unable to decribe the procedure of playing fribee with a dog in detail. A a reult, the tak of dene video captioning i introduced recently in [16] and the ultimate goal i to generate a entence for each event occurring in the video. The difficulty of dene video captioning originate from two apect: 1) how to accurately localize each event in time? 2) how to deign a powerful entence generation model? In the literature, there have been everal technique, including temporal action propoal [2, 3, 6, 17] and image/video captioning [32, 34, 37], being propoed for each individual apect. However, imply olving the problem of dene video captioning in a two-tage way, i.e., firt temporal event propoal and then entence generation, may detroy the interaction between localizing and decribing event, reulting in a ub-optimal olution. Thi paper propoe a novel deep architecture to unify

2 the accurate localization of temporal event with the decriptive principle of entence generation for dene video captioning. Technically, we devie a new decriptivene regreion component and integrate it into a ingle hot detection framework a a bridge, on one hand to meaure the complexity of each event being decribed in entence generation, and on the other, to adjut the event propoal. More pecifically, the decriptivene regreion guide the learning of temporal event propoal together with event/background claification and temporal coordinate regreion. In between, event/background claification i to predict event propoal and temporal coordinate regreion i to refine the temporal boundarie of each propoal. Furthermore, the inference of decriptivene regreion i employed a an attention to weight video clip in each propoal locally. The propoal-level repreentation i then averaged over all the clip-level repreentation in the propoal weighted by the holitic attention core of each clip and finally fed into an attribute-augmented captioning architecture for entence generation. A uch, the tak of dene video captioning could be jointly learnt and globally optimized in an end-to-end manner. The main contribution of thi work i the propoal of a new architecture to unify the temporal localization of event propoal and entence generation for dene video captioning. The olution alo lead to the elegant view of what kind of interaction hould be built acro the two ub problem and how to model and integrate the interaction in a deep learning framework, which are problem not yet fully undertood in the literature. 2. Related Work Temporal Action Propoal. [5] i one of the early work that detect temporal egment containing the action of interet in a liding window fahion. Next, a few ubequent work [2, 3, 6, 8, 27] tackle temporal action propoal by leveraging action claifier on a maller number of temporal window. In particular, Spare-prop [3] utilize dictionary learning to encode repreentation of trimmed action intance and then retrieve the mot repreentative egment from teting video, which are treated a the claindependent propoal. S-CNN [27] train a 3D CNN to claify a video egment a background or being-action and employ varied length temporal window for multi-cale action propoal generation. Later in [6], DAP utilize Long Short-Term Memory () to encode video tream and enable multi-cale propoal generation inide the tream with a ingle pa through the video, obviating the need for deploying liding window on multiple cale. Furthermore, Buch et al. develop SST baed on DAP by contructing no overlapping liding window over the input video and encoding each window equentially with a Gated Recurrent Unit (GRU) in [2]. Mot recently, Gao et al. [8] deign temporal coordinate regreion for temporal action propoal generation. Video Captioning. The reearch in thi direction ha proceeded along two different dimenion: template-baed language method [11, 15, 26] and equence learning approache (e.g., RNN) [16, 19, 32, 33, 34, 38]. Templatebaed language method directly generate the entence with detected keyword in predefined language template. Sequence learning approache utilize CNN plu RNN architecture to generate novel entence with more flexible yntactical tructure. In [33], Venugopalan et al. preent a -baed model to generate video decription with the mean pooling repreentation over all frame. The framework i then extended by inputting both frame and optical flow image into an encoder-decoder in [32]. Compared to mean pooling, Yao et al. propoe to utilize the temporal attention mechanim to exploit temporal tructure for video captioning [34]. Later in [38], a hierarchical RNN i devied to further capture the inter-entence dependency, targeting for decribing a long video with a paragraph coniting of multiple entence. Different from the video paragraph captioning with non-overlapping and annotated temporal interval, a more challenge tak, named a dene video captioning, i recently introduced in [16] which involve both detecting and decribing multiple event in a video. A two-tage dene-captioning ytem i thu deigned by leveraging DAP [6] to localize temporal event propoal and a -baed equence learning module to decribe each event propoal. Mot recently, [35] additionally incorporate KNN-baed retrieval module into -baed equence learning module to boot video captioning. Summary. Our work aim to detect and decribe event in video, i.e., dene video captioning. Different from the aforementioned method [16], our approach contribute by tudying not only detecting the event with the imple objective of binary claification (i.e., event or background) and modeling entence generation with, but alo enhancing the temporal event propoal by utilizing both temporal boundary regreion to correct tart and end time of event and decriptivene regreion to infer whether the event can be well decribed from language perpective. Moreover, entence generation module i further booted by leveraging emantic attribute and reinforcement learning to optimize with non-differentiable metric. 3. Dene Video Captioning The baic idea of thi work i to automatic decribe multiple event in video by temporally localizing event propoal and generating language entence for each event propoal. The temporal event propoal (TEP) i performed by encapulating the event claification to recognize video egment of event from background, propoal generation to temporally localize the event, and decriptivene infer-

3 Bae Layer Anchor Layer FC Prediction Layer 3D CNN Temporal Coordinate Regreion Decriptivene Regreion Event/Background Claification 500D 2048D 1024D 512D 512D 512D Propoal ranking lit The dog i een in everal clip chaing after a fribee. METEOR Reward S w, w,, w N 1 2 S ˆ w, w,, w N 1 2 Sentence Generation wn w 1 N 1 w N w N w w 1 1 BOS BOS w2 w 2 w w 1 1 Attribute Start time End time Figure 2. An overview of our Dene Video Captioning framework mainly including Temporal Event Propoal (TEP) and Sentence Generation (SG) (better viewed in color). The input video i firt encoded into a erie of clip-level feature via a 3-D CNN, which are fed into TEP module to produce candidate propoal. The TEP module i employed by integrating the event/background claification to predict event propoal, temporal coordinate regreion to refine the temporal boundarie of each propoal, and decriptivene regreion to infer the decriptive complexity of each event, into a ingle hot detection architecture. After ranking the candidate propoal with regard to both eventne and decriptivene core, the top propoal are in turn injected into SG module for entence generation. The SG module leverage both attribute and reinforcement learning baed optimization to enhance captioning. ence procedure to infer the decriptive complexity of thi event, in one ingle network. Such deign enable traightforward temporal event propoal in a ingle hot manner to eae the training conumption. The entence generation module (SG) leverage an attribute-augmented baed model for generating decription. Moreover, the policy-gradient baed reinforcement learning i adopted to optimize with evaluation metric baed reward, harmonizing the module with repect to the teting inference. Pleae note that the decriptivene inference procedure in TEP module i not only leveraged to additionally refine the localized event propoal from language perpective through decriptivene regreion, but alo integrated into SG module to conider the decriptivene core a one kind of temporal attention over clip for weighted fuing them a the input propoal-level repreentation of. A uch, our ytem including both TEP and SG module can be jointly trained through the global optimization of detection and captioning in an end-to-end manner. An overview of our dene video captioning ytem i illutrated in Figure Problem Formulation Suppoe we have a video V = {v t } Tv t=1 with T v frame/clip and v t denote the t-th frame/clip in temporal order. The ultimate target of our dene video captioning ytem i to generate a et of temporal localized decription Φ v = {φ i = (t i tart, t i end, S i)} Mv i=1 for the input video V, where M v i the number of entence, t i tart and t i end repreent the tarting time and ending time for each entence S i, and S i = {w 1, w 2,, w N } conit of N word. Hence the TEP module in our ytem i firtly utilized to produce a et of candidate propoal for the input video V: Φ p = {φ i p = (t i tart, t i end, p i event, p i de)} Np i=1, (1) where p i event i the probability of recognizing the candidate a an event (i.e., eventne core), p i de denote the decriptivene core meauring how well the candidate can be decribed from language perpective and N p i the total number of candidate propoal. By conolidating the idea of electing propoal from both viion and language perpective, all the candidate are ranked according to the fued core p i conf = pi event + λ 0 p i de and only the candidate with a p i conf higher than a threhold are injected into SG module for captioning, denoted a Φˆp. Inpired by the uccee of equence learning model in machine tranlation [28] and attribute utilized in image/video captioning [7, 20, 36], we formulate our SG module a an attribute-argument -baed model which encode the input event propoal repreentation (F) and it detected attribute/categorie (A) into a fixed dimenional vector and then decode it to the output target entence. A uch, the entence generation problem we exploit here can be formulated by minimizing the negative log probability Time

4 of the correct textual entence ( log Pr (S F, A)). The negative log probability i typically meaured with cro entropy lo, reulting in the dicrepancy of evaluation between training and inference. Hence, to further boot our SG module by amending uch dicrepancy, we take the inpiration from the reinforcement learning [23] leveraged in equence learning and directly optimize by minimizing the following expected entence-level reward lo a L SG = E S pθ [r(s)], (2) where θ denote the parameter of that chedule a policy p θ for generating entence. r(s) i the reward meaured by comparing the generated entence S to groundtruth entence over non-differentiable evaluation metric Temporal Event Propoal Exiting olution for temporal action/event propoal mainly focu on detecting event with binary claifier (i.e., event or background) in a liding window fahion. However, the temporal liding window are typically too dene and even deigned with multiple cale, reulting in the heavy computation cot. Inpired from patial boundary regreion in object localization [24] which imultaneouly predict objectne core and object bound, we integrate the event claification with temporal coordinate regreion for correcting event temporal bound, puruing both low-cot and high-quality event propoal. Moreover, for the dene video captioning tak, the event identification undoubtedly play the major role in temporal event propoal, while the propoal containing rich decribable object or cene are alo preferred by human being in decription generation. A uch, a novel decriptivene regreion i epecially devied to infer the decriptive complexity of each propoal, and further refine the event propoal from language perpective. Similar to ingle hot object detection in [18], all the three component (i.e, event claification, temporal coordinate regreion and decriptivene regreion) are elegantly integrated into one feed-forward CNN, aiming to imultaneouly produce a fixed-ize et of propoal, core for the preence of event in the propoal (i.e., eventne core), and decriptivene core of propoal. Input. Technically, given input video V = {v t } Tv t=1, a 3- D CNN i utilized to encode the frame equence into a erie of clip-level feature {f t = F (v t : v t+δ )} T f t=1 where δ i the temporal reolution of each feature f t. In our experiment, C3D [30] i adopted a 3-D CNN encoder F with δ = 16 frame and the temporal interval for encoding i et a 8 frame, reulting in the initial output feature map with the ize of T f D 0. Note that D 0 i the dimenion of clip-level feature and T f = T v /8 dicretize the video frame. Network Architecture. The initial feature map of ize T f D 0 i fed into 1-D CNN architecture in TEP module, which conit of convolutional layer in three group: bae layer, anchor layer, and prediction layer, a hown in Figure 2. In particular, two bae layer (conv 1 and conv 2 ) are firtly devied to reduce the temporal dimenion of feature map and increae the ize of temporal receptive field, producing the output feature map of ize T f / Then, we tack nine anchor layer (conv 3 to conv 11 ) on the top of bae layer conv 2, each of which i deigned with the ame configuration (kernel ize: 3, tride ize: 2, and filter number: 512). Thee anchor layer decreae in temporal dimenion of feature map progreively, enabling the temporal event propoal at multiple temporal cale. For each anchor layer, it output feature map i injected into prediction layer to produce a fixed et of prediction in one hot manner. Concretely, given an output feature map f j with the ize of T fj D j, the baic element (anchor) for predicting parameter of a candidate propoal i a 1 D j feature map cell that produce a prediction core vector p pred = (p cl, c, w, p de ) via fully connected layer. p cl = [p event, p bk ] denote the two dimenional claification core for event/background and p de i the decriptivene core to infer the confidence of thi propoal to be well decribed. c and w are two temporal offet relative to the default center location µ c and width µ w of thi anchor, which are leveraged to adjut it temporal coordinate a ϕ c = µ c + α 1µ w c, ϕ w = µ w exp(α 2 w), t tart = ϕ c 1 ϕw, t 2 end = ϕ c + 1 ϕw, (3) 2 where ϕ c and ϕ w are refined center location and width of the anchor. α 1 and α 2 are utilized to control the impact of temporal offet, both of which are et a 0.1. t tart and t end repreent the adjuted tarting and ending time of the anchor. In addition, derived from the anchor boxe in [18, 24], we aociate a et of default temporal boundarie with each feature map cell. The different temporal cale ratio for thee default temporal boundarie are denoted a R = {r d } D d=1 = {1, 1.25, 1.5}. For each temporal cale ratio r d, we can thu achieve one default center location (µ cd = t+0.5 T fj ) and width (µ wd = r d T fj ) of t-th feature map cell, reulting in a total of T fj D anchor. Accordingly, for the feature map f j, the et of all the produced propoal i defined a Φ fj = {φ fj = (t u tart, t u end, pu cl, pu de )}N U u=1, where N U = T fj D. By accumulating all the produced propoal of the output feature map of nine anchor layer, the final predicted propoal et i Φ p = {Φ fj } = {φ i p} Np i=1. Training. During training, a poitive/negative label hould be firtly aigned to each predicted propoal conditioned on the ground truth propoal et Φ v. Specifically, for each φ i p Φ p, we meaure it temporal Interection over Union (tiou) with each ground truth propoal and obtain the highet tiou. If the highet tiou i larger than 0.7, φ i p i treated a poitive ample with regard to the correponding ground truth propoal φ g, otherwie φ i p i a negative ample. The training objective in our TEP module i formulated a a multi-tak lo by integrating the event/background cla-

5 ification lo (L event ) for ditinguihing event from background, temporal coordinate regreion lo (L tcr ) for adjuting temporal coordinate of event propoal, and decriptivene regreion lo (L de ) for inferring the decriptive complexity of each propoal, which i defined a L T EP = L event + αl tcr + βl de, (4) where α and β are tradeoff parameter. The event/background claification lo L event i meaured a the tandard oftmax lo over the whole predicted propoal et Φ p. The temporal coordinate regreion lo L tcr i devied a Smooth L1 lo [10] (S L1 ) between the poitive predicted propoal and the correponding ground truth propoal. Similar to [24], both of the temporal offet for center location (ϕ c ) of the predicted propoal and for it width (ϕ w ) are regreed a L tcr = 1 N po N p I i(s L1(ϕ i c gc) i + S L1(ϕ i w gw)), i (5) i=1 where I i denote the aigned label for predicted propoal φ i p (1 for poitive ample and 0 for negative ample), N po i the number of poitive ample, gc i and gw i repreent the center location and width of ground truth propoal. The decriptivene regreion lo L de i calculated a the Euclidean ditance between the inferred decriptivene core p i de of the predicted propoal and it entence-level reward r(s i ) in SG with regard to ground truth entence: L de = 1 N p p i de r(s i) 2 (6) N p 2. i=1 In particular, for each poitive predicted propoal φ i p, we achieve it entence-level reward r(s i ) by directly feeding thi propoal into SG module and comparing the generated entence S i with the correponding ground-truth entence over evaluation metric (e.g., METEOR). For each negative ample, it entence-level reward i naturally fixed a 0. Accordingly, by minimizing the decriptivene regreion lo, our prediction layer i additionally endowed with the ability to directly infer the approximate decriptivene core (i.e., the entence-level reward for captioning) of an event propoal without referring ground-truth entence Sentence Generation Given the et of elected predicted event propoal Φˆp Φ p from TEP module, each propoal φˆp Φˆp i injected into attribute-augmented -baed model in SG module for decription generation. Specifically, the attribute repreentation A of predicted propoal φˆp i firtly tranformed into to inform the whole about the high-level attribute, followed by the propoal repreentation F which i encoded into at the econd time tep. Then, decode each output word baed on previou word and previou tep hidden tate. Decriptivene-driven Temporal Attention. One natural way to achieve the propoal repreentation F i performing mean pooling proce over all the clip within thi propoal. However, in many cae, the generated decription only relate to ome key clip with low decriptive complexity. A a reult, to pinpoint the local clip containing rich decribable object or cene and further incorporate the contribution of different clip into producing propoal repreentation, a decriptivene-driven temporal attention mechanim i employed on the predicted event propoal. Given the input propoal φˆp containing Nˆpi clip {v i } N ˆp i i=1, the clip-level decriptivene core pvi de of each clip v i i firtly achieved by holitically taking the average of all the decriptivene core of predicted propoal containing thi clip v i. Here we treat the clip-level decriptivene core of each clip a one kind of temporal attention over all the clip within thi propoal. Baed on the attention ditribution, we calculate the weighted um of the clip feature and obtain the aggregated propoal feature F weighted by holitic attention core of each clip: α vi = p v i de / N ˆpi j=1 p v j de, N ˆpi F = α vi f vi, (7) where α vi i the normalized attention core and f vi i the clip repreentation of v i. The aggregated propoal feature could be regarded a a more informative propoal feature ince the mot decriptive clip for entence generation have been ditilled with higher attention weight. Training. The training objective in our SG module i formulated a the expected entence-level reward lo in Eq.(2). Inpired from Self-critical Sequence Training (SCST) [25], the gradient of thi objective i given by i=1 θ L SG (r(s ) r(ŝ)) θ log p θ (S ), (8) where S i a ampled entence and r(ŝ) denote the reward of baeline achieved by greedily decoding inference Joint Detection and Captioning The overall objective of our dene video captioning i compried of the training objective of TEP module in Eq.(4) and the reward lo of SG module in Eq.(2): L = λ 1L T EP + λ 2L SG, (9) where λ 1 and λ 2 are tradeoff parameter for TEP and SG, repectively. Note that decriptivene inference could be regarded a a bridge which i not only leveraged in TEP for adjuting the event propoal from language perpective, but alo integrated into SG for meauring the decriptivenedriven temporal attention to boot entence generation, enabling the interaction between TEP and SG. A a reult, the overall objective function of our ytem can be olved through the joint and global optimization of detection and captioning in an end-to-end manner.

6 4. Experiment We conduct our experiment on the ActivityNet Caption dataet [16] and evaluate our propoed ytem on both dene video captioning and temporal event propoal tak Dataet The dataet, ActivityNet Caption, i a recently collected large-cale dene video captioning benchmark, which contain 20,000 video covering a wide range of complex human activitie. Each video i aligned with a erie of temporally annotated entence. On average, there are 3.65 temporally localized entence for each video, reulting in a total of 100,000 entence. In our experiment, we follow the etting in [16], and take 10,024 video for training, 4,926 for validation and 5,044 for teting Dene Video Captioning Tak We firtly invetigate our ytem on dene video captioning tak. The tak i to detect individual event and then decribe each event with natural language. Compared Approache. To empirically verify the merit of our propoed dene video captioning ytem, we compare the following video captioning baeline: (1) Long Short-Term Memory () [33]: utilize a CNN plu RNN framework to directly tranlate from video pixel to natural language decription. The frame feature are mean pooled to generate the video feature. (2) Sequence to Sequence - Video to Text (S2VT) [32]: S2VT incorporate both RGB and optical flow input, and the encoding and decoding of the input and word repreentation are learnt jointly in a parallel manner. (3) Temporal Attention (TA) [34]: TA combine the frame repreentation from GoogleNet [29] and video clip repreentation from 3D CNN trained on hand-crafted decriptor. Furthermore, a oft attention mechanim i employed to dynamically attend to pecific temporal region of the video while generating entence. (4) Hierarchical Recurrent Neural Network (H-RNN) [38]: H-RNN generate paragraph by uing one RNN to generate individual entence and the econd to capture the inter-entence dependencie. Moreover, both patial and temporal attention mechanim are leveraged in H-RNN. (5) Dene-Captioning Event (DCE) [16]: DCE leverage a multi-cale variant of DAP [6] to localize temporal event propoal and S2VT [32] a bae captioning module to decribe each event. An attention module i further incorporated to exploit temporal context for dene captioning. (6) Dene Video Captioning (DVC) i our complete ytem in thi paper. Two lightly different etting of DVC are named a DVC-D and DVC-D-A. The former only incorporate the decriptivene-driven temporal attention mechanim into in SG module and i trained without attribute and reinforcement learning baed optimization, while the latter i more imilar to DVC that only replace the expected entence-level reward lo in DVC with the traditional cro entropy lo. Note that DCE i the only exiting work on dene video captioning tak and mot previou video captioning work (e.g.,, S2VT, TA, and H-RNN) focu on decribing entire video without detecting a erie of event. Hence we compare the five video captioning baeline on dene video captioning tak by feeding them with the fixed ground truth propoal or the learnt one from our TEP module. Setting. For video clip repreentation, we utilize the publicly available 500-way C3D in [16], whoe dimenion i reduced by PCA from the original 4,096-way output of f c7 of C3D pre-trained on Sport-1M video dataet [13]. For repreentation of attribute/categorie, we treat all the 200 categorie on Activitynet dataet [4] a the high-level emantic attribute and train the attribute detector with cro entropy lo, reulting in the final 200-way vector of probabilitie. Each word in the entence i repreented a one-hot vector (binary index vector in a vocabulary). The dimenion of the input and hidden layer in are both et to 1,024. The tradeoff parameter λ 0 leveraging the event probability and decriptivene core for propoal election i empirically et to 0.2. The tradeoff parameter α and β in Eq.(4) are et a 0.5 and 10. For the tradeoff parameter λ 1 and λ 2 in Eq.(9), we et them a 1 and 20, repectively. We mainly implement our DVC baed on Caffe [12], which i one of widely adopted deep learning framework. The whole ytem i trained by Adam [14] optimizer. The initial learning rate i et a and the mini-batch ize i et a 1. Note that SG module in our DVC i pre-trained with ground-truth propoal-entence pair. The entencelevel reward in SG i meaured with METEOR. Evaluation Metric. For the evaluation of our propoed model, we follow the metric in [16] to meaure the ability to jointly localize and decribe dene event. Thi metric compute the mean average preciion (map) acro tiou threhold of 0.3, 0.5, 0.7, and 0.9 when captioning the top 1,000 propoal. The preciion of caption i meaured by three evaluation metric: BLEU@N [21], METEOR [1], and CIDEr-D [31]. All the metric are computed by uing the code 1 releaed by ActivityNet Evaluation Server. Performance Comparion. Table 1 how the performance of different model on ActivityNet Caption validation et. Overall, the reult acro ix evaluation metric with ground truth propoal and learnt propoal conitently indicate that our propoed DVC achieve uperior performance againt other tate-of-the-art video captioning technique including non-attention model (, S2VT) and attention-baed approache (TA, H-RNN, DCE). In particular, the METEOR core of our DVC can achieve 10.33% 1

7 Table 1. METEOR (M) and CIDEr-D (C) core of our DVC and other tate-of-the-art video captioning method for dene video captioning tak on ActivityNet Caption validation et (left: performance with ground truth (GT) propoal; right: performance with the learnt propoal from our TEP module). All value are reported a percentage (%). with GT propoal with learnt propoal B@1 B@2 B@3 B@4 M C B@1 B@2 B@3 B@4 M C Ground Truth Ground Truth [33] S2VT [32] TA [34] H-RNN [38] DCE [16] DVC-D DVC-D-A DVC DVC A woman i een holding on A woman i itting on a A woman i een itting on a a cat on a coach and couch with a cat. couch with a cat and lead holding up a pair of nail into her cutting a cat' claw. clipper. She begin cutting the cat' The woman then begin The woman then begin to claw while the cat quirm cutting the nail of the cat. clip the cat' claw. around a bit. She continue cutting the claw while peaking to the camera. The woman i then hown bruhing the dog' hair while the camera capture him. The woman i now cutting the cat' claw and i talking to the camera. DVC A man i tanding in a A man i een peaking to A man i tanding in a kitchen making a andwich. the camera while holding a kitchen and a andwich i piece of paper. the counter. He put the andwich into a A man i een itting behind A man i on a counter and platic bag. a table and peaking to the begin to make a andwich. camera. He put the andwich into a A man i een peaking to The man then put a platic Tupperware box. the camera while holding a andwich on a plate. piece of paper. He pick up the food and The man i tanding in the The man i then hown in walk out a door. table. the kitchen and begin to pull the andwich in the back. Figure 3. Dene video captioning reult on ActivityNet Caption validation et. The output temporally localized entence are generated by 1) Ground Truth, 2), and 3) our DVC. We how the reult with the highet overlap with ground truth caption. with ground truth propoal, making the relative improvement over the bet competitor DCE by 16.33%, which i conidered a a ignificant progre on thi benchmark. A expected, the METEOR core i dropped down to 6.93% when provided the predicted propoal from our TEP module intead of ground truth propoal. Moreover, DVC- D by additionally leveraging decriptivene-driven temporal attention for dene video captioning, lead to a performance boot againt. The reult baically indicate the advantage of weighting each local clip with holitic attention core in producing propoal repreentation for entence generation, intead of repreenting each propoal by directly performing mean pooling over it clip in. In addition, DVC-D-A by additionally augmenting with high-level emantic attribute, conitently improve DVC-D over all the metric, but the METEOR core are till lower than DVC. Thi confirm the effectivene of utilizing reinforcement learning technique for directly optimizing with METEOR-baed reward lo, which harmonize SG module with repect to it teting inference. Figure 3 howcae a few dene video captioning reult generated by different method and human-annotated ground truth entence. From thee exemplar reult, it i Table 2. Uer tudy on two criteria: M1 - percentage of et of caption generated by different method that are evaluated a better/equal to human caption; M2 - percentage of et of caption that pa Turing Tet. Human DVC H-RNN [38] TA [34] S2VT [32] [33] M M eay to ee that all of thee automatic method can generate omewhat relevant entence, while our propoed DVC can generate more relevant and decriptive entence by jointly exploiting decriptivene-driven temporal attention mechanim and high-level attribute for booting dene video captioning. For example, compared to phrae bruhing the dog hair in the entence generated by, cutting the cat claw in our DVC i more precie to decribe the event propoal in the lat propoal of the firt video. Human Evaluation. To better undertand how atifactory are the localized temporal event propoal and the correponding generated entence of different method, we alo conducted a human tudy to compare our DVC againt four baeline, i.e., H-RNN, TA, S2VT, and. A total number of 12 evaluator from different education background are invited and a ubet of 1K video i randomly elected from validation et for the ubjective evaluation. The evaluation proce i a follow. All the evaluator are organized into two group. We how the firt group all the five et of temporally localized entence generated by each approach plu a erie of temporally human-annotated entence and ak them the quetion: Do the ytem produce the et of temporally localized entence reembling human-generated entence? In contrat, we how the econd group once only one et of temporally localized entence generated by different approach or human annotation and they are aked: Can you determine whether the given et of entence ha been generated by a ytem or by a human being? From evaluator repone, we calculate two metric: 1) M1: percentage of et of caption that are evaluated a better or equal to human caption; 2) M2: percentage of et of caption that pa the Turing Tet. Table 2 lit the reult of the uer tudy. Overall, our DVC i clearly the winner for all two criteria. In particular, the percentage achieve 35.1% and 38.7% in term of M1 and M2, repectively, making the abolute improvement over the bet competitor H-RNN by 2.2% and 2.4%.

8 Table 3. Leaderboard of the publihed tate-of-the-art dene video captioning model on the online ActivityNet evaluation erver. DVC (P3D) DVC (C3D) TAC [9] DCE [16] METEOR Performance on ActivityNet Evaluation Server. We alo ubmitted our bet run in term of METEOR core, i.e., DVC, to online ActivityNet evaluation erver and evaluated the performance on official teting et. Table 3 how the performance Leaderboard on official teting et. Pleae note that here we deign two ubmiion run for our DVC, i.e., DVC (C3D) and DVC (P3D). The input clip feature of the two run are 500-way C3D feature and 2048-way output of pool5 layer from P3D ReNet [22], repectively. Compared to the top performing method, our propoed DVC (C3D) achieve the bet METEOR core. In addition, when leveraging the clip feature from P3D ReNet, our METEOR core on teting et i further booted up to 12.96%, ranking the firt on the Leaderboard Temporal Event Propoal Tak The econd experiment i conducted on temporal event propoal tak, which evaluate our TEP module capability to adequately localize all event for a given video. Compared Approache. We compare our DVC with three tate-of-the-art temporal action propoal method: (1) Temporal Actionne Grouping (TAG) [39]. TAG utilize actionne claifier to generate actionne curve, followed by the waterhed algorithm to produce bain. The propoal are finally generated by grouping the bain. (2) Dene-Captioning Event (DCE) [16]. DCE leverage a multi-cale variant of -baed action propoal model in [6] to localize temporal event propoal. (3) Temporal Unit Regreion Network (TURN) [8]. TURN jointly predict action propoal and refine the temporal boundarie by temporal coordinate regreion. Evaluation Metric. For temporal action propoal tak, we adopt the Area-Under-the-Curve (AUC) core for Average Recall v. Average Number of Propoal per Video (AR-AN) curve in [9] a the evaluation metric. AR i defined a the mean of all recall value uing tiou threhold between 0.5 and 0.95 (tep ize: 0.05), and AN denote the total number of propoal divided by the number of video. Performance Comparion. Figure 4 how the AR-AN curve of four run on ActivityNet Caption validation et for temporal event propoal tak. Overall, the quantitative reult with regard to AUC core indicate that our DVC outperform other method. In particular, by leveraging temporal coordinate regreion for adjuting temporal boundary of detected propoal, DCE and TURN lead a large performance boot againt TAG. Moreover, DVC by additionally incorporating decriptivene regreion into TEP module further improve DCE and TURN. The reult indicate the advantage of joint detection and captioning, which refine the event propoal from language perpective. Average Recall DVC, AUC=59.91 TURN, AUC=57.85 DCE, AUC=54.54 TAG, AUC= Average Number of Propoal per Video Figure 4. The AR-AN curve of different approache on ActivityNet Caption validation et for temporal event propoal tak. Table 4. Effect of utilizing multiple anchor layer. conv 3 8 conv 9 conv 10 conv 11 conv 12 AUC Parameter Number M M M M M Effect of Multiple Anchor Layer. In order to how the relationhip between performance and the number of anchor layer with different temporal reolution, we progreively tack anchor layer with decreaing temporal reolution and compare the performance. The reult hown in Table 4 indicate increaing the number of anchor layer with different temporal reolution can generally lead to performance improvement. Meanwhile, the number of parameter in all adopted anchor layer increae. Thu, we finally adopt conv 3 11 a anchor layer a that ha a better tradeoff between performance and model complexity. 5. Concluion We have preented a novel deep architecture which unifie the temporal localization of event propoal and entence generation for dene video captioning. Particularly, we tudy the problem of how to build the interaction acro the two ub challenge (i.e., temporal event propoal and entence generation) and how to integrate uch interaction into a deep learning framework for enhancing dene video captioning. To verify our claim, we have devied a decriptivene regreion component and incorporated it into a ingle hot detection tructure, on one hand to adjut the event propoal from language perpective in TEP module, and on the other, to meaure the decriptive complexity of each event in SG module. Experiment conducted on ActivityNet Caption dataet validate our model and analyi. More remarkably, we achieve uperior reult over tateof-the-art method when evaluating our framework on both dene video captioning and temporal event propoal tak. Acknowledgment. Thi work wa upported in part by NSFC under Grant , U , and the Guangzhou Science and Technology Program, China, under Grant

9 Reference [1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgment. In ACL workhop, [2] S. Buch, V. Ecorcia, C. Shen, B. Ghanem, and J. C. Nieble. St: Single-tream temporal action propoal. In CVPR, [3] F. Caba Heilbron, J. Carlo Nieble, and B. Ghanem. Fat temporal activity propoal for efficient detection of human action in untrimmed video. In CVPR, [4] F. Caba Heilbron, V. Ecorcia, B. Ghanem, and J. Carlo Nieble. Activitynet: A large-cale video benchmark for human activity undertanding. In CVPR, [5] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human action in video. In ICCV, [6] V. Ecorcia, F. C. Heilbron, J. C. Nieble, and B. Ghanem. Dap: Deep action propoal for action undertanding. In ECCV, [7] H. Fang, S. Gupta, et al. From caption to viual concept and back. In CVPR, [8] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regreion network for temporal action propoal. In ICCV, [9] B. Ghanem, J. C. Nieble, C. Snoek, F. C. Heilbron, H. Alwael, R. Khrina, V. Ecorcia, K. Hata, and S. Buch. Activitynet challenge 2017 ummary. arxiv preprint arxiv: , [10] R. Girhick. Fat r-cnn. In ICCV, [11] S. Guadarrama, N. Krihnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and decribing arbitrary activitie uing emantic hierarchie and zero-hot recognition. In ICCV, [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girhick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fat feature embedding. In MM, [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-cale video claification with convolutional neural network. In CVPR, [14] D. Kingma and J. Ba. Adam: A method for tochatic optimization. In ICLR, [15] A. Kojima, T. Tamura, and K. Fukunaga. Natural language decription of human activitie from video image baed on concept hierarchy of action. IJCV, [16] R. Krihna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Nieble. Dene-captioning event in video. In ICCV, [17] T. Lin, X. Zhao, and Z. Shou. Single hot temporal action detection. In MM, [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single hot multibox detector. In ECCV, [19] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and tranlation to bridge video and language. In CVPR, [20] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with tranferred emantic attribute. In CVPR, [21] K. Papineni, S. Rouko, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine tranlation. In ACL, [22] Z. Qiu, T. Yao, and T. Mei. Learning patio-temporal repreentation with peudo-3d reidual network. In ICCV, [23] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural network. In ICLR, [24] S. Ren, K. He, R. Girhick, and J. Sun. Fater r-cnn: Toward real-time object detection with region propoal network. In NIPS, [25] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ro, and V. Goel. Self-critical equence training for image captioning. In CVPR, [26] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Tranlating video content to natural language decription. In ICCV, [27] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed video via multi-tage cnn. In CVPR, [28] I. Sutkever, O. Vinyal, and Q. V. Le. Sequence to equence learning with neural network. In NIPS, [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolution. In CVPR, [30] D. Tran, L. Bourdev, R. Fergu, L. Torreani, and M. Paluri. Learning patiotemporal feature with 3d convolutional network. In ICCV, [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Conenu-baed image decription evaluation. In CVPR, [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to equence - video to text. In ICCV, [33] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Tranlating video to natural language uing deep recurrent neural network. In NAACL HLT, [34] L. Yao, A. Torabi, K. Cho, N. Balla, C. Pal, H. Larochelle, and A. Courville. Decribing video by exploiting temporal tructure. In ICCV, [35] T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. MSR Aia MSM at activitynet challenge 2017: Trimmed action recognition, temporal action propoal and denecaptioning event in video. In CVPR ActivityNet Challenge Workhop, [36] T. Yao, Y. Pan, Y. Li, and T. Mei. Incorporating copying mechanim in image captioning for learning novel object. In CVPR, [37] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Booting image captioning with attribute. In ICCV, [38] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning uing hierarchical recurrent neural network. In CVPR, [39] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with tructured egment network. ICCV, 2017.

EXPLORING COGNITIVE STRATEGIES FOR INTEGRATING MULTIPLE-VIEW VISUALIZATIONS

EXPLORING COGNITIVE STRATEGIES FOR INTEGRATING MULTIPLE-VIEW VISUALIZATIONS EXPLORING COGNITIVE STRATEGIES FOR INTEGRATING MULTIPLE-VIEW VISUALIZATIONS Young Sam Ryu 1, Beth Yot 2, Gregorio Convertino 2, Jian Chen 2, and Chri North 2 Grado Department of Indutrial and Sytem Engineering

More information

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture

More information

arxiv: v2 [cs.cv] 3 Apr 2018

arxiv: v2 [cs.cv] 3 Apr 2018 Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning Jingwen Wang Wenhao Jiang Lin Ma Wei Liu Yong Xu South China University of Technology Tencent AI Lab {jaywongjaywong, cswhjiang,

More information

Seismic Response Control of Structures using Liquid Column Vibration Absorber Considering Real Earthquake Ground Motions

Seismic Response Control of Structures using Liquid Column Vibration Absorber Considering Real Earthquake Ground Motions Seimic Repone Control of Structure uing iquid Column Vibration Aborber Conidering Real Ground Motion Debai Panda M. Tech Scholar National Intitute of Technology Agartala Agartala, India Dr. Rama Debbarma

More information

An economic analysis of a methionine source comparison response model

An economic analysis of a methionine source comparison response model An economic analyi of a methionine ource comparion repone model D. Vedenov and G. M. Peti 1 Department of Agricultural Economic, Texa A&M Univerity, 2124 TAMU, College Station 77843-2124; and Department

More information

arxiv: v1 [cs.cv] 12 Dec 2016

arxiv: v1 [cs.cv] 12 Dec 2016 Text-guided Attention Model for Image Captioning Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr arxiv:1612.03557v1

More information

CCXCIII. VITAMIN A DETERMINATION: RELA- AND PHYSICAL METHODS OF TEST. TION BETWEEN THE BIOLOGICAL, CHEMICAL

CCXCIII. VITAMIN A DETERMINATION: RELA- AND PHYSICAL METHODS OF TEST. TION BETWEEN THE BIOLOGICAL, CHEMICAL CCXCIII. VITAMIN A DETERMINATION: RELA- TION BETWEEN THE BIOLOGICAL, CHEMICAL AND PHYSICAL METHODS OF TEST. BY KATHLEEN CULHANE LATHBURY. From the Phyiological Laboratorie, The Britih Drug Houe, Ltd. (Received

More information

Computerized testing of cognitive functions

Computerized testing of cognitive functions Computerized teting of cognitive function PhDr. Jiri Kloe Head of the Central edical Pychology Department Central ilitary Hopital Prague Prague, Czech Republic Doc. PhDr. ilan Brichcin, CSc. Senior Reearch

More information

Evaluation of a Program to Enhance Young Drivers Safety in Israel

Evaluation of a Program to Enhance Young Drivers Safety in Israel Evaluation of a Program to Enhance Young Driver Safety in Irael Tomer Toledo* Technion Irael Intitute of Technology, Haifa, Irael Tippy Lotan Or Yarok, Hod Haharon, Irael Orit Taubman - Ben-Ari Bar-Ilan

More information

Evaluating the effectiveness of rating instruments for a communication skills assessment of medical residents

Evaluating the effectiveness of rating instruments for a communication skills assessment of medical residents Adv in Health Sci Educ (2009) 14:575 594 DOI 10.1007/10459-008-9142-2 ORIGINAL PAPER Evaluating the effectivene of rating intrument for a communication kill aement of medical reident Cherdak Iramaneerat

More information

THE TYCHE AND SAFE MODELS: COMPARING TWO MILITARY FORCE STRUCTURE ANALYSIS SIMULATIONS

THE TYCHE AND SAFE MODELS: COMPARING TWO MILITARY FORCE STRUCTURE ANALYSIS SIMULATIONS THE TYCHE AND SAFE MODELS: COMPARING TWO MILITARY FORCE STRUCTURE ANALYSIS SIMULATIONS Cheryl Eiler and Slawomir Weolkowki Daniel T. Wojtazek Centre for Operational Reearch and Analyi Atomic Energy of

More information

Study of Fixed Assets Investment s Effect on the Employment of Three Industries

Study of Fixed Assets Investment s Effect on the Employment of Three Industries Proceeding of the 7th International Conference on Innovation & Management 951 Study of Fixed Aet Invetment Effect on the Employment of Three Indutrie Fan Fan, Li Jing Intitute of Economic, Yangtze Univerity,

More information

THE INVESTIGATION OF THE EFFECT OF THE REINFORCEMENT S KIND ON THE TENSILE STRENGTH IN THE FIBER REINFORCED COMPOSITE MATERIALS

THE INVESTIGATION OF THE EFFECT OF THE REINFORCEMENT S KIND ON THE TENSILE STRENGTH IN THE FIBER REINFORCED COMPOSITE MATERIALS Trakia Journal of Science, Vol. 7, Suppl. 2, pp 15-19, 2009 Copyright 2009 Trakia Univerity Available online at: http://www.uni-z.bg ISSN 1313-7050 (print) ISSN 1313-3551 (online) Original Contribution

More information

arxiv: v1 [cs.cv] 3 Apr 2018

arxiv: v1 [cs.cv] 3 Apr 2018 End-to-End Dense Video Captioning with Masked Transformer arxiv:1804.00819v1 [cs.cv] 3 Apr 2018 Luowei Zhou University of Michigan luozhou@umich.edu Abstract Richard Socher Salesforce Research richard@socher.org

More information

Noise Maps for Quantitative and Clinical Severity Towards Long-Term ECG Monitoring

Noise Maps for Quantitative and Clinical Severity Towards Long-Term ECG Monitoring enor Article Noie Map for Quantitative and Clinical Severity Toward Long-Term ECG Monitoring Etrella Ever-Villalba ID, Francico Manuel Melgarejo-Meeguer, Manuel Blanco-Velaco ID, Francico Javier Gimeno-Blane

More information

Contour Integration in Anisometropic Amblyopia

Contour Integration in Anisometropic Amblyopia Pergamon PII: 0042-6989(97)00233-2 Viion Re., Vol. 38, No. 6, pp. 889-894, 1998 1998 Elevier cience Ltd. All right reerved Printed in Great Britain 0042-6989/98 $19.00 + 0.00 Contour Integration in Aniometropic

More information

REVIEW for Exam 2. Chapters 9 13 (& chi-square in ch8)

REVIEW for Exam 2. Chapters 9 13 (& chi-square in ch8) REVIEW for Exam Chapter 9 3 & chi-quare in ch8 True or Fale. Etimated tandard error of the mean in a paired-ample t-tet i baed on the variance of the difference core.. W/in S deign i particularly ueful

More information

Classifying Knee Pathologies using Instantaneous Screws of the Six Degrees-of-Freedom Knee Motion

Classifying Knee Pathologies using Instantaneous Screws of the Six Degrees-of-Freedom Knee Motion Proceeding of the 26 IEEE International Conference on Robotic and Automation Orlando, Florida - May 26 Claifying Knee Pathologie uing Intantaneou Screw of the Six Degree-of-Freedom Knee Motion Alon Wolf

More information

Cortical representations of confidence in a visual perceptual decision

Cortical representations of confidence in a visual perceptual decision Received Feb Accepted Apr Publihed 5 Jun DOI:.8/ncomm9 Cortical repreentation of confidence in a viual perceptual deciion Leopold Zizlperger,, *, Thoma Sauvigny, *, Barbara Händel & Thoma Haarmeier,5 To

More information

Crossmodal temporal discrimination: Assessing the predictions of a general pacemaker counter model

Crossmodal temporal discrimination: Assessing the predictions of a general pacemaker counter model Journal Perception & Pychophyic 6,?? 68 (?), (7),???-??? 114-115 Cromodal temporal dicrimination: Aeing the prediction of a general pacemaker counter model ROLF ULRICH and JUDITH NITSCHKE Univerity of

More information

Appendix Am A Comparison of the National Cancer Institute s and the International Agency for Research on Cancer s Evaluation of Bioassay Results

Appendix Am A Comparison of the National Cancer Institute s and the International Agency for Research on Cancer s Evaluation of Bioassay Results Appendixe.. Content Page Appendix A: A Comparion of the National Cancer ntitute and the nternational Agency for Reearch on Cancer Evaluation of Bioaay Reult...........................................211

More information

Sequential Predictions Recurrent Neural Networks

Sequential Predictions Recurrent Neural Networks CS 2770: Computer Vision Sequential Predictions Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 One Motivation: Descriptive Text for Images It was an arresting

More information

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS SEOUL Oct.7, 2016 DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS Gunhee Kim Computer Science and Engineering Seoul National University October

More information

Research Article Numerical Treatment of the Model for HIV Infection of CD4 T Cells by Using Multistep Laplace Adomian Decomposition Method

Research Article Numerical Treatment of the Model for HIV Infection of CD4 T Cells by Using Multistep Laplace Adomian Decomposition Method Dicrete Dynamic in Nature and Society Volume 2012, Article ID 976352, 11 page doi:10.1155/2012/976352 Reearch Article Numerical Treatment of the Model for HIV Infection of CD4 T Cell by Uing Multitep Laplace

More information

Recurrent Neural Networks

Recurrent Neural Networks CS 2750: Machine Learning Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2017 One Motivation: Descriptive Text for Images It was an arresting face, pointed of chin,

More information

Cognitive Modeling & Processing for Speech Recognition Ears and Beyond

Cognitive Modeling & Processing for Speech Recognition Ears and Beyond Introduction Cognitive Modeling & Proceing for Speech Recognition Ear and Beyond B.H. Juang & Woojay Jeon Georgia Intitute of Technology February, 5 ASR ytem till perform far wore than human litener under

More information

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan UT Austin Huijuan Xu UMass. Lowell Jeff Donahue UC Berkeley Marcus Rohrbach UC Berkeley Subhashini Venugopalan

More information

Point defects in silicon after zinc diffusion a deep level transient spectroscopy and spreading-resistance profiling study

Point defects in silicon after zinc diffusion a deep level transient spectroscopy and spreading-resistance profiling study Semicond. Sci. Technol. 14 (1999) 435 440. Printed in the UK PII: S0268-1242(99)93053-5 Point defect in ilicon after zinc diffuion a deep level tranient pectrocopy and preading-reitance profiling tudy

More information

Data for MBI Workshop Statistics of Time Warpings and Phase Variations. Three-dimensional vascular geometry dataset

Data for MBI Workshop Statistics of Time Warpings and Phase Variations. Three-dimensional vascular geometry dataset Data for MBI Workhop Statitic of Time Warping and Phae Variation Mathematical Biocience Intitute, November 13-16, 2012 Three-dimenional vacular geometry dataet May 30, 2012 1 Data background Thee data

More information

arxiv: v1 [cs.cv] 2 May 2017

arxiv: v1 [cs.cv] 2 May 2017 Dense-Captioning Events in Videos Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles Stanford University {ranjaykrishna, kenjihata, fren, feifeili, jniebles}@cs.stanford.edu arxiv:1705.00754v1

More information

Lipid Segregation on Cylindrically and Spherically Curved Membranes

Lipid Segregation on Cylindrically and Spherically Curved Membranes Lipid Segregation on Cylindrically and Spherically Curved Membrane where n and u are unit vector repreenting, repectively, lipid orientation and membrane normal, repreent the two-dimenional derivative

More information

Numerical Simulation of Scour Depth Variation Around Vertical Wall Abutments

Numerical Simulation of Scour Depth Variation Around Vertical Wall Abutments World Journal of Reearch and Review (WJRR) ISSN:2455-3956, Volume-5, Iue-6, December 2017 Page 25-30 Numerical Simulation of Scour Depth Variation Around Vertical Wall Abutment Evangelia Farirotou, Nikolao

More information

Predicting Peptides That Bind to MHC Molecules Using Supervised Learning of Hidden Markov Models

Predicting Peptides That Bind to MHC Molecules Using Supervised Learning of Hidden Markov Models PROTEINS: Structure, Function, and Genetic 33:460 474 (1998) RESEARCH ARTICLES Predicting Peptide That Bind to MHC Molecule Uing Supervied Learning of Hidden Markov Model Hirohi Mamituka* C&C Media Reearch

More information

Vector Learning for Cross Domain Representations

Vector Learning for Cross Domain Representations Vector Learning for Cross Domain Representations Shagan Sah, Chi Zhang, Thang Nguyen, Dheeraj Kumar Peri, Ameya Shringi, Raymond Ptucha Rochester Institute of Technology, Rochester, NY 14623, USA arxiv:1809.10312v1

More information

Attention Correctness in Neural Image Captioning

Attention Correctness in Neural Image Captioning Attention Correctness in Neural Image Captioning Chenxi Liu 1 Junhua Mao 2 Fei Sha 2,3 Alan Yuille 1,2 Johns Hopkins University 1 University of California, Los Angeles 2 University of Southern California

More information

Spinal Flexibility and Individual Factors That Influence It

Spinal Flexibility and Individual Factors That Influence It Spinal lexibility and Individual actor That Influence It ICHELE C. BATTI'E, STANLEY J. BIGOS, ANN SHEEHY, and ARK D. WORTLEY We conducted an invetigation to examine the pinal flexibility of a large, adult

More information

Hierarchical Convolutional Features for Visual Tracking

Hierarchical Convolutional Features for Visual Tracking Hierarchical Convolutional Features for Visual Tracking Chao Ma Jia-Bin Huang Xiaokang Yang Ming-Husan Yang SJTU UIUC SJTU UC Merced ICCV 2015 Background Given the initial state (position and scale), estimate

More information

Streamlined Dense Video Captioning

Streamlined Dense Video Captioning 1 3 Topic: caesar salad recipe : a caesar salad is ready and is served in a bowl ee 2 : croutons are in a bowl and chopped ingredients are separated ee 3 : the man mix all the ingredients in a bowl to

More information

arxiv: v3 [cs.cv] 23 Jul 2018

arxiv: v3 [cs.cv] 23 Jul 2018 Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data Xihui Liu 1, Hongsheng Li 1, Jing Shao 2, Dapeng Chen 1, and Xiaogang Wang 1 arxiv:1803.08314v3 [cs.cv] 23 Jul

More information

Joint Dictionary Learning-based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery

Joint Dictionary Learning-based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery > RPLAC HS LN WH YOUR PAPR DNFCAON NUMBR (DOUBL-CLCK HR O D) < oint Dictionary Learning-baed Non-Negative Matrix Factorization for Voice Converion to mprove Speech ntelligibility After Oral Surgery Szu-Wei

More information

arxiv: v1 [cs.cv] 7 Dec 2018

arxiv: v1 [cs.cv] 7 Dec 2018 An Attempt towards Interpretable Audio-Visual Video Captioning Yapeng Tian 1, Chenxiao Guan 1, Justin Goodman 2, Marc Moore 3, and Chenliang Xu 1 arxiv:1812.02872v1 [cs.cv] 7 Dec 2018 1 Department of Computer

More information

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning Inferring Clinical Correlations from EEG Reports with Deep Neural Learning Methods for Identification, Classification, and Association using EHR Data S23 Travis R. Goodwin (Presenter) & Sanda M. Harabagiu

More information

QUANTITATIVE STUDIES ON THE CILIATE GLAUCOMA

QUANTITATIVE STUDIES ON THE CILIATE GLAUCOMA 422 QUANTITATIVE STUDIES ON THE CILIATE GLAUCOMA I. THE REGULATION OF THE SIZE AND THE FISSION RATE BY THE BACTERIAL FOOD SUPPLY BY J. P. HARDING, PH.D. Zoological Laboratory, Cambridge (Received 2 February

More information

A Novel Pulse Compression Scheme Based on Minimum Mean-Square Error Reiteration 1

A Novel Pulse Compression Scheme Based on Minimum Mean-Square Error Reiteration 1 A Novel Pule Compreion Scheme Baed on Minimum Mean-Square Error Reiteration Shannon D. Blunt Karl Gerlach Radar Diviion, Naval Reearch Laboratory 4555 Overlook Ave. S.W. Wahington DC 375 Abtract Thi paper

More information

Firms cash holdings and uncertainty

Firms cash holdings and uncertainty Firm cah holding and uncertainty Ruoran Gao Yaniv Grintein Thi verion: 5/13/2013 Preliminary Pleae do not quote without permiion Abtract We examine the dynamic relation between uncertainty and cah holding

More information

Comparison of set-shifting ability in patients with chronic schizophrenia and frontal lobe damage

Comparison of set-shifting ability in patients with chronic schizophrenia and frontal lobe damage Schizophrenia Reearch 37 (1999) 251 270 Commentary Comparion of et-hifting ability in patient with chronic chizophrenia and frontal lobe damage Chrito Panteli a,*, Fiona Z. Barber a, Thoma R.E. Barne b,

More information

Cost utility analysis of chemotherapy in symptomatic advanced nonsmall cell lung cancer

Cost utility analysis of chemotherapy in symptomatic advanced nonsmall cell lung cancer Eur Repir J 2006; 27: 895 901 DOI: 10.1183/09031936.06.00102705 CopyrightßERS Journal Ltd 2006 Cot utility analyi of chemotherapy in ymptomatic advanced nonmall cell lung cancer C.A. Doom*, Y.N. Lieven

More information

K-Complex Detection Based on Synchrosqueezing Transform

K-Complex Detection Based on Synchrosqueezing Transform AU Journal of Electrical Engineering AU J. Elec. Eng., 49()(07)5- DOI: 0.060/eej.07.577.5096 K-Complex Detection Baed on Synchroqueezing ranform Z. Ghanbari and M. H. Moradi * Faculty of Biomedical Engineering,

More information

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience. Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing

More information

A Framework for Auto-segmentation of Left Ventricle from Magnetic Resonance Images

A Framework for Auto-segmentation of Left Ventricle from Magnetic Resonance Images APCOM & ISCM 11-14 th Deceber, 2013, Singapore A Fraewor for Auto-egentation of Left Ventricle fro Magnetic Reonance Iage *Xulei Yang 1, Si Yong Yeo 1, Calvin Li 1, Yi Su 1, Min Wan 2, Liang Zhong 2, and

More information

Computational modeling of visual attention and saliency in the Smart Playroom

Computational modeling of visual attention and saliency in the Smart Playroom Computational modeling of visual attention and saliency in the Smart Playroom Andrew Jones Department of Computer Science, Brown University Abstract The two canonical modes of human visual attention bottomup

More information

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material Learning to Disambiguate by Asking Discriminative Questions Supplementary Material Yining Li 1 Chen Huang 2 Xiaoou Tang 1 Chen Change Loy 1 1 Department of Information Engineering, The Chinese University

More information

LSTD: A Low-Shot Transfer Detector for Object Detection

LSTD: A Low-Shot Transfer Detector for Object Detection LSTD: A Low-Shot Transfer Detector for Object Detection Hao Chen 1,2, Yali Wang 1, Guoyou Wang 2, Yu Qiao 1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 Huazhong

More information

by Pace et al. (1985). These investigators developed

by Pace et al. (1985). These investigators developed JOURNAL OF APPLIED BEHAVIOR ANALYSIS 19921 259 491-498 NUMBEP. 2 (ummep. 1992) A COMPARISON OF TWO APPROACHES FOR IDENTIFYING REINFORCERS FOR PERSONS WITH SEVERE AND PROFOUND DISABILITIES WAYNE FiHm, CAmLEEN

More information

arxiv: v1 [cs.cv] 18 May 2016

arxiv: v1 [cs.cv] 18 May 2016 BEYOND CAPTION TO NARRATIVE: VIDEO CAPTIONING WITH MULTIPLE SENTENCES Andrew Shin, Katsunori Ohnishi, and Tatsuya Harada Grad. School of Information Science and Technology, The University of Tokyo, Japan

More information

Protein Structure Prediction using 2D HP Lattice Model Based on Integer Programming Approach

Protein Structure Prediction using 2D HP Lattice Model Based on Integer Programming Approach 212 International Congre on Informatic, Environment, Energy and Application-IEEA 212 IPCSIT vol.38 (212) (212) IACSIT Pre, Singapore Protein Structure Prediction uing 2D HP Lattice Model Baed on Integer

More information

Cervical cytology intelligent diagnosis based on object detection technology

Cervical cytology intelligent diagnosis based on object detection technology Cervical cytology intelligent diagnosis based on object detection technology Meiquan Xu xumeiquan@126.com Weixiu Zeng Semptian Co., Ltd. Machine Learning Lab. zengweixiu@gmail.com Hunhui Wu 736886978@qq.com

More information

arxiv: v1 [stat.ml] 23 Jan 2017

arxiv: v1 [stat.ml] 23 Jan 2017 Learning what to look in chest X-rays with a recurrent visual attention model arxiv:1701.06452v1 [stat.ml] 23 Jan 2017 Petros-Pavlos Ypsilantis Department of Biomedical Engineering King s College London

More information

Tandem acoustic modeling: Neural nets for mainstream ASR?

Tandem acoustic modeling: Neural nets for mainstream ASR? Tandem acoutic modeling: for maintream ASR? Dan Elli International Computer Science Intitute Berkeley CA dpwe@ici.berkeley.edu Outline 2 3 Tandem acoutic modeling Inide Tandem ytem: What going on? Future

More information

arxiv: v2 [cs.cv] 10 Aug 2017

arxiv: v2 [cs.cv] 10 Aug 2017 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering arxiv:1707.07998v2 [cs.cv] 10 Aug 2017 Peter Anderson 1, Xiaodong He 2, Chris Buehler 2, Damien Teney 3 Mark Johnson

More information

Efficient Deep Model Selection

Efficient Deep Model Selection Efficient Deep Model Selection Jose Alvarez Researcher Data61, CSIRO, Australia GTC, May 9 th 2017 www.josemalvarez.net conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction???????? Num Classes

More information

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

Investigative and mechanistic toxicology Mechanism-based problem solving

Investigative and mechanistic toxicology Mechanism-based problem solving Invetigative and mechanitic toxicology Mechanim-baed problem olving the invetigative & exploratory toxicology company www.cxrbiocience.com Invetigative and mechanitic toxicology CXR Biocience : Our approach

More information

AAPS PharmSci 2002; 4 (4) article 42 (

AAPS PharmSci 2002; 4 (4) article 42 ( Pharmacodynamic Modeling of Chemotherapeutic Effect: Application of a Tranit Compartment Model to Characterize Methotrexate Effect in Vitro Submitted: May 15, 22; Accepted: September 22, 22; Publihed October

More information

Differential Attention for Visual Question Answering

Differential Attention for Visual Question Answering Differential Attention for Visual Question Answering Badri Patro and Vinay P. Namboodiri IIT Kanpur { badri,vinaypn }@iitk.ac.in Abstract In this paper we aim to answer questions based on images when provided

More information

(12) United States Patent (10) Patent No.: US 9,011,147 B2

(12) United States Patent (10) Patent No.: US 9,011,147 B2 US00901 1147B2 (12) United State Patent () Patent No.: US 9,011,147 B2 Jacquemyn (45) Date of Patent: Apr. 21, 20 (54) DENTIST TOOL 4,144,645 A * 3/1979 Marhall... 433,223 4,504.230 A 3, 1985 Patch (75)

More information

No e si re o m va la m e d tect o i n fro m C G g S a n l 7 www. jier. m com

No e si re o m va la m e d tect o i n fro m C G g S a n l 7 www. jier. m com N oie removal and Ichemia detection from ECG Signal V ijay Kumar, Ruby Gupta A btract Heart i mot important organ of human body. Electrocardiogram (ECG) i ued for electrical recording of heart ignal. In

More information

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem Action Recognition Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem This section: advanced topics Convolutional neural networks in vision Action recognition Vision and Language 3D

More information

arxiv: v2 [cs.cv] 19 Dec 2017

arxiv: v2 [cs.cv] 19 Dec 2017 An Ensemble of Deep Convolutional Neural Networks for Alzheimer s Disease Detection and Classification arxiv:1712.01675v2 [cs.cv] 19 Dec 2017 Jyoti Islam Department of Computer Science Georgia State University

More information

arxiv: v2 [cs.cv] 6 Nov 2017

arxiv: v2 [cs.cv] 6 Nov 2017 Published as a conference paper in International Conference of Computer Vision (ICCV) 2017 Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training Rakshith Shetty 1 Marcus

More information

arxiv: v2 [cs.cv] 8 Sep 2016

arxiv: v2 [cs.cv] 8 Sep 2016 arxiv:1608.07068v2 [cs.cv] 8 Sep 2016 Title Generation for User Generated Videos Kuo-Hao Zeng 1, Tseng-Hung Chen 1, Juan Carlos Niebles 2, Min Sun 1 1 Departmant of Electrical Engineering, National Tsing

More information

Multi-attention Guided Activation Propagation in CNNs

Multi-attention Guided Activation Propagation in CNNs Multi-attention Guided Activation Propagation in CNNs Xiangteng He and Yuxin Peng (B) Institute of Computer Science and Technology, Peking University, Beijing, China pengyuxin@pku.edu.cn Abstract. CNNs

More information

Training for Diversity in Image Paragraph Captioning

Training for Diversity in Image Paragraph Captioning Training for Diversity in Image Paragraph Captioning Luke Melas-Kyriazi George Han Alexander M. Rush School of Engineering and Applied Sciences Harvard University {lmelaskyriazi@college, hanz@college,

More information

A Chemical Dynamics Approach to Understand Role of Vitamin D in Human Immunity

A Chemical Dynamics Approach to Understand Role of Vitamin D in Human Immunity 2014 1 t International Congre on Environmental, Biotechnology, and Chemitry Engineering IPCBEE vol.64(2014) (2014) IACSIT Pre, Singapore DOI: 10.7763/IPCBEE. 2014. V64. 13 A Chemical Dynamic Approach to

More information

Karnofsky Performance Scale (KPS) or Physical Performance Test (PPT)? That is the question

Karnofsky Performance Scale (KPS) or Physical Performance Test (PPT)? That is the question Critical Review in Oncology/Hematology 77 (2011) 142 147 Karnofky Performance Scale (KPS) or Phyical Performance Tet (PPT)? That i the quetion Catherine Terret a,b,c,, Gille Albrand b,c,d, Géraldine Moncenix

More information

arxiv: v2 [cs.lg] 1 Jun 2018

arxiv: v2 [cs.lg] 1 Jun 2018 Shagun Sodhani 1 * Vardaan Pahuja 1 * arxiv:1805.11016v2 [cs.lg] 1 Jun 2018 Abstract Self-play (Sukhbaatar et al., 2017) is an unsupervised training procedure which enables the reinforcement learning agents

More information

Learning to Evaluate Image Captioning

Learning to Evaluate Image Captioning Learning to Evaluate Image Captioning Yin Cui 1,2 Guandao Yang 1 Andreas Veit 1,2 Xun Huang 1,2 Serge Belongie 1,2 1 Department of Computer Science, Cornell University 2 Cornell Tech Abstract Evaluation

More information

Social Image Captioning: Exploring Visual Attention and User Attention

Social Image Captioning: Exploring Visual Attention and User Attention sensors Article Social Image Captioning: Exploring and User Leiquan Wang 1 ID, Xiaoliang Chu 1, Weishan Zhang 1, Yiwei Wei 1, Weichen Sun 2,3 and Chunlei Wu 1, * 1 College of Computer & Communication Engineering,

More information

Micellar Solutions of Ionic Surfactants and Their Mixtures with Nonionic Surfactants: Theoretical Modeling vs. Experiment 1

Micellar Solutions of Ionic Surfactants and Their Mixtures with Nonionic Surfactants: Theoretical Modeling vs. Experiment 1 ISSN 06-933X, Colloid Journal, 04, Vol. 76, No. 3, pp. 5570. Pleiade Publihing, Ltd., 04. Micellar Solution of Ionic Surfactant and Their Mixture with Nonionic Surfactant: Theoretical Modeling v. Experiment

More information

A NOVEL FUZZY NEURAL NETWORK ESTIMATOR FOR PREDICTING HYPOGLYCAEMIA IN INSULIN-INDUCED SUBJECTS

A NOVEL FUZZY NEURAL NETWORK ESTIMATOR FOR PREDICTING HYPOGLYCAEMIA IN INSULIN-INDUCED SUBJECTS Proceeding 3rd Annual Conference IEEE/EBS Oct.5-8, 1, Itanbul, TURKEY A NOVEL FUZZY NEURAL NETWORK ESTIATOR FOR PREDICTING HYPOGLYCAEIA IN INSULIN-INDUCED SUBJECTS N. Ghevondian 1, H. T. Nguyen 1, S. Colagiuri

More information

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem Towards image captioning and evaluation Vikash Sehwag, Qasim Nadeem Overview Why automatic image captioning? Overview of two caption evaluation metrics Paper: Captioning with Nearest-Neighbor approaches

More information

arxiv: v1 [cs.cv] 2 Jun 2017

arxiv: v1 [cs.cv] 2 Jun 2017 INTEGRATED DEEP AND SHALLOW NETWORKS FOR SALIENT OBJECT DETECTION Jing Zhang 1,2, Bo Li 1, Yuchao Dai 2, Fatih Porikli 2 and Mingyi He 1 1 School of Electronics and Information, Northwestern Polytechnical

More information

arxiv: v2 [cs.cv] 3 Jun 2018

arxiv: v2 [cs.cv] 3 Jun 2018 S4ND: Single-Shot Single-Scale Lung Nodule Detection Naji Khosravan and Ulas Bagci Center for Research in Computer Vision (CRCV), School of Computer Science, University of Central Florida, Orlando, FL.

More information

Effects of a Supervised Home Exercise Program on Patients with Severe Chronic Obstructive Pulmonary Disease

Effects of a Supervised Home Exercise Program on Patients with Severe Chronic Obstructive Pulmonary Disease Effect of a Supervied Home Exercie Program on Patient with Severe Chronic Obtructive Pulmonary Dieae ANGELA J. BUSCH and JAMES D. McCLEMENTS The purpoe of thi tudy wa to analyze the effect of a home exercie

More information

The Use of Virtual Reality Technologies during Physiotherapy of the Paretic Upper Limb in Patients after Ischemic Stroke

The Use of Virtual Reality Technologies during Physiotherapy of the Paretic Upper Limb in Patients after Ischemic Stroke imedpub Journal http://www.imedpub.com JOURNAL OF NEUROLOGY AND NEUROSCIENCE ISSN The Ue of Virtual Reality Technologie during Phyiotherapy of the Paretic Upper Limb in Patient after Ichemic Stroke Stryla

More information

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering SUPPLEMENTARY MATERIALS 1. Implementation Details 1.1. Bottom-Up Attention Model Our bottom-up attention Faster R-CNN

More information

Superior pre-attentive auditory processing in musicians

Superior pre-attentive auditory processing in musicians Cognitive Neurocience 10, 1309±1313 (1999) THE preent tudy focue on in uence of long-term experience on auditory proceing, providing the rt evidence for pre-attentively uperior auditory proceing in muician.

More information

Discriminability objective for training descriptive image captions

Discriminability objective for training descriptive image captions Discriminability objective for training descriptive image captions Ruotian Luo TTI-Chicago Joint work with Brian Price Scott Cohen Greg Shakhnarovich (Adobe) (Adobe) (TTIC) Discriminability objective for

More information

MULTIPLE SCLEROSIS. National clinical guideline for diagnosis and management in primary and secondary care

MULTIPLE SCLEROSIS. National clinical guideline for diagnosis and management in primary and secondary care The National Collaborating Centre for Chronic Condition Funded to produce guideline for the NHS by NICE MULTIPLE SCLEROSIS National clinical guideline for diagnoi and management in primary and econdary

More information

An FSIS Update on the Prevention and Control of Foodborne Salmonella

An FSIS Update on the Prevention and Control of Foodborne Salmonella 1 Outline An FSIS Update on the Prevention and Control of Foodborne Salmonella Karen Becker, DVM, MPH, DACVPM Director, Applied USDA/ USAHA Salmonella Committee Providence, RI Oct 27, 2015 2 FSIS Miion

More information

LETTER. PML-RARA mutations confer varying arsenic trioxide resistance. Protein & Cell. Protein Cell 2017, 8(4): DOI /s

LETTER. PML-RARA mutations confer varying arsenic trioxide resistance. Protein & Cell. Protein Cell 2017, 8(4): DOI /s Protein Cell 2017, 8(4):296 301 DOI 10.1007/13238-016-0356-4 mutation confer varying arenic trioxide reitance Dear Editor, Acute promyelocytic leukemia (APL) caued by the malignant proliferation of bone

More information

IMA Preprint Series # 2377

IMA Preprint Series # 2377 A MATHEMATICAL MODEL FOR THE CTL EFFECT ON THE DRUG RESISTANCE DURING ANTIRETROVIRAL TREATMENT OF HIV INFECTION By Nicoleta Tarfulea IMA Preprint Serie # 2377 ( Augut 211 ) INSTITUTE FOR MATHEMATICS AND

More information

Multi-Scale Salient Object Detection with Pyramid Spatial Pooling

Multi-Scale Salient Object Detection with Pyramid Spatial Pooling Multi-Scale Salient Object Detection with Pyramid Spatial Pooling Jing Zhang, Yuchao Dai, Fatih Porikli and Mingyi He School of Electronics and Information, Northwestern Polytechnical University, China.

More information

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks 1 Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Jing Ma 1, Wei Gao 2, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand

More information

B657: Final Project Report Holistically-Nested Edge Detection

B657: Final Project Report Holistically-Nested Edge Detection B657: Final roject Report Holistically-Nested Edge Detection Mingze Xu & Hanfei Mei May 4, 2016 Abstract Holistically-Nested Edge Detection (HED), which is a novel edge detection method based on fully

More information

DeepDiary: Automatically Captioning Lifelogging Image Streams

DeepDiary: Automatically Captioning Lifelogging Image Streams DeepDiary: Automatically Captioning Lifelogging Image Streams Chenyou Fan David J. Crandall School of Informatics and Computing Indiana University Bloomington, Indiana USA {fan6,djcran}@indiana.edu Abstract.

More information

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling AAAI -13 July 16, 2013 Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling Sheng-hua ZHONG 1, Yan LIU 1, Feifei REN 1,2, Jinghuan ZHANG 2, Tongwei REN 3 1 Department of

More information

Progressive Attention Guided Recurrent Network for Salient Object Detection

Progressive Attention Guided Recurrent Network for Salient Object Detection Progressive Attention Guided Recurrent Network for Salient Object Detection Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, Gang Wang Dalian University of Technology, China Alibaba AILabs, China

More information

Recognizing American Sign Language Gestures from within Continuous Videos

Recognizing American Sign Language Gestures from within Continuous Videos Recognizing American Sign Language Gestures from within Continuous Videos Yuancheng Ye 1, Yingli Tian 1,2,, Matt Huenerfauth 3, and Jingya Liu 2 1 The Graduate Center, City University of New York, NY,

More information