Captioning Videos Using Large-Scale Image Corpus

Size: px

Start display at page:

Download "Captioning Videos Using Large-Scale Image Corpus"

Beverley Walton
6 years ago
Views:

1 Du XY, Yang Y, Yang L et al. Captioning videos using large-scale iage corpus. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 32(3): May 207. DOI 0.007/s Captioning Videos Using Large-Scale Iage Corpus Xiao-Yu Du,2, Meber, CCF, Yang Yang 3,4, Meber, CCF, ACM, IEEE, Liu Yang,5 Fu-Min Shen 3,4, Meber, CCF, ACM, IEEE, Zhi-Guang Qin, Senior Meber, CCF, Meber, ACM, IEEE and Jin-Hui Tang,6,, Senior Meber, CCF, IEEE, Meber, ACM School of Inforation and Software Engineering, University of Electronic Science and Technology of China Chengdu 60054, China 2 School of Software Engineering, Chengdu University of Inforation Technology, Chengdu 60225, China 3 Center for Future Media, University of Electronic Science and Technology of China, Chengdu 673, China 4 School of Coputer Science and Engineering, University of Electronic Science and Technology of China Chengdu 673, China 5 Sichuan University West China Hospital of Stoatology, Chengdu 6004, China 6 School of Coputer Science and Engineering, Nanjing University of Science and Technology, Nanjing 20094, China E-ail: duxiaoyu@cuit.edu.cn; dlyyang@gail.co; yangliu988322@63.co; fuin.shen@gail.co E-ail: qinzg@uestc.edu.cn; jinhuitang@njust.edu.cn Revised Deceber 26, 206; revised April 7, 207. Abstract Video captioning is the tas of assigning coplex high-level seantic descriptions(e.g., sentences or paragraphs) to video data. Different fro previous video analysis techniques such as video annotation, video event detection and action recognition, video captioning is uch closer to huan cognition with saller seantic gap. However, the scarcity of captioned video data severely liits the developent of video captioning. In this paper, we propose a novel video captioning approach to describe videos by leveraging freely-available iage corpus with abundant literal nowledge. There are two ey aspects of our approach: ) effective integration strategy bridging videos and iages, and 2) high efficiency in handling ever-increasing training data. To achieve these goals, we adopt sophisticated visual hashing techniques to efficiently index and search large-scale iages for relevant captions, which is of high extensibility to evolving data and the corresponding seantics. Extensive experiental results on various real-world visual datasets show the effectiveness of our approach with different hashing techniques, e.g., LSH (locality-sensitive hashing), PCA-ITQ (principle coponent analysis iterative quantization) and supervised discrete hashing, as copared with the state-of-the-art ethods. It is worth noting that the epirical coputational cost of our approach is uch lower than that of an existing ethod, i.e., it taes /256 of the eory requireent and /64 of the tie cost of the ethod of Devlin et al. Keywords video captioning, hashing, iage captioning Introduction In recent years, driven by the rapid developent of obile devices, Internet and social sharing platfors, a treendous aount of user-generated video clips have been eerging on the Web. As an iportant alternative to wordsand iages, such UGC videoshavesignificantly changed the way people live and counicate. Every inute, there are -hour videos uploaded to YouTube and 39 illion videos uploaded to Vine 2. Regular Paper Special Section of CVM 207 This wor was partially supported by the National Basic Research 973 Progra of China under Grant No. 204CB347600, the National Natural Science Foundation of China under Grant Nos , , , and 6208, the National Ten-Thousand Talents Progra of China (Young Top-Notch Talent), the National Thousand Young Talents Progra of China, the Fundaental Research Funds for the Central Universities of China under Grant Nos. ZYGX204Z007 and ZYGX205J055, and the Natural Science Foundation of Jiangsu Province of China under Grant No. BK Corresponding Author Mar Mar Springer Science + Business Media, LLC & Science Press, China

2 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 48 Besides, averagely.5 illion videos are uploaded to Miaopai 3 every day. Such a large aount of videos are ostly lac of descriptions. How to seantically index these video data and ae the easier to see and coprehend is a significant proble []. AsshowninFig.,videoannotation [2] focusesonderiving siple seantic concepts fro low-level features, i.e., describing videos in natural language words rather than nueric features. To coprehend ore seantic inforation fro video data, event detection [3], action recognition [], refined frae tags [4], deduced tags [5] and any other seantic descriptions [6-7] were designed for learning ore coplex seantics [8]. Nevertheless, the approaches are based on classification and their results are liited in certain pre-defined concept/event set. The descriptive power of these results is very liited. Providing users coprehensive descriptions, video captioning, which describes videos with sentences or paragraphs, is in urgent need. (a) (b) Fig.. Describing video clips using video annotation, action recognition, event detection and video captioning. In existing video captioning approaches, convolutional neural networ(cnn) and long-short ter eory (LSTM) are widely used. The conventional way is first adapting CNN odel to extract video features and then utilizing LSTM to generate video captions. Nonetheless, there are two disadvantages. ) Extreely Coplex Neural Networs for Video Data. The coplex structure slows down the execution efficiency which liits the extensibility for variable large-scale datasets. 2) Scarcity Proble in Existing Video Corpus. For instance, one of the ost popular video captioning datasets, YouTube2Text (Y2T), only contains 970 video clips with captions. Such scale of training data can hardly guarantee reliable odels to generalize to new video data. In order to address the above probles, we turn to seeing for auxiliary captioned data. Iage datasets always contain various corpus (e.g., isitu 4 provides situation annotations [9] ). They are always used as auxiliary inforation in video processing [0-]. Moreover we notice that there are any iage captioning datasets with sufficiently relevant captions. Integrating iage captioning corpus ay be a nice way to augent the scale of corpus for video captioning. There are two ajor iage captioning directions. The first one is to extract iage features (e.g., CNN feature, and the supervised and unsupervised features [2] ) and then generate captions by LSTM. The other is to generate captions fro the descriptions of the siilar iages of the query iage. As entioned before, the first direction is less extensible and does not perfor well when we have the proble of data scarcity and data updating. Therefore, in this paper, we choose to follow the second direction and refer to it as the retrieval approach in the subsequent content. With the frequently-updated training data, the retrieval approach only needs to process the new data. In contrast, the deep learning approach ay have to retrain the coplex odel, which can hardly be afforded. Meanwhile, even if there are a very sall aount of siilar iages for a query iage, retrieval approach would perfor well by accurately discovering the ost relevant saples to guarantee the perforance. It is 3 Mar Mar. 207.

3 482 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 intuitive that when the size of training data is large enough, the retrieval approach is able to achieve good perforance. However, the coputational cost of - nearest neighbor (NN) search is extreely high as the diensionality of visual feature is norally large, e.g., the fc7 layer (4096-D) fro Alexnet or VGG. Moreover, original visual features ay contain redundancy and/or noise which degrades the perforance. In order to copensate the drawbac of traditional NN search, we propose to eploy hashing technique to index visual saples. With the excellent ability of reducing storage and coputational cost, hashing [3] has been extensively studied and applied to ultiedia retrieval in large-scale databases. The ain idea is to utilize a sall set of binary bits to represent a saple instead of a large nuber of real-valued features [4]. The siilarity between saples can be efficiently coputed with the Haing distance between the corresponding hash codes. Locality-sensitive hashing (LSH) is nown as one of the ost popular data-independent hashing ethods. In recent years, data-dependent or learning-based hashing ethods becoe ore popular since they perfor uch better in accuracy and copressing rate. Iterative quantization (ITQ) is a typical unsupervised optiization algorith. The popular unsupervised hashing ethod PCA-ITQ [5] copresses features by principle coponent analysis (PCA) and then optiizes the features by ITQ. The recently proposed supervised discrete hashing (SDH) [6] is a wellperfored supervised hashing ethod. In this paper, we propose a novel video captioning approach using hashing techniques. First of all, we utilize visual hashing ethods to preprocess the visual eleents including iages and videos. We instantiate our ethod with SDH in this paper. Then we retrievesiilar iages to query video and collect the corresponding captions as candidate captions. At last, we re-ran the candidate captions and select the consensus one as our captioning result. The ain contributions of this paper are suarized as follows. We propose a novel video captioning approach by leveraging well-captioned iage data. The sufficient seantic descriptions in large-scale iage datasets help to avoid the deficiency of captioned videos. We devise a siple yet effective approach to align iage and video data to alleviate the influence of doain difference, which bridges the iages and videos by extracting representative video fraes. By utilizing the hashing techniques, copared with the approach proposed in [7], the proposed retrieval captioning approach achieves 64 ties speedup with only /256 ties eory overhead. Hash codes are used for effective and efficiency feature copression, which aes retrieval approach easily extensible to eerging data. The reinder of this paper is organized as follows. Section 2 presents related wor. Section 3 elaborates the details of our proposed approach. Section 4 deonstrates the experiental results, followed by the conclusions in Section 5. 2 Related Wor The rapid developent of deep neural networ (DNN) otivates the progress of ultiedia processing. Convolutional neural networ (CNN) [8] is a ind of DNN to process iage inforation [9]. Recurrent neural networ (RNN) [20] is another ind of DNN widely used in natural language generation [2]. To reduce the coplexity of DNN ipleentation, various DNN tools were developed. Caffe [22] is the faous one whichcanconstruct,train,testandiprovethenetina siple way. Therefore any classical CNN odels are trained on Caffe including Alexnet [8], GoogLeNet [23] and VGG [24]. Since the CNN odels always perfor well across doains [25], the trained odels are frequently selected to extract seantic features. The fully connection layers naed fc7 in Alexnet and VGG6 are always treated as seantic features of a given iage. The perforances in iage classification and siilarity judgent are better than the traditional features. While features are extracted, captioning approaches would generate sentences word by word. Soe existing researches utilized probability odels such as Marov chain [26], expectation axiization [27] and axiu entropy [28] odels. Soe directly utilized neural networ odels RNN [29], and LSTM [30], and soe ixed the CNN and RNN [3-32] to train an integrative odel. In the entioned approaches, the descriptions are generated echanically, and thus we call the generation approaches. Another ind of captioning approaches taes existing huan captions as the captioning result. They are called retrieval approaches. One of the retrieval approaches perfors well in MSCOCO challenge [33]. It is a siple and even rude solution. Given a query iage, several siilar iages are collected as candidate iages. Then the candidate captions which refer to the candidate iages are rearranged. The best caption is treated as the consensus caption. The retrieval approach was first pro-

4 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 483 posed in [34]. Ordonez et al. [34] constructed one illion captioned iages corpus and designed the captioning experients. And the experients deonstrated that the ore iages appended to dataset, the ore effective the results are. To retrieve siilar iages, Devlin et al. [7] used the nearest neighbor (NN) ethod. Hashing is another faous retrieving ethod especially in large-scale datasets. The ain idea is effectively copressing features into a set of binary codes (hash codes) to efficiently calculate the siilarity using Haing distance [35]. Soe hashing ethods were accelerated [36] and soe could extend the identifications [37]. To generalize the process, soe hashing fraewors were proposed [38]. There are three inds of hashing: ) data-independent ethods such as locality sensitive hashing (LSH) [39], 2) unsupervised ethods such as RDSH [40], IMH [4] and PCA-ITQ [5], and 3) supervised ethods such as LDA-HASH [42], CCA-ITQ [5], RDCM [43] and SDH [6]. PCA-ITQ and CCA-ITQ are diensionality reduction ethods with iterative quantization and perfor well presented in [5]. Shen et al. [6] deonstrated that SDH perfors uch better than other supervised hashing ethods. Generally the supervised ethods perfor better than the unsupervised ones, and the unsupervised ones perfor better than data-independent ones. It is notable that the nuber of hash bits also significantly affects the perforance. For exaple, LSH needs at least 52 bits to retrieve fairly siilar iages epirically. Many iage captioning approaches have been applied to video captioning. Iproving the iage generation approaches has been a widely-used way. Ballas et al. [44] arranged several CNNs in a CNN atrix to process the adjacent fraes synchronously. Mazloo et al. [45] too the bag-of-words with the SIFT features [46], MFCC [47] and attention points as video features. Shetty and Laasonen [48] incorporated frae features extracted by CNN, classification fro CNN and video feature as the input of LSTM. Yao et al. [49] fored a 3D-CNN encoder-decoder frae, with which spatio-teporal vectors are divided to chuns as the input of LSTM. Pan et al. [] proposed a hierarchical recurrent neural networ to extract the video features. Sener et al. [5] put inforation fro videos and subtitlesintobag-of-wordsandtooitasthe inputoflstm. The researches deonstrate that the ideas fro iage captioning approaches benefit videos. Therefore, the retrieval approaches rarely used in video captioning are another captioning choice. 3 Proposed Approach In this section, we describe the proposed video captioning approach. As shown in Fig.2, we first integrate video corpus and iage corpus by apping the into a coon visual seantic space, and then copress the to binary codes. Given a query video, we retrieve the siilar saples fro the apped corpus and collect the captions related to the siilar eleents as our candidate captions. Finally, candidate captions are rearranged, and the best caption also called consensus caption is selected as our caption for the query video. 3. Data Integration Note that each video ay consist of a sequence of fraes, which aes it difficult to directly copare videos and iages. In this part, we integrate videos and iages in the sae space where we could calculate their siilarity. We design a siple yet effective approach to achieve this goal. Based on the fact that current UGC videos are very short (norally a few seconds), it is reasonable to assue that the seantics of a video is focused. In fact, we have analyzed the distribution of video length in the Y2T dataset, and the average length is 9 s. As illustrated in Fig.3, the representative videos are seantically focused. Meanwhile, if we use ultiple fraes to represent videos, the coon part of these fraes (e.g., bacground) ay be over-ephasized. Therefore, it is possible for us to use only one frae to represent a video, which also provides a straightforward way to connect iage representation and video representation. In this wor, we adopt the fc7 layer of VGG6 as the visual feature for both the iage and the video. 3.2 Visual Hashing The visual features are usually in high-diensional space (4 096-D for AlexNet CNN features). We copress the features to a set of binary codes with hashing ethods. There are various hashing ethods. We select several typical hashing ethods in different inds as our benchar. LSH. Locality sensitive hashing (LSH) [39] is a fundaental hashing ethod and it is data-independent. Codes are generated using rando projection. Randoly constructing a weight atrix W R d b, where d is the nuber of source feature diensions and b is the nuber of hash bits, the codes can be calculated by C = (F W > 0). Here F R n d indicates n saples with d-diensional features.

484 J. Coput. Sci. & Technol., May 207, Vol.32, No.

.................... Captioning d d2 Hashing CNN Iages in Corpus Videos

.. Retrieving Siilar Eleents Candidate Captions 000... A Cat Is on... 0000.

.. Consensus Captions A Cat Is Sitting on the Table A Dog Is... Fig.2.

get a transfor atrix T to indicate the PCA reduction and a transfor atrix

Therefore, the weight atrix of PCA-ITQ is W = T R.

The recently proposed supervised discrete hashing (SDH)[6] is a

In this paper, we utilize SDH as our video hashing odule.

in B,W,F n X yi W T bi 2 + λ W 2 + i= n X ν bi H(xi ) 2 i= Fig.3.

The ain ideas to be explained are clear in the fraes. PCA-ITQ.

Different fro LSH, it is optiized with data features to get better

5 484 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 Training Set Test Set d d2 d d2 d d2 d d2 d d2 d d Captioning d d2 Hashing CNN Iages in Corpus Videos in Corpus Preprocessing Query Video Mixed Corpus Retrieving Siilar Eleents Candidate Captions A Cat Is on A Table Is Consensus Captions A Cat Is Sitting on the Table A Dog Is... Fig.2. Video clip captioning flow. get a transfor atrix T to indicate the PCA reduction and a transfor atrix R to indicate the ITQ operation. Therefore, the weight atrix of PCA-ITQ is W = T R. The hash codes can be calculated by C = (F W > 0) as well. SDH. The recently proposed supervised discrete hashing (SDH)[6] is a well-perfored supervised hashing ethod. In this paper, we utilize SDH as our video hashing odule. SDH learns binary codes and hash functions through the following proble. in B,W,F n X yi W T bi 2 + λ W 2 + i= n X ν bi H(xi ) 2 i= Fig.3. Fraes extracted fro video clips. The ain ideas to be explained are clear in the fraes. PCA-ITQ. PCA-ITQ[5] is one of the ost popular unsupervised hashing ethods. Different fro LSH, it is optiized with data features to get better perforance. PCA-ITQ contains two steps: ) utilizing principle coponent analysis (PCA) to reduce feature diensions, and 2) utilizing ITQ to optiize features by shifting and rotating. As shown in [5], we would s.t. bi {, }L. That is, in B,W,F Y W T B 2 + λ W 2 + ν B H(X) 2 s.t. B {, }L n. Here H( ) is the hash function to be learned, λ and ν are penalty paraeters, X = {xi }ni= is the data atrix,

6 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 485 Y = {y i } n i= {0,}C n is the ground truth label atrix, W = {w } C = RL C is the classification vector, and B = {b i } n i= {,}L n is the learned binary codes, where the i-th colun b i is the binary code for x i. The ey of SDH is that binary code optiization is not solved with resort to continuous relaxations but directly with discrete optiization. In SDH, the optiization is processed with three alternating steps: W- step, F-step and B-step [6]. With the obtained hash function H( ), our hash codes could be generated as C = (H(X) > 0). 3.3 Video Captioning In this part, we describe how to generate captions for videos using hashing and auxiliary iages. The overall flowchart is illustrated in Fig.2, which is coprised of three ajor coponents. Preliinarily, we transfor all training iages and videos into hash codes using the learned hash functions. All the hash codes can be directly stored in the ain eory. Given a target video v, we go through the following steps. Step. We first export the representative frae and extract the fc7 layer of VGG as the visual feature. Then, we convert the visual feature to hash codes, denoted as q, using the learned hash functions. Step 2. Given q, we search the whole database of binary codes to obtain siilar saples. Note that the siilarity between two saples is calculated according to Haing distance. Given a code of retrieved saple p {0,} L, the Haing distance of q and p is defined as L hd(q,p) = XOR(q l,p l ), l= where q l and p l are the l-th eleents of q and p, respectively. The eleents with the sallest Haing distance are the siilar saples. Step 3. Finally, we collect the captions of all retrieved saples to for a candidate caption set, denoted as C. To generate the desired caption for the target, we intend to first find a axiu subset of size, denoted as M C. Then, we decide the ost representative caption c if c is the ost siilar one to the eleents in M. Following [7], we tae BLEU- 4 [52] score as our caption siilarity and our consensus caption is forally generated by: c = argaxax c C M C c M Si(c,c ), where Si(, ) coputes the siilarity of two captions. 4 Experients In this section, we discuss the perforance of our proposed approach applied in video captioning. The experients contain two steps. At the first step, we iprove the iage retrieval captioning approach, and copare the effects using different inds of hashing ethods and NN ethod. At the second step, we apply our video captioning approach and illustrate the interesting results. We score the approaches in the experients by BLEU [52], METEOR [53], CIDEr [54] and ROUGE [55], which are calculated with the MSCOCO evaluation server. 4. Datasets Our experients are designed on iages and videos. Therefore the iage and video datasets shown below are selected to illustrate the ultiple aspects of our approach. MSCOCO. It is an iage captioning dataset provided by Microsoft Corporation. Each iage has at least five captions. The training set, validation set and test set have been separated and an evaluation syste is provided for results evaluating uniforly. There are totally iages in the training set (one of the is daaged), 40 4 iages in the validation set and in the test set. The test set is confidential for us; therefore we tae the validation set as our test data. YouTube2Text (Y2T). It is a user-generated video dataset with various clean and noisy captions. There are totally 970 videos and captions of the are English captions and are cleaned English captions. Statistically the average length of the videos is 9 s. Most of the can be described using one sentence. Following [44, 56-57] we tae videos as the training saples and the other 670 videos as the test data. In our experients, the official evaluation syste of MSCOCO is used to score our results. The MSCOCO dataset is used to evaluate our iage captioning approach and the Y2T dataset is used to evaluate our video captioning approach. MSCOCO is also used as an extended corpus for video captioning. The nubers of n-gras in the entioned corpus are shown in Table. Obviously the n-gras in iage corpus are uch larger than those in video corpus. Using iage corpus ay be a way to extend video corpus.

7 486 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 Table. Statistics of n-gras in Y2T and MSCOCO Dataset -Gra 2-Gra 3-Gra 4-Gra Total Y2T MSCOCO Training MSCOCO Validation Total (unique) Corpus Codes At the first step of our experients, the visual eleents are referred to as binary codes using VGG6 and hashing ethods. We explore the perforances of typical hashing ethods including data-independent hashing LSH, unsupervised hashing PCA-ITQ, and supervised hashing SDH. To evaluate the perforances on large-scale corpus, we exploit the results using 52 bits hashing codes and 64 bits hashing codes. Using 52 bits hashing codes costs only 64 bytes eory per iage. Storing all the iages costs less than 0 MB eory. Using 64-bit hashing codes costs uch less. In contrast, taing the diensional fc7 layerofvgg6asaniagefeaturecosts6 KBperiage and nearly 2 GB in total. That is about 256 ties ore than using 52-bit codes. The saved storage grows ore obviously in larger scale datasets. Table 2 deonstrates the eory cost of our test datasets D indicatesthe features using4096floatssuch asthe original feature exported fro the Alexnet. 52-bit and 64- bit indicate the features using 52 bits and 64 bits respectively. Obviously the copressed features are uch saller than the original features. Table 2. Meory Requireents Using Different Sizes of Features for Visual Eleents in Y2T and MSCOCO Dataset D 52-Bit 64-Bit Y2T 3 MB 24.0 KB 6 KB MSCOCO 894 MB 7.4 MB MB 4.3 Siilar Iage Retrieval Fig.4 deonstrates the siilar iage retrieving results using NN, LSH, PCA-ITQ and SDH. Fro the five ost siilar iages, we can hardly judge which ethod is better. The hashing ethods could get effective candidate iages as good as NN. Therefore, hashing ethods wor well in retrieving siilar iages. This change aes the execution cost uch less tie, as shown in Fig.5. With the increasing aount of data, NN costs ore and ore tie than the hashing ethods. When there areabout 0 6 iages, using NN costs about 0 inutes. In contrast, hashing ethods cost several icro-seconds still. Obviously, after our iproveents, the retrieval step is ore appropriate for large-scale datasets. 4.4 Iage Captioning Result To explore the effects of iage captioning using NN, LSH, PCA-ITQ and SDH, we design several tests on the MSCOCO dataset. Following [7], we tae half of the validation set (20252 iages) as tuning data and tae the reains as the test data, and we score the results using the MSCOCO evaluation syste. There are two variables and for iage captioning. We utilize tuning data to explore the best and for our approaches. Fig.6 shows the effect of - paraeters. Generally the variances of different approaches are siilar. Table 3 presents that retrieval approaches using hashing ethods perfor as well as those using NN ethod. The scores are based on BLEU [52], METEOR [53], CIDEr [54] and ROUGE [55]. This deonstrates our proposed iage captioning approach is effective and efficient. Because of less eory and tie cost, our approach could be used in a uch larger dataset. The results also illustrate that although our approaches using different hashing ethods have siilar perforances, there are still a fine distinction. The supervised ethod perfors better than the unsupervised ethod, and the unsupervised ethod is better than the data-independent ethod, because of the descending order of scores. Besides, we copare the codes using 52 bits with the codes using 64 bits. As shown in Table 3, using 64- bit, SDH still perfors well, but LSH perfors uch worse. When resources are liited, using 64-bit SDH ust be a better choice. 4.5 Video Captioning Result Considering soe frae in the video can denote the ain contents of the video, we select one of the fraes

Iage Query Iage SDH NN LSH PCA-ITQ (b) (a) Fig.4.

MSCOCO dataset and the Y2T training set as our labeled

Table 4 presents the results using our SDH retrieval

This siple and even rude ethod perfors as well as the

The result also deonstrates that the first fraes and the

4 096-D Feature 52-Bit Hashing Code 2 3 log(x) 4 5 6 7

For ore details, we list soe captioned videos in Fig.

Obviously the generated captions are related to the query

The inaccuracies are ainly caused by the deficiency of

7(a) use puppy and dog to describe the sae anial.

a party which causes a low score. In videos of Fig.

Y2T says the process while MSCOCO says the status. Fig.

dataset than the MSCOCO dataset in captioning Y2T test

8 487 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus Query Iage Query Iage SDH NN LSH PCA-ITQ (b) (a) Fig.4. Siilar iages retrieved by SDH, NN, LSH, and PCA-ITQ. Tie (s) as the videos indicator. Here we tae the first fraes and the iddle fraes of the videos for coparison. We tae the test set of Y2T as ground truth, the entire MSCOCO dataset and the Y2T training set as our labeled corpus respectively. Table 4 presents the results using our SDH retrieval captioning approach. This siple and even rude ethod perfors as well as the approach proposed in [57]. The result also deonstrates that the first fraes and the iddle fraes perfor siilarly and both contain ain video contents. (Τ03) CNN fc D Feature 52-Bit Hashing Code 2 3 log(x) Fig.5. Tie cost for corpus with X iages. For ore details, we list soe captioned videos in Fig.7 to present the captioning results. Obviously the generated captions are related to the query video. The inaccuracies are ainly caused by the deficiency of related corpus. The captions for the videos in Fig.7(a) use puppy and dog to describe the sae anial. Even though both of the are correct, the outer corpus would obtain a lower score. Most situations siilar to videos in Fig.7(b) stand for a party. Thus the caption based on MSCOCO describes soe details of a party which causes a low score. In videos of Fig.7(c), the focuses of the two corpora are uch different: Y2T says the process while MSCOCO says the status. Fig.7(d) deonstrates that only if there are enough phrases in MSCOCO, it ay perfor better than Y2T. Generally, the Y2T training dataset is a better training dataset than the MSCOCO dataset in captioning Y2T test data, as deonstrated in Table 4, even though there are less descriptions in Y2T corpus. The BLEU- score presents that the results on MSCOCO contain siilar inforation. But the uch lower BLEU-4 score presents that the phrases and sentence structures in MSCOCO are uch different. With the increasing size of corpus, the phrases and sentences would cover ost of the graars and this proble would be changed. Moreover, MSCOCO is not large enough so that the iage types are deficient in our application. Fig.8 shows two unatched saples. The first one is a girl putting soe sticer on her face. There are 464 captions in MSCOCO related to sticer, but only three captions are related to woan, woen or girl. The second saple presents a siilar proble. The captions including words cat and turtle are laid there. Besides, the results are liited by the extracted features. Although VGG6 is alost the best

9 488 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 SDH NN PCA-ITQ LSH (a) (b) (c) (d) Fig.6. Captioning scores while selecting the consensus caption fro subsets of the captions for the siilar iages. (a) BLEU-4. (b) ROUGE L. (c) METEOR. (d) CIDEr. Table 3. Scores of Our Approaches with Different Hashing Methods on MSCOCO Validation Approach BLEU- BLEU-2 BLEU-3 BLEU-4 CIDEr METEOR ROUGE MS NN [7] bit Ours-LSH Ours-PCA-ITQ Ours-SDH bit Ours-LSH Ours-PCA-ITQ Ours-SDH Table 4. Results of Video Retrieval Captioning Approach on Y2T Using First Frae and Middle Frae Frae Corpus BLEU- BLEU-2 BLEU-3 BLEU-4 CIDEr METEOR ROUGE First Y2T MSCOCO Middle Y2T MSCOCO Thoason et al. [57]

10 489 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus (a) (b) (c) (d) Fig.7. Captions for four saple videos generated by our approaches with Y2T and MSCOCO corpus. (a) Y2T: a puppy plays with a ball. MSCOCO: a dog laying on the floor in a roo. (b) Y2T: a an is playing a guitar onstage. MSCOCO: a group of people standing around and dancing. (c) Y2T: soeone is slicing bread into slices. MSCOCO: a plate of food on a table. (d) Y2T: a an is riding a horse barebac. MSCOCO: a an riding on the bac of an elephant. features for iage classification, it lacs salient point. While we are finding a panda appearing in a clip, there are only one panda iage in 0-ost siilar iages. Most of the retrieving results focus on the bacground and the rest are related to the color. To eliinate the effect fro hashing ethods, we retrieve the -siilar iages using NN. The results shown in Fig.9 confir our discovery. Fig.8. No appropriate captions for the query clip found. The blue caption is a user-generated caption for the clip. The red ones are all the captions related to the eywords. Fig.9. Iages siilar to a panda iage.

11 4 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 Finally we explore the relations between the corpus size and the captioning results. We integrate the training set and the validation set of MSCOCO and get totally iages. Then we design a 3-step experient (s s3): at the first step s, we tae the iages with their captions as our corpus. At each of the following steps we append 0000 ore iages to the corpus. At the last step s3 we use the entire iages. Then we score the results at each step. Fig.0 presents that with the corpus size increased fro s to s3, our approach gets a higher and higher score. Obviously large-scale corpus does help in video captioning. 5 Conclusions In this paper, we proposed a novel video captioning approach to describe videos by leveraging freelyavailable iage corpus with abundant literal nowledge. Hashing techniques are utilized to significantly iprove both coputation and storage efficiencies. The results deonstrated that our approach is exactly effective and efficient. Our siple captioning approach adapts to the large-scale datasets. The volues of ultiedia data in the Internet are explosively increasing. Since the noisy data disturbs our approaches, how to clean the large-scale data, integrate the, and use other auxiliary features, ay be the next topic First Frae Middle Frae First Frae Middle Frae BLEU_ Score BLEU_4 Score Step (a) Step (b) First Frae Middle Frae 0.65 First Frae Middle Frae CIDEr Score METEOR Score Step Step (c) (d) Fig.0. Scores of video captioning using different scales of MSCOCO corpus. References [] Song Y, Tang J H, Liu F, Yan S C. Body surface context: A new robust feature for action recognition fro depth videos. IEEE Transactions on Circuits and Systes for Video Technology, 204, 24(6): [2] Qi G J, Hua X S, Rui Y, Tang J H, Mei T, Zhang H J. Correlative ulti-label video annotation. In Proc. the 5th

12 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 49 ACM International Conference on Multiedia, Sept. 2007, pp [3] Chen J W, Cui Y, Ye G N, Liu D, Chang S F. Event-driven seantic concept discovery by exploiting wealy tagged internet iages. In Proc. International Conference on Multiedia Retrieval, Apr [4] Tang J H, Shu X B, Qi G J, Li Z C, Wang M, Yan S C, Jain R. Tri-clustered tensor copletion for social-aware iage tag refineent. IEEE Transactions on Pattern Analysis and Machine Intelligence, 206, doi: 0.09/TPAMI [5] Tang J H, Shu X B, Li Z C, Qi G J, Wang J D. Generalized deep transfer networs for nowledge propagation in heterogeneous doains. ACM Transactions on Multiedia Coputing, Counications, and Applications, 206, 2(4s): Article No. 68. [6] Li Z C, Tang J H. Wealy supervised deep atrix factorization for social iage understanding. IEEE Transactions on Iage Processing, 207, 26(): [7] Li Z C, Liu J, Tang J H, Lu H Q. Robust structured subspace learning for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 205, 37(0): [8] Yang Y, Zhang H W, Zhang M X, Shen F M, Li X L. Visual coding in a seantic hierarchy. In Proc. the 23rd ACM International Conference on Multiedia, Oct. 205, pp [9] Yatsar M, Zettleoyer L, Farhadi A. Situation recognition: Visual seantic role labeling for iage understanding. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun pp [0] Yang Y, Zha Z J, Gao Y, Zhu X F, Chua T S. Exploiting web iages for seantic video indexing via robust saple-specific loss. IEEE Transactions on Multiedia, 204, 6(6): [] Yang Y, Yang Y, Shen H T. Effective transfer tagging fro iage to video. ACM Transactions on Multiedia Coputing, Counications, and Applications, 203, 9(2): Article No. 4. [2] Li Z C, Liu J, Yang Y, Zhou X F, Lu H Q. Clusteringguided sparse structural learning for unsupervised feature selection. IEEE Transactions on Knowledge and Data Engineering, 204, 26(9): [3] Wang J D, Shen H T, Song J K, Ji J Q. Hashing for siilarity search: A survey. arxiv: , Apr [4] Tang J H, Li Z C, Wang M, Zhao R Z. Neighborhood discriinant hashing for large-scale iage retrieval. IEEE Transactions on Iage Processing, 205, 24(9): [5] Gong Y C, Lazebni S. Iterative quantization: A procrustean approach to learning binary codes. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun. 20 pp [6] Shen F M, Shen C H, Liu W, Shen H T. Supervised discrete hashing. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun. 205, pp [7] Devlin J, Gupta S, Girshic R, Mitchell M, Zitnic C L. Exploring nearest neighbor approaches for iage captioning. arxiv preprint arxiv: , Apr [8] Krizhevsy A, Sutsever I, Hinton G E. Iagenet classification with deep convolutional neural networs. In Proc. the 25th International Conference on Neural Inforation Processing Systes, Dec. 202, pp [9] Zhu Z, Liang D, Zhang S H, Huang X L, Li B, Hu S M. Traffic-sign detection and classification in the wild. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun. 206, pp [20] Miolov T, Karafiát M, Burget L, Černocý J, Khudanpur S. Recurrent neural networ based language odel. In Proc. the th Annual Conference of the International Speech Counication Association, Sep. 200, pp [2] Song J, Tang S L, Xiao J, Wu F, Zhang Z F. LSTM-in- LSTM for generating long descriptions of iages. Coputational Visual Media, 206, 2(4): [22] Jia Y Q, Shelhaer E, Donahue J, Karayev S, Long J, Girshic R, Guadarraa S, Darrell T. Caffe: Convolutional architecture for fast feature ebedding. In Proc. the 22nd ACM International Conference on Multiedia, Nov. 204, pp [23] Szegedy C, Liu W, Jia Y Q, Seranet P, Reed S, Anguelov D, Erhan D, Vanhouce V, Rabinovich A. Going deeper with convolutions. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun. 205, pp.-9. [24] Sionyan K, Zisseran A. Very deep convolutional networs for large-scale iage recognition. arxiv preprint arxiv: , Apr [25] Razavian A S, Azizpour H, Sullivan J, Carlsson S. CNN features off-the-shelf: An astounding baseline for recognition. In Proc. IEEE Conference on Coputer Vision and Pattern Recognition, Jun. 204, pp [26] Norris J R. Marov Chains. Cabridge University Press, 998. [27] Moon T K. The expectation-axiization algorith. IEEE Signal Processing Magazine, 996, 3(6): [28] Berger A L, Pietra V J D, Pietra S A D. A axiu entropy approach to natural language processing. Coputational Linguistics, 996, 22(): [29] Kiros R, Salahutdinov R, Zeel R. Multiodal neural language odels. In Proc. the 3st International Conference on Machine Learning, Jun. 204, pp [30] Wu Q, Shen C H, van den Hengel A, Liu L Q, Dic A. Iage captioning with an interediate attributes layer. arxiv:6.044v, Apr [3] Gao H Y, Mao J H, Zhou J, Huang Z H, Wang L, Xu W. Are you taling to a achine? Dataset and ethods for ultilingual iage question answering. arxiv:5.0562, Apr [32] Donahue J, Hendrics L A, Guadarraa S, Rohrbach M, Venugopalan S, Saeno K, Darrell T. Longter recurrent convolutional networs for visual recognition and description. arxiv:4.4389, Apr [33] Chen X L, Fang H, Lin T Y, Vedanta R, Gupta S, Dollar P, Zitnic C L. Microsoft coco captions: Data collection and evaluation server. arxiv: , Apr [34] Ordonez V, Kularni G, Berg T L. I2text: Describing iages using illion captioned photographs. In Proc. Neural Inforation Processing Systes, Dec. 20, pp.43-5.

492 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 [35] Chariar M S. Siilarity estiation techniques fro rounding algoriths. In Proc. the 34th Annual ACM Syposiu on Theory of Coputing, May 2002, pp.

[37] Yang Y, Luo Y S, Chen W L, Shen F M, Shao J, Shen H T. Zero-shot hashing via transferring supervised nowledge. In Proc. ACM Conference on Multiedia, Oct. 206, pp.286-295.

13 492 J. Coput. Sci. & Technol., May 207, Vol.32, No.3 [35] Chariar M S. Siilarity estiation techniques fro rounding algoriths. In Proc. the 34th Annual ACM Syposiu on Theory of Coputing, May 2002, pp [36] Shen F M, Zhou X, Yang Y, Song J K, Shen H T, Tao D C. A fast optiization ethod for general binary code learning. IEEE Transactions on Iage Processing, 206, 25(2): [37] Yang Y, Luo Y S, Chen W L, Shen F M, Shao J, Shen H T. Zero-shot hashing via transferring supervised nowledge. In Proc. ACM Conference on Multiedia, Oct. 206, pp [38] Shen F M, Shen C H, Shi Q F, van den Hengel A, Tang Z M, Shen H T. Hashing on nonlinear anifolds. IEEE Transactions on Iage Processing, 205, 24(6): [39] Gionis A, Indy P, Motwani R. Siilarity search in high diensions via hashing. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 999, pp [40] Yang Y, Shen F M, Shen H T, Li H X, Li X L. Robust discrete spectral hashing for large-scale iage seantic indexing. IEEE Transactions on Big Data, 205, (4): [4] Song J K, Yang Y, Yang Y, Huang Z, Shen H T. Interedia hashing for large-scale retrieval fro heterogeneous data sources. In Proc. ACM SIGMOD International Conference on Manageent of Data, Jun. 203, pp [42] Strecha C, Bronstein A, Bronstein M, Fua P. Ldahash: Iproved atching with saller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 202, 34(): [43] Luo Y D, Yang Y, Shen F M, Huang Z, Zhou P, Shen H T. Robust discrete code odeling for supervised hashing. Pattern Recognition, 207, doi: 0.06/j.patcog [44] Ballas N, Yao L, Pal C, Courville A. Delving deeper into convolutional networs for learning video representations. arxiv: , Apr [45] Mazloo M, Li X R, Snoe C G M. TagBoo: A seantic video representation without supervision for event detection. arxiv:.02899v2, Apr [46] Lowe D G. Object recognition fro local scale-invariant features. In Proc. the 7th IEEE International Conference on Coputer Vision, Sep. 999, pp.-57. [47] Xu M, Duan L Y, Cai J F, Chia L T, Xu C S, Tian Q. HMM-based audio eyword generation. In Proc. Pacific- Ri Conference on Multiedia, Dec pp [48] Shetty R, Laasonen J. Video captioning with recurrent networs based on frae- and video-level features and visual content classification. arxiv: , Apr [49] Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting teporal structure. In Proc. IEEE International Conference on Coputer Vision, Dec. 205, pp [] Pan P B, Xu Z W, Yang Y, Wu F, Zhuang Y T. Hierarchical recurrent neural encoder for video representation with application to captioning. arxiv: , Apr [5] Sener O, Zair A R, Savarese S, Saxena A. Unsupervised seantic parsing of video collections. In Proc. IEEE International Conference on Coputer Vision, Dec. 205, pp [52] Papineni K, Rouos S, Ward T, Zhu W J. BLEU: A ethod for autoatic evaluation of achine translation. In Proc. the 40th Annual Meeting on Association for Coputational Linguistics, Jul. 2002, pp [53] Denowsi M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In Proc. the 9th Worshop on Statistical Machine Translation, Vol. 6, Apr. 204, pp [54] Vedanta R, Zitnic C L, Parih D. CIDEr: Consensusbased iage description evaluation. arxiv:4.5726, Apr [55] Lin C Y. Rouge: A pacage for autoatic evaluation of suaries. In Proc. ACL-04 Worshop on Text Suarization Branches Out, Jul. 2004, pp [56] Guadarraa S, Krishnaoorthy N, Malarnenar G, Venugopalan S, Mooney R, Darrell T, Saeno K. YouTube2Text: Recognizing and describing arbitrary activities using seantic hierarchies and zero-shot recognition. In Proc. IEEE International Conference on Coputer Vision, Dec. 203, pp [57] Thoason J, Venugopalan S, Guadarraa S, Saeno K, Mooney R J. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proc. the 25th International Conference on Coputational Linguistics, Aug. 204, pp Xiao-Yu Du is currently a lecturer in the School of Software Engineering of Chengdu University of Inforation Technology, Chengdu, and a Ph.D. candidate of University of Electronic Science and Technology of China, Chengdu. He received his M.E. degree in coputer software and theory in 20 and B.S. degree in coputer science and technology in 2008, both fro Beijing Noral University, Beijing. His research interests include ultiedia analysis and retrieval, coputer vision, and achine learning. Yang Yang is currently with University of Electronic Science and Technology of China, Chengdu. He joined National University of Singapore (NUS) as a research fellow, woring with Prof. Tat-Seng Chua ( ). He was conferred his Ph.D. degree in inforation technology fro the University of Queensland (UQ) in 202. He obtained his Master s degree in coputer science in 2009 fro Peing University, Beijing, and Bachelor s degree in coputer science in 2006 fro Jilin University, Changchun, respectively. He research interests ainly focus on ultiedia search, social edia analysis, and achine learning.

Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 493 Liu Yang is a staff of Sichuan University West China Hospital of Stoatology, Chengdu.

His research interests in pattern recognition and coputer vision.

degree fro School of Coputer Science, Nanjing University of Science and Technology, Nanjing, in 204. With the support of the China Scholarship Council fro Apr. 20 to Nov.

He received his Bachelor s degree in applied atheatics fro Shandong University, Jinan, in 2007. Zhi-Guang Qin received his Ph.D.

14 Xiao-Yu Du et al.: Captioning Videos Using Large-Scale Iage Corpus 493 Liu Yang is a staff of Sichuan University West China Hospital of Stoatology, Chengdu. He is pursuing his Master s degree in School of Inforation and Software Engineering, University of Electronic Science and Technology of China, Chengdu. His research interests in pattern recognition and coputer vision. Fu-Min Shen is currently a lecturer in School of Coputer Science and Engineering, University of Electronic Science and Technology of China, Chengdu. He received his Ph.D. degree fro School of Coputer Science, Nanjing University of Science and Technology, Nanjing, in 204. With the support of the China Scholarship Council fro Apr. 20 to Nov. 202, he visited ACVT (Australian Centre for Visual Technology) and School of Coputer Science, University of Adelaide, Australia. He received his Bachelor s degree in applied atheatics fro Shandong University, Jinan, in Zhi-Guang Qin received his Ph.D. degree fro the School of Coputer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, in 996. He is currently a full professor of the School of Inforation and Software Engineering with UESTC, Chengdu. His research interests include coputer networing, inforation security, cryptography, and pattern analysis. Jin-Hui Tang is a professor in School of Coputer Science and Engineering, Nanjing University of Science and Technology, Nanjing. He received his B.E. degree in counication engineering and Ph.D. degree in signal and inforation processing fro the University of Science and Technology of China, Hefei, in 2003 and 2008 respectively. His current research interests include ultiedia search and coputer vision. He has authored over 60 journal and conference papers in these areas. He is a co-recipient of the Best Student Paper Award in MMM 206, the Best Paper Finalist in ACM MM 205, and Best Paper Awards in ACM MM 2007, PCM 20 and ICIMCS 20. He is a senior eber ofccf and IEEE, andaeber ofacm.

Bayesian Networks Modeling for Crop Diseases

Bayesian Networks Modeling for Crop Diseases Bayesian Networs Modeling for Crop Diseases Chunguang Bi and Guifen Chen College of nforation & Technology, Jilin gricultural University, Changchun, China Bi_chunguan@126.co, guifchen@163.co bstract. Severe