Visual-Verbal Consistency of Image Saliency

Size: px
Start display at page:

Download "Visual-Verbal Consistency of Image Saliency"

Transcription

1 217 IEEE International Conference on Systems, Man, and Cybernetics (SMC) Banff Center, Banff, Canada, October 5-8, 217 Visual-Verbal Consistency of Image Saliency Haoran Liang, Ming Jiang, Ronghua Liang and Qi Zhao College of Information Engineering, Zhejiang University of Technology, Hangzhou, China Department of Computer Science and Engineering, University of Minnesota, USA {haoran, Abstract When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural. From the annotations, we extract visual and verbal saliency ranks to compare against each other. We then propose a number of low-level and semantic-level features relevant to the visualverbal consistency. Integrated into a computational model, the proposed features effectively predict the consistency between the two modalities on a large dataset with both types of annotations, namely SALICON [1]. Keywords visual saliency, image caption, visual-verbal consistency, correlation I. INTRODUCTION Generating properly formed image captions based on the understanding of image contents is an important and challenging task in computer vision and natural language processing. Progress in this task has great impacts that can benefit various real-life applications such as traffic navigation [2], robotics [3] and education [4]. To better describe an image, particularly in a cluttered scene, it is of great importance to capture the key elements in the image instead of describing everything. Previous studies reveal that visual attention is a helpful proxy for perceiving importance in images. Visual attention is a bottleneck mechanism that allows only a small portion of the visual input to reach higher-level processing units. It breaks down scene understanding into a sequence of localized visual analysis problems. We hypothesize that patterns in image captions strongly rely on what people regard as important. On the other hand, annotations of visual attention offer a natural ranking from humans while free-viewing the scenes. Thus, it is of interest to understand the consistency between how people view images and how they describe them. Further, we observe that the level of visual-verbal consistency can vary despite the overall high correlation between the two modalities (see example in Fig. 1). Natural questions to follow are then: what image features make them more consistent or less, and whether the consistency is predictable with the image features. This work was done when Haoran Liang was a visiting student in the Zhao Lab. horse A woman on a horse riding in the dirt. Woman in English riding gear on top of a horse. A woman wearing a helmet riding her horse outside. A girl riding a horse through an equestrian course. A woman is riding a horse that is running out of water. truck horse tv controller Two girls in an apartment play a lively video game. Two children playing video games in front of a television. Two women stand while actively playing a video game. Two people standing up in a living room playing nintendo wii. Two people playing a game on a television. Fig. 1. Examples of high-consistency (blue box) and low-consistency (orange box) patterns between how people look at and describe images. For each case, the gray-scale map on the left indicates visual saliency, and the one on the right indicates verbal saliency. In this work, we study the relationship between the words used in image captions and the regions where people look in natural images. First, we explicitly define verbal saliency as the importance of the words in a sentence and report the correlation between verbal saliency and visual saliency. Second, we propose a number of features based on image and object information and quantify their effects on the consistency between visual saliency and verbal saliency. Finally, we show the effectiveness of the features in predicting the visual-verbal consistency using a support vector regression (SVR). A. Image Saliency II. RELATED WORK The computational model of visual attention was first introduced by Itti et al. [5]. Specifically, they proposed to use a set of feature maps from three complementary channels, which were intensity, color, and orientation. The normalized feature maps from each channel were then linearly combined to generate the overall saliency map. Based on this, many other researchers have since suggested various models that can be categorized based on the algorithms used, e.g., Bayesian models [6], graphical models [7], spectral analysis models [8] and pattern classification models [9]. In addition to the early features, high-level image semantics have been shown much more relevant to image saliency [1]. Therefore, deep neural networks have been proposed to reduce the semantic gap [11], [12], by naturally integrating hierarchical features for saliency prediction. tv /17/$ IEEE 3489

2 Image Fixations Visual saliency map A man kneeling over with five boards on the grass. A man kneels down in front of five different colored surfboards. A sitting in front of many surfboards. A man squaltting near his five surfboards. A man sitting in the grass with five surfboards in front of him. Segmentation Captions Word Verbal saliency maps Cardinality (TF-IDF) Weight Sequence board.68 7 grass where V (I) = {V (o 1 ), V (o 2 ),..., V (o n )} and W (I) = {W (o 1 ), W (o 2 ),..., W (o n )}. Fig. 2 demonstrates the approaches of computing visual saliency from human fixations, and verbal saliency from image captions. A. Visual Saliency The saliency map obtained from attentional data can be directly applied to measure the importance of objects in an image. In particular, given an input image and its fixation map, the saliency of each object in the scene was computed, by taking the maximal value of fixation map within the objects segmentation mask, i.e., V (o i ) = max(p i ), P i denotes the set of saliency values (obtained using mouse-tracking in [1]) inside the ith object mask. Cardinality Sequence Fig. 2. Overview of the method to generate visual saliency maps and verbal saliency maps. See Sec. III for details. B. Verbal Saliency and Image Captioning Few works have focused on image importance [13] and specificity [14] by far. Supported by deep neural network models, the methods of image-sentence retrieval [15] and image captioning [16] have been developing dramatically. Many works [17], [18] using a multi-modal recurrent neural network (RNN) achieve the state-of-the-art performance for both the tasks of image-sentence retrieval and image captioning. Recent models [19], [2], [21] suggest that attention related information improves image captioning model performance. The proposed work is the first that explicitly examine the correlation and consistency of visual saliency and image description on a large dataset with FF5, images. It also distinguishes itself from earlier models in that it directly explores features that contribute to the consistency between how people view and describe an image. Our work further verifies the possibility of predicting visual-verbal consistency with computational algorithms. III. OVERVIEW Studying the consistency between visual saliency and verbal saliency requires an image dataset with captions, object annotations, and saliency annotations. SALICON [1] is selected in this work as it is currently the largest public dataset that meets the requirement. We could also use algorithms such as RCNN [22] to detect objects in future study. Given an image with object annotations, to measure its visual-verbal consistency, we obtain saliency values from fixations and captions, assign them to the objects, and compare them using Spearman s rank correlation ρ. That is, for image I with N objects, we denote each object as o i, where i 1, 2,..., n. Two types of saliency values, V (o i ) and W (o i ) are computed for o i from attentional data and image captions, respectively. The visual-verbal consistency is computed as ρ = Spearman(V (I), W (I)), (1) B. Verbal Saliency In order to know which objects are described and where there are, we propose a mapping method between words from a caption and objects in an image. Particularly, we use the leading platform Natural Language Toolkit (NLTK) [23] to parse all the captions into a big vocabulary set that consists of each noun with the help of Stanford Log-linear Part-Of- Speech Tagger [24]. We use a simple Wordnet-based measure of semantic distance [25] to find which of the 8 categories these words belong to. We initialize the salient value for each image using its corresponding 5 captions in two strategies, namely cardinality-based and sequence-based approaches. Cardinality-based approach measures the importance of each word based on the frequency of occurrence. In order to weight the count of each word into floating point values, the importance (also denoted as the initial saliency value W init (o i ) ) is computed using term frequency-inverse document frequency (TF-IDF) using the scikit-learn software package [26]. Words that are rare in the corpus but occur frequently in a sentence contribute more to the importance. Let J i be the number of times that the word for object o i occurs in k sentences, the term frequency of the object is calculated as: W init (o i ) = Ji N. i=1 Sequence-based Ji approach captures and highlights the order of each word in a sentence. We denote the number of categories in an image as K, and the order of a word(object o i ) in an image as q i, e.g., q i = 1 means that the word comes first in the sentence, so its weight is K q, which is higher if a word comes at the beginning of a sentence, and lower vise versa. Therefore, for object o i in k sentences, the initial saliency value will be W init (o i ) = N i=1 K q i. These two approaches provide reasonable initializations of the verbal saliency values. Yet ambiguities still exist. For example, with multiple people in a scene, we cannot tell whom the man in the captions refers to. Further, the mappings from words to objects are not one-to-one. For example, the vehicle may represent either a car or a truck that co-exist in an image. To approach this problem, we propose a method based on psychophysical studies. In particular, we borrow the observations from [13] to add to the verbal saliency map by considering the compositional features relating to objects, i.e. 349

3 (a) Probability (b) Probability Fixations Normalized size Normalized location Descriptions (c) 25 Image number Spearman's correlation Fig. 3. (a, b) The effects of location and size on both visual saliency and verbal saliency. (c) The distribution of Spearman correlation scores across the 1, SALICON training images. size and location. Visual attention is well known to have a center bias and a preference for dominant objects, and image captions also have a similar tendency. Fig. 3 (a) and (b) display the effects of location and size on description probability for the training set of SALICON. With these observations, we adjust the saliency values of objects with an additional term T ( ) = S( ) (1 L( )), where S( ) is the normalized value of size and L( ) is the normalized distance to center (see details in Sec. IV-B), respectively. The finalized value is calculated as W (o i ) = W init (o i ) T (o i ), which is used as the ground truth for verbal saliency. In order to investigate the inter-subject consistency in image captioning, for each image in the training set, we use one of the five captions and the rest to generate two verbal saliency maps and calculate the Spearman correlations, resulting in a human consistency of ρ =.865. By comparing the ground truth between visual saliency and verbal saliency, we obtain ρ =.435 and ρ =.439 (ρ =.361 and ρ =.368 without adjusting size and location) for cardinality and sequence-based schemes respectively and the distribution of all the images is shown in Fig. 3 (c). IV. FEATURES AND STATISTICAL ANALYSIS In this section, we report the statistics of the dataset (see Sec. IV-A), and propose a number of potential features for estimating the consistency between visual saliency and verbal saliency, including low-level features (i.e. size, location, density) and semantic-level features (i.e. categories). After that, we quantitatively evaluate how each feature impacts the visualverbal consistency (see Sec. IV-B), and show the inclination people possess within different categories (see Sec. IV-C). A. Dataset Composition We first count the number of each part-of-speech (POS) in the image descriptions for the 1, training images. Each word is converted to its basic form before the count to make sure the element in our vocabulary list is unique. Obviously, nouns (431) stand out to be the dominant part of the descriptions, followed by verbs (1927) and adjectives (164). Numerals (55) and prepositions (54) may occur quite frequently, but they are less relevant to the image content. Verbs and adjectives apparently relate to the perception of importance and seem worth consideration. However, they are used to describe the attributes that eventually lead us back to the particular objects they derive from. Hence, we only consider nouns in the following analysis. B. Low-Level Features The average correlation scores of ρ =.435 and.439 respectively for cardinality and sequence methods show a significant correlation between visual saliency and verbal saliency. To investigate this, we propose three low-level and two semantic-level features extracted from all the objects, based on the object annotations of the image. Particularly, given an image I with N objects, let o i be the ith object, we define low-level features for objects namely size, location and density as follows: Size: the size S(o i ) of an object can be measured as the number of pixel in the object mask. We normalize the size by image resolution: S(o i ) =. Thus, each object will have a rounded size value that ranges from.1 to.9, resulting in a 9-dimensional vector that encodes the numbers of objects in different sizes. Location: the center coordinate of an object bounding box is used as the exact location for one object. In our feature setting, we use the relative location L(o i ) of an object which is defined as the distance to the image center: L(o i ) = dist(o i,image center) 2 Image diagonal oi size Image size. Similarly, with all the distances ranging from.1 to.9, the descriptor is 9-dimensional vector that encodes the numbers of objects in different locations. Density: object density consists of the objects distances to each other. From the segmentation masks of all the objects, the object density can be computed as: D(o i ) = N j dist(oi,oj) Image diagonal N, where dist(o i, o j ) denotes the Euclidean distance between the ith and jth objects. After normalization, we use a 2-dimensional vector to encode the numbers of objects with different densities from.5 to 1. with a step of.5. Next, we independently investigate the contribution of each low-level feature to the visual-verbal consistency. Across all training images, we compute the Pearson s linear correlation coefficients between the feature values and the visualverbal consistency scores, for both the cardinality-based and sequence-based approaches. In this analysis, a positive correlation suggests that the corresponding feature channel contributes positively to the visual-verbal consistency and vise versa. The test results are shown in Fig. 4. First, as observed from the results, images with mediumsized objects (s-.2 to s-.5) have particularly more consistent visual saliency and verbal saliency. These feature channels contribute significantly to the difference in both cardinality and sequence cases. The downward trend of feature contribution from s-.2 to s-.9 demonstrates that visual saliency and verbal saliency become less similar when large objects are presented. It is most likely that large objects are in the background and visually less salient, but are important for describing the context of a scene. 3491

4 Pearson's linear correlation * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * s-.1 s-.2 s-.3 s-.4 s-.5 s-.6 s-.7 s-.8 s-.9 s-1. l-.1 l-.2 l-.3 l-.4 l-.5 l-.6 l-.7 l-.8 l-.9 d-.5 d-.1 d-.15 d-.2 d-.25 d-.3 d-.35 d-.4 d-.45 d-.5 d-.55 d-.6 d-.65 d-.7 d-.75 Feature Cardinality Sequence Fig. 4. Correlations of size (s-), location (l-) and density (d-) features for both cardinality-based and sequence-based verbal saliency. The asterisks on top of the bars denote significant correlations with p<.1. Feature channels with all-zero values are skipped. Second, images with objects close to the center (l-.1) are significantly more consistent between visual saliency and verbal saliency. As objects get farther from the image center (l-.2 to l-.5), they contribute negatively to the visual-verbal consistency. However, when the images contain more objects near the edges (l-.7 to l-.9), probably caused by a large dominant object occupying the central region, the saliency ranks become more consistent. Finally, images with low object density (d-.35 to d-.6) are more consistent between visual saliency and verbal saliency, while images with high object density (d-.15 to d-.25) are less consistent between visual saliency and verbal saliency. To be specific, sparse contents within images obtain more similar description, while cluttered ones do not. C. Semantic-Level Features There has been psychophysical evidence that human observers tend to fixate semantic categories such as people and animals. In this group of features, we expect to see whether the presence of different categories will affect the way people describe an image. We sort the noun list by how many times a noun appears in the image captions in a descending order. We notice that the top 1 nouns are quite commonly used in our daily life, i.e., man, dog. For each noun, we look for all the images that contain the described object along with the corresponding captions. Given a noun mapped to an object, we compute its probability by dividing the number of captions that contain this noun or its synonym by the total number of sentences. Note that summative words are not considered in this case, e.g., a lot of dishes on the table does not mean an egg (if exists) is mentioned since we focus more on sibling terms like boy for, ship for boat. For the probability of fixations, we simply count the number of fixations inside the object mask and then perform a cut-off at a threshold θ to exclude noises. In our experiments, θ is set to 1 to 3 based on the object size. We randomly choose 3 nouns (listed in Table I) from the top 1 in the vocabulary set to investigate the features, from which several observations can be drawn. Fixations cover more contents than image captions as we observe that almost all the probabilities from fixations are higher. Despite that the probabilities have been studied in the previous work [13], what we find new is that humans have quite similar tendencies towards TABLE I PROBABILITIES OF OBJECT CATEGORIES BEING MENTIONED IN IMAGE CAPTIONS AND FIXATED DURING FREE VIEWING. Category Inclination Category Mentioned Fixated Category Mentioned Fixated cow bowl cat egg dog shower.6.74 zebra bottle hat computer potato truck furniture pizza foot frisbee chip.53.6 car broccoli.39.6 baseball hand number soup chair steel stop controller.22.3 suitcase lamp living things both in visual attention and image captions. In contrast, we observe different patterns for inanimate objects. A considerable number of inanimate objects with a certain level of attraction to attention are less likely to be described, such as containers (bottle, bowls), and salient object parts (hand, foot). We further observe that the image captions become less specific when more categories are presented. Particularly, if only one or two objects are presented, people tend to mention the exact name. As the number of objects increase, people are more likely to summarize the scene, in which case, supercategory appears frequently. For example, food on table is less used to describe an image showing only a plate of carrot, but more for a table of many dishes. V. PREDICTING THE VISUAL-VERBAL CONSISTENCY The impact of the features on visual-verbal consistency motivates us to investigate whether the consistency is predictable and how the features contribute to the prediction. In this section, we describe the experimental setup and provide the details of feature settings along with the model (see Sec. V-A). We show the experimental results on the SALICON dataset (see Sec. V-B) with intuitions gained to better understand the patterns of image viewing and description. A. Computational Modeling Given an image I, our goal is to automatically decide the consistency between visual saliency and verbal saliency. Based on the observations above, we take into consideration the size, location, density, and category in our model. The image feature F (I) can be represented by concatenating different feature vectors as follows: F (I) = {C, C s, S, L, D} R l (2) where C is an 8-dimensional vector representing object categories; C s similarly denotes the 12-dimensional vector for super-categories, each element in C and C s denotes the 3492

5 TABLE II PERFORMANCE (MEAN SQUARE ERROR) OF FEATURES AT PREDICTING THE CORRELATION Cardinality Sequence Feature type MSE STD MSE STD Size Location Density Size + Location + Density Category Super-category Category + super-category All combined AlexNet [29] Fig. 5. Example images with low consistency (left), medium consistency (middle), and high consistency (right). Numbers indicate the cardinality-based visual-verbal consistency our model predicts. number of a certain object category in the image; S, L and D are 1-, 1- and 2-dimensional vectors encoding size, distance to center and density of object respectively. The above features form a l-dimensional (l = 132) feature vector for each image. We employ a support vector regressor (SVR, [27]) as our predictive model. B. Experimental Results and Discussions In order to validate the effectiveness of the features used in our experiment, we conduct the same process as described in Sec. III on the validation set of 5, images, and use the Spearman correlation scores as their ground-truth labels. We train SVRs [28] with Gaussian kernels to predict the visualverbal consistency, using a grid search to select cost and hyper-parameters. We report the results for both standalone and combined features using different sources (cardinality and sequence) as ground truth. We also include deep convolutional neural network as an additional model for prediction. For this, we employ AlexNet [29] and extract features from the layer just before the final classification layer (often referred to as fc7), resulting in a feature dimension of 4,96. The performances of all models are shown in Table II. It can be seen that the combined features perform considerably better (MSE =.157 and.1568) than the standalone features. The two schemes of verbal saliency generation have shown similar performances with the difference that the sequence-based method performs slightly better. Moreover, we observe that the object categories, followed by the supercategories, play the most significant roles in the prediction of consistency, whereas low-level features are less useful. This suggests that the verbal description patterns are more related to the semantics of the image rather than the early vision. What affects the visual-verbal consistency? We first look at the image content to unearth possibly consistent patterns that contribute to visual-verbal consistency. Example images predicted by SVR using combined features are shown in Fig. 5. The images are ranked by predicted consistency scores in an ascending order. For the most consistent cases in Fig. 5 (right), our model finds relatively clean images with few objects and usual activities. The content of these images are quite clear at first sight and can be easily concluded. The objects in these A man holding out a cell phone with an app on pulled up. Man holding up his cell phone to display app. A man showing an app he has on his phone. A man displaying the screen and a web sight on his cell phone. A man showing his Motorola cell phone app. A standing in front of a parked bus with a bicycle on it. A couple wearing bike helmets prepares to stow the bikes and ride a bus. A city bus waits for a rider to attach their bike. A bus has bicycles mounted on the front. A woman is putting her bike on the bus. The dog is playing with his toy in the grass. A dog that is standing in the grass near a frisbee. A collie chasing a frisbee in the grass. A cute dog is in the grass with a frisbee. A brown and white dog chasing a blue and yellow frisbee. Fig. 6. Different patterns in what people look and describe. The three images are selected from each set in Fig. 5. images are highly likely to be common in daily life and are consistently mentioned in the image captions, e.g., dogs, cars, and people. Also, these objects are quite likely to occur with a dominant size (.2 < s <.5) near the center of the image (l <.2), attracting gaze in the early stage. Images in Fig. 5 (middle) are more complicated scenes with 5 or more objects. In these cases, while people still have consistent preferences of visual saliency, the image captions vary in sequence. A typical case can be found in the middle of Fig. 6. As the number of objects increases, there are a variety of choices to describe the scene, so the weight of each being mentioned drops. As a result, different things are chosen at the beginning of each sentence. As observed in Table I, in most cases, humans and animals are both visually and verbally salient. However, there exist several cases that inanimate things are more attractive at the first sight. Since mostly animate things occur at very first of captions, the visual saliency, in these cases, correlates negatively with the verbal saliency. In Fig. 5 (left), these rare cases are demonstrated, of which most contain objects more salient than human and animal. Finally, we investigate whether the length of the image captions leads to more variability and hence less visual-verbal consistency. The low correlation between the average length of the captions (measured as the number of words in the sentence) with consistency (ρ =.6) shows that the length of caption has no obvious impact on the visual-verbal consistency. We then check the effect of variation in the length of caption and find that, images with high consistency scores (ρ <.4 or ρ >.4) usually have less variation in the length of caption (see Fig. 6). 3493

6 Overall, visual-verbal consistency is correlated with image content to quite an extent. The proposed features and regressor are able to predict the consistency effectively. Yet there are a number of failure cases that occur because of missing annotations, suggesting further updates of the dataset. VI. CONCLUSIONS In this work, we characterize aspects of image and caption relating to the visual-verbal consistency on how people explore and describe an image. We investigate this question on a large-scale dataset, namely SALICON. We extract saliency values of objects from eye-tracking data and captions, based on which we propose several low-level and semantic-level features such as size, location, density, and category. Further, we analyze their relative contribution to the visual-verbal consistency. Finally, we propose a computational model using a support vector regressor (SVR) to predict the consistency between visual saliency and verbal saliency. We envision that understanding the visual-verbal consistency of image saliency could inspire exciting and far-reaching applications in computer vision and answer related questions in human vision. REFERENCES [1] M. Jiang, S. Huang, J. Duan, and Q. Zhao, SALICON: Saliency in context, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215. [2] H. Tae-Hyun, J. In-Hak, and C. Seong-Ik, Detection of traffic lights for vision-based car navigation system, in Advances in Image and Video Technology, 26, pp [3] C. Thorpe, M. Hebert, T. Kanade, and S. Shafer, Vision and Navigation for the Carnegie Mellon Navlab. Springer, [4] W. Harwin, A. Ginige, and R. Jackson, A potential application in early education and a possible role for a vision system in a workstation based robotic aid for physically disabled s, Interactive robotic aidsone option for independent living: An international perspective, volume Monograph, vol. 37, pp , [5] L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, no. 11, pp , [6] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, Sun: A bayesian framework for saliency using natural statistics, Journal of Vision, vol. 8, no. 7, p. 32, 28. [7] J. Harel, C. Koch, and P. Perona, Graph-based visual saliency, in Advances in Neural Information Processing Systems, 26, pp [8] X. Hou and L. Zhang, Saliency detection: A spectral residual approach, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 27, pp [9] T. Judd, K. Ehinger, F. Durand, and A. Torralba, Learning to predict where humans look, in Proceedings of the IEEE International Conference on Computer Vision, 29. [1] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao, Predicting human gaze beyond pixels, Journal of Vision, vol. 14, no. 1, pp. 1 2, 214. [11] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, Predicting eye fixations using convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215, pp [12] X. Huang, C. Shen, X. Boix, and Q. Zhao, Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks, in Proceedings of the IEEE International Conference on Computer Vision, 215, pp [13] A. C. Berg, T. L. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos et al., Understanding and predicting importance in images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 212, pp [14] M. Jas and D. Parikh, Image specificity, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215, pp [15] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, vol. 2, pp , 214. [16] A. Karpathy, A. Joulin, and F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Advances in Neural Information Processing Systems, 214, pp [17] A. Karpathy and L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, pp , 215. [18] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, Explain images with multimodal recurrent neural networks, arxiv preprint arxiv:141.19, 214. [19] H. Xu and K. Saenko, Ask, attend and answer: Exploring questionguided spatial attention for visual question answering, in European Conference on Computer Vision. Springer, 216, pp [2] C. Liu, J. Mao, F. Sha, and A. Yuille, Attention correctness in neural image captioning, arxiv preprint arxiv: , 216. [21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning, 215, pp [22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 214, pp [23] S. Bird, E. Klein, and E. Loper, Natural language processing with Python. O Reilly Media, Inc., 29. [24] K. Toutanova and C. D. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, in Proceedings of the 2 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics- Volume 13, 2, pp [25] Z. Wu and M. Palmer, Verbs semantics and lexical selection, in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, 1994, pp [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp , 211. [27] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Statistics and computing, vol. 14, no. 3, pp , 24. [28] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, 211, software available at edu.tw/ cjlin/libsvm. [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 212, pp

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning Shweta Khairnar, Sharvari Khairnar 1 Graduate student, Pace University, New York, United States 2 Student, Computer Engineering,

More information

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material Learning to Disambiguate by Asking Discriminative Questions Supplementary Material Yining Li 1 Chen Huang 2 Xiaoou Tang 1 Chen Change Loy 1 1 Department of Information Engineering, The Chinese University

More information

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian The 29th Fuzzy System Symposium (Osaka, September 9-, 3) A Fuzzy Inference Method Based on Saliency Map for Prediction Mao Wang, Yoichiro Maeda 2, Yasutake Takahashi Graduate School of Engineering, University

More information

Validating the Visual Saliency Model

Validating the Visual Saliency Model Validating the Visual Saliency Model Ali Alsam and Puneet Sharma Department of Informatics & e-learning (AITeL), Sør-Trøndelag University College (HiST), Trondheim, Norway er.puneetsharma@gmail.com Abstract.

More information

Computational modeling of visual attention and saliency in the Smart Playroom

Computational modeling of visual attention and saliency in the Smart Playroom Computational modeling of visual attention and saliency in the Smart Playroom Andrew Jones Department of Computer Science, Brown University Abstract The two canonical modes of human visual attention bottomup

More information

PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks

PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks Marc Assens 1, Kevin McGuinness 1, Xavier Giro-i-Nieto 2, and Noel E. O Connor 1 1 Insight Centre for Data Analytic, Dublin City

More information

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling AAAI -13 July 16, 2013 Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling Sheng-hua ZHONG 1, Yan LIU 1, Feifei REN 1,2, Jinghuan ZHANG 2, Tongwei REN 3 1 Department of

More information

MEMORABILITY OF NATURAL SCENES: THE ROLE OF ATTENTION

MEMORABILITY OF NATURAL SCENES: THE ROLE OF ATTENTION MEMORABILITY OF NATURAL SCENES: THE ROLE OF ATTENTION Matei Mancas University of Mons - UMONS, Belgium NumediArt Institute, 31, Bd. Dolez, Mons matei.mancas@umons.ac.be Olivier Le Meur University of Rennes

More information

Multi-attention Guided Activation Propagation in CNNs

Multi-attention Guided Activation Propagation in CNNs Multi-attention Guided Activation Propagation in CNNs Xiangteng He and Yuxin Peng (B) Institute of Computer Science and Technology, Peking University, Beijing, China pengyuxin@pku.edu.cn Abstract. CNNs

More information

DEEP convolutional neural networks have gained much

DEEP convolutional neural networks have gained much Real-time emotion recognition for gaming using deep convolutional network features Sébastien Ouellet arxiv:8.37v [cs.cv] Aug 2 Abstract The goal of the present study is to explore the application of deep

More information

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency Hisham Cholakkal Jubin Johnson Deepu Rajan Nanyang Technological University Singapore {hisham002, jubin001,

More information

A Visual Saliency Map Based on Random Sub-Window Means

A Visual Saliency Map Based on Random Sub-Window Means A Visual Saliency Map Based on Random Sub-Window Means Tadmeri Narayan Vikram 1,2, Marko Tscherepanow 1 and Britta Wrede 1,2 1 Applied Informatics Group 2 Research Institute for Cognition and Robotics

More information

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS SEOUL Oct.7, 2016 DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS Gunhee Kim Computer Science and Engineering Seoul National University October

More information

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical

More information

EEG Features in Mental Tasks Recognition and Neurofeedback

EEG Features in Mental Tasks Recognition and Neurofeedback EEG Features in Mental Tasks Recognition and Neurofeedback Ph.D. Candidate: Wang Qiang Supervisor: Asst. Prof. Olga Sourina Co-Supervisor: Assoc. Prof. Vladimir V. Kulish Division of Information Engineering

More information

Where should saliency models look next?

Where should saliency models look next? Where should saliency models look next? Zoya Bylinskii 1, Adrià Recasens 1, Ali Borji 2, Aude Oliva 1, Antonio Torralba 1, and Frédo Durand 1 1 Computer Science and Artificial Intelligence Laboratory Massachusetts

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 15: Visual Attention Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 14 November 2017 1 / 28

More information

Understanding and Predicting Importance in Images

Understanding and Predicting Importance in Images Understanding and Predicting Importance in Images Alexander C. Berg, Tamara L. Berg, Hal Daume III, Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Aneesh Sood, Karl Stratos, Kota

More information

Computational Cognitive Science. The Visual Processing Pipeline. The Visual Processing Pipeline. Lecture 15: Visual Attention.

Computational Cognitive Science. The Visual Processing Pipeline. The Visual Processing Pipeline. Lecture 15: Visual Attention. Lecture 15: Visual Attention School of Informatics University of Edinburgh keller@inf.ed.ac.uk November 11, 2016 1 2 3 Reading: Itti et al. (1998). 1 2 When we view an image, we actually see this: The

More information

arxiv: v2 [cs.cv] 17 Jun 2016

arxiv: v2 [cs.cv] 17 Jun 2016 Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? Abhishek Das 1, Harsh Agrawal 1, C. Lawrence Zitnick 2, Devi Parikh 1, Dhruv Batra 1 1 Virginia Tech,

More information

Design of Palm Acupuncture Points Indicator

Design of Palm Acupuncture Points Indicator Design of Palm Acupuncture Points Indicator Wen-Yuan Chen, Shih-Yen Huang and Jian-Shie Lin Abstract The acupuncture points are given acupuncture or acupressure so to stimulate the meridians on each corresponding

More information

Differential Attention for Visual Question Answering

Differential Attention for Visual Question Answering Differential Attention for Visual Question Answering Badri Patro and Vinay P. Namboodiri IIT Kanpur { badri,vinaypn }@iitk.ac.in Abstract In this paper we aim to answer questions based on images when provided

More information

arxiv: v1 [stat.ml] 23 Jan 2017

arxiv: v1 [stat.ml] 23 Jan 2017 Learning what to look in chest X-rays with a recurrent visual attention model arxiv:1701.06452v1 [stat.ml] 23 Jan 2017 Petros-Pavlos Ypsilantis Department of Biomedical Engineering King s College London

More information

A Model for Automatic Diagnostic of Road Signs Saliency

A Model for Automatic Diagnostic of Road Signs Saliency A Model for Automatic Diagnostic of Road Signs Saliency Ludovic Simon (1), Jean-Philippe Tarel (2), Roland Brémond (2) (1) Researcher-Engineer DREIF-CETE Ile-de-France, Dept. Mobility 12 rue Teisserenc

More information

The Attraction of Visual Attention to Texts in Real-World Scenes

The Attraction of Visual Attention to Texts in Real-World Scenes The Attraction of Visual Attention to Texts in Real-World Scenes Hsueh-Cheng Wang (hchengwang@gmail.com) Marc Pomplun (marc@cs.umb.edu) Department of Computer Science, University of Massachusetts at Boston,

More information

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering SUPPLEMENTARY MATERIALS 1. Implementation Details 1.1. Bottom-Up Attention Model Our bottom-up attention Faster R-CNN

More information

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation Ryo Izawa, Naoki Motohashi, and Tomohiro Takagi Department of Computer Science Meiji University 1-1-1 Higashimita,

More information

Object Detectors Emerge in Deep Scene CNNs

Object Detectors Emerge in Deep Scene CNNs Object Detectors Emerge in Deep Scene CNNs Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Presented By: Collin McCarthy Goal: Understand how objects are represented in CNNs Are

More information

Keywords- Saliency Visual Computational Model, Saliency Detection, Computer Vision, Saliency Map. Attention Bottle Neck

Keywords- Saliency Visual Computational Model, Saliency Detection, Computer Vision, Saliency Map. Attention Bottle Neck Volume 3, Issue 9, September 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Visual Attention

More information

FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES

FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES Riku Matsumoto, Hiroki Yoshimura, Masashi Nishiyama, and Yoshio Iwai Department of Information and Electronics,

More information

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture

More information

How Deep is the Feature Analysis underlying Rapid Visual Categorization?

How Deep is the Feature Analysis underlying Rapid Visual Categorization? How Deep is the Feature Analysis underlying Rapid Visual Categorization? Sven Eberhardt Jonah Cader Thomas Serre Department of Cognitive Linguistic & Psychological Sciences Brown Institute for Brain Sciences

More information

Learned Region Sparsity and Diversity Also Predict Visual Attention

Learned Region Sparsity and Diversity Also Predict Visual Attention Learned Region Sparsity and Diversity Also Predict Visual Attention Zijun Wei 1, Hossein Adeli 2, Gregory Zelinsky 1,2, Minh Hoai 1, Dimitris Samaras 1 1. Department of Computer Science 2. Department of

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 19: Contextual Guidance of Attention Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 20 November

More information

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation U. Baid 1, S. Talbar 2 and S. Talbar 1 1 Department of E&TC Engineering, Shri Guru Gobind Singhji

More information

VIDEO SALIENCY INCORPORATING SPATIOTEMPORAL CUES AND UNCERTAINTY WEIGHTING

VIDEO SALIENCY INCORPORATING SPATIOTEMPORAL CUES AND UNCERTAINTY WEIGHTING VIDEO SALIENCY INCORPORATING SPATIOTEMPORAL CUES AND UNCERTAINTY WEIGHTING Yuming Fang, Zhou Wang 2, Weisi Lin School of Computer Engineering, Nanyang Technological University, Singapore 2 Department of

More information

Comparison of Two Approaches for Direct Food Calorie Estimation

Comparison of Two Approaches for Direct Food Calorie Estimation Comparison of Two Approaches for Direct Food Calorie Estimation Takumi Ege and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka, Chofu-shi, Tokyo

More information

Recurrent Neural Networks

Recurrent Neural Networks CS 2750: Machine Learning Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2017 One Motivation: Descriptive Text for Images It was an arresting face, pointed of chin,

More information

Attentional Masking for Pre-trained Deep Networks

Attentional Masking for Pre-trained Deep Networks Attentional Masking for Pre-trained Deep Networks IROS 2017 Marcus Wallenberg and Per-Erik Forssén Computer Vision Laboratory Department of Electrical Engineering Linköping University 2014 2017 Per-Erik

More information

Sequential Predictions Recurrent Neural Networks

Sequential Predictions Recurrent Neural Networks CS 2770: Computer Vision Sequential Predictions Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 One Motivation: Descriptive Text for Images It was an arresting

More information

ANALYSIS AND DETECTION OF BRAIN TUMOUR USING IMAGE PROCESSING TECHNIQUES

ANALYSIS AND DETECTION OF BRAIN TUMOUR USING IMAGE PROCESSING TECHNIQUES ANALYSIS AND DETECTION OF BRAIN TUMOUR USING IMAGE PROCESSING TECHNIQUES P.V.Rohini 1, Dr.M.Pushparani 2 1 M.Phil Scholar, Department of Computer Science, Mother Teresa women s university, (India) 2 Professor

More information

Rich feature hierarchies for accurate object detection and semantic segmentation

Rich feature hierarchies for accurate object detection and semantic segmentation Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik UC Berkeley Tech Report @ http://arxiv.org/abs/1311.2524! Detection

More information

Natural Scene Statistics and Perception. W.S. Geisler

Natural Scene Statistics and Perception. W.S. Geisler Natural Scene Statistics and Perception W.S. Geisler Some Important Visual Tasks Identification of objects and materials Navigation through the environment Estimation of motion trajectories and speeds

More information

Saliency Prediction with Active Semantic Segmentation

Saliency Prediction with Active Semantic Segmentation JIANG et al.: SALIENCY PREDICTION WITH ACTIVE SEMANTIC SEGMENTATION 1 Saliency Prediction with Active Semantic Segmentation Ming Jiang 1 mjiang@u.nus.edu Xavier Boix 1,3 elexbb@nus.edu.sg Juan Xu 1 jxu@nus.edu.sg

More information

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan UT Austin Huijuan Xu UMass. Lowell Jeff Donahue UC Berkeley Marcus Rohrbach UC Berkeley Subhashini Venugopalan

More information

Attention Correctness in Neural Image Captioning

Attention Correctness in Neural Image Captioning Attention Correctness in Neural Image Captioning Chenxi Liu 1 Junhua Mao 2 Fei Sha 2,3 Alan Yuille 1,2 Johns Hopkins University 1 University of California, Los Angeles 2 University of Southern California

More information

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University Deep Networks and Beyond Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University Artificial Intelligence versus Human Intelligence Understanding

More information

arxiv: v2 [cs.cv] 22 Mar 2018

arxiv: v2 [cs.cv] 22 Mar 2018 Deep saliency: What is learnt by a deep network about saliency? Sen He 1 Nicolas Pugeault 1 arxiv:1801.04261v2 [cs.cv] 22 Mar 2018 Abstract Deep convolutional neural networks have achieved impressive performance

More information

Predicting Memorability of Images Using Attention-driven Spatial Pooling and Image Semantics

Predicting Memorability of Images Using Attention-driven Spatial Pooling and Image Semantics Predicting Memorability of Images Using Attention-driven Spatial Pooling and Image Semantics Bora Celikkale a, Aykut Erdem a, Erkut Erdem a, a Department of Computer Engineering, Hacettepe University,

More information

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns 1. Introduction Vasily Morzhakov, Alexey Redozubov morzhakovva@gmail.com, galdrd@gmail.com Abstract Cortical

More information

Skin color detection for face localization in humanmachine

Skin color detection for face localization in humanmachine Research Online ECU Publications Pre. 2011 2001 Skin color detection for face localization in humanmachine communications Douglas Chai Son Lam Phung Abdesselam Bouzerdoum 10.1109/ISSPA.2001.949848 This

More information

Learning Spatiotemporal Gaps between Where We Look and What We Focus on

Learning Spatiotemporal Gaps between Where We Look and What We Focus on Express Paper Learning Spatiotemporal Gaps between Where We Look and What We Focus on Ryo Yonetani 1,a) Hiroaki Kawashima 1,b) Takashi Matsuyama 1,c) Received: March 11, 2013, Accepted: April 24, 2013,

More information

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition Stefan Mathe, Cristian Sminchisescu Presented by Mit Shah Motivation Current Computer Vision Annotations subjectively

More information

Object Recognition: Conceptual Issues. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman

Object Recognition: Conceptual Issues. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman Issues in recognition The statistical viewpoint Generative vs. discriminative methods

More information

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES Felix Sun, David Harwath, and James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA {felixsun,

More information

Chair for Computer Aided Medical Procedures (CAMP) Seminar on Deep Learning for Medical Applications. Shadi Albarqouni Christoph Baur

Chair for Computer Aided Medical Procedures (CAMP) Seminar on Deep Learning for Medical Applications. Shadi Albarqouni Christoph Baur Chair for (CAMP) Seminar on Deep Learning for Medical Applications Shadi Albarqouni Christoph Baur Results of matching system obtained via matching.in.tum.de 108 Applicants 9 % 10 % 9 % 14 % 30 % Rank

More information

Object Detection. Honghui Shi IBM Research Columbia

Object Detection. Honghui Shi IBM Research Columbia Object Detection Honghui Shi IBM Research 2018.11.27 @ Columbia Outline Problem Evaluation Methods Directions Problem Image classification: Horse (People, Dog, Truck ) Object detection: categories & locations

More information

Multi-Level Net: a Visual Saliency Prediction Model

Multi-Level Net: a Visual Saliency Prediction Model Multi-Level Net: a Visual Saliency Prediction Model Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara Department of Engineering Enzo Ferrari, University of Modena and Reggio Emilia {name.surname}@unimore.it

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 12: Time Sequence Data, Recurrent Neural Networks (RNNs), Long Short-Term Memories (s), and Image Captioning Peter Belhumeur Computer Science Columbia University

More information

An Evaluation of Motion in Artificial Selective Attention

An Evaluation of Motion in Artificial Selective Attention An Evaluation of Motion in Artificial Selective Attention Trent J. Williams Bruce A. Draper Colorado State University Computer Science Department Fort Collins, CO, U.S.A, 80523 E-mail: {trent, draper}@cs.colostate.edu

More information

arxiv: v1 [cs.cv] 12 Dec 2016

arxiv: v1 [cs.cv] 12 Dec 2016 Text-guided Attention Model for Image Captioning Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr arxiv:1612.03557v1

More information

Progressive Attention Guided Recurrent Network for Salient Object Detection

Progressive Attention Guided Recurrent Network for Salient Object Detection Progressive Attention Guided Recurrent Network for Salient Object Detection Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, Gang Wang Dalian University of Technology, China Alibaba AILabs, China

More information

Methods for comparing scanpaths and saliency maps: strengths and weaknesses

Methods for comparing scanpaths and saliency maps: strengths and weaknesses Methods for comparing scanpaths and saliency maps: strengths and weaknesses O. Le Meur olemeur@irisa.fr T. Baccino thierry.baccino@univ-paris8.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/

More information

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Reza Lotfian and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas

More information

Valence-arousal evaluation using physiological signals in an emotion recall paradigm. CHANEL, Guillaume, ANSARI ASL, Karim, PUN, Thierry.

Valence-arousal evaluation using physiological signals in an emotion recall paradigm. CHANEL, Guillaume, ANSARI ASL, Karim, PUN, Thierry. Proceedings Chapter Valence-arousal evaluation using physiological signals in an emotion recall paradigm CHANEL, Guillaume, ANSARI ASL, Karim, PUN, Thierry Abstract The work presented in this paper aims

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the

More information

When Saliency Meets Sentiment: Understanding How Image Content Invokes Emotion and Sentiment

When Saliency Meets Sentiment: Understanding How Image Content Invokes Emotion and Sentiment When Saliency Meets Sentiment: Understanding How Image Content Invokes Emotion and Sentiment Honglin Zheng1, Tianlang Chen2, Jiebo Luo3 Department of Computer Science University of Rochester, Rochester,

More information

Efficient Deep Model Selection

Efficient Deep Model Selection Efficient Deep Model Selection Jose Alvarez Researcher Data61, CSIRO, Australia GTC, May 9 th 2017 www.josemalvarez.net conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction???????? Num Classes

More information

An Integrated Model for Effective Saliency Prediction

An Integrated Model for Effective Saliency Prediction Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) An Integrated Model for Effective Saliency Prediction Xiaoshuai Sun, 1,2 Zi Huang, 1 Hongzhi Yin, 1 Heng Tao Shen 1,3

More information

Webpage Saliency. National University of Singapore

Webpage Saliency. National University of Singapore Webpage Saliency Chengyao Shen 1,2 and Qi Zhao 2 1 Graduate School for Integrated Science and Engineering, 2 Department of Electrical and Computer Engineering, National University of Singapore Abstract.

More information

Aggregated Sparse Attention for Steering Angle Prediction

Aggregated Sparse Attention for Steering Angle Prediction Aggregated Sparse Attention for Steering Angle Prediction Sen He, Dmitry Kangin, Yang Mi and Nicolas Pugeault Department of Computer Sciences,University of Exeter, Exeter, EX4 4QF Email: {sh752, D.Kangin,

More information

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience. Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing

More information

Chittron: An Automatic Bangla Image Captioning System

Chittron: An Automatic Bangla Image Captioning System Chittron: An Automatic Bangla Image Captioning System Motiur Rahman 1, Nabeel Mohammed 2, Nafees Mansoor 3 and Sifat Momen 4 1,3 Department of Computer Science and Engineering, University of Liberal Arts

More information

Measuring Focused Attention Using Fixation Inner-Density

Measuring Focused Attention Using Fixation Inner-Density Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute

More information

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES Varinthira Duangudom and David V Anderson School of Electrical and Computer Engineering, Georgia Institute of Technology Atlanta, GA 30332

More information

Development of novel algorithm by combining Wavelet based Enhanced Canny edge Detection and Adaptive Filtering Method for Human Emotion Recognition

Development of novel algorithm by combining Wavelet based Enhanced Canny edge Detection and Adaptive Filtering Method for Human Emotion Recognition International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 12, Issue 9 (September 2016), PP.67-72 Development of novel algorithm by combining

More information

Object-Level Saliency Detection Combining the Contrast and Spatial Compactness Hypothesis

Object-Level Saliency Detection Combining the Contrast and Spatial Compactness Hypothesis Object-Level Saliency Detection Combining the Contrast and Spatial Compactness Hypothesis Chi Zhang 1, Weiqiang Wang 1, 2, and Xiaoqian Liu 1 1 School of Computer and Control Engineering, University of

More information

Categories. Represent/store visual objects in terms of categories. What are categories? Why do we need categories?

Categories. Represent/store visual objects in terms of categories. What are categories? Why do we need categories? Represen'ng Objects Categories Represent/store visual objects in terms of categories. What are categories? Why do we need categories? Grouping of objects into sets where sets are called categories! Categories

More information

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection Y.-H. Wen, A. Bainbridge-Smith, A. B. Morris, Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection, Proceedings of Image and Vision Computing New Zealand 2007, pp. 132

More information

Hierarchical Convolutional Features for Visual Tracking

Hierarchical Convolutional Features for Visual Tracking Hierarchical Convolutional Features for Visual Tracking Chao Ma Jia-Bin Huang Xiaokang Yang Ming-Husan Yang SJTU UIUC SJTU UC Merced ICCV 2015 Background Given the initial state (position and scale), estimate

More information

AUTOMATIC MEASUREMENT ON CT IMAGES FOR PATELLA DISLOCATION DIAGNOSIS

AUTOMATIC MEASUREMENT ON CT IMAGES FOR PATELLA DISLOCATION DIAGNOSIS AUTOMATIC MEASUREMENT ON CT IMAGES FOR PATELLA DISLOCATION DIAGNOSIS Qi Kong 1, Shaoshan Wang 2, Jiushan Yang 2,Ruiqi Zou 3, Yan Huang 1, Yilong Yin 1, Jingliang Peng 1 1 School of Computer Science and

More information

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani Semi-automatic WordNet Linking using Word Embeddings Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani January 11, 2018 Kevin Patel WordNet Linking via Embeddings 1/22

More information

arxiv: v2 [cs.cv] 19 Dec 2017

arxiv: v2 [cs.cv] 19 Dec 2017 An Ensemble of Deep Convolutional Neural Networks for Alzheimer s Disease Detection and Classification arxiv:1712.01675v2 [cs.cv] 19 Dec 2017 Jyoti Islam Department of Computer Science Georgia State University

More information

Recurrent Refinement for Visual Saliency Estimation in Surveillance Scenarios

Recurrent Refinement for Visual Saliency Estimation in Surveillance Scenarios 2012 Ninth Conference on Computer and Robot Vision Recurrent Refinement for Visual Saliency Estimation in Surveillance Scenarios Neil D. B. Bruce*, Xun Shi*, and John K. Tsotsos Department of Computer

More information

Task-driven Webpage Saliency

Task-driven Webpage Saliency Task-driven Webpage Saliency Quanlong Zheng 1[0000 0001 5059 0078], Jianbo Jiao 1,2[0000 0003 0833 5115], Ying Cao 1[0000 0002 9288 3167], and Rynson W.H. Lau 1[0000 0002 8957 8129] 1 Department of Computer

More information

Vector Learning for Cross Domain Representations

Vector Learning for Cross Domain Representations Vector Learning for Cross Domain Representations Shagan Sah, Chi Zhang, Thang Nguyen, Dheeraj Kumar Peri, Ameya Shringi, Raymond Ptucha Rochester Institute of Technology, Rochester, NY 14623, USA arxiv:1809.10312v1

More information

IDENTIFYING STRESS BASED ON COMMUNICATIONS IN SOCIAL NETWORKS

IDENTIFYING STRESS BASED ON COMMUNICATIONS IN SOCIAL NETWORKS IDENTIFYING STRESS BASED ON COMMUNICATIONS IN SOCIAL NETWORKS 1 Manimegalai. C and 2 Prakash Narayanan. C manimegalaic153@gmail.com and cprakashmca@gmail.com 1PG Student and 2 Assistant Professor, Department

More information

Outline. Teager Energy and Modulation Features for Speech Applications. Dept. of ECE Technical Univ. of Crete

Outline. Teager Energy and Modulation Features for Speech Applications. Dept. of ECE Technical Univ. of Crete Teager Energy and Modulation Features for Speech Applications Alexandros Summariza(on Potamianos and Emo(on Tracking in Movies Dept. of ECE Technical Univ. of Crete Alexandros Potamianos, NatIONAL Tech.

More information

arxiv: v3 [cs.cv] 1 Jul 2018

arxiv: v3 [cs.cv] 1 Jul 2018 1 Computer Vision and Image Understanding journal homepage: www.elsevier.com SalGAN: visual saliency prediction with adversarial networks arxiv:1701.01081v3 [cs.cv] 1 Jul 2018 Junting Pan a, Cristian Canton-Ferrer

More information

The Ordinal Nature of Emotions. Georgios N. Yannakakis, Roddy Cowie and Carlos Busso

The Ordinal Nature of Emotions. Georgios N. Yannakakis, Roddy Cowie and Carlos Busso The Ordinal Nature of Emotions Georgios N. Yannakakis, Roddy Cowie and Carlos Busso The story It seems that a rank-based FeelTrace yields higher inter-rater agreement Indeed, FeelTrace should actually

More information

VENUS: A System for Novelty Detection in Video Streams with Learning

VENUS: A System for Novelty Detection in Video Streams with Learning VENUS: A System for Novelty Detection in Video Streams with Learning Roger S. Gaborski, Vishal S. Vaingankar, Vineet S. Chaoji, Ankur M. Teredesai Laboratory for Applied Computing, Rochester Institute

More information

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING 134 TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING H.F.S.M.Fonseka 1, J.T.Jonathan 2, P.Sabeshan 3 and M.B.Dissanayaka 4 1 Department of Electrical And Electronic Engineering, Faculty

More information

Neuromorphic convolutional recurrent neural network for road safety or safety near the road

Neuromorphic convolutional recurrent neural network for road safety or safety near the road Neuromorphic convolutional recurrent neural network for road safety or safety near the road WOO-SUP HAN 1, IL SONG HAN 2 1 ODIGA, London, U.K. 2 Korea Advanced Institute of Science and Technology, Daejeon,

More information

Framework for Comparative Research on Relational Information Displays

Framework for Comparative Research on Relational Information Displays Framework for Comparative Research on Relational Information Displays Sung Park and Richard Catrambone 2 School of Psychology & Graphics, Visualization, and Usability Center (GVU) Georgia Institute of

More information

Feature Engineering for Depression Detection in Social Media

Feature Engineering for Depression Detection in Social Media Maxim Stankevich, Vadim Isakov, Dmitry Devyatkin and Ivan Smirnov Institute for Systems Analysis, Federal Research Center Computer Science and Control of RAS, Moscow, Russian Federation Keywords: Abstract:

More information

GESTALT SALIENCY: SALIENT REGION DETECTION BASED ON GESTALT PRINCIPLES

GESTALT SALIENCY: SALIENT REGION DETECTION BASED ON GESTALT PRINCIPLES GESTALT SALIENCY: SALIENT REGION DETECTION BASED ON GESTALT PRINCIPLES Jie Wu and Liqing Zhang MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems Dept. of CSE, Shanghai Jiao Tong

More information

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1 1. INTRODUCTION Sign language interpretation is one of the HCI applications where hand gesture plays important role for communication. This chapter discusses sign language interpretation system with present

More information

A HMM-based Pre-training Approach for Sequential Data

A HMM-based Pre-training Approach for Sequential Data A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University

More information