Visual-Verbal Consistency of Image Saliency

Size: px

Start display at page:

Download "Visual-Verbal Consistency of Image Saliency"

Chester Tucker
6 years ago
Views:

217 IEEE International Conference on Systems, Man, and Cybernetics (SMC) Banff Center, Banff, Canada, October 5-8, 217 Visual-Verbal Consistency of Image Saliency Haoran Liang, Ming Jiang, Ronghua

rhliang}@zjut.edu.cn, mjiang@umn.edu, qzhao@cs.umn.edu Abstract When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations.

To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural.

We then propose a number of low-level and semantic-level features relevant to the visualverbal consistency.

Keywords visual saliency, image caption, visual-verbal consistency, correlation I.

1 217 IEEE International Conference on Systems, Man, and Cybernetics (SMC) Banff Center, Banff, Canada, October 5-8, 217 Visual-Verbal Consistency of Image Saliency Haoran Liang, Ming Jiang, Ronghua Liang and Qi Zhao College of Information Engineering, Zhejiang University of Technology, Hangzhou, China Department of Computer Science and Engineering, University of Minnesota, USA {haoran, Abstract When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural. From the annotations, we extract visual and verbal saliency ranks to compare against each other. We then propose a number of low-level and semantic-level features relevant to the visualverbal consistency. Integrated into a computational model, the proposed features effectively predict the consistency between the two modalities on a large dataset with both types of annotations, namely SALICON [1]. Keywords visual saliency, image caption, visual-verbal consistency, correlation I. INTRODUCTION Generating properly formed image captions based on the understanding of image contents is an important and challenging task in computer vision and natural language processing. Progress in this task has great impacts that can benefit various real-life applications such as traffic navigation [2], robotics [3] and education [4]. To better describe an image, particularly in a cluttered scene, it is of great importance to capture the key elements in the image instead of describing everything. Previous studies reveal that visual attention is a helpful proxy for perceiving importance in images. Visual attention is a bottleneck mechanism that allows only a small portion of the visual input to reach higher-level processing units. It breaks down scene understanding into a sequence of localized visual analysis problems. We hypothesize that patterns in image captions strongly rely on what people regard as important. On the other hand, annotations of visual attention offer a natural ranking from humans while free-viewing the scenes. Thus, it is of interest to understand the consistency between how people view images and how they describe them. Further, we observe that the level of visual-verbal consistency can vary despite the overall high correlation between the two modalities (see example in Fig. 1). Natural questions to follow are then: what image features make them more consistent or less, and whether the consistency is predictable with the image features. This work was done when Haoran Liang was a visiting student in the Zhao Lab. horse A woman on a horse riding in the dirt. Woman in English riding gear on top of a horse. A woman wearing a helmet riding her horse outside. A girl riding a horse through an equestrian course. A woman is riding a horse that is running out of water. truck horse tv controller Two girls in an apartment play a lively video game. Two children playing video games in front of a television. Two women stand while actively playing a video game. Two people standing up in a living room playing nintendo wii. Two people playing a game on a television. Fig. 1. Examples of high-consistency (blue box) and low-consistency (orange box) patterns between how people look at and describe images. For each case, the gray-scale map on the left indicates visual saliency, and the one on the right indicates verbal saliency. In this work, we study the relationship between the words used in image captions and the regions where people look in natural images. First, we explicitly define verbal saliency as the importance of the words in a sentence and report the correlation between verbal saliency and visual saliency. Second, we propose a number of features based on image and object information and quantify their effects on the consistency between visual saliency and verbal saliency. Finally, we show the effectiveness of the features in predicting the visual-verbal consistency using a support vector regression (SVR). A. Image Saliency II. RELATED WORK The computational model of visual attention was first introduced by Itti et al. [5]. Specifically, they proposed to use a set of feature maps from three complementary channels, which were intensity, color, and orientation. The normalized feature maps from each channel were then linearly combined to generate the overall saliency map. Based on this, many other researchers have since suggested various models that can be categorized based on the algorithms used, e.g., Bayesian models [6], graphical models [7], spectral analysis models [8] and pattern classification models [9]. In addition to the early features, high-level image semantics have been shown much more relevant to image saliency [1]. Therefore, deep neural networks have been proposed to reduce the semantic gap [11], [12], by naturally integrating hierarchical features for saliency prediction. tv /17/$ IEEE 3489

Image Fixations Visual saliency map A man kneeling over with five boards on the grass. A man kneels down in front of five different colored surfboards. A sitting in front of many surfboards.

68 15 board.68 7 grass.27 2.5 where V (I) = {V (o 1 ), V (o 2 ),..., V (o n )} and W (I) = {W (o 1 ), W (o 2 ),..., W (o n )}. Fig.

Visual Saliency The saliency map obtained from attentional data can be directly applied to measure the importance of objects in an image.

Cardinality Sequence Fig. 2. Overview of the method to generate visual saliency maps and verbal saliency maps. See Sec. III for details. B.

2 Image Fixations Visual saliency map A man kneeling over with five boards on the grass. A man kneels down in front of five different colored surfboards. A sitting in front of many surfboards. A man squaltting near his five surfboards. A man sitting in the grass with five surfboards in front of him. Segmentation Captions Word Verbal saliency maps Cardinality (TF-IDF) Weight Sequence board.68 7 grass where V (I) = {V (o 1 ), V (o 2 ),..., V (o n )} and W (I) = {W (o 1 ), W (o 2 ),..., W (o n )}. Fig. 2 demonstrates the approaches of computing visual saliency from human fixations, and verbal saliency from image captions. A. Visual Saliency The saliency map obtained from attentional data can be directly applied to measure the importance of objects in an image. In particular, given an input image and its fixation map, the saliency of each object in the scene was computed, by taking the maximal value of fixation map within the objects segmentation mask, i.e., V (o i ) = max(p i ), P i denotes the set of saliency values (obtained using mouse-tracking in [1]) inside the ith object mask. Cardinality Sequence Fig. 2. Overview of the method to generate visual saliency maps and verbal saliency maps. See Sec. III for details. B. Verbal Saliency and Image Captioning Few works have focused on image importance [13] and specificity [14] by far. Supported by deep neural network models, the methods of image-sentence retrieval [15] and image captioning [16] have been developing dramatically. Many works [17], [18] using a multi-modal recurrent neural network (RNN) achieve the state-of-the-art performance for both the tasks of image-sentence retrieval and image captioning. Recent models [19], [2], [21] suggest that attention related information improves image captioning model performance. The proposed work is the first that explicitly examine the correlation and consistency of visual saliency and image description on a large dataset with FF5, images. It also distinguishes itself from earlier models in that it directly explores features that contribute to the consistency between how people view and describe an image. Our work further verifies the possibility of predicting visual-verbal consistency with computational algorithms. III. OVERVIEW Studying the consistency between visual saliency and verbal saliency requires an image dataset with captions, object annotations, and saliency annotations. SALICON [1] is selected in this work as it is currently the largest public dataset that meets the requirement. We could also use algorithms such as RCNN [22] to detect objects in future study. Given an image with object annotations, to measure its visual-verbal consistency, we obtain saliency values from fixations and captions, assign them to the objects, and compare them using Spearman s rank correlation ρ. That is, for image I with N objects, we denote each object as o i, where i 1, 2,..., n. Two types of saliency values, V (o i ) and W (o i ) are computed for o i from attentional data and image captions, respectively. The visual-verbal consistency is computed as ρ = Spearman(V (I), W (I)), (1) B. Verbal Saliency In order to know which objects are described and where there are, we propose a mapping method between words from a caption and objects in an image. Particularly, we use the leading platform Natural Language Toolkit (NLTK) [23] to parse all the captions into a big vocabulary set that consists of each noun with the help of Stanford Log-linear Part-Of- Speech Tagger [24]. We use a simple Wordnet-based measure of semantic distance [25] to find which of the 8 categories these words belong to. We initialize the salient value for each image using its corresponding 5 captions in two strategies, namely cardinality-based and sequence-based approaches. Cardinality-based approach measures the importance of each word based on the frequency of occurrence. In order to weight the count of each word into floating point values, the importance (also denoted as the initial saliency value W init (o i ) ) is computed using term frequency-inverse document frequency (TF-IDF) using the scikit-learn software package [26]. Words that are rare in the corpus but occur frequently in a sentence contribute more to the importance. Let J i be the number of times that the word for object o i occurs in k sentences, the term frequency of the object is calculated as: W init (o i ) = Ji N. i=1 Sequence-based Ji approach captures and highlights the order of each word in a sentence. We denote the number of categories in an image as K, and the order of a word(object o i ) in an image as q i, e.g., q i = 1 means that the word comes first in the sentence, so its weight is K q, which is higher if a word comes at the beginning of a sentence, and lower vise versa. Therefore, for object o i in k sentences, the initial saliency value will be W init (o i ) = N i=1 K q i. These two approaches provide reasonable initializations of the verbal saliency values. Yet ambiguities still exist. For example, with multiple people in a scene, we cannot tell whom the man in the captions refers to. Further, the mappings from words to objects are not one-to-one. For example, the vehicle may represent either a car or a truck that co-exist in an image. To approach this problem, we propose a method based on psychophysical studies. In particular, we borrow the observations from [13] to add to the verbal saliency map by considering the compositional features relating to objects, i.e. 349

3 (a) Probability (b) Probability Fixations Normalized size Normalized location Descriptions (c) 25 Image number Spearman's correlation Fig. 3. (a, b) The effects of location and size on both visual saliency and verbal saliency. (c) The distribution of Spearman correlation scores across the 1, SALICON training images. size and location. Visual attention is well known to have a center bias and a preference for dominant objects, and image captions also have a similar tendency. Fig. 3 (a) and (b) display the effects of location and size on description probability for the training set of SALICON. With these observations, we adjust the saliency values of objects with an additional term T ( ) = S( ) (1 L( )), where S( ) is the normalized value of size and L( ) is the normalized distance to center (see details in Sec. IV-B), respectively. The finalized value is calculated as W (o i ) = W init (o i ) T (o i ), which is used as the ground truth for verbal saliency. In order to investigate the inter-subject consistency in image captioning, for each image in the training set, we use one of the five captions and the rest to generate two verbal saliency maps and calculate the Spearman correlations, resulting in a human consistency of ρ =.865. By comparing the ground truth between visual saliency and verbal saliency, we obtain ρ =.435 and ρ =.439 (ρ =.361 and ρ =.368 without adjusting size and location) for cardinality and sequence-based schemes respectively and the distribution of all the images is shown in Fig. 3 (c). IV. FEATURES AND STATISTICAL ANALYSIS In this section, we report the statistics of the dataset (see Sec. IV-A), and propose a number of potential features for estimating the consistency between visual saliency and verbal saliency, including low-level features (i.e. size, location, density) and semantic-level features (i.e. categories). After that, we quantitatively evaluate how each feature impacts the visualverbal consistency (see Sec. IV-B), and show the inclination people possess within different categories (see Sec. IV-C). A. Dataset Composition We first count the number of each part-of-speech (POS) in the image descriptions for the 1, training images. Each word is converted to its basic form before the count to make sure the element in our vocabulary list is unique. Obviously, nouns (431) stand out to be the dominant part of the descriptions, followed by verbs (1927) and adjectives (164). Numerals (55) and prepositions (54) may occur quite frequently, but they are less relevant to the image content. Verbs and adjectives apparently relate to the perception of importance and seem worth consideration. However, they are used to describe the attributes that eventually lead us back to the particular objects they derive from. Hence, we only consider nouns in the following analysis. B. Low-Level Features The average correlation scores of ρ =.435 and.439 respectively for cardinality and sequence methods show a significant correlation between visual saliency and verbal saliency. To investigate this, we propose three low-level and two semantic-level features extracted from all the objects, based on the object annotations of the image. Particularly, given an image I with N objects, let o i be the ith object, we define low-level features for objects namely size, location and density as follows: Size: the size S(o i ) of an object can be measured as the number of pixel in the object mask. We normalize the size by image resolution: S(o i ) =. Thus, each object will have a rounded size value that ranges from.1 to.9, resulting in a 9-dimensional vector that encodes the numbers of objects in different sizes. Location: the center coordinate of an object bounding box is used as the exact location for one object. In our feature setting, we use the relative location L(o i ) of an object which is defined as the distance to the image center: L(o i ) = dist(o i,image center) 2 Image diagonal oi size Image size. Similarly, with all the distances ranging from.1 to.9, the descriptor is 9-dimensional vector that encodes the numbers of objects in different locations. Density: object density consists of the objects distances to each other. From the segmentation masks of all the objects, the object density can be computed as: D(o i ) = N j dist(oi,oj) Image diagonal N, where dist(o i, o j ) denotes the Euclidean distance between the ith and jth objects. After normalization, we use a 2-dimensional vector to encode the numbers of objects with different densities from.5 to 1. with a step of.5. Next, we independently investigate the contribution of each low-level feature to the visual-verbal consistency. Across all training images, we compute the Pearson s linear correlation coefficients between the feature values and the visualverbal consistency scores, for both the cardinality-based and sequence-based approaches. In this analysis, a positive correlation suggests that the corresponding feature channel contributes positively to the visual-verbal consistency and vise versa. The test results are shown in Fig. 4. First, as observed from the results, images with mediumsized objects (s-.2 to s-.5) have particularly more consistent visual saliency and verbal saliency. These feature channels contribute significantly to the difference in both cardinality and sequence cases. The downward trend of feature contribution from s-.2 to s-.9 demonstrates that visual saliency and verbal saliency become less similar when large objects are presented. It is most likely that large objects are in the background and visually less salient, but are important for describing the context of a scene. 3491

4 Pearson's linear correlation * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * s-.1 s-.2 s-.3 s-.4 s-.5 s-.6 s-.7 s-.8 s-.9 s-1. l-.1 l-.2 l-.3 l-.4 l-.5 l-.6 l-.7 l-.8 l-.9 d-.5 d-.1 d-.15 d-.2 d-.25 d-.3 d-.35 d-.4 d-.45 d-.5 d-.55 d-.6 d-.65 d-.7 d-.75 Feature Cardinality Sequence Fig. 4. Correlations of size (s-), location (l-) and density (d-) features for both cardinality-based and sequence-based verbal saliency. The asterisks on top of the bars denote significant correlations with p<.1. Feature channels with all-zero values are skipped. Second, images with objects close to the center (l-.1) are significantly more consistent between visual saliency and verbal saliency. As objects get farther from the image center (l-.2 to l-.5), they contribute negatively to the visual-verbal consistency. However, when the images contain more objects near the edges (l-.7 to l-.9), probably caused by a large dominant object occupying the central region, the saliency ranks become more consistent. Finally, images with low object density (d-.35 to d-.6) are more consistent between visual saliency and verbal saliency, while images with high object density (d-.15 to d-.25) are less consistent between visual saliency and verbal saliency. To be specific, sparse contents within images obtain more similar description, while cluttered ones do not. C. Semantic-Level Features There has been psychophysical evidence that human observers tend to fixate semantic categories such as people and animals. In this group of features, we expect to see whether the presence of different categories will affect the way people describe an image. We sort the noun list by how many times a noun appears in the image captions in a descending order. We notice that the top 1 nouns are quite commonly used in our daily life, i.e., man, dog. For each noun, we look for all the images that contain the described object along with the corresponding captions. Given a noun mapped to an object, we compute its probability by dividing the number of captions that contain this noun or its synonym by the total number of sentences. Note that summative words are not considered in this case, e.g., a lot of dishes on the table does not mean an egg (if exists) is mentioned since we focus more on sibling terms like boy for, ship for boat. For the probability of fixations, we simply count the number of fixations inside the object mask and then perform a cut-off at a threshold θ to exclude noises. In our experiments, θ is set to 1 to 3 based on the object size. We randomly choose 3 nouns (listed in Table I) from the top 1 in the vocabulary set to investigate the features, from which several observations can be drawn. Fixations cover more contents than image captions as we observe that almost all the probabilities from fixations are higher. Despite that the probabilities have been studied in the previous work [13], what we find new is that humans have quite similar tendencies towards TABLE I PROBABILITIES OF OBJECT CATEGORIES BEING MENTIONED IN IMAGE CAPTIONS AND FIXATED DURING FREE VIEWING. Category Inclination Category Mentioned Fixated Category Mentioned Fixated cow bowl cat egg dog shower.6.74 zebra bottle hat computer potato truck furniture pizza foot frisbee chip.53.6 car broccoli.39.6 baseball hand number soup chair steel stop controller.22.3 suitcase lamp living things both in visual attention and image captions. In contrast, we observe different patterns for inanimate objects. A considerable number of inanimate objects with a certain level of attraction to attention are less likely to be described, such as containers (bottle, bowls), and salient object parts (hand, foot). We further observe that the image captions become less specific when more categories are presented. Particularly, if only one or two objects are presented, people tend to mention the exact name. As the number of objects increase, people are more likely to summarize the scene, in which case, supercategory appears frequently. For example, food on table is less used to describe an image showing only a plate of carrot, but more for a table of many dishes. V. PREDICTING THE VISUAL-VERBAL CONSISTENCY The impact of the features on visual-verbal consistency motivates us to investigate whether the consistency is predictable and how the features contribute to the prediction. In this section, we describe the experimental setup and provide the details of feature settings along with the model (see Sec. V-A). We show the experimental results on the SALICON dataset (see Sec. V-B) with intuitions gained to better understand the patterns of image viewing and description. A. Computational Modeling Given an image I, our goal is to automatically decide the consistency between visual saliency and verbal saliency. Based on the observations above, we take into consideration the size, location, density, and category in our model. The image feature F (I) can be represented by concatenating different feature vectors as follows: F (I) = {C, C s, S, L, D} R l (2) where C is an 8-dimensional vector representing object categories; C s similarly denotes the 12-dimensional vector for super-categories, each element in C and C s denotes the 3492

TABLE II PERFORMANCE (MEAN SQUARE ERROR) OF FEATURES AT PREDICTING THE CORRELATION. -.31 -.8 -.1.4.91.92 Cardinality Sequence Feature type MSE STD MSE STD Size.212.54.196.581 Location.1963.89.1918.

1459.1568.1467 AlexNet [29].1751.1393.1733.1443 -.54 -.27 -.28 -.21 -.74 -.68.1.7 -.1 -.1.1.2.81.61.82.68.68.45 Fig. 5.

number of a certain object category in the image; S, L and D are 1-, 1- and 2-dimensional vectors encoding size, distance to center and density of object respectively.

Experimental Results and Discussions In order to validate the effectiveness of the features used in our experiment, we conduct the same process as described in Sec.

We train SVRs [28] with Gaussian kernels to predict the visualverbal consistency, using a grid search to select cost and hyper-parameters.

We also include deep convolutional neural network as an additional model for prediction.

5 TABLE II PERFORMANCE (MEAN SQUARE ERROR) OF FEATURES AT PREDICTING THE CORRELATION Cardinality Sequence Feature type MSE STD MSE STD Size Location Density Size + Location + Density Category Super-category Category + super-category All combined AlexNet [29] Fig. 5. Example images with low consistency (left), medium consistency (middle), and high consistency (right). Numbers indicate the cardinality-based visual-verbal consistency our model predicts. number of a certain object category in the image; S, L and D are 1-, 1- and 2-dimensional vectors encoding size, distance to center and density of object respectively. The above features form a l-dimensional (l = 132) feature vector for each image. We employ a support vector regressor (SVR, [27]) as our predictive model. B. Experimental Results and Discussions In order to validate the effectiveness of the features used in our experiment, we conduct the same process as described in Sec. III on the validation set of 5, images, and use the Spearman correlation scores as their ground-truth labels. We train SVRs [28] with Gaussian kernels to predict the visualverbal consistency, using a grid search to select cost and hyper-parameters. We report the results for both standalone and combined features using different sources (cardinality and sequence) as ground truth. We also include deep convolutional neural network as an additional model for prediction. For this, we employ AlexNet [29] and extract features from the layer just before the final classification layer (often referred to as fc7), resulting in a feature dimension of 4,96. The performances of all models are shown in Table II. It can be seen that the combined features perform considerably better (MSE =.157 and.1568) than the standalone features. The two schemes of verbal saliency generation have shown similar performances with the difference that the sequence-based method performs slightly better. Moreover, we observe that the object categories, followed by the supercategories, play the most significant roles in the prediction of consistency, whereas low-level features are less useful. This suggests that the verbal description patterns are more related to the semantics of the image rather than the early vision. What affects the visual-verbal consistency? We first look at the image content to unearth possibly consistent patterns that contribute to visual-verbal consistency. Example images predicted by SVR using combined features are shown in Fig. 5. The images are ranked by predicted consistency scores in an ascending order. For the most consistent cases in Fig. 5 (right), our model finds relatively clean images with few objects and usual activities. The content of these images are quite clear at first sight and can be easily concluded. The objects in these A man holding out a cell phone with an app on pulled up. Man holding up his cell phone to display app. A man showing an app he has on his phone. A man displaying the screen and a web sight on his cell phone. A man showing his Motorola cell phone app. A standing in front of a parked bus with a bicycle on it. A couple wearing bike helmets prepares to stow the bikes and ride a bus. A city bus waits for a rider to attach their bike. A bus has bicycles mounted on the front. A woman is putting her bike on the bus. The dog is playing with his toy in the grass. A dog that is standing in the grass near a frisbee. A collie chasing a frisbee in the grass. A cute dog is in the grass with a frisbee. A brown and white dog chasing a blue and yellow frisbee. Fig. 6. Different patterns in what people look and describe. The three images are selected from each set in Fig. 5. images are highly likely to be common in daily life and are consistently mentioned in the image captions, e.g., dogs, cars, and people. Also, these objects are quite likely to occur with a dominant size (.2 < s <.5) near the center of the image (l <.2), attracting gaze in the early stage. Images in Fig. 5 (middle) are more complicated scenes with 5 or more objects. In these cases, while people still have consistent preferences of visual saliency, the image captions vary in sequence. A typical case can be found in the middle of Fig. 6. As the number of objects increases, there are a variety of choices to describe the scene, so the weight of each being mentioned drops. As a result, different things are chosen at the beginning of each sentence. As observed in Table I, in most cases, humans and animals are both visually and verbally salient. However, there exist several cases that inanimate things are more attractive at the first sight. Since mostly animate things occur at very first of captions, the visual saliency, in these cases, correlates negatively with the verbal saliency. In Fig. 5 (left), these rare cases are demonstrated, of which most contain objects more salient than human and animal. Finally, we investigate whether the length of the image captions leads to more variability and hence less visual-verbal consistency. The low correlation between the average length of the captions (measured as the number of words in the sentence) with consistency (ρ =.6) shows that the length of caption has no obvious impact on the visual-verbal consistency. We then check the effect of variation in the length of caption and find that, images with high consistency scores (ρ <.4 or ρ >.4) usually have less variation in the length of caption (see Fig. 6). 3493

6 Overall, visual-verbal consistency is correlated with image content to quite an extent. The proposed features and regressor are able to predict the consistency effectively. Yet there are a number of failure cases that occur because of missing annotations, suggesting further updates of the dataset. VI. CONCLUSIONS In this work, we characterize aspects of image and caption relating to the visual-verbal consistency on how people explore and describe an image. We investigate this question on a large-scale dataset, namely SALICON. We extract saliency values of objects from eye-tracking data and captions, based on which we propose several low-level and semantic-level features such as size, location, density, and category. Further, we analyze their relative contribution to the visual-verbal consistency. Finally, we propose a computational model using a support vector regressor (SVR) to predict the consistency between visual saliency and verbal saliency. We envision that understanding the visual-verbal consistency of image saliency could inspire exciting and far-reaching applications in computer vision and answer related questions in human vision. REFERENCES [1] M. Jiang, S. Huang, J. Duan, and Q. Zhao, SALICON: Saliency in context, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215. [2] H. Tae-Hyun, J. In-Hak, and C. Seong-Ik, Detection of traffic lights for vision-based car navigation system, in Advances in Image and Video Technology, 26, pp [3] C. Thorpe, M. Hebert, T. Kanade, and S. Shafer, Vision and Navigation for the Carnegie Mellon Navlab. Springer, [4] W. Harwin, A. Ginige, and R. Jackson, A potential application in early education and a possible role for a vision system in a workstation based robotic aid for physically disabled s, Interactive robotic aidsone option for independent living: An international perspective, volume Monograph, vol. 37, pp , [5] L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, no. 11, pp , [6] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, Sun: A bayesian framework for saliency using natural statistics, Journal of Vision, vol. 8, no. 7, p. 32, 28. [7] J. Harel, C. Koch, and P. Perona, Graph-based visual saliency, in Advances in Neural Information Processing Systems, 26, pp [8] X. Hou and L. Zhang, Saliency detection: A spectral residual approach, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 27, pp [9] T. Judd, K. Ehinger, F. Durand, and A. Torralba, Learning to predict where humans look, in Proceedings of the IEEE International Conference on Computer Vision, 29. [1] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao, Predicting human gaze beyond pixels, Journal of Vision, vol. 14, no. 1, pp. 1 2, 214. [11] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, Predicting eye fixations using convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215, pp [12] X. Huang, C. Shen, X. Boix, and Q. Zhao, Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks, in Proceedings of the IEEE International Conference on Computer Vision, 215, pp [13] A. C. Berg, T. L. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos et al., Understanding and predicting importance in images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 212, pp [14] M. Jas and D. Parikh, Image specificity, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 215, pp [15] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, vol. 2, pp , 214. [16] A. Karpathy, A. Joulin, and F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Advances in Neural Information Processing Systems, 214, pp [17] A. Karpathy and L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, pp , 215. [18] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, Explain images with multimodal recurrent neural networks, arxiv preprint arxiv:141.19, 214. [19] H. Xu and K. Saenko, Ask, attend and answer: Exploring questionguided spatial attention for visual question answering, in European Conference on Computer Vision. Springer, 216, pp [2] C. Liu, J. Mao, F. Sha, and A. Yuille, Attention correctness in neural image captioning, arxiv preprint arxiv: , 216. [21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning, 215, pp [22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 214, pp [23] S. Bird, E. Klein, and E. Loper, Natural language processing with Python. O Reilly Media, Inc., 29. [24] K. Toutanova and C. D. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, in Proceedings of the 2 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics- Volume 13, 2, pp [25] Z. Wu and M. Palmer, Verbs semantics and lexical selection, in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, 1994, pp [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp , 211. [27] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Statistics and computing, vol. 14, no. 3, pp , 24. [28] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, 211, software available at edu.tw/ cjlin/libsvm. [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 212, pp

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning Shweta Khairnar, Sharvari Khairnar 1 Graduate student, Pace University, New York, United States 2 Student, Computer Engineering,