arxiv: v1 [cs.cv] 28 Dec 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 28 Dec 2018"

Loreen Roberts
5 years ago
Views:

Salient Object Detection via High-to-Low Hierarchical Context Aggregation Yun Liu 1 Yu Qiu 1 Le Zhang 2 JiaWang Bian 1 Guang-Yu Nie 3 Ming-Ming Cheng 1 1 Nankai University 2 A*STAR 3 Beijing

cv] 28 Dec 2018 Abstract Recent progress on salient object detection mainly aims at exploiting how to effectively integrate convolutional sideoutput features in convolutional neural networks (CNN).

However, should the fusion strategies be more and more complex for accurate salient object detection?

As we know, the contexts of an image usually refer to the global structures, and the top layers of CNN usually learn to convey global information.

Here, we design an hourglass network with intermediate supervision to learn contextual features in a high-to-low manner.

We extensively evaluate our method on six challenging saliency datasets, and our simple method achieves state-of-the-art performance under various evaluation metrics.

1 Salient Object Detection via High-to-Low Hierarchical Context Aggregation Yun Liu 1 Yu Qiu 1 Le Zhang 2 JiaWang Bian 1 Guang-Yu Nie 3 Ming-Ming Cheng 1 1 Nankai University 2 A*STAR 3 Beijing Institute of Technology arxiv: v1 [cs.cv] 28 Dec 2018 Abstract Recent progress on salient object detection mainly aims at exploiting how to effectively integrate convolutional sideoutput features in convolutional neural networks (CNN). Based on this, most of the existing state-of-the-art saliency detectors design complex network structures to fuse the side-output features of the backbone feature extraction networks. However, should the fusion strategies be more and more complex for accurate salient object detection? In this paper, we observe that the contexts of a natural image can be well expressed by a high-to-low self-learning of sideoutput convolutional features. As we know, the contexts of an image usually refer to the global structures, and the top layers of CNN usually learn to convey global information. On the other hand, it is difficult for the intermediate sideoutput features to express contextual information. Here, we design an hourglass network with intermediate supervision to learn contextual features in a high-to-low manner. The learned hierarchical contexts are aggregated to generate the hybrid contextual expression for an input image. At last, the hybrid contextual features can be used for accurate saliency estimation. We extensively evaluate our method on six challenging saliency datasets, and our simple method achieves state-of-the-art performance under various evaluation metrics. Code will be released upon paper acceptance. 1. Introduction Salient object detection, also known as saliency detection, aims at simulating the human vision system to detect the most conspicuous and eye-attracting objects or regions in a natural image [1, 7]. The progress in saliency detection has been beneficial to a wide range of vision applications, including image retrieval [11], visual tracking [33], scene classification [36], content-ware video compression [61], and weakly supervised learning [46, 47]. Although numerous valuable models have been presented [25, 4, 57, 29, 17, 53, 15] and significant progress has been made, it remains as an open problem to accurately detect M.M. Cheng (cmm@nankai.edu.cn) is the corresponding author. (a) Image (b) GT (c) Side 6 (d) Side 5 (e) Side 4 (f) Side 3 (g) Side 2 (h) Side 1 (i) Aggregated Contexts Figure 1. Visualization of our learned contexts at different sides of the neural network. The contexts at lower sides are learned under the guidance of top global contexts to only emphasize the details of salient objects. salient objects in static images, especially in some complicated scenarios. Conventional saliency detection methods [7, 19, 39] usually design hand-crafted low-level features and heuristic priors, which are difficult to describe semantic objects and scenes. Recent progress on saliency detection is mainly beneficial from convolutional neural networks (CNN) [32, 26, 57, 45, 54, 21, 22]. The backbone of CNN usually consists of several blocks of stacked convolutional and pooling layers, in which the blocks near to network inputs are called bottom sides and otherwise top sides. It is well accepted that the top sides of CNN contain semantic meaningful information while the bottom sides contain complementary spatial details [48, 30, 16]. Therefore, current state-ofthe-art saliency detectors [4, 51, 45, 54, 29, 44, 55, 43, 16] mainly aim at designing complex network structures to fuse the features or results from various side-outputs. For example, Hou et al. [16] carefully selected several combination sets of various side-output results and fused the combination results for accurate saliency segmentation. Wang et al. [44] proposed a recurrent module to filter out noisy 1

2 information for side-output features. Although significant progress has been made in this direction [16, 55, 44], the side-output fusion strategies have become more and more complex. Do we have to continue this direction for the further improvement of saliency detection? To answer this question, we notice that some recent studies [58, 52] find CNN can learn global contextual information for input images at top convolution layers by enlarging receptive fields. This is not directly applicable to saliency detection, because saliency detection requires not only global contextual information but also local spatial details. Instead of fusing side-output features complicatedly as in [4, 57, 51], we consider constructing hierarchical contextual features. Specifically, we flow global contextual information obtained at top sides into bottom sides. The top contextual information will learn to guide the bottom sides to construct the contextual features at fine spatial scales only emphasizing salient objects. Hence the obtained contexts are different from side-output features or some combinations of them which only contain or at least emphasize local representations for an image. A visualization of contexts learned by our model can be found in Figure 1. Intuitively, the hierarchical contexts should be learned in a high-to-low manner, which means the top sides should learn contexts first and then bottom sides can learn contexts at large spatial resolutions using the information flowing from the top sides. Hence we build an hourglass network and add intermediate supervision after the context module at each side. In the training process, we find the top sides can be automatically optimized first, which is consistent with our hypothesis. This will be demonstrated in Section 4. At last, we simply aggregate hierarchical contexts for accurate salient object detection. The experimental results demonstrate our simple idea can favorably outperform recent state-of-the-art methods that use heavily engineered networks. Our contributions can be summarized as three folds: We build an hourglass network with intermediate supervision to learn hierarchical contexts, which are generated with the guidance of global contextual information and thus only emphasize salient objects at different scales. We propose a hierarchical context aggregation module to ensure the network is optimized from the top sides to bottom sides. We aggregate the learned hierarchical contexts at different scales to perform accurate salient object detection unlike previous studies [16, 55, 43] that fuse side-output features or some complex combinations of side-outputs. We extensively compare our method with recent stateof-the-art methods on six popular datasets. Our simple method favorably outperforms these competitors under various metrics. 2. Related Work Salient object detection is a very active research field due to its wide applications and challenging scenarios. Here, we briefly divide the related work into four parts to review the development of saliency detection and context learning. Heuristic saliency detection methods usually extract handcrafted low-level features and apply machine learning models to classify these features. Some heuristic saliency priors are utilized to ensure the accuracy, such as color contrast [1, 7], center prior [20, 19] and background prior [50, 60]. DRFI [19] is a comprehensive representative of this kind of methods by integrating various features and priors. However, it is difficult for the low-level features to describe semantic information, and the saliency priors are not robust enough for complicated scenarios. Hence deep learning based methods have dominated this fields due to their powerful representation capability. Region-based saliency detection appears in the early era of deep learning based saliency. These approaches view each image patch as a basic processing unit to perform saliency detection. Lee et al. [21] utilized both low-level hand-crafted features and high-level deep features to classify candidate regions as salient or not. The low-level features are compared with other parts of an image to form a distance map that is then encoded by the CNN. Wang et al. [40] presented a two-stage training strategy to sort the segmented object proposals in which the first stage extracts features and the second stage predicts the saliency score for each region. Li et al. [23] extracted multi-scale deep features which are used to infer the saliency scores for image segments. CNN-based image-to-image saliency detection models [4, 57, 51, 45, 54, 29, 44, 17, 55, 5, 43, 27, 28, 16, 24, 32, 26] take saliency detection as a pixel-wise binary classification task and perform image-to-image predictions. For example, Chen et al. [5] proposed a two-stream network which consists of a fixation stream and a semantic stream. Zhang et al. [57] introduced an attention guided network that progressively integrates multiple layer-wise attention for saliency detection. Islam et al. [17] introduced a new deep learning solution with a hierarchical representation of relative saliency and stage-wise refinement. How to effectively fuse multi-level CNN features is the main research direction for CNN-based saliency detection methods [4, 51, 45, 54, 29, 44, 55, 43, 16, 24, 32, 26]. There are too many studies to list here, but the general trend of recent designs is becoming more and more complicated. We will provide detailed discussion about these methods in Sec-

3 DeConv Crop Hierarchical Context Aggregation 2 Conv Element-wise Sum Figure 2. Overall framework of our proposed method. Our effort starts from the VGG16 network [38]. We add an additional convolution block at the end of the convolution layers of VGG16, resulting in six convolution blocks in total. The contexts at each convolution block are learned in a high-to-low manner to ensure that each block is guided by all higher layers to generate scale-aware contexts. The Hierarchical Context Aggregation (HCA) module can guarantee the optimization order is high-to-low and aggregate the generated hierarchical contexts to predict the final saliency maps. tion 4. Compared with them, we focus on a simple yet effective design in this paper. Context learning is recently discovered in semantic segmentation [58, 52]. Zhao et al. [58] added a pyramid pooling module for global context construction upon the final layer of the deep network, by which they significantly improved the performance of semantic segmentation. Zhang et al. [52] built context encoding module using the encoding layer [9] on the top of neural network to conduct accurate semantic segmentation. In saliency detection, Wang et al. [43] followed [58] to use the pyramid pooling module to extract contextual information. Zhao et al. [59] proposed a global context module and a local context module to extract the global and local contexts. The global context module is fed with a superpixel-centered large window including the full image, while the local context module takes a superpixel-centered small window with a small image patch. Hence the the goal to extract multi-contexts in [59] is achieved by multi-scale inputs. The full literature review of salient object detection is out the scope of this paper. Please refer to [2, 8, 12] for a more comprehensive survey. In this paper, we focus on the context learning rather than previous multi-level feature fusion for the improvement of saliency detection. Different from [43] that uses multiple networks, each of which has a pyramid pooling module [58] at the top, we propose an elegant single network. Different from [59] that uses multi-scale inputs, we use single-scale inputs to extract multi-level contexts. The resulting model is simple yet effective. 3. Approach In this section, we will elaborate our proposed framework for salient object detection. We first introduce our base network in Section 3.1. Then, we present a Mirrorlinked Hourglass Network (MLHN) in Section 3.2. A detailed description of the Hierarchical Context Aggregation (HCA) module is finally provided in Section 3.3. We show an overall network architecture in Figure Base Network To tackle the salient object detection, we follow recent studies [5, 43, 16] to use fully convolutional networks. Specifically, we use the well-known VGG16 network [38] as our backbone net, whose final fully connected layers are removed to serve for image-to-image translation. Salient object detection usually requires global information to judge which objects are salient [7], so enlarging the receptive field of the network would be helpful. To this end, we remain the final pooling layer as in [16] and follow [3] to transform the last two fully connected layers to convolution layers, one of which has the kernel size of 3 3 with 1024 channels and another of which has the kernel size of 1 1 with 1024 channels as well. Therefore, there are five pooling layers in the backbone net. They divide the convolution layers into six convolution blocks, which are denoted as {S 1, S 2, S 3, S 4, S 5, S 6 } from bottom to top, respectively. We consider S 6 as the top valve that controls the overall contextual information flow in the network. The resolution of feature maps in each convolution block is the half

4 of the preceding one. Following [16, 48], the side-output of each convolution block means the connection from the last layer of this block Mirror-linked Hourglass Network Based on the backbone net, we build a Mirror-linked Hourglass Network (MLHN). An overview of MLHN is displayed in Figure 2. More concretely, we upsample the convolution block S 6 by two times and connect a 1 1 convolution layer (w/o non-linearization) after S 5. The resulting two feature maps are fused using an element-wise summation operation. For the upsampling, the side-output of S 6 is first connected to a 1 1 convolution layer (w/o nonlinearization) which follows by a deconvolution layer. This deconvolution upsamples a features map by 2 times using bilinear interpolation. A crop operation is performed to ensure the upsampled feature map of S 6 has equal size to the feature map of S 5. To convert the fused feature map into contextual information, two sequential convolution layers are then connected to obtain contextual features S 5. These two convolution layers play a role of transform function, which uses the contextual information of S 6 to guide the features of S 5 to generate contexts S 5. The contextual features { S 4, S 3, S 2, S 1 } can be obtained in the similar way. For a clear presentation, this can be formulated as S i = ϕ(φ 1 (S i ) + φ 2 ( S i+1 )) i {1, 2, 3, 4, 5} φ 1 ( ) = Conv( ) φ 2 ( ) = Crop(Upsample(Conv( ))) ϕ( ) = ReLU(Conv(ReLU(Conv( )))). A standard encoder-decoder network can be formulated as (1) S i = ϕ( S i+1 ) i {1, 2, 3, 4, 5} (2) In this way, the proposed MLHN gradually flow top contextual information into lower sides, so the lower sides are expected to only emphasize the details of salient regions in an image. The two sequential convolution layers (orange box in Figure 2) are with kernel size 5 5 for { S 5, S 4, S 3 } and kernel size 3 3 for { S 2, S 1 }. The numbers of output channels are 512, 256, 256, 128 and 128 from S 5 to S 1, respectively. On one hand, the encoded features in the base network are connected to the decoder part in a Mirror-linked way. On the other hand, the proposed network is symmetric with S 6 as its center, just like an hourglass. Hence we call our network Mirror-linked Hourglass Network (MLHN) Hierarchical Context Aggregation Intuitively, the proposed MLHN should be optimized from the top sides to bottom sides, because the global contextual information is contained in the top sides and will be Side6 Side5 7 7 Side4 Side Side2 Side1 Contexts Side 2 Conv 3 3 Conv DeConv 1 1 Conv Figure 3. Hierarchical Context Aggregation (HCA) module used in our proposed network. All sides of the backbone have intermediate supervision to ensure that the optimization is performed from high sides to lower sides, so that every side can learn the contextual information. The hierarchical contexts from all sides are concatenated for final saliency map prediction. flowed to bottom sides gradually. Therefore, unlike previous encoder-decoder networks [37, 31] that impose supervision at the final layer of decoder, we adopt supervision at all context learning stages, i.e. { S 6, S 5, S 4, S 3, S 2, S 1 }, through a Hierarchical Context Aggregation (HCA) module. The HCA module is shown in Figure 3. The side-output of each decoder side is first connected with two convolution layers, which are with kernel size of 7 7 for S 6, 5 5 for { S 5, S 4, S 3 } and 3 3 for { S 2, S 1 }. The numbers of channels for them are 512, 512, 256, 256, 128 and 128, respectively. Then, we add a 3 3 convolution layer without non-linearization to decrease the number of channels to 25 for all sides. The 25-channel map is the context map at each side. A deconvolution layer with fixed bilinear kernel is employed to upsample the context map into the size of original image. In order to better understand this process, we formulate it as C i = Crop(Upsample(ω(ψ( S i )))) ω( ) = Conv( ) i {1, 2, 3, 4, 5, 6} ψ( ) = ReLU(Conv(ReLU(Conv( )))), in which ω( ) is a linear transformation for channel reorganization and ψ( ) is to transform the fused features at each stage into contexts at various scales. The saliency prediction map can be obtained by simply adding a 1 1 (w/o non-linearization) convolution. We put the intermediate supervision here for each side to help the top sides to be optimized first. The upsampled context maps ( C i, i = 1, 2,, 6) for all sides are aggregated using a standard concatenation. A 7 7 convolution and a (3)

5 Hidden Layer Loss Layer Connection (a) (b) (c) (d) (e) (f) Figure 4. Illustration of different multi-scale deep learning architectures: (a) hyper feature learning; (b) FCN style; (c) HED style; (d) DSS style; (e) encoder-decoder networks; (f) our HCA network. The connections in above figures can be any network configurations, e.g. any types of CNN layers or combinations of them. 3 3 convolution are followed to further fuse the hierarchical contexts for the final high-quality prediction of saliency maps. We empirically find large kernel sizes are a bit helpful here, but large kernel sizes will also lead to slow speed because the aggregated context map is in the size of original image. Therefore, we do not use two 7 7 or larger kernel sizes. The essential function of HCA lies in three aspects. Firstly, the intermediate supervision of HCA can help MLHN be optimized from top to bottom, so that the global contextual information at top sides will flow to bottom sides gradually. Secondly, the added convolution layers can encourage each side to generate contexts at the corresponding scale. Thirdly, the hierarchical contexts at all sides are aggregated for final saliency map prediction, unlike previous methods [16, 48, 30] that compute final results by fusing results of various side-outputs. 4. Architectural Analyses Due to the nature of the multi-scale and multi-level learning in deep neural networks, there have emerged a large number of architectures that are designed to utilize the hierarchical deep features. For example, multi-scale learning can use skip-layer connections [13, 31] which is widely accepted owning their strong capabilities to fuse hierarchical deep features inside the networks. On the other hand, multi-scale learning can use encoder-decoder networks that progressively decode the hierarchical deep representation learned in the encoder backbone net. We have seen these two structures applied in various vision tasks. We continue our discussion by briefly categorizing inside multi-scale deep learning into five classes: hyper feature learning, FCN style, HED style, DSS style and encoderdecoder networks. An overall illustration of them is summarized in Figure 4. Our following discussion of them will clearly show the differences between our proposed HCA network and previous efforts on multi-scale learning. Hyper feature learning: Hyper feature learning [13] is the most intuitive way to purse multi-scale information, as illus- Loss loss-side1 loss-side2 loss-side3 loss-side4 loss-side5 loss-side6 loss-fuse #Iteration Figure 5. Side loss at the first 2000 training iterations. At the beginning, the loss of top sides drop quickly, but the bottom sides manage to have smaller loss at last. trated in Figure 4(a). Examples of this structure for saliency include [24, 51, 5, 43, 27]. These models concatenate/sum multi-scale deep features from multiple levels of backbone nets [24, 51] or branches of the multi-stream nets [5, 43, 27]. The fused hyper features are then used for final predictions. FCN style: Since the top sides of neural networks usually contain more reliable semantic information, a reasonable revision of hyper feature learning is to progressively fuse deep features from upper layers to lower layers [31, 37], as shown in Figure 4(b). The top semantic features will combine with bottom low-level features to capture fine-grained details. The feature fusion can be a simple element-wise summation [31], a simple feature map concatenation (U- Net) [37], or more complex designs based on them. Most of recent saliency models fall into this category [57, 45, 54, 29, 44, 17, 55]. They differ from each other by applying different fusion strategies. One notable similarity of these models is that the final prediction is produced using the fused feature maps at the largest scale. Hence the final fused features are expected to learn both global seman-

6 tic information and local low-level details. To better achieve this goal, recent state-of-the-art models have designed very complex fusion strategies [29, 44, 4]. HED style: HED-like networks [48, 30] add deep supervision at the intermediate sides to perform predictions, and the final result is a combination of predictions at all sides (shown in Figure 4(c)). Unlike multi-scale feature fusion, HED performs multi-scale prediction fusion. Chen et al. [4] followed this style to perform saliency detection. DSS style: DSS network [16] is an extension of HED architecture. The side-output of each network side is fused with side-outputs from some of the upper sides. For each side, which upper sides to choose for fusion is carefully selected by experiments. The difference between HED and DSS can be clearly seen in Figure 4(d). Encoder-decoder networks: To benefit from the powerful representation capability of deep networks, one can also decode the high-level representation at the top layers [35], as displayed in Figure 4(e). The decoder gradually enlarges its resolution to decode local information from upper layers. HCA network: We show a streamlined diagram of our proposed HCA network in Figure 4(f). Its left part looks a bit like an FCN (Figure 4(b)) or an encoder-decoder network (Figure 4(e)) with parallel connections. Unlike the FCN and encoder-decoder nets that perform predictions using the final fused hybrid features, our HCA network aggregates hierarchical contexts to perform predictions. The contexts are learned in a high-to-low manner through the proposed HCA module, so that the firstly optimized top sides can generate global contextual information to guide lower layers to produce scale-specific contexts. We show a demonstration of this high-to-low optimization in Figure 5, which includes the loss curves of all sides during training. We can clearly see that C 6 is optimized first, then C 5, C4, C3, C2 and C 1 follow sequentially. Without carefully designed feature fusion strategies [29, 55, 44, 4], the simple HCA can learn high-quality contexts for accurate salient object detection. 5. Experiments 5.1. Experimental Setup Implementation Details. We implement the proposed network using the well-known Caffe [18] framework. The convolution layers contained in original VGG16 [38] are initialized using the publicly available pretrained ImageNet model [10]. The weights of other layers are initialized from the zero-mean Gaussian distribution with standard deviation The upsampling operations are implemented by deconvolution layers with bilinear interpolation kernels which will be frozen in the training process. The network is optimized using SGD with learning rate policy of poly, in which the current learning rate equals the base one multiplying (1 curr iter/max iter) power. The hyper parameters power and max iter are set to 0.9 and 20000, respectively, so that the training takes iterations in total. The initial learning rate is set to 1e-7. The momentum and weight decay are set to 0.9 and , respectively. All the experiments in this paper are performed on a TITAN Xp GPU. Datasets. We extensively evaluate our method on six popular datasets, including DUTS [41], ECSSD [49], SOD [34], HKU-IS [23], THUR15K [6] and DUT-OMRON [50]. These six datasets consist of 15572, 1000, 300, 4447, 6232 and 5168 natural complex images with corresponding pixelwise ground truth labeling. Among them, DUTS dataset [41] is a latest released challenging dataset consisting of training images and 5019 test images in very complex scenarios. For fair comparison, we follow recent studies [44, 29, 43, 51] to use DUTS training set for training and test on the DUTS test set and other five datasets. Evaluation Criteria. We utilize two evaluation metrics to evaluate our method as well as other state-of-the-art salient object detectors, including max F-measure score and mean absolute error (). Given a predicted saliency map with continuous probability values, we can convert it into binary maps with arbitrary thresholds and computing corresponding precision/recall values. Taking the average of precision/recall values over all images in a dataset, we can get many mean precision/recall pairs. Moreover, F-measure score is an overall performance indicator: F β = (1 + β2 ) P recision Recall β 2, (4) P recision + Recall in which β 2 is usually set to 0.3 to emphasize more on precision. We follow recent studies [32, 16, 55, 56, 29, 25, 4] to report max F β across different thresholds. Given a saliency map S and the corresponding ground truth G that are normalized to [0, 1], can be calculated as = 1 H W H i=1 j=1 W S(i, j) G(i, j) (5) where H and W represent the height and width, respectively. S(i, j) denotes the saliency score at location (i, j), similar to G(i, j) Performance Comparison We compare our proposed salient object detector with 16 recent state-of-the-art saliency models, including DRFI [19], MDF [23], LEGS [40], DCL [24], DHS [26], ELD [21], RFCN [42], NLDF [32], DSS [16], SRM [43], Amulet [55], UCF [56], BRN [44], PiCA [29], C2S [25] and RAS

Methods DUTS-test ECSSD DRFI [19] 0.649 0.154 0.777 MDF [23] LEGS [40] DCL [24] DHS [26] ELD [21] RFCN [42] NLDF [32] DSS [16] Amulet [55] UCF [56] PiCA [29] C2S [25] RAS [4] HCA (ours) 0.707 0.652 0.

7 Methods DUTS-test ECSSD DRFI [19] MDF [23] LEGS [40] DCL [24] DHS [26] ELD [21] RFCN [42] NLDF [32] DSS [16] Amulet [55] UCF [56] PiCA [29] C2S [25] RAS [4] HCA (ours) SRM [43] BRN [44] PiCA [29] HCA (ours) SOD HKU-IS Non-deep learning VGG16 [38] backbone ResNet [14] backbone DUT-OMRON THUR15K Table 1. Comparison of the proposed HCA and 16 competitors in terms of the metrics of and on six datasets. We report results on both VGG16 [38] backbone and ResNet [14] backbone. The top three models in each column are highlighted in red, green and blue, respectively. For ResNet based methods, we only highlight the top performance. Image DRFI MDF DCL DHS RFCN DSS SRM Amulet UCF BRN PiCA C2S RAS Ours GT Figure 6. Qualitative comparison of HCA and 13 state-of-the-art methods. [4]. Among them, DRFI [19] is the state-of-the-art nondeep-learning based method, and the other 15 models are all based on deep learning. We do not report MDF [23] results on the HKU-IS [23] dataset because MDF uses a part of HKU-IS for training. Due to the same reason, we do not report DHS [26] results on the DUT-OMRON [50]. For fair comparison, all these models are tested using their publicly available code and pretrained models released by the authors with default settings. We also report the results of the ResNet-101 [14] version of our proposed HCA. Since

8 No. Module Side 1 Side 2 Side 3 Side 4 Side 5 Side 6 1 MLHN (128, 3 3) 1 (128, 3 3) 1 (256, 5 5) 1 (256, 5 5) 1 (512, 5 5) 1-2 HCA (128, 3 3) 1 (128, 3 3) 1 (256, 5 5) 1 (256, 5 5) 1 (512, 5 5) 1 (512, 7 7) 1 3 MLHN (128, 3 3) 2 (128, 3 3) 2 (256, 3 3) 2 (256, 3 3) 2 (512, 3 3) 2-4 MLHN (128, 3 3) 2 (128, 3 3) 2 (256, 3 3) 2 (256, 3 3) 2 (512, 3 3) 2 (512, 7 7) 2 5 HCA (128, 3 3) 2 (128, 3 3) 2 (256, 5 5) 2 (256, 5 5) 2 (512, 5 5) 2 (512, 5 5) 2 6 MLHN (128, 3 3) 2 (128, 3 3) 2 (256, 5 5) 2 (512, 5 5) 2 (1024, 5 5) 2-7 HCA (128, 3 3) 2 (128, 3 3) 2 (256, 5 5) 2 (512, 5 5) 2 (1024, 5 5) 2 (1024, 7 7) 2 * MLHN (128, 3 3) 2 (128, 3 3) 2 (256, 5 5) 2 (256, 5 5) 2 (512, 5 5) 2 - HCA (128, 3 3) 2 (128, 3 3) 2 (256, 5 5) 2 (256, 5 5) 2 (512, 5 5) 2 (512, 7 7) 2 Table 2. Experimental settings of ablation studies. * means the default settings used in this paper. The column of Module indicates which module is changed, and another model remains the default settings in the meanwhile. No. DUTS-test ECSSD SOD HKU-IS DUT-OMRON THUR15K F β F β F β F β F β F β * Table 3. Evaluation results of ablation studies. See Table 2 for experimental settings with corresponding numbers. ResNet is deep enough to capture global contexts, we exclude the sixth side ( S 6 ) in HCA. Table 1 summarizes the numeric comparison in terms of F β and on six datasets. HCA can significantly outperform other competitors in most cases, which demonstrates its effectiveness. With the VGG16 [38] backbone, the F β values of HCA are 2.1%, 1.0%, 0.9%, 1.1%, 0.6% and 0.5% higher than the second best method on the DUTS, ECSSD, SOD, HKU-IS, DUT-OMRON and THUR15K datasets, respectively. On the SOD dataset in terms of metric, HCA performs slightly worse than the best result. PiCA [29] seems to achieves the second place. With the ResNet backbone, the performance gap between the proposed HCA and other ResNet based competitors is much larger than with VGG16 backbone net. Specifically, the F β values of HCA are 2.2%, 1.3%, 1.3%, 1.7%, 3.0% and 0.8% higher than the second best method on six datasets, respectively. We also provide a qualitative comparison in Figure 6. For objects with various shapes and scales, HCA can well segment the entire objects with fine details (1-2 rows). HCA is also robust with complicated background (3-5 rows), multiple objects (6-7 rows) and confusing stuff (8 row) Ablation Studies To evaluate the influences of various design choices of MLHN and HCA (the 2 Conv blocks in Figure 2 and Figure 3), we extensively perform seven ablation studies with VGG16 backbone. The detailed experimental settings and corresponding evaluation results are shown in Table 2 and Table 3, respectively. We can observe that our proposed method is not sensitive to different parameter settings, and the default design achieves slightly better results. These ablation studies can also reflect some interesting phenomena. For example, the experiment #5 suggests larger convolution kernel at sixth side is helpful to obtain accurate global contexts. The experiments #6 and #7 demonstrate introducing more convolution channels is useless to the performance. Interestingly, we observe that the default convolution parameter settings are similar to DSS [16] although we have different network architecture (see Section 4). Perhaps it is due to the intrinsic properties of backbone nets. 6. Conclusion Salient object detection is highly related to the global contextual information which can be used to judge which parts of an image are salient. Motivated by this, we propose a simple yet effective method in this paper. Our method starts from the top sides of neural networks and gradually flows the top global contexts into lower sides to obtain hierarchical contexts. These hierarchical contexts are aggregated for the final salient object detection. Our method reaches the new state-of-the-art on six datasets when compared with 16 recent saliency models. In the future, we plan to apply the proposed network architecture into other vision tasks that need global information.

9 References [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In IEEE CVPR, pages , , 2 [2] A. Borji, M.-M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE TIP, 12(24): , [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4): , [4] S. Chen, X. Tan, B. Wang, and X. Hu. Reverse attention for salient object detection. In ECCV, , 2, 6, 7 [5] X. Chen, A. Zheng, J. Li, and F. Lu. Look, perceive and segment: Finding the salient objects in images via two-stream fixation-semantic CNNs. In IEEE ICCV, pages , , 3, 5 [6] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu. Salientshape: Group saliency in image collections. The Visual Computer, 30(4): , [7] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3): , , 2, 3 [8] R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang. Review of visual saliency detection with comprehensive information. IEEE TCSVT, [9] H. Z. J. X. K. Dana. Deep TEN: Texture encoding network. In IEEE CVPR, pages , [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, pages , [11] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu. Visual-textual joint relevance learning for tag-based social image search. IEEE TIP, 22(1): , [12] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu. Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine, 35(1):84 100, [13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In IEEE CVPR, pages , [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE CVPR, pages , [15] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau. Delving into salient object subitizing and detection. In IEEE ICCV, pages , [16] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. In IEEE CVPR, pages , , 2, 3, 5, 6, 7, 8 [17] M. A. Islam, M. Kalash, and N. D. Bruce. Revisiting salient object detection: Simultaneous detection, ranking, and subitizing of multiple salient objects. In IEEE CVPR, pages , , 2, 5 [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages , [19] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. Salient object detection: A discriminative regional feature integration approach. In IEEE CVPR, pages , , 2, 6, 7 [20] Z. Jiang and L. S. Davis. Submodular salient region detection. In IEEE CVPR, pages , [21] G. Lee, Y.-W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features. In IEEE CVPR, pages , , 2, 6, 7 [22] G. Li, Y. Xie, L. Lin, and Y. Yu. Instance-level salient object segmentation. In IEEE CVPR, pages , [23] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In IEEE CVPR, pages , , 6, 7 [24] G. Li and Y. Yu. Deep contrast learning for salient object detection. In IEEE CVPR, pages , , 5, 6, 7 [25] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen. Contour knowledge transfer for salient object detection. In ECCV, pages , , 6, 7 [26] N. Liu and J. Han. DHSNet: Deep hierarchical saliency network for salient object detection. In IEEE CVPR, pages , , 2, 6, 7 [27] N. Liu and J. Han. A deep spatial contextual long-term recurrent convolutional network for saliency detection. IEEE TIP, 27(7): , , 5 [28] N. Liu, J. Han, T. Liu, and X. Li. Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE TNNLS, 29(2): , [29] N. Liu, J. Han, and M.-H. Yang. PiCANet: Learning pixelwise contextual attention for saliency detection. In IEEE CVPR, pages , , 2, 5, 6, 7, 8 [30] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge detection. In IEEE CVPR, pages , , 5, 6 [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE CVPR, pages , , 5 [32] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin. Non-local deep features for salient object detection. In IEEE CVPR, pages , , 2, 6, 7 [33] V. Mahadevan and N. Vasconcelos. Saliency-based discriminant tracking. In IEEE CVPR, [34] V. Movahedi and J. H. Elder. Design and perceptual validation of performance measures for salient object segmentation. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 49 56, [35] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In IEEE ICCV, pages , [36] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang. Regionbased saliency detection and its application in object recognition. IEEE TCSVT, 24(5): ,

10 [37] O. Ronneberger, P. Fischer, and T. Brox. U-Net: convolutional networks for biomedical image segmentation. In MIC- CAI, pages , , 5 [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, , 6, 7, 8 [39] N. Tong, H. Lu, X. Ruan, and M.-H. Yang. Salient object detection via bootstrap learning. In IEEE CVPR, pages , [40] L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and global search. In IEEE CVPR, pages , , 6, 7 [41] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan. Learning to detect salient objects with image-level supervision. In IEEE CVPR, pages , [42] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Saliency detection with recurrent fully convolutional networks. In ECCV, pages , , 7 [43] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu. A stagewise refinement model for detecting salient objects in images. In IEEE ICCV, pages , , 2, 3, 5, 6, 7 [44] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji. Detect globally, refine locally: A novel approach to saliency detection. In IEEE CVPR, pages , , 2, 5, 6, 7 [45] W. Wang, J. Shen, X. Dong, and A. Borji. Salient object detection driven by fixation prediction. In IEEE CVPR, pages , , 2, 5 [46] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, pages , [47] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Revisiting dilated convolution: A simple approach for weaklyand semi-supervised semantic segmentation. In IEEE CVPR, pages , [48] S. Xie and Z. Tu. Holistically-nested edge detection. In IEEE ICCV, pages , , 3, 5, 6 [49] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In IEEE CVPR, pages , [50] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In IEEE CVPR, pages , , 6, 7 [51] Y. Zeng, H. Lu, L. Zhang, M. Feng, and A. Borji. Learning to promote saliency detectors. In IEEE CVPR, pages , , 2, 5, 6 [52] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In IEEE CVPR, pages , , 3 [53] J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley. Deep unsupervised saliency detection: A multiple noisy labeling perspective. In IEEE CVPR, pages , [54] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang. A bi-directional message passing model for salient object detection. In IEEE CVPR, pages , , 2, 5 [55] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregating multi-level convolutional features for salient object detection. In IEEE ICCV, pages , , 2, 5, 6, 7 [56] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin. Learning uncertain convolutional features for accurate saliency detection. In IEEE ICCV, pages , , 7 [57] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang. Progressive attention guided recurrent network for salient object detection. In IEEE CVPR, pages , , 2, 5 [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE CVPR, pages , , 3 [59] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In IEEE CVPR, pages , [60] W. Zhu, S. Liang, Y. Wei, and J. Sun. Saliency optimization from robust background detection. In IEEE CVPR, pages , [61] F. Zund, Y. Pritch, A. Sorkine-Hornung, S. Mangold, and T. Gross. Content-aware compression using saliency-driven image retargeting. In ICIP, pages ,

Visual Saliency Based on Multiscale Deep Features Supplementary Material

Visual Saliency Based on Multiscale Deep Features Supplementary Material Guanbin Li Yizhou Yu Department of Computer Science, The University of Hong Kong https://sites.google.com/site/ligb86/mdfsaliency/