Saliency Inspired Modeling of Packet-loss Visibility in Decoded Videos

Size: px

Start display at page:

Download "Saliency Inspired Modeling of Packet-loss Visibility in Decoded Videos"

Paul Casey
6 years ago
Views:

1 1 Saliency Inspired Modeling of Packet-loss Visibility in Decoded Videos Tao Liu*, Xin Feng**, Amy Reibman***, and Yao Wang* *Polytechnic Institute of New York University, Brooklyn, NY, U.S. **Chongqing University, Chongqing, P.R. China ***AT&T Labs-Research, Florham Park, NJ, U.S. Abstract The visibility of packet loss in decoded video depends on various factors and their complicated interactions, such as loss severity and duration, and characteristics of background signal. However, visual attention or saliency may play an important role as well. In this work, we investigate how to improve visibility prediction by incorporating the saliency information. Based on earlier findings about how saliency affects the perceptual quality of video with packet losses, we propose several saliency-based factors and incorporate them into a Generalized Linear Model (GLM) to predict loss visibility. Test results with 1080 MPEG-2 packet losses indicate that saliency information can help improve the prediction accuracy about 12% over nonsaliency-based model, and that saliency-weighted mean-squareerror and variation of saliency information are promising metrics. Index Terms Packet loss visibility, perceptual video quality, saliency information, GLM important component of the Human Visual System (HVS), describes features of an image (for example, color, intensity and orientation) which cause one region to stand out relative to other regions [6]. Because of both physiological and psychological evidence that humans have high selectivity on exposed visual information [6], there is recent interest in designing image and video quality metrics that weight signal errors based on their visual sensitivities [4][5][10]. Some evidence suggests incorporating saliency during spatial pooling may not always improve quality metrics [5]. However, in this paper, we demonstrate that adding saliency information to the problem of predicting the visibility of impairments due to packet losses can dramatically improve prediction performance. We focus here on individual packet losses in high-quality video that has few other visual artifacts. I. INTRODUCTION In video transmission, the decoder may not receive all the encoded video data because of the losses occurred in packet networks. Generally, different packet losses may not cause equal video quality degradation, and in fact some may not be visible to human viewers. On the other hand, assessing the impact of packet loss is crucial for network providers to monitor and control their networks to guarantee satisfactory video quality for end users. However, because of significant difficulty in assessing the perceptual quality of video affected by packet loss directly, investigating the visibility of packet losses is a natural and appropriate starting point. In the literature, some researchers reported their findings about the visibility of packet loss in video transmission. The work in [1][7] accurately modeled the visibility of single packet loss by a Generalized Linear Model (GLM) with various factors, such as signal error, video motion, loss position and etc. The authors of [8] extended this to predict visibility of multiple losses. The work in [3] introduced an intuitive concept Mean Time Between Failures (MTBF) that video quality can be interpreted as the probability of visible artifacts, and estimate it with several existing quality metrics. In [9], the visibility of single packet loss in compressed video is evaluated by loss severity (in PSNR drop) caused by lost frame, and the duration of its propagated error. In this work, we explore whether saliency information can improve the prediction of packet-loss visibility. Saliency, an Specifically, we still use the GLM method to predict loss visibility, but introduce new factors that incorporate saliency information. Motivated by an earlier finding [10] that weighting the pixel-wise errors by the visual saliency map can predict the perceptual quality better than using non-weighted errors, we propose to use saliency weighted error as a factor. Furthermore, we also investigate factors summarizing the saliency map itself, including how the saliency map changes when there is a packet loss, and the temporal variation of the saliency map. In this work, we explore packet loss visibility using the subjective data in [1]. We first design a GLM with previously identified non-saliency factors as a benchmark, and then we incorporate our aforementioned saliency-based factors into the GLM. We compute the saliency map using the computational visual attention model developed by Itti et al. [6]. Our tests show that both saliency-weighted error and changes in saliency map are good metrics to predict loss visibility, and they can significantly improve the prediction accuracy of GLM over the one using non-saliency factors. The remainder of this paper is organized as follows. All factors we consider, both with and without saliency, are discussed in Section II. The subjective data is described briefly in Section III. In Section IV, we design two GLMs, with and without saliency factors, and we analyze and compare their performances. We conclude this paper with Section V.

Furthermore, the interaction between them on a scene-level was considered as well. Here we only briefly discuss them in each category; refer to [1][7] for more detailed descriptions.

In the loss-visibility scenario, for the sake of easy calculation, two simplified variations of each metric are used to predict the visibility of packet loss: those measurements of initial frame in

When there is a single slice loss (as opposed to the loss of an entire frame), the impact of discontinuities caused by lost slices on video quality can be measured by the Slice-Boundary Mismatch

2 2 A. Non-saliency factors II. FACTORS AFFECTING VISIBILITY In [1][7], there were totally 20 non-saliency factors (and their variations) proposed, which covers both characteristics of videos and packet loss impairments. Furthermore, the interaction between them on a scene-level was considered as well. Here we only briefly discuss them in each category; refer to [1][7] for more detailed descriptions. (a) Error characteristics: Mean Squared Error (MSE) and Structural Similarity Index Metric (SSIM) are two widely accepted quality metrics. In the loss-visibility scenario, for the sake of easy calculation, two simplified variations of each metric are used to predict the visibility of packet loss: those measurements of initial frame in loss-affected segment, IMSE and ISSIM; and the extreme values of IMSE and ISSIM in macroblock level, MaxIMSEmb and MinISSIMmb. When there is a single slice loss (as opposed to the loss of an entire frame), the impact of discontinuities caused by lost slices on video quality can be measured by the Slice-Boundary Mismatch (SBM), first proposed in [13] and modified in minor details in [7]. Only SBM on the initial frame of loss-affected segment, ISBM, is considered. Additionally, some important content-independent measures, such as spatial extent, or SXTNT (the number of slices lost in one frame), HGT (the average height of the lost slices), Duration (duration of the loss-affected segment), are also considered. (b) Video characteristics: Motion is one of the most important characteristics of videos. Therefore,the mean and variance of the magnitudes of the motion vectors across all macroblocks initially affected by a loss, MotMean and MotVar can also be used to predict the visibility of a loss. SigMean and SigVar, the mean and variance of intensity values of the initial frame of loss-affected segment, and ResidEng (residual energy after motion compensation) of that frame are also effective. (c) Scene-level characteristics: In addition to the factors in (a) and (b), some high-level characteristics are also considered. The authors of [7] show that the relative position between scene change and packet loss impairment has great influence on its visibility. Therefore, D2R (the distance between the current frame (with packet loss) and the reference frame used for concealment)(farconceal is set when values of D2R are equal to or greater than 3), DistFromCut (the distance in time between the first frame affected by the packet loss and the nearest scene cut, either before or after) and its threshold versions, AtScene, BeforeScene, and AfterScene, are considered in this work. Since these non-saliency factors were proved capable to predict the loss visibility, we use them as candidate factors to design a GLM. B. Saliency-based factors In the previous subsection, we discuss a score of factors covering many attributes of packet loss impairment. However, these are only heuristically linked to properties of the HVS. We hereby investigate factors associated with saliency in this subsection. We adopt the widely accepted saliency-based visual attention model (SVAM) [6] proposed by Itti. et al. to calculate the saliency information of our test videos. Itti s SVAM decomposes an input image into a set of multiscale feature maps of color, intensity and orientation. By using a center-surround mechanism, SVAM accurately simulates which location in the image will automatically and unconsciously attract visual attention. Fig.1 shows the saliency maps of original and loss-imaired video frames (for demonstration purpose only, not the actual test video). Figure 1. (a) (c) Saliency maps of original and loss-affected video frames. We consider two basic attributes of the computed saliency information for their ability to predict loss visibility. First, inspired by the results in our previous work [9] that weighting the pixel-wise errors by the saliency values correlates well with the perceptual quality of loss-impaired videos, we propose to supplement the IMSE factor by saliency-weighted IMSE, denoted by IMSE_Sal. We also consider the saliencyweighted MSE computed over all loss-affected frames, yielding MSE_Sal. Second, in addition to the impact of saliency on pixel-wise error, we investigate properties of the saliency information itself. If we examine the saliency maps of both the original and the loss-impaired video frames from Fig. 1, we discover that packet losses not only distort the video frames but also alter the distribution of salient regions across the affected frames spatially and temporally. We also observ that packet losses are more visible in videos where the saliency map changes rapidly in time. These two observations lead us to propose two additional factors: SMSE, which measures the changes (in terms of MSE) between saliency maps of original and loss-impaired frames (only in the position where loss happens); STV, which measures temporal variation of the saliency map of loss-impaired frames, respectively. They are defined mathematically as follows with SMSE = E t [ E(x,y) ( S 1 (x, y, t) S 2 (x, y, t) 2 ) ] (1) (b) (d) ST V = STD t (SM 2 (t)) (2) SM 2 (t) = E (x,y) (S 2 (x, y, t)) (3) where E (x,y) () is the 2-D mean operator averaging over all pixels in frame t, E t () is the mean operator averaging over

3 3 time in the segment of loss-impaired sequence, and STD t () is the standard deviation operator over time for that segment. S i (x, y, t) denotes the saliency value at position (x, y) in frame t, and i = 1or2 refers to the original or distorted video sequence. Note that, for saliency computation, we tested two methods, one using color, orientation, and intensity information only, as in the original Itti s model [6]; another one further using motion information with the motion features computed following [11]. From our previous study, the second method produces saliency maps that are more consistent with our visual inspections, although it requires extra compuation over the former one in saliency detection. Therfore, we focus on the latter one in this work. Table I summarizes all the aforementioned factors. Table I LIST OF ALL NON-SALIENCY AND SALIENCY FACTORS 1 IMSE 14 D2R 2 MaxIMSEmb 15 SigVar 3 ISSIM 16 DistFromCut 4 MinSSIMmb 17 AtScene 5 ISBM 18 BeforeScene 6 SXTNT 19 AfterScene 7 HGT 20 FarConceal 8 Duration 21 IMSE_Sal (no motion) 9 ResidEng 22 IMSE_Sal 10 CameraMotion 23 MSE_Sal 11 SigMean 24 S_MSE 12 MotMean 25 STV 13 MotVar III. SUBJECTIVE TEST The subjective data used in this work was first presented in [1]; to be self-contained, we describe it briefly here. This subjective test was designed not to assess the quality of video at a given packet loss rate, but instead to learn about what affects the visibility of impairments caused by individual packet losses. The test videos shown were compressed with MPEG-2 at 720x480 resolution and 30 fps frame rate, with various scene contents and different camera motions, using 13-frame GOPs with 2 B-frames before every P-frame at a bitrate of around 4Mbps. One isolated packet loss was randomly inserted into the video in every 4-second window. The 1080 packet losses affected either one slice, two slices, or an entire frame. The decoder applied Zero-motion error concealment (copying macroblocks from the closest reference frame) when losses occurred. Each packet loss was viewed by 12 viewers, whose task was to indicate each time they saw an artifact while were watching a 6-minute continuous video clip. The ground truth of visibility of each packet loss was defined as the percentage of viewers who indicated they saw the loss. For the detailed information about this subjective test, please refer to [1]. A. GLM fitting method IV. GENERALIZED LINEAR MODELS As [1][7], we model the probability of visibility using a GLM [2], which is a development of linear models to accommodate both non-normal response distributions and transformations to linearity in a straightforward way. We use logistic regression to fit our model, with logit() as the link function. With the help of statistical software R [12], the model is fit with an iteratively re-weighted least-square method to generate a maximum-likelihood estimate. The GLM fittings suggested in [2] can be summarized to the following 3 steps: Step 1: Before designing the GLM with multiple variables, each individual factor is analyzed first. By examining the Lowess smoothed scatter plot between each factor and logit function of subjective visibility, the best (in prediction error) form of each factor, e.g. log(), can be determined. Step 2: Since some of the factors are correlated, it is possible to overfit the model by using all of them together. To select factors into the model, we took a step-wise approach by adding one factor at a time. We begin with Null model, and in each step, the one factor that generates maximum reduction of prediction error in the cross-validation process is added to the model, so that the efficiency of each inclusion is maximized. When we can no longer reduce the prediction error by adding a single factor, the model is preliminarily established. Step 3: We check if there are any interactions (products) between any two selected factors that can further improve the prediction of the model. If there are multiple such pairs, the decisions of their inclusion are made in the similar fashion as Step 2. Note that we perform 10-fold cross-validation in the process of building up the model. Specifically, we randomly divide the entire data set of 1080 losses into 10 groups of equal size and choose the data from 9 out of the 10 sets as a training set. The remaining data set is used for testing. We repeat this process 10 times, each time choosing a different set for training. The average prediction error is used as the performance measure. B. GLMs of two models In order to test the impact of saliency information on modeling the loss visibility, we fit the same subjective data with two different set of factors, one containing only nonsaliency-based factors (Model 1); the other one containing all the aforementioned factors (Model 2). We note that our model differs from that in [7] because it uses just one subjective dataset. The factors and their coefficients of both models are summarized in Table II. To test the significance of each factor in the model, including the interaction terms added in the third step, we re-perform Step 2 with all the selected factors and interaction terms to update their inclusion order. The new order provides a ranking of the significance of each factor and interaction term. This allows the two models to be compared if we limit each to having the same number of predictive factors. We draw the bar plots of factors and model prediction errors of this stepwise procedure, in the order of their inclusions, in Fig. 2 (a) and 2 (b). C. Comparison Model 1 v.s. Model 2 To compare the prediction performances of Model 1 and Model 2, we show the relationship between the number of

4 4 (a) (b) Figure 2. Factor inclusions of (a) Model 1; and (b) Model 2.

5 Table II COEFFICIENTS OF MODEL 1 AND MODEL 2 Model 1 Model 2 factor coef. factor coef. MinSSIMmb -1.141e+00 MinSSIMmb -9.794e-01 FarConceal* FarConceal* 1.929e-01 log(imse) (IMSE_Sal) 4 1 7.

5 5 Table II COEFFICIENTS OF MODEL 1 AND MODEL 2 Model 1 Model 2 factor coef. factor coef. MinSSIMmb e+00 MinSSIMmb e-01 FarConceal* FarConceal* 1.929e-01 log(imse) (IMSE_Sal) e-01 StillCamera e-01 StillCamera e+00 log(reseng) e-01 log(reseng) e+00 log(imse) 7.209e-01 log(imse) 7.090e-01 log(isbm)* (SXTNT!=30) 6.590e-02 SXTNT e-01 HGT e-02 log(reseng)* 3.027e+00 SigMean e-02 StillCamera* e+00 AtScene 1.404e+00 HGT e-02 Duration< e-01 SigMean e-03 BeforeScene e e+01 MotMean> e-01 BeforeScene e-01 SigVar e-04 log(1-issim) 2.915e-01 DistFromCut e-02 Duration< e-01 Duration 1.578e+01 SigVar* FarConceal*(IMSE_Sal) e-04 SigVar 8.905e-04 log(stv) 1.715e-01 DistFromCut e-03 log(isbm)* SXTNT!= e-02 MotMean> e-01 (MotMean>0.707)* FarConceal*(IMSE_Sal) e-01 AtScene * 1.472e+01 AtScene e+00 factors used in the model and prediction error reduction ratio (Model 2 to Model 1) in Fig.3. We can see that the overall prediction error of Model 2 ( ) is about 12% less than that of Model 1( ). When both models are limited to use 15 factors, Model 2 still outperforms Model 1 by about 9%. In addition, we can see that, no matter how many factors are used to fit the models, our model with saliency-based factors always outperform those without, except for the case of only one factor, since that factor (MinSSIMmb) is the same for both models. Therefore, we conclude that saliency information significantly boosts the visibility prediction performance. To gain a clearer picture of the contribution of each individual saliency-based factor, we examine the inclusion orders (or significance ranks) of saliency-based factors in Model 2 in Fig. 2 (b). There are 4 saliency factors in the first half of the 23 factors in the model: FarConceal*(IMSE_Sal) 1 4, log(resideng)*, StillCamera*, and. If we expand our focus to the first 15 (the number of factors used in Model 1) factors, SigVar*FarConceal*(IMSE_Sal) 1 4 is also present. Note that the saliency-weighted pixel-wise error (IMSE_Sal) and the difference of saliency maps of original and distorted video caused by packet loss (S_MSE) are two very helpful factors in modeling packet loss visibility. Admittedly, this improvement is at the expense of complex computation of the saliency detection system, unless there is any salinecy or foveal information avaiable, such as in some Figure 3. Performance comparison between Model 1 and Model 2. ( The number of factors used in Model 1 is limited to 15.) particular types of video contents, e.g. news broadcasting or soccer games, where salient regions are obvious. V. CONCLUSIONS In this paper, we proposed several saliency-based factors, which can be used to improve the prediction of the visibility of packet losses. Together with a variety of non-saliencybased factors, we fit the GLMs with and without our proposed factors using existing subjective data, and the results show that saliency-based factors significantly improve the performance in loss-visibility modeling. Our prior work in perceptual quality prediction and the current work in prediciting packet loss have shown that saliency information is helpful for both. One interesting question is whether we can predict the perceptual quality from the packet loss visibility. This is one of our current research directions. REFERENCES [1] S. Kanumuri, et al, Modeling Packet-loss Visibility in MPEG-2 Video IEEE Trans. Multimedia, vol.8, pp , Apr [2] P. McCullagh, et al., Generalized Linear Models, 2nd Edition, London, U.K.: Chapman & Hall. [3] N. Suresh, et al., Mean Time Between Failures: A Subjectively Meaningful Video Quality Metric, ICASSP,2006 [4] Z. Wang, et al., Image Quality Assessment:From Error Visibility to Structure Similarity IEEE Trans. Im. Proc. Vol.13, Apr [5] A. Ninassi, et al., Does Where You Gaze on An Image Affect Your Perception of Quality? Applying Visual Attention To Image Quality Metric, Intl. Cof. Im. Proc., 2007 [6] L. Itti, et al., A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no.11, pp , Nov [7] A. Reibman, and D. Poole, Predicting packet-loss visibility using scene characteristics, Packet Video, 2007 [8] S. Kanumuri, et al., Predicting H.264 Packet Loss Visibility using a Generalized Linear Model, ICIP, 2006 [9] T. Liu, et al.. Subjective Quality Evaluation of Decoded Video in The Presence of Packet Losses ICASSP 2007 [10] X. Feng, et al., Saliency Based Objective Quality Assessment of Decoded Video Affected by Packet Losses, ICIP, [11] D. Walther, Interactions of Visual Attention and Object Recognition: Computational Modeling, Algorithms, and Psychophysics, PhD thesis, California Institute of Technology, Pasadena, CA, Feb [12] R software is available at [13] H. Rui et al., Evaluation of packet loss impairment on streaming video, J. of Zhejiang University SCIENCE, vol. 7, April 2006.

Saliency Inspired Full-Reference Quality Metrics for Packet-Loss-Impaired Video

IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 1, MARCH 2011 81 Saliency Inspired Full-Reference Quality Metrics for Packet-Loss-Impaired Video Xin Feng, Tao Liu, Member, IEEE, Dan Yang, and Yao Wang,