In this report, we simulate the rate-distortion relationship of the above three methods. The unconstrained Lagrangian cost function is then introduced in order to find the optimum detection parameters for each method. And finally, we demonstrate that the proposed noise adaptive threshold method generate the best image quality given the same channel bandwidth among the three methods.
This report is organized as follows: in section II , we describe the context-based arithmetic encoding CAE scheme used in encoding the change detection mask; in section III , we present the rate-distortion result from both the subsampling change detection method and threshold-adjusting change detection method; in section III , we describe the noise characteristics in a typical digital imaging system and propose an adaptive change detection algorithm.
The performance of these three algorithms is compared in section IV. Context-based arithmetic encoding  is one of the binary bitmap-based shape coding scheme used in the object-based MPEG-4 standard . Here we only consider the intraframe mode compression. The binary change detection mask is first divided into 16x16 macroblocks.
Within each macroblock, the coder exploits the spatial redundancy of the binary shape information to be coded. Pixels are coded in scan-line order and row-by-row.
Three different types of the macroblocks are defined: "black" block in which none of the pixel has changed all pixels are 0 ; "white" block whose pixels are all changed and need to be replenished all pixels are 1 ; and those boundary macroblock which contains both changed and unchanged pixels. For black and white macroblocks, encoder only need to signal the macroblock type and the number of bits it used is negligible. For those boundary macroblocks, a template of 10 pixels is used to define the causal context for predicting the binary value of the current pixel S0.
The template is shown in Figure 2. Figure 3 Template for defining the context of the pixel to be coded. The template extends up to 2 pixels to the left, to the right and to the top of the pixel S0 to be coded. For encoding the pixels in the two top and left rows of a macroblock, parts of the template are defined by the shape information of the already transmitted macroblocks on the top and left side of the current macroblock.
enter site For the two right-most columns, each undefined pixel of the context is set to the value of its closest neighbor inside the macroblock. For encoding the state transition, a context-based arithmetic encoder is used. For this project, however, the arithmetic encoder is not implemented. Instead, the theoretical conditional entropy based this template is calculated and this entropy is approximated as the bit-rate required for encoding the change mask.
In order to calculate the conditional entropy, conditional probability table with size of contexts is first derived from a bigger training sequence. The conditional entropy is given by:. Same as predictive coding, the context-based conditional entropy coding greatly reduces the statistical dependencies between adjacent pixels, since:. A standard change detector is shown in the following figure:.
Figure 4 Change detector block diagram. Suppose the previous frame is represented by matrix A 1 , the current frame is represented by matrix A 2 , after change detection, the change mask is given by matrix C, which is binary with 0 represents unchanged pixels and 1 represent changed pixels. The mean-square distortion is defined as:. Thus the total bit-rate R is:. Based on this assumption, we can just study the rate-distortion between D and R 1 instead.
The Lagrangian cost function given Lagrange multiplier l is defined as. The test video sequence used in this study is captured by a stationary high-speed digital camera with a person moving cross the screen. Following is two consecutive sample frames and an example of the detected change mask:. Figure 5 Example of two continuous frames and the change mask. There are two direct ways to reduce the number of detected pixels in the change detection: subsampling and threshold adjusting.
To reduce the bit-rate as well as allow lossy compression, the macroblock can be subsampled by a factor of 2, 4 or 8, resulting in a subblock of size 8x8, 4x4 or 2x2, respectively. The subblock is encoded using the CAE encoder as described above. The signal in the changed area is subsampled as well by the same ratio. The encoder transmits to the decoder the subsampling factor such that the decoder decodes the change mask and the signal and then upsamples them.
The upsampling filter used in this study is a simple pixel replication filter combined with a 3x3 median filter. Following figure shows the change mask generated at different subsampling ratio:. Figure 6 Change mask generated at different subsampling ratio. Following figures show that as the subsampling ratio increases, the bit-rate decreases while the distortion increases:. The rate distortion curve and the Lagrangian cost function at different Lagrange multiplier l are plotted as follows:.
Another method to control the bit-rate is by adjusting the change detector threshold. As the threshold increased, few pixels will be detected. Following figure shows the change detection mask under different threshold values. Following figures show that as detection threshold increases, the bit-rate decreases while the distortion increases.
A fundamental problem in designing an optimum change detector is how to separate pixels whose change is due to noise from pixels whose value change is due to real input signal change. Finally, we show that our method is very effective in joint compression of multiple modalities, which exist in videos from depth, stereo, or multi view cameras.
The main contributions of this paper are: i We present a simple yet effective and theoretically grounded method for video compression that can serve as the basis for future work. The rest of the paper is organized as follows. In the next section, we discuss related work on learned image and video compression. Then, in section 3 , we discuss the theoretical framework of learned compression using rate-distortion autoencoders, as well as the relation to variational autoencoders. In section 4 we discuss our methodology in detail, including data preprocessing and autoencoder and prior architecture.
We present experimental results in section 5 , comparing our method to classical and learned video codecs, evaluating semantic compression, adaptive compression, and multimodal compression. Section 6 concludes the paper. Learned Video Compression Video compression shares many similarities with image compression, but the large size of video data, and the very high degree of redundancy create new challenges [ 15 , 30 , 33 , 40 ].
While being powerful and flexible, this model scales rather poorly to larger videos, and can only be used for lossless compression. Hence, we employ this method for lossless compression of latent codes, which are much smaller than the video itself. An extension of this method was proposed in [ 11 ] where blocks of pixels are modeled in an autoregressive fashion and the latent space is binarized like in [ 36 ].
The applicability of this approach is rather limited since it is still not very scalable, and introduces artifacts in the boundary between blocks, especially for low bit rates. The method described in [ 40 ] compresses videos by first encoding key frames, and then interpolating them in a hierarchical manner. However, this method requires additional components to handle a context of the predicted frame. In our approach, we aim at learning these interactions through 3D convolutions instead.
In [ 15 ] a stochastic variational compression method for video was presented. The model contains a separate latent variable for each frame, and for the inter-frame dependencies, and uses the prior proposed in [ 6 ].
Propagation of a learned state. USA1 en. The figure shows that our model has a better rate-distortion performance than H. These two methods, LVMAF and HVMAF temporal quality aggregation, produced very high quality encoded sequences — allowing for more aggressive or more conservative temporal quality fluctuations in the combined video sequence, respectively. Figure 3 Template for defining the context of the pixel to be coded.
By contrast, we use a simpler model with a single latent space, and use a deterministic instead of stochastic encoder. Very recently the video compression problem was attacked by considering flow compression and residual compression [ 27 , 33 ]. Nevertheless, we believe that these ideas are promising and may be able to further improve the result presented in this paper. Our general approach to lossy compression is to learn a latent variable model in which the latent variables capture the important information that is to be transmitted, and from which the original input can be approximately reconstructed.
We begin by defining a joint model of data x and discrete latent variables z ,. In the VAE [ 25 , 31 ] , one uses neural networks to parameterize both q z x and p x z , which can thus be thought of as the encoder and decoder part of an autoencoder. The VAE is commonly interpreted as a regularized auto-encoder, where the first term of the loss measures the reconstruction error and the KL term acts as a regularizer [ 25 ].
Under this interpretation, the first term of the rhs of Eq. Averaged over q , one obtains the first term of the VAE loss Eq. We note that in lossy compression, we do not actually encode x using p x z , which would allow lossless reconstruction. Instead, we only send z and hence refer to the first loss term as the distortion. The second term of the bound the KL is related to the cost of coding the latents z coming from the encoder q z x using an optimal code derived from the prior p z.
Averaging over the encoder q z x , we find that the average coding cost is equal to the cross-entropy between q and p :. So the KL measures the coding cost, except that there is a discount worth H [ q ] bits: randomness coming from the encoder is free. Since we cannot use bits-back coding for lossy compression, the cross-entropy provides a more suitable loss than the KL. Moreover, when using discrete latents, the entropy H [ q ] is always non-negative, so we can add it to the rhs of Eq. We thus obtain the rate-distortion loss. Since the cross-entropy loss does not include a discount for the encoder entropy, there is a pressure to make the encoder more deterministic.
For this reason, we only consider deterministic encoders in this work. When using deterministic encoders, the rate-distortion loss Eq. Finally, we note that limiting ourselves to deterministic encoders does not lower the best achievable likelihood, assuming a sufficiently flexible class of prior and likelihood. In the previous section, we have outlined the general compression framework using rate-distortion autoencoders.
Here we will describe the specific models we use for encoder, code model, and decoder, as well as the data format, preprocessing, and loss functions.