The paper propose a probabilistic video model, the Video Pixel Networks(VPN), to estimate the discrete joint distribution of the raw pixel values in a video. The novelty of this model is that it utilizes the power of PixelRNN, which makes it possible to learn not only the temporal structure, but also spatial and colour dependency as well. As a result, the time, space and colour structure of video tensors are learned and encoded as a four-dimensional dependency chain.
1. The architecture of the VPN consists of two parts:
(1) resolution preserving CNN encoder
(2) pixelCNN decoder
2. The Eq. (1) models the joint distribution in a conditional dependent fashion, so no independence assumptions are needed.
For example, as shown in the figure, the green colour channel value of pixel x in frame Ft depends on:
(1) all the pixels values in every channel from time-1 to time-(t-1)
(2) the already generated red colour value of the pixel x
1. The reason to use resolution-preserving CNN encoder is that the it allows the model to condition each pixel that needs to be generated without loss of representational capacity.
2. Convolutional LSTM is used. (Read paper  ! This version of LSTM has been used frequently recently.)
3. The novelty of this paper is that the decoder is conditional on more information. Seems like several papers have been trying to do similar things (GIVING MORE INFORMATION TO THE GENERATIVE MODEL). What more can be conditioned on?
 Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.