The paper describe a new spatial-temporal video autoencoder, based on a classic spatial image auto encoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional LSTM cells that integrate changes over time.
At each time step, the system receives as input a video frame, predicts the optical low based on the current frame and the LSTM content as a dense transformation map, and applies it to the current frame to predict the next frame.
1. Spatial autoencoder: a classic convolutional encoder - decoder architecture
2. Temporal autoencoder: it contains three parts: memory module convolutional LSTM and optical flow prediction with huber penalty, grid generator and image sample,
3. The loss function is the reconstruction error between the predicted next frame and the ground truth next frame, with Huber penalty gradient injected during backpropagation on the optical flow map.
1. In these kind of video prediction works, usually they just output the next frame. Is it even enough? Just generating one frame, is it enough to conclude that the method captures enough temporal information in the video?
2. It would be really cool if the output could be actually some movement or even an action instead of one frame which is really similar with the previous frame.
The paper propose a probabilistic video model, the Video Pixel Networks(VPN), to estimate the discrete joint distribution of the raw pixel values in a video. The novelty of this model is that it utilizes the power of PixelRNN, which makes it possible to learn not only the temporal structure, but also spatial and colour dependency as well. As a result, the time, space and colour structure of video tensors are learned and encoded as a four-dimensional dependency chain.
1. The architecture of the VPN consists of two parts:
(1) resolution preserving CNN encoder
(2) pixelCNN decoder
2. The Eq. (1) models the joint distribution in a conditional dependent fashion, so no independence assumptions are needed.
For example, as shown in the figure, the green colour channel value of pixel x in frame Ft depends on:
(1) all the pixels values in every channel from time-1 to time-(t-1)
(2) the already generated red colour value of the pixel x
1. The reason to use resolution-preserving CNN encoder is that the it allows the model to condition each pixel that needs to be generated without loss of representational capacity.
2. Convolutional LSTM is used. (Read paper  ! This version of LSTM has been used frequently recently.)
3. The novelty of this paper is that the decoder is conditional on more information. Seems like several papers have been trying to do similar things (GIVING MORE INFORMATION TO THE GENERATIVE MODEL). What more can be conditioned on?
 Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.
This paper introduces the Deep Recurrent Attention Writer (DRAW) neural network architecture for image generation.
DRAW network combine a novel spatial attention mechanism with a sequential VAE framework that allows for the iterative construction of complex images.
1. As shown, the architecture is a pair of recurrent neural networks: (1) an encoder network that compresses the images during training, and (2) a decoder that reconstitutes image after receiving codes. The whole system is trained end to end by gradient descent.
2. Three major differences:
(1) the encoder is privy to the decoder's previous outputs, allowing it to tailor the codes it sends according to the decoder's behaviour so far.
(2) The decoder's outputs are successively added to the distribution that will ultimately generate the data.
(3) A dynamically updated attention mechanism is used to focus on specific regions.
3. Selective attention model: an array of 2D Gaussian filters is applied to the image, yielding an image 'patch; of smoothly varying location and zoom, similar to  and . The attention model makes it possible for the network to only focus on a certain region of interest. It is similar with , but being differentiable.
1. The use of LSTM/RNN in images is interesting. It actually mimics the drawing process. So how do human practically makes a video or film? First shoot every scene (short clips of videos), then link these clips together to form a film. Is it possible to do this similar with this paper's insight? Seems like the network in network (RNN in RNN) structure from the Feedback Network paper combined with this paper could work.
2. The differentiable attention model: pay attention. Could be useful in a lot of applications.
3. I have seen several papers about VAE. The difference is sometimes just the input. Like in this paper, the input of the VAE is a combination of several information. How to make use of each input or just concatenation is enough?
 Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Re- current models of visual attention. In Advances in Neural Information Processing Systems, pp. 2204–2212, 2014.
The paper introduces a stochastic variational inference and learning algorithm that scales to large datasets. The contribution is two folds.
(1) A reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.
(2) Datasets with continuous latent variables per data point, posterior inference can be made especially efficient by fitting an approximate inference model to the intractable posterior using the proposed lower bound estimator.
1. Lower bound estimator, a stochastic objective function, can be derived for a variety of directed graphical models with continuous latent variables using VAE.
2. Important equations to reach the VAE:
The marginal likelihood has been written as the sum of two terms: (1) KL divergence (regularization term) and (2) variational lower bound on the marginal likelihood ( expected negative reconstruction error).
3. Reparameterization trick: Let z be a continuous random variable with some distribution conditioned on x. It can be expressed as the output of a function g() with some parameter.
4. The steps in the VAE is (1) encode, (2) reparametrize and (3) decode. Then the reconstruction loss is used to optimize.
Fundamental paper for VAE. Need to derive the equation by hand to understand better.
This paper studies the problem of synthesizing a number of likely future frames from a single input image. The proposed approach is to model future frames in a probabilistic manner. The proposed network structure, called Cross Convolutional Network, encodes image and motion information as feature maps and convolutional kernels, respectively.
1. In training: the network observes a set of consecutive image pairs in videos, and automatically infers the relationship between images in each pair without any supervision. In test: the network predicts the condition distribution P(J|I), of future RGB images J given an RGB input image I that is not in the training set. Using the distribution, the model can synthesize multiple different image samples corresponding to possible future frames of the input image.
2. Conditional variational autoencoder is used to model the complex conditional distribution of future frames.
3. Instead of finding an intrinsic representation of the image itself, the proposed network finds an intrinsic representation of intensity changes between two images, known as difference image or Eulerian motion.
4. Motion is modelled using a set of image-dependent convolution kernels operating over an image pyramid.
5. Network structure: (a) motion encoder: which is a variational autoencoder learning the compact representation z of possible motions (b) a kernel decoder, which learns motion kernels from z (c)image encoder: which consists of convolutional layers extracting feature maps from the input image I, (d) cross convolutional layer, which takes the output of the image encoder and the kernel decoder, and convolves the feature maps with motion kernels, (e) motion decoder which regresses the difference image from the combined feature maps
6. The above figure shows three ways to generate next frame.
(1) deterministic model: minimize the reconstructive error. It can not capture the multiple possible motions that a shape can have.
(2) Motion prior: This model contains a latent representation z, which encodes the intrinsic dimensionality of the motion field. But it does not see the input image during inference. So it can only learn a joint distribution of motion fields for all classes.
(3) Probabilistic frame predictor: The decoder now takes two inputs, the intrinsic representation z, and image I. Therefore, instead of modelling a joint distribution of motion v, it will learn a conditional distribution of motion given the input image.
1. I still have some question about the inference time. The author claims that the model can learn class-specific motions.
However, at inference time, the network uses the distribution p(z) which is obtained from the entire training set. So how does the model know which motion to use?
2. The key point of the paper is the cross convolution. The cross convolutional layer applies the kernels learned to the feature maps. How does this cross convolution apply the motion to the image?
3. This difference image, how is it different from the optical flow image?
4. The motion encoder: takes two adjacent frames in time as input (or difference image? what is the influence here?)