This paper studies the problem of synthesizing a number of likely future frames from a single input image. The proposed approach is to model future frames in a probabilistic manner. The proposed network structure, called Cross Convolutional Network, encodes image and motion information as feature maps and convolutional kernels, respectively.
1. In training: the network observes a set of consecutive image pairs in videos, and automatically infers the relationship between images in each pair without any supervision. In test: the network predicts the condition distribution P(J|I), of future RGB images J given an RGB input image I that is not in the training set. Using the distribution, the model can synthesize multiple different image samples corresponding to possible future frames of the input image.
2. Conditional variational autoencoder is used to model the complex conditional distribution of future frames.
3. Instead of finding an intrinsic representation of the image itself, the proposed network finds an intrinsic representation of intensity changes between two images, known as difference image or Eulerian motion.
4. Motion is modelled using a set of image-dependent convolution kernels operating over an image pyramid.
5. Network structure: (a) motion encoder: which is a variational autoencoder learning the compact representation z of possible motions (b) a kernel decoder, which learns motion kernels from z (c)image encoder: which consists of convolutional layers extracting feature maps from the input image I, (d) cross convolutional layer, which takes the output of the image encoder and the kernel decoder, and convolves the feature maps with motion kernels, (e) motion decoder which regresses the difference image from the combined feature maps
6. The above figure shows three ways to generate next frame.
(1) deterministic model: minimize the reconstructive error. It can not capture the multiple possible motions that a shape can have.
(2) Motion prior: This model contains a latent representation z, which encodes the intrinsic dimensionality of the motion field. But it does not see the input image during inference. So it can only learn a joint distribution of motion fields for all classes.
(3) Probabilistic frame predictor: The decoder now takes two inputs, the intrinsic representation z, and image I. Therefore, instead of modelling a joint distribution of motion v, it will learn a conditional distribution of motion given the input image.
1. I still have some question about the inference time. The author claims that the model can learn class-specific motions.
However, at inference time, the network uses the distribution p(z) which is obtained from the entire training set. So how does the model know which motion to use?
2. The key point of the paper is the cross convolution. The cross convolutional layer applies the kernels learned to the feature maps. How does this cross convolution apply the motion to the image?
3. This difference image, how is it different from the optical flow image?
4. The motion encoder: takes two adjacent frames in time as input (or difference image? what is the influence here?)