The paper describe a new spatial-temporal video autoencoder, based on a classic spatial image auto encoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional LSTM cells that integrate changes over time.
At each time step, the system receives as input a video frame, predicts the optical low based on the current frame and the LSTM content as a dense transformation map, and applies it to the current frame to predict the next frame.
1. Spatial autoencoder: a classic convolutional encoder - decoder architecture
2. Temporal autoencoder: it contains three parts: memory module convolutional LSTM and optical flow prediction with huber penalty, grid generator and image sample,
3. The loss function is the reconstruction error between the predicted next frame and the ground truth next frame, with Huber penalty gradient injected during backpropagation on the optical flow map.
1. In these kind of video prediction works, usually they just output the next frame. Is it even enough? Just generating one frame, is it enough to conclude that the method captures enough temporal information in the video?
2. It would be really cool if the output could be actually some movement or even an action instead of one frame which is really similar with the previous frame.
The paper propose a probabilistic video model, the Video Pixel Networks(VPN), to estimate the discrete joint distribution of the raw pixel values in a video. The novelty of this model is that it utilizes the power of PixelRNN, which makes it possible to learn not only the temporal structure, but also spatial and colour dependency as well. As a result, the time, space and colour structure of video tensors are learned and encoded as a four-dimensional dependency chain.
1. The architecture of the VPN consists of two parts:
(1) resolution preserving CNN encoder
(2) pixelCNN decoder
2. The Eq. (1) models the joint distribution in a conditional dependent fashion, so no independence assumptions are needed.
For example, as shown in the figure, the green colour channel value of pixel x in frame Ft depends on:
(1) all the pixels values in every channel from time-1 to time-(t-1)
(2) the already generated red colour value of the pixel x
1. The reason to use resolution-preserving CNN encoder is that the it allows the model to condition each pixel that needs to be generated without loss of representational capacity.
2. Convolutional LSTM is used. (Read paper  ! This version of LSTM has been used frequently recently.)
3. The novelty of this paper is that the decoder is conditional on more information. Seems like several papers have been trying to do similar things (GIVING MORE INFORMATION TO THE GENERATIVE MODEL). What more can be conditioned on?
 Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.
This paper proposes a novel recurrent neural network model that is capable of extracting information form an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Since the region selection stage is non-differentiable, reinforcement learning is used. (hard-attention model)
1. The model is a RNN structure that processes inputs sequentially, attending to different locations within the image ( or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. (Information need to be summarized and passed along time, so RNN is a solid structure choice.)
2. One advantage of this approach ( all hard-attention approaches) is that the computation is independent of the size of the image (usually, the computation needs to be done at every pixel of the image). Only several local regions are used to model each image.
3. In each time-step of the RNN, a glimpse sensor is used to extract a representation for a given coordinate (shown as A). This representation is combined with the coordinates information to generate the glimpse representation with a Glimpse Network (shown as B).
4. The information to be passed along time is through the internal representation in RNN. And the output of RNN are used to (1) predict the next location, and the action. The action can be in various forms. In a recognition task, for example, it could be a softmax layer.
5. The optimization of the problem is done by maximizing the total reward the agent can get in a long run. This is carried out by REINFORCE.
1. The insight of this paper is pretty clear, and has been proved to be effective in a lot of applications.
2. Attention model can be divided into two categories: (1) soft-attention model and (2) hard-attention model. Soft attention model assigns a weight to each pixel or region. But the computation has to be carried out on the entire image. Hard attention model, on the other hand, makes the computation independent of the size of the image. So it is quicker. But it loses some information (only depend on the regions selected.)
3. Take note how REINFORCE is used in this task.
This paper proposes a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. The feedback network automatically enables making early predictions at the query time, and its output naturally conforms to a hierarchical structure in the label space.
1. The fundamental idea of feedback network is that the output of the network is routed back into the system as part of an iterative cause-and-effect process.
2. The overall process: the image undergoes a shared convolutional operation repeatedly and a prediction is made at each time; the recurrent convolutional operations are trained to produce the best output at each iteration given a hidden state that carries a direction notation of thus-far output.
3. Skip connection inspired by ResNet is added to regulate the flow of signal through the network.
4. Episodic curriculum learning can be achieved by time varying loss function: encourages the network to recognize objects in a first coarse then fine manner.
1. How is it possible that even trained without taxonomy, the network still learns a hierarchical structure representation of the input?
2. The insight for this paper seems to be that decrease the depth of the network depth, and increase the depth in temporal domain.
3. For implementation, seems like it is just a LSTM for single image with some skip connection. (How to tell the story when implementation is similar.)
4. How can we extend this insight into other areas? (network in network idea could work)
This paper proposes an approach that takes into account both the local and global temporal structure of videos to produce descriptors. It uses a spatial temporal 3-D convolutional neural network to represent the short temporal dynamics; and a temporal attention mechanism to automatically select the most relevant temporal segments.
This paper tries to handle an important unsolved question in video analysis area: how to utilize frame-level features or short clip level features to generate video-level features that can effectively represent video content. Usually, there are two ways to do this: (1) temporal pooling over frame-level features (2) recurrent network structure (RNN, LSTM, etc.) The problem with these two ways is that they treat every single frame equally. However, in a video, a lot of the frames are just redundant. Therefore, using attention model to choose which frames to attend to (a weighted version of the pooling) totally makes sense.
(1) Local temporal structure is obtained by using a C3D network. But strangely, the author does not use the frame as input, but rather use a so-called low-level video representation that contains HoG, HoF, and MbH. These representation forms a cubic that is considered as the input of C3D model.
(2) Temporal attention mechanism (soft) can be implemented as weighted sum of the temporal feature vectors. The attention weight at frame i indicates the relevance of the i-th temporal feature in the input video given all the previous generated results.
For more, refer to .
(1) Is the low-level representation useful? Since the CNN can be treated as a feature extractor, is it really necessary to use these hand-crafted features first?
(2) I really like the temporal attention idea, I thought about this for some time. I used to think this kind of method has to be learned by Reinforcement Learning. Probably not.
(3) Is the intuition behind the temporal attention model like: if one frame is too similar with previous frames, the similarity of their features and outputs would be very similar, therefore, the attention weight for that frame should be lowered.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR,2015