The paper describe a new spatial-temporal video autoencoder, based on a classic spatial image auto encoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional LSTM cells that integrate changes over time.
At each time step, the system receives as input a video frame, predicts the optical low based on the current frame and the LSTM content as a dense transformation map, and applies it to the current frame to predict the next frame.
1. Spatial autoencoder: a classic convolutional encoder - decoder architecture
2. Temporal autoencoder: it contains three parts: memory module convolutional LSTM and optical flow prediction with huber penalty, grid generator and image sample,
3. The loss function is the reconstruction error between the predicted next frame and the ground truth next frame, with Huber penalty gradient injected during backpropagation on the optical flow map.
1. In these kind of video prediction works, usually they just output the next frame. Is it even enough? Just generating one frame, is it enough to conclude that the method captures enough temporal information in the video?
2. It would be really cool if the output could be actually some movement or even an action instead of one frame which is really similar with the previous frame.
This paper presents a new way to capture the video-wide temporal information for action recognition. The proposed framework uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The parameters of the ranking function are used as the feature vectors to perform the action recognition.
Intuitively, videos of similar contents should have similar evolution behaviours, thus the parameters learned to represent the evolution should be similar. In this way, they can be used to learn classification boundaries.
This is a novel idea. Similarly, there is a paper  to model the action as a transformation function that can transform the surrounding from state A (pre-action state) to state B (after-action state). Therefore, it can be modelled using a Siamese network.
The good things about these two papers is that we don't even care about how good the performance is (of course it is still a very important factor), just the idea itself is novel enough. However, since the idea is completely new, at least not following some already-successful fashion, it is not guaranteed to work.
 Wang, X., Farhadi, A., Gupta, A.: Actions~transformations. In: CVPR 2016.
This paper proposes an approach that takes into account both the local and global temporal structure of videos to produce descriptors. It uses a spatial temporal 3-D convolutional neural network to represent the short temporal dynamics; and a temporal attention mechanism to automatically select the most relevant temporal segments.
This paper tries to handle an important unsolved question in video analysis area: how to utilize frame-level features or short clip level features to generate video-level features that can effectively represent video content. Usually, there are two ways to do this: (1) temporal pooling over frame-level features (2) recurrent network structure (RNN, LSTM, etc.) The problem with these two ways is that they treat every single frame equally. However, in a video, a lot of the frames are just redundant. Therefore, using attention model to choose which frames to attend to (a weighted version of the pooling) totally makes sense.
(1) Local temporal structure is obtained by using a C3D network. But strangely, the author does not use the frame as input, but rather use a so-called low-level video representation that contains HoG, HoF, and MbH. These representation forms a cubic that is considered as the input of C3D model.
(2) Temporal attention mechanism (soft) can be implemented as weighted sum of the temporal feature vectors. The attention weight at frame i indicates the relevance of the i-th temporal feature in the input video given all the previous generated results.
For more, refer to .
(1) Is the low-level representation useful? Since the CNN can be treated as a feature extractor, is it really necessary to use these hand-crafted features first?
(2) I really like the temporal attention idea, I thought about this for some time. I used to think this kind of method has to be learned by Reinforcement Learning. Probably not.
(3) Is the intuition behind the temporal attention model like: if one frame is too similar with previous frames, the similarity of their features and outputs would be very similar, therefore, the attention weight for that frame should be lowered.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR,2015