The paper describe a new spatial-temporal video autoencoder, based on a classic spatial image auto encoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional LSTM cells that integrate changes over time.
At each time step, the system receives as input a video frame, predicts the optical low based on the current frame and the LSTM content as a dense transformation map, and applies it to the current frame to predict the next frame.
1. Spatial autoencoder: a classic convolutional encoder - decoder architecture
2. Temporal autoencoder: it contains three parts: memory module convolutional LSTM and optical flow prediction with huber penalty, grid generator and image sample,
3. The loss function is the reconstruction error between the predicted next frame and the ground truth next frame, with Huber penalty gradient injected during backpropagation on the optical flow map.
1. In these kind of video prediction works, usually they just output the next frame. Is it even enough? Just generating one frame, is it enough to conclude that the method captures enough temporal information in the video?
2. It would be really cool if the output could be actually some movement or even an action instead of one frame which is really similar with the previous frame.