This paper proposes an approach that takes into account both the local and global temporal structure of videos to produce descriptors. It uses a spatial temporal 3-D convolutional neural network to represent the short temporal dynamics; and a temporal attention mechanism to automatically select the most relevant temporal segments.
This paper tries to handle an important unsolved question in video analysis area: how to utilize frame-level features or short clip level features to generate video-level features that can effectively represent video content. Usually, there are two ways to do this: (1) temporal pooling over frame-level features (2) recurrent network structure (RNN, LSTM, etc.) The problem with these two ways is that they treat every single frame equally. However, in a video, a lot of the frames are just redundant. Therefore, using attention model to choose which frames to attend to (a weighted version of the pooling) totally makes sense.
(1) Local temporal structure is obtained by using a C3D network. But strangely, the author does not use the frame as input, but rather use a so-called low-level video representation that contains HoG, HoF, and MbH. These representation forms a cubic that is considered as the input of C3D model.
(2) Temporal attention mechanism (soft) can be implemented as weighted sum of the temporal feature vectors. The attention weight at frame i indicates the relevance of the i-th temporal feature in the input video given all the previous generated results.
For more, refer to .
(1) Is the low-level representation useful? Since the CNN can be treated as a feature extractor, is it really necessary to use these hand-crafted features first?
(2) I really like the temporal attention idea, I thought about this for some time. I used to think this kind of method has to be learned by Reinforcement Learning. Probably not.
(3) Is the intuition behind the temporal attention model like: if one frame is too similar with previous frames, the similarity of their features and outputs would be very similar, therefore, the attention weight for that frame should be lowered.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR,2015