This paper presents a new way to capture the video-wide temporal information for action recognition. The proposed framework uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The parameters of the ranking function are used as the feature vectors to perform the action recognition.
Intuitively, videos of similar contents should have similar evolution behaviours, thus the parameters learned to represent the evolution should be similar. In this way, they can be used to learn classification boundaries.
This is a novel idea. Similarly, there is a paper  to model the action as a transformation function that can transform the surrounding from state A (pre-action state) to state B (after-action state). Therefore, it can be modelled using a Siamese network.
The good things about these two papers is that we don't even care about how good the performance is (of course it is still a very important factor), just the idea itself is novel enough. However, since the idea is completely new, at least not following some already-successful fashion, it is not guaranteed to work.
 Wang, X., Farhadi, A., Gupta, A.: Actions~transformations. In: CVPR 2016.