This paper proposes a novel recurrent neural network model that is capable of extracting information form an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Since the region selection stage is non-differentiable, reinforcement learning is used. (hard-attention model)
1. The model is a RNN structure that processes inputs sequentially, attending to different locations within the image ( or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. (Information need to be summarized and passed along time, so RNN is a solid structure choice.)
2. One advantage of this approach ( all hard-attention approaches) is that the computation is independent of the size of the image (usually, the computation needs to be done at every pixel of the image). Only several local regions are used to model each image.
3. In each time-step of the RNN, a glimpse sensor is used to extract a representation for a given coordinate (shown as A). This representation is combined with the coordinates information to generate the glimpse representation with a Glimpse Network (shown as B).
4. The information to be passed along time is through the internal representation in RNN. And the output of RNN are used to (1) predict the next location, and the action. The action can be in various forms. In a recognition task, for example, it could be a softmax layer.
5. The optimization of the problem is done by maximizing the total reward the agent can get in a long run. This is carried out by REINFORCE.
1. The insight of this paper is pretty clear, and has been proved to be effective in a lot of applications.
2. Attention model can be divided into two categories: (1) soft-attention model and (2) hard-attention model. Soft attention model assigns a weight to each pixel or region. But the computation has to be carried out on the entire image. Hard attention model, on the other hand, makes the computation independent of the size of the image. So it is quicker. But it loses some information (only depend on the regions selected.)
3. Take note how REINFORCE is used in this task.