This paper introduces the Deep Recurrent Attention Writer (DRAW) neural network architecture for image generation.
DRAW network combine a novel spatial attention mechanism with a sequential VAE framework that allows for the iterative construction of complex images.
1. As shown, the architecture is a pair of recurrent neural networks: (1) an encoder network that compresses the images during training, and (2) a decoder that reconstitutes image after receiving codes. The whole system is trained end to end by gradient descent.
2. Three major differences:
(1) the encoder is privy to the decoder's previous outputs, allowing it to tailor the codes it sends according to the decoder's behaviour so far.
(2) The decoder's outputs are successively added to the distribution that will ultimately generate the data.
(3) A dynamically updated attention mechanism is used to focus on specific regions.
3. Selective attention model: an array of 2D Gaussian filters is applied to the image, yielding an image 'patch; of smoothly varying location and zoom, similar to  and . The attention model makes it possible for the network to only focus on a certain region of interest. It is similar with , but being differentiable.
1. The use of LSTM/RNN in images is interesting. It actually mimics the drawing process. So how do human practically makes a video or film? First shoot every scene (short clips of videos), then link these clips together to form a film. Is it possible to do this similar with this paper's insight? Seems like the network in network (RNN in RNN) structure from the Feedback Network paper combined with this paper could work.
2. The differentiable attention model: pay attention. Could be useful in a lot of applications.
3. I have seen several papers about VAE. The difference is sometimes just the input. Like in this paper, the input of the VAE is a combination of several information. How to make use of each input or just concatenation is enough?
 Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Re- current models of visual attention. In Advances in Neural Information Processing Systems, pp. 2204–2212, 2014.
The paper introduces a stochastic variational inference and learning algorithm that scales to large datasets. The contribution is two folds.
(1) A reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.
(2) Datasets with continuous latent variables per data point, posterior inference can be made especially efficient by fitting an approximate inference model to the intractable posterior using the proposed lower bound estimator.
1. Lower bound estimator, a stochastic objective function, can be derived for a variety of directed graphical models with continuous latent variables using VAE.
2. Important equations to reach the VAE:
The marginal likelihood has been written as the sum of two terms: (1) KL divergence (regularization term) and (2) variational lower bound on the marginal likelihood ( expected negative reconstruction error).
3. Reparameterization trick: Let z be a continuous random variable with some distribution conditioned on x. It can be expressed as the output of a function g() with some parameter.
4. The steps in the VAE is (1) encode, (2) reparametrize and (3) decode. Then the reconstruction loss is used to optimize.
Fundamental paper for VAE. Need to derive the equation by hand to understand better.
This paper studies the problem of synthesizing a number of likely future frames from a single input image. The proposed approach is to model future frames in a probabilistic manner. The proposed network structure, called Cross Convolutional Network, encodes image and motion information as feature maps and convolutional kernels, respectively.
1. In training: the network observes a set of consecutive image pairs in videos, and automatically infers the relationship between images in each pair without any supervision. In test: the network predicts the condition distribution P(J|I), of future RGB images J given an RGB input image I that is not in the training set. Using the distribution, the model can synthesize multiple different image samples corresponding to possible future frames of the input image.
2. Conditional variational autoencoder is used to model the complex conditional distribution of future frames.
3. Instead of finding an intrinsic representation of the image itself, the proposed network finds an intrinsic representation of intensity changes between two images, known as difference image or Eulerian motion.
4. Motion is modelled using a set of image-dependent convolution kernels operating over an image pyramid.
5. Network structure: (a) motion encoder: which is a variational autoencoder learning the compact representation z of possible motions (b) a kernel decoder, which learns motion kernels from z (c)image encoder: which consists of convolutional layers extracting feature maps from the input image I, (d) cross convolutional layer, which takes the output of the image encoder and the kernel decoder, and convolves the feature maps with motion kernels, (e) motion decoder which regresses the difference image from the combined feature maps
6. The above figure shows three ways to generate next frame.
(1) deterministic model: minimize the reconstructive error. It can not capture the multiple possible motions that a shape can have.
(2) Motion prior: This model contains a latent representation z, which encodes the intrinsic dimensionality of the motion field. But it does not see the input image during inference. So it can only learn a joint distribution of motion fields for all classes.
(3) Probabilistic frame predictor: The decoder now takes two inputs, the intrinsic representation z, and image I. Therefore, instead of modelling a joint distribution of motion v, it will learn a conditional distribution of motion given the input image.
1. I still have some question about the inference time. The author claims that the model can learn class-specific motions.
However, at inference time, the network uses the distribution p(z) which is obtained from the entire training set. So how does the model know which motion to use?
2. The key point of the paper is the cross convolution. The cross convolutional layer applies the kernels learned to the feature maps. How does this cross convolution apply the motion to the image?
3. This difference image, how is it different from the optical flow image?
4. The motion encoder: takes two adjacent frames in time as input (or difference image? what is the influence here?)
This paper proposes a novel recurrent neural network model that is capable of extracting information form an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Since the region selection stage is non-differentiable, reinforcement learning is used. (hard-attention model)
1. The model is a RNN structure that processes inputs sequentially, attending to different locations within the image ( or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. (Information need to be summarized and passed along time, so RNN is a solid structure choice.)
2. One advantage of this approach ( all hard-attention approaches) is that the computation is independent of the size of the image (usually, the computation needs to be done at every pixel of the image). Only several local regions are used to model each image.
3. In each time-step of the RNN, a glimpse sensor is used to extract a representation for a given coordinate (shown as A). This representation is combined with the coordinates information to generate the glimpse representation with a Glimpse Network (shown as B).
4. The information to be passed along time is through the internal representation in RNN. And the output of RNN are used to (1) predict the next location, and the action. The action can be in various forms. In a recognition task, for example, it could be a softmax layer.
5. The optimization of the problem is done by maximizing the total reward the agent can get in a long run. This is carried out by REINFORCE.
1. The insight of this paper is pretty clear, and has been proved to be effective in a lot of applications.
2. Attention model can be divided into two categories: (1) soft-attention model and (2) hard-attention model. Soft attention model assigns a weight to each pixel or region. But the computation has to be carried out on the entire image. Hard attention model, on the other hand, makes the computation independent of the size of the image. So it is quicker. But it loses some information (only depend on the regions selected.)
3. Take note how REINFORCE is used in this task.
This paper presents a new way to capture the video-wide temporal information for action recognition. The proposed framework uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The parameters of the ranking function are used as the feature vectors to perform the action recognition.
Intuitively, videos of similar contents should have similar evolution behaviours, thus the parameters learned to represent the evolution should be similar. In this way, they can be used to learn classification boundaries.
This is a novel idea. Similarly, there is a paper  to model the action as a transformation function that can transform the surrounding from state A (pre-action state) to state B (after-action state). Therefore, it can be modelled using a Siamese network.
The good things about these two papers is that we don't even care about how good the performance is (of course it is still a very important factor), just the idea itself is novel enough. However, since the idea is completely new, at least not following some already-successful fashion, it is not guaranteed to work.
 Wang, X., Farhadi, A., Gupta, A.: Actions~transformations. In: CVPR 2016.
This paper proposes a new framework to detect visual relationships in an image. Previous works treat a full relationship including (object 1, relation, object 2). But practically, most of the relations do not have many training instances. Therefore, this paper treat objects and predicates separately, which is the insight of this paper. Objects and predicates independently occur much more frequent. The two parts are then combined by leveraging language priors from semantic word embeddings.
1. In the visual module, the R-CNN is used to generate region proposals (actually not proposals, but 2-D detections)
2. In the visual module, a second CNN is trained to classify each of the predicates using the union of the bounding boxes of the two participating objects in that relationship.
3. Language module projects relationships into an embedding space where similar relationships are optimized to be close together.
1. Why use R-CNN only in 2016? Faster R-CNN has been proved to be much more effective.
2. Using union operation, the input of predicate module can have all the objects in that relationship. (If there is a way to include overlap, which only focuses on the specific region of the relationship happening, probably could provide some benefits.)
3. The insight of this paper is simple: feature sharing. ( I really like this idea recently, embedding space, sub-action, etc. are all based on this simple yet effective insight.)
4. What should be the appropriate evaluation metric in this task? Like the author stated: mAP is a pessimistic evaluation metric because it is not realistic to annotate all possible relationships in an image.
5. Visual Relationship Dataset (dataset and code available)
This paper proposes a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. The feedback network automatically enables making early predictions at the query time, and its output naturally conforms to a hierarchical structure in the label space.
1. The fundamental idea of feedback network is that the output of the network is routed back into the system as part of an iterative cause-and-effect process.
2. The overall process: the image undergoes a shared convolutional operation repeatedly and a prediction is made at each time; the recurrent convolutional operations are trained to produce the best output at each iteration given a hidden state that carries a direction notation of thus-far output.
3. Skip connection inspired by ResNet is added to regulate the flow of signal through the network.
4. Episodic curriculum learning can be achieved by time varying loss function: encourages the network to recognize objects in a first coarse then fine manner.
1. How is it possible that even trained without taxonomy, the network still learns a hierarchical structure representation of the input?
2. The insight for this paper seems to be that decrease the depth of the network depth, and increase the depth in temporal domain.
3. For implementation, seems like it is just a LSTM for single image with some skip connection. (How to tell the story when implementation is similar.)
4. How can we extend this insight into other areas? (network in network idea could work)
This paper proposes an approach that takes into account both the local and global temporal structure of videos to produce descriptors. It uses a spatial temporal 3-D convolutional neural network to represent the short temporal dynamics; and a temporal attention mechanism to automatically select the most relevant temporal segments.
This paper tries to handle an important unsolved question in video analysis area: how to utilize frame-level features or short clip level features to generate video-level features that can effectively represent video content. Usually, there are two ways to do this: (1) temporal pooling over frame-level features (2) recurrent network structure (RNN, LSTM, etc.) The problem with these two ways is that they treat every single frame equally. However, in a video, a lot of the frames are just redundant. Therefore, using attention model to choose which frames to attend to (a weighted version of the pooling) totally makes sense.
(1) Local temporal structure is obtained by using a C3D network. But strangely, the author does not use the frame as input, but rather use a so-called low-level video representation that contains HoG, HoF, and MbH. These representation forms a cubic that is considered as the input of C3D model.
(2) Temporal attention mechanism (soft) can be implemented as weighted sum of the temporal feature vectors. The attention weight at frame i indicates the relevance of the i-th temporal feature in the input video given all the previous generated results.
For more, refer to .
(1) Is the low-level representation useful? Since the CNN can be treated as a feature extractor, is it really necessary to use these hand-crafted features first?
(2) I really like the temporal attention idea, I thought about this for some time. I used to think this kind of method has to be learned by Reinforcement Learning. Probably not.
(3) Is the intuition behind the temporal attention model like: if one frame is too similar with previous frames, the similarity of their features and outputs would be very similar, therefore, the attention weight for that frame should be lowered.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR,2015