How many time-steps does this kind of method rely on? Theoretically, it should be the entire history, but practically, I am not sure. And if only several frames are used, then it would be hard to predict complicated structured outputs for future. The output would be merely very similar with the inputs from previous time steps.

At each time step, the system receives as input a video frame, predicts the optical low based on the current frame and the LSTM content as a dense transformation map, and applies it to the current frame to predict the next frame.

2. Temporal autoencoder: it contains three parts: memory module convolutional LSTM and optical flow prediction with huber penalty, grid generator and image sample,

3. The loss function is the reconstruction error between the predicted next frame and the ground truth next frame, with Huber penalty gradient injected during backpropagation on the optical flow map.

2. It would be really cool if the output could be actually some movement or even an action instead of one frame which is really similar with the previous frame.

The paper propose a probabilistic video model, the

1. The architecture of the VPN consists of two parts:

(1) resolution preserving CNN encoder

(2) pixelCNN decoder

2. The Eq. (1) models the joint distribution in a conditional dependent fashion, so no independence assumptions are needed.

For example, as shown in the figure, the green colour channel value of pixel x in frame Ft depends on:

(1) all the pixels values in every channel from time-1 to time-(t-1)

(2) the already generated red colour value of the pixel x

**Discussion:**

1. The reason to use resolution-preserving CNN encoder is that the *it allows the model to condition each pixel that needs to be generated without loss of representational capacity.*

2.**Convolutional LSTM **is used. (Read paper [1] ! This version of LSTM has been used frequently recently.)

3. The novelty of this paper is that**the decoder is conditional on more information**. Seems like several papers have been trying to do similar things (GIVING MORE INFORMATION TO THE GENERATIVE MODEL). *What more can be conditioned on?*

**Reference:**

[1] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.

]]>For example, as shown in the figure, the green colour channel value of pixel x in frame Ft depends on:

(1) all the pixels values in every channel from time-1 to time-(t-1)

(2) the already generated red colour value of the pixel x

2.

3. The novelty of this paper is that

DRAW network combine a novel spatial attention mechanism with a sequential VAE framework that allows for the iterative construction of complex images.

2.

(2) The decoder's outputs are successively added to the distribution that will ultimately generate the data.

(3) A dynamically updated attention mechanism is used to focus on specific regions.

3. Selective attention model: an array of 2D Gaussian filters is applied to the image, yielding an image 'patch; of smoothly varying location and zoom, similar to [1] and [2]. The attention model makes it possible for the network to only focus on a certain region of interest. It is similar with [3], but being

2. The differentiable attention model: pay attention. Could be useful in a lot of applications.

3. I have seen several papers about VAE. The difference is sometimes just the input. Like in this paper, the input of the VAE is a combination of several information.

[2] Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.

[3] Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Re- current models of visual attention. In Advances in Neural Information Processing Systems, pp. 2204–2212, 2014.

(1) A reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.

(2) Datasets with continuous latent variables per data point, posterior inference can be made especially efficient by fitting an approximate inference model to the intractable posterior using the proposed lower bound estimator.

2. Important equations to reach the VAE:

The marginal likelihood has been written as the sum of two terms: (1) KL divergence (regularization term) and (2) variational lower bound on the marginal likelihood ( expected negative reconstruction error).

3. Reparameterization trick: Let**z** be a continuous random variable with some distribution conditioned on **x**. It can be expressed as the output of a function **g()** with some parameter.

4. The steps in the VAE is (1) encode, (2) reparametrize and (3) decode. Then the reconstruction loss is used to optimize.

Discussion:

Fundamental paper for VAE. Need to derive the equation by hand to understand better.

]]>3. Reparameterization trick: Let

4. The steps in the VAE is (1) encode, (2) reparametrize and (3) decode. Then the reconstruction loss is used to optimize.

Discussion:

Fundamental paper for VAE. Need to derive the equation by hand to understand better.

2. Conditional variational autoencoder is used to model the complex conditional distribution of future frames.

3. Instead of finding an intrinsic representation of the image itself, the proposed network finds an intrinsic representation of intensity changes between two images, known as

5. Network structure: (a) motion encoder: which is a variational autoencoder learning the compact representation

6. The above figure shows three ways to generate next frame.

(1) deterministic model: minimize the reconstructive error. It can not capture the multiple possible motions that a shape can have.

(2) Motion prior: This model contains a latent representation*z,* which encodes the intrinsic dimensionality of the motion field. But it does not see the input image during inference. So it can only learn a joint distribution of motion fields for all classes.

(3) Probabilistic frame predictor: The decoder now takes two inputs, the intrinsic representation*z*, and image *I*. Therefore, instead of modelling a joint distribution of motion *v*, it will learn a conditional distribution of motion given the input image.

**Discussion:**

1. I still have some question about the inference time. The author claims that the model can learn class-specific motions.

However, at inference time, the network uses the distribution*p(z) *which is obtained from the entire training set. So how does the model know which motion to use?

2. The key point of the paper is the cross convolution. The cross convolutional layer applies the kernels learned to the feature maps. How does this cross convolution apply the motion to the image?

3. This*difference image*, how is it different from the *optical flow* image?

4. The motion encoder: takes two adjacent frames in time as input (or difference image? what is the influence here?)

]]>(1) deterministic model: minimize the reconstructive error. It can not capture the multiple possible motions that a shape can have.

(2) Motion prior: This model contains a latent representation

(3) Probabilistic frame predictor: The decoder now takes two inputs, the intrinsic representation

1. I still have some question about the inference time. The author claims that the model can learn class-specific motions.

However, at inference time, the network uses the distribution

2. The key point of the paper is the cross convolution. The cross convolutional layer applies the kernels learned to the feature maps. How does this cross convolution apply the motion to the image?

3. This

4. The motion encoder: takes two adjacent frames in time as input (or difference image? what is the influence here?)

This paper proposes a novel recurrent neural network model that is capable of extracting information form an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Since the region selection stage is non-differentiable, reinforcement learning is used. (hard-attention model)

1. The model is a RNN structure that processes inputs sequentially, attending to different locations within the image ( or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. (Information need to be summarized and passed along time, so RNN is a solid structure choice.)

2. One advantage of this approach ( all hard-attention approaches) is that the computation is independent of the size of the image (usually, the computation needs to be done at every pixel of the image). Only several local regions are used to model each image.

3. In each time-step of the RNN, a glimpse sensor is used to extract a representation for a given coordinate (shown as A). This representation is combined with the coordinates information to generate the glimpse representation with a Glimpse Network (shown as B).

4. The information to be passed along time is through the internal representation in RNN. And the output of RNN are used to (1) predict the next

5. The optimization of the problem is done by maximizing the total reward the agent can get in a long run. This is carried out by

2. Attention model can be divided into two categories: (1) soft-attention model and (2) hard-attention model. Soft attention model assigns a weight to each pixel or region. But the computation has to be carried out on the entire image. Hard attention model, on the other hand, makes the computation independent of the size of the image. So it is quicker. But it loses some information (only depend on the regions selected.)

3. Take note how

Intuitively, videos of similar contents should have similar evolution behaviours, thus the parameters learned to represent the evolution should be similar. In this way, they can be used to learn classification boundaries.

The good things about these two papers is that we don't even care about how good the performance is (of course it is still a very important factor), just the idea itself is novel enough. However, since the idea is completely new, at least not following some already-successful fashion, it is not guaranteed to work.

This paper proposes a new framework to detect visual relationships in an image. Previous works treat a full relationship including (object 1, relation, object 2). But practically, most of the relations do not have many training instances. Therefore, this paper treat objects and predicates separately, which is the insight of this paper. Objects and predicates independently occur much more frequent. The two parts are then combined by leveraging language priors from semantic word embeddings.

1. In the visual module, the R-CNN is used to generate region proposals (actually not proposals, but 2-D detections)

2. In the visual module, a second CNN is trained to classify each of the predicates

3. Language module projects relationships into an embedding space where similar relationships are optimized to be close together.

2. Using

3. The insight of this paper is simple: feature sharing. ( I really like this idea recently, embedding space, sub-action, etc. are all based on this simple yet effective insight.)

4. What should be the appropriate evaluation metric in this task? Like the author stated: mAP is a pessimistic evaluation metric because it is not realistic to annotate all possible relationships in an image.

5. Visual Relationship Dataset (dataset and code available)

This paper proposes a feedback based approach in which

1. The fundamental idea of

2. The overall process: the image undergoes a shared convolutional operation repeatedly and a prediction is made at each time; the recurrent convolutional operations are trained to produce the best output at each iteration given a hidden state that carries a direction notation of thus-far output.

3.

4.

1. How is it possible that even trained without taxonomy, the network still learns a hierarchical structure representation of the input?

2. The insight for this paper seems to be that decrease the depth of the network depth, and increase the depth in temporal domain.

3. For implementation, seems like it is just a LSTM for single image with some skip connection. (How to tell the story when implementation is similar.)

4. How can we extend this insight into other areas? (network in network idea could work)