This paper proposes a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. The feedback network automatically enables making early predictions at the query time, and its output naturally conforms to a hierarchical structure in the label space.
1. The fundamental idea of feedback network is that the output of the network is routed back into the system as part of an iterative cause-and-effect process.
2. The overall process: the image undergoes a shared convolutional operation repeatedly and a prediction is made at each time; the recurrent convolutional operations are trained to produce the best output at each iteration given a hidden state that carries a direction notation of thus-far output.
3. Skip connection inspired by ResNet is added to regulate the flow of signal through the network.
4. Episodic curriculum learning can be achieved by time varying loss function: encourages the network to recognize objects in a first coarse then fine manner.
1. How is it possible that even trained without taxonomy, the network still learns a hierarchical structure representation of the input?
2. The insight for this paper seems to be that decrease the depth of the network depth, and increase the depth in temporal domain.
3. For implementation, seems like it is just a LSTM for single image with some skip connection. (How to tell the story when implementation is similar.)
4. How can we extend this insight into other areas? (network in network idea could work)