This paper proposes a new framework to detect visual relationships in an image. Previous works treat a full relationship including (object 1, relation, object 2). But practically, most of the relations do not have many training instances. Therefore, this paper treat objects and predicates separately, which is the insight of this paper. Objects and predicates independently occur much more frequent. The two parts are then combined by leveraging language priors from semantic word embeddings.
1. In the visual module, the R-CNN is used to generate region proposals (actually not proposals, but 2-D detections)
2. In the visual module, a second CNN is trained to classify each of the predicates using the union of the bounding boxes of the two participating objects in that relationship.
3. Language module projects relationships into an embedding space where similar relationships are optimized to be close together.
1. Why use R-CNN only in 2016? Faster R-CNN has been proved to be much more effective.
2. Using union operation, the input of predicate module can have all the objects in that relationship. (If there is a way to include overlap, which only focuses on the specific region of the relationship happening, probably could provide some benefits.)
3. The insight of this paper is simple: feature sharing. ( I really like this idea recently, embedding space, sub-action, etc. are all based on this simple yet effective insight.)
4. What should be the appropriate evaluation metric in this task? Like the author stated: mAP is a pessimistic evaluation metric because it is not realistic to annotate all possible relationships in an image.
5. Visual Relationship Dataset (dataset and code available)