Rethinking referring relationships from a perspective of mask-level relational reasoning. (January 2023)
- Record Type:
- Journal Article
- Title:
- Rethinking referring relationships from a perspective of mask-level relational reasoning. (January 2023)
- Main Title:
- Rethinking referring relationships from a perspective of mask-level relational reasoning
- Authors:
- Li, Chengyang
Zhu, Liping
Tian, Gangyi
Hou, Yi
Zhou, Heng - Abstract:
- Highlights: We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible. We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information. We introduce an image-to-text relational reasoning module, which is unsupervised. It improves the generalization ability of the multimodal model. Our method achieves state-of-the-art accuracy on two challenging datasets, VRD and Visual Genome. Abstract: Referring relationship aims at localizing subject and object entities in an image, according to a triple text < subject, predicate, object > . Previous methods use iterative attention to shift between image regions for modeling predicate. However, predicate sometimes is implicit and difficult to be represented in the image domain. Convolution modeling method to express predicate is simple and inappropriate. Besides, relational reasoning information in the text itself is not fully utilized. To this end, we rethink referring relationship from a mask-level relational reasoning perspective to improve model interpretability. For text-to-image reasoning, we design Mask Generate and Mask Transfer modules, so as to fully integrate the text priors into the reasoning and prediction of masks. For image-to-text reasoning, we propose an unsupervised triple reconstruction method to guide text-to-image reasoning and improve multimodal generalization. ByHighlights: We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible. We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information. We introduce an image-to-text relational reasoning module, which is unsupervised. It improves the generalization ability of the multimodal model. Our method achieves state-of-the-art accuracy on two challenging datasets, VRD and Visual Genome. Abstract: Referring relationship aims at localizing subject and object entities in an image, according to a triple text < subject, predicate, object > . Previous methods use iterative attention to shift between image regions for modeling predicate. However, predicate sometimes is implicit and difficult to be represented in the image domain. Convolution modeling method to express predicate is simple and inappropriate. Besides, relational reasoning information in the text itself is not fully utilized. To this end, we rethink referring relationship from a mask-level relational reasoning perspective to improve model interpretability. For text-to-image reasoning, we design Mask Generate and Mask Transfer modules, so as to fully integrate the text priors into the reasoning and prediction of masks. For image-to-text reasoning, we propose an unsupervised triple reconstruction method to guide text-to-image reasoning and improve multimodal generalization. By bi-directional reasoning between image and text, the proposed method MRR fully conforms to the multimodal relational reasoning process. Experiments show that MRR achieves state-of-the-art performance on two datasets of referring relationships, VRD and Visual Genome. … (more)
- Is Part Of:
- Pattern recognition. Volume 133(2023)
- Journal:
- Pattern recognition
- Issue:
- Volume 133(2023)
- Issue Display:
- Volume 133, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 133
- Issue:
- 2023
- Issue Sort Value:
- 2023-0133-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-01
- Subjects:
- Referring relationship -- Multimodal learning -- Image and text -- Visual grounding -- Deep learning
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2022.109044 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 24024.xml