Semi-supervised cross-modal retrieval with graph-based semantic alignment network. (September 2022)
- Record Type:
- Journal Article
- Title:
- Semi-supervised cross-modal retrieval with graph-based semantic alignment network. (September 2022)
- Main Title:
- Semi-supervised cross-modal retrieval with graph-based semantic alignment network
- Authors:
- Zhang, Lei
Chen, Leiting
Ou, Weihua
Zhou, Chuan - Abstract:
- Abstract: Semi-supervised cross-modal retrieval is an eclectic paradigm which learns common representations via exploiting underlying semantic information from both labeled and unlabeled data. Most existing methods ignore the rich semantic information of text data and are unable to fully utilize the text data in common representation learning. Moreover, they only considered the correlation of the data with the same semantic label, but ignored the correlation between the data with different semantic label. In this paper, we propose a novel semi-supervised cross-modal retrieval method, called Graph-based Semantic Alignment Network (GSAN), which learns common representation by aligning the features of different modalities with semantic embeddings of text data. Firstly, we design a Deep Supervised Semantic Encoding (DSSE) module to train the semantic projector and label predictor which can exploit the semantic embeddings and the predicted labels from unlabeled data of text modality. Then, GAN-based Bidirectional Fusion (GBF) module is designed to learn the mapping networks of two modalities (image and text). In order to make the mapping networks generate semantically discriminative and modality-invariant representations, we utilize the underlying semantic information exploited by DSSE to construct Graph-based Triplet Constraint (GTC) which can enforce feature embeddings from the semantically-matched (image and text) pairs to be more similar and push those mismatched ones away.Abstract: Semi-supervised cross-modal retrieval is an eclectic paradigm which learns common representations via exploiting underlying semantic information from both labeled and unlabeled data. Most existing methods ignore the rich semantic information of text data and are unable to fully utilize the text data in common representation learning. Moreover, they only considered the correlation of the data with the same semantic label, but ignored the correlation between the data with different semantic label. In this paper, we propose a novel semi-supervised cross-modal retrieval method, called Graph-based Semantic Alignment Network (GSAN), which learns common representation by aligning the features of different modalities with semantic embeddings of text data. Firstly, we design a Deep Supervised Semantic Encoding (DSSE) module to train the semantic projector and label predictor which can exploit the semantic embeddings and the predicted labels from unlabeled data of text modality. Then, GAN-based Bidirectional Fusion (GBF) module is designed to learn the mapping networks of two modalities (image and text). In order to make the mapping networks generate semantically discriminative and modality-invariant representations, we utilize the underlying semantic information exploited by DSSE to construct Graph-based Triplet Constraint (GTC) which can enforce feature embeddings from the semantically-matched (image and text) pairs to be more similar and push those mismatched ones away. By the benefit of fully using of semantic information, our approach can only use fewer label data and achieves the performance of state-of-the-art methods. In addition, since we only utilize the mapping networks trained in GBF module to generate common representations in referring stage, our approach is efficient and time saving in real world application. Extensive experiments on four widely-used datasets show the effectiveness of GSAN. Highlights: We specialize in exploring semantic information covered by corresponding text data. The semantics of text data plays supervision signal in unsupervised learning. The graph-based triplet constraint enforces the common representations well cluster. The triplet loss is utilized to constrain the adversarial learning. Experiments in both supervised and semi-supervised setting show the effectiveness. … (more)
- Is Part Of:
- Computers & electrical engineering. Volume 102(2022)
- Journal:
- Computers & electrical engineering
- Issue:
- Volume 102(2022)
- Issue Display:
- Volume 102, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 102
- Issue:
- 2022
- Issue Sort Value:
- 2022-0102-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-09
- Subjects:
- 41A05 -- 41A10 -- 65D05 -- 65D17
Cross-modal retrieval -- Semi-supervised learning -- Deep neural network -- Generative adversarial network
Computer engineering -- Periodicals
Electrical engineering -- Periodicals
Electrical engineering -- Data processing -- Periodicals
Ordinateurs -- Conception et construction -- Périodiques
Électrotechnique -- Périodiques
Électrotechnique -- Informatique -- Périodiques
Computer engineering
Electrical engineering
Electrical engineering -- Data processing
Periodicals
Electronic journals
621.302854 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00457906/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compeleceng.2022.108218 ↗
- Languages:
- English
- ISSNs:
- 0045-7906
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.680000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23282.xml