Rethinking Local and Global Feature Representation for Dense Prediction. (March 2023)
- Record Type:
- Journal Article
- Title:
- Rethinking Local and Global Feature Representation for Dense Prediction. (March 2023)
- Main Title:
- Rethinking Local and Global Feature Representation for Dense Prediction
- Authors:
- Chen, Mohan
Zhang, Li
Feng, Rui
Xue, Xiangyang
Feng, Jianfeng - Abstract:
- Highlights: We propose a dense prediction method based on a dualstream encoder combining convolution and Transformer, which extracts the local spatial information and global context information simultaneously. Different decoders are designed to effectively fuse local features and global features from two streams. We conduct extensive experiments on Cityscapes, ADE20K, KITTI and COCO datasets, demonstrating that our method can achieve high performance on dense prediction tasks. Abstract: Although fully convolution networks (FCNs) have dominated dense prediction tasks ( e.g., semantic segmentation, depth estimation and object detection) for decades, they are inherently limited in capturing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details ( e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by rethinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolution and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local featureHighlights: We propose a dense prediction method based on a dualstream encoder combining convolution and Transformer, which extracts the local spatial information and global context information simultaneously. Different decoders are designed to effectively fuse local features and global features from two streams. We conduct extensive experiments on Cityscapes, ADE20K, KITTI and COCO datasets, demonstrating that our method can achieve high performance on dense prediction tasks. Abstract: Although fully convolution networks (FCNs) have dominated dense prediction tasks ( e.g., semantic segmentation, depth estimation and object detection) for decades, they are inherently limited in capturing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details ( e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by rethinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolution and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convolution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of semantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80, 000 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% AP b on COCO dataset when using Faster R-CNN, which is higher than Conformer. … (more)
- Is Part Of:
- Pattern recognition. Volume 135(2023)
- Journal:
- Pattern recognition
- Issue:
- Volume 135(2023)
- Issue Display:
- Volume 135, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 135
- Issue:
- 2023
- Issue Sort Value:
- 2023-0135-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-03
- Subjects:
- Dense prediction -- Vision transformer -- Semantic segmentation -- Depth estimation -- Object detection
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2022.109168 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 24456.xml