ConViT: improving vision transformers with soft convolutional inductive biases*This article is an updated version of: D'Ascoli S, Touvron H, Leavitt M L, Morcos A S, Biroli G and Sagun L 2021 ConViT: improving vision transformers with soft convolutional inductive biases Proc. 38th Int. Conf. Machine Learning vol 139 ed M Meila and T Zhang pp 2286–96. (1st November 2022)

Record Type:: Journal Article
Title:: ConViT: improving vision transformers with soft convolutional inductive biases*This article is an updated version of: D'Ascoli S, Touvron H, Leavitt M L, Morcos A S, Biroli G and Sagun L 2021 ConViT: improving vision transformers with soft convolutional inductive biases Proc. 38th Int. Conf. Machine Learning vol 139 ed M Meila and T Zhang pp 2286–96. (1st November 2022)
Main Title:: ConViT: improving vision transformers with soft convolutional inductive biases*This article is an updated version of: D'Ascoli S, Touvron H, Leavitt M L, Morcos A S, Biroli G and Sagun L 2021 ConViT: improving vision transformers with soft convolutional inductive biases Proc. 38th Int. Conf. Machine Learning vol 139 ed M Meila and T Zhang pp 2286–96.
Authors:: d'Ascoli, Stéphane
Touvron, Hugo
Leavitt, Matthew L
Morcos, Ari S
Biroli, Giulio
Sagun, Levent
Abstract:: Abstract: Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a 'soft' convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT (Touvron et al 2020 arXiv:2012.12877 ) on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it has escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. … (more)
Is Part Of:: Journal of statistical mechanics. (2022:Nov.)
Journal:: Journal of statistical mechanics
Issue:: (2022:Nov.)
Issue Display:: Volume 1000095 (2022)
Year:: 2022
Volume:: 1000095
Issue Sort Value:: 2022-1000095-0000-0000
Page Start:
Page End:
Publication Date:: 2022-11-01
Subjects:: deep learning -- machine learning
Statistical mechanics -- Periodicals
Mechanics -- Statistical methods -- Periodicals
530.1305
Journal URLs:: http://ioppublishing.org/ ↗
DOI:: 10.1088/1742-5468/ac9830 ↗
Languages:: English
ISSNs:: 1742-5468
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 24479.xml