Abstract
Deep learning architectures have innovated the field of vision transformers with their attainments. Inspired by such significant accomplishments, a multitude of progressive research has recently been done that employs Transformer-based frameworks in computer vision (CV). These models have proved their efficacy in three fundamental vision tasks: image classification, object detection, and segmentation of different sensory data streams. Visual transformers have demonstrated significant performance across various benchmarks in contrast to state-of-the-art convolutional neural networks. In this survey, we have comprehensively reviewed some newly published works according to three central CV tasks. We have assessed and compared all these prevailing transformers using diverse metrics. Additionally, we discuss the open issues and challenges faced and some unmined aspects to strengthen visual transformer architectures.