Cross-CvT: An Encoder-Decoder Multi-Level CrossAttentional Architecture for Semantic Segmentation /
Syed Muhammad Ammar Shah
- 100p. Soft Copy 30cm
Convolutional Neural Network (CNN) based semantic segmentation algorithms have been widely used in encoder-decoder framework for semantic segmentation due to their ability to extract local information efficiently but lack the receptive field to handle long-range dependencies, especially in shallow layers. Transformer-based algorithms have the capability to extract global features due to their inherent attention mechanism but require large amounts of data and computational power to perform at their full potential. Hybrid CNN-Transformer algorithms are being explored to utilize the strengths of the approaches. This work introduces one such algorithm called Cross-CvT, which is inspired by Convolutional Vision Transformer (CvT) paradigm. The encoder adopts the standard CvT design, employing convolutional patch embeddings and convolutional transformer blocks, where each MLP feed-forward layer is replaced by an inverted residual block to introduce local context. The decoder mirrors this design but uses learned upsampling through transposed convolutions by replacing convolutional patch embeddings. Skip connections link corresponding encoder and decoder stages, augmented by cross-attention modules that allow decoder feature queries to attend to encoder outputs, enabling rich multi-scale feature fusion. The proposed architecture preserves the transformer’s global context while reintroducing CNN-like inductive biases for detailed high-resolution segmentation. We evaluate Cross-CvT on the Cityscapes benchmark and achieved a mean Intersection over Union score of 52.3%, which competes with the state-of-the-art approaches in the realm of semantic segmentation, which highlights the effectiveness of the Cross-CvT design for semantic segmentation.