Shah, Syed Muhammad Ammar

Cross-CvT: An Encoder-Decoder Multi-Level CrossAttentional Architecture for Semantic Segmentation / Syed Muhammad Ammar Shah - 100p. Soft Copy 30cm

Convolutional Neural Network (CNN) based semantic segmentation algorithms have been
widely used in encoder-decoder framework for semantic segmentation due to their ability
to extract local information efficiently but lack the receptive field to handle long-range
dependencies, especially in shallow layers. Transformer-based algorithms have the
capability to extract global features due to their inherent attention mechanism but require
large amounts of data and computational power to perform at their full potential. Hybrid
CNN-Transformer algorithms are being explored to utilize the strengths of the approaches.
This work introduces one such algorithm called Cross-CvT, which is inspired by
Convolutional Vision Transformer (CvT) paradigm. The encoder adopts the standard CvT
design, employing convolutional patch embeddings and convolutional transformer blocks,
where each MLP feed-forward layer is replaced by an inverted residual block to introduce
local context. The decoder mirrors this design but uses learned upsampling through
transposed convolutions by replacing convolutional patch embeddings. Skip connections
link corresponding encoder and decoder stages, augmented by cross-attention modules that
allow decoder feature queries to attend to encoder outputs, enabling rich multi-scale feature
fusion. The proposed architecture preserves the transformer’s global context while
reintroducing CNN-like inductive biases for detailed high-resolution segmentation. We
evaluate Cross-CvT on the Cityscapes benchmark and achieved a mean Intersection over
Union score of 52.3%, which competes with the state-of-the-art approaches in the realm of
semantic segmentation, which highlights the effectiveness of the Cross-CvT design for
semantic segmentation.


MS Robotics and Intelligent Machine Engineering

629.8