Cross-CvT: An Encoder-Decoder Multi-Level CrossAttentional Architecture for Semantic Segmentation / Syed Muhammad Ammar Shah

By: Shah, Syed Muhammad AmmarContributor(s): Supervisor : Dr. Zaib AliMaterial type: TextTextIslamabad : SMME- NUST; 2025Description: 100p. Soft Copy 30cmSubject(s): MS Robotics and Intelligent Machine EngineeringDDC classification: 629.8 Online resources: Click here to access online
Tags from this library: No tags from this library for this title. Log in to add tags.
Item type Current location Home library Shelving location Call number Status Date due Barcode Item holds
Thesis Thesis School of Mechanical & Manufacturing Engineering (SMME)
School of Mechanical & Manufacturing Engineering (SMME)
E-Books 629.8 (Browse shelf) Available SMME-TH-1167
Total holds: 0

Convolutional Neural Network (CNN) based semantic segmentation algorithms have been
widely used in encoder-decoder framework for semantic segmentation due to their ability
to extract local information efficiently but lack the receptive field to handle long-range
dependencies, especially in shallow layers. Transformer-based algorithms have the
capability to extract global features due to their inherent attention mechanism but require
large amounts of data and computational power to perform at their full potential. Hybrid
CNN-Transformer algorithms are being explored to utilize the strengths of the approaches.
This work introduces one such algorithm called Cross-CvT, which is inspired by
Convolutional Vision Transformer (CvT) paradigm. The encoder adopts the standard CvT
design, employing convolutional patch embeddings and convolutional transformer blocks,
where each MLP feed-forward layer is replaced by an inverted residual block to introduce
local context. The decoder mirrors this design but uses learned upsampling through
transposed convolutions by replacing convolutional patch embeddings. Skip connections
link corresponding encoder and decoder stages, augmented by cross-attention modules that
allow decoder feature queries to attend to encoder outputs, enabling rich multi-scale feature
fusion. The proposed architecture preserves the transformer’s global context while
reintroducing CNN-like inductive biases for detailed high-resolution segmentation. We
evaluate Cross-CvT on the Cityscapes benchmark and achieved a mean Intersection over
Union score of 52.3%, which competes with the state-of-the-art approaches in the realm of
semantic segmentation, which highlights the effectiveness of the Cross-CvT design for
semantic segmentation.

There are no comments on this title.

to post a comment.
© 2023 Central Library, National University of Sciences and Technology. All Rights Reserved.