NUST Institutions Library Catalogue catalog › Details for: Cross-CvT: An Encoder-Decoder Multi-Level CrossAttentional Architecture for Semantic Segmentation /

Normal view MARC view ISBD view

Cross-CvT: An Encoder-Decoder Multi-Level CrossAttentional Architecture for Semantic Segmentation / Syed Muhammad Ammar Shah

By: Shah, Syed Muhammad Ammar Contributor(s): Supervisor : Dr. Zaib Ali Material type: Text

TextIslamabad : SMME- NUST; 2025Description: 100p. Soft Copy 30cmSubject(s): MS Robotics and Intelligent Machine EngineeringDDC classification: 629.8 Online resources: Click here to access online

Tags from this library: No tags from this library for this title. Log in to add tags.

Holdings ( 1 )
Title notes ( 1 )
Comments ( 0 )

Item type	Current location	Home library	Shelving location	Call number	Status	Date due	Barcode	Item holds
Thesis	School of Mechanical & Manufacturing Engineering (SMME)	School of Mechanical & Manufacturing Engineering (SMME)	E-Books	629.8 (Browse shelf)	Available		SMME-TH-1167

Total holds: 0

Convolutional Neural Network (CNN) based semantic segmentation algorithms have been
widely used in encoder-decoder framework for semantic segmentation due to their ability
to extract local information efficiently but lack the receptive field to handle long-range
dependencies, especially in shallow layers. Transformer-based algorithms have the
capability to extract global features due to their inherent attention mechanism but require
large amounts of data and computational power to perform at their full potential. Hybrid
CNN-Transformer algorithms are being explored to utilize the strengths of the approaches.
This work introduces one such algorithm called Cross-CvT, which is inspired by
Convolutional Vision Transformer (CvT) paradigm. The encoder adopts the standard CvT
design, employing convolutional patch embeddings and convolutional transformer blocks,
where each MLP feed-forward layer is replaced by an inverted residual block to introduce
local context. The decoder mirrors this design but uses learned upsampling through
transposed convolutions by replacing convolutional patch embeddings. Skip connections
link corresponding encoder and decoder stages, augmented by cross-attention modules that
allow decoder feature queries to attend to encoder outputs, enabling rich multi-scale feature
fusion. The proposed architecture preserves the transformer’s global context while
reintroducing CNN-like inductive biases for detailed high-resolution segmentation. We
evaluate Cross-CvT on the Cityscapes benchmark and achieved a mean Intersection over
Union score of 52.3%, which competes with the state-of-the-art approaches in the realm of
semantic segmentation, which highlights the effectiveness of the Cross-CvT design for
semantic segmentation.

There are no comments on this title.

to post a comment.