Design of a Novel Spectral Learnable Dynamic Feature Map Semantic Segmentation Model /
Mugheera Saleem
- 131p. Soft Copy 30cm
In Computer Vision models, downsampling is a strategy to compress contextual information spatially while improving model flexibility by adding depth to the layer outputs. Traditionally, downsampling ratios in segmentation models have been governed by the model architecture and have remained fixed while being treated as a hyperparameter. Although some research has been conducted to introduce learnable downsampling in image classification, similar strategies have not been adopted in segmentation due to the issues in managing dynamic feature maps while performing upsampling. AdaUNet is an efficient semantic segmentation model with a novel encoder-decoder design. The encoder incorporates a differentiable stride learning mechanism and spectral attention to adaptively determine downsampling rates, reducing redundant spatial information and computational costs. The decoder uses a hypernetwork-based super-resolution model called Continuous Upsampling Filters (CUF) to smoothly recover high-resolution outputs. This design allows AdaUNet to be the first segmentation model that optimizes the size of the intermediate feature maps, reducing them by up to 50 times compared to traditional fixed-pooling methods, drastically cutting FLOPs and activation memory. On half-size image resolution of Cityscapes, AdaUNet achieves a 61.8% mean IoU using just 7.4M parameters and 29.36 GFLOPs, outperforming models like SegFormer and HRNet-V2. On the CamVid (256×256) dataset, the model scores a 72.33% mean IoU with only 3.14 GFLOPs. Furthermore, a Cityscapes-pretrained AdaUNet surpasses an ImageNet-1k pretrained (ResNet-101) DeepLabv3 model by 5% on CamVid while requiring around 20 times fewer FLOPs and 8 times less parameters. The proposed model is highly suitable for resource-constrained environments where high accuracy and low computational cost are critical.