MARC View

000			03928nam a22001817a 4500
003			NUST
005			20260127084907.0
082			_a005.1,ZIA
100			_aZia, Usman _9124510
245			_aImage Description using Deep Learning / _cUsman Zia
260			_aRawalpindi, _bMCS (NUST), _c2022
300			_axiii, 114 p
505			_aInternet technologies are generating enormous amount of data that merges textual and visual content: tagged images, descriptions in newspaper, videos with captions, and social media feeds. Such interaction with technology and devices has become part of everyday life, for example explaining an image in the context of news, following instructions by interpreting a diagram or a map, understanding presentations while listening to a lecture. Traditionally, content providers manually added captions to make this more accessible. These captions are used by text-to-speech system to generate a natural-language description of images and videos. Recent years have seen an upsurge of interest in problems that require a combination of language and visual contents to develop methods for automatically generating image descriptions. Due to the potential applications in computer vision, information retrieval, autonomous vehicles and natural language processing (NLP), automatic generation of sequence of words known as caption for an image has captured enormous consideration in past decade. Various techniques have been proposed for automatic generation of image descriptions using most suitable annotation in the training set. These training annotations are sometimes rearranged or also boosted by natural language processing (NLP) algorithms. Despite significant achievements in generating sentences for images, existing models struggle to capture human-like semantics in generated descriptions. In this thesis, three novel image description techniques have been proposed to generate semantically superior captions of the target image. The first proposed technique incorporates topic sensitive word embedding for generation of image description. Topic Models consider documents to be associated with different topics based on probability distribution over words. The proposed approach uses topic modeling to align semantic meaning of words to image features and generate descriptions that are more relevant to context (topic) of the target image regions. Compared to traditional models, the proposed approach utilizes high level semantics of words to represent diversity in the training corpus. Convolutional layers of the visual encoder used on traditional models generate feature maps to extract hierarchical information from the visual contents. These convolution layers do not exploit the dependencies between feature maps which can result in loss of essential information to guide language model for description generation. The second proposed model incorporates scene information to capture the overall setting reflected in the visual content along with object level features using squeeze-and-excitation module and spatial details to boost the accuracy of caption generation. Visual features are coupled with location information along with topic modeling to capture semantic word relationships to feed sequence-tosequence word generation task. Third proposed approach addresses the challenges in remote sensing image description due to large variance in the visual aspects of objects. Multi-scale visual feature encoder is proposed to extract detailed information from remote sensing images. Adaptive attention decoder dynamically assigns weights to the multi-scale features and textual queues to strengthen the language model to generate novel topic sensitive descriptions.
650			_aPhD Computer Software Engineering Thesis _9132801
651			_aPhD CSE Thesis _9132802
700			_aSupervised by Dr. Abdul Ghafoor _9132894
942			_2ddc _cTHE
999			_c615947 _d615947