Deep learning techniques for image captioning

Md. Zakir Hossain

Generating a description of an image is called image captioning. Image captioning is a challenging task because it involves the understanding of the main objects, their attributes, and their relationships in an image. It also involves the generation of syntactically and semantically meaningful descriptions of the images in natural language. A typical image captioning pipeline comprises an image encoder and a language decoder. Convolutional Neural Networks (CNNs) are widely used as the encoder while Long short-term memory (LSTM) networks are used as the decoder. A variety of LSTMs and CNNs including attention mechanisms are used to generate meaningful and accurate captions. Traditional image captioning techniques have limitations in generating semantically meaningful and superior captions. In this research, we focus on advanced image captioning techniques, which are able to generate semantically more meaningful and superior captions. As such we have made four contributions in this thesis. First, we investigate an attention based LSTM on image features extracted by DenseNet, which is a newer type of CNN. We integrate DenseNet features with attention mechanism and we show that this combination can generate more relevant image captions than other CNNs. Second, we use bi-directional self-attention as a language decoder. Bi-directional decoder can capture the context in both forward and backward directions, i.e., past context as well as any future context, in caption generation. Consequently, the generated captions are more meaningful and superior to those generated by typical LSTMs and CNNs. Third, we further extend the work by using an additional CNN layer to incorporate the structured local context together with the past and the future contexts attained by Bi-directional LSTM. A pooling scheme namely Attention Pooling is also used to enhance the information extraction capability of the pooling layer. Consequently, it is able to generate contextually superior captions. Fourth, existing image captioning techniques use human-annotated real images for training and testing, which involve an expensive and time-consuming process. Moreover, nowadays bulk of the images are synthetic or generated by machines. There is also a need for generating captions for such images. We investigate the use of synthetic images for training and testing image captioning. We show that such images can help improving the captions of real images and they can effectively be used in caption generation of synthetic images.

Deep learning techniques for image captioning

Files and links (1)

Abstract

Details

Metrics