Image Captioning based on Encoder-Decoder Deep Network and Attention on Attention

Document Type : Research Paper


Department of Computer Engineering, Bu-Ali Sina University


Image captioning is an interdisciplinary research field in machine vision and natural language processing. Most of the proposed methods for generating image captions follow an encoder-decoder framework. In this way, each word is generated based on the image features and previously generated words. Recently the attention mechanism, which usually creates a spatial map that highlights the image regions associated with each word, has been widely used in research. In this paper, we propose a new method that integrates the encoder-decoder framework with the attention on attention mechanism. The encoder part of the model uses ResNet to extract global features of the image, and the decoder consists of three important parts: Attention-LSTM, Language-LSTM, and Attention on attention-layer. The attention mechanism uses local evidence to enhance the demonstration of the features and reasoning in the generation of image descriptions. The method was able to improve the generation of captions and improve METEOR, ROUGH evaluation metric well. And also it generates better captions compared to modern methods on the Flickr8k, dataset.