Automatic image captioning using capsule neural network and ELMo embedding technique

Document Type : Research Paper

Authors

1 Phd. Student, Computer Science, Yazd University

2 Department of Computer Engineering , Yazd University, Yazd, Iran

3 Department of Electrical Engineering , Yazd University, Yazd, Iran

Abstract

Automatic image captioning is a challenging task in computer vision and aims to generate computer-understandable descriptions for images. Employing convolutional neural networks (CNN) has a key role in image caption generation. However, during the process of generating descriptions for an image, there are two major challenges for CNN, such as: they do not consider the relationships and spatial hierarchical structures between the objects in the image, and the lack of resistance against rotational changes of the images. In order to solve these challenges, this paper presents an improved capsule network to describe image content using natural language processing by considering the relations between the objects . A capsule contains a set of neurons that consider the parameters of the state of objects in the image, such as size, direction, scale, and relationships of objects to each other. These capsules have a special focus on extracting meaningful features for use in the process of generating relevant descriptions for a given set of images. Qualitative tests on the MS-COCO dataset using the capsule network and ELMo embedding technique have resulted in 2-5% improvement in the evaluated metrics compared to existing image captioning models.

Keywords