TextRS: Deep bidirectional triplet network for matching text to remote sensing images

Taghreed Abdullah and Yakoub Bazi and Mohamad M. Al Rahhal and Mohamed L. Mekhalfi and Lalitha, R. and Mansour Zuair (2020) TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote sensing, 12 (3). pp. 1-19. ISSN 2072-4292

[img] Text
TextRS.pdf - Published Version
Restricted to Registered users only until 2020.

Download (6MB) | Request a copy
Official URL: https://doi.org/10.3390/rs12030405

Abstract

Exploring the relevance between images and their respective natural language descriptions, due to its paramount importance, is regarded as the next frontier in the general computer vision literature. Thus, recently several works have attempted to map visual attributes onto their corresponding textual tenor with certain success. However, this line of research has not been widespread in the remote sensing community. On this point, our contribution is three-pronged. First, we construct a new dataset for text-image matching tasks, termed TextRS, by collecting images from four well-known different scene datasets, namely AID, Merced, PatternNet, and NWPU datasets. Each image is annotated by five different sentences. All the five sentences were allocated by five people to evidence the diversity. Second, we put forth a novel Deep Bidirectional Triplet Network (DBTN) for text to image matching. Unlike traditional remote sensing image-to-image retrieval, our paradigm seeks to carry out the retrieval by matching text to image representations. To achieve that, we propose to learn a bidirectional triplet network, which is composed of Long Short Term Memory network (LSTM) and pre-trained Convolutional Neural Networks (CNNs) based on (EfficientNet-B2, ResNet-50, Inception-v3, and VGG16). Third, we top the proposed architecture with an average fusion strategy to fuse the features pertaining to the five image sentences, which enables learning of more robust embedding. The performances of the method expressed in terms Recall@K representing the presence of the relevant image among the top K retrieved images to the query text shows promising results as it yields 17.20%, 51.39%, and 73.02% for K = 1, 5, and 10, respectively.

Item Type: Article
Uncontrolled Keywords: remote sensing; text image matching; triplet networks; EfficientNets; LSTM network
Subjects: D Physical Science > Computer Science
Divisions: Department of > Computer Science
Depositing User: Mr Umendra uom
Date Deposited: 04 Feb 2021 07:22
Last Modified: 17 Jun 2022 10:32
URI: http://eprints.uni-mysore.ac.in/id/eprint/15643

Actions (login required)

View Item View Item