Taghreed Abdullah and Yakoub Bazi and Mohamad M. Al Rahhal and Mohamed L. Mekhalfi and Lalitha, R. and Mansour Zuair (2020) TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote sensing, 12 (3). pp. 1-19. ISSN 2072-4292
Text
TextRS.pdf - Published Version Restricted to Registered users only until 2020. Download (6MB) | Request a copy |
Abstract
Exploring the relevance between images and their respective natural language descriptions, due to its paramount importance, is regarded as the next frontier in the general computer vision literature. Thus, recently several works have attempted to map visual attributes onto their corresponding textual tenor with certain success. However, this line of research has not been widespread in the remote sensing community. On this point, our contribution is three-pronged. First, we construct a new dataset for text-image matching tasks, termed TextRS, by collecting images from four well-known different scene datasets, namely AID, Merced, PatternNet, and NWPU datasets. Each image is annotated by five different sentences. All the five sentences were allocated by five people to evidence the diversity. Second, we put forth a novel Deep Bidirectional Triplet Network (DBTN) for text to image matching. Unlike traditional remote sensing image-to-image retrieval, our paradigm seeks to carry out the retrieval by matching text to image representations. To achieve that, we propose to learn a bidirectional triplet network, which is composed of Long Short Term Memory network (LSTM) and pre-trained Convolutional Neural Networks (CNNs) based on (EfficientNet-B2, ResNet-50, Inception-v3, and VGG16). Third, we top the proposed architecture with an average fusion strategy to fuse the features pertaining to the five image sentences, which enables learning of more robust embedding. The performances of the method expressed in terms Recall@K representing the presence of the relevant image among the top K retrieved images to the query text shows promising results as it yields 17.20%, 51.39%, and 73.02% for K = 1, 5, and 10, respectively.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | remote sensing; text image matching; triplet networks; EfficientNets; LSTM network |
Subjects: | D Physical Science > Computer Science |
Divisions: | Department of > Computer Science |
Depositing User: | Mr Umendra uom |
Date Deposited: | 04 Feb 2021 07:22 |
Last Modified: | 17 Jun 2022 10:32 |
URI: | http://eprints.uni-mysore.ac.in/id/eprint/15643 |
Actions (login required)
View Item |