Multilingual Referring Expression Comprehension
Abstract
Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions, yet research remains predominantly English-centric despite increasing global deployment demands. This work addresses multilingual REC through two principal contributions that enable cross-lingual visual grounding at scale. First, we construct a unified multilingual dataset spanning 10 languages (English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian) by systematically expanding 12 existing English REC benchmarks through machine translation and visual context-based translation enhancement. Our dataset comprises 8 million multilingual referring expressions across 70,000 images with 350,000 annotated instances. Second, we introduce an attention-anchored neural architecture that leverages frozen multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation compared to 91.3% English-only performance. Multilingual evaluation shows consistent capabilities across language families, with Romance languages maintaining 2-4 percentage point gaps relative to English, establishing practical feasibility for multilingual visual grounding systems.
Multilingual Dataset Construction
We expanded 12 established English REC benchmarks into a unified multilingual corpus spanning 10 languages.
Language Coverage
Source Datasets
Our dataset builds upon these established referring expression benchmarks
Loading citations...
Citation
If you find this work useful, please cite our paper:
@article{nogueira2025multilingualrec,
title={Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs},
author={Nogueira, Francisco Reis},
journal={TBD},
year={2025}
}