Multilingual Referring Expression Comprehension

Francisco Reis NogueiraInstituto Superior Técnico, Lisoba, Portugal

Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions, yet research remains predominantly English-centric despite increasing global deployment demands. This work addresses multilingual REC through two principal contributions that enable cross-lingual visual grounding at scale. First, we construct a unified multilingual dataset spanning 10 languages (English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian) by systematically expanding 12 existing English REC benchmarks through machine translation and visual context-based translation enhancement. Our dataset comprises 8 million multilingual referring expressions across 70,000 images with 350,000 annotated instances. Second, we introduce an attention-anchored neural architecture that leverages frozen multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation compared to 91.3% English-only performance. Multilingual evaluation shows consistent capabilities across language families, with Romance languages maintaining 2-4 percentage point gaps relative to English, establishing practical feasibility for multilingual visual grounding systems.

10
Languages
8M
Expressions
86.9%
Accuracy

Multilingual Dataset Construction

We expanded 12 established English REC benchmarks into a unified multilingual corpus spanning 10 languages.

8M
Expressions
10
Languages
70K
Images
346K
Objects
12
Datasets
<8%
Multi. Gap

Language Coverage

EnglishPortugueseSpanishFrenchGermanDutchItalianKoreanChineseRussian

Source Datasets

Our dataset builds upon these established referring expression benchmarks

Loading citations...

Citation

If you find this work useful, please cite our paper:

@article{nogueira2025multilingualrec,
  title={Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs},
  author={Nogueira, Francisco Reis},
  journal={TBD},
  year={2025}
}