Automatic building of bilingual dictionaries

written by Jorge Gracia del Río on 2022-03-01

The world we live in is highly multilingual, as is its digital counterpart. Despite the predominance of English all along the Web, more and more content and digital resources are available in more and more languages, better reflecting linguistic diversity. One such type of language resource is bilingual (and multilingual) electronic dictionaries. They serve as the basis for many Natural Language Processing (NLP) tasks, from Machine Translation to Cross-lingual Knowledge Transfer, for instance.

However, it is unrealistic to expect that high level, human crafted, bilingual dictionaries are available for every possible language pair (consider that over 7K languages are spoken in the world!). Therefore, the automatic inference of new bilingual (and multilingual) dictionaries based on already existent ones is an active research field, aimed at extending the coverage of bilingual dictionaries available on the Web.

dictionaries

A number of different approaches have been developed in the last decades, that can be grouped in three categories [1] (we cite here some particular techniques for illustration, but the list is not exhaustive):

Pivot-based methods. In these methods, there exist two bilingual dictionaries, one containing translations from language L1 to another language L2, and another from L2 to L3. Language L2 acts as a pivot language to infer a new set of translations from L1 to L3. The most basic approach in this type of method is direct transitivity. But more complex approaches are possible like the ones based on the one time inverse consultation technique [2].

Graph-based methods. In case the dictionaries are connected in a richer way, for instance in a translation graph, other techniques based on graph exploration can be used, such as the ones based on circuits [3] or cycles [1,4].

Distributional semantics-based methods. These methods can be based on vector space models [5], on leveraging statistical similarities between two languages [6], or on the inference and use of cross-lingual word embeddings [7].

Historically, the different systems have been evaluated with different data, in evaluation frameworks, and measuring their performance with different metrics and in different application contexts, making it difficult to make a systematic comparison among them. In order to alleviate such an issue, and to stimulate further research in the field, the Translation Inference Across Dictionaries (TIAD) initiative started in 2017, as a shared task aimed at exploring methods and techniques for automatically generating new bilingual (and multilingual) dictionaries from existing ones in the context of a coherent experiment framework that enables reliable validation of results and solid comparison of the processes used. The shared task has conducted periodic evaluation campaigns and organised workshops to communicate and discuss the results. These workshops were co-located with well known conferences such as the Language Data and Knowledge (LDK) conference in 2017, 2019 and 2021 and the Language Resources and Evaluation (LREC) conference in 2020, and also in its next edition in June 2022.

Since its second edition, the evaluation data is based on the Apertium RDF graph. In particular, the participating systems are asked to generate new translations automatically among three languages, English, French, Portuguese, based on known translations contained in the Apertium RDF graph. As these languages (EN, FR, PT) are not directly connected in this graph, the participants apply their methodologies to derive translations among such pairs based on the other translations already available in the graph. The evaluation of the results is blind and is carried out by the organisers against manually compiled pairs of the Lexicala dictionaries (https://lexicala.com/resources#dictionaries).

Since 2019, a total of 34 systems have been evaluated, and improved results have been witnessed over the years, however still giving room for further improvements.

More information on the next edition of TIAD can be found at https://tiad2022.unizar.es/. The TIAD’22 shared task is supported by Prêt-à-LLOD along with other projects and initiatives.


References

[1] Goel, S., Gracia, J., & Forcada, M. L. (2021). Bilingual dictionary generation andenrichment via graph exploration. Semantic Web Journal [IN PRESS]. http://www.semantic-web-journal.net/content/bilingual-dictionary-generation-and-enrichment-graph-exploration-0

[2] Tanaka, K. and Umemura, K. 1994. Construction of a Bilingual Dictionary Intermediated by a Third Language. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1, 297–303. ACL. http://dl.acm.org/citation.cfm?id=991937

[3] Mausam, S. Soderland, O. Etzioni, D. Weld, M. Skinner and J. Bilmes, Compiling a Massive, Multilingual Dictionary via Probabilistic Inference, in: Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Association for Computational Linguistics, Suntec, Singapore, 2009, pp. 262–270. https://www.aclweb.org/ anthology/P09-1030.

[4] Villegas, M., Melero, M., Gracia, J., & Bel, N. (2016). Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries. In Proc. of 10th Language Resources and Evaluation Conference (LREC’16) Portorož (Slovenia) (pp. 868–876). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2016/pdf/613_Paper.pdf

[5] R. Rapp, Identifying word translations in non-parallel texts, in: Proc. of the 33rd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, 1995, p. 320. doi:10.3115/981658.981709. http://portal.acm.org/citation.cfm?doid=981658.981709.

[6] A. Irvine and C. Callison-Burch, Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals, in: Proc. of NAACL-HLT 2013, Association for Computational Linguistics, 2013, pp. 9–14. https://www.aclweb.org/anthology/C98-1066 /.

[7] M. Artetxe, G. Labaka and E. Agirre, Bilingual lexicon induction through unsupervised machine translation, in: Proc. of 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Association for Computational Linguistics (ACL), 2019, pp. 5002–5007. ISBN 9781950737482. doi:10.18653/v1/p19-1494.