Increasingly, multilingual language resources are available as Linguistic Linked Open Data (LLOD) [1] which model relations between resources and include rich metadata with standardized, non-proprietary technologies – a trend which promises to lead to improved multilingual NLP systems. However, how to effectively utilize these resources in practical applications is not self-evident, in particular for specialized technical domains.
One example of such a domain are posts from online health communities, i.e., web fora and similar systems focusing on health topics used by patients, caregivers and/or professionals in a wide range of languages. Online health communities are a relevant data source for a range of emerging application areas, such as public health monitoring or evidence generation for regulatory drug approval [2], which entail analysing patients’ experiences beyond clinical trials. A central aspect of these so-called patient-reported outcomes is health-related quality of life (HRQoL) [3].
In a recent publication to be presented at the upcoming SEMANTiCS 2021 conference, Prêt-à-LLOD pilot partner Semalytix reports on a machine learning approach to classify online health community posts into categories derived from facets of HRQoL as described in the World Health Organization’s quality of life surveys (WHOQOL) [4], e.g., pain and discomfort, work capacity, financial resources. Semalytix addresses the problem of predicting HRQoL facets across languages via a multitude of individual binary classifiers trained using a cross-lingual transfer learning framework based on bilingual lexica available as multilingual LLOD. The combination of LLOD and transfer learning is motivated by the flexibility required to predict a large number of HRQoL facets (a total of 19 facets is considered) in a multilingual setting: Transfer learning allows to train classifiers for different languages based on training data in a single source language, without the need of additional annotated data for each target language.
The Semalytix approach is based on word embeddings and crosslingual supervision via token-level lexica (supervised bilingual word embeddings). Thus, the training procedure and resulting models are considerably less complex than state-of-the-art cross-lingual zero-shot models, which are based on contextualized representations learnt via pre-training transformer-based language models on massive multilingual corpora. Evaluation results presented show that the models developed by Semalytix, when being combined with a baseline approach that integrates machine translation and rule-based extraction algorithms, are strong contestants to cross-lingual transformers, thus emphasizing the prospects of resources and technologies being developed in Prêt-à-LLOD for rea-world multilingual text analytics applications.
References
[1] Declerck T, McCrae JP, Hartung M, Gracia J, Chiarcos C, Montiel-Ponsoda E, et al. Recent Developments for the Linguistic Linked Open Data Infrastructure. In: Proc. of LREC; 2020. p. 5660-7. [2] McDonald L, Malcolm B, Ramagopalan S, Syrad H. Real-world Data and the Patient Perspective: the PROmise of Social Media? BMC Medicine. 2019;17. [3] Bullinger M, Quitmann J. Quality of life as patient-reported outcomes: principles of assessment. Dialogues in Clinical Neuroscience. 2014 Jun;16(2):137-45. [4] World Health Organization. WHOQOL: Measuring Quality of Life; 1997. Available here