Drilling for Information - The Pret-a-LLOD Language Resource Transformation Software

written by Christian Chiarcos, GU Frankfurt on 2022-03-06

If data is the new oil as famously claimed at the beginning of the current decade, then methods that help us to access, interpret and process data in a meaningful way are the drilling technology of the 21st century. However, data comes in various forms, and in language technology, the complexities of formats and annotations cannot be easily reduced or simplified, as a certain level of complexity is inherent to human language. Most academic research on most and industrial applications of language technology are thus confronted with a universe of formats and formalisms and with the task to get tool X to work with data Y or to get its output being further processed by tool Z.

Resource heterogeneity is, indeed, one of the most time-consuming challenges when working with linguistic data, given the sheer wealth of different data formats and annotation schemes mixed with an ever growing set of specialized tools accepting and producing very specific input and output. More and more powerful software relying on Deep Learning requires ever larger sets of training data and combinations of corpora, lexical data, ontologies etc. which are only available in heterogeneously annotated resources.

Linked Data is a means to address the issues of combining decentrally organized resources in that it introduces a common Resource Description Framework (RDF) and allows ontological modeling to standardized data models for specific types of resources, such as OntoLex-Lemon for lexical data or Web Annotation, the NLP Interchange Format, CoNLL-RDF and POWLA for linguistic annotations. Also with respect to annotation schemes, efforts of standardization have been made in the form of the GOLD ontology, ISOcat or the Ontologies of Linguistic Annotation. Yet, neither are all resources available as linked data, nor do all of them adhere to the same data models and annotation schemes. There can also be variations of how data models are being adapted to specific use cases.

We designed FINTAN, the Flexible INtegrated Transformation and Annotation eNgineering platform [1], to tackle many of these issues:

First and foremost, FINTAN provides universal transformation capabilities. That is, we allow to transform any source format to any target format, but instead of a fully integrated converter suite which would either be limited to very specific source and target formats or could only produce fairly generic output, we decided to develop a framework which would allow a user to easily integrate existing converters, ontologies and knowledge graphs and to operate on their own data models.
FINTAN is based on RDF technology: RDF is a generic formalism for knowledge representation and information integration based on graph formalisms. Technically, every linguistic annotation can be encoded as a graph, every dictionary can be represented as a feature structure, and, thus, every format for either purpose can be transformed to RDF. The requirement for transforming data from source to target format is that these data models can be expressed in RDF, and that RDF converters for source and target formats are provided.
FINTAN is scalable: We do not process one data set at a time, but we read it from a stream, split it into segments and process each segment(and optionally, its preceding and following context) separately and in parallel. This makes FINTAN fast and slim, with a smaller memory print and increased performance, if applied to bulk data.
FINTAN is powerful: Beyond mere transformation from one format to another, FINTAN allows to restructure and enrich its input(s), also by consulting external data sources available on the web. The key to this is that users can define transformation operations in the form of SPARQL updates, and use the full potential of SPARQL to run federated queries, to consult external (web) services during runtime, and to perform updates on RDF graphs. We implemented a number of demonstrators that use FINTAN technology to perform annotation engineering, with complex transformation workflows that take one or more pieces of manually annotated source data, and to integrate them into a different, and richer representation that, then, can be used to train machine learning tools on. Previous use cases include entity linking, dictionary-based annotation, syntactic parsing and the conjoint processing of syntactic and semantic annotations.
FINTAN is modular and portable: A transformation workflow in FINTAN typically consists of a load (from format X) operation, the application of one or multiple updates, and a write (to format Y) operation. Updates are written in SPARQL and typically perform one small operation at a time, and -- where needed -- can be re-used for other transformation workflows. In particular, an existing workflow can be easily ported from one source format (X) to another (Z) if an update operation is provided that performs the transformation from (the RDF model for) X to (the RDF model for) Z.
FINTAN has a graphical workflow editor. RDF technology is known for being generic and universally applicable, but also to come with a certain entry barrier. For creating and managing complex transformation workflows, FINTAN provides suitable visualizations to enable users to configure and to manage their workflows.
FINTAN can be easily extended: Aside from natural support to the technologies it builds upon, SPARQL and Java, FINTAN allows to integrate any type of user-provided module if these conform to our OpenAPI specifications. In particular this includes Docker containers.
Finally, FINTAN workflows can be easily deployed: From the workflow editor, we can generate a Docker container that allows users to run and execute this workflow without the need to locally replicate its working environment.

Overall, the FINTAN platform allows a user to combine existing converters, to implement powerful transformations, to apply them on data streams, and to manage transformation and annotation engineering workflows in a user-friendly environment, as summarized in the following picture:

Curious?

Fork us on GitHub

Footnotes

[1] FINTAN is an acronym, but a meaningful one. After the Pret-a-LLOD workflow management system Teanga, it also adopts an Irish name. In Irish mythology, Fintan was an ancient sage, the personification of knowledge and wisdom, but at the same time, also a shapeshifter and thus also referred to as the "Salmon of Knowledge". Transformation and knowledge are the core features of the FINTAN platform, as well.