Camps Jean-Baptiste et Duval Frédéric
Linguistic annotation of scholarly edition is critical to a renewed and deepened reading of texts. Lemmatization and POS or syntactic tagging opens up new possibilities, with features such as on-demand paradigm visualisation, and increase potentialities for computational analysis. Annotated corpora are paramount in training lemmatizers with deep learning algorithm. Today, we are still hindered by the scarcity of available data and small corpus size, particularly on what regards ancient languages and their specific difficulties.
In the course of the PSL project LAKME (Linguistically Annotated Corpora using Machine Learning Techniques), École des chartes and Lattice (ÉNS) have started to collaborate on the production of annotated corpora for Old French and Medieval Occitan – both rare languages from a NLP perspective. Thanks to the work of annotators and the involvement of researchers (Jean-Baptiste Camps, Frédéric Duval), a 50,000 words corpus, annotated in lemmas, POS and morphology, has been produced for Old French, and an equivalent corpus for Old Occitan (in collaboration with the CORLIG project, Univ. Paris-Sorbonne). Lemmatization models have been trained to encouraging results (acc. > 94%), thanks to the contribution of the ÉNC (JBC, Thibault Clérice) to the development of Pandora (https://github.com/hipster-philology/pandora/), in partnership with Anvers University (Mike Kestemont).
However, to gain in efficiency, scaling up is imperative. For this, in addition to further enhancement of the lemmatizers, it is necessary to reduce as much as possible the time used by close-reading correction of annotated data, while maintaining (or improving) the quality and integrity of the data, and allowing for collaborative work or crowd-sourcing. To that end, the development of a post-correction, language-independent, tool has started at the ÉNC (pandora-postcorrect-app, https://github.com/hipster-philology/pandora-postcorrect-app).
To make this information available to readers, through interactive features for reading or querying enriched texts, developments will have to be done, using the CMS Nemo, and, if possible, in collaboration with the Alpheios project (http://alpheios.net/), adding thus Old French to the coverage of this project (Greek and Latin ; forthcoming, Syriac and Hebrew).
The present project will allow us to additionally produce:
- Open Source and language independent tools, to facilitate correction of annotated-data in a systematic way, and to allow the production of innovative linguistically-enriched editions;
- gold standard annotated data, for ancient periods of the Gallo-Romance languages, Old French and Occitan, to be published in Open Data, at the service of the community.