Digital component



The digital component of Scripta is developing digital tools for the study of written documents. Given the unique concentration in PSL of experts in some of the most diverse forms of writing, particular emphasis will be placed on qualitative and quantitative approaches in paleography as well as on texts and their editions and linguistic analysis.

On the one hand, an ergonomic interface will be created to allow for the process of manual transcription and data entry right up to the publication of a digital (and/or printed) edition of texts, whether of a transcribed witness, a text with critical apparatus, a translation, a comment and/or a linguistic analysis, all according to international standards including TEI, IIIF, and MAF. A central brick will be Archetype, developed by Peter Stokes (EPHE, PSL) and a team at King’s College London, which allows deep annotation and extensive paleographic study of writing. On the other hand, we will combine more ‘manual’ digital approaches with new computational possibilities. A second fundamental brick will therefore be kraken, developed by Benjamin Kiessling (PSL research engineer): it comprises a module for handwritten text recognition (HTR) which is being extended to allow the automatic analysis not only of printed matter but also of manuscripts. In this module, the automatic analysis of images of written documents will use convolutional neural networks to isolate objects from their background and distinguish between main writing, decoration (illuminations, drop capitals, etc.), and interlineal or marginal annotations.

A third brick will be a deep transcription interface in conjunction with Archetype on the one hand and Kraken on the other. Transcriptions that are directly connected to the images will facilitate the publication of facsimile digital editions, but they will also enable the preparation of training material for the computer to prepare automatic transcriptions, as well as the automatic alignment of existing transcriptions with the image. This direct link between text and image will allow new forms of visual queries such as finding all images with a specific word cut out, find all words that look like a given example image (wordspotting), and so on. The combination of the quantitative approach (kraken) with the qualitative approach (Archetype) will enable the clustering of all the letters in a manuscript on the basis of deep manual annotation and thus, for example, distinguishing between allographs (variants) of a given letter and analyzing their distribution across a manuscript or even across a corpus. Coupled with databases of manuscripts that are dated, localised or written by an identified scribe, a further module could automatically find other manuscripts that are similar in this respect.

A further step will comprise the development of modules for linguistic annotation, the analysis of textual variants, stemmatology, and intertextual analysis, all of which will be available to researchers. As with the component for automatic transcription, here also we plan a dialogue between manual annotation of training data for subsequent automatic annotation, and an ergonomic module for manual correction of the automatic results after processing using the Pyrrha system already being developed at the ENC.

In connection with various projects in progress, we also foresee interfaces for import and export that will allow the integration of existing textual, visual or linguistic data, or the export to digital or printed publications using different standards including IIIF (the International Image Interoperability Framework) and DTS (Distributed Text Services). Services such as DTS makes it possible to precisely identify any part of a given digital text (for instance a paragraph, a line, a sentence or a word), for example in order to quote it in another digital publication. In the same way, IIIF makes it possible to precisely identify any part of a digital image, for example in order to create links between texts and images and/or to automatically import images from a range of libraries, archives and other repositories.