ENS-PSL
45 rue d'Ulm
75005 Paris
France
DHAI seminar session, "Texts as images. Collaborating and training models for manuscript studies from layout segmentation to text recognition and to historical analysis", with Mélodie Boillet (LITIS, Teklia) and Dominique Stutzmann (IRHT-CNRS)
For some years now, the analysis of medieval written sources has been greatly changed by the application of digital methods, especially image processing through artificial intelligence or machine learning. Access to texts has been fundamentally expanded by methods such as full-text indexing, automatic text recognition (Handwritten Text Recognition) and text-mining tools such as Named Entity Recognition, stylometry or the identification of textual adoptions (Text Reuse Detection). At the interface between computer vision and digital images of text sources, other fields of research have also developed, such as the analysis of layout and mise-en-page or the exploration of writing as an image from scribe identification to script classification.
In this presentation, we will discuss the methodology and the results of two interdisciplinary research projects funded by national and European institutions: HORAE - Hours: Recognition, Analysis, Edition (ANR-17-CE38-0008) and HOME - History of Medieval Europe (JPI-CH Cultural Heritage). HORAE enquires the production and circulation of books of hours - a complex type of prayer book - in the late Middle Ages and HOME addressed the constitution of registers and cartularies containing copies of legal documents.
We will present how different actors worked together for the study of manuscripts, aligning research on historical questions with technical ones, with partners joining from different disciplines (Humanities, NLP, computer vision and machine learning) and from the public and private sectors. Then, we will present specific models for document object detection (Doc-UFCN) and text recognition (PyLaia) and discuss the next steps in the historical analysis and in further research projects (Socface).
Mélodie Boillet studied data science, specializing in machine learning and computer vision. She completed her PhD in January 2023, carried out in collaboration between the University of Rouen-Normandy (LITIS) and the company Teklia. Her work focused on deep learning methods to detect objects from document images. They have been developed, among others, within the HORAE and HOME projects, and applied to many other automatic document processing projects.
Dominique Stutzmann studied history at the university Paris 1 - Sorbonne (PhD 2009), the Ecole nationale des Chartes and the EPHE-PSL (habiliation 2021). He is a research professor (directeur de recherche) with the CNRS and honorary professor at the Humboldt University in Berlin. He was the PI of the HOME and HORAE projects. His work focuses on scribal cultures in religious communities of the Middle Ages and the implementation of computer vision for the study of medieval handwritings and texts.
Contact scott.trigg@obspm.fr to receive the Zoom link.