BookNLP, French part of the Multilingual BookNLP project

Thierry POIBEAU (Researcher/ CNRS)
, updated on
17 July 2021

Multilingual BookNLP is a UC Berkeley project aiming at producing NLP pipelines adapted to literary texts (in particular novels: analysis of characters, places, etc.). LATTICE will produce the NLP pipeline  for French.

The French BookNLP project aims to develop an NLP pipeline for the analysis of large French literary corpora, in connection with David Bamman's Multilingual Book NLP project. Multilingual Book NLP aims to develop NLP pipelines for several languages, but French is not included. It is this gap that we aims to fill.

A team of researchers around David Bamman has developed the BookNLP suite, allowing the massive annotation of novels, in order to make it possible to develop qualitative and quantitative studies on this type of corpus (structure of novels, character networks, etc.). The annotation mainly concerns the reference to characters and some other entities (places, some artefacts), as well as the related co-reference chains.

MultiLingual BookNLP is an ongoing project at Berkeley to redevelop the initial NLP pipeline and extend it to five other languages. We are developing the corresponding resources French, in coordination with the Berkeley project.

We will reuse existing tools as much as possible. Since NLP tools rely heavily on machine learning techniques, a large part of the work consists in annotating corpora necessary for training. For French, we intend to start from the Democrat corpus, developed within the framework of the ANR project with the same name (see here and here for the resources).


In order to understand referential expressions and reference chains, the approach followed in Democrat was a combination of methods coming from linguistics and NLP. The corpus is a selection of novels from 19th and early 20th century (Democrat included other texts from other periods and genres, but justa  subpart of Democrat will ber e-used in Book NLP). The annotations of the Democrat project will be "recycled" to correspond to the Multilingual BookNLP scheme. The first experiments showed that the two schemes (Democrat and BookNLP) were largely compatible, even if the Democrat annotation will have to be completed. In particular, the markables will have to be "typed" and other specific additions are also to be expected (such as the annotation of dialogue sequences, a task which could probably be partly automated).

The Democrat corpus is freely available under a Creative Commons license. The French BookNLP corpus and related tools will also be made available and freely reusable.


  • Thierry Poibeau, Lattice (CNRS & ENS/PSL & Université Sorbonne nouvelle) : director
  • Frédérique Mélanie-Becquet (engineer, Lattice) : coordination
  • Claude Grunspan (intern 2021) : annotation
  • Jean Barré (intern 2021) : annotation
  • Olga Seminck (post-doctorante, laboratoire LATTICE => fiche) : annotation
  • Clément Plancq (engineer, Lattice) : annotation, NLP tools
  • Laurette Chardon (research engineer, Crisco, Univ. de Caen) : annotation, NLP tools
  • Ioana Galleron (University professor, Lattice, Université Sorbonne nouvelle): annotation
  • Frédéric Landragin : distinguished scientific advisor

See the GitHub and the seminar