Idiolecte, modelling the evolution of a personal writing style

By
Thierry POIBEAU (Researcher/ CNRS)
Olga Seminck (Post doctoral student/ CNRS)
, updated on
17 July 2021
Image
© Florian Klauer - Unsplash
Share

The Idiolecte project aims at modelling the evolution of a personal writing style.

The way in which individuals express themselves is unique but changes over their lifetime. However, quantitative studies of idiolectal evolution are rare, mainly because of the lack of large corpora. In this project we want to address this issue. We first collected a corpus : The Corpus for Idiolectal Research (CIDRE). It contains dated French fiction novels of eleven prolific 19th century writers (for a total of 37 million words).

Image
logo

Second, we quantitatively analyze the data to answer the following question: How can the diachronic evolution of the idiolect be characterized? By the means of hierarchical clustering and simple linear regression techniques, we show that the evolution of the idiolect is in a mathematical sense monotonic. This property subsequently enables us to propose a machine learning task: predicting the year in which a work was written. For the majority of the authors in our corpus, the accuracy is very high.

After applying a feature selection algorithm, we can examine the most important features: the language structures that are of the highest influence in the idiolectal diachronic evolution. We find that some of those features are stylistic and have been previously remarked in qualitative literature studies. In a future series of experiments, we would like to address the question how much personal language change is affected by collective language change.

Image
Kelly Sikkema - Unsplash

Team

  • Thierry Poibeau : Lattice, CNRS (direction)
  • Dominique Legallois : Lattice, Université Sorbonne nouvelle (direction)
  • Olga Seminck : Lattice, CNRS
  • Philippe Gambette : Université Gustave Eiffel

Publications

  • Corpus Cidre : Github et Zenodo
  • Seminck, Olga, Philippe Gambette, Dominique Legallois, Thierry Poibeau. Accepté au Journal of Open Humanities Data. The Corpus for Idiolectal Research (CIDRE)
  • Seminck, Olga, Philippe Gambette, Dominique Legallois, Thierry Poibeau. Accepté à The 2021 Conference of the European Association for Digital Humanities (EADH). The Corpus for Idiolectal Research (CIDRE)
  • Gambette, Philippe, Olga Seminck, Dominique Legallois, Thierry Poibeau. Accepté à The 2021 Conference of the European Association for Digital Humanities (EADH). Using and Evaluating Hierarchical Clustering Methods for Corpora with Chronological Order

Seminar

L’ évolution de l’idiolecte, Lattice Groupe de lecture Humanités Numériques, 2021