Idiolecte, modelling the evolution of a personal writing style

Olga Seminck(Post doctoral student/ CNRS)

Published on

13 July 2021

, updated on

17 July 2021

The Idiolecte project aims at modelling the evolution of a personal writing style.

The way in which individuals express themselves is unique but changes over their lifetime. However, quantitative studies of idiolectal evolution are rare, mainly because of the lack of large corpora. In this project we want to address this issue. We first collected a corpus : The Corpus for Idiolectal Research (CIDRE). It contains dated French fiction novels of eleven prolific 19^th century writers (for a total of 37 million words).

Second, we quantitatively analyze the data to answer the following question: How can the diachronic evolution of the idiolect be characterized? By the means of hierarchical clustering and simple linear regression techniques, we show that the evolution of the idiolect is in a mathematical sense monotonic. This property subsequently enables us to propose a machine learning task: predicting the year in which a work was written. For the majority of the authors in our corpus, the accuracy is very high.

After applying a feature selection algorithm, we can examine the most important features: the language structures that are of the highest influence in the idiolectal diachronic evolution. We find that some of those features are stylistic and have been previously remarked in qualitative literature studies. In a future series of experiments, we would like to address the question how much personal language change is affected by collective language change.

Team
Thierry Poibeau : Lattice, CNRS (direction)
Dominique Legallois : Lattice, Université Sorbonne nouvelle (direction)
Olga Seminck : Lattice, CNRS
Philippe Gambette : Université Gustave Eiffel

Publications

Corpus Cidre : Github et Zenodo
Seminck, Olga, Philippe Gambette, Dominique Legallois, Thierry Poibeau. Accepté au Journal of Open Humanities Data. The Corpus for Idiolectal Research (CIDRE)
Seminck, Olga, Philippe Gambette, Dominique Legallois, Thierry Poibeau. Accepté à The 2021 Conference of the European Association for Digital Humanities (EADH). The Corpus for Idiolectal Research (CIDRE)
Gambette, Philippe, Olga Seminck, Dominique Legallois, Thierry Poibeau. Accepté à The 2021 Conference of the European Association for Digital Humanities (EADH). Using and Evaluating Hierarchical Clustering Methods for Corpora with Chronological Order