The Idiolecte project aims at modelling the evolution of a personal writing style.
The way in which individuals express themselves is unique but changes over their lifetime. However, quantitative studies of idiolectal evolution are rare, mainly because of the lack of large corpora. In this project we want to address this issue. We first collected a corpus : The Corpus for Idiolectal Research (CIDRE). It contains dated French fiction novels of eleven prolific 19th century writers (for a total of 37 million words).
Second, we quantitatively analyze the data to answer the following question: How can the diachronic evolution of the idiolect be characterized? By the means of hierarchical clustering and simple linear regression techniques, we show that the evolution of the idiolect is in a mathematical sense monotonic. This property subsequently enables us to propose a machine learning task: predicting the year in which a work was written. For the majority of the authors in our corpus, the accuracy is very high.
After applying a feature selection algorithm, we can examine the most important features: the language structures that are of the highest influence in the idiolectal diachronic evolution. We find that some of those features are stylistic and have been previously remarked in qualitative literature studies. In a future series of experiments, we would like to address the question how much personal language change is affected by collective language change.
L’ évolution de l’idiolecte, Lattice Groupe de lecture Humanités Numériques, 2021