Today I finished a short three weeks research stay at the Polish Academy of Sciences in Krakow as part of the COST Action “Distant Reading for European Literary History”. It was a great experience. My supervisor was Maciej Eder, one of the big names in stylometry, particularly famous for making the R package “Stylo”. The researchers working at his institute are at the same time brilliant, hardworking and welcoming. It was a very positive experience, I worked a lot, eat many pierogi and enjoyed great craft beer.
My aim for those three weeks was to assess the influence of normalization and lemmatization of Middle High German sources when performing stylometric analysis. I think this kind of studies for old languages are required if we want to rely on stylometry. Some time ago I used to hear people complaining about the almost impossible task of doing computational analysis on sources with such a big degree of variation. Lately, I’ve heard many claims that variation and even error in the sources do not have such a considerable effect on stylometric analysis. My experiments seem to suggest that it depends on what you are trying to do and the level of variation you are considering. With paleographic transcriptions the results were pretty awful, but improved considerable with very basic normalization. The task is also very relevant. To classify poetry or prose, for example, even very unregularized data had considerable success. But genre and authorship classification are more difficult, and improve with better preprocessing of the sources.
In any case, now I have 30 days to compile the results of the experiments and present a report of my work. I also hope to be able to produce an article with this.