CORPUS LINGUISTICS — 2006

Темы Тезисы   Theses Topics Русский/English
   
M. Kren SYN2000 and SYN2005 are both 100 million representative balanced corpora of contemporary written Czech covering basically two subsequent time periods. Despite this similarity, both corpora differ in many other aspects, mainly the notion of representativeness, but also improved tokenization and segmentation of the text, as well as its lemmatization and morphological tagging. The frequencies of individual words in both corpora cannot be therefore directly compared without taking these differences into account. The paper describes the way this problem has been solved by providing all users with specialized word lists for both corpora. In addition to regular frequencies, the word lists contain also normalized frequencies for each word, both for the corpus as a whole and for the main text types separately. The normalized frequencies minimize the influence of the above mentioned differences, they are directly comparable and thus allow to study e.g. usage trends in Czech lexicon or to compare the occurrence distribution among the text types. The lists are publicly available for download, so that they can be analyzed not only manually by inspecting individual items, but they are also ready for automatic processing.