CORPUS LINGUISTICS — 2006

Темы Тезисы   Theses Topics Русский/English
   
R. Meyer Historical linguistics can obviously only rely on corpus data rather than on other sources of evidence, and thus, diachronic corpora have been a focus of attention since the beginnings of modern corpus linguistics (cf. e.g. the Penn-Helsinki Corpus of Middle English). But corpora are usually built with a certain purpose in mind. Historical and diachronic corpora, for the most part, serve the aim of preserving and publishing rare and valuable sources. They pose problems like the proper encoding of graphical detail of a manuscript, the appropriate rendition of rare character forms, and the systematic annotation of important textological information, as e.g. the redaction history. In the case of the Regensburg diachronic corpus of Russian, our focus is somewhat different: we are building a corpus specifically for corpus linguistic research into the historical development of the language. The queries we wish to pose are thus similar to typical queries on synchronic linguistic corpora: statistical distributions of word forms and/or lemmas across texts, (sequences of) part-of-speech tags, collocations etc. To this end, we try and reduce the burden of a document preservation approach by relying on standard editions of the most important texts, selecting text samples according to an explicit design scheme, and mainly investing work into the linguistic annotation (standardisation, lemmatisation, part-of-speech tagging). From a diachronic linguistic point of view, it is of primary importance to develop a base of comparison for texts of different times of origin — introducing, e.g., so-called «hyperlemmas» which remain constant throughout lexical changes; or diachronically adequate part-of-speech tagsets -, whereas other aspects of annotation, including graphical detail and philological apparatus, may receive less attention. On the technical side, we have explored several approaches — native XML databases, the Stuttgart Corpus Workbench, Bonito — and finally selected the ACT tool (Ribarov et al. 2004), a software suite designed specifically for historical manuscripts, with a comfortable corpus editor working on a relational database, and an XML-annotation scheme close to TEI standards. We query the database via a web interface which we have modified from the original ACT sources to suit our needs. These include, e.g., KWIC concordancing, the precise description of the location of hits, and the selection and download of relevant results. Although work on the corpus is ongoing — it currently contains only about 400 000 forms -, it may already serve to illustrate several interesting points: namely, the implications of a specifically diachronic linguistic approach for the design of the corpus and the annotation scheme; the merits and possible improvements of the ACT software tool, when applied to a corpus of larger (several hundred pages long) texts; and the pros and cons of using a relational database for corpus preservation, rather than native XML-based or proprietary alternatives (eXist, Xaira). We illustrate the current state of the corpus with sample searches concerning the development of diathesis, i.e., the domain of historical syntax.