CORPUS LINGUISTICS — 2006

Темы

Тезисы Theses

Topics

Русский/English

M. Waclawicova, M. Koprivova We will talk about spoken corpora, the question of their representativeness, the methods of obtaining source documents and their difficulties; everything will be shown on the example of building new spoken corpora of the Czech language. There are many parameters (sociologic, linguistic, technical etc.) that can be used to provide appropriate representatives of corpora. Which of them are the most important and what do they affect? And what about the praxis — hich parameters are we able to keep well balanced during the building of our new corpora? Since 2001 there have been made series of recordings for corpus of the spoken Czech language. This time the aim is to acquire recordings from all parts of Bohemia. The main content is informal language. Dialogues are mainly spoken in privacy, in unofficial and informal situations and their topics are not given beforehand. The speakers are characterized by their sex, age and education. In addition, more information is filled into a special database, such as the topics of the recordings, language origin of the speakers, their relations. The recordings are transcribed with special transcription, made on the base of ethnic transcription, which was modified for the purposes of computer processing according to the usage of the Czech National Corpus. The transcription notes, among other things, the cases, in which common speech differs from the standard pronunciation.