CORPUS LINGUISTICS — 2006

Темы

Тезисы Theses

Topics

Русский/English

V. Bobichev, T. Zidrashko It is widely accepted that lemmatization is an important preprocessing step in many applications dealing with text, especially for highly inflected languages.
The paper presents a memory-based method of lemmatization which was evaluated on Romanian and Bulgarian texts. Providing an additional method for lemmatization our aim was to obtain word lemma without any POS information, dictionary, grammatical or morphological information about the word. The method is particularly useful for languages that lack large electronic dictionaries and morphological or syntactic tools. The proposed method uses memory based learning. In memory-based learning algorithms features influence the final result considerably. As features a certain number of final letters of the word were used because wordform usually changes via its ending. Several types of feature sets were tested, one set containing separated letters and another one containing fragments of word endings. In the last experiment, ending of the previous word was added to the feature set. The experiments were made on three sources: electronic dictionary, containing about 90 000 word forms; MULTEXT-EAST corpus; Corpus of Supreme Court Decisions containing 410 morphologically annotated law documents. The best accuracy of about 92% was obtained using separated letters and ending of the previous word as features. To prove the method applicability to different languages we also tried it on Bulgarian texts and got about 86% of correctly formed lemmas on relatively small corpus.