CORPUS LINGUISTICS — 2006

 
An Na Idiomaticity is a common phenomenon in natural language. People cannot know idiomatic meaning simply by its structure and use the standard lexical and syntactical rules to analyze it. Chinese idiom which cannot be given a compositional analysis is different from English idiom. As a result, processing idioms became one of the difficulties in natural language processing. This paper tries to solve the problem of idioms existing in corpus annotation based on the study of English and Chinese idioms. Terminology in this field has always been problematic, and extended discussions of the problem including those by many linguists such as (Nunberg et al. 1994) and so on. There is no generally agreed common description on idiom. Different terms are sometimes used to describe identical or very similar kinds of unit; at the same time, a single term may be used to denote very different phenomena. It is therefore essential to clarify the kinds of unit and phenomenon which I will be discussing. Considering the structure and meaning of English idioms, we divided them into three kinds. The first kind is phraseme, for example because of, according to, look into and so on. These units include several words and play the role of simple words in sentences. The second kind is pure idiom. These units often include two or more words and express the literal and deep meanings. The third kind is proverbs, sayings and similes. They are always independent sentences. In Chinese, researchers have same ideas in the use of terms; they all use «idiom». But their understanding of idiom’s function and bound are not the same. Wu Zhankun (1986) considered it the material of sentences, and that its function is equal to words; Liu Shuxin (1995) considered that it should be divided into two parts: some are lexical units, and their functions are equal to words; some are non-lexical units, which usually appear as unattached sentences. For the bound, the thought that idiom should include idioms, customary usages, two-part allegorical sayings, proverbs, aphorisms and so on is well accepted. Only Zhou Jian (1994) thought it should include proper noun and special noun. From the introduction above, we can see that no matter in and abroad China, there is no unified view on idiom. But in the process of corpus annotation, we cannot avoid idiom. So, for the research on corpus annotation, idiom in this paper mainly refer to the phraseme and pure idiom in English, and idioms, customary usages, two-part allegorical sayings, terms and also abbreviations which has the same character with idioms and customary usages in Chinese. No matter in Chinese or English, idioms are gradually formed during long time of using, it is the distillate of language. But idiom is established by all the social people who use it, it is also national. So, there are both similarities and differences in English and Chinese Idiom. Corpus annotation cannot avoid idiom tagging. Foreign corpus gave idioms part-of-speech (POS) tagging and semantic tagging. Idiom POS annotation in foreign corpus is complete. From the POS annotation, it is better to know the grammatical function of English idiom. Idiom cannot be only given a POS analysis. For better knowing the idiomatic meaning, British National Corpus also gives idioms semantic annotation. The initial tagset was loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (McArthur 1981). The tagging method of Chinese idiom, we mostly study on the contemporary Chinese Corpus at Peking University and standardization for corpus processing made by ministry of Chinese education institute of applied linguistics. These two POS tagging standardization are the mostly used in corpus annotation in China, and also in idiom tagging. By these standardization, when tagging idioms, only annotation was given, the lexical category and P-O-S annotation was put together, for example, ‘in’, ‘ip’. The standardization of Peking University and ministry of education institute of applied linguistics are different. The standardization of Peking University is more detail. The tagset of Peking University includes more P-O-S annotations than ministry of education institute of applied linguistics, for example ‘n(noun)’, ‘v(verb)’, ‘a(adjective)’, ‘d(adverb) ‘and so on . The first Specification of Peking University only gave idiom a lexical tag, for example, ‘i(i indicates idioms), l(l indicates customary usages), j(j indicates abbreviations.)’. In the new Specification, Peking University improved on the first Specification. Based on the lexical category tagging, they gave idiom POS tag. In the course of syntactic processing, we can process these tags, for example, ‘in, jv, la’ and so on. But this method increased the complicacy of rules when describing idioms. And the speed of corpus processing is slowly if applying this annotation method. For better processing idiom in Chinese Broadcast Media Language Corpora (CBMLC), as a start, I did an exclusive research on Chinese idioms after tagging them in our corpus, focusing on their syntactic distributions. Then a database about idiom grammatical functions was constructed. Through the database of idiom, it is easy to analyze and compare the usage of different idiom. I took a data-driven research paradigm from the descriptive methodology that most former linguists took. Since my goal is to set up a tagging manual for idioms so that their proper syntactic functions can be reflected, this paradigm ensures a genuine report on the diversified usage of idioms. The outcome of my study is a detailed manual on idiom tagging, and an annotated corpus of idioms with both syntactic functions and lexical information. In the process of tagging idioms, the method of two annotations was applied. First, idiom POS annotation was automatically given, such as ‘v, a, n, d’. Second, idiom lexical category annotation was automatically given, such as ‘i(idioms), j(abbreviations), l(customary usages), gy(two part allegorical sayings)’. Computer linguistics needs an annotation corpus as its knowledge in the course of natural language processing. When the researchers could get information from the corpus, this annotation corpus can be proved valuable. Part-of-speech (POS) tagging was the fundamental period of corpus annotation. At present, there are only POS tagging and syntactic tagging in Chinese Corpus. For better knowing Chinese idiomatic meaning, we can learn from the method of semantic tagging in foreign corpus to annotate Chinese semantic idiomatic meaning.