File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2002_metho.xml
Size: 18,391 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2002"> <Title>Dynamic Lexical Acquisition in Chinese Sentence Analysis</Title> <Section position="2" start_page="0" end_page="3" type="metho"> <SectionTitle> 1 Proposing words and attributes </SectionTitle> <Paragraph position="0"> Two major types of lexical information are being acquired dynamically in our current Chinese system: new words and new grammatical attributes such as parts of speech (POS) and sub-categorization frames. The acquisition assumes the availability of an existing dictionary which is relatively mature though incomplete in many ways. In our case, we have a lexicon of 88,000 entries with grammatical attributes in most of them. Our assumption is that, once a dictionary has reached this scale, we should have enough information to predict the missing information in the context of sentence analysis.</Paragraph> <Paragraph position="1"> We can then stop hand-editing the static dictionary and let dynamic lexical acquisition take over.</Paragraph> <Paragraph position="2"> In most cases, the grammatical properties of a word define the syntactic context in which this word may appear. Therefore, it is often possible to detect the grammatical properties of a word by looking at the surrounding context of this word in a sentence. In fact, this is one of the main criteria used by lexicographers, who often apply a conscious or subconscious contextual &quot;template&quot; for each grammatical property they assign. We have coded those templates in our system so that a computer can make similar judgments.</Paragraph> <Paragraph position="3"> When a word is found to fit into a template for a given property but we do not have that property in the dictionary yet, we can make a guess and propose to add that property. Our current Chinese system has 29 such templates, 14 for detecting new words and 15 for detecting new grammatical attributes for new or existing words.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 1.1 Proposing new words </SectionTitle> <Paragraph position="0"> Two types of unlisted words exist in Chinese: (1) single-character bound morphemes used as words; (2) new combinations of characters as words. An example of Type (1) is Kan . This is a bound morpheme in our dictionary, appearing only as a part in words like Kan Da Shan (have a good chat). However, like many other bound morphemes in Chinese, it can occasionally be used as an independent word, as in the following sentence: Ta Zai Wo Jia Kan Liao Liang Ge Xiao Shi he at I home chat LE two CL hour He chatted for two hours at my house.</Paragraph> <Paragraph position="1"> The usual response to this problem is to treat it as a lexical gap and edit the entry of Kan to make it a verb in the dictionary. This is undesirable for at least two reasons. First of all, many bound morphemes in Chinese can be occasionally used as words and making all of them independent words will introduce a lot of noise in sentence analysis. Secondly, it will be a difficult task for lexicographers, not just because it takes time, but because the lexicographers will often be unable to make the decision unless they see sentences where a given bound morpheme is used as a word.</Paragraph> <Paragraph position="2"> In our system, we leave the existing dictionary untouched. Instead, we &quot;promote&quot; a bound Currently these templates are hand-coded heuristics based on linguists' intuition. We are planning to use machine learning techniques to acquire those templates automatically.</Paragraph> <Paragraph position="3"> morpheme to be a word dynamically when it appears in certain contextual templates. The template that promotes Kan to be a verb may include conditions such as: * not subsumed by a longer word, such as Kan Kan Er Tan ; * being part of an existing multiple-character verb, such as Kan in Kan Kan Er Tan ; * followed by an aspect marker, such as ; * etc.</Paragraph> <Paragraph position="4"> Currently we have 4 such templates, promoting morphemes to nouns, verbs, adjectives and adverbs respectively.</Paragraph> <Paragraph position="5"> Examples of Type (2) are found all the time and adding them all to the existing dictionary will be a never-ending job. Here is an example: Wu Xu Zhong Xin Qi Dong Jiu Ke Yi Jie Bo Huo Jie Bo not need again start then can dock or undock Bian Xi Dian Nao easy-to-carry computer You can dock and undock your laptop without restarting.</Paragraph> <Paragraph position="6"> Jie Bo (dock), Jie Bo (undock) andBian Xi (easy-to-carry) are not entries in our dictionary. Instead of adding them to the dictionary, we use templates to recognize them online. The template that combines two individual characters to form a verb may include conditions such as: * none of the characters is subsumed by a longer word; * the joint probability of the characters being independent words in text is low; * the internal structure of the new word conforms to the word formation rules of Chinese * the component characters have similar behavior in existing words * etc.</Paragraph> <Paragraph position="7"> The details can be found in Wu & Jiang (2000). Currently we have 10 such templates, which are capable of identifying nouns, verbs, adjectives and adverbs of various lengths.</Paragraph> <Paragraph position="8"> 1.2. Proposing grammatical attributes POS and sub-categorization information is crucial for the success of sentence analysis. However, there is no guarantee that every word in the existing dictionary will have the correct POS and sub-categorization information. Besides, words can behave differently in different domains or develop new properties over time. Take the Chinese word (synchronize) for example. It is an intransitive verb in our dictionary, but it is now often used as a transitive verb, especially in the computer domain. For instance: MADC Ke Fang Bian Di Tong Bu Exchange Zhang Hu can easily DE synchronize account MADC (Microsoft Active Directory Connector) can easily synchronize Exchange accounts. We may want to change the existing dictionary to make words likeTong Bu transitive verbs, but that may not be appropriate lexicographically, at least in the general domain, not to mention the human labor involved in such an undertaking. However, the sentence above cannot get a spanning parse unlessTong Bu is a transitive verb. To overcome this difficulty, our system can dynamically create a transitive verb in certain contexts. An obvious context would be &quot;followed by an NP&quot;, for example. This way we are able to parse the sentence without changing the dictionary.</Paragraph> <Paragraph position="9"> A similar approach is taken in cases where a word is used in a part of speech other than the one(s) specified in the dictionary. In the following sentence, for example, the noun Qun Ji (cluster) is used as a verb instead: Ni Ke Yi Qun Ji 32 Tai Fu Wu Qi you can cluster 32 CL server You can cluster 32 servers.</Paragraph> <Paragraph position="10"> Rather than edit the dictionary to permanently add the verb POS to nouns like Qun Ji , we turn them into verbs dynamically during sentence analysis if they fit into the verb template. The conditions in the verb template may include: Such templates are in effect very similar to POS taggers, though we use them exclusively to create new POS instead of choosing from existing POS. 2 Harvesting new words and attributes Proposing of new words and attributes as described in the previous section is only intended to be intelligent guesses, which can be wrong sometimes. For example, although transitive verbs tend to be followed by NPs, not all verbs that precede NPs are transitive verbs. To make sure that (1) the wrong guesses do not introduce too much noise into the analysis and (2) only the correct guesses are accepted as true lexical information, we take the following steps to filter out the errors that result from over-guessing.</Paragraph> </Section> <Section position="2" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 2.1 Set up the competition </SectionTitle> <Paragraph position="0"> The proposed words and attributes are assigned lower probability in our system. This is straightforward for new words. We simply assign them low scores when we add them (as new terminal nodes) to the parsing chart . For new attributes on existing words, we make a new node which is a copy of the original node and assign the new attributes and a lower probability to this node. As a result, the chart will contain two nodes for the same word, one with the new attributes and one without. The overall effect is that the newly proposed nodes will compete with other nodes to get into a parse, though with a disadvantage. The sub-trees built with the new nodes will have lower scores and will not be in the preferred analysis unless there is no other way to get a spanning parse. Therefore, if the guesses are wrong and the sentence can be successfully parsed without the additional nodes, the best parse (the parse with the highest score) will not contain those nodes and the guesses are practically ignored. On the other hand, if the guesses are right and we cannot get any successful parse unless we use them, then they will end up in the top parse in spite of their low See Jensen et al (1993) and Heidorn (2000) for a general description of how chart parsing works in our system. A Chinese-specific description of the system can be found in Wu & Jiang (1998).</Paragraph> <Paragraph position="1"> Our system can produce more than one parse for a given sentence and the top parse is the one with the probability.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.2 Keep the winners </SectionTitle> <Paragraph position="0"> For each sentence, we pick the top parse and check it to see if there are any terminal nodes that are new words or nodes containing new attributes. If so, we know that these nodes are necessary at least to make the current sentence analyzable. The fact that they are able to beat their competitors despite their disadvantage suggests that they probably represent lexical information that is missing in the existing dictionary. We therefore collect such information and store it away in a separate lexicon. This auxiliary lexicon contains entries for the new words and the new attributes of existing words. Each entry in this lexicon carries a frequency count which records the number of times a given new word or new attribute has appeared in good parses during the processing of certain texts. The content of this lexicon depends on the corpora, of course, and different lexicons can be built for different domains. When processing future sentences, the entries in those lexicons can be dynamically merged with the entries in the main lexicon, so that we do not have to make the same guesses again.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.3 Use the fittest </SectionTitle> <Paragraph position="0"> The information lexicalized in those auxiliary lexicons, though good in general, is not guaranteed to be correct. While being necessary for a successful parse is strong evidence for its validity, that is not a sufficient condition for the correctness of such information. Consequently, there can be some noise in those lexicons.</Paragraph> <Paragraph position="1"> However, a real linguistic property is likely to be found consistently whereas mistakes tend to be random. To prevent the use of wrongly lexicalized entries, we may require a frequency threshold during the merging process: only those entries that have been encountered more than n times in the corpora are allowed to be merged with the main lexicon and used in future analysis.</Paragraph> <Paragraph position="2"> If a given new word or linguistic property is found to occur repeatedly across different domains, we may even consider physically highest score.</Paragraph> <Paragraph position="3"> merging it into the main dictionary, as it may be a piece of information that is worth adding permanently.</Paragraph> </Section> </Section> <Section position="3" start_page="3" end_page="5" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> The system described above has been evaluated in terms of the contribution it makes in parsing.</Paragraph> <Paragraph position="1"> The corpus parsed in the evaluation consists of 121,863 sentences from Microsoft technical manuals. The choice is based on the consideration that this is a typical domain-specific text where there are many unlisted words and many novel usages of words.</Paragraph> <Paragraph position="2"> To tease apart the effects of online guessing and lexicalization, we did two separate tests, one with online guessing only and one with lexicalization as well. When lexicalization is switched on, the new words and attributes that are stored in the auxiliary lexicon are used in subsequent processing. Once a new word or attribute has been recognized in n sentences, it will act as if it were an entry in the main dictionary and can be used in the analysis of any other sentence with normal probability.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.1 Online guessing only </SectionTitle> <Paragraph position="0"> In this test, we parsed the corpus twice, once with guessing and once without. Then we picked out all the sentences that had different analyses in the two passes and compared their parses to see if they became better when lexical guessing is on.</Paragraph> <Paragraph position="1"> Since comparing the parses requires human inspection and is therefore very time consuming, we randomly selected 10,000 sentences out of the 121,863 and used only those sentences in the test.</Paragraph> <Paragraph position="2"> It turns out that 1,459 of those 10,000 sentences got different parses when lexical guessing is switched on. Human comparison of those differences shows that, of the 1,459, the guessing made 1,153 better, 82 worse, and 224 stay the same (different parses but equally good or bad).</Paragraph> <Paragraph position="3"> The net gain is 1,071. In other words, 10.71% of the sentences became better when lexical guessing is used.</Paragraph> <Paragraph position="4"> The novel usages are mainly due to the fact that the text is translated from English.</Paragraph> <Paragraph position="5"> More detailed analysis shows that 48% of the improvements are due to the recognition of new words and 52% to the addition of new grammatical attributes. Of the 82 sentences that became worse, 6 failed because of the lack of storage during processing caused by the additional resources required by the guessing algorithm. The rest are due to over-guessing, or more precisely, the failure to rule out the over-guesses in sentence analysis. The guessing component is designed to over-guess, since the goal there is recall rather than precision. The latter is achieved by the filtering effect of the parser.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 3.2 Additional gain with lexicalization </SectionTitle> <Paragraph position="0"> In this second test, we evaluated the effect of lexicalization on new word recognition</Paragraph> </Section> </Section> <Section position="4" start_page="5" end_page="6" type="metho"> <SectionTitle> . We </SectionTitle> <Paragraph position="0"> parsed all the 121,863 sentences twice, once with lexicalization and once without. The number of unique new words recognized in this corpus is . Notice that this number does not change between the two processes. Using the lexicon created by dynamic lexicalization will increase the instances of those words being recognized, but will not change the number of unique words, since the entries in the auxiliary lexicon can also be recognized online. However, the numbers of instances are different in the two cases. When lexicalization is turned off, we are able to get 5963 instances of those 922 new words in 5239 sentences. When lexicalization is on, however, we are able to get 6464 instances in 5608 sentences. In other words, we can increase the recognition rate by 8.4% and potentially save 369 additional sentences in parsing. The reason for this improvement is that, without lexicalization, we may fail to identify the new words in certain sentences because there were not enough good contexts in those sentences for the identification. Once those words are lexicalized, we no longer have to depend on context-based guessing and We would like to look at the effect on grammatical attributes as well, but the evaluation is not as straightforward there and much more time-consuming.</Paragraph> <Paragraph position="1"> The total number of unique words used in this corpus is 17,110. So at least 5% of the words are missing in the original dictionary.</Paragraph> <Paragraph position="2"> those sentences can benefit from what we have learned from other sentences. Here is a concrete example for illustration: Ta Zhang Wo Liao Jie Bo Bian Xi Dian Nao De Ji Zhu . He master LE undock laptop DE technology He mastered the technology of undocking a laptop.</Paragraph> <Paragraph position="3"> In this sentence, we do not have enough context to identify the new word Jie Bo because Liao Jie is a word in Chinese (Remember there are no spaces between words in Chinese!). This destroys the condition that none of the characters in the new word should be subsumed by a longer word.</Paragraph> <Paragraph position="4"> However, if Jie Bo has been recognized in some other sentences, such as the one we saw in Section 1.1, and has been lexicalized, we can simply look up this word in the dictionary and use it right away. In short, lexicalization enables what is learned locally to be available globally. Conclusion In this paper, we have demonstrated a mechanism for dynamic dictionary update. This method reduces human effort in dictionary maintenance and facilitates domain-switching in sentence analysis. Evaluation shows that this mechanism makes a significant contribution to parsing, especially the parsing of large, domain-specific corpora.</Paragraph> </Section> class="xml-element"></Paper>