File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-1207_concl.xml
Size: 6,400 bytes
Last Modified: 2025-10-06 13:52:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1207"> <Title>Statistically-Enhanced New Word Identification in a Rule-Based Chinese System</Title> <Section position="5" start_page="49" end_page="50" type="concl"> <SectionTitle> 4. Results and Discussion </SectionTitle> <Paragraph position="0"> 4.1. Increase in Parser Coverage The new word identification mechanism discussed above has been part of our system for about 10 months. To find out how much contribution it makes to our parser coverage, we took 176,863 sentences that had been parsed successfully with the new word mechanism turned on and parsed them again with the new word mechanism turned off. When we did this test at the beginning of these 10 months, 37640 of those sentences failed to get a parse when the mechanism was turned off. In other words, 21.3% of the sentences were &quot;saved&quot; by this mechanism. At the end of the 10 months, however, only 7749 of those sentences failed because of the removal of the mechanism. At first sight, this seems to indicate that the new word mechanism is doing a much less satisfactory job than before. What actually happened is that many of the words that were identified by the mechanism 10 months ago, especially those that occur frequently, have been added to our dictionary. In the past 10 months, we have been using this mechanism both as a component of robust parsing and as a method of lexical acquisition whereby new enwies are discovered from text corpora. This discovery procedure has helped us find many words that are found in none of the existing word lists we have access to.</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 4.2. Precision of Identification </SectionTitle> <Paragraph position="0"> Apart from its contribution to parser coverage, we can also evaluate the new word identification mechanism by looking at its precision. In our evaluation, we measured precision in two different ways.</Paragraph> <Paragraph position="1"> In the first measurement, we compared the number of new words that are proposed by the guessing mechanism and the number of words that end up in successful parses. If we use NWA to stand for the number of new words that are added to the word lattice and NWU for the number of new words that appear in a parse tree, the precision rate will be NWU / NWA. Actual testing shows that this rate is about 56%. This means that the word guessing mechanism has over-guessed and added about twice as many words as we need. This is not a real problem in our system, however, because the final decision is made in the parsing process. The lexical component is only responsible for providing a word lattice of which one of the paths is correct. In the second measurement, we had a native speaker of Chinese go over all the new words that end up in successful parses and see how many of them sound like real words to her. This is a fairly subjective test but nonetheless meaningful one.</Paragraph> <Paragraph position="2"> It turns out that about 85% of the new words that &quot;survived&quot; the parsing process are real words. We would also like to run a large-scale recall test on the mechanism, but found it to be impossible. To run such a test, we have to know how many unlisted new words actually exist in a corpus of texts. Since there is no automatic way of knowing it, we would have to let a human manually check the texts. This is too expensive to be feasible.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 4.3. Contributions of Other Components </SectionTitle> <Paragraph position="0"> While the results shown above do give us some idea about how much contribution the new word identification mechanism makes to our system, it is actually very difficult to say precisely how much credit goes to this mechanism and how much to other components of the system. As we can see, the performance of this mechanism also depends on the following two factors: (1) The word segmentation processes prior to the application of this mechanism. They include dictionary lookup, derivational morphology, proper name identification and the assembly of other items such as time, dates, monetary units, address, phone numbers, etc. These processes also group characters into words. Any improvement in those components will also improve the performance of the new word mechanism.</Paragraph> <Paragraph position="1"> If every word that &quot;should&quot; be found by those processes has already been identified, the single-character sequences that remain after those processes will have a better chance of being real words.</Paragraph> <Paragraph position="2"> (2) The parsing process that follows. As mentioned earlier, the lexical component of our system does not make a final decision on &quot;wordhood&quot;. It provides a word lattice from which the syntactic parser is supposed to pick the correct path. In the case of new word identification, the word lattice will contain both the new words that are identified and the all the words/characters that are subsumed by the new words. A new word proposed in the word lattice will receive its official wordhood only when it becomes part of a successful parse. To recognize a new word correctly, the parser has to be smart enough to accept the good guesses and reject the bad guesses. This ability of the parser will imporve as the parser improves in general and a better parser will yield better final results in new word identification.</Paragraph> <Paragraph position="3"> Generally speaking, the mechanisms using IWP and P(Cat, Pos, Len) provide the internal criteria for wordhood while word segmentation and parsing provide the external criteria. The internal criteria are statistically based whereas the external criteria are rule-based. Neither can do a good job on its own without the other. The approach we take here is not to be considered staff stical natural language processing, but it does show that a rule-based system can be enhanced by some statistics. The statistics we need can be extracted from a very small corpus and a dictionary and they are not domain dependent.</Paragraph> <Paragraph position="4"> We have benefited from the mechanism in the analysis of many different kinds of texts.</Paragraph> </Section> </Section> class="xml-element"></Paper>