File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1814_evalu.xml
Size: 4,419 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1814"> <Title>Extracting Pronunciation-translated Names from Chinese Texts using Bootstrapping Approach</Title> <Section position="4" start_page="11" end_page="11" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We devise several tests to evaluate our extraction scheme with bootstrapping. We use the MET-2 test corpus for two of the tests, and PKUC as basic language resource to support the process. We use PKUC to extract common word dictionary, which consists of about 37,000 words. We also use PKUC to extract and evaluate typical context cue words around person, location and organization names. Our experiments start from a &quot;seed&quot; = {&quot;g19475&quot;, &quot;g4584&quot;, &quot;g5064&quot;, &quot;g7043&quot;, &quot;g3534&quot;}; and a set of 60 context cue words.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.1 Obtaining P-Names from the Internet </SectionTitle> <Paragraph position="0"> We perform the bootstrapping process as discussed in Section 2 to extract P-Names from the Internet, and stopped after about 650 iterations. We manually count the number of correct P-Names obtained at the end of every 65-iterations. We also use the first 100,000 P-Names found at the end of the bootstrapping process as the ground truth to compute the accuracy of P-Name identification.</Paragraph> <Paragraph position="1"> Figure 3 presents the results of the P-Name extraction process. From the figure, we can see that as we increased the number of iterations, the number of P-Names obtained also increased proportionally.</Paragraph> <Paragraph position="2"> This demonstrates that our bootstrapping process is consistent. We also observe that the system is able to maintain a high accuracy of over 85% even when the number of P-Names found approaches 100,000.</Paragraph> <Paragraph position="3"> This demonstrates that our method is effective.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Extracting P-Names from MET-2 set </SectionTitle> <Paragraph position="0"> We use MET-2 test corpus to test the effectiveness of our approach to identify P-Names from new texts as discussed in Section 3. We consider a P-Name as correctly extracted only when every of its character are correctly identified. The results are presented in a recall of over 95% and precision of close to 90%. The results are encouraging as we did not use the training resource of MET-2 corpus to train the system, which is expected to lead to higher accuracy.</Paragraph> <Paragraph position="1"> As a by-product of the PN-Finder, we obtained a large set of context words. We found that we can use these context words to correctly classify about 25% of the extracted P-Names in MET-2 test set into person names or part of location or organization names using the method described in Section 2.2. The employing of context words to classify P-Names is mainly to confirm more P-Names and P-Name characters.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Contributions of PN-Finder to a Generic NE Recognition Module </SectionTitle> <Paragraph position="0"> The most important contribution of PN-Finder is that it can be used to improve the performance of a generic Chinese named entity recognizer as discussed in Chua and Liu (2002). Here, we conducted several trials by using the PN-Finder to extract a different number of P-Names. We use the first 100,000 P-Names found by the PN-Finder, together with the pattern rules in the general named entity recognizer, to conduct a baseline test.</Paragraph> <Paragraph position="1"> This test merely performs direct table look-up to locate all possible P-Names. Table 2 lists the performance of the general NE recognition system by using an increasing number of P-Names found by the PN-Finder, together with the use of the confidence statistics, context words obtained from the current sets of P-Names and pattern rules. The results indicate that as we increase the number of P-Names found by the PN-Finder, the performance of the general NE recognition system is improved steadily until it reaches over 92% in average F</Paragraph> </Section> </Section> class="xml-element"></Paper>