File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1142_evalu.xml
Size: 6,561 bytes
Last Modified: 2025-10-06 13:59:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1142"> <Title>Learning Transliteration Lexicons from the Web</Title> <Section position="7" start_page="1133" end_page="1135" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We first construct a development corpus by crawling of webpages. This corpus consists of about 500 MB of webpages, called SET1 (Kuo et al, 2005). Out of 80,094 qualified sentences, 8,898 DQTPs are manually extracted from SET1, which serve as the gold standard in testing. To establish a baseline system, we first train a PSM using all 8,898 DQTPs in supervised manner and conduct a closed test on SET1 as in Table 1. We further implement three PSM learning strategies and conduct a systematic series of experiments.</Paragraph> <Section position="1" start_page="1133" end_page="1133" type="sub_section"> <SectionTitle> 5.1 Unsupervised Learning </SectionTitle> <Paragraph position="0"> We follow the formulation described in Section 4.2. First, we derive an initial PSM using randomly selected 100 seed DQTPs and simulate the Web-based learning process with the SET1: (i) select high F-rank and high C-rank E-C pairs using PSM, (ii) add the selected E-C pairs to the DQTP pool as if they are true DQTPs, and (iii) reestimate PSM by using the updated DQTP pool.</Paragraph> <Paragraph position="1"> In Figure 2, we report the F-measure over iterations. The U_HF curve reflects the learning progress of using E-C pairs that occur more than once in the SET1 corpus (high F-rank). The U_HF_HR curve reflects the learning progress using a subset of E-C pairs from U_HF which has high posterior odds as defined in Eq.(6).</Paragraph> <Paragraph position="2"> Both selection strategies aim to select E-C pairs, which are as genuine as possible.</Paragraph> <Paragraph position="3"> unsupervised learning on SET1.</Paragraph> <Paragraph position="4"> We found that both U_HF and U_HF_HR give similar results in terms of F-measure. Without surprise, more iterations don't always lead to better performance because unsupervised learning doesn't aim to acquiring new knowledge over iterations. Nevertheless, unsupervised learning improves the initial PSM in the first iteration substantially. It can serve as an effective PSM adaptation method.</Paragraph> </Section> <Section position="2" start_page="1133" end_page="1134" type="sub_section"> <SectionTitle> 5.2 Active Learning </SectionTitle> <Paragraph position="0"> The objective of active learning is to minimize human supervision by automatically selecting the most informative samples to be labeled. The effect of active learning is that it maximizes performance improvement with minimum annotation effort. Like in unsupervised learning, we start with the same 100 seed DQTPs and an initial PSM model and carry out experiments on SET1: (i) select low F-rank, low C-rank and GSA-PSM and PSA-PSM disagreed E-C pairs; (ii) label the selected pairs by removing the non-E-C pairs and add the labeled E-C pairs to the DQTP pool, and (iii) reestimate the PSM by using the updated DQTP pool.</Paragraph> <Paragraph position="1"> To select the samples, we employ 3 different strategies: A_LF_LR, where we only select low F-rank and low C-rank candidates for labeling. A_DIFF, where we only select those that GSA-PSM and PSA-PSM disagreed upon; and A_DIFF_LF_LR, the union of A_LF_LR and A_DIFF selections. As shown in Figure 3, the F-measure of A_DIFF (0.729) and A_DIFF_LF_LR (0.731) approximate to that of supervised learning 0.735) after four iterations. learning on SET1.</Paragraph> <Paragraph position="2"> With almost identical performance as supervised learning, the active learning approach has greatly reduced the number of samples for manual labeling as reported in Table 2. It is found that for active learning to reach the performance of supervised learning, A_DIFF is the most effective strategy. It reduces the labeling effort by 89.0%, from 80,094 samples to labeling in 6 iterations of Figure 3.</Paragraph> </Section> <Section position="3" start_page="1134" end_page="1134" type="sub_section"> <SectionTitle> 5.3 Active Unsupervised Learning </SectionTitle> <Paragraph position="0"> It would be interesting to study the performance of combining unsupervised learning and active learning. The experiment is similar to that of active learning except that, in step (iii) of active learning, we take the unlabeled high confidence candidates (high F-rank and high C-rank as in U_HF_HR of Section 5.1) as the true labeled samples and add into the DQTP pool. The result is shown in Figure 4. Although active unsupervised learning was reported having promising results (Riccardi and Hakkani-Tur, 2003) in some NLP tasks, it has not been as effective as active learning alone in this experiment probably due to the fact the unlabeled high confidence candidates are still too noisy to be informative.</Paragraph> </Section> <Section position="4" start_page="1134" end_page="1135" type="sub_section"> <SectionTitle> 5.4 Learning Transliteration Lexicons </SectionTitle> <Paragraph position="0"> The ultimate objective of building a PSM is to extract a transliteration lexicon from the Web by iteratively submitting queries and harvesting new transliteration pairs from the return results until no more new pairs. For example, by submitting &quot;Robert&quot; to search engines, we may get &quot;Robert-Luo Bo Te &quot;, &quot;Richard-Li Cha &quot; and &quot;Charles-Cha Er Si &quot; in return. In this way, new queries can be generated iteratively, thus new pairs are discovered. We pick the best performing SET1-derived PSM trained using A_DIFF_LF_LR active learning strategy and test it on a new database SET2 which is obtained in the same way as SET1.</Paragraph> <Paragraph position="1"> SET2.</Paragraph> <Paragraph position="2"> SET2 contains 67,944 Web pages amounting to 3.17 GB. We extracted 2,122,026 qualified sentences from SET2. Using the PSM, we extract 137,711 distinct E-C pairs. As the gold standard for SET2 is unavailable, we randomly select 1,000 pairs for manual checking. A precision of 0.777 is reported. In this way, 107,001 DQTPs can be expected. We further carry out one iteration of unsupervised learning using U_HF_HR to adapt the SET1-derived PSM towards SET2. The results before and after adaptation are reported in Table 3. Like the experiment in Section 5.1, the unsupervised learning improves the PSM in terms of precision significantly.</Paragraph> </Section> </Section> class="xml-element"></Paper>