File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1142_metho.xml

Size: 15,779 bytes

Last Modified: 2025-10-06 14:10:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1142">
  <Title>Learning Transliteration Lexicons from the Web</Title>
  <Section position="4" start_page="1129" end_page="1130" type="metho">
    <SectionTitle>
2 A bilingual snippet refers to a Chinese predominant text
</SectionTitle>
    <Paragraph position="0"> with embedded English appositives.</Paragraph>
    <Paragraph position="1">  example above, &amp;quot;Content to Community&amp;quot; is not a transliteration of C2C, but rather an acronym expansion, while &amp;quot;Ku Luo /Ku-Luo/&amp;quot;, as underlined, presents a transliteration for &amp;quot;Kuro&amp;quot;. What is important is that the E-C pairs are always closely collocated. Inspired by this observation, we propose an algorithm that searches over the close context of an English word in a bilingual snippet for the word's transliteration candidates.</Paragraph>
    <Paragraph position="2"> The contributions of this paper include: (i) an approach to harvesting real life E-C transliteration pairs from the Web; (ii) a phonetic similarity model that evaluates the confidence of so extracted E-C pair candidates; (iii) a comparative study of several machine learning strategies.</Paragraph>
  </Section>
  <Section position="5" start_page="1130" end_page="1130" type="metho">
    <SectionTitle>
3 Phonetic Similarity Model
</SectionTitle>
    <Paragraph position="0"> English and Chinese have different syllable structures. Chinese is a syllabic language where each Chinese character is a syllable in either consonant-vowel (CV) or consonant-vowel-nasal (CVN) structure. A Chinese word consists of a sequence of characters, phonetically a sequence of syllables. Thus, in first E-C transliteration, it is a natural choice to syllabify an English word by converting its phoneme sequence into a sequence of Chinese-like syllables, and then convert it into a sequence of Chinese characters.</Paragraph>
    <Paragraph position="1"> There have been several effective algorithms for the syllabification of English words for transliteration. Typical syllabification algorithms first convert English graphemes to phonemes, referred to as the letter-to-sound transformation, then syllabify the phoneme sequence into a syllable sequence. For this method, a letter-to-sound conversion is needed (Pagel, 1998; Jurafsky, 2000). The phoneme-based syllabification algorithm is referred to as PSA.</Paragraph>
    <Paragraph position="2"> Another syllabification technique attempts to map the grapheme of an English word to syllables directly (Kuo and Yang, 2004). The grapheme-based syllabification algorithm is referred to as GSA. In general, the size of a phoneme inventory is smaller than that of a grapheme inventory. The PSA therefore requires less training data for statistical modeling (Knight, 1998); on the other hand, the grapheme-based method gets rid of the letter-to-sound conversion, which is one of the main causes of transliteration errors (Li et al, 2004).</Paragraph>
    <Paragraph position="3"> Assuming that Chinese transliterations always co-occur in proximity to their original English words, we propose a phonetic similarity modeling (PSM) that measures the phonetic similarity between candidate transliteration pairs. In a bilingual snippet, when an English word EW is spotted, the method searches for the word's possible Chinese transliteration CW in its neighborhood. EW can be a single word or a phrase of multiple English words. Next, we formulate the PSM and the estimation of its parameters.</Paragraph>
    <Section position="1" start_page="1130" end_page="1130" type="sub_section">
      <SectionTitle>
3.1 Generative Model
</SectionTitle>
      <Paragraph position="0"> Let 1{,...,...}mMESeseses= be a sequence of English syllables derived from EW, using the PSA or GSA approach, and 1{,.. ,...}nNCScscscs= be the sequence of Chinese syllables derived from CW, represented by a Chinese character string 1,...,...,nNCWccc . EW and CW is a transliteration pair. The E-C transliteration can be considered a generative process formulated by the noisy channel model, with EW as the input and CW as the output. (/)PEWCW is estimated to characterize the noisy channel, known as the transliteration probability. ()PCW is a language model to characterize the source language.</Paragraph>
      <Paragraph position="1"> Applying Bayes' rule, we have</Paragraph>
      <Paragraph position="3"> Following the translation-by-sound principle, the transliteration probability (/)PEWCW can be approximated by the phonetic confusion probability (/)PESCS , which is given as</Paragraph>
      <Paragraph position="5"> where F is the set of all possible alignment paths between ES and CS. It is not trivial to find the best alignment path D . One can resort to a dynamic programming algorithm. Assuming conditional independence of syllables in ES and CS, we have 1(/)(/)M mmmPESCSpescs== in a special case where MN= . Note that, typically, we have NM due to syllable elision. We introduce a null syllable j and a dynamic warping strategy to evaluate (/)PESCS when MN (Kuo et al, 2005). With the phonetic approximation, Eq.(1) can be rewritten as</Paragraph>
      <Paragraph position="7"> The language model in Eq.(3) can be represented by Chinese characters n-gram statistics.</Paragraph>
      <Paragraph position="9"> context of EW usually has a number of competing Chinese transliteration candidates in a set, denoted as W . We rank the candidates by Eq.(1) to find the most likely CW for a given EW.</Paragraph>
      <Paragraph position="10"> In this process, ()PEW can be ignored because it is the same for all CW candidates. The CW candidate that gives the highest posterior probability is considered the most probable  However, the most probable CW isn't necessarily the desired transliteration. The next step is to examine if CW and EW indeed form a genuine E-C pair. We define the confidence of the E-C pair as the posterior odds similar to that in a hypothesis test under the Bayesian interpretation. We have 0H , which hypothesizes that CW and EW form an E-C pair, and 1H , which hypothesizes otherwise. The posterior odds is given as follows,</Paragraph>
      <Paragraph position="12"> where 'CS is the syllable sequence of CW , 1(/)pHEW is approximated by the probability mass of the competing candidates of CW ,</Paragraph>
      <Paragraph position="14"> . The higher the s is, the more probable that hypothesis</Paragraph>
      <Paragraph position="16"> seen as an extension to prior work (Brill et al, 2001) in transliteration modeling. We introduce the posterior odds s as the confidence score so that E-C pairs that are extracted from different contexts can be directly compared. In practice, we set a threshold for s to decide a cutoff point for E-C pairs short-listing.</Paragraph>
    </Section>
    <Section position="2" start_page="1130" end_page="1130" type="sub_section">
      <SectionTitle>
3.2 PSM Estimation
</SectionTitle>
      <Paragraph position="0"> The PSM parameters are estimated from the statistics of a given transliteration lexicon, which is a collection of manually selected E-C pairs in supervised learning, or a collection of high confidence E-C pairs in unsupervised learning.</Paragraph>
      <Paragraph position="1"> An initial PSM is bootstrapped using prior knowledge such as rule-based syllable mapping.</Paragraph>
      <Paragraph position="2"> Then we align the E-C pairs with the PSM and derive syllable mapping statistics for PSA and GSA syllabifications. A final PSM is a linear combination of the PSA-based PSM (PSA-PSM) and the GSA-based PSM (GSA-PSM). The PSM parameter (/)mnpescs can be estimated by an Expectation-Maximization (EM) process (Dempster, 1977). In the Expectation step, we compute the counts of events such as #,mnescs&lt;&gt; and # ncs&lt;&gt; by force-aligning the E-C pairs in the training lexicon Y . In the Maximization step, we estimate the PSM parameters (/)mnpescs by (/)#,/#mnmnnpescsescscs=&lt;&gt;&lt;&gt;. (7) As the EM process guarantees non-decreasing likelihood probability (/)PESCS&amp;quot;Y , we let the EM process iterate until (/)PESCS&amp;quot;Y converges. The EM process can be thought of as a refining process to obtain the best alignment between the E-C syllables and at the same time a re-estimating process for PSM parameters. It is summarized as follows.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1130" end_page="1133" type="metho">
    <SectionTitle>
4 Adaptive Learning Framework
</SectionTitle>
    <Paragraph position="0"> We propose an adaptive learning framework under which we learn PSM and harvest E-C pairs from the Web at the same time. Conceptually, the adaptive learning is carried out as follows.</Paragraph>
    <Paragraph position="1"> We obtain bilingual snippets from the Web by iteratively submitting queries to the Web search engines (Brin and Page, 1998). For each batch of querying, the query results are all normalized to plain text, from which we further extract qualified sentences. A qualified sentence has at least one English word. Under this criterion, a collection of qualified sentences can be extracted automatically. To label the E-C pairs, each qualified sentence is manually checked based on the following transliteration criteria: (i) if an EW is partly translated phonetically and partly translated semantically, only the phonetic transliteration constituent is extracted to form a  transliteration pair; (ii) elision of English sound is accepted; (iii) multiple E-C pairs can appear in one sentence; (iv) an EW can have multiple valid Chinese transliterations and vice versa. The validation process results in a collection of qualified E-C pairs, also referred to as Distinct Qualified Transliteration Pairs (DQTPs).</Paragraph>
    <Paragraph position="2"> As formulated in Section 3, the PSM is trained using a training lexicon in a data driven manner. It is therefore very important to ensure that in the learning process we have prepared a quality training lexicon. We establish a baseline system using supervised learning. In this approach, we use human labeled data to train a model. The advantage is that it is able to establish a model quickly as long as labeled data are available.</Paragraph>
    <Paragraph position="3"> However, this method also suffers from some practical issues. First, the derived model can only be as good as the data that it sees. An adaptive mechanism is therefore needed for the model to acquire new knowledge from the dynamically growing Web. Second, a massive annotation of database is labor intensive, if not entirely impossible.</Paragraph>
    <Paragraph position="4"> To reduce the annotation needed, we discuss three adaptive strategies cast in the machine learning framework, namely active learning, unsupervised learning and active-unsupervised learning. The learning strategies can be depicted in Figure 1 with their difference being discussed next. We also train a baseline system using supervised learning approach as a reference point for benchmarking purpose.</Paragraph>
    <Section position="1" start_page="1132" end_page="1132" type="sub_section">
      <SectionTitle>
4.1 Active Learning
</SectionTitle>
      <Paragraph position="0"> Active learning is based on the assumption that a small number of labeled samples, which are DQTPs here, and a large number of unlabeled  samples are available. This assumption is valid in most NLP tasks. In contrast to supervised learning, where the entire corpus is labeled manually, active learning selects the most useful samples for labeling and adds the labeled examples to the training set to retrain the model. This procedure is repeated until the model achieves a certain level of performance.</Paragraph>
      <Paragraph position="1"> Practically, a batch of samples is selected each time. This is called batch-based sample selection (Lewis and Catlett, 1994), as shown in the search and ranking block in Figure 1.</Paragraph>
      <Paragraph position="2"> For an active learning to be effective, we propose using three measures to select candidates for human labeling. First, we would like to select the most uncertain samples that are potentially highly informative for the PSM model. The informativeness of a sample can be quantified by its confidence score s as in the PSM formulation. Ranking the E-C pairs by s is referred to as C-rank. The samples of low C-rank are the interesting samples to be labeled. Second, we would like to select candidates that are of low frequency. Ranking by frequency is called Frank. During Web crawling, most of the search engines use various strategies to prevent spamming and one of fundamental tasks is to remove the duplicated Web pages. Therefore, we assume that the bilingual snippets are all unique. Intuitively, E-C pairs of low frequency indicate uncommon events which are of higher interest to the model. Third, we would like to select samples upon which the PSA-PSM and GSA-PSM disagree the most. The disagreed upon samples represent new knowledge to the PSM. In short, we select low C-rank, low F-rank and PSM-disagreed samples for labeling because the high C-rank, high F-rank and PSM-agreed samples are already well known to the model.</Paragraph>
    </Section>
    <Section position="2" start_page="1132" end_page="1133" type="sub_section">
      <SectionTitle>
4.2 Unsupervised Learning
</SectionTitle>
      <Paragraph position="0"> Unsupervised learning skips the human labeling step. It minimizes human supervision by automatically labeling the data. This can be effective if prior knowledge about a task is available, for example, if an initial PSM can be built based on human crafted phonetic mapping rules. This is entirely possible. Kuo et al (2005) proposed using a cross-lingual phonetic confusion matrix resulting from automatic speech recognition to bootstrap an initial PSM model. The task of labeling samples is basically to distinguish the qualified transliteration pairs from the rest. Unlike the sample selection method in active learning, here we would like to  select the samples that are of high C-rank and high F-rank because they are more likely to be the desired transliteration pairs.</Paragraph>
      <Paragraph position="1"> The difference between the active learning and the unsupervised learning strategies lies in that the former selects samples for human labeling, such as in the select &amp; labeling block in Figure 1 before passing on for PSM learning, while the latter selects the samples automatically and assumes they are all correct DQTPs. The disadvantage of unsupervised learning is that it tends to reinforce its existing knowledge rather than to discover new events.</Paragraph>
    </Section>
    <Section position="3" start_page="1133" end_page="1133" type="sub_section">
      <SectionTitle>
4.3 Active-Unsupervised Learning
</SectionTitle>
      <Paragraph position="0"> The active learning and the unsupervised learning strategies can be complementary. Active learning minimizes the labeling effort by intelligently short-listing informative and representative samples for labeling. It makes sure that the PSM learns new and informative knowledge over iterations. Unsupervised learning effectively exploits the unlabelled data. It reinforces the knowledge that PSM has acquired and allows PSM to adapt to changes at no cost. However, we do not expect unsupervised learning to acquire new knowledge like active learning does. Intuitively, a better solution is to integrate the two strategies into one, referred to as the active-unsupervised learning strategy. In this strategy, we use active learning to select a small amount of informative and representative samples for labeling. At the same time, we select samples of high confidence score from the rest and consider them correct E-C pairs. We then merge the labeled set with the high-confidence set in the PSM re-training.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML