File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2011_metho.xml

Size: 15,156 bytes

Last Modified: 2025-10-06 14:10:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2011">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics</Title>
  <Section position="5" start_page="82" end_page="86" type="metho">
    <SectionTitle>
3 Chinese-to-English NE Translation
</SectionTitle>
    <Paragraph position="0"> As we have mentioned in the last section, we could find most English translations in Chinese web page snippets. We thus base our system on web search engine: retrieving candidates from returned snippets, combining both linguistic and statistical information to find the correct translation. Our system can be split into three steps: candidate retrieving, candidate evaluating, and  In the first step, the NE to be translated, GN, is sent to Google to retrieve traditional Chinese web pages, and a simple English NE recognition method and several preprocessing procedures are applied to obtain possible candidates from returned snippets. In the second step, four features (i.e., phonetic values, word senses, recurrences, and relative positions) are exploited to give these candidates a score. In the last step, the candidates with higher scores are sent to Google again. Recurrence information and relative positions concerning with the candidate to be verified of GN in returned snippets are counted along with the scores to decide the final ranking of candidates. These three steps will be detailed in the following subsections.</Paragraph>
    <Section position="1" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
3.1 Retrieving Candidates
</SectionTitle>
      <Paragraph position="0"> Before we can identify possible candidates, we must retrieve them first. In the returned traditional Chinese snippets by Google, there are still many English fragments. Therefore, the first task our system would do is to separate these English fragments into NEs and non-NEs. We propose a simple method to recognize possible NEs. All fragments conforming to the following properties would be recognized as NEs: * The first and the last word of the fragment are numerals or capitalized.</Paragraph>
      <Paragraph position="1"> * There are no three or more consequent lowercase words in the fragment.</Paragraph>
      <Paragraph position="2"> * The whole fragment is within one sentence.</Paragraph>
      <Paragraph position="3"> After retrieving possible NEs in returned snippets, there are still some works to do to make a</Paragraph>
    </Section>
    <Section position="2" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
Parts and by Meaning
</SectionTitle>
      <Paragraph position="0"> for the Others The entire NE is supposed to be translated by its meaning and the name parts are transliterated.</Paragraph>
      <Paragraph position="1">  The NE is translated by its semantic or the content of the entity it refers to. &amp;quot;The Mask&amp;quot; and &amp;quot;Mo Deng (modern)Da (great)Sheng (saint)&amp;quot; Parallel Names NE is initially denominated as more than one name or in more than one language. &amp;quot;Sun Zhong Shan (Sun Zhong-Shan)&amp;quot; and &amp;quot;Sun Yat-Sen&amp;quot;  finer candidate list for verification. First, there might be many different forms for a same NE. For example, &amp;quot;Mr. &amp; Mrs. Smith&amp;quot; may also appear in the form of &amp;quot;Mr. and Mrs. Smith&amp;quot;, &amp;quot;Mr. And Mrs. Smith&amp;quot;, and so on. To deal with these aliasing forms, we transform all different forms into a standard form for the later ranking and identification. The standard form follows the following rules:  would be transformed into &amp;quot;MR. AND MRS. SMITH&amp;quot;.</Paragraph>
      <Paragraph position="2"> The second work we should complete before ranking is filtering useless substrings. An NE may comprise many single words. These component words may all be capitalized and thus all substrings of this NE would be fetched as candidates of our translation work. Therefore, sub-strings which always appear with a same preceding and following word are discarded here, since they would have a zero recurrence score in the next step, which would be detailed in the next subsection.</Paragraph>
    </Section>
    <Section position="3" start_page="83" end_page="83" type="sub_section">
      <SectionTitle>
3.2 Evaluating Candidates
</SectionTitle>
      <Paragraph position="0"> After candidate retrieving, we would obtain a sequence of m candidates, C1, C2, ..., Cm. An integrated evaluating model is introduced to exploit four features (phonetic values, word senses, recurrences, and relative positions) to score these m candidates, as the following equation suggests:</Paragraph>
      <Paragraph position="2"> LScore(Ci,GN) combines phonetic values and word senses to evaluate the lexical similarity between Ci and GN. SScore(Ci,GN) concerns both recurrences information and relative positions to evaluate the statistical relationship between Ci and GN. These two scores are then combined to obtain Score(Ci,GN). How to estimate LScore(Cn, GN) and SScore(Cn, GN) would be discussed in detail in the following subsections. null  The lexical similarity concerns both phonetic values and word senses. An NE may consist of many single words. These component words may be translated either by phonetic values or by word senses. Given a translation pair, we could split them into fragments which could be bipartite matched according to their translation relationships, as Figure 4 shows.</Paragraph>
    </Section>
    <Section position="4" start_page="83" end_page="85" type="sub_section">
      <SectionTitle>
Shu Shu De Xiao Wu &amp;quot;.
</SectionTitle>
      <Paragraph position="0"> To identify the lexical similarity between two NEs, we could estimate the similarity scores between the matched fragment pairs first, and then sum them up as a total score. We postulate that the matching with the highest score is the correct matching. Therefore the problem becomes a weighted bipartite matching problem, i.e., given the similarity scores between any fragment pairs, to find the bipartite matching with the highest score. In this way, our next problem is how to estimate the similarity scores between fragments.</Paragraph>
      <Paragraph position="1"> We treat an English single word as a fragment unit, i.e., each English single word corresponds to one fragment. An English candidate Ci consisting of n single words would be split into n fragment units, Ci1, Ci2, ..., Cin. We define a Chinese fragment unit that it could comprise one to four characters and may overlap each other. A fragment unit of GN can be written as GNab, which denotes the ath to bth characters of GN, and b - a &lt; 4. The linguistic similarity score between two fragments is:</Paragraph>
      <Paragraph position="3"> Where PVSim() estimates the similarity in phonetic values while WSSim() estimate it in word senses.</Paragraph>
      <Paragraph position="4">  In this paper, we adopt a simple but novel method to estimate the similarity in phonetic values. Unlike many approaches, we don't introduce an intermediate phonetic alphabet system for comparison. We first transform the Chinese fragments into possible English strings, and then estimate the similarity between transformed strings and English candidates in surface strings, as Figure 5 shows. However, similar pronunciations does not equal to similar surface strings. Two quite dissimilar strings may have very similar pronunciations. Therefore, we take this strat- null Edit distances are usually used to estimate the surface similarity between strings. However, the typical edit distance does not completely satisfy the requirement in the context of translation identification. In translation, vowels are an unreliable feature. There are many variations in pronunciation of vowels, and the combinations of vowels are numerous. Different combinations of vowels may have a same phonetic value, however, same combinations may pronounce totally differently. The worst of all, human often arbitrarily determine the pronunciation of unfamiliar vowel combinations in translation. For these reasons, we adopt the strategy that vowels can be ignored in transformation. That is to say when it is hard to determine which vowel combination should be generated from given Chinese fragments, we can only transform the more certain part of consonants. Thus during the calculation of edit distances, the insertion of vowels would not be calculated into edit distances. Finally, the modified edit distance between two strings A and B is defined as follow:  Len() denotes the length of the string. In the above equation, the similarity scores are ranged from 0 to 1.</Paragraph>
      <Paragraph position="5"> We build the fixed transformation table manually. All possible transformations from Chinese transliterating characters to corresponding English strings are built. If we cannot precisely indicate which vowel combination should be transformed, or there are too many possible combinations, we ignores vowels. Then we use a training set of 3,000 transliteration names to examine possible omissions due to human ignorance.</Paragraph>
      <Paragraph position="6"> square6 Word Senses More or less similar to the estimation of phonetic similarity, we do not use an intermediate representation of meanings to estimate word sense similarity. We treat the English translations in the C-E bilingual dictionary (reference removed for blind review) directly as the word senses of their corresponding Chinese word entries. We adopt a simple 0-or-1 estimation of word sense similarity between two strings A and B, as the following equation suggests: = dictionary in the ofon translatia is if ,1 dictionary in the</Paragraph>
      <Paragraph position="8"> All the Chinese foreign names appearing in test data is removed from the dictionary.</Paragraph>
      <Paragraph position="9"> From the above equations we could derive that LSim() of fragment pairs is also ranged from 0 to 1. Candidates to be evaluated may comprise different number of component words, and this would result the different scoring base of the weighted bipartite matching. We should normalize the result scores of bipartite matching. As a result, the following equation is applied:  Two pieces of information are concerned together to estimate the statistical similarity: recurrences and relative positions. A candidate Ci might appear l times in the returned snippets, as Ci,1, Ci,2, ..., Ci,l. For each Ci,k, we find the dis- null tance between it and the nearest GN in the returned snippets, and then compute the relative position scores as the following equation:  In other words, if the candidate is adjacent to the given NE, it would have a relative position score of 1. Relative position scores of all Ci,k would be summed up to obtain the primitive statistical score: PSS(Ci, GN) = [?] k RP(Cn,k, GN) As we mentioned before, since the imprecision of NE recognition, most substrings of NEs would also be recognized as candidates. This would result a problem. There are often typos in the information provided on the Internet. If some component word of an NE is misspelled, the substrings constituted by the rest words would have a higher statistical score than the correct NE. To prevent such kind of situations, we introduce entropy of the context of the candidate. If a candidate has a more varied context, it is more possible to be an independent term instead of a substring of other terms. Entropy provides such a property: if the possible cases are more varied, there is higher entropy, and vice versa. Entropy function here concerns the possible cases of the most adjacent word at both ends of the candidate, as the following equation suggests:  Where NCTr and NCi denote the appearing times of the rth context CTr and the candidate Ci in the returned snippets respectively, and NPTi denotes the total number of different cases of the context of Ci. Since we want to normalize the entropy to 0~1, we take NPTi as the base of the logarithm function.</Paragraph>
      <Paragraph position="10"> While concerning context combinations, only capitalized English word is discriminated. All other words would be viewed as one sort &amp;quot;OTHER&amp;quot;. For example, assuming the context of &amp;quot;David&amp;quot; comprises three times of (Craig, OTHER), three times of (OTHER, Stern), and six times of (OTHER, OTHER), then:</Paragraph>
    </Section>
    <Section position="5" start_page="85" end_page="86" type="sub_section">
      <SectionTitle>
3.3 Verifying Candidates
</SectionTitle>
      <Paragraph position="0"> In evaluating candidate, we concern only the appearing frequencies of candidates when the NE to be translated is presented. In the other direction, we should also concern the appearing frequencies of the NE to be translated when the candidate is presented to prevent common words getting an improper high score in evaluation. We perform the inverse search approach for this sake. Like the evaluation of statistical scores in the last step, candidates are sent to Google to retrieve Traditional Chinese snippets, and the same equation of SScore() is computed concerning the candidate. However, since there are too many candidates, we cannot perform this process on all candidates. Therefore, an elimination mechanism is adopted to select candidates for verification. The elimination mechanism works  as follows: 1. Send the Top-3 candidates into Google for verification.</Paragraph>
      <Paragraph position="1"> 2. Count SScore(GN, Ci). (Notice that the order of the parameter is reversed.) Re-weight Score(Ci, GN) by multiplying SScore(GN, Ci) 3. Re-rank candidates 4. After re-ranking, if new candidates become  the Top-3 ones, redo the first step. Otherwise end this process.</Paragraph>
      <Paragraph position="2"> The candidates have been verified would be recorded to prevent duplicate re-weighting and unnecessary verification.</Paragraph>
      <Paragraph position="3"> There is one problem in verification we should concern. Since we only consider recurrence information in both directions, but not co-occurrence information, this would result some problem when dealing rarely used translations. For example, &amp;quot;Peter Pan&amp;quot; can be translated into &amp;quot;Bi De Pan &amp;quot; or &amp;quot;Bi De Pan &amp;quot; (both pronounced as Bi-De-Pan) in Chinese, but most people would use the former translation. Thus if we send &amp;quot;Peter Pan&amp;quot; to verification when translating &amp;quot;Bi De Pan &amp;quot;, we would get a very low score.</Paragraph>
      <Paragraph position="4"> To deal with this situation, we adopt the strategy of disbelieving verification in some situa- null tions. If all candidates have scores lower than the threshold, we presume that the given NE is a rarely used translation. In this situation, we use only Score(Cn, GN) estimated by the evaluation step to rank its candidates, without multiplying SScore(GN, Ci) of the inverse search. The threshold is set to 1.5 by heuristic, since we consider that a commonly used translation is supposed to have their SScore() larger than 1 in both directions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML