XML Viewer - c96-1030

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1030_metho.xml
Size: 13,211 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1030">
  <Title>Example-Based Machine Translation in the Pangloss System</Title>
  <Section position="3" start_page="0" end_page="169" type="metho">
    <SectionTitle>
2 Parallel Bilingual Corpus
</SectionTitle>
    <Paragraph position="0"> The corpus used by PanEBMT consists of a set of source/target sentence, pairs, and is flflly indexed on t, he source-language sentences. The corpus is not aligned at any granularity liner than the sentence pair; subsententia\] alignment is perfornled at run-time based on the sentence fragments selet;ted and the other knowledge sources.</Paragraph>
    <Paragraph position="1"> The corpus index lists all occurrences of every word and punctuation mark in the source-language sentences contained in the corpus. The index has been designed to permit incremental updates, allowing new sentence pairs to be added to the corpus as they become awulable (for example, to implement a translation memory with the system's own output). The text is tokenized prior to indexing, so that words in any of the equivalence classes detined in the EBMT contiguration tile (such as month names, countries, or measuring units), as well as the predetined equiwdence class &lt;nuntlmr&gt;, are indexed under the equivalence class rather than their own names. For each distinct token, the index contains a list of tile token's occurrences, consisting of a sentence identifier and the word number within the sentence. At translation time, f'anEI~MT back-substitutes tile appropriate target-language word into any translation which involves any tokenized words.</Paragraph>
    <Paragraph position="2"> 'rile bilingual corpus used for the results reported here consists of 726,406 Spanish-English sentence pairs drawn primarily from the IIN Multilingual (~'orpus available fl'om tile l,inguistic Data (Jonsortium(Graff and Finch, 1992) (Figure l), with a small admixture of texts from the Pan- null Las fuentes de esos comentarios y recomendaciones son las siguientes : The sources of these comments and recommendations are : E1 informe de la Junta de Auditores a la Asamblea General que incluye las observaciones del Director Ejecutivo del UNICEF sobre los comentarios y recomendaciones de la Junta de</Paragraph>
  </Section>
  <Section position="4" start_page="169" end_page="169" type="metho">
    <SectionTitle>
Auditores ;
</SectionTitle>
    <Paragraph position="0"> The report of the Board of Auditors to the General Assembly which incorporates the observations of the Executive Director of UNICEF on the comments and recommendations of the Board of</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="169" type="metho">
    <SectionTitle>
Auditors ;
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="6" start_page="169" end_page="169" type="metho">
    <SectionTitle>
(ACADMICOS ACADEMICS ACADEMICAL
TITLES DEGREES)
(ACAECIDO HAPPEN)
(ACAECIDOS HAPPEN)
(ACANTONADAS CANTON QUARTER TROOPS)
(ACANTONAMIENTO CANTONMENT)
(ACARREA CARRY CART HAUL TRANSPORT
CAUSE OCCASION)
(ACARREABA CARRY CART HAUL TRANSPORT
CAUSE OCCASION)
(ACARREARON CARRY CART HAUL TRANSPORT
CAUSE OCCASION)
(ACARREAR TRANSPORT HAUL CART CARRY
LUG ALONG BRING DOWN CAUSE OCCASION
ITS TRAIN RESULT GIVE RISE)
</SectionTitle>
    <Paragraph position="0"> Together, the bilingual dictionary and target-language list, of roots and synonyms (extracted from WordNet when translating into English) provide the necessary information to lind associations between source-language and target-language words in the selected sentence pairs.</Paragraph>
    <Paragraph position="1"> These associations are used in performing subsentential alignment. A source word is considered to be associated with a target-language word whenever either the target word itself or any of the words in its root/synonym list appear in the list of possible translations for the source word given by the dictionary.</Paragraph>
    <Paragraph position="2"> Not all words will be associated one-to-one; however, the current implementation requires that at least one such unique association be found in order to provide an anchor for the alignment protess. null</Paragraph>
  </Section>
  <Section position="7" start_page="169" end_page="169" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> PanEBMT is implemented in C++, using the FramepaC library (Brown, 1996) for accessing Lisp data structures stored in files or sent from the main Pangloss module via Unix pipes. PanEBMT consists of approximately 13,300 lines of code, including the code for a glossary mode which will not be described here.</Paragraph>
    <Paragraph position="1"> PanEBMT uses a re-processed version of the bilingual dictionary used by Pangloss's dictionary translation engine (Figure 2). The re-processing consists of removing various high-frequency words and splitting all nmlti-word definitions into a list of single words, needed to find one-to-one associations. null 210250 sentence pairs stern from the PAI{O corpus and 552 pairs from evaluations.</Paragraph>
  </Section>
  <Section position="8" start_page="169" end_page="169" type="metho">
    <SectionTitle>
4 EBMT's Place in Pangloss
</SectionTitle>
    <Paragraph position="0"> PanEBMT is merely one of the translation engines used by Pangloss; the others are transfer engines (dictionaries and glossaries) and a knowledge-based machine translation engine (Figure 3). Each of these produces a set of candidate translations for various segments of the input, which are then combined into a chart (Figure 3). The chart is passed through a statistical language model to determine the best path through the chart, which is then output as the translation of the original input sentence.</Paragraph>
  </Section>
  <Section position="9" start_page="169" end_page="170" type="metho">
    <SectionTitle>
5 EBMT Operation
</SectionTitle>
    <Paragraph position="0"> The EBMT engine produces translations in two phases:  1. find chunks by searching the corpus index for occurrences of consecutive words from the input text 2. perform subsentential alignment on each sen- null tence pair found in the first phase to determine the translation of the chunk In constrast with other work on example-based translation, such as (Maruyama and Watanabe, 1992) or early Pangloss EBMT experiments (Nirenburg et al., 1993), PanEBMT does not find an optimal partitioning of the input. Instead, it attempts to produce translations of every word sequence in the input sentence which appears in its corpus. The final selection of the &amp;quot;correct&amp;quot; cover for the input is left for the statistical language model, as is the case for all of the other translation engines in Pangloss. An advantage of this approach is that; it avoids discarding possible chunks merely because they are not part of the &amp;quot;optimal&amp;quot; cover for the input, instead selecting the input coverage by how well the translations fit together to form a complete translation.</Paragraph>
    <Paragraph position="1">  Architecture 3'0 lind chunks, the engine sequentially looks up each word of tile input in the index. The oc&lt;:urrence list for each word is comp~tred ;tgainst the occurrence list for the prior word and against the list of chunks extending to the prior word. For c,~u;h occtlrrence which is adjacent to all occnrl'elwe of the prior word, a new chunk is created or an existing chunk is extended as appropriate. Alter processing all input words in this tmmner, the engine has determined all possible substrings of the input containing at least two words which are; present in the corpus. Since the more Dequent word sequences &lt;:an o&lt;:cur hundreds of times in the eorl)uS , the list of chunks is culled to eliminate all but the last tlve (by default) occurrences of any distinct word sequence. By selecting the last occurrences of each word sequence, one effectively gives the most recent additions to the corpus the highest weight, precisely what is needed for a translation meanory.</Paragraph>
    <Paragraph position="2"> Next, the sentence pairs containing tile chunks retold in the lirst phase are read from disk, and alignment is performed on each in order to determine the translation of the chunk unless the match is against the entire COl'pus entry, in which case the entire target-language sentence is taken as the translation. Alignment currently uses a rather simplistic brnte-force approach very similar to that of (Nirenburg et el., 1994) which identifies the minimum and maximum possible segments of the target-language sentence which could possibly correspond to the chunk, and then applies a scoring fimction to ew',ry possible substring of the maximum segment containing at least the luinimmn segment. The suhstring with the best score is then selected as the aligned match for the chunk.</Paragraph>
    <Paragraph position="3"> The alignment scoring function is computed fl'om the weighted sum of a number of extremely simple test flmctions. The weights call be changed for ditDring lengths of the source chunk in order to adapt to varying impacts of the tests with varying numl)ers of words in the chunk, as well as varyit,g impacts as some or all of the. raw test stores change. The test functions include (in approximate order el' importance) such measures as a) the number o\[' source words without &lt;:orrcspondences in the t.;trget, b) the number of target words without c.orrespondences in tile source, c) matching words in source/target without correspondences, d) nmnber of words with COl'respondence itt the fifll target but not the candidate chunk, e) common sentence boundaries, f) eli(table source words, g) insertable target words, and It) the difference in length between source and ta&gt; get chunks.</Paragraph>
    <Paragraph position="4"> There is one exception to the above procedure for retrieving and aligning chunks. If any of the chunks covers the entire input string and the entire source-language half of a corpus sentence pair, then all other chunks are discarded and the target-language half of the pair is prodnced as the translation. This speeds up the system when opea'ating in tnmsl~tion memory mode, as would be the case in a system used to translate revisions of previous texts. Unlike a pure translation memory, however, Pan I'\]IIMT does not require all exact; match with a memorized translation.</Paragraph>
    <Paragraph position="5"> Figure 4 shows the set of translations generated fi'om one sentence. The output is shown in the format used R)r standalone testing, which generates only the best translation for each distinct clnmk; when integrated with the rest of Pangloss, Panl,;l/MT also includes information indicating which portion of tile input sentence and which pair fi'om the corpus were used, and can produce multiple translations for each chunk. The. number next to the source-language chunk in the output indicates the wdue of the scoring flnlction, where higher values are worse. Very poor alignmeats (scores greater than five times the source chunk length) have already been omitted from the output.</Paragraph>
  </Section>
  <Section position="10" start_page="170" end_page="171" type="metho">
    <SectionTitle>
6 Recent Enhancements
</SectionTitle>
    <Paragraph position="0"> The EBMT engine described here is a completely new implementation ill C++ replacing an earlier Lisp version. The previous version had performed very poorly (to the point where its results were  essentially ignored when combining the outputs of the various translation engines), for two main reasons: inadequate corpus size and incomplete indexing.</Paragraph>
    <Paragraph position="1"> The earlier incarnation had used a corpus of considerably less than 40 megabytes of text, compared to the 270 megabytes used for the results described herein. The seven-fold increase in corpus size produces a proportional increase in matches. Not only was the corpus fairly small, the text which was used was not flflly indexed. To limit the size of the index file, a long list of tile most frequent words were omitted from the index, as were punctuation marks. Although allowances were made for the words on the stop Fist, the missing punctuation marks always forced a break in clmnks, fl'equently limiting the size of chunks which could be found. Further, allowance was made for the ,m-indexed frequent words by permitring any sequence of frequent words between two indexed words, producing many erroneous matches.</Paragraph>
    <Paragraph position="2"> The newer implementation fully indexes the corpus, anti thus examines only exact matches with the input, ensuring that only good matches are actually processed. Further, PanEBMT can index certain word pairs to, in effect, precompute some two-word chunks. When applied to the five to ten most frequent words, this pairing can reduce processing time during translation by dramatically reducing the amount of data which must be read from the index file (for example, there might be 10,000 occurrences of a word pair instead of 1,000,000 occurrences of one of the words and 100,000 of the other word), and thus the number of adjacency comparisons which must be made.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML