File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2212_metho.xml

Size: 16,345 bytes

Last Modified: 2025-10-06 14:15:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2212">
  <Title>Automatically Creating Bilingual Lexicons for Machine Translation from Bilingual Text</Title>
  <Section position="4" start_page="0" end_page="1299" type="metho">
    <SectionTitle>
2 Theoretical framework
</SectionTitle>
    <Paragraph position="0"> The basic requirement that an MT system should meet for the present purpose is to be bidirectional. Bidirectionality is required in order to ensure that both source and target grammars can be used for parsing and that transfer can be done in both directions. More precisely, what is relevant is that the input and output to transfer be the same kind of structure.</Paragraph>
    <Paragraph position="1"> Moreover, the proposed method is most productive with a lexicalist MT system (Whitelock, 1994). The proposed application is concerned with producing bilingual lexical knowledge and this sort of knowledge is the only type of bilingual knowledge required by lexicalist systems. Nevertheless, it is also conceivable that the present approach can be used with a nonlexicalist transfer system, as long as the system is bidirectional. In this case, only the lexical portion of the bilingual knowledge can be automatically produced, assuming that the structural transfer portion is already in place. In the rest of this paper, a lexicalist MT system will be assumed and referred to. For the specific implementation described here and all the examples, we will refer to an existing lexicalist English-Spanish MT system (Popowich et al., 1997).</Paragraph>
    <Paragraph position="2"> The main feature of a lexicalist MT system is that it performs no structural transfer. Transfer is a mapping between a bag of lexical items used in parsing (the source bag) and a corresponding bag of target lexical items (the target bag), to be used in generation. The source bag actually contains more information than the corresponding bag of lexical items before parsing. Its elements get enriched with additional information instantiated during the parsing process. Information of fundamental importance included therein is a system of indices that express de- null pendencies among lexical items. Such dependencies are transferred to the target bag and used to constrain generation. The task of generation is to find an order in which the lexical items can be successfully parsed.</Paragraph>
  </Section>
  <Section position="5" start_page="1299" end_page="1299" type="metho">
    <SectionTitle>
3 Bilingual templates
</SectionTitle>
    <Paragraph position="0"> A bilingual template is a bilingual entry in which words are left unspecified. E.g.:</Paragraph>
    <Paragraph position="2"> Here, a '&amp;quot; :' operator connects a word (a variable, in a template) to a description, %-~' connects the left and right sides of the entry, '\V introduces a transfer macro, which takes two descriptions as arguments and performs some additional transfer (Turcato et al., 1997). Descriptions are mainly expressed by macros, introduced by a '(c)' operator. The macro arguments are indices, as used in lexicalist transfer. Templates have been widely used in MT (Buschbeck-Wolf and Dorna, 1997), particularly in the Example-Based Machine Translation (EBMT) framework (Kaji et al. (i992), Giivenir and Tun~ (1996)). However, in EBMT, templates are most often used to model sentence-level correspondences, rather then lexical equivalences. Consequently, in EBMT the relation between lexical equivalences and templates is the reverse of what is being proposed here. In EBMT, lexical equivalences are assumed and (sentential) templates are inferred from them. In the present framework, sentential correspondences (in the form of possible combinations of lexical templates) are assumed and lexical equivalences are inferred from them.</Paragraph>
    <Paragraph position="3"> In a lexicalist approach, the notion of bilingual lexical entry, and thus that of bilingual template, must be intended broadly. Multiword entries can exist. They can express dependencies among lexical items, thus being suitable for expressing phrasal equivalences. In brief, bilingual lexical entries can exhaustively cover all the bilingual information needed in transfer.</Paragraph>
    <Paragraph position="4"> In a lexicalist MT system, transfer is accomplished by finding a bag of bilingual entries partitioning the source bag. The source side of each entry (in the rest of this paper: the left hand side) corresponds to a cell of the partition. The  union of the target sides of the entries constitutes the target bag. E.g.: (2) a.</Paragraph>
    <Paragraph position="5"> b.</Paragraph>
    <Paragraph position="6"> C.</Paragraph>
    <Paragraph position="7">  where each Sw{::Sdi and Twi::Tdi are, respectively, a source and target &lt; Word, Description&gt; pair. In addition, the bilingual entries must satisfy the constraints expressed by indices in the source and target bags. The same information can be used to find (2b), given (2a) and (2c).</Paragraph>
    <Paragraph position="8"> Any bilingual lexicon is partitioned by a set of templates. The entries in each equivalence class only differ by their words. A bilingual lexical entry can thus be viewed as a triple &lt;Sw, Tw, T&gt;, where Sw is a list of source words, Tw a list of target words, and T a template. A set of such bilingual templates can be intuitively regarded as a 'transfer grammar'. A grammar defines all the possible sequences of pre-terminal symbols, i.e. all the possible types of sentences. Analogously, a set of bilingual templates defines all the possible translational equivalences between bags of pre-terminal symbols, i.e. all the possible equivalences between types of sentences. Using this intuition, the possibility is explored of analyzing a pair of such bags by means of a database of bilingual templates, to find a bag of templates that correctly accounts for the translational equivalence of the two bags, without resorting to any information about words. In the example (2), the following bag of templates would be the requested solution:</Paragraph>
    <Paragraph position="10"> Equivalences between (bags of) words are automatically obtained as a result of the process, whereas in translating they are assumed and used to select the appropriate bilingual entries.</Paragraph>
    <Paragraph position="11">  The whole idea is based on the assumption that a lexical item's description and the constraints on its indices are sufficient in most cases to uniquely identify a lexical item in a parse output bag. Although exceptions could be found (most notably, two modifiers of the same category modifying the same head), the idea is viable enough to be worth exploring.</Paragraph>
    <Paragraph position="12"> The impression might arise that it is difficult and impractical to have a set of templates available in advance. However, there is empirical evidence to the contrary. A count on the MT system used here showed that a restricted number of templates covers a large portion of a bilingual lexicon. Table 1 shows the incremental coverage. Although completeness is hard to obtain, a satisfactory coverage can be achieved with a relatively small number of templates.</Paragraph>
    <Paragraph position="13"> In the implementation described here, a set of templates was extracted from the MT bilingual lexicon and used to bootstrap further lexical development. The whole lexical development can be seen as an interactive process involving a bilingual lexicon and a template database. Templates are initially derived from the lexicon, new entries are successively created using the templates. Iteratively, new entries can be manually coded when the automatic procedure is lacking appropriate templates and new templates extracted from the manually coded entries can be added to the template database.</Paragraph>
  </Section>
  <Section position="6" start_page="1299" end_page="1302" type="metho">
    <SectionTitle>
4 The algorithm
</SectionTitle>
    <Paragraph position="0"> In this section the algorithm for creating bilingual lexical entries is described, along with a sample run. The procedure was implemented in Prolog, as was the MT system at hand. Basically, a set of lexical entries is obtained from a pair of sentences by first parsing the source and target sentences. The source bag is then transferred using templates as transfer rules (plus entries for closed-class words and possibly a pre-existing bilingual lexicon). The transfer output bag is then unified with the target sentence parse output bag. If the unification succeeds, the relevant information (bilingual templates and associated words) is retrieved to build up the new bilingual entries. Otherwise, the system backtracks into new parses and transfers.</Paragraph>
    <Paragraph position="1"> The main predicate make_entries/3 matches a source and a target sentence to produce a set of bilingual entries:</Paragraph>
    <Paragraph position="3"> be_info_to_entries(Be,Entries).</Paragraph>
    <Paragraph position="4"> Each Derivn variable points to a buffer where all the information about a specific derivation (parse or transfer) is stored and each Bagn variable refers to a bag of lexical items. Each step will be discussed in detail in the rest of the section. A sample run will be shown for the following English-Spanish pair of sentences: (4) a. the fat man kicked out the black dog.</Paragraph>
    <Paragraph position="5"> b. el hombre gordo ech5 el perro negro.</Paragraph>
    <Paragraph position="6"> In the sample session no bilingual lexicon was used for content words. Only a bilingual lexicon for closed class words and a set of bilingual templates were used. Therefore, new bilingual entries were obtained for all the content words (or phrases) in the sentences.</Paragraph>
    <Section position="1" start_page="1299" end_page="1299" type="sub_section">
      <SectionTitle>
4.1 Source sentence parse
</SectionTitle>
      <Paragraph position="0"> The parse of the source sentence is performed by parse_source/2. The parse tree is shown in  the present purposes, only pre-terminal nodes in the tree are labeled.</Paragraph>
      <Paragraph position="1">  information from the source bag, i.e. the bag resulting from parsing the source sentence. All the syntactic and semantic information has been omitted and replaced by a category label. What is relevant here is the way the indices are set, as a result of parsing. The words {the,fat,man} are tied together and so are {kick,out} and {the,black,dog}. Moreover, the indices of 'kick' show that its second index is tied to its subject, {the,fat ,man}, and its third index is tied to its object, {the,black,dog}.</Paragraph>
    </Section>
    <Section position="2" start_page="1299" end_page="1299" type="sub_section">
      <SectionTitle>
4.2 Target sentence parse
</SectionTitle>
      <Paragraph position="0"> The parse of the target sentence is performed by parse_target/2. Fig. 3 and 4 show, respectively, the resulting tree and bag. In an analogous manner to what is seen in the source sentence, {el,hombre,gordo) and {el ,perro ,negro} are, respectively, the sub-ject and the object of 'echS'.</Paragraph>
    </Section>
    <Section position="3" start_page="1299" end_page="1302" type="sub_section">
      <SectionTitle>
4.3 Transfer
</SectionTitle>
      <Paragraph position="0"> The result of parsing the source sentence is used by transfer/2 to create a translationally equivalent target bag. Fig. 5 shows the result. Transfer is performed by consulting a bilingual lexicon, which, in the present case, contained en- null tries for closed class words (e.g. an entry mapping 'the' to 'el') and templates for content words. The templates relevant to our example are the following:</Paragraph>
      <Paragraph position="2"> Bilingual templates are simply bilingual entries with words replaced by variables. Actually, on the target side, words are replaced by labels of the form word(Ti,Position), where Ti is a template identifier and Position identifies the position of the item in the right hand side of the template. Thus, a label word(adj/adj, 1) identifies the first word on the right hand side of the template that maps an adjective to an adjective.</Paragraph>
      <Paragraph position="3"> Such labels are just implementational technicalities that facilitate the retrieval of the relevant information when a lexical entry is built up from a template, but they have no role in the matching procedure. For the present purposes they can entirely be regarded as anonymous variables that can unify with anything, exactly like their source counterparts.</Paragraph>
      <Paragraph position="4"> After transfer, the instances of the templates used in the process are coindexed in some way, by virtue of their unification with the source bag items. This is analogous to what happens with bilingual entries in the translation process.</Paragraph>
    </Section>
    <Section position="4" start_page="1302" end_page="1302" type="sub_section">
      <SectionTitle>
4.4 Target bag matching
</SectionTitle>
      <Paragraph position="0"> The predicate ge'c_bag/2 retrieves a bag of lexical items associated with a derivation. Therefore, Bag2 and Bag3 will contain the bags of lexical items resulting, respectively, from parsing the target sentence and from transfer.</Paragraph>
      <Paragraph position="1"> The crucial step is the matching between the transfer output bag and the target sentence parse output bag. The predicate match_bags/3 tries to unify the two bags (returning the result in Bag4). A successful unification entails that the parse and transfer of the source sentence are consistent with the parse of the target sentence. In other words, the bilingual rules used in transfer correctly map source lexical items into target lexical items. Therefore, the lexical equivalences newly established through this process can be asserted as new bilingual entries.</Paragraph>
      <Paragraph position="2"> In the matching process, the order in which the elements are listed in the figures is irrelevant, since the objects at hand are bags, i.e.</Paragraph>
      <Paragraph position="3"> unordered collections. A successful match only requires the existence of a one-to-one mapping between the two bags, such that: (i) the respective descriptions, here represented by category labels, are unifiable; (ii) a further one-to-one mapping between the indices in the two bags is induced.</Paragraph>
      <Paragraph position="4"> The following mapping between the transfer output bag (Fig. 5) and the target sentence parse output bag (Fig. 4) will therefore succeed: {&lt;2-I,I&gt;,&lt;3-2,3&gt;,&lt;4-3,2&gt;,&lt;i-4,4&gt;, &lt;5-6,5&gt;,&lt;6-7,7&gt;,&lt;7-8,6&gt;} In fact, in addition to correctly unifying the descriptions, it induces the following one-to-one mapping between the two sets of indices: {&lt;A,O&gt;,&lt;B,l&gt;,&lt;I,13&gt;}</Paragraph>
    </Section>
    <Section position="5" start_page="1302" end_page="1302" type="sub_section">
      <SectionTitle>
4.5 Bilingual entries creation
</SectionTitle>
      <Paragraph position="0"> The rest of the procedure builds up lexical entries for the newly discovered equivalences and is implementation dependent. First, the source bag is retrieved in Bag1. Then, make_be_info/4 links together information from the source bag, the target bag (actually, its unification with the target sentence parse bag) and the transfer derivation, to construct a list of terms (the variable Be) containing the information to create an entry. Each such term has the form be(Sw,Tw,Ti), where Sw is a list of source words, Tw is a list of target words and Ti is a template identifier. In our example, the following be/3 terms are created:  (6) a. be( \[fat\] , \[gordo\] ,adj/adj) b. be ( \[man\] , \[hombre\] , cn/n) c. be ( \[kick, out\] , \[echar\] , tv+adv/tv) d. be ( \[black\] , \[negro\] , adj/adj ) e. be ( \[dog\] , \[perro\] , cn/n)  \\trans_noun(M,K).</Paragraph>
      <Paragraph position="1"> If a pre-existing bilingual lexicon is in use, bilingual entries are prioritized over bilingual templates. Consequently, only new entries are created, the others being retrieved from the existing bilingual lexicon. Incidentally, it should be noted that a new entry is an entry which differs from any existing entry on either side. Therefore, different entries are created for different senses of the same word, as long as the different senses have different translations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML