XML Viewer - w02-1601

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1601_metho.xml
Size: 26,429 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1601">
  <Title>A SYNCHRONIZATION STRUCTURE OF SSTC AND ITS APPLICATIONS IN MACHINE TRANSLATION</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A SYNCHRONIZATION STRUCTURE OF SSTC AND ITS
APPLICATIONS IN MACHINE TRANSLATION
MOSLEH H. AL-ADHAILEH
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TANG ENYA KONG
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ZAHARIN YUSOFF
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> In this paper, a flexible annotation schema called (SSTC) is introduced. In order to describe the correspondence between different languages, we propose a variant of SSTC called synchronous SSTC (S-SSTC). We will also describe how S-SSTC provides the flexibility to treat some of the non-standard cases, which are problematic to other synchronous formalisms. The proposed S-SSTC schema is well suited to describe the correspondence between different languages, in particular, relating a language with its translation in another language (i.e. in Machine Translation). Also it can be used as annotation for translation systems that automatically extract transfer mappings (rules or examples) from bilingual corpora. The S-SSTC is very well suited for the construction of a Bilingual Knowledge Bank (BKB), where the examples are kept in form of S-SSTCs.</Paragraph>
    <Paragraph position="1"> KEYWORDS: parallel text, Structured String-Tree Correspondence (SSTC), Synchronous SSTC, Bilingual Knowledge Bank (BKB), Tree Bank Annotation Schema.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> There is now a consensus about the fact that natural language should be described as correspondences between different levels of representation. Much of theoretical linguistics can be formulated in a very natural manner as stating correspondences (translations) between layers of representation structures (Rambow &amp; Satta, 1996).</Paragraph>
    <Paragraph position="1"> In this paper, a flexible annotation schema called Structured String-Tree Correspondence (SSTC) (Boitet &amp; Zaharin, 1988) will be introduced to capture a natural language text, its corresponding abstract linguistic representation and the mapping (correspondence) between these two. The correspondence between the string and its associated representation tree structure is defined in terms of the sub-correspondence between parts of the string (substrings) and parts of the tree structure (subtrees), which can be interpreted for both analysis and generation. Such correspondence is defined in a way that is able to handle some non-standard cases (e.g.</Paragraph>
    <Paragraph position="2"> non-projective correspondence).</Paragraph>
    <Paragraph position="3"> While synchronous systems are becoming more and more popular, there is therefore a great need for formal models of corresponding different levels of representation structures. Existing synchronous systems face a problem of handling, in a computationally attractive way, some non-standard phenomena exist between NLs. Therefore there is a need for a flexible annotation schema to realize additional power and flexibility in expressing the desired structural correspondences between languages (representation structures).</Paragraph>
    <Paragraph position="4"> Many problems in Machine Translation (MT), in particular transfer-rules extraction, EBMT, etc., can be expressed via correspondences. We will define a variant of SSTC called synchronous SSTC (S-SSTC).</Paragraph>
    <Paragraph position="5"> S-SSTC consists of two SSTCs that are related by a synchronization relation. The use of S-SSTC is motivated by the desire to describe not only the correspondence between the text and its representation structure for each language (i.e. SSTC) but also the correspondence between two languages (synchronous correspondence). For instance, between a language and its translation in other language in the case of MT. The S-SSTC will be used to relate expression of a natural language to its associated translation in another language. The interface between the two languages is made precise via a synchronization relation between two SSTCs, which is totally non-directional.</Paragraph>
    <Paragraph position="6"> In this paper, we will present the proposed S-SSTC a schema well suited to describe the correspondence between two languages. The synchronous SSTC is flexible and able to handle the non-standard correspondence cases exist between different languages. It can also be used to facilitate automatic extraction of transfer mappings (rules or examples) from bilingual corpora.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2. STRUCTURED STRING-TREE
CORRESPONDENCE (SSTC)
From the Meaning-Text Theory (MTT)
</SectionTitle>
    <Paragraph position="0"> point of view, Natural Language (NL) is considered as a correspondence between meanings and texts (Kahane, 2001). The MTT point of view, even if it has been introduced in different formulations, is more or less accepted by the whole linguistic community.</Paragraph>
    <Paragraph position="1"> In this section, we stress on the fact that in order to describe Natural Language (NL) in a natural manner, three distinct components need to be expressed by the linguistic formalisms; namely, the text, its corresponding abstract linguistic representation and the mapping (correspondence) between these two.</Paragraph>
    <Paragraph position="2"> Actually, NL is not only a correspondence between different representation levels, as stressed by MTT postulates, but also a sub-correspondence between them. For instance, between the string in a language and its representation tree structure, it is important to specify the sub-correspondences between parts of the string (substrings) and parts of the tree structure (subtrees), which can be interpreted for both analysis and generation in NLP. It is well known that many linguistic constructions are not projective (e.g.</Paragraph>
    <Paragraph position="3"> scrambling, cross serial dependencies, etc.). Hence, it is very much desired to define the correspondence in a way to be able to handle the non-standard cases (e.g. non-projective correspondence), see Figure 1.</Paragraph>
    <Paragraph position="4"> Towards this aim, a flexible annotation structure called Structured String-Tree Correspondence (SSTC) was introduced in Boitet &amp; Zaharin (1988) to record the string of terms, its associated representation structure and the mapping between the two, which is expressed by the sub-correspondences recorded as part of a SSTC.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.1 The SSTC Annotation Structure
</SectionTitle>
      <Paragraph position="0"> The SSTC is a general structure that can associate an arbitrary tree structure to string in a language as desired by the annotator to be the interpretation structure of the string, and more importantly is the facility to specify the correspondence between the string and the associated tree which can be non-projective (Boitet &amp; Zaharin, 1988). These features are very much desired in the design of an annotation scheme, in particular for the treatment of linguistic phenomena, which are non-standard, e.g. crossed  : - An SSTC is a general structure, which is a string in a language associated with an arbitrary tree structure; i.e. its interpretation structure, and the correspondence between the string and its associated tree, which can be non-projective; i.e. SSTC is a triple (st, tr, co), where st is a string in one language, tr is its associated representation tree structure and co is the correspondence between st and tr.</Paragraph>
      <Paragraph position="1"> - The correspondence co between a string and its representation tree is made of two interrelated correspondences: a) Between nodes and substrings (possibly discontinuous).</Paragraph>
      <Paragraph position="2"> b) Between (possibly incomplete) subtrees and (possibly discontinuous) substrings.</Paragraph>
      <Paragraph position="3"> - The correspondence can be encoded on the tree by attaching to each node N in the representation tree two sequences of INTERVALS called SNODE(N) and STREE(N).</Paragraph>
      <Paragraph position="4"> - SNODE(N): An interval of the substring in the string that corresponds to the node N in the tree.</Paragraph>
      <Paragraph position="5"> STREE(N): An interval of the substring in the string that corresponds to the subtree having the node N as root.</Paragraph>
      <Paragraph position="6"> Figure 2 illustrates the sentence &amp;quot;John picks the box up&amp;quot; with its corresponding SSTC. It contains a non-projective correspondence. An interval is assigned to each word in the sentence, i.e. (0-1) for &amp;quot;John&amp;quot;, (1-2) for &amp;quot;picks&amp;quot;, (2-3) for &amp;quot;the&amp;quot;, (3-4) for &amp;quot;box&amp;quot; and (4-5) for &amp;quot;up&amp;quot;. A substring in the sentence that corresponds to a node in the representation tree is denoted by assigning the interval of the substring to SNODE of  These definitions are based on the discussion in (Tang, 1994) and Boitet &amp; Zaharin (1988).</Paragraph>
      <Paragraph position="7">  the node, e.g. the node &amp;quot;picks up&amp;quot; with SNODE intervals (1-2+4-5) corresponds to the words &amp;quot;picks&amp;quot; and &amp;quot;up&amp;quot; in the string with the similar intervals.The correspondence between subtrees and substrings are denoted by the interval assigned to the STREE of each node, e.g. the subtree rooted at node &amp;quot;picks up&amp;quot; with STREE interval (0-5) corresponds to the whole sentence &amp;quot;John picks the box up&amp;quot;.</Paragraph>
      <Paragraph position="8"> The case depicted in Figure 2, describes how the SSTC structure treats some non-standard linguistic phenomena. The particle &amp;quot;up&amp;quot; is featurised into the verb &amp;quot;pick&amp;quot; and in discontinuous manner (e.g. &amp;quot;up&amp;quot; (4-5) in &amp;quot;pick-up&amp;quot; (1-2+4-5)) in the sentence &amp;quot;He picks the box up&amp;quot;. For more details on the proprieties of SSTC, see Boitet &amp; Zaharin (1988).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3. SYNCHRONOUS SSTC STRUCTURE
</SectionTitle>
    <Paragraph position="0"> Much of theoretical linguistics can be formulated in a very natural manner as stating correspondences (translations) between layers of representation structures (Rambow &amp; Satta, 1996), such as the relation between syntax and semantic. An analogous problem is to be defined in such a way that expresses the correspondence between a language and its translations in other languages. Therefore the synchronization of two adequate linguistic formalisms seems to be an appropriate representation for that.</Paragraph>
    <Paragraph position="1"> The idea of parallelized formalisms is widely used one, and one which has been applied in many different ways. The use of synchronous formalisms is motivated by the desire to describe two languages that are closely related to each other but that do not have the same structures. For example, synchronous Tree Adjoining Grammar (S-TAG) can be used to relate TAGs for two different languages, for example, for the purpose of immediate structural translation in machine translation (Abeille et al.,1990), (Harbusch &amp; Poller,1996), or for relating a syntactic TAG and semantic one for the same language (Shieber &amp; Schabes,1990). S-TAG is a variant of Tree Adjoining Grammar (TAG) introduced by (Shieber &amp; Schabes,1990) to characterize correspondences between tree adjoining languages. Considering the original definition of S-TAGs, one can see that it does not restrict the structures that can be produced in the source and target languages. It allows the construction of a non-TAL (Shieber, 1994), (Harbusch &amp; Poller, 2000). As a result, Shieber (1994) propose a restricted definition for S-TAG, namely, the IS-TAG for isomorphic S-TAG. In this case only TAL can be formed in each component. This isomorphism requirement is formally attractive, but for practical applications somewhat too strict. Also contrastive well-known translation phenomena exist in different languages, which cannot be expressed by IS-TAG, Figure 3 illustrates some examples (Shieber, 1994).</Paragraph>
    <Paragraph position="2"> Similar limitations also appear in synchronous CFGs (Harbusch &amp; Poller,1994).</Paragraph>
    <Paragraph position="3"> Due to these limitations, instead of investigating into the synchronization of two grammars, we propose a flexible annotation schema (i.e. Synchronous</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Structured String-Tree Correspondence (S-SSTC)) to
</SectionTitle>
      <Paragraph position="0"> realize additional power and flexibility in expressing structural correspondences at the level of language sentence pairs. For example, such schema can serve as a mean to represent translation examples, or find structural correspondences for the purpose of transfer grammar learning (Menezes &amp; Richardson, 2001), (Aramaki et al., 2001), (Watanabe et al., 2000), (Meyers et al., 2000), (Matsumoto et al., 1993), (kaji et al., 1992), and example-base machine translation</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="2" end_page="3" type="metho">
    <SectionTitle>
EBMT
</SectionTitle>
    <Paragraph position="0"> (Sato &amp; Nagao, 1990), (Sato, 1991), (Richardson et al., 2001), (Al-Adhaileh &amp; Tang, 1999).</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 The Synchronous SSTC
</SectionTitle>
      <Paragraph position="0"> In this section, we will discuss the definition and the formal properties of S-SSTC. A S-SSTC consists of a pair of SSTCs with an additional synchronization relation between them. The use of S-SSTC is motivated by the desire to describe not only the correspondence between the text and its representation structure in one language (i.e. SSTC) but also the correspondence between two languages (synchronous correspondence).</Paragraph>
      <Paragraph position="1"> Definitions: - Let each of S and T be SSTC which consists of a triple (st, tr, co), where st is a string in one language, tr is its associated representation tree structure and co is the correspondence between st and tr, as defined in Section 2.1.</Paragraph>
      <Paragraph position="2">  is a set of links defining the synchronization correspondence between S and T at different internal levels of the two SSTC structures.  which defines the synchronous correspondences between nodes of tr in S, and nodes of tr in T.  sn G41 records the synchronous correspondences at level of nodes in S and T (i.e. lexical correspondences between specified nodes), and  for a comprehensive overview about EBMT, see Somers(1999)</Paragraph>
      <Paragraph position="4"> G41 records the synchronous correspondences at level of subtrees in S and T (i.e. structural correspondences between subtrees), and normally</Paragraph>
      <Paragraph position="6"> ) which corresponds to an incomplete subtree.</Paragraph>
      <Paragraph position="8"> ) which corresponds to an incomplete subtree.</Paragraph>
      <Paragraph position="9"> - The synchronous correspondence between terminal</Paragraph>
      <Paragraph position="11"> Note: The synchronous correspondences can be between SSTCs that contain non-standard phenomena; i.e.</Paragraph>
      <Paragraph position="12"> featursiation and discontinuity (crossed dependency). In these cases the synchronous correspondence is strait forward (following the above definitions); e.g. see Figure 4 and Figure 6.</Paragraph>
      <Paragraph position="13"> The S-SSTC will be used to relate expressions of a natural language to its associated translation in another language. For convenience, we will call the two languages source and target languages, although S-SSTC is non-directional. S-SSTC is defined to make such relation explicit. Figure 4 depicts a S-SSTC for the English source sentence &amp;quot;John picks the heavy box up&amp;quot; and its translation in the Malay target sentence &amp;quot;John kutip kotak berat itu&amp;quot;. The gray arrows indicate the correspondence between the string and it representation tree within each of the SSTCs, and the dot-gray arrows indicate the relations (i.e. synchronous correspondence) of synchronization between linguistic units of the source SSTC and the target SSTC.</Paragraph>
      <Paragraph position="14"> Based on the notation used in S-SSTC, Figure 4 illustrates the S-SSTC for the English sentence &amp;quot;John picks the heavy box up&amp;quot; and its translation in the Malay language &amp;quot;John kutip kotak berat itu&amp;quot;, with the synchronous correspondence between them. The synchronous correspondence is denoted in terms of SNODE pairs for</Paragraph>
      <Paragraph position="16"> interval/s from the source SSTC and</Paragraph>
      <Paragraph position="18"> interval/s from the target SSTC. As for</Paragraph>
      <Paragraph position="20"> target SSTC. For instance, as depicted in Figure 5, the fact that &amp;quot;picks up&amp;quot; in the source corresponds to &amp;quot;kutip&amp;quot; in the target is expressed by the pair</Paragraph>
      <Paragraph position="22"> correspondence. Whereas, the fact that &amp;quot;John picks the heavy box up&amp;quot; is corresponds to &amp;quot;John kutip  itu&amp;quot;, together with the synchronous correspondence between them. kotak berat itu&amp;quot; is expressed by (</Paragraph>
      <Paragraph position="24"> fact that &amp;quot;box&amp;quot; in the source corresponds to &amp;quot;kotak&amp;quot; in the target under the pair (</Paragraph>
      <Paragraph position="26"> )=(4-5,2-3) in the sn G41 synchronous correspondence. Whereas, the phrase &amp;quot;the heavy box&amp;quot; is corresponds to the phrase &amp;quot;kotak berat itu&amp;quot; in the target is expressed by (</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4. HANDLING NON-STANDARD CASES
WITH S-SSTC
</SectionTitle>
    <Paragraph position="0"> As mentioned earlier, there are some non-standard phenomena exist between different languages, that cause challenges for synchronized formalisms. In this Section, we will describe some example cases, which are drawn from the problem of using synchronous formalisms to define translations between languages (e.g. Shieber (1994) cases). Due to lack of space we will only brief on some of these non-standard cases without going into the details.</Paragraph>
    <Paragraph position="1"> Figure 4 illustrates a case where the English sentence has non-standard cases of featurisation, crossed dependency and a many-to-one synchronous correspondence in &amp;quot;picks up&amp;quot;. Another case is reordering of words in the phrases, which is clear in  many-to-one correspondence, where a word (single node) in one language corresponds to a phrase (subtree) in the other, namely, the adverbial &amp;quot;hopefully&amp;quot; is translated into the French phrase &amp;quot;On espere que&amp;quot;. Second, a case of argument swap (reordering of subtrees) in the English &amp;quot;Kim misses Dale&amp;quot; and its corresponding translation &amp;quot;Dale manque a Kim&amp;quot; in French.</Paragraph>
    <Paragraph position="2"> Figure 6 describes the cases of clitic climbing in French and the non-projective correspondence (i.e.</Paragraph>
    <Paragraph position="3"> crossed dependency). It shows the flexibility of SSTC and the proposed S-SSTC in handling such popular  same, but they exhibit different structures. Nodes participating in the domination relationship in one SSTC may be mapped to nodes neither of which dominates the other (i.e. elimination of dominance). Another even more extreme relationship between the synchronized pair involving inverted correspondences is exemplified in Figure 8.</Paragraph>
    <Paragraph position="4">  the first SSTC has/ve a synchronous correspondence with partial subtree/s in the second SSTC. The German word &amp;quot;beschenkte&amp;quot; corresponds to the English phrase &amp;quot;give present&amp;quot; which is a partial subtree from the tree rooted by the word &amp;quot;give&amp;quot; in the English SSTC. This synchronous correspondence is recorded under the st G41 where the operation (-: minus) is used to calculate the Y:STREE interval/s for the partial subtree/s.</Paragraph>
  </Section>
  <Section position="10" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5. SYNCHRONOUS CORRESPOND-
ENCE CONSTRAINTS BETWEEN
NATURAL LANGUAGES (NLs)
</SectionTitle>
    <Paragraph position="0"> As we mentioned in Section 2, in the SSTC the correspondences between the surface text and the associated representation tree structure are ensured by means of intervals; i.e. (X:SNODE, Y:STREE). This explicitly indicates which word/s of the text correspond/s to which node in the tree. For describing a NL using SSTC, a set of constraints were defined to govern such correspondences (Lepage, 1994): - X:SNODE and Y:STREE intervals are governed by the following constraints: i) Global correspondence: an entire tree corresponds to an entire sentence.</Paragraph>
    <Paragraph position="1"> ii) Inclusion: a subtree which is part of another subtree T, must correspond to a substring in the substring corresponding to T.</Paragraph>
    <Paragraph position="2"> iii) Membership: a node in a subtree T, must correspond to a word which is member of the substring corresponding to T.</Paragraph>
    <Paragraph position="3"> In a similar manner, in order to describe the synchronous correspondences between NLs using S-SSTC, we define a set of constraints to govern the synchronous correspondences between the different NLs. These constraints will be used to make explicitly the synchronous correspondences in a natural manner.</Paragraph>
    <Paragraph position="5"> INT(String) in the second SSTC. This mean the whole tree in the first SSTC corresponds to the whole tree in the second SSTC, and the whole string in the first SSTC corresponds to the whole string in the second SSTC).</Paragraph>
    <Paragraph position="6"> Note that these constraints can be used to license only the linguistically meaningful synchronous correspondences between the two SSTCs of the S-SSTC (i.e. between the two languages). For instance, when building translation units in EBMT approaches (Richardson et al., 2001), (Aramaki, 2001), (Al-Adhaileh &amp;Tang, 1999), (Sato &amp; Nagao, 1990), (Sato, 1991), (Sadler &amp; Vendelmans, 1990), etc., where S-SSTC can be used to represent the entries of the BKB or when S-SSTC used as an annotation schema to find the translation correspondences (lexical and structural correspondences) for transferrules' extraction from parallel parsed corpus  2000), (Meyers et al., 2000), (Matsumoto et al., 1993) and (kaji et al., 1992). Note that the grammar alignment rules used in (Menezes &amp; Richardson, 2001) can be reformulated using these constraints to construct the transfer mappings from a synchronous source-target example.</Paragraph>
    <Paragraph position="7"> Figure 10 shows an example from Menezes and Richardson (2001), the logical form for the Spanish-English pair: (&amp;quot;En Informacion del hipervinculo, haga clic en la direccion del hipervinculo&amp;quot;, &amp;quot;Under Hyperlink Information, click the hyperlink address&amp;quot;). Recently, the development of machine translation systems requires a substantial amount of translation knowledge typically embodied in the bilingual corpora. For instance, the development of translation systems based on transfer mappings (rules or examples) that automatically extracted from these bilingual corpora. All these systems typically first obtain a tree structures (normally a predicate-argument or a dependency structure) for both the source and target sentences. From the resulting structures, lexical and structural correspondences between the two structures are extracted, which are then presented as a set of examples in a bilingual knowledge bank (BKB) or transfer rules for translation process.</Paragraph>
    <Paragraph position="8"> However, what has so far been lacking is a schema or a framework to annotate and express such extracted lexical and structural correspondences in a flexible and powerful manner. The proposed S-SSTC annotation schema can fulfill this need, and it is flexible enough to handle different type of relations that may happen between different languages' structures. S-SSTC very well suited for the construction of a BKB, which is needed for the EBMT applications. Al-Adhaileh and Tang (2001) presented an approach for constructing a BKB based on the S-SSTC.</Paragraph>
    <Paragraph position="9"> In S-SSTC, the synchronous correspondence is defined in a way to ensure a flexible representation for both lexical and structural correspondences: i-Node-to-node correspondence (lexical correspondence), which is recorded in terms of pair of intervals</Paragraph>
    <Paragraph position="11"> is SNODE interval/s for the source and the target SSTC respectively, ii- Subtreeto-Subtree correspondence (structural correspondence), which is very much needed for relating the two different languages at a level higher than the lexical level, a level of phrases. It is recorded in terms of pair of intervals (Y</Paragraph>
    <Paragraph position="13"> interval/s for the source and the target SSTC respectively.</Paragraph>
    <Paragraph position="14"> Furthermore, the SSTC structure can easily be extended to keep multiple levels of linguistic information, if they are considered important to enhance the performance of the machine translation system (i.e. Features transfer). For instance, each node representing a word in the annotated tree structure can be tagged with part of speech (POS), semantic features and morphological features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML