File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1094_metho.xml

Size: 21,682 bytes

Last Modified: 2025-10-06 14:14:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1094">
  <Title>Translating into Free Word Order Languages</Title>
  <Section position="4" start_page="0" end_page="556" type="metho">
    <SectionTitle>
2 Information Structure
</SectionTitle>
    <Paragraph position="0"> \]n the Information Structure (IS) that I use for Turkish, a sentence is first divided into a topic and a comment. The topic is the maiu element that the sentence is about, and the comment is the information conveyed about this toI)ic. Within the comment, we tind the focus, the most information-bearing const.itnent in the senten(:e, and the ground, the rest of the sentence. The fo cus is the new or important information in the sentence and receives prosodic prominence in speech.</Paragraph>
    <Paragraph position="1"> In Turkish, the pragmatic fimction of topic is assigned to the sentence-initial position mM the focus to the immediately preverbM position, following (Erguvanh, 11984). The rest of the sentence forms the ground.</Paragraph>
    <Paragraph position="2"> In (Iloffman, 1995; Iloffman, 1995b), I show that the information structure components of topic and focus can be suecessfiflly used in generating the context-appropriate answer to database queries. Determining the topic and focns is fairly easy in the context of a simple question, however it is much more complica.ted in a text. In the fol-</Paragraph>
    <Paragraph position="4"> lowing sections, I will describe the characteristics of topic, focus, and ground components of the 1S in naturally occurring texts analyzed in (lloffman, 1995b) and allude to possible algorithms for determining them. The algorithms will then be spelled out in section 3.</Paragraph>
    <Paragraph position="5"> An example text from the cortms 1 is shown below. The noncanonical OSV word order in (1)b is contextually appropriate because the object t)ronoun is a discourse-old topic that links the se.ntence to the previous context, and the sul)jeet, &amp;quot;your father&amp;quot;, is a discourse-new focus that is being contrasted with other relatives. Discourse-old entities are those that were previously mentioned in the discourse while discourse-new entil, ics are those that were not (Prince, 1992).</Paragraph>
    <Paragraph position="6"> O) a.</Paragraph>
    <Paragraph position="7"> b.</Paragraph>
    <Paragraph position="8"> Bu defteri de gok say(lira ben.</Paragraph>
    <Paragraph position="9"> This notebk-acc too much like-l)st-lS I.</Paragraph>
    <Paragraph position="10"> 'As for this notebook, I like it very much.' Bunu da baban ml verdi? (OSV) This-Ace too father-2S Quest give-Past? 'Did your FATHER, give this to you?'</Paragraph>
  </Section>
  <Section position="5" start_page="556" end_page="558" type="metho">
    <SectionTitle>
(CHILDES lba.cha)
</SectionTitle>
    <Paragraph position="0"> Many people have suggested that &amp;quot;free&amp;quot; word order languages order information from old to new information. However, the Old-to-New ordering prim:iple is a generalization to which exceptions can be found. 1 believe that the order in which speakers place old vs. new items in a sentence reflects the information structures that are awdlable to the speakers. The ordering is actually tile 'Ibpic followed by the Focus. Tile qbpic tends to be discourse-old inlbrmation and the focus disconrsenew. However, it is possible to have a disconrse-NEW topic and a discourse-OLD focus, as wc will see in the following sections, which explains the exceptions to the Old-To-New ordering principle.</Paragraph>
    <Paragraph position="1">  sations, contemporary novels, and adult speedl from the CHILDES corpus.</Paragraph>
    <Section position="1" start_page="556" end_page="557" type="sub_section">
      <SectionTitle>
2.1 Topic
</SectionTitle>
      <Paragraph position="0"> Although humans can intuitively determine whal, the tol)ic of a sentence is, the traditional delinition (what tim sentence is about) is too vague to be implemented in a COmlml, ational system, l propose heuristics based on familiarit,y and salience to determine discourse-old seal;ante topics, ~tt~C/l heu ris~ ties based on grammatical reb~tions Ibr discou rse- null new t.opics. Speakers can shill; Loa new topic at the start, of a new discourse sag/ileal., ;ts iH (2)a. Or they can continue ta.lking about Lh(~ sam(, (liscours(&gt;o\[(I tot)it , as iu (2)1).</Paragraph>
      <Paragraph position="1"> (2) a. \[Mary\]m went to lhe I,ookstore.</Paragraph>
      <Paragraph position="2"> b. \[She\]./. I)ought a new book on linguistics.</Paragraph>
      <Paragraph position="3">  A discourse-old topic often serves 1.o liuk the sentence to the previous context l)y evoking a familiar and sMient discourse entity. (~enteriug Theory ((~rosz/etal, 1{)95) provides a measure of saliency based on the obserwrtions t;hat salient discourse entities are often mentioned rel)ea.1;edly within a discourse segment and are oft.an r(mlized as pronouns. (rl~lran, 1995) provides a. (:OUlprehensive study of null and overt subjects in Turkish using Centering Theory, and \[ inw~stigate the interaction between word order and (',catering in Turkish in (Iloffman, 1996).</Paragraph>
      <Paragraph position="4"> In the Centering Algoritl.n, each nt,l, era.nce in a discom:se is associated with a ranked list of discourse entities called the forward-lookiug eent.ers (Cf list;) that contains every (lis(:ours(~ entity that is reMized in thai; utteraltce. The Cf list is usually ranked according to a hierarchy of granmmtica\] relal, ions, e.g. subjects are aSSllllled to \])e lllore salient than objects. The backward looking center (Cb) is the most salient member of t,he Cf list that links the era'rent utterance to the iwevious utterance. The Cb of an utterance is delined as the highest ranke(l element of the previous u tterance's Cf list that also occurs iu the curren(, utterance.</Paragraph>
      <Paragraph position="5"> If there is a pronoun in the sentence, it ia likely to be the (Jb. As we. will see, the (~,b has much in common with a sentence- tol)ic.</Paragraph>
      <Paragraph position="7"> The Cb analyses of the canonical SOV and the noncanonical OSV word orders in 251rkish are summarized in Figure 1 (forthcoming study in (Hoffman, 1996)). As expected, the subject is often the Cb in the SOV sentences. However, in the OSV sentences, the object, not the subject, is most often the Cb of the utterance. A comparison of the 20 discourses in the first two rows 2 of the tables in Figure 1 using the chi-square test shows that the association between sentence-position and Cb is statistically significant (X 2 = 10.10, p &lt; 0.001). a Thus, the Cb, when it is not dropped, is often placed in the sentence initial topic position in Turkish regardless of whether it is the subject or the object of the sentence. The intditive reason for this is that speakers want to form a coherent discourse by immediately linking each sentence to the previous ones by placing the Cb and discourse-old topic in the sentence-initial position.</Paragraph>
      <Paragraph position="8"> There are also situations where no Cb or discourse-old topic can be found. Then, a discourse-new topic can be placed in the sentence-initial position to start a new discourse segment. Discourse-new topics are often subjects or situation-setting adverbs (e.g. yesterday, in the morning, in the garden) in 3Mrkish.</Paragraph>
    </Section>
    <Section position="2" start_page="557" end_page="558" type="sub_section">
      <SectionTitle>
2.2 Focus
</SectionTitle>
      <Paragraph position="0"> The term focus has been used with many different meanings. Focusing is often associated with new information, but it is well-known that old information, for example pronouns, can be focused as well. I think part of the confusion lies in the distinction between contrastive and presentational 2The centering analysis is inconclusive in some cases because the subject and the object in the sentence are realized with the same referential form (e.g. both as overt pronouns or as full NPs).</Paragraph>
      <Paragraph position="1"> ZAlternatively, using the canonical SOV sentences as the expected frequencies, the observed frequencies for the noncanonical OSV sentences significantly diverge from the expected frequencies (X 2 = 8.8, p &lt; 0.005).</Paragraph>
      <Paragraph position="2"> focus. Focusing discourse-new information is often called presentational or informational focus as shown in (3)a. Broad/wide focus (focus projection) is also possible where the rightmost element in the phrase is accented, but the whole phrase is in focus. However, we can also use focusing in or- null der to contrast one item with another, and in this case the focus can be discourse-old or discoursenew, e.g. (3)b.</Paragraph>
      <Paragraph position="3"> (3) a. What did Mary do this summer? She \[wandered around TURKEY\]F.</Paragraph>
      <Paragraph position="4"> b. It wasn't \[ME\],., - It was \[HF, R\]e.</Paragraph>
      <Paragraph position="5">  (VMlduvf, 1992) defines fbcns as the most information-bearing constituent, and this definition encompasses both contrastive and presentational focusing. I use this definition of focus as well. However, as will see, we still need two different algorithms in order to determine which items are in focus in the target sentence in MT. We must check to see if they are discourse-new information as well as checking if they are being contrasted with another item in the discourse model.</Paragraph>
      <Paragraph position="6"> In Turkish, items that are presentationally or contrastively focused are placed in the immediately preverbM (IPV) position and receive the primary accent of the phrase. 4 As seen in Figure 2, brand-new discourse entities are found in the,,IPV position, but never in other positions in the sentence in my Turkish corpus. The distribution of brand-new (the starred line of the table) versus discourse-old information (the rest of the table 5) is statistically significant, (X 2 = 10.847, p &lt; .001). This supports the association of discourse-new \[bcus with the IPV position.</Paragraph>
      <Paragraph position="7"> 4Some languages such as Greek and Russian treat presentational and contrastive focus differently in word order.</Paragraph>
      <Paragraph position="8"> 5 lnferrables refer to entities that the hearer can easily accmnmodate based on entities already in the dis-. course model or the situation. Hearer-old entities are well-known to the speaker and hearer but not necessarily mentioned in the prior discourse (Prince, 1992). They both behave like discourse-oM entities.</Paragraph>
      <Paragraph position="9">  However, as can be seen in Figure 2, most of the focused subjects in the OSV sentences in my corpus were actually discourse-old information. Discourse-old entities that occur in the IPV position are contrastively focused. In (Rooth, 1985)'s alternative-set theory, a contrastively focused item is interpreted by constructing a set of alternatives from which the focused item must be distinguished. Generalizing from his work, we can determine whether an entity should be contrastively focused by seeing if we can construct an alternative set from the discourse model.</Paragraph>
    </Section>
    <Section position="3" start_page="558" end_page="558" type="sub_section">
      <SectionTitle>
2.3 Ground
</SectionTitle>
      <Paragraph position="0"> Those items that do not play a role in IS of the sentence as the topic or the focus form the ground of the sentence. In Turkish, discourse-old information that is not the topic or focus can be (4) a. dropped, b. postposed to the right of the verb, c. or placed unstressed between the topic and the focus.</Paragraph>
      <Paragraph position="1"> Postposing plays a backgrounding fnnction in Turkish, and it is very common. Often, speakers will drop only those items that are very salient (e.g. mentioned just in the previous sentence) and postpose the rest of the discourse-old items, lIowever, the conditions for dropping arguments can be very complex. (Turan, 1995) shows that there are semantic considerations; for instance, generic objects are often dropped, but specific objects are often realized as overt pronouns and fronted. Thus, the conditions governing dropping and postposing are areas that require more research.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="558" end_page="560" type="metho">
    <SectionTitle>
3 The Implementation
</SectionTitle>
    <Paragraph position="0"> In order to simplify the MT implementation, I concentrate on translating short and simple English texts into Turkish, using an interlingua representation where concepts in the semantic representation map onto at most one word in the English or Turkish lexicons. The translation proceeds sentence by sentence (leaving aside questions of aggregation, etc.), but contextual information is used during the incremental generation of the target text. These simplifications allow me to test out the algorithms for determining the topic and the focus presented in this section.</Paragraph>
    <Paragraph position="1"> In the implementation, first, an English sentence is parsed with a Combinatory Categorial Grammar, CCG, (Steedman, 1985). The semantic representation is then sent to the sentence planner for Turkish. The Sentence Planner uses the algorithms in the following subsections to determine the topic, focus, and ground from the given semantic representation ~md the discourse model.</Paragraph>
    <Paragraph position="2"> Then, the sentence planner sends the semantic representation and the information strncture it has determined to the sentence realization component for Turkish. This component consists of a head-driven bottom up generation algorithm that uses the semantic as well as the information strncture features given by the planner to choose an appropriate head in the lexicon. The grammar used for the generation of 3hlrkish is a lexicalist formalism called Multiset-CCG (Hoffman, 1995; Iloffman, 1995b), an extension of CCGs. Multiset-CCG was developed in order to capture formal and descriptive properties of &amp;quot;free&amp;quot; and restricted word order in simple and complex sentences (with discontinuous constituents and long distance dependencies). Mnltiset-CCG captures the context-dependent meaning of word order in 'Fnrkish by compositionally deriving the predicate-argument structure and the information strnctm'e of a sentence in parallel.</Paragraph>
    <Paragraph position="3"> The following sections describe the algorithms used by the sentence plauner to determine the IS of the 'lSlrkish sentence, given the semantic representation of a parsed English sentence.</Paragraph>
    <Section position="1" start_page="558" end_page="559" type="sub_section">
      <SectionTitle>
3.1 The Topic Algorithm
</SectionTitle>
      <Paragraph position="0"> As each sentence is translated, we update the discourse model, and keep track of the forward looking centers list (Cflist) of the last processed sentence. This is simply a list of all the discourse enities realized in that sentence ranked according to the theta-role hierarchy found in the semantic representation. Thus, the Cf list for the reI)resentation give(Pat, Chris, book) is the ranked list \[Pat,Chris,book\], where the subject is assmned to be more salient than the objects.</Paragraph>
      <Paragraph position="1"> Given the semantic representation for the sentence, the discourse model of the text processe(l so far, and the ranked C\[ lists of the current and previous sentences in the discourse, the following algorithm determines the topic of (;he sentence. First, the algorithm tries to choose the most salient discourse-old entity as the sentence topicf If there is no discourse-old entity realized in the sentence, then a situation-setting adverb o, the subject is chosen as the discourse-new topic.</Paragraph>
      <Paragraph position="2"> l. Compare the current Cf list with the previous sentence's Cf list; and choose the firs( item that is a member of both of the ranked lists (the Cb).</Paragraph>
      <Paragraph position="3"> 6(Stys/Zemke, 1995) use the saliency ranking to order the whole sentence in Polish. tIowever, \[ I)clieve that there is a distinct notion of topic and fo(:as in Turkish.</Paragraph>
      <Paragraph position="4">  2. If 1 fails: Choose the first item in the current sentence's Cf list that is discourse-old (i.e. is already in the discourse model).</Paragraph>
      <Paragraph position="5"> 3. If 2 fails: If there is a situation-setting adverb in the semantic representation (i.e. a predicate modifying the main event, in representation), choose it as the discourse-new topic.</Paragraph>
      <Paragraph position="6"> 4. If 3 fails: choose the first item in the Cf list  (i.e. the subject) as the discourse-new topic.</Paragraph>
      <Paragraph position="7"> Note that the determination of the sentence topic is distinct from the question of how to realize the salient Cb/topic (e.g. as a dropped or overt pronoun or full NP). In the MT domain, this can be determined by the referential form in the source text. This trick can also be used for accommodating inferrable or hearer-old entities that behave as if they are discourse-old even though they are literally discourse-new. If an item that is not; in the discourse model is nonetheless realized as a definite NP in the source text, the speaker is treating the entity as discourse-old. This is very similar to (Stys/Zemke, 1995)'s MT system which uses the referential form in the source text to predict the topicality of a phrase in the target text.</Paragraph>
    </Section>
    <Section position="2" start_page="559" end_page="559" type="sub_section">
      <SectionTitle>
3.2 The Focus Algorithm
</SectionTitle>
      <Paragraph position="0"> Given the rest of the semantic representation for the sentence and the discourse model of the text processed so far, the following algorithm determines the focus of the sentence. The first step is to determine presentational focusing of discourse-new information. Note that the focus, unlike the topic, can contain more than one element; this allows broad focus as well as narrow focusing. If there is no discourse-new information, the second step in the algorithm allows contrastive focusing of discourse-old information. In order to construct the alternative sets, a small knowledge base is used to determine the semantic type (agent, object, or  event) of the entities in the discourse model.</Paragraph>
      <Paragraph position="1"> 1. If there are any discourse-new entities (i.e.</Paragraph>
      <Paragraph position="2"> not in the discourse model) in the sentence, put their semantic representations into focus, 2. Else for each discourse entity realized in the sentence, (a) Look up its semantic type in the KB and construct an alternative set that consists of all objects of that type in the discourse model, (b) If the constructed alternative set is not  empty, put the discourse entity's semantic representation into the focus.</Paragraph>
      <Paragraph position="3"> Once the topic and focus are determined, the remainder of the semantic representation is assigned as the gronnd. For now, items in the ground are either generated in between the topic and the focus or post-posed behind the verb as backgrounded information. Further research is needed to disa.mbiguate the use of the two possible word orders. Further research is also needed on the exact role of verbs in the IS. Verbs can be in the focus or the ground in Turkish; this cannot be seen in the word order, but it is distinguished by sentential stress for narrow focus readings. The algorithm above works for verbs since I place events that are realized as verbs in the sentence into the discourse model as well. ltowever, verbs are usually not in focus unless they are surprising or contrastive or in a discourse-initiM context. Thus, the algorithm needs to be extended to a(:comnaodate discourse-new verbs that are nonetheless expected in some way into the ground component. In addition, verbs often participate in broad focus readings, and fllrther research is needed to account for the observation that broad focus readings are only available in canonical word orders.</Paragraph>
    </Section>
    <Section position="3" start_page="559" end_page="560" type="sub_section">
      <SectionTitle>
3.3 Examples
</SectionTitle>
      <Paragraph position="0"> The English text in (5) is translated using the word orders in (6) following the Mgorithrns given above. In (6), the numbers following T and F indicate the step in the respective algorithm which determined the topic or focus for that sentence. Note that the inappropriate word orders (indicated by  #) cannot be generated by the algorithm.</Paragraph>
      <Paragraph position="1"> (5) a. Pat will meet Chris today.</Paragraph>
      <Paragraph position="2"> b. There is a tMk at four.</Paragraph>
      <Paragraph position="3"> c. Chris is giving the talk.</Paragraph>
      <Paragraph position="4"> d. Pat cannot come.</Paragraph>
      <Paragraph position="5"> (6) a.</Paragraph>
      <Paragraph position="6"> b.</Paragraph>
      <Paragraph position="7">  The algorithms can also utilize long distance scrambling in 3~rkish, i.e. constructions where an element of an embedded clause has been ex- null tracted and scrambled into the matrix clause in order to play a role in the IS of the matrix clause. For example the b sentence in the following text is translated using long distance scrambling because &amp;quot;the talk&amp;quot; is the Cb of the utterance and therefore the best sentence topic, even though it is the argument of an embedded clause.</Paragraph>
      <Paragraph position="8">  (7) a. There is a talk at four.</Paragraph>
      <Paragraph position="9"> b. Pat thinks that Chris will give the talk. (8) a. D6rtde bir konu~ma var. (AdvSV)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML