File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2039_metho.xml
Size: 20,830 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2039"> <Title>Parsing Aligned Parallel Corpus by Projecting Syntactic Relations from Annotated Source Corpus</Title> <Section position="4" start_page="301" end_page="301" type="metho"> <SectionTitle> 2 Link Grammar and Phrases </SectionTitle> <Paragraph position="0"> Link grammar (LG) is a theory of syntax which builds simple relations between pairs of words, rather than constructing constituents in tree-like hierarchy. For example, in an SVO language like English, the verb forms a subject link (S-) to some word on its left, and an object link (O+) with some word on its right. Nouns make the subject link (S+) to some word (verb) on its right, or object link (O-) to some word on its left.</Paragraph> <Paragraph position="1"> The English Link Grammar Parser (Sleator and Temperley, 1991) is a syntactic parser of English based on LG. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words.</Paragraph> <Paragraph position="2"> The parser also produces a &quot;constituent&quot; representation of a sentence (showing noun phrases, verb phrases, etc.). It is a dictionary-based system in which each word in the dictionary is associated with a set of links. Most of the links have some associated suffixes to provide various information (e.g., gender (m/f), number (s/p)), describing some properties of the underlying word. The English link parser lists total of 107 links. Table 1 gives a list of some important links of English LG along with the information about the words on their left/right and some suffixes.</Paragraph> <Paragraph position="3"> As an example, consider the syntactic structure and constituent representation of the sentence given below.</Paragraph> <Paragraph position="5"> the teacher of the boys is good</Paragraph> <Paragraph position="7"> It may be noted that in the phrase structure of the above sentence, verb phrase as obtained from the phrase parser has been modified to some extent. The algorithm discussed in this work assumes verb phrases as the main verb along with all the auxiliary verbs.</Paragraph> <Paragraph position="8"> For ease of presentation and understanding, we classify phrase relations as Inter-Phrase and Intra-phrase relations. Since the phrases are often embedded, different levels of phrase relations are obtained. From the outermost level to the innermost, we call them as &quot;first level&quot;, &quot;second level&quot; of relations and so on. One should note that an ith level Intra-phrase relation may become Inter-phrase relation at a higher level.</Paragraph> <Paragraph position="9"> As an example, consider the parsing and phrase structure of the English sentence given above.</Paragraph> <Paragraph position="10"> In the first level the Inter-phrase relations (corresponding to the phrases &quot;the teacher of the boys&quot;, &quot;is&quot; and &quot;good&quot;) are Ss and Pa and the remaining links are Intra-phrase relations.</Paragraph> <Paragraph position="11"> In the second level the only Inter-phrase relationship is Mp (connecting &quot;the teacher&quot; and &quot;the boys&quot;), and the Intra-phrase relations are Ds, Jp and Dmc. In third and the last level, Jp is the Inter-phrase relationship and Dmc is the Intra-phrase relation (corresponding to &quot;of&quot; and &quot;the boys&quot;).</Paragraph> <Paragraph position="12"> The algorithm proposed in Section 4 uses pDCA to first establish the relations of the target language corresponding to the first-level Inter-phrase relations of the source language sentence. Then recursively it assigns the relations corresponding to the inner level relations.</Paragraph> </Section> <Section position="5" start_page="301" end_page="303" type="metho"> <SectionTitle> 3 DCA vis-`a-vis pDCA </SectionTitle> <Paragraph position="0"> Direct Correspondence Assumption (DCA) states that the relation between words in source language sentence can be projected as the relations between corresponding words in the (literal) translation in the target language. Direct Projection Algorithm (DPA), which is based on DCA, is a straightforward projection procedure in which the dependencies in an English sentence are projected to the sentence's translation, using the word-level alignments as a bridge. DPA also uses some monolingual knowledge specific to the projected-to language. This knowledge is applied in the form of Post-Projection transformation.</Paragraph> <Paragraph position="1"> However with respect to many language pairs syntactic relationships between the words cannot always be imitated to project a parse structure from source language to target language. For illustration consider the sentence given in Figure 1. We try to project the links from English to Hindi in Figure 1(a) and Hindi to Bangla in Figure 1(b).</Paragraph> <Paragraph position="2"> For Hindi sentence, links are given as discussed by We observe that in the parse structure of the target language sentences, neither all relations are correct nor the parse tree is complete. Thus, we observe that DPA leads to, if not wrong, a very shallow parse structure. Further, Figure 1(b) suggests that DCA fails not only for languages belonging to different families (English-Hindi), but also for languages belonging to the same family (Hindi-Bangla).</Paragraph> <Paragraph position="3"> Hence it is necessary that the parsing algorithm should be able to differentiate between the links which can be projected directly and the links which cannot. Further it needs to identify the chunks of the target language sentence that cannot be linked even after projecting the links from the source language sentence. Thus we propose pseudo Direct Correspondence Assumption (pDCA) where not all relations can be projected directly. The projection algorithm needs to take care of the following three categories of links: Category 1: Relationship between two chunks in the source language can be projected to the target language with minor or no changes (for example, subject-verb, object-verb relationships in the above illustration). It may be noted that since except for some suffix differences (due to morphological variations), the relation is same in the source and the target language.</Paragraph> <Paragraph position="4"> Category 2: Relationship between two chunks in the source language can be projected to the target language with major changes. For example, in the English sentence given in Figure 2(a), the relationship between the girl and in the white dress is Mp, i.e. &quot;nominal modifier (preposition phrase)&quot;. In the corresponding phrases ladkii and safed kapde waalii of Hindi, although the relationship is same, i.e., &quot;nominal modifier&quot;, the type of nominal modifier is changing to waalaa/waale/waalii-adjective. If the distinction between the types of nominal modifiers is not maintained, the parsing will be very shallow.</Paragraph> <Paragraph position="5"> Hence the modification in the link is necessary.</Paragraph> <Paragraph position="6"> Category 3: Relationship between two chunks in the target language is either entirely different or can not be captured from the relationship between the corresponding chunk(s) in the source language. For example, the relationship between the main verb and the auxiliary verb of the Hindi sentence in Figure 2(a) can not be defined using the English parsing. Such phrases should be parsed independently.</Paragraph> <Paragraph position="7"> The proposed algorithm is based on the above-described concept of pDCA which gives the parse structure of the sentences given in Fig. 2.</Paragraph> <Paragraph position="8"> While working with Indian languages, we found that outermost Inter-phrase relations usually belong to Category 1, and remaining relations belong to Category 2. Generally an innermost Intra-phrase relation (like verb phrase) belongs to Category 3. Thus, outermost Inter-phrase relations can usually be projected to target language directly, innermost Intra-phrase relations for the target language which are independent of the source language should be decided on the basis of language specific study and remaining relationship should be modified before projection from source to target language.</Paragraph> </Section> <Section position="6" start_page="303" end_page="306" type="metho"> <SectionTitle> 4 The Proposed Algorithm </SectionTitle> <Paragraph position="0"> DPA (Hwa et al., 2005) discusses projection procedure for five different cases of word alignment of source-target language: one-to-one, oneto-none, one-to-many, many-to-one and many-tomany. As discussed earlier, DPA is not sufficient for many cases. For example, in case of one-to-many alignment, the proposed solution is to first create a new empty word that is set as head of all multiply aligned words in target language sentence, and then the relation is projected accordingly. But, in such cases, relations between these multiply-aligned words can not be given, and thus the resulting parsing becomes shallow. The proposed algorithm (pDPA) overcomes these shortcomings as well.</Paragraph> <Paragraph position="1"> The pDPA works in the following way. It recursively identifies the phrases of the target language sentence, and assigns the links between the two phrases/words of the target language sentence by using the links between the corresponding phrases/words in the source language sentence. It may be noted that link between phrases means link between the head words of the corresponding phrases. Assignment of links starts from the outermost level phrases. Syntactic relations between the constituents of the target language phrase(s) for which the syntactic structure does not correspond with the corresponding phrase(s) in the target language are given independently. A list of link rules is maintained which keeps the information about modification(s) required in a link while projecting from the source language to the target language. These rules are limited to closed category words, to parts of speech projected from source language, or to easily enumerated lexical categories.</Paragraph> <Paragraph position="2"> Figure 3 describes the algorithm. The algorithm takes an input sentence (T) and the parsing and the constituent structure of its parallel sentence (S).</Paragraph> <Paragraph position="3"> Further S and T are assumed to be word-aligned.</Paragraph> <Paragraph position="4"> Initially, S and T are passed to the module Project-From(), which identifies the constituent phrases of S and the relations between them. Then each set of phrases and relations is passed to the module ParseFrom(). ParseFrom() module takes as input two source phrases/words, relation between them, and corresponding target phrases. It projects the corresponding relations in the target language sentence T. ParseFromSpecial() module is required if the relation between phrases of source language can not be projected so directly to the target language. Module Parse() assigns links between the constituent words of the target language phrases [?] P. Notations used in the algorithm are as follows: null * By Tprime [?] Sprime we mean that Tprime is aligned with Sprime, Tprime and Sprime being some text in the target and source language, respectively.</Paragraph> <Paragraph position="5"> * Given a language, the head of a phrase is usually defined as the keyword of the phrase. For example, for a verb phrase, the head word is the main verb.</Paragraph> <Paragraph position="6"> * P is the exhaustive set of target language phrases for which Intra-phrase relations are independent of the corresponding source language phrases.</Paragraph> <Paragraph position="7"> * Rule list R is the list of source-target language specific rules which specifies the modifications in the source language relations to be projected appropriately in the target language. null * Given the parse and constituent structure of a text S, Psij = <Si, Sj, L> , where L is the relation between the constituent phrases/words Si and Sj of S. Psprimeij = <Ti, Tj> , Ti [?] Si and Tj [?] Sj. Further, Phij = <Psij,Psprimeij> .</Paragraph> <Paragraph position="9"> whose occurrence in parse of some Sprime may lead to different structure of Tprime, where Tprime [?] Sprime.</Paragraph> <Paragraph position="10"> In the following sections we discuss in detail the scheme for parsing Hindi sentences using parse structure of the corresponding English sentence. Along with the parse structure of the input, the phrase structure is also obtained.</Paragraph> <Paragraph position="11"> 5 Case study: English to Hindi Prior requirements for developing a parsing scheme for the target language using the proposed algorithm are: development of target language links, word alignment technique, phrase identification procedure, creation of rule set R, morphological analysis, development of ParseFromSpecial() module. In this section we discuss these details for adapting a parser for Hindi using English LG based parser.</Paragraph> <Paragraph position="12"> Hindi Links. Goyal and Chatterjee (2005a; 2005b) have developed links for Hindi Link Grammar along with their suffixes. Some of the Hindi links are briefly discussed in the Table 2. It may be noted that due to the free word order of Hindi, direction can not be specified for some links, i.e., for such links &quot;Word in Left&quot; and &quot;Word in Right&quot; (second and third column of Table 2) shall be read as &quot;Word on one side&quot; and &quot;Word on the other side&quot;, respectively.</Paragraph> <Paragraph position="13"> (Aswani and Gaizauskas, 2005). However, for the current implementation alignment has been done manually with the help of an online English-Hindi dictionary1.</Paragraph> <Paragraph position="14"> Identification of Phrases and Head Words.</Paragraph> <Paragraph position="15"> Verb Phrases. Corresponding to any main verb vi present in the Hindi sentence, a verb phrase is formed by considering all the auxiliary verbs following it. A list of Hindi auxiliary verbs, along with the linkage requirements has been maintained. This list is used to identify and link verb phrases. Main verb of the verb phrase is considered to be the head word.</Paragraph> <Paragraph position="16"> Noun and Postposition2 Phrases. English NP is translated in Hindi as either NP or PP3. Also, English PP can be translated as either NP or PP. If the Hindi noun is followed by any postposition, then that postposition is attached with the noun to get a PP. In this case the postposition is considered as the head. Hindi NP corresponding to some English NP is the maximal span of the words (in Hindi sentence) aligned with the words in the corresponding English NP. The Hindi noun whose English translation is involved in establishing the Inter-phrase link is the head word. Note that if the last word (noun) in this Hindi NP is followed by any postposition (resulting in some PP), then that postposition is also included in the NP concerned . In this case the postposition is the head of the NP. The system maintains a list of Hindi postpositions to identify Hindi PPs.</Paragraph> <Paragraph position="17"> For example, consider the translation pair the lady in the room had cooked the food[?] kamre (room) mein (in) baiThii huii (-) aurat (lady) ne (-) khaanaa (food) banaayaa (cooked) thaa (-).</Paragraph> <Paragraph position="18"> The phrase structure of the English sentence is (NP1 (NP2 the lady) (PP1 in (NP3 the room))) (VP1 had cooked) (NP4 the food).</Paragraph> <Paragraph position="19"> Here, some of the Hindi phrases are as follows: kamre mein and aurat ne are identified as Hindi PP corresponding to English PP1 and NP2. The words mein and ne are considered as their head words, respectively. Since the maximal span of stands for postposition phrase.</Paragraph> <Paragraph position="20"> translation of words of English NP1 is kamre mein baiThii huii aurat which is followed by postposition ne, the Hindi phrase corresponding to NP1 is kamre mein baiThii huii aurat ne with ne as the head word. As huii and thii, which follow the verbs baiThii4 and banaayaa respectively, are present in the auxiliary verb list, Hindi VPs are obtained as baiThii huii and banaayaa thaa (corresponding to VP1).</Paragraph> <Paragraph position="21"> Phrase Set P. Hindi verb phrase and postposition phrases are linked independent of the corresponding phrases in the English sentence. Thus,</Paragraph> <Paragraph position="23"> Rule List R. Below we enlist some of the rules defined for parsing Hindi sentences using the English links (E-links) of the parallel English sentences. Note that these rules are dependent on the target language.</Paragraph> <Paragraph position="24"> Corresponding to E-link S: If the Hindi subject is followed by ne, then the subject makes a Jn link with ne, and ne makes an SN link with the verb.</Paragraph> <Paragraph position="25"> Corresponding to E-link O: If the Hindi object is followed by ko, then the object makes a Jk link with ko, and ko makes an OK link with the verb.</Paragraph> <Paragraph position="26"> Corresponding to E-links M, MX: English NPs may have preposition phrase, present participle, past participle or adjective as postnominal modifiers which are translated as prenominal modifiers, or as relative clause in Hindi. The structure of postnominal modifier, however, may not be preserved in the Hindi sentence. If the sentence is not complex, then the corresponding Hindi link may be one ofMA(adjective), MP(postposition phrase), MT (present participle), ME (past participle), or MW (waalaa/waale/waalii-adjective). An appropriate link is to be assigned in Hindi sentence after identification of the structure of the nominal modifier. These cases are handled in the module ParseFrom-Special(). The segment of the module that handles English Mp link is given in Figure 4.</Paragraph> <Paragraph position="27"> Further, since morphological information of Hindi words can not be always extracted using corresponding English sentence, a morphological analyzer is required to extract the information5. For the current implementation, morphological infor- null mation is being extracted using some rules in simpler cases, and manually for more complex cases.</Paragraph> <Section position="1" start_page="306" end_page="306" type="sub_section"> <SectionTitle> 5.1 Illustration with an Example </SectionTitle> <Paragraph position="0"> Consider the English sentence (S) the girl in the room drew a picture, its parsed and constituent structure as given in Figure 5. Further, the corresponding Hindi sentence (T), and the word-alignment is also given.</Paragraph> <Paragraph position="1"> The step-by-step parsing of the sentence as per the pDPA is given below.</Paragraph> <Paragraph position="2"> ProjectFrom(S, T): S = {S1,S2,S3}, where S1,S2,S3 are the phrases the girl in the room, drew and a picture, respectively. From the definition of Hindi phrases, corresponding Ti's are identified as &quot;kamre mein baithii laDkii ne&quot;, &quot;banaayaa&quot; and &quot;ek chitr&quot;. From the parse structure of S, Ph's are obtained as Ph12 = <<S1,S2,Ss> ,<T1,T2> > and</Paragraph> <Paragraph position="4"> pushed in the stack S and further processing is done one-by-one for each of them. We show the further process for the Ph12.</Paragraph> <Paragraph position="5"> Since Ss /[?]L, ParseFrom(Ph12) is executed.</Paragraph> <Paragraph position="6"> ParseFrom(Ph12): The algorithm identifies t1 = ne, t2 = banaayaa. The Hindi link corresponding to Ss will be SN. The module ProjectFrom(S1, T1) is then called.</Paragraph> <Paragraph position="8"> girl and in the room, respectively. Corresponding T11 and T12 are ladkii ne and kamre mein. Thus, Ph = <<S11,S12,Mp> ,<T11,T12> > .</Paragraph> <Paragraph position="9"> Since L = Mp [?] L, ParseFromSpecial(Ph) is called.</Paragraph> <Paragraph position="10"> ParseFromSpecial(Ph): (Refer to Figure 4) Since T2 is followed by an unaligned verb baithii, the algorithm finds T3 as baithii, and t1 as ne. It assigns ME link between baithii and ne. Further, MVp link is assigned between mein and baithii. Then ProjectFrom(S11,T11) and ProjectFrom(S12,T12) are called. Since both T11 and T12 [?] S, J and Jn links are assigned between constituent words of T11 and T12, respectively, using Hindi-specific rules.</Paragraph> <Paragraph position="11"> Similarly, Ph23 is parsed.</Paragraph> <Paragraph position="12"> The final parse and phrase structure of the sen-</Paragraph> </Section> </Section> class="xml-element"></Paper>