XML Viewer - w06-1205

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1205_metho.xml
Size: 21,307 bytes
Last Modified: 2025-10-06 14:10:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1205">
  <Title>Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora</Title>
  <Section position="4" start_page="0" end_page="30" type="metho">
    <SectionTitle>
2 Complex Predicates
</SectionTitle>
    <Paragraph position="0"> CPs are characterized by a predicate or host typically a noun (N), adjective (A), verb (V), or adverb (Adv) - followed by a light verb (LV), a grammaticalized version of a main verb, which contributes little telic significance to the composite predicate. As an example, the English verb &amp;quot;describe&amp;quot; may be rendered in Hindi as the Noun-Verb complex 'vnn nulln + kr ', varNan kar, &amp;quot;description + do&amp;quot;. Analysis based on a non-CP lexicon might assign the verbal head as kar (do), whereas functional aspects such as the argument structure are determined by the noun host varNan &amp;quot;description&amp;quot;. An example of a V-V CP may  be 'kr + d e', kar de &amp;quot;do+give&amp;quot;, where the light verb de &amp;quot;give&amp;quot; imposes a completive aspect on the action kar &amp;quot;do&amp;quot;.</Paragraph>
    <Paragraph position="1"> Identifying such constructs is a significant hurdle for NLP tasks ranging from phrasal parsing (Ray et al., 2003, Shrivastava et al., 2005), translation (where each complex may be treated as a lexical unit in the target language), predicate-argument analysis, to semantic delineation. In addition to the computational aspects, a mere listing of all CPs occurring in the corpus would provide an important resource for tasks such as constructing WordNets (Narayan et al.,2002) and linguistic analysis of CPs (Butt and Geuder, 2003).</Paragraph>
    <Paragraph position="2"> Rule-based approaches to identifying CPs are not very effective since there do not seem to be any clear set of rules that can be used to distinguish CPs from non-CP constructs (contrast, for example, the composite CP 'an umit d e' anumati de &amp;quot;permission+give&amp;quot; with the non-composite N-V structure 'iktaab d e' kitaab de &amp;quot;give the book&amp;quot;). Even where such rules do exist, they depend on semantic properties such as the fact that book is a physical object which can be given in the physical sense (Raina and Mukerjee, 2005).</Paragraph>
    <Paragraph position="3"> However, in the translated form, the former may show up as a verb, whereas the latter invariably will be a N+V, so the tag projection would rule out the latter as a CP.</Paragraph>
    <Paragraph position="4"> Here we adopt a parallel corpus-based approach to creating a database of complex predicates in Hindi. The procedure can potentially be duplicated to most Indo-Aryan languages. The motivation is that a CP may be translated as a direct verb in other languages, and POS Projection across Parallel Corpora then project a tag of Verb for this expression in the source language. Additional linguistic constraints are used to determine if the multi-word cluster qualifies as a CP. These include a check list of LVs that can occur with A, N, V and Adv constituents of a multi word predicate.</Paragraph>
    <Paragraph position="5"> Let us consider some examples from the CP lexicon constructed from the EMILLE parallel corpus (McEnery et al., 2000) of 200,000 words, collected from leaflets prepared by the UK government for immigrants. Examples of these different complexes may be: (1) N+V: vnn nulln + kr varNan kar &amp;quot;description + do&amp;quot;: p aik e j yaa nullnullt ut inullt ehaar m eN j ais e paikej yaa prastut ishtehaar mein jaise package or present advertisement in as vnn nulln ikyaa gyaa ho, tthiik v aisaa varNan kiyaa gayaa ho ThIk vaisaa description do-past go-past be-pres exact same hii hogaa hii hogaa emph be-fut &amp;quot;It will be exactly as described on the package or the display advertisement.&amp;quot;  (2) A+V: uplnulldh h ai upalabdh hai &amp;quot;available+ be&amp;quot;: shaaytaa smiip hii uplnulldh h ai / Sahaytaa samiip hii upalabdh hai Help near emph available be-pres &amp;quot;Help is available nearby.&amp;quot; (3) V+V : soc l e soch le &amp;quot;think+take&amp;quot;: phl e hr phl uu k e baar e m eN anullchii trh Pahle har pehluu ke baare-mein achchhi tarah First every aspect-poss about good way soc liiije / soch liijiye think take-imp-hon &amp;quot;Think it through first.&amp;quot; (4) Adv+V vaapas paa &amp;quot;return+obtain&amp;quot;  aap saamaan bdln e m eN apn e p uur e p ais e Aap saamaan badalne mein apne puure paise You goods exchange-nom in your all money vaaps paan e kaa aidhkaar kho d et e h aiN / vaapas paane kaa adhikar kho dete hai return obtain-nom of right lose give be-pres &amp;quot;You loose your right to get your full money back in exchanging the goods. &amp;quot;  Of the four classes cited above, the NV and AV classes are the most productive. The AdvV class is highly restricted, confined to a few adverbs. The VV class is highly selective for its constituents, apparently driven by semantic considerations.</Paragraph>
    <Paragraph position="6"> Identifying CPs in text is crucial to processing since it serves as a clausal head, and other elements in the phrase are licensed by the complex as a whole and not by the verbal head. The semantic import of the hostverb complex varies along a composability continuum, at one end of which we have purely idiomatic CPs, while at the other end, the CPs may be recoverable from its constituents. For example, 'nullyvhaar+kr ', vyavhaar kar, &amp;quot;behave+do&amp;quot; has a sense of &amp;quot;use,treat&amp;quot; in English, reflecting clearly an idiomatic usage.</Paragraph>
    <Paragraph position="7"> Detecting CPs is made difficult by the differing degrees of productivity for different classes of open-class host, which reflects the applicability of unrestricted rules. Also, verbs participating in CPs are very selective; e.g. in NV and AV CPs the verb is typically restricted to ho, kar and the like, whereas in VV constructs ho reflects auxiliary usage, but a different set of verbs appear. The open class word (host) tends to be uninflected, and only the light verb (LV) carries tense, agreement and aspect markers. Even the host V participating in a VV CP is always uninflected. As an instance of the difficulty in detecting CPs, consider the so called permissive CP (Hook, 1993; Butt and Geuder, 2003), as in the karne+de &amp;quot;do-nom +give&amp;quot; example here, where the host verb appears to be inflected: (5) Raam ne sitaa ko kaam karne diyaa Ram-erg sita-acc work do-nom give-past &amp;quot;Ram let Sita do the work&amp;quot; However, this does not actually reflect CP usage, and is better parsed as:  Another challenge for CP identification is that the constituents may be separated sometimes quite widely.</Paragraph>
    <Paragraph position="8"> 3 CPs from Parallel Projection Identifying MWEs from corpora is clearly an area of increasing research emphasis. For resource-rich languages, one may use a parse tree and look for mutual information statistics in head-complement collocations, and also compare it with other &amp;quot;similar&amp;quot; collocations to determine if something is unusual about a given construct (Lin, 1999). As of now however, even POS-tagging remains a challenge for languages such as Hindi, thereby making it necessary to seek alternate methods. Parallel corpus based approaches to inducing monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for French, Chinese and other languages have shown quite promising results (Yarowsky et al., 2001). These approaches use minimal linguistic input and have been increasingly effective with the growth in the availability of large parallel corpuses. The algorithm essentially attempts to word-align the target language sentences with the source language sentences and then use a probabilistic model try to project the linguistic information from the source language. Since these are statistical algorithms, the accuracy of results depends on the size of the corpus used.</Paragraph>
    <Paragraph position="9"> In our approach, we first use a similar approach to word-align an English-Hindi parallel corpus. The English sentences are tagged and the tags are projected to Hindi sentences. We observe that words which are tagged as verbs by projection and have POS tag as N, A, Adv or V in the Hindi lexicon, and are followed by an LV, are usually CPs.</Paragraph>
    <Paragraph position="10"> Clearly the CP detection is limited to those instances where a CP in the target language is translated as a single verb in English. For example, a phrase such as jvaab d e, jawaab de, &amp;quot;answer give&amp;quot;, may be rendered in English either as the verb &amp;quot;answer&amp;quot; or as the English CP &amp;quot;give answer&amp;quot;. In the latter case (an example appearing quite frequently in this corpus), the correct POS projection would label jawaab as [N answer], thus failing to detect the CP. While this may not be significant in certain tasks (e.g. translation), it may be relevant in others (e.g. semantic processing).</Paragraph>
    <Paragraph position="11">  Furthermore, the POS tagging process is inherently biased towards projecting tags for frequently encountered constituents first, and this may lead to some constituents in certain CPs being flagged with their normal POS tags, resulting in missed CPs. However, this does not result in false positives, since non-CP constructs often fail on other criteria (e.g. list of LVs).</Paragraph>
    <Paragraph position="12"> For reasons discussed above, many CPs are not identifiable through parallel corpus methods. Some examples include 'aidhkaar hot e', 'p aidaa krn e', 'haain hotii '. Our database is therefore correspondingly thin for these types of CPs.</Paragraph>
    <Paragraph position="13"> With VV CPs, it is difficult to distinguish between CPs and other related structures such as the passive construct or serial verbs. These are illustrated below.</Paragraph>
  </Section>
  <Section position="5" start_page="30" end_page="30" type="metho">
    <SectionTitle>
(7) Passive
</SectionTitle>
    <Paragraph position="0"> aisaa bhii ho sktaa h ai ik null e iddtt nott Aisa bhii ho saktaa hai ki credit note It emph be can aux that credit note isph null k u ch hii idno N tk kaam m eN siraf kuch hii dino tak kaam me only few emph days for use in laayaa jaa sktaa ho / laaya jaa sakta ho bring go can be &amp;quot;It is quite possible that the credit note can be put to use only for a few days.&amp;quot; (8) Serial verb vh lddkaa m ujh e apnii iktaab d e gyaa / voh laDkaa mujhe apni kitaab de gayaa That boy me own book give go-past &amp;quot;That boy gave me his book and went away.&amp;quot; It appears that passive can be reliably ruled out using the root verb criterion for VVs, since the main verb in passive is always in an inflected form. No comparable formal criterion exists for the serial verb, where also the POS tagger will identify both constituents as verbs. However, these verbs are relatively rare compared to CPs.</Paragraph>
  </Section>
  <Section position="6" start_page="30" end_page="33" type="metho">
    <SectionTitle>
4 Hindi-English POS Projection
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
4.1 Data Resources and Preprocessing
</SectionTitle>
      <Paragraph position="0"> We used the EMILLE  corpus Hindi-English parallel corpus, with approximately 200,000 words in non-sentenced aligned translations in Unicode 16 format (McEnery et al., 2000). The texts consist of different types of information leaflets originally in English, along with translations in Hindi, Bangla, Gujarati and a number of South Asian languages. Closer analysis of the corpus reveals that the corpus is not completely sentence aligned and also that the translations are not very correct in many cases. Hindi versions of the manuals tend to be more verbose than their English translations. For the word alignment algorithm we needed a sentence aligned corpus but due to the small size of the parallel corpus, the standard sentence alignment systems did not give very high accuracy levels. Therefore, the whole data was manually sentence aligned to produce a sentence aligned parallel corpus of about nine thousand sentences and 140 thousand words which is used in this work.</Paragraph>
    </Section>
    <Section position="2" start_page="30" end_page="31" type="sub_section">
      <SectionTitle>
4.2 Word alignment
</SectionTitle>
      <Paragraph position="0"> We have used IBM models proposed by Brown (Brown et al., 1993) for word aligning the parallel corpus. The IBM models have been widely used in statistical machine translation. Given a Hindi sentence h, we seek the English sentence e that maximizes P(e  |h); the &amp;quot;most likely&amp;quot; translation.</Paragraph>
      <Paragraph position="2"> interested in P (h  |e). We used the Giza++ tool kit (Och and Ney, 2000), based on the Expectation Maximization (EM) algorithm, to calculate these probability measures. At the end of this step, we have a word-to-word mapping between the English and Hindi sentences. A &amp;quot;NULL&amp;quot; is used in the English sentences to account for the unaligned Hindi words from the corresponding Hindi sentence.</Paragraph>
      <Paragraph position="4"> from the English &amp;quot;complain&amp;quot; and is tagged as V+V. Since shikaayat is a N in the Hindi lexicon, this phrase is identified as an CP of N+V type.</Paragraph>
    </Section>
    <Section position="3" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.3 Tagging English Sentences
</SectionTitle>
      <Paragraph position="0"> The English sentences are POS-tagged using the Brill Tagger (Brill, 1994), a rule based tagger which uses more or less the same tags as the Penn Treebank project (Marcus, 1994).</Paragraph>
      <Paragraph position="1"> Since for our purposes, we did not need a very detailed subcategorization of the tag set for Hindi, the English tag set was reduced by merging the subcategorization tags of a few categories. Thus all noun distinctions in the Pen Treebank tagset based on number, person etc were merged in our treatment of the Noun class. Similarly in the case of verbs, we merged distinctions based on tense, person, aspect and participles etc. Subclasses of adverbs and case forms of pronouns were also merged. Rest of the POS categories were retained. The &amp;quot;NULL&amp;quot; word in the English sentences, used for unaligned Hindi words in the parallel corpus, was given a &amp;quot;NULL&amp;quot; tag.</Paragraph>
    </Section>
    <Section position="4" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.4 Projection of Tags to Hindi
</SectionTitle>
      <Paragraph position="0"> The reduced English tags were projected to Hindi words based on the word alignments obtained earlier. A sample alignment and tagged projection is shown in Figure 1. As the figure shows, postpositional markers, which are relatively more frequent in Hindi are mapped to the &amp;quot;NULL&amp;quot; word in the English sentence.</Paragraph>
      <Paragraph position="1"> Since the amount of training data is very small, the statistical word alignment algorithm is not adequate enough to align all words correctly. To overcome this weakness, we apply some filtering conditions to remove alignment errors, especially in smaller sentences. This filtering is based on two parameters: a) Fertility count (r f ), which is defined as the number of Hindi words an English word maps to, and b) Acceptance level (k), defined as the number of words acceptable in a sentence with fertility count greater than equal to r f . These two parameters are selected to minimize errors in the groundtruth sampleset, and the resulting filtering heuristics used are presented in Table 1.</Paragraph>
    </Section>
    <Section position="5" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.5 Identification of CP's
</SectionTitle>
      <Paragraph position="0"> After the filtering is done we observe that the CP's are usually translated as a direct verb in English. So if the projected tag of a Hindi word is Verb and the normal POS tag of the word in the Hindi dictionary is N, A, V or Adv and the word is followed by one of the members from the LV set, then we classify the multi word expression as N+V, A+V, V+V, or Adv+V CP respectively.</Paragraph>
    </Section>
    <Section position="6" start_page="31" end_page="32" type="sub_section">
      <SectionTitle>
4.6 Fragments of the CP Lexicon
</SectionTitle>
      <Paragraph position="0"> A sample fragment of the CP lexicon is shown in Figure-2. The whole corpus is available online  . Since we do not have a very comprehensive Hindi dictionary we are not able to classify many CP's that are identified in their respective class. On a test with 4400 sentences we identified a total of 1439 CPs  The lexicon is available online at  http://www.cse.iitk.ac.in/users/language/CPdatabase.htm null  with the following distribution: N+V: 788, A+V: 107, Adv+V: 18 and V+V: 526.</Paragraph>
    </Section>
    <Section position="7" start_page="32" end_page="32" type="sub_section">
      <SectionTitle>
4.7 Errors in CP identification
</SectionTitle>
      <Paragraph position="0"> CP identification in the test data set involved certain ground truth decisions such as excluding verbal composites with regular auxiliary verb h ai, hai corresponding to the English finite verb 'be' and the progressive 'rhaa ' raha '-ing (progressive)'. CPs with idiomatic usage were included, and so were the CPs with a passive verb, although the latter were not counted in computational scores. The testing was done on a small set of about 120 groundtruth sentences in which the CP's were carefully identified manually. We get a precision of about 82.5% and a recall of 40% with our CP finding algorithm. If the idiomatic CPs is not considered the recall goes upto 46%.</Paragraph>
      <Paragraph position="1"> Several types of errors are observed in the corpus-derived results. A False Negative (missed CP) error arising due to the English complex predicate is shown in Figure 3. A number of False Positives arise due to inadequacy in the Hindi dictionary - the online dictionary of Hindi we used was missing many lexemes. A further problem is homography e.g. the word kii (do-past) appears both as an possessive marker, as well as the past-tense form for the verb kara (do), occurring frequently (with jaa, go) in adjectival clause constructions. This has been mis-tagged in about one in ten instances (approx 0.2% cases), with hosts such as shikaayat (complaint), baat (talk), dekhvaal (looking-after), madad (help) etc. Similarly, the word un can appear as a noun (wool) or a pronoun (he). Furthermore, while considerable care was taken to manually sentence align the parallel corpus, a number of typos and other problems remain, some of them show up as false positives.</Paragraph>
    </Section>
    <Section position="8" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
4.8 Discontinuous CP identification
</SectionTitle>
      <Paragraph position="0"> In the results above, we have made no attempt to identify discontinuous CPs, i.e., instances where other phonological material intervenes between the constituents of a CP, As an example, consider (9) jaa Nc ho, jaanch ho, &amp;quot;inspection-be&amp;quot; agr kaar kii jaa Nc phl e hii ho agar kaar kii jaanch pahale hii ho if car poss inspection earlier emph happen c ukii h ai , to irpott null maa Nige / chuki hai to report mangiye comp. be-present then report ask-imp-hon &amp;quot;If the car has already been inspected please ask to see the report.&amp;quot; These separated multi-word expressions constitute some of the most difficult problems for any language - for example, one may compare these with English phrasal verbs like &amp;quot;give up&amp;quot;, which can sometimes occur in discontinuity. However, owing to the relatively free word order in Hindi, the discontinuous CPs in Hindi are separated by a variety of structures ranging from simple emphatic or focal particles and negation markers to clausal  although they are separated by the phrase pahale-bhi (already). Thus, using source and target languages together. the parallel projection method may have the potential for discovering discontinuous CPs as well. constituents. How these structures are to be encoded in a computational lexicon is a complex matter that takes us beyond CP identification (Villavicencio et al. 2004). But while rule-based identification of such constructs is problematic, we feel that POS-tag projection holds considerable promise in this direction.</Paragraph>
      <Paragraph position="1"> In the algorithm above we have only considered the target language (Hindi) tags after the parallel tagging is completed. If in addition, we also consider the source language tag and its radiation the CP probabilities may be redefined in a manner that helps capture some discontinuous CPs as well. Thus, if English &amp;quot;complain&amp;quot; radiates to shikaayat and kara, the inherent CP can be detected even in the presence of an intermediate phrase. An example from the POS-tagged data exhibiting discontinous CP detection is presented in</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML