XML Viewer - w96-0207

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0207_metho.xml
Size: 38,963 bytes
Last Modified: 2025-10-06 14:14:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0207">
  <Title>i Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation</Title>
  <Section position="3" start_page="69" end_page="69" type="metho">
    <SectionTitle>
1. \[\[CAT CONN\] \[ROOT oysa\]\]
</SectionTitle>
    <Paragraph position="0"> (on the other hand)</Paragraph>
  </Section>
  <Section position="4" start_page="69" end_page="69" type="metho">
    <SectionTitle>
2. \[\[CAT NOUN\] \[ROOT oy\] \[AGR 3SG\]
\[POSS NONE\] \[CASE NOM\]
\[c0Nv VERB NONE\]
\[TAM1COND\] \[AGR 3SG\]\]
</SectionTitle>
    <Paragraph position="0"> (if it is a vote)</Paragraph>
  </Section>
  <Section position="5" start_page="69" end_page="69" type="metho">
    <SectionTitle>
3. \[\[CAT PRONOUN\] \[ROOT o\] \[TYPE DEMONS\]
</SectionTitle>
    <Paragraph position="0"> On the other hand, the form oya gives rise to the following parses:</Paragraph>
  </Section>
  <Section position="6" start_page="69" end_page="70" type="metho">
    <SectionTitle>
1. \[\[CAT NOUN\] \[ROOT oya\] \[AGR 3SG\]
\[POSS NONE\] \[CASE NOM\]\] (lace)
2. \[\[CAT NOUN\] \[ROOT oy\] \[AGR 3SG\]
\[POSS NONE\] \[CASE DAT\]\] (to the vote)
3. \[\[CAT VERB\] \[ROOT oy\] \[SENSE POS\]
</SectionTitle>
    <Paragraph position="0"> \[TAM1 OPT\] \[AGR 3SG\]\] (let him carve) and the form oyun gives rise to the following parses: iOutput of the morphological anaJyzer is edited for clarity, and English glosses have been given.</Paragraph>
    <Paragraph position="1"> 2Glosses are given as linear feature value sequences corresponding to the morphemes (which are not shown).</Paragraph>
    <Paragraph position="2"> The feature names are as follows: CAT-major category, TYPE-minor category, R00T-main root form, AGR -number and person agreement, POSS- possessive agreement, CASE surface case, CONV- conversion to the category following with a certain suffix indicated by the argument after that, TAMl-tense, aspect, mood marker 1, SENSE-verbal polarity, DES- desire mood, IMP-imperative mood, 0PToptative mood, COND-Conditional</Paragraph>
  </Section>
  <Section position="7" start_page="70" end_page="70" type="metho">
    <SectionTitle>
I
I. \[\[CAT NOUN\] \[ROOT oyun\] \[AGR 3SG\]
\[POSS NONE\] \[CASE NOM\]\] (game)
2. \[\[CAT NOUN\] \[ROOT oy\] \[AGR 3SG\]
\[POSS NONE\] \[CASE GEN\]\] (of the vote)
3. \[\[CAT NOUN\] \[ROOT oy\] \[AGR 3SG\]
\[POSS 2SG\] \[CASE NOM\]\] (your vote)
4. \[\[CAT VERB\] \[ROOT oy\] \[SENSE POS\]
\[TAM1 IMP\] \[AGR 2PL\]\] (carve it!)
</SectionTitle>
    <Paragraph position="0"> On the other hand, the local syntactic context may help reduce some of the ambiguity above, as in: 3</Paragraph>
  </Section>
  <Section position="8" start_page="70" end_page="70" type="metho">
    <SectionTitle>
(NOUN NOUN-POSS form)
</SectionTitle>
    <Paragraph position="0"> using some very basic noun phrase agreement constraints in Turkish. Obviously in other similar cases, it may be possible to resolve the ambiguity completely. null There are also numerous other examples of word forms where productive derivational processes come into play: 4 geldiGimdeki (at the time I came)</Paragraph>
  </Section>
  <Section position="9" start_page="70" end_page="70" type="metho">
    <SectionTitle>
\[CONV ADJ REL\]\]
</SectionTitle>
    <Paragraph position="0"> (final adjectivalization by the relative (ki) suffix) Here, the original root is verbal but the final part-of-speech is adjectival. In general, the ambiguities of the forms that come before such a form in text can be resolved with respect to its original (or intermediate) parts-of-speech (and inflectional features), while the ambiguities of the forms that follow can be resolved based on its final part-of-speech.</Paragraph>
    <Paragraph position="1"> The main intent of our system is to achieve a morphological ambiguity reduction in the text by choosing for a given ambiguous token, a subset of its ZWith a slightly different but nevertheless common glossing convention.</Paragraph>
    <Paragraph position="2"> 4Upper cases in morphological output indicates one of the non-ASCII special Turkish characters: e.g., G denotes ~, U denotes /i, etc.</Paragraph>
    <Paragraph position="3"> parses which are not disallowed by the syntactic context it appears in. It is certainly possible that a given token may have multiple correct parses, usually with the same inflectional features or with inflectional features not ruled out by the syntactic context. These can only be disambiguated usually on semantic or discourse constraint grounds. 5 We consider a token fully disambiguated if it has only one morphological parse remaining after automatic disambiguation. We consider as token as correctly disambiguated, if one of the parses remaining for that token is the correct intended parse. 6 We evaluate the resulting disambiguated text by a number of metrics defined as follows (Voutilainen,  In the ideal case where each token is uniquely and correctly disambiguated with the correct parse, both recall and precision will be 1.0. On the other hand, a text where each token is annotated with all possible parses, 7 the recall will be 1.0 but the precision will be low. The goal is to have both recall and precision as high as possible.</Paragraph>
  </Section>
  <Section position="10" start_page="70" end_page="71" type="metho">
    <SectionTitle>
3 Constraint-based Morphological
</SectionTitle>
    <Paragraph position="0"> Disambiguation This section outlines our approach to constraint-based morphological disambiguation incorporating unsupervised learning component. Our system with the structure presented in Figure 1 has three main components:  1. the preprocessor, 2. the learning module, and 3. the morphological disambiguation module.  Preprocessing is common to both the learning and the morphological disambiguation modules. The module takes as input to the system raw Turkish text and preprocesses it in a manner to be described shortly.</Paragraph>
    <Paragraph position="1"> If the text is to be used for training, the learning module then  2. uses an unsupervised learning procedure to induce some additional (an possibly corpus dependent) rules to choose and delete some parses. Morphological disambiguation of previously unseen text proceeds as follows:  1. The hand-crafted rules are applied first. 2. Certain parses are deleted using context statistics on the corpus to be tagged. 3. Rules learned to choose and delete parses are then applied.</Paragraph>
    <Section position="1" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
3.1 The Preprocessor
</SectionTitle>
      <Paragraph position="0"> The preprocessing module takes as input a Turkish text, segments it into sentences using various heuristics about punctuation, tokenizes and runs it through a wide-coverage high-performance morphological analyzer developed using two-level morphology tools by Xerox (Karttunen, 1993). This module also performs a number of additional functions: * it groups lexicalized collocations such as idiomatic forms, semantically coalesced forms such as proper noun groups, certain numeric forms, etc.</Paragraph>
      <Paragraph position="1"> * it groups any compound verb formations which are formed by a lexically adjacent, direct or oblique object, and a verb, which for the purposes of syntactic analysis, may be considered as single lexical item: e.g., saygz durmak (to pay respect), kafay~ yemek (literally t0 eat the head - to get mentally deranged), etc.</Paragraph>
      <Paragraph position="2"> * it groups non-lexicalized collocations: Turkish abounds with various non-lexicalized collocations where the sentential role of the collocation has (almost) nothing to do with the parts-of-speech of the individual forms involved. Almost all of these collocations involve duplications, and have forms like w + x w + y where w is the duplicated string comprising the root and certain sequence of suffixes and x and y are possibly different (or empty) sequences of other suffixes.</Paragraph>
      <Paragraph position="3"> The following is a list of multi-word constructs for Turkish that we handle in our preprocessor. This list is not meant to be comprehensive, and new construct specifications can easily be added. It is conceivable that such a functionality can be used in almost any language. (See Oflazer and Kuru6z (1994) and KuruSz (1994) for details of all other forms for Turkish.) 1. duplicated optative and 3SG verbal forms functioning as manner adverb. An example is ko~a ko~a, where each lexical item has the morphological parse \[\[CAT VERB\] \[ROOT koS\] \[SENSE POS\] \[TAM1 0PT\] \[AGR3SG\]\] The preprocessor recognizes this and generates the feature sequence:  2.</Paragraph>
      <Paragraph position="4"> \[\[CAT VERB\] \[ROOT koS\] \[SENSE POS\] \[TAM1 OPT\] \[AGR 3SG\] \[CONV ADVERB DUPi\] \[TYPE MANNER\]\] aorist verbal forms with root duplications and sense negation, functioning as temporal adverbs. For instance for the non-lexicalized collocation yapar yapmaz, where items have the parses \[\[CAT VERB\] \[ROOT yap\] \[SENSE P0S\] \[TAM1 AORIST \] \[AGR 3SG\]\] \[\[CAT VERB\] \[ROOT yap\] \[SENSE BEG\] \[TAM1 AORIST \] \[AGR 3SG\]\] respectively, the preprocessor generates the feature sequence \[\[CAT VERB\] \[ROOT koS\] \[SENSE POS\] \[TAM1 AORIST\] \[AGR 3SG\] \[CONV ADVERB DUP-AOR\] \[TYPE TEMP\]\]  3. duplicated verbal and derived adverbial forms with the same verbal root acting as temporal adverbs, e.g., gitti gideli, 4. emphatic adjectival forms involving duplication and question clitic, e.g., g71zel mi g~zel (beautiful question-clitic beautifulvery beautiful) 5. adjective or noun duplications that act as  manner adverbs, e.g., hzzh hzzh, evev, This module recognizes all such forms and coalesces them into new feature structures reflecting the final structure along with any inflectional information.</Paragraph>
      <Paragraph position="5"> * The preprocessor then converts each parse into a hierarchical feature structure so that the inflectional feature of the form with the last category conversion (if any) are at the top level. Thus in the example above for geldi~imdeki, the following feature structure is generated:</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="71" end_page="72" type="metho">
    <SectionTitle>
SUFFIX ~EL
</SectionTitle>
    <Paragraph position="0"> * Finally, each such feature structure is then projected on a subset of its features. The features selected are - inflectional and certain derivational markers, and stems for open class of words,</Paragraph>
  </Section>
  <Section position="12" start_page="72" end_page="76" type="metho">
    <SectionTitle>
TOKEN IZATION MORPHOLOGY NON-LEXICAL UNKNOWN FORMAT
COLLOCATION WORD CONVERSION
</SectionTitle>
    <Paragraph position="0"> -- roots and certain relevant features such as subcategorization requirements for closed classes of words such as connectives, postpositions, etc.</Paragraph>
    <Paragraph position="1"> The set of features selected for each part-of-speech category is determined by a template and hence is controllable, permitting experimentation with differing levels of information. The information selected for stems are determined by the category of the stem itself recursively. null Under certain circumstances where a token has two or more parses that agree in the selected features, those parses will be represented by a single projected parse, hence the number of parses in the (projected) training corpus may be smaller than the number of parses in the original corpus. For example, the feature structure above is projected into a feature structure such</Paragraph>
    <Section position="1" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
3.2 Unknown Words
</SectionTitle>
      <Paragraph position="0"> Although the coverage of our morphological analyzer for Turkish (Oflazer, 1993), with about 30,000 root words and about 35,000 proper names, is very satisfactory, it is inevitable that there will be forms in the corpora being processed that are not recognized by the morphological analyzer. These are almost always foreign proper names, words adapted into the language and not in the lexicon, or very obscure technical words. These are nevertheless inflected (using Turkish word formation paradigms) with inflectional features demanded by the syntactic context and sometimes even go through derivational processes. For improved disambiguation, one has to at least recover any morphological features even if the root word is unknown. To deal with this, we have made the assumption that all unknown words have nominal roots, and built a second morphological analyzer whose (nominal) root lexicon recognizes S + where S is the Turkish surface alphabet (in the two-level morphology sense), but then tries to interpret an arbitrary postfix of the unknown word as a sequence of Turkish suffixes subject to all morphographemic constraints. For instance when a form such as talkshowumun is entered, this second analyzer hypothesizes the following analyses:  I. \[\[CAT NOUN\] \[ROOT talkshowumun\] \[AGR 3SG\] \[POSS NONE\] \[CASE NOM\]\] 2. \[\[CAT NOUN\] \[ROOT talkshowumu\] \[AGR 3SG\] \[POSS 2SG\] \[CASE NOM\]\] 3. \[\[CAT NOUN\] \[ROOT talksho~um\] \[AGR 3SG\] \[POSS NONE\] \[CASE GEN\]\] 4. \[\[CAT NOUN\] \[ROOT talkshowum\] \[AGR 3SG\] \[POSS 2SG\] \[CASE NOM\]\] 5. \[\[CAT NOUN\] \[ROOT talksho~u\] \[AGR 3SG\] \[POSS 1SG\] \[CASE GENII 6. \[\[CAT NOUN\] \[ROOT talkshow\] \[AGR 3SG\] \[POSS ISG\] \[CASE GEN\]\]  which are then processed just like any other during disambiguation.S This however is not a sufficient solution for some very obscure situations where for the foreign word is written using its, say, English orthography, while suffixation goes on according to its English pronunciation, which may make some constraints like vowel 8Incidentally, the correct analysis is the 6 th, meaning o.\[ my talk show. The 5 th one has the same morphological features except for the root.</Paragraph>
      <Paragraph position="1">  harmony inapplicable on the graphemic representation, though harmony is in effect in the pronunciation. For instance one sees the form Carter'a where the last vowel in Carter is pronounced so that it harmonizes with a in Turkish, while the e in the surface form does not harmonize with a. We are nevertheless rather satisfied with our solution as in our experiments we have noted that well below 1% of the forms remain as unknown and these are usually item markers in formatted or itemized lists, or obscure foreign acronyms.</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.3 Constraint Rules
</SectionTitle>
      <Paragraph position="0"> The system uses rules of the sort if LC and RC then choose PARSE or if LC and RC then delete PARSE where LC and RC are feature constraints on unambiguous left and right contexts of a given token, and PARSE is a feature constraint on the parse(s) that is (are) chosen (or deleted) in that context if they are subsumed by that constraint. Currently the left and right contexts can be at most 2 tokens, hence we look at a window of at most 5 tokens of which one is ambiguous. We refer to the unambiguous tokens in the context as llc (left-left context) lc (left context), rc (right context) and rrc (right-right context). Depending on the amount of unambiguous tokens in a context, our rules can have one of the following context structures, listed in order of de- null creasing specificity: i. llc, Ic .... rc, rrc 2. llc, ic ....</Paragraph>
      <Paragraph position="1"> rc, rrc 3. ic rc 4. lc rc  To illustrate the flavor of our rules we can give the following examples. The first example chooses parses with case feature ablative, preceding an un-ambiguous postposition which subcategorizes for an ablative nominal form.</Paragraph>
      <Paragraph position="2"> \[llc: \[\] ,Ic: \[\] , choose : \[case : abl\] , rc: \[\[cat :postp,subcat :abl\]\] ,rrc: \[\]\] A second example rule is \[llc : \[ \[cat : adj , type : determiner\] \] , ic: \[\[cat :adj ,stem: \[cat :noun\]\]\] , choose: \[cat :adj\] , rc:\[\[cat:noun,poss:'NONE'\]\], rrc:\[\]\]. which selects and adjective parse following a determiner, adjective sequence, and before a noun without a possessive marker.</Paragraph>
      <Paragraph position="3"> Another sample rule is: \[llc: \[\] ,Ic: \[ \[agr: '2SG' ,case:gen\]\] , choose: \[cat :noun,poss: '2SG'\] , rc: \[3 ,rrc: \[\]3 which chooses a nominal form with a possessive marker 2SG following a pronoun with 2SG agreement and genitive case, enforcing the simplest form of noun-noun form noun phrase constraints. Our system uses two hand-crafted sets of rules, in combination with the rules that are learned by  unsupervised learning: 1. We use an initial set of hand-crafted choose  rules to speed-up the learning process by creating disambiguated contexts over which statistics can be collected. These rules (examples of which are given above) are independent of the corpus that is to be tagged, and are linguistically motivated. They enforce some very common feature patterns especially where word order is rather strict as in NP's or PP's. 9 The motivation behind these rules is that they should improve precision without sacrificing recall. These are rules which impose very tight constraints so as not to make any recall errors. Our experience is that after processing with these rules, the recall is above 99% while precision improves by about 20 percentage points. Another important feature of these rules is that they are applied even if the contexts are also ambiguous, as the constraints are tight. That is, if each token in a sequence of, say, three ambiguous tokens have a parse matching one of the context constraints (in the proper order), then all of them are simultaneously disambiguated. In hand crafting these rules, we have used our experience from an earlier tagger (Oflazer and Kuruhz, 1994). Currently we use 288 hand-crafted choose rules.</Paragraph>
      <Paragraph position="4"> 2. We also use a set of hand-crafted heuristic delete rules to get rid of any very low probability parses. For instance, in Turkish, postpositions have rather strict contextual constraints and if there are tokens remaining with multiple parses one of which is a postposition reading, we delete that reading. Our experience is that these rules improve precision by about 10 to 12 additional percentage points with negligible impact on recall. Currently we use 43 hand-crafted delete rules.</Paragraph>
    </Section>
    <Section position="3" start_page="73" end_page="75" type="sub_section">
      <SectionTitle>
3.4 Learning Choose Rules
</SectionTitle>
      <Paragraph position="0"> Given a training corpus, with tokens annotated with possible parses (projected over selected features), we first apply the hand-crafted rules. Learning then goes on as a number of iterations over the training corpus. We proceed with the following schema which is an adaptation of Brill's formulation (Brill, 1995b):  1. We generate a table, called incontext, of all possible unambiguous contexts which contain a token with an unambiguous (projected) parse, along with a count of how many times this parse occurs unambiguously in exactly the same context in the corpus. We refer to an entry in table with a context C and parse P as incontext(C, P).</Paragraph>
      <Paragraph position="1"> 2. We also generate a table, called count, of all unambiguous parses in the corpus along with a count of how many times this parse occurs in the corpus. We refer to an entry in this table with a given parse P, as count(P).</Paragraph>
      <Paragraph position="2">  3. We then start going over the corpus token by token generating contexts as we go.</Paragraph>
      <Paragraph position="3"> 4. For each unambiguous context encountered, C = (LC, RC) 1deg around an ambiguous token w with parses P1,. * * Pk, and for each parse Pi, we generate a candidate rule of the sort if LC and RC then choose Pi 5. Every such candidate rule is then scored in the following fashion: (a) We compute</Paragraph>
      <Paragraph position="5"> (b) The score of the candidate rule is then computed as:</Paragraph>
      <Paragraph position="7"> 6. We order all candidate rules generated during one pass over the corpus, along two dimensions: (a) we group candidate rules by context specificity (given by the order in Section 3.3), (b) in each group, we order rules by descending  score.</Paragraph>
      <Paragraph position="8"> We maintain score thresholds associated with each context specificity group: the threshold of a less specific group being higher than that of a more specific group. We then choose the top scoring rule from any group whose score equals or exceeds the threshold associated with that group. The reasoning is that we prefer more specific and/or high scoring rules: high scoring rules are applicable, in general, in more places; while more specific rules have stricter constraints and more accurate morphological parse selections, We have noted that choosing the highest scoring rule at every step may sometimes make premature commitments which can not be undone later.</Paragraph>
      <Paragraph position="9">  matching contexts and ambiguity in those contexts is reduced. During this application the following are also performed: (a) if the application results in an unambiguous parse in the context of the applied rule, we increment the count associated with this parse in table count. We also update the incontext table for the same context, and other contexts which contains the disambiguated parse.</Paragraph>
      <Paragraph position="10"> (b) we also generate any new unambiguous contexts that this newly disambiguated token may give rise to, and add it to the incontext table along with count 1.</Paragraph>
      <Paragraph position="11"> Note that for efficiency reasons, rule candidates are not generated repeatedly during each pass over the corpus, but rather once at the beginning, and then when selected rules are applied to very specific portions of the corpus.</Paragraph>
      <Paragraph position="12">  8. If there are no rules in any group that exceed its threshold, group thresholds are reduced by multiplying by a damping constant d (0 &lt; d &lt; 1) and iterations are continued.</Paragraph>
      <Paragraph position="13"> 9. If the threshold for the most specific context falls below a given lower limit, the learning process is terminated.</Paragraph>
      <Paragraph position="14"> Some of the rules that have been generated by this learning process are given below: 1. Disambiguate around a coordinating conjunction: null \[llc: \[\] ,ic: \[\] , choose : \[cat :noun,agr: 3SG ,case :nom\], rc : \[ \[cat :conn, root : re\] \] , rrc : \[ \[cat : noun, agr : 3SG, poss : NONE\] \] \] 2. Choose participle form adjectival over a nominal reading: \[llc: \[\] ,Ic: \[\], choose : \[cat : adj, suffix : yah\], rc : \[ \[cat : noun, agr : 3SG, poss : NONE\] \] , rrc: \[\[cat :noun,agr:3SG,poss: 3SG\]\]\] . 3. Choose a nominal reading (over an adjectival)  if a three token compound noun agreement can be established with the next two tokens: \[llc: \[\] ,lc: \[\] ,</Paragraph>
      <Paragraph position="16"> derivation The procedure outlined in the previous section has to be modified slightly in the case when the unambiguous token in the rc position is a morphologically derived form. For such cases one has to take into consideration additional pieces of information.  We will motivate this using a simple example from Turkish. Consider the example fragment: ... bir masa+dlr.</Paragraph>
      <Paragraph position="17"> ... a table+is ... is a table where the first token has the morphological parses:  where the determiner is attached to the noun and the whole phrase is then taken as a VP although the verbal marker is on the second lexical item. If, in this case, the token bit is considered to neighbor a token whose top level inflectional features indicate it is a verb, it is likely that bit will be chosen as an adverb as it precedes a verb, whereas the correct parse is the determiner reading.</Paragraph>
      <Paragraph position="18"> In such a case where the right context of an ambiguous token is a derived form, one has to consider as the right context, both the top level features of final form, and the stem from which it was derived. During the set-up of the incontext table, such a context is entered twice: once with the top level feature constraints of the immediate unambiguous right-context, and once with the feature constraints of the stem. The unambiguous token in the right context is also entered to the count table once with its top level feature structure and once with the feature structure of the stem.</Paragraph>
      <Paragraph position="19"> When generating candidate choose or delete rules, for contexts where rc is a derived form and rrc is empty, we actually generate two candidates rules for each ambiguous token in that context:  1. if llc, ic and rc then choose/delete Pi.</Paragraph>
      <Paragraph position="20"> 2. if llc, Ic and stem(re) then choose/delete  P~.</Paragraph>
      <Paragraph position="21"> These candidate rules are then evaluated as described above. In general all derivations in a lexical form have to be considered though we have noted that considering one level gives satisfactory results.  Some morphological features are only meaningful or relevant for disambiguation only when they appear to the left or to the right of the token to be disambiguated. For instance, in the case of Turkish, the CASE feature of a nominal form is only useful in the immediate left context, while the POSS (the possessive agreement marker) is useful only in the right context. If these features along with their possible values are included in context positions where they are not relevant, they &amp;quot;split&amp;quot; scores and hence cause the selection of some other irrelevant rule. Using the maxim that union gives strength, we create contexts so that features not relevant to a context position are not included, thereby treating context that differ in these features as same. 11</Paragraph>
    </Section>
    <Section position="4" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
3.5 Learning Delete Rules
</SectionTitle>
      <Paragraph position="0"> For choosing delete rules we have experimented with two approaches. One obvious approach is to use the formulation described above for learning choose rules, but instead of generating choose rules, pick the parses that score (significantly) worse than and generate delete rules for such parses. We have implemented this approach and found that it is not very desirable due to two reasons:  1. it generates far too many delete rules, and 2. it impacts recall seriously without a correspond null ing increase in precision.</Paragraph>
      <Paragraph position="1"> The second approach that we have used is considerably simpler. We first reprocess the training corpus but this time use a second set of projection templates, and apply initial rules, learned choose rules and heuristic delete rules. Then for every unambiguous context C = (LC, RC), with either an immediate left, or an immediate right components or both (so n Obviously these features are specific to a language.  the contexts used here are the last 3 in Section 3.3), a score incontext( C, Pi ) count ( Pi ) for each parse Pi of the (still) ambiguous token, is computed. Then, delete rules of the sort if LC and RC then delete Pi are generated for all parses with a score below a certain fraction (0.2 in our experiments) of the highest scoring parse. In this process, our main goal is to remove any seriously improbable parses which may somehow survive all the previous choose and delete constraints applied so far. Using a second set of templates which are more specific than the templates used during the learning of the choose rules, we introduce features we were originally projected out. Our experience has been that less strict contexts (e.g., just alc or rc) generate very useful delete rules, which basically weed out what can (almost) never happen as it is certainly not very feasible to formulate hand-crafted rules that specify what sequences of features are not possible.</Paragraph>
      <Paragraph position="2"> Some of the interesting delete rules learned here  are: 1. Delete the first of two consecutive verb parses: \[llc : \[\] , lc: \[\] , delete : \[cat : verb\] , re: \[\[cat :verb\]\],rrc: \[\]\] 2. Delete accusative case marked noun parse before a postposition that subcategorizes for a nominative noun: \[llc: \[\] ,Ic: \[\] , delete : \[cat :noun, agr : 3SG,poss : NONE, case : acc\] , rc : \[ \[cat : postp, subcat : nora\] \] ,rrc : \[\] \] . 3. Delete the accusative case marked parse with- null out any possessive marking, if the previous form has genitive case marking (signaling a genitivepossessive NP construction): \[llc: \[\] , lc : \[ \[cat : noun, agr : 3SG,poss : NONE, case : gen\] \], delete : \[cat :noun, agr: 3SG ,poss : NONE, case : ace\], re: \[\] ,rrc: \[\]\].</Paragraph>
    </Section>
    <Section position="5" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.6 Using context statistics to delete parses
</SectionTitle>
      <Paragraph position="0"> After applying hand-crafted rules to a text to be disambiguated we arrive at a state where ambiguity is about 1.10 to 1.15 parses per token (down from 1.70 to 1.80 parses per token) without any serious loss on recall. This state allows statistics to be collected over unambiguous contexts. To remove additional parses which never appear in any unambiguous context we use the scoring described above for choosing delete rules, to discard parses on the current text based on context statistics} 2 We make three passes 12Please note that delete rules learned may be applied to future texts to be disambiguated, while this step is over the current text, scoring parses in unambiguous contexts of the form used in generating delete rules, and discarding parses whose score is below a certain fraction of the maximum scoring parse, on the fly. The only difference with the scoring used for delete rules, is that the score of a parse Pi here is a weighted sum of the quantity incontext(C, Pi) count( Pi ) evaluated for three contexts in the case both the lc and rc are unambiguous/</Paragraph>
    </Section>
    <Section position="6" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.7 Steps in Disamblguating a Text
</SectionTitle>
      <Paragraph position="0"> Given a new text annotated with all morphological parses (this time the parses are not projected), we proceed with the following steps for disambiguation:  lier, are then repeatedly applied to unambiguous contexts, until no more ambiguity reduction is possible. During the application of these rules, if the immediate right context of a token is a derived form, then the stem of the right context is also checked against the constraint imposed by the rule. So if the rule right context constraint subsumes the top level feature structure or the stem feature structure, then the rule succeeds and is applied if all other constraints are also satisfied.</Paragraph>
      <Paragraph position="1"> 5. Finally, the delete rules that have been learned are applied repeatedly to unambiguous contexts, until no more ambiguity reduction is possible. null</Paragraph>
    </Section>
  </Section>
  <Section position="13" start_page="76" end_page="79" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We have applied our learning system to two Turkish texts. Some statistics on these texts are given in Table 1. The first text labeled ARK is a short text on near eastern archaeology. The second text from which fragments whose labels start with C are derived, is a book on early 20 ~h history of Turkish Republic.</Paragraph>
    <Paragraph position="1"> In Table 1, the tokens considered are that are generated after morphological analysis, unknown word processing and any lexical coalescing is done. The applied to the current text on which disambiguation is performed.</Paragraph>
    <Paragraph position="2">  words that are unknown are those that could not even be processed by the unknown noun processor. Whenever an unknown word had more than one parse it was counted under the appropriate group. We learned rules from ARK itself, and on the first 500, 1000, and 2000 sentence portions of C2400. C270 which was from the remaining 400 sentences of C2400 was set aside for testing. Gold standard disambiguated versions for ARK, C270 were prepared manually to evaluate the automatically tagged versions. null Our results are summarized in the following set of tables. Tables 2 and 3 give the ambiguity, recall and precision initially, after hand-crafted rules are applied, and after the contextual statistics are used to remove parses - all applications being cumulative. The rows labeled BASE give the initial state of the text to be tagged. The rows labeled INITIAL CHOOSE give the state after hand-crafted choose rules are applied, while the rows labeled INITIAL DELETE give the state after the hand-crafted choose and delete rules are applied. The rows labeled CONTEXT STATISTICS give the state after the rules are applied and context statistics are used (as described earlier) to remove additional parses.  Tables 5 and 6 present the results of further disambiguation of ARK, and C270 using rules learned from training texts C500, C1000, C2000 and ARK.</Paragraph>
    <Paragraph position="3"> These rules are applied after the last stage in the tables above. 13 The number of rules learned are given  the sentence level, for each of the test texts. The columns labeled UA/C and A/C give the number and percentage of the sentences that are correctly disambiguated with one parse per token, and with more than one parse for at least one token, respectively. The columns labeled 1, 2, 3, and &gt;3 denote the number and percentage of sentences that have 1, 2, 3, and &gt;3 tokens, with all remaining parses incorrect. It can be seen that well 60% of the sentences are correctly morphologically disambiguated with very small number of ambiguous parses remaining. null 13Please note for ARK, in the first two rows, the training and the test texts are the same.</Paragraph>
    <Paragraph position="4"> nLearning iterations have been stopped when the maximum rule score fell below 7.</Paragraph>
    <Section position="1" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
4.1 Discussion of Results
</SectionTitle>
      <Paragraph position="0"> We can make a number of observations from our experience: Hand-crafted rules go a long way in improving precision substantially, but in a language like Turkish, one has to code rules that allow no, or only carefully controlled derivations, otherwise lots of things go massively wrong. Thus we have used very tight and conservative rules in hand-crafting.</Paragraph>
      <Paragraph position="1"> Although the additional impact of choose and rules that are induced by the unsupervised learning is not substantiM, this is to be expected as the stage at which they are used is when all the &amp;quot;easy&amp;quot; work has been done and the more notorious cases remain. An important class of rules we explicitly have avoided hand crafting are rules for disambiguating around coordinating conjunctions. We have noted that while learning choose rules, the system zeroes in rather quickly on these contexts and comes up with rather successful rules for conjunctions. Similarly, the delete rules find some interesting situations which would be virtually impossible to enumerate.</Paragraph>
      <Paragraph position="2"> Although it is easy to formulate what things can go together in a context, it is rather impossible to formulate what things can not go together.</Paragraph>
      <Paragraph position="3"> We have also attempted to learn rules directly without applying any hand-crafted rules, but this has resulted in a failure with the learning process getting stuck fairly early. This is mainly due to the lack of sufficient unambiguous contexts to bootstrap the whole disambiguation process.</Paragraph>
      <Paragraph position="4"> From analysis of our results we have noted that trying to choose one correct parse for every token is rather ambitious (at least for Turkish). There are a number of reasons for this: There are genuine ambiguities. The word o is either a personal or a demonstrative pronoun (in addition to being a determiner). One simply can not choose among the first two using any amount of contextual information.</Paragraph>
      <Paragraph position="5"> A given word may be interpreted in more than one way but with the same inflectional features, or with features not inconsistent with the syntactic context. This usually happens when the root of one of the forms is a proper prefix of the root of the other one. One would need serious amounts of semantic, or statistical root word and word form preference information for resolving these. For instance, in  both noun phrases are syntactically possible, though the second one is obviously nonsense.</Paragraph>
      <Paragraph position="6"> It is not clear how one would disambiguate this using just contextual or syntactic information.</Paragraph>
      <Paragraph position="7"> Another similar example is: kurmaya yardlm etti kur+ma+ya yardlm et+ti construct+INF+DAT help make+PAST helped construct (something) kurmay+a yard~m et+ti milit ary-officer+DAT help make+PAST helped the military-officer where again with have a similar problem. It may be possible to resolve this one using sub-categorization constraints on the object of the verb kur assuming it is in the very near preceding context, but this may be very unlikely as Turkish allows arbitrary adjuncts between the object and the verb.</Paragraph>
      <Paragraph position="8"> Turkish allows sentences to consist of a number of sentences separated by commas. Hence locating a verb in the middle of a sentence is rather difficult, as certain verbal forms also have an adjectival reading, and punctuation is not very helpful as commas have many other uses.</Paragraph>
      <Paragraph position="9"> The distance between two constituents (of, say, a noun phrase) that have to agree in various morphosyntactic features may be arbitrar- null ily long and this causes occasional mislnatches, especially if the right nominal constituent has a surface plural marker which causes a 4-way ambiguity, as in masalam.</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="79" end_page="79" type="metho">
    <SectionTitle>
4. \[\[CAT NOUN\] \[ROOT masa\] \[AGR 3SG\]
\[POSS 3PL\] \[CASE NOM\]\]
</SectionTitle>
    <Paragraph position="0"> (their table) Choosing among the last three is rather problematic if the corresponding genitive form to force agreement with is outside the context. Among these problems, the most crucial is the second one which we believe can be solved to a great extent by using root word preference statistics and word form preference statistics. We are currently working on obtaining such statistics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML