XML Viewer - a94-1024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1024_metho.xml
Size: 15,760 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1024">
  <Title>Tagging and Morphological Disambiguation of Turkish Text</Title>
  <Section position="3" start_page="0" end_page="145" type="metho">
    <SectionTitle>
2 Tagging Text
</SectionTitle>
    <Paragraph position="0"> Automatic text tagging is an important step in discovering the linguistic structure of large text corpora. Basic tagging involves annotating the words in a given text with various pieces of information, such as part-of-speech and other lexical features.</Paragraph>
    <Paragraph position="1"> Part-of-speech tagging facilitates higher-level analysis, such as parsing, essentially by performing a certain amount of ambiguity resolution using relatively cheaper methods.</Paragraph>
    <Paragraph position="2"> The most important functionality of a tagger is the resolution of the structure and parts-of-speech of the lexical items in the text. This, however, is not a very trivial task since many words are in general ambiguous in their part-of-speech for various reasons.</Paragraph>
    <Paragraph position="3"> In English, for example a word such as make can be verb or a noun. In Turkish, even though there are ambiguities of such sort, the agglutinative nature of the language usually helps resolution of such ambiguities due to morphotactical restrictions. On the other hand, this very nature introduces another kind of ambiguity, where a lexical form can be morphologically interpreted in many ways. For example, the word evin, can be broken down as: 1  evin POS English 1. N(ev)+2SG-POSS N (your) house 2. N(ev)+GEN N of the house 3. N(evin) N wheat germ If, however, the local context is considered, it may be possible to resolve the ambiguity as in: 1 Output of the morphological analyzer is edited for clarity.</Paragraph>
    <Paragraph position="4">  (V) (s/he) was offended It is in general rather hard to select one of these interpretations without doing substantial analysis of the local context, and even then one can not fully resolve such (usually semantic) ambiguities.</Paragraph>
    <Paragraph position="5"> An additional problem that can be off-loaded to the tagger is the recognition of multi-word or idiomatic constructs. In Turkish, which abounds with such forms, such a recognizer can recognize these very productive multi-word constructs, like</Paragraph>
    <Paragraph position="7"> where both components are verbal but the compound construct is a manner or temporal adverb.</Paragraph>
    <Paragraph position="8"> This relieves the parser from dealing with them at the syntactic level. Furthermore, it is also possible to recognize various proper nouns with this functionality. Such help from a tagging functionality would simplify the development of parsers for Turkish (Demir, 1993; Giing6rdii, 1993).</Paragraph>
    <Paragraph position="9"> Researchers have used a number of different approaches for building text taggers. Karlsson (Karlsson, 1990) has used a rule-based approach where the central idea is to maximize the use of morphological information. Local constraints expressed as rules basically discard many alternative parses whenever possible. Brill (Brill, 1992) has designed a rule-based tagger for English. The tagger works by automatically recognizing rules and remedying its weaknesses, thereby incrementally improving its performance. More recently, there has been a rule- null with very minor differences adjectives have the same morphotactics as nouns.</Paragraph>
    <Paragraph position="10"> based approach implemented with finite-state machines (Koskenniemi et al., 1992; Voutilainen and Tapanainen, 1993).</Paragraph>
    <Paragraph position="11"> A completely different approach to tagging uses statistical methods, (e.g., (Church, 1988; Cutting et al., 1993)). These systems essentially train a statistical model using a previously hand-tagged corpus and provide the capability of resolving ambiguity on the basis of most likely interpretation. The models that have been widely used assume that the part-of-speech of a word depends on the categories of the two preceding words. However, the applicability of such approaches to word-order free languages remains to be seen.</Paragraph>
    <Section position="1" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
2.1 An example
</SectionTitle>
      <Paragraph position="0"> We can describe the process of tagging by showing the analysis for the sentence: iflen d6ner dfnmez evimizin yak~nmda bulunan derin gb'lde yiizerek gev~emek en biiyiik zevkimdi. (Relaxing by swimming the deep lake near our house, as soon as I return from work was my greatest pleasure.) which we assume has been processed by the morphological analyzer with the following output:  Although there are a number of choices for tags for the lexical items in the sentence, almost all except one set of choices give rise to ungrammatical or implausible sentence structures. 3 There are number of points that are of interest here: * the construct d6ner d6nmez formed by two tensed verbs, is actually a temporal adverb meaning ... as soon as .. return(s), hence these two lexical items can be coalesced into a single lexical item and tagged as a temporal adverb.</Paragraph>
      <Paragraph position="1"> * The second person singular possessive interpretation of yahmnda is not possible since this word forms a simple compound noun phrase with the previous lexical item and the third per-son singular possessive morpheme functions as the compound marker, agreeing with the agreement of the previous genitive case-marked form. * The word derin (deep) is the modifier of a simple compound noun derin g61 (deep lake) hence the second choice can safely be selected. The verbal root in the third interpretation is very unlikely to be used in text, let alone in second person imperative form. The fourth and the fifth interpretations are not very plausible either. The first interpretation (meaning your skin) may be a possible choice but can be discarded in the middle of a longer compound noun phrase.</Paragraph>
      <Paragraph position="2"> * The word en preceding an adjective indicates a superlative construction and hence the noun reading can be discarded.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="145" end_page="147" type="metho">
    <SectionTitle>
3 The Tagging Tool
</SectionTitle>
    <Paragraph position="0"> The tagging tool that we have developed integrates the following functionality with a user interface, as shown in Figure 1, implemented under X-windows.</Paragraph>
    <Paragraph position="1"> It can be used interactively, though user interaction is very rare and (optionally) occurs only when the disambiguation can not be done by the tagger.</Paragraph>
    <Paragraph position="2">  1. Morphological analysis with error logging, 2. Multi-word and idiomatic construct recognition, null 3. Morphological disambiguation by using constraints, heuristics and certain statistics, 4. Root and lexical form statistics compilation,  The second and the third functionalities are implemented by a rule-base subsystem which allows one to write rules of the following form:</Paragraph>
    <Paragraph position="4"> where each Ci is a set of constraints on a lexical form, and the corresponding Ai is an action to be executed on the set of parses associated with that lexical form, only when all the condilions are sa~isJied.</Paragraph>
    <Paragraph position="5">  The conditions refer to any available morphological or positional feature associated with a lexical form such as:  tions.</Paragraph>
    <Paragraph position="6"> Conditions may refer to absolute feature values or variables (as in Prolog, denoted by the prefix _ in the following examples) which are then used to link conditions. All occurrences of a variable have to unify for the match to be considered successful. This feature is powerful and and lets us specify in a rather general way, (possibly long distance) feature constraints in complex NPs, PPs and VPs. This is a part of our approach that distinguishes it from other constraint-based approaches.</Paragraph>
    <Paragraph position="7"> The actions are of the following types:  * Null action: Nothing is done on the matching parse.</Paragraph>
    <Paragraph position="8"> * Delete: Removes the matching parse if more than one parse for the lexical form are still in the set associated with the lexical form.</Paragraph>
    <Paragraph position="9"> * Output: Removes all but the matching parse from the set effectively tagging the lexical form with the matching parse.</Paragraph>
    <Paragraph position="10"> * Compose: Composes a new parse from various  matching parses, for multi-word constructs. These rules are ordered, and applied in the given order and actions licensed by any matching rule are applied. One rule formalism is used to encode both multi-word constructs and constraints.</Paragraph>
    <Section position="1" start_page="145" end_page="146" type="sub_section">
      <SectionTitle>
3.1 The Multi-word Construct Processor
</SectionTitle>
      <Paragraph position="0"> As mentioned before, tagging text on lexical item basis may generate spurious or incorrect results when multiple lexical items act as single syntactic or semantic entity. For example, in the sentence $irin mi ~irin bir kSpek ko~a ko~a geldi (A very cute dog came running) the fragment ~irin mi ~irin constitutes a duplicated emphatic adjective in which there is an embedded question suffix mi (written separately in 7hrkish), 4 and the fragment ko~a ko~a is a duplicated verbal construction, which has the grammatical role of manner adverb in the sentence, though 4If, however, the adjective ~irin was not repeated, then we would have a question formation.</Paragraph>
      <Paragraph position="1">  both of the constituent forms are verbal constructions. The purpose of the multi-word construct processor is to detect and tag such productive constructs in addition to various other semantically coalesced forms such as proper nouns, etc.</Paragraph>
      <Paragraph position="2"> The following is a set of multi-word constructs for Turkish that we handle in our tagger. This list is not meant to be comprehensive, and new construct specifications can easily be added. It is conceivable that such a functionality can be used in almost any language.</Paragraph>
      <Paragraph position="3"> 1. duplicated optative and 3SG verbal forms functioning as manner adverb, e.g., ko~a ko~a, aorist verbal forms with root duplications and sense negation functioning as temporal adverbs, e.g., yapar yapmaz, and duplicated verbal and derived adverbial forms with the same verbal root acting as temporal adverbs, e.g., gitti gideli, 2. duplicated compound nominal form constructions that act as adjectives, e.g., giizeller giizeli, and emphatic adjectival forms involving the question suffix, e.g., giizel mi giizel,  a lexically adjacent, direct or oblique object and a verb, which for the purposes of syntactic analysis, may be considered as single lexical item.</Paragraph>
      <Paragraph position="4"> We can give the following example for specifying a multi-word construct: 5  This rule would match any adjacent verbal lexical forms with the same root, both with the aorist aspect, and 3SG agreement. The first verb has to be positive and the second one negated. When found, a composite lexical form with an temporal adverb part-of-speech, is then generated. The original verbal root may be recovered from the root of the composed form for any subcategorization checks, at the syntactic level.</Paragraph>
    </Section>
    <Section position="2" start_page="146" end_page="147" type="sub_section">
      <SectionTitle>
3.2 Using constraints for morphological
</SectionTitle>
      <Paragraph position="0"> of a lexical form has several distinct analyses, it is not possible to disambiguate such cases except maybe by using root usage frequencies. For disambiguation one may have to use information provided by sentential position and the local morphosyntactic context. Voutilainen and Heikkila (Voutilainen et al., 1992) have proposed a constraint grammar approach where one specifies constraints on the local context of a word to disambiguate among multiple readings of a word. Their approach has, however, been applied to English where morphological information has rather little use in such resolution.</Paragraph>
      <Paragraph position="1"> In our tagger, constraints are applied on each word, and check if the forms within a specified neighborhood of the word satisfy certain morphosyntactic or positional restrictions, and/or agreements. Our constraint pattern specification is very similar to multi-word construct specification. Use of variables, operators and actions, are same except that the compose actions does not make sense here. The following is an example constraint that is used to select the postpositional reading of certain word when it is preceded by a yet unresolved nominal form with a certain case. The only requirement is that the case of the nominal form agrees with the case subcategorization requirement of the following postposition. (LP = 0 refers to current word, LP = 1 refers to next word.)</Paragraph>
      <Paragraph position="3"> When a match is found, the matching parses from both words are selected and the others are discarded.</Paragraph>
      <Paragraph position="4"> This one constraint disambiguates almost all of the postpositions and their arguments, the exceptions being nominal words which semantically convey the information provided by the case (such as words indicating direction, which may be used as if they have a dative case).</Paragraph>
      <Paragraph position="5"> Finally the following example constraint deletes the sentence final adjectival readings derived from verbs, effectively preferring the verbal reading (as Turkish is a SOV language.) Cat = V, Finalcat = ADJ, SP = END : Delete.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="147" end_page="147" type="metho">
    <SectionTitle>
4 Performance of the Tagger
</SectionTitle>
    <Paragraph position="0"> We have performed some preliminary experiments to assess the effectiveness of our tagger. We have used about 250 constraints for Turkish. Some of these constraints are very general as the postposition rule above, while some are geared towards recognition of NP's of various sorts and a small number apply certain syntactic heuristics. In this section, we summarize our preliminary results. Table 1 presents some preliminary results about the our tagging experiments. null Although the texts that we have experimented with are rather small, the results indicate that our approach is effective in disambiguating morphological structures, and hence POS, with minimal user intervention. Currently, the speed of the tagger is limited by essentially that of the morphological analyzer, but we have ported the morphological analyzer to the XEROX TWOL system developed by Karttunen and Beesley (Karttunen and Beesley, 1992).</Paragraph>
    <Paragraph position="1"> This system can analyze Turkish word forms at about 1000 forms/see on SparcStation 10's. We intend to integrate this to our tagger soon, improving its speed performance considerably.</Paragraph>
    <Paragraph position="2"> We have tested the impact of morphological disambiguation on the performance of a LFG parser developed for Turkish (GiingSrdii, 1993; GiingSrdii and Oflazer, 1994). The input to the parser was disambiguated using the tool developed and the results were compared to the case when the parser had to consider all possible morphological ambiguities itself. For a set of 80 sentences considered, it can be seen that (Table 2), morphological disambiguation enables almost a factor of two reduction in the average number of parses generated and over a factor of two speed-up in time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML