File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2130_intro.xml
Size: 6,191 bytes
Last Modified: 2025-10-06 14:06:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2130"> <Title>Learning Part-of-Speech Guessing Rules from Lexicon: Extension to Non-Concatenative Operations*</Title> <Section position="3" start_page="0" end_page="770" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Part-of-speech (pos) taggers are programs which assign a single pos-tag to a word-token, provided that, it is known what parts-of-speech this word can take on in principle. In order to do that taggers are supplied with a lexicon that lists possible l'os-tags for words which were seen at the training phase. Naturally, when tagging real-word texts, one can expect to encounter words which were not seen at the training phase and hence not included into the lexicon. This is where word-Pos guessers take their place - they employ the analysis of word features, e.g. word leading and trailing characters to figure out its possible pos categories.</Paragraph> <Paragraph position="1"> Currently, most of the taggers are supplied with a word-guessing component for dealing with unknown words. The most popular guessing strategy is so-called &quot;ending guessing&quot; when a possible set, of pos-tags for a word is guessed solely on the basis of its trailing characters. An example of such guesser is the guesser supplied with the Xerox tagger (Kupiec, 1992). A similar approach was taken gome of the research reported here was funded as part of EPSRC project IED4/1/5808 &quot;Integrated Language Database&quot;.</Paragraph> <Paragraph position="2"> in (Weischedel et al., 1993) where an unknown word was guessed given the probabilities for an unknown word to be of a particular pos, its capitalisation feature and its ending. In (Brill, 1995) a system of rules which uses both ending-guessing and more morphologically motivated rules is described. Best of these methods were reported to achieve 82-85% of tagging accuracy on unknown words, e.g. (Brill, 1995; Weischedel et al., 1993).</Paragraph> <Paragraph position="3"> In (Mikheev, 1996) a cascading word-Pos guesser is described. It applies first morphological prefix and suffix guessing rules and then ending-guessing rules. This guesser is reported to achieve higher guessing accuracy than quoted before which in average was about by 8-9% better than that of the Xerox guesser and by 6-7% better than that of Brill's guesser, reaching 87-92% tagging accuracy on unknown words.</Paragraph> <Paragraph position="4"> There are two kinds of word-guessing rules employed by the cascading guesser: morphological rules and ending guessing rules. Morphological word-guessing rules describe how one word can be guessed given that another word is known. In English, as in many other languages, morphological word formation is realised by affixation: prefixation and suffixation, so there are two kinds of morphological rules: suffix rules (A '~) - rules which are applied to the tail of a word, and prefix rules (AP) -- rnles which are applied to the beginning of a word. For example, the prefix rule: AP : \[u, (VBD VBN) (JJ)l says that if segmenting the prefix &quot;un&quot; from an unknown word results in a word which is found in the lexicon as a past verb and participle (VBD VBN), we conclude that the unknown word is an adjective (J J). This rule works, for instance, for words \[developed -+undeveloped\]. An example of a suffix rule is: A ~ : \[ed (NN VB) (JJ VBD VBN)\] This rule says that if by stripping the suffix &quot;ed&quot; from an unknown word we produce a word with the pos-class noun/verb (NN VB), the unknown word is of the class adjective/past-verb/participle (JJ VBD VBN). This rule works, for instance, for word l)airs \[book -+booked\], \[water -+watered\], etc. Unlike morphological guessing rules, ending-guessing rules do not require the main form of an unknown word to be listed in the lexicon. These rules guess a pos-c.lass for a word .just Oil the basis of its ending characters and without looking up it'~ st;era in the lexicon. For example, an ending-guessing rule Ae: \[ing--- (aa NN VBG)\] says that if a word ends with &quot;ing&quot; it; can be an adjective, a noun or a gerund. Unlike a morphological rule, this rule does not ask to (:hock whether the snbstring preceeding the &quot;ing&quot;ending is a word with a particular pos-tag. Not surt)risingly, morphoh)gical guessing rules are more accurate than ending-guessing rules lint their lexical coverage is more restricted, i.e. dmy are able to cover less unknown words. Sine('. they al-( ~. illore, accurate, in the cascading guesser they were al)plied before the ending-guessing rules and improved the pre, cision of the guessings by about 5deg./0. This, actually, resulted in about 2% higher accuracy of tagging on unknown words.</Paragraph> <Paragraph position="5"> Although in general the performance of the cascading guesser was detected to be only 6% worse than a general-language lexicon lookup, one of the over-simt)lifications assumed at the extraction of i;he mort)hological rules was that they obey only simI)le con(:atenative regularities: book * ~book-}-ed; take --+take-l-n; play -4playqoing. No atteml)tS were made to model nonconcatenadve cases which are quite eoinmon in English, as for instance: try - ,tries; reduce-+reducing; advise-~advisable.</Paragraph> <Paragraph position="6"> So we thought that the incorporation of a set of guessing rule, s which call capture morphok)gical word dependencies with letter alterations should ext;end the lexieal coverage of tile morphoh)gical rules and hence might contribute to the overall guessing accuracy.</Paragraph> <Paragraph position="7"> In the rest of the paper first, we will I)riefly outline the unsupervised statistical learning technique proposed in (Mikheev, 1996), then we propose a modification which will allow for the incorporation of the learning of non-concatenative mort)hological rules, and finally, wc will ewfluate and assess the contribution of the non-concatenative sutfix morphological rules to the overall tagging av, curaey on unknown words using the cascading guesser.</Paragraph> </Section> class="xml-element"></Paper>