File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2128_metho.xml
Size: 11,013 bytes
Last Modified: 2025-10-06 14:15:03
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2128"> <Title>Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming</Title> <Section position="4" start_page="775" end_page="775" type="metho"> <SectionTitle> 2 Previous work </SectionTitle> <Paragraph position="0"> Two previous studies on the induction of rules for part of speech tagging are presented in this section.</Paragraph> <Paragraph position="1"> Samuelsson et al. (1996) describe experiments of inducing English CG rules, intended more as a help for the grammarian, rather than as an attempt to induce a full-scale CG. The training corpus consisted of some 55 000 words of English text, morphologically and syntactically tagged according to the EngCG tagset. Constraints of the form presented in Section 1.1 were induced based on bigram statistics. Also lexical rules, discarding unlikely readings for certain word forms, were induced. In addition to these, 'barrier' rules were learnt. While the induced 'remove' rules were based on bigrams, the barrier rules utilized longer contexts. When tested on a 10 000 word test corpus, the recall of the induced grammar was 98.2% with a precision of 87.3%, which means that some of the ambiguities were left pending after tagging (1.12 readings per word).</Paragraph> <Paragraph position="2"> Cussens (1997) describes a project in which CG inspired rules for tagging English text were induced using the Progol machine-learning system. To its help the Progol system had a small hand-crafted syntactic grammar. The grammar was used as background knowledge to the Progol system only, and was not used for producing any syntactic structure in the final output. The examples consisted of the tags of all of the words on each side of the word to be disambiguated (the target word). Given no unknown words and a tag set of 43 different tags, the system tagged 96.4% of the words correctly.</Paragraph> </Section> <Section position="5" start_page="775" end_page="777" type="metho"> <SectionTitle> 3 Present work </SectionTitle> <Paragraph position="0"> The current work was inspired by Cussens (1997) as well as Samuelsson et al. (1996), but departs from both in several respects. It also follows up an initial experiment conducted by the current authors (Eineborg and Lindberg, 1998).</Paragraph> <Paragraph position="1"> Following Samuelsson et al. (1996) local-context and lexical rules were induced. In the present work, no barrier rules were induced. In contrast to their study, a TWOL lexicon and an annotated training text using the same tagset were not available. Instead, a lexicon was created from the training corpus.</Paragraph> <Paragraph position="2"> Just as in Cussens work, Progol was used to induce tag elimination rules from an annotated corpus. In contrast to his study, no grammatical background knowledge is given to the learner and also word tokens, and not only part of speech tags, are in the training data.</Paragraph> <Paragraph position="3"> In order to induce the new rules, the context has been limited to a window of maximally five words, with the target word to disambiguate in the middle. A motivation for using a rather small window size can be found in Karlsson et al. (1995, page 59) where it is pointed out that sensible constraints referring to a position relative to the target word utilize close context, typically 1-3 words.</Paragraph> <Paragraph position="4"> Some further restrictions on how the learning system may use the information in the window have been applied in order to reduce the complexity of the problem. This is described in Section 3.2.</Paragraph> <Paragraph position="5"> A pre-release of the Stockholm-UmePS Corpus was used. Some 10% of the corpus was put aside to be used as test data, and the rest of the corpus made up the training data. The test data files were evenly distributed over the different text genres.</Paragraph> <Section position="1" start_page="776" end_page="776" type="sub_section"> <SectionTitle> 3.1 Preprocessing </SectionTitle> <Paragraph position="0"> Before starting the learning of constraints, the training data was preprocessed in different ways. Following Cusseus (1997), a lexicon was produced from the training corpus. All different word forms in the corpus were represented in the lexicon by one look-up word and an ambiguity class, the set of different tags which occurred in the corpus for the word form. The lexicon ended up just over 86 000 entries big.</Paragraph> <Paragraph position="1"> Similar to Karlsson et al. (1995), the first step of the tagging process was to identify 'idioms', although the term is used somewhat differently in this study; bi- and trigrams which were always tagged with one specific tag sequence (unambiguously tagged, i.e.) were extracted from the training text. Example 'idioms' are given in Table 1. 1 530 such bi- and trigrams were used.</Paragraph> <Paragraph position="2"> Following Samuelsson et al. (1996), a list of very unlikely readings for certain words was produced ('lexicai rules'). For a word form plus tag to qualify as a lexical rule, the word form should have a frequency of at least 100 occurrences in the training data, and the word should occur with the tag to discard in no more than 1% of the cases. 355 lexical rules were produced this way. The role of lexical rules and 'idioms' is to remove the simple cases of ambiguities, making it possible for the induced rules to fire, since these rules are all 'careful', meaning that they can refer to unambiguous contexts only (if they refer to tag features, and not word forms only, i.e.).</Paragraph> </Section> <Section position="2" start_page="776" end_page="777" type="sub_section"> <SectionTitle> 3.2 Rule induction </SectionTitle> <Paragraph position="0"> Rules were induced for all part of speech categories. Allowing the rules to refer to specific morphological features (and not necessarily a complete specification) has increased the expressive power of the rules, compared to the initial experiments (Eineborg and Lindberg, 1998). The rules can look at word form, part of speech, morphological features, and whether a word has an upper or lower case initial character. Although we used a window of size 5, the rules can look at maximally four positions at the same time within the window. Another restriction has been put on which combination of features the system may select from a context word. The closer a context word is to the target the more features it may use. This is done in order to reduce the search space. Each context word is represented as a prolog term with arguments for word form, upper/lower case character and part of speech tag along with a set of morphological features (if any).</Paragraph> <Paragraph position="1"> A different set of training data was produced for each of the 24 part speech categories. The training data was pre-processed by applying the bi- and trigrams and the lexical rules, described above (Section 3.1). This step was taken in order to reduce the amount of training data -rules should not be learnt for ambiguities which would be taken care of anyway.</Paragraph> <Paragraph position="2"> Progol is able to induce a hypothesis using only positive examples, or using both positive and negative examples. Since we are inducing tag eliminating rules, an example is considered positive when a word is incorrectly tagged and the reading should be discarded. A negative example is a correctly tagged word where the reading should be retained. The training data for each part of speech tag consisted of between 4000 and 6000 positive examples with an equivalent number of negative examples. The examples for each part of speech category were randomly drawn from all examples available in the training data.</Paragraph> <Paragraph position="3"> A noise level of 1% was tolerated to make sure that Progol could find important rules despite the fact that some examples could be incorrect.</Paragraph> </Section> <Section position="3" start_page="777" end_page="777" type="sub_section"> <SectionTitle> 3.3 Rule format </SectionTitle> <Paragraph position="0"> The induced rules code two types of information: Firstly, the rules state the number and positions of the context words relative to the target word (the word to disambiguate). Secondly, for each context word referred to by a rule, and possibly also for the target word, the rule states under what conditions the rule is applicable. These conditions can be the word form, morphological features or whether a word is spellt with an initial capital letter or not, and combinations of these things. Examples of induced rules are where the first rule eliminates all verbal (vb) readings of a word immediately preceded by a word tagged as determiner (dr). The second rule deletes the infinitive marker (ie) reading of a word followed by any word which has the feature 'definite' (clef), followed by a verb (vb). The third rule deletes verb tags which have the features 'imperative' (imp) and 'active voice' (aRt) if the preceding word is att (word(atl;)). As alredy been mentioned, the scope of the rules has been limited to a window of five words, the target word included. In an earlier attempt, the window was seven words, but these rules were less expressive in other respects (Eineborg and Lindberg, 1998).</Paragraph> </Section> </Section> <Section position="6" start_page="777" end_page="777" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Just under 7 000 rules were induced. The tagger was tested on a subset of the unseen data. Only sentences in which all words were in the lexicon were allowed. Sentences including words tagged as U0 were discarded. The U0 tag is a peculiarity of the SUC tagset, and conveys no grammatical information; it stands for 'foreign word' and is used e.g. for the words in passages quoting text which is not in Swedish.</Paragraph> <Paragraph position="1"> The test data consisted of 42 925 words, including punctuation marks. After lexicon look-up the words were assigned 93 810 readings, i.e., on average 2.19 readings per word. 41 926 words retained the correct reading after disambiguation, which means that the correct tag survived for 97.7% of the words. After tagging, there were 48 691 readings left, 1.13 readings per word.</Paragraph> <Paragraph position="2"> As a comparison to these results, a preliminary test of the Brill tagger also trained on the Stockholm-UmePS Corpus, tagged 96.9% of the words correctly, and Oliver Mason's QTag got 96.3% on the same data (Ridings, 1998).</Paragraph> <Paragraph position="3"> Neither of these two taggers leave ambiguities pending and both handles unknown words, which makes a direct comparison of the fgures given above hard.</Paragraph> <Paragraph position="4"> The processing times were quite long for most of the rule sets -- few of them were actually allowed to continue until all examples were exhausted. null</Paragraph> </Section> class="xml-element"></Paper>