File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0712_metho.xml
Size: 18,840 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0712"> <Title>Learning Computational Grammars</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"> This section starts with a description of the three tasks that we have worked on in the framework of this project. After this we will describe the machine learning algorithms applied to this data and conclude with some notes about combining different system results.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Task descriptions </SectionTitle> <Paragraph position="0"> In the framework of this project, we have worked on the following three tasks: 1. base phrase (chunk) identification 2. base noun phrase recognition 3. finding arbitrary noun phrases Text chunks are non-overlapping phrases which contain syntactically related words. For example, the sentence: contains eight chunks, four NP chunks, two VP chunks and two PP chunks. The latter only contain prepositions rather than prepositions plus the noun phrase material because that has already been included in NP chunks. The process of finding these phrases is called CHUNKING. The project provided a data set for this task at the CoNLL-2000 workshop (Tjong Kim Sang and Buchholz, 2000)1. It consists of sections 15-18 of the Wall Street Journal part of the Penn Treebank II (Marcus et al., 1993) as training data (211727 tokens) and section 20 as test data (47377 tokens). A specialised version of the chunking task is NP CHUNKING or baseNP identification in which the goal is to identify the base noun phrases. The first work on this topic was done back in the eighties (Church, 1988). The data set that has become standard for evaluation machine learning approaches is the one first used by Ramshaw and Marcus (1995). It consists of the same training and test data segments of the Penn Treebank as the chunking task (respectively sections 15-18 and section 20). However, since the data sets have been generated with different software, the NP boundaries in the NP chunking data sets are slightly different from the NP boundaries in the general chunking data.</Paragraph> <Paragraph position="1"> Noun phrases are not restricted to the base levels of parse trees. For example, in the sentence In early trading in Hong Kong Monday , gold was quoted at $ 366.50 an ounce ., the noun phrase a14a15a24a16 $ 366.50 an ounce a18 contains two embedded noun phrases a14a15a24a16 $ 366.50 a18 and a14a15a17a16 an ounce a18 . In the NP BRACKETING task, the goal is to find all noun phrases in a sentence. Data sets for this task were defined for CoNLL-992. The data consist of the same segments of the Penn Treebank as the previous two tasks (sections 15-18) as training material and section 20 as test material. This material was extracted directly from the Treebank and therefore the NP boundaries at base levels are different from those in the previous two tasks. In the evaluation of all three tasks, the accuracy of the learners is measured with three rates. We compare the constituents postulated by the learners with those marked as correct by experts (gold standard). First, the percentage of detected constituents that are correct (precision). Second, the percentage of correct constituents that are detected (recall). And third, a combination of precision and recall, the Fa25a27a26a29a28 rate which is equal to (2*precision*recall)/(precision+recall).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Machine Learning Techniques </SectionTitle> <Paragraph position="0"> This section introduces the ten learning methods that have been applied by the project members to the three tasks: LSCGs, ALLiS, LSOMMBL, Maximum Entropy, Aleph, MDL-based DCG learners, Finite State Transducers, IB1IG, IGTREE and C5.0.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Local Structural Context Grammars </SectionTitle> <Paragraph position="0"> (LSCGs) (Belz, 2001) are situated between conventional probabilistic context-free production rule grammars and DOP-Grammars (e.g., Bod and Scha (1997)). LSCGs outperform the former because they do not share their inherent independence assumptions, and are more computationally efficient than the latter, because they incorporate only subsets of the context included in DOP-Grammars. Local Structural Context (LSC) is (partial) information about the immediate neighbourhood of a phrase in a parse.</Paragraph> <Paragraph position="1"> By conditioning bracketing probabilities on LSC, more fine-grained probability distributions can be achieved, and parsing performance increased.</Paragraph> <Paragraph position="2"> Given corpora of parsed text such as the WSJ, LSCGs are used in automatic grammar construction as follows. An LSCG is derived from the corpus by extracting production rules from bracketings and annotating the rules with the type(s) of LSC to be incorporated in the LSCG (e.g. parent category information, depth of embedding, etc.).</Paragraph> <Paragraph position="3"> Rule probabilities are derived from rule frequencies (currently by Maximum Likelihood Estimation). In a separate optimisation step, the resulting LSCGs are optimised in terms of size and parsing performance for a given parsing task by an automatic method (currently a version of beam search) that searches the space of partitions of a grammar's set of nonterminals.</Paragraph> <Paragraph position="4"> The LSCG research efforts differ from other approaches reported in this paper in two respects.</Paragraph> <Paragraph position="5"> Firstly, no lexical information is used at any point, as the aim is to investigate the upper limit of parsing performance without lexicalisation. Secondly, grammars are optimised for parsing performance and size, the aim being to improve performance but not at the price of arbitrary increases in grammar complexity (hence the cost of parsing). The automatic optimisation of corpus-derived LSCGs is the subject of ongoing research and the results reported here for this method are therefore preliminary. null Theory Refinement (ALLiS). ALLiS ((D'ejean, 2000b), (D'ejean, 2000c)) is a inductive rule-based system using a traditional general-to-specific approach (Mitchell, 1997).</Paragraph> <Paragraph position="6"> After generating a default classification rule (equivalent to the n-gram model), ALLiS tries to refine it since the accuracy of these rules is usually not high enough. Refinement is done by adding more premises (contextual elements).</Paragraph> <Paragraph position="7"> ALLiS uses data encoded in XML, and also learns rules in XML. From the perspective of the XML formalism, the initial rule can be viewed as a tree with only one leaf, and refinement is done by adding adjacent leaves until the accuracy of the rule is high enough (a tuning threshold is used). These additional leaves correspond to more precise contextual elements. Using the hierarchical structure of an XML document, refinement begins with the highest available hierarchical level and goes down in the hierarchy (for example, starting at the chunk level and then word level). Adding new low level elements makes the rules more specific, increasing their accuracy but decreasing their coverage. After the learning is completed, the set of rules is transformed into a proper formalism used by a given parser.</Paragraph> <Paragraph position="8"> Labelled SOM and Memory Based Learning (LSOMMBL) is a neurally inspired technique which incorporates a modified self-organising map (SOM, also known as a 'Kohonen Map') in memory-based learning to select a subset of the training data for comparison with novel items.</Paragraph> <Paragraph position="9"> The SOM is trained with labelled inputs. During training, each unit in the map acquires a label. When an input is presented, the node in the map with the highest activation (the 'winner') is identified. If the winner is unlabelled, then it acquires the label from its input. Labelled units only respond to similarly labelled inputs. Otherwise training proceeds as with the normal SOM.</Paragraph> <Paragraph position="10"> When training ends, all inputs are presented to the SOM, and the winning units for the inputs are noted. Any unused units are then discarded.</Paragraph> <Paragraph position="11"> Thus each remaining unit in the SOM is associated with the set of training inputs that are closest to it. This is used in MBL as follows. The labelled SOM is trained with inputs labelled with the output categories. When a novel item is presented, the winning unit for each category is found, the training items associated with the winning units are searched for the closest item to the novel item and the most frequent classification of that item is used as the classification for the novel item.</Paragraph> <Paragraph position="12"> Maximum Entropy When building a classifier, one must gather evidence for predicting the correct class of an item from its context. The Maximum Entropy (MaxEnt) framework is especially suited for integrating evidence from various information sources. Frequencies of evidence/class combinations (called features) are extracted from a sample corpus and considered to be properties of the classification process. Attention is constrained to models with these properties.</Paragraph> <Paragraph position="13"> The MaxEnt principle now demands that among all the probability distributions that obey these constraints, the most uniform is chosen. During training, features are assigned weights in such a way that, given the MaxEnt principle, the training data is matched as well as possible. During evaluation it is tested which features are active (i.e., a feature is active when the context meets the requirements given by the feature). For every class the weights of the active features are combined and the best scoring class is chosen (Berger et al., 1996). For the classifier built here we use as evidence the surrounding words, their POS tags and baseNP tags predicted for the previous words.</Paragraph> <Paragraph position="14"> A mixture of simple features (consisting of one of the mentioned information sources) and complex features (combinations thereof) were used.</Paragraph> <Paragraph position="15"> The left context never exceeded 3 words, the right context was maximally 2 words. The model was calculated using existing software (Dehaspe, 1997).</Paragraph> <Paragraph position="16"> Inductive Logic Programming (ILP) Aleph is an ILP machine learning system that searches for a hypothesis, given positive (and, if available, negative) data in the form of ground Prolog terms and background knowledge (prior knowledge made available to the learning algorithm) in the form of Prolog predicates. The system, then, constructs a set of hypothesis clauses that fit the data and background as well as possible.</Paragraph> <Paragraph position="17"> In order to approach the problem of NP chunking in this context of single-predicate learning, it was reformulated as a tagging task where each word was tagged as being 'inside' or 'outside' a baseNP (consecutive NPs were treated appropriately). Then, the target theory is a Prolog program that correctly predicts a word's tag given its context. The context consisted of PoS tagged words and syntactically tagged words to the left and PoS tagged words to the right, so that the resulting tagger can be applied in the left-to-right pass over PoS-tagged text.</Paragraph> <Paragraph position="18"> Minimum Description Length (MDL) Estimation using the minimum description length principle involves finding a model which not only 'explains' the training material well, but also is compact. The basic idea is to balance the generality of a model (roughly speaking, the more compact the model, the more general it is) with its specialisation to the training material. We have applied MDL to the task of learning broad-covering definite-clause grammars from either raw text, or else from parsed corpora (Osborne, 1999a). Preliminary results have shown that learning using just raw text is worse than learning with parsed corpora, and that learning using both parsed corpora and a compression-based prior is better than when learning using parsed corpora and a uniform prior. Furthermore, we have noted that our instantiation of MDL does not capture dependencies which exist either in the grammar or else in preferred parses. Ongoing work has focused on applying random field technology (maximum entropy) to MDL-based grammar learning (see Osborne (2000a) for some of the issues involved).</Paragraph> <Paragraph position="19"> Finite State Transducers are built by interpreting probabilistic automata as transducers. We use a probabilistic grammatical algorithm, the DDSM algorithm (Thollard, 2001), for learning automata that provide the probability of an item given the previous ones. The items are described by bigrams of the format feature:class. In the resulting automata we consider a transition labeled feature:class as the transducer transition that takes as input the first part (feature) of the bigram and outputs the second part (class). By applying the Viterbi algorithm on such a model, we can find out the most probable set of class values given an input set of feature values. As the DDSM algorithm has a tuning parameter, it can provide many different automata. We apply a majority vote over the propositions made by the so computed automata/transducers for obtaining the results mentioned in this paper.</Paragraph> <Paragraph position="20"> Memory-based learning methods store all training data and classify test data items by giving them the classification of the training data items which are most similar. We have used three different algorithms: the nearest neighbour algorithm IB1IG, which is part of the Timbl software package (Daelemans et al., 1999), the decision tree learner IGTREE, also from Timbl, and C5.0, a commercial version of the decision tree learner C4.5 (Quinlan, 1993). They are classifiers which means that they assign phrase classes such as I (inside a phrase), B (at the beginning of a phrase) and O (outside a phrase) to words. In order to improve the classification process we provide the systems with extra information about the words such as the previous n words, the next n words, their part-of-speech tags and chunk tags estimated by an earlier classification process. We use the default settings of the software except for the number of examined nearest neighbourhood regions for IB1IG (k, default is 1) which we set to 3.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Combination techniques </SectionTitle> <Paragraph position="0"> When different systems are applied to the same problem, a clever combination of their results will outperform all of the individual results (Dietterich, 1997). The reason for this is that the systems often make different errors and some of these errors can be eliminated by examining the classifications of the others. The most simple combination method is MAJORITY VOTING. It examines the classifications of the test data item and for each item chooses the most frequently predicted classification. Despite its simplicity, majority voting has found to be quite useful for boosting performance on the tasks that we are interested in.</Paragraph> <Paragraph position="1"> We have applied majority voting and nine other combination methods to the output of the learning systems that were applied to the three tasks. Nine combination methods were originally suggested by Van Halteren et al. (1998). Five of them, including majority voting, are so-called voting methods. Apart from majority voting, all assign weights to the predictions of the different systems based on their performance on non-used training data, the tuning data. TOTPRECISION uses classifier weights based on their accuracy. TAG-PRECISION applies classification weights based on the accuracy of the classifier for that classification. PRECISION-RECALL uses classification weights that combine the precision of the classification with the recall of the competitors. And finally, TAGPAIR uses classification pair weights based on the probability of a classification for some predicted classification pair (van Halteren et al., 1998).</Paragraph> <Paragraph position="2"> The remaining four combination methods are so-called STACKED CLASSIFIERS. The idea is to make a classifier process the output of the individual systems. We used the two memory-based learners IB1IG and IGTREE as stacked classifiers.</Paragraph> <Paragraph position="3"> Like Van Halteren et al. (1998), we evaluated two features combinations. The first consisted of the predictions of the individual systems and the second of the predictions plus one feature that described the data item. We used the feature that, according to the memory-based learning metrics, was most relevant to the tasks: the part-of-speech tag of the data item.</Paragraph> <Paragraph position="4"> In the course of this project we have evaluated another combination method: BEST-N MA-</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> JORITY VOTING (Tjong Kim Sang et al., 2000). </SectionTitle> <Paragraph position="0"> This is similar to majority voting except that instead of using the predictions of all systems, it uses only predictions from some of the systems for determining the most probable classifications.</Paragraph> <Paragraph position="1"> We have experienced that for different reasons some systems perform worse than others and including their results in the majority vote decreases the combined performance. Therefore it is a good idea to evaluate majority voting on subsets of all systems rather than only on the combination of all systems.</Paragraph> <Paragraph position="2"> Apart from standard majority voting, all combination methods require extra data for measuring their performance which is required for determining their weights, the tuning data. This data can be extracted from the training data or the training data can be processed in an n-fold cross-validation process after which the performance on the complete training data can be measured. Although some work with individual systems in the project has been done with the goal of combining the results with other systems, tuning data is not always available for all results. Therefore it will not always be possible to apply all ten combination methods to the results. In some cases we have to restrict ourselves to evaluating majority voting only.</Paragraph> </Section> class="xml-element"></Paper>