File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1117_metho.xml
Size: 14,877 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1117"> <Title>A Maximum-Entropy Partial Parser for Unrestricted Text</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Maximum Entropy Modelling </SectionTitle> <Paragraph position="0"> The expressiveness and modelling power of the maximum entropy approach arise from its ability to combine information coming from different knowledge sources. Given a set X of possible histories and a set Y of futures, ~e can characterise events from the joint event space X, Y by defining a number of features, i.e., equivalence relations over X x Y. By defining these features, we express our insights about information relevant to modelling.</Paragraph> <Paragraph position="1"> In such a formalisation, the maximum entropy technique consists in finding a model that (a) fits the empirical expectations of the pre-defined features, and (b) does not assume anything specific about events that are not sub-ject to constraints imposed by the features. In other words, we search for the maximum entropy probability distribution p*:</Paragraph> <Paragraph position="3"> where P = {p:p meets the empirical feature expectations} and H(p) denotes the entropy of p.</Paragraph> <Paragraph position="4"> For parameter estimation, we can use the Improved Iterative Scaling (IIS) algorithm (Berger et all., 1996), which assumes p to have the form: p(x, y) = where fi : X x Y ~ {0, 1} is the indicator function of the i-th feature, Ai the weight assigned to this feature, and Z a normalisation constant. IIS iteratively adjusts the weights (Ai) of the features; the model converges to the maximum entropy distribution.</Paragraph> <Paragraph position="5"> One of the most attractive properties of the maximum entropy approach is its ability to cope with feature decomposition and overlapping features. In the following sections, we will show how these advantages can be exploited for partial parsing, i.e., the recognition of syntactic structures of limited depth.</Paragraph> </Section> <Section position="4" start_page="0" end_page="143" type="metho"> <SectionTitle> 3 Context Information for Parsing </SectionTitle> <Paragraph position="0"> An interesting feature of many partial parsers is that they recognise phrase boundaries mainly on the basis of cues provided by strictly local contexts. Regardless of whether or not abstractions such as phrases occur in the model, most of the relevant information is contained directly in the sequence of words and part-of-speech tags to be processed.</Paragraph> <Paragraph position="1"> An archetypal representative of this approach is the method described by Church (1988), who used corpus frequencies to determine the boundaries of simple non, recursive NPs. For each pair of part-of-speech tags ti, tj, the probability of an NP boundary (' \[' or '\]') occurring between ti and tj is computed. On the basis of these context probabilities, the program inserts the symbols '\[' and '\]' into sequences of part-of-speech tags. Information about lexical contexts also significantly improves the performance of deep parsers. For instance, Joshi and Srinivas (1994) encode partial structures in the Tree Adjoining Grammar framework and use tagging techniques to restrict a potentially very large amount of alternative structures. Here, the context incorporates information about both the terminal yield and the syntactic structure built so far.</Paragraph> <Paragraph position="2"> Local configurations of words and p~irts of speech are a particularly important knowledge source for lexicalised grammars. In the Link Grammar framework (Lagerty et al., 1992; Della Pietra et al., 1994), strictly local contexts are naturally combined with long-distance information coming from long-range trigrams.</Paragraph> <Paragraph position="3"> Since modelling syntactic context is a very knowledge-intensive problem, the maximum entropy framework seems to be a particularly appropriate approach. Ratnaparkhi (1997) introduces several conteztual predicates which provide rich information about the syntactic context of nodes in a tree (basically, the structure and category of nodes dominated by or dominating the current phrase). These predicates are used to guide the actions of a parser.</Paragraph> <Paragraph position="4"> The use of a rich set of contextual features is also the basic idea of the approach taken by Hermjakob and Mooney (1997), who employ predicates capturing syntactic and semantic context in their parsing and machine translation system.</Paragraph> </Section> <Section position="5" start_page="143" end_page="145" type="metho"> <SectionTitle> 4 A Partial Parser for German </SectionTitle> <Paragraph position="0"> The basic idea underlying our appr*oach to partial parsing can be characterised as follows: * An appropriate encoding format makes it possible to express all relevant lexical, categorial and structural information in a finite alphabet of structural tags assigned to words (section 4.1).</Paragraph> <Paragraph position="1"> * Given a sequence of words tagged with part-of-speech labels, a Markov model is used to determine the most probable sequence of structural tags (section 4.2).</Paragraph> <Paragraph position="2"> * Parameter estimation is based on the maximum entropy technique, which takes full advantage of the multi-dimensional character of the structural tags (section 4.3). The details of the method employed are explained in the remainder of this section.</Paragraph> <Section position="1" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 4.1 Relevant Contextual Information </SectionTitle> <Paragraph position="0"> Three pieces of information associated with a word wi are considered relevant to the parser: * the part-of-speech tag ti assigned to wi * the structural relation ri between wi and its predecessor wi-1 * the syntactic category ca of parent(wi) On the basis of these three dimensions, structural tags are defined as triples of the form Si = (ti,ri,ca). For better readability, we will sometimes use attribute-value matrices to denote such tags.</Paragraph> <Paragraph position="1"> = Since we consider structures of limited depth, only seven values of the REL attribute are dis-</Paragraph> <Paragraph position="3"> If more than one of the conditions above are met, the first of the corresponding tags in the list is assigned. Figure 1 exemplifies the encoding format.</Paragraph> <Paragraph position="5"> These seven values of the ri attribute are mostly sufficient to represent the structure of even fairly complex NPs, PPs and APs, involving PP and genitive NP attachment as well as complex prenominal modifiers. The only NP components that are not treated here are relative clauses and infinitival complements. A German prepositional phrase and its encoding are shown in figure 2.</Paragraph> </Section> <Section position="2" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 4.2 A Markovian Parser </SectionTitle> <Paragraph position="0"> The task of the parser is to determine the best sequence of triples (ti, ri, Ci ) for a given sequence of part-of-speech tags (to, tl,...tn). Since the attributes TAG, REL and CAT can take only a finite number of values, the number of such triples will also be finite, and they can be used to construct a 2-nd order Markov model. The triples Si = (ti,ri,ci) are states of the model, which emits POS tags (tj) as signals.</Paragraph> <Paragraph position="1"> In this respect, our approach does not much differ from standard part-of-speech tagging techniques. We simply assign the most probable sequence of structural tags S = (So, &,..., &) to a sequence of part-of-speech tags T = (to, tt,...,tn). Assuming the Markov property,</Paragraph> <Paragraph position="3"> The part-of-speech tags are encoded in the structural tag (the ti dimension), so S uniquely determines T. Therefore, we have P(ti\[Si) = 1 if Si = (ti, ri, ci) and 0 otherwise, which simpli* ties calculations.</Paragraph> </Section> <Section position="3" start_page="144" end_page="145" type="sub_section"> <SectionTitle> 4.3 Parameter Estimation </SectionTitle> <Paragraph position="0"> The more interesting aspect of our parser is the estimation of contextual probabilities, i.e., calculating the probability of a structural tag Si (the &quot;future&quot;) conditional on its immediate predecessors Si- 1 and Si-2 (the &quot;history&quot;).</Paragraph> <Paragraph position="1"> The labels are In the following two subsections, we contrast the traditional HMM estimation method and the maximum entropy approach.</Paragraph> <Paragraph position="2"> One possible way of parameter estimation is to use standard HMM techniques while treating the triples Si = (ti, ci,ri} as atoms. Trigram probabilities are estimated from an annotated corpus by using relative frequencies r:</Paragraph> <Paragraph position="4"> A standard method of handling sparse data is to use a linear combination of unigrams, bigrams, and trigrams/5:</Paragraph> <Paragraph position="6"> The Ai denote weights for different context sizes and sum up to 1. They are commonly estimated by deleted interpolation (Brown et hi., 1992).</Paragraph> <Paragraph position="7"> A disadvantage of the traditional method is that it considers only full n-grams Si-n+l, ..., Si and ignores a lot of contextual information, such as regular behaviour of the single attributes TAG, REL and CAT. The maximum entropy approach offers an attractive alternative in this respect since we are now free to define features accessing different constellations of the attributes. For instance, we can abstract over one or more dimensions, like in the context description in figure 1.</Paragraph> <Paragraph position="8"> Such &quot;partial n-grams&quot; permit a better exploitation of information coming from contexts observed in the training data. We say that a feature fk defined by the triple (Mi-2, Mi-1, Mi) of attribute-value matrices is active on a trigram context (S~_2, S~_i, S~) (i.e., fk(S~_ 2, S~_1, S~) = 1) iff Mj unifies with the attribute-value matrix /t'I~ encoding the information contained in S~ for j = i - 2, i - 1, i. A novel context would on average activate more features than in the standard HMM approach, which treats the (ti, ri, c~> triples as atoms. The actual features are extracted from the training corpus in the following way: we first define a number of feature patterns that say which attributes of a trigram context are relevant. All feature pattern instantiations that occur in the training corpus are stored; this procedure yields several thousands of features for each pattern.</Paragraph> <Paragraph position="9"> After computing the weights Ai of the features occurring in the training sample, we can calculate the contextual probability of a multi-dimensional structural tag Si following the two tags Si-2 and Si-l: 1 . e~,'~i'Ii(Si-2'&-&quot;sd p(&l&-2, &-,) = E We achieved the best results with 22 empirically determined feature patterns comprising full and partial n-grams, n _< 3. These patterns are listed in Appendix A.</Paragraph> </Section> </Section> <Section position="6" start_page="145" end_page="146" type="metho"> <SectionTitle> 5 Applications </SectionTitle> <Paragraph position="0"> Below, we discuss two applications of our maximum entropy parser: treebank annotation and chunk parsing of unrestricted text. For precise results, see section 6.</Paragraph> <Section position="1" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 5.1 Treebank Annotation </SectionTitle> <Paragraph position="0"> The partial parser described here is used for corpus annotation in a treebank project, cf. (Skut et hi., 1997). The annotation process is more interactive than in the Penn Treebank approach (Marcus et hi., 1994), where a sentence is first preprocessed by a partial parser and then edited by a human annotator. In our method, manual and automatic annotation steps are closely interleaved. Figure 3 exemplifies the human-computer interaction during annotation.</Paragraph> <Paragraph position="1"> The annotations encode four kinds of linguistic information: 1) parts of speech and inflection, 2) structure, 3) phrasal categories (node labels), 4) grammatical functions (edge labels).</Paragraph> <Paragraph position="2"> Part-of-speech tags are assigned in a preprocessing step. The automatic instantiation of labels is integrated into the assignment of structures. The annotator marks the words and phrases to be grouped into a new substructure, and the node and edge labels are inserted by the program, cf. (Brants et al., 1997).</Paragraph> <Paragraph position="3"> matical function labels: NK nominal kernel component, AC adposition, NMC number component, MO modifier.</Paragraph> <Paragraph position="4"> Initially, such annotation increments were just local trees of depth one. In this mode, the annotation of the PP bei etwa acht Millionen Tonnen (\[at\] around eight million tons) involves three annotation steps (first the number phrase acht Millionen, then the AP, and the PP). Each time, the annotator highlights the immediate constituents of the phrase being constructed.</Paragraph> <Paragraph position="5"> The use of the partial parser described in this paper makes it possible to construct the whole PP in only one step: The annotator marks the words dominated by the PP node, and the internal structure of the new phrase is assigned automatically. This significantly reduces the amount of manual annotation work. The method yields reliable results in the case of phrases that exhibit a fairly rigid internal structure. More than 88% of all NPs, PPs and APs are assigned the correct structure, including PP attachment and complex prenominal modifiers.</Paragraph> <Paragraph position="6"> Further examples of structures recognised by the parser are shown in figure 4. A more detailed description of the annotation mode can be found in (Brants and Skut, 1998).</Paragraph> </Section> <Section position="2" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 5.2 NP Chunker </SectionTitle> <Paragraph position="0"> Apart from treebank annotation, our partial parser can be used to chunk part-of-speech tagged text into major phrases. Unlike in the previous application, the tool now has to determine not only the internal structure, but also the external boundaries of phrases. This makes the task more difficult; especially for determining FP attachment.</Paragraph> <Paragraph position="1"> However, if we restrict the coverage of the parser to the prenominal part of the NP/PP, it performs quite well, correctly assigning almost 95% of all structural tags, which corresponds to a bracketing precision of ca. 87%.</Paragraph> </Section> </Section> class="xml-element"></Paper>