File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2105_metho.xml
Size: 14,116 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2105"> <Title>Robust German Noun Chunking With a Probabilistic Context-Free Grammar</Title> <Section position="3" start_page="0" end_page="727" type="metho"> <SectionTitle> 2 The Grammar </SectionTitle> <Paragraph position="0"> The German grammar is a head-lexicalised I)robabilistic context-free grainmar. Section 2.1 defines probabilistic context-ti'ee grammars and their head-lexicalised refinement. Section 2.2 introduces our grmnmar architecture, focusing on noun chunks.</Paragraph> <Paragraph position="1"> The robustness rules for the ehunker a.re described in section 2.3.</Paragraph> <Section position="1" start_page="0" end_page="726" type="sub_section"> <SectionTitle> 2.1 (Head-Lexiealised) Probabilistie Context-Free Grammars </SectionTitle> <Paragraph position="0"> A probabilistic context-free g~tmmar (PCFG) is a context-free grmnmar which assigns a probability P to each context-fl'ee grammar rule in the rule set R. The probability of a parse tree T is defined as \[I,.eRP(,.)'&quot;l, whore I&quot;1 is the number of times rule r was applied to build T. The parmneters of PCFGs ca:: be learned f'rom unparsed corpora using the Inside-Outside algorithm (Lari and Young, 1990).</Paragraph> <Paragraph position="1"> Hcad-lcxicaliscd probabilistic contczt-frcc .qra?ltmar;s (It-L PCFG) (Carroll and Rooth, 1998) extend the PCFG al)proach by incorl)orating information about the lcxi('al head of constituents into the t)robal)ilistic model.: Each node in a parse of a H-L PCFG is labelled with a ca.tegory and the lexical head of the category. A H-L PCFG rule looks like a PCFG rule in which one of the daughters has been marked as the head. The rule i)robabilities l~.uze(C --+ alC) m'e replaced by lexicalised rule probabilities l~,~t~(C -~ o~\]C, h) where h is the lexical head of the lnother constituent C. The probability of a rule therefore depends not only on the category of the mother node, but also on its lexical head. Assume that the grmnmar has two rules VP -+ V NP and VP ---> V. Then the transitive verb buy should have a higher probability for the former rule whereas the latter rule shouhl be more likely for intrm:sitive verbs like sh'.cp. II-L PCFGs incorporate another type of parameters called lexical choice probabilities. The lexical choice probability l~hoi,,~ (hd\]C,z, G,~, h,,,, ) rel)reseitt,s the prolml)ility that a node of category Cd with a mother node of category C,~ and lexical head h,,,, bears the texical head h,d. The probability of a parse tree is obtained by multiplying lexicalised rule protml)ilities and lex~ ical choice ln'obal)ilities for all nodes. Since it is possible to transform H-L PCFGs into PCFGs, the PCFG algorithms are ai)l)licable to I\]:-L PCF(4s.</Paragraph> </Section> <Section position="2" start_page="726" end_page="727" type="sub_section"> <SectionTitle> 2.2 Noun Chunks in the German Grammar </SectionTitle> <Paragraph position="0"> Currently, the German grammar contains d,619 rules and covers 92% of our 15 million words of verb final and relative clauses ~. Tl:e structmal :lOtln chunk concel)t in tim grammar is defined according to Almey's chunk style (Abney, 1991) who describes chunks as syntactic units which correspond in some way to prosodic 1)atterns, containing a content word surrounded t)y some function word(s): all words from the beginning of the noun 1)hrase to the head noun are included. :~ The difl'erent kinds of noun chunks covered by our grmnmar are listed below and illustrated with exmnples: .. a combination of a non-obligatory deternfiner, optional adjectives or cardinals and the noun of the head noun.</Paragraph> <Paragraph position="1"> itself: (1) cine gutc Mec a good idea (2) viclcn Menschcn for many 1)eot)le (3) dercn kiinstliche Stimme whose m'tificial voice (4) elf Ladungen eleven cargos (5) Wasscr water and prepositional phrases where the definite m'ticle of the embedded noun chunk is morphologically combined with a 1)rel)osition, so the pure noun chunk could not be set)arated: (6) zum Schluss at the end * personal pronouns: ich (I), mir (me) * reflexive pronouns: reich (myself), sich (himself/herself/itself) null * possessive pronou:ls: (7) Mcins ist sauber.</Paragraph> <Paragraph position="2"> Mine is clean.</Paragraph> <Paragraph position="3"> * demonstrative t)ronouns: (8) Jcncr ffi.hrt viel sclmeller.</Paragraph> <Paragraph position="4"> That one goes much faster.</Paragraph> <Paragraph position="5"> * indefinite 1)ronom~s: (9) Einige sind durchgefifllen.</Paragraph> <Paragraph position="6"> Some failed.</Paragraph> <Paragraph position="7"> * relative 1)ronouns: (10) Ich mag Menschen, die ehrlich sind. I like peol)le who are honest.</Paragraph> <Paragraph position="8"> * nonfinalised adjectives: Wichtigcm (important things) * l)roper nmnes: Christoph, Kolumbus * a noun chunk refined by a prol)er name: (1.1.) der Erobere.r Christoph Kolumbus the conquerer Christoph Cohlmbus * cardinals indicating a ycm': (1.2) Ich begann 1996.</Paragraph> <Paragraph position="9"> I started 1996.</Paragraph> <Paragraph position="10"> The chunks may be recursive in case they appear as c, omplement of an adjectival phrase, as in (dcr (ira Rc.qc, 0 wartendc 5'oh, n) (the son who was waiting in the rain).</Paragraph> <Paragraph position="11"> Noun chunks have features for case, without fi:rther agreement features for nouns and verbs. The case is constrained by the time:ion of the noun chunk, as verbal or adjectival co:nplement with nominative, accusative, dative or genitive case, as modifier with genitive case, or as part of a prel:ositional phrase (also in the special case representing a prepositional phrase itself) with accusative or dative case. Both structure mid case of noun phrases may be ambiguous and have to be disambiguated: * ambiguity concenfing structure: diesen (this) is disregarding the context a demonstrative pronoun mnbiguous between representing a standalone noun chunk (cf. example (8)) or a determiner within a noun chunk (cf. example (2)) * mnbiguity concerning case: die Beitriige (the contributions) is disregarding tile context ambiguous between nonfinative mid accusative case The disambiguation is learned during grammar training, since the lexicalised rule probabilities as well as the lexical choice probabilities tend to enforce the correct structure and case information. Considering the above examples, the trained grmnmar should be able to parse diesen I(rie9 (this war) as one noun chunk instead of two (with diesen representing a standalone noun clmnk) because of (i) the preferred use of denlonstrative pronouns as determiners (+ lexicalised rule probabilities), and (ii) the lexical coherence between the two words (~ lexical choice probabilities); in a sentence like er zahlte die Beitr@e (lie paid the contributions) the accusative case of the latter noun chunk should be identified because of the lexical coherence between the verb zaMen (pay) and the lexical head of the subcategorised noun phrase Beitrag (contribution) as related direct object head (+ lexical choice probat)ilides). null</Paragraph> </Section> <Section position="3" start_page="727" end_page="727" type="sub_section"> <SectionTitle> 2.3 Robustness Rules </SectionTitle> <Paragraph position="0"> Tile Gernlan grammar covers over 90% of tile clauses of our verb final and relative clause corpora. This is sufiqcient for the extraction of lexical infornlation, e.g. the subcategorisation of verbs (see (Beil et al., 1999)). For chunking, however, it is usually necessary to analyse all sentences. Therefore, the grammar was augmented with a set of robustness rules.</Paragraph> <Paragraph position="1"> Three types of robustness rules have been considered, namely unigram rules, bigram rules aud tri-gram rules.</Paragraph> <Paragraph position="2"> Unigram rules are rules of the form X -+ YP X, where YP is a grammatical category and X is a new category. If such a rule is added for each grammar category 4, the coverage is 100% because the grammar is then able to generate any sequence of category labels, hi practice, some of the rules can be omitted while still retaining full coverage: e.g. the rule X -+ 4Also needed are two rules which start and terminate the &quot;X chain&quot;. We used the rules T0P --+ START X and X --+ END. START and END expand to SGML tags which mark the beginning and the end of a sentence, respectively.</Paragraph> <Paragraph position="3"> ADV X is not necessary if the grmnmar already contains tile rules ADVP --+ ADV and X --+ ADVP X. Uni-gram rules are insensitive to their context so that all permutations of the categories which are generated by the X chain have the stone probability.</Paragraph> <Paragraph position="4"> The second type of robustness rules, called trigram rules (Carroll and Rooth, 1998) is more context sensitive. Trigram rules have the form X:Y -+ Y Y:Z where X, Y, Z are categories and X:Y and Y:Z are new categories. \]Mgram rules choose the next category on the basis of the two preceding categories.</Paragraph> <Paragraph position="5"> Therefore the number of rules grows as the number of categories raises to the third power. For example, 125,000 trigrmn rules are needed to generate 50 different categories in arbitrary order.</Paragraph> <Paragraph position="6"> Since unigram rules are context insensitive and trigram rules are too numerous, a tlfird type of robustness rules, called bi9ram rules, was developed.</Paragraph> <Paragraph position="7"> A bigram rule actually consists of two rules, a rule of the form :Y --+ Y Y: which generates the COl> stituent Y deternlinistically, and a rule Y: -~ :Z which selects the next constituent Z based on the current one. Given n categories, we obtain n rules of the first form and n 2 rules of the second fornl.</Paragraph> <Paragraph position="8"> Even when categories which directly project to some other category were oufitted in the generation of the bigram rules for our Germm~ grmnmar, the number of rules was still fairly large. Hence we generalised some of the grammatical categories by adding additional chain rules. For example, the prepositional phrase categories PP. Akk : an, PP. Akk : auf, PP.Akk:gegen etc. were generalised to PPX by adding the rules PPX --~ PP.Akk:an et(:. Instead of n + 1 t)igram rules for each of tlm 23 prepositional categories, we now obtained only n + 2 rules with the new category PPX. Altogedmr, 3,332 robustness rules were added.</Paragraph> </Section> </Section> <Section position="4" start_page="727" end_page="728" type="metho"> <SectionTitle> 3 Chunking </SectionTitle> <Paragraph position="0"> A head-lexicalised probabilistic context-fl'ee parser, called LoPar (Schnfid, 1999), was used for pa.rsing. The f'unctionality of LoPar encompasses lmrely synlbolic parsing as well as Vitcrbi parsing, inside-outside computation, POS tagging, chunking and training with PCFGs as well as H-L PCFGs. Because of the large number of parameters in l)al'ticular of H-L PCFGs, the parser smoothes the probability distributions in order to re,old zero probabilities. The absolute discounting method (Ney et al., 1994) was adapted to fractional counts for this purpose. LoPar also supports lemmatisation of the lexical heads of a H-L PCFG. Tile input to the parser consists of ambiguously tagged words. The tags are provided by a German morphological mlalyser (Schiller and St6ckert, 1995).</Paragraph> <Paragraph position="1"> The best chunk set of a sentence is defined as tile set of chunks (with category, start mid end position) for which the stun of the prolmbilities of all parses which c, ontain exactly that chunk set is maximal.</Paragraph> <Paragraph position="2"> The chunk set of the most likely parse (i.e. Viterbi parse) is not necessarily the best chunk set according to this definition, as the folh)wing PCFG shows.</Paragraph> <Paragraph position="3"> This grmmnar generates the three parse trees (S (A (C x))), (S (A (D x))),and (S (B x)). The parse tree probal)ilities are 0.3, 0.3 and 0.4, respectively. The last parse is therefore the Viterbi parse of x. Now assume that {A,B} is the set; of chunk categories. The most likely chunk set is then { (A, 0,1)} because the sum of the l/robal/ilities of all parses which contain h is 0.6, whereas the sum over tit(; l/robal/ilities of all 1)arses containing B is only 0.4. computeChunks is a slightly simlllified l)seudocode version of the actual chunking algorithm: computeChunks(G, Prul,:) hfitialize float array p\[Gv\] Initialize chunk set array.ch.'unks\[Gv\] for each vertex v ill GV in bottom-up order do if v is an or-node then Initialize float array prob\[ch:unk.@l(v)\]\] to 0</Paragraph> <Paragraph position="5"> if v is labelled with a chunk category (7 then ch/,,,,a:.~\[v\] +- ~-l,,,,,,~a:.~\[,,\] U {(C, .,'t,,'t(,,), c,,.d(v))} return ch,'.,Tzks\[,'oot(G)\] computeChunks takes two arguments. The first, argument is a parse fore.st G which is represented as an and-or-graph. Gv is the set of vertices. The second argument is the rule probability vector, d is a flmction which returns the daughters of a vertex. The algorithm comtmtes the best elmnk set; ch,,,m, ks\[v\] and the corresponding I)robability ply\] for all vertices v in bottom-up order, chunks\[d(v)\] returns the set of chunk sets of the daughter nodes of vertex v. r'ule(v) reSurns tile rule which created v and is only defined tbr and-nodes, start(v) and end(v) return the start aml end position of the constituent represented by V.</Paragraph> <Paragraph position="6"> The chunking algorithm was extmrimentally eoltlpared with chunk extraction fl'om Viterbi parses. In 35 out of 41 ewfluation rims with different parameter settings '~, the f-score of tile chunking algorithm S'Fhe runs differed wrt. training strategy and number of iterations. See section 4 for details.</Paragraph> <Paragraph position="7"> was better than that of the Viterbi algorithm. The average f-score of the chunking algorithm was 84.7 % compared to 84.0 % for the Viterbi algorithm.</Paragraph> </Section> class="xml-element"></Paper>