File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2149_metho.xml
Size: 17,687 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2149"> <Title>XTAG System - A Wide Coverage Grammar for English</Title> <Section position="3" start_page="0" end_page="925" type="metho"> <SectionTitle> 2 SYSTEM DESCRIPTION </SectionTitle> <Paragraph position="0"> Figure 1 shows the overall llow of the system when parsing a sentence. The input sentence is submitted to the Morphological Analyzer and the 3hgger.</Paragraph> <Paragraph position="1"> The morphological analyzer retrieves the morphological information for each individual word from the morphological database. This output is tiltered in the P.O.S Blender using the output of the trigram tagger to reduce the part-of-speech ambiguity of the *currently at BBN, Cambridge, MA, USA</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Input Sentence </SectionTitle> <Paragraph position="0"> ~'ph Analy~ l_ 'l'~ger ,,__ _ _ ~\[ p.0.S Blender ~< 1 , ....... i i __ /i ~-1%~ TteeSdection ,,~NynOB~ .......... ..........</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Derivation Structure </SectionTitle> <Paragraph position="0"> Figure I : Overview of XTAG system words. The sentence, now annotated with part-of-speech tags and morphological information for each word, is input to the Parser, which consults the syntactic database and tree database to retrieve the appropriate tree structures for each lexical item. A variety o\[' heuristics arc used to reduce the number of trees selected. The parsertitan composes the structures to obtain the parse(s) of the sentence.</Paragraph> </Section> <Section position="3" start_page="0" end_page="922" type="sub_section"> <SectionTitle> 2.1 Morphological Analyzer </SectionTitle> <Paragraph position="0"> The morphology database \[Karp et al., 19921 was originally exlracted from 1979 edition of the Collins English Dictionary and Oxford Adwmced Learner's l)ictionary of Current English, and then cleaned up and auglnentcd by hand. It consists of approximately 317,000 inltected items, along with their root forms and intlectional intbrmalion (such as case, num- null Noun/Verb Contraction. Notms and Verhs are the largest categories, with approximately 213,000 and 46,500 inflected forms, respectively. '\['he access time for a given inflected entry is 0.6 msec.</Paragraph> </Section> <Section position="4" start_page="922" end_page="922" type="sub_section"> <SectionTitle> 2.2 Part-of-Speech Tagger </SectionTitle> <Paragraph position="0"> A trigram part-of-speech tagger \[Church, 19881, trained on the Wall Street Jotlrilal Corpus, is incorporated in XTAG. The trigraln tagger has been extended to output the N-best parts-of-speech sequences \[Soong and Huang, 1990\]. XTAG uses this infer mat|on Io reduce the number of specious parses by filtering the possible parts-of-speech provided by the morphological analyzer for each word. The tagger decreases tile time to parse a sentence by an average of 93%.</Paragraph> </Section> <Section position="5" start_page="922" end_page="924" type="sub_section"> <SectionTitle> 2.3 Parser </SectionTitle> <Paragraph position="0"> Tim system uses an Ea,'ley-style parser which has been extended to handle feature structures associated with trees \[Schabes, 19901. The parser uses a general two-pass parsing s|rategy for 'lexicalized' gramlnars ISchabes, 19881. In the tree-selection pass, the parser uses tile syntactic database entry \[or each lexical item in the sentence to select a set of elementary structures from the tree database. The tree-grafting pass composes the selected trees using substitution and adjunct|on operations to obtain the parse of the sentence.</Paragraph> <Paragraph position="1"> The output of tile parser for tile sentence I had a map yesterday is illustrated in l;igure 2. The parse tree I represents the surface constituent structure, while the derivation tree represents the deriwttion history of tile parse. The nodes of the derivation tree are the tree names anchored by tile lexical items. The composition operation is indicated by the nature of Ihe arcs; a dashed line is used for substitution and a bold line for adjunct|on. The number beside each tree name is the address of the node at which the operation took place.</Paragraph> <Paragraph position="2"> Tile deriwltion tree can also be interpreted as a dependency graph with unlabeled arcs hetween words of the sentence.</Paragraph> <Paragraph position="3"> lleuristics that take advantage of 12FAGs have been implemented to improve the performance of tile parser. For instance, the span of the tree and the position of the anchor in the tree are used to weed out unsuitable trees in the first pass of the parser. Statistical information about the usage frequency of the trees I liach node of Ihe parse tree has a ligature sh'uchlre, not shown here, associaled with it.</Paragraph> <Paragraph position="5"> has been acquired by parsing corpora. This information has been compiled into a statistical database (tile l,ex Prob DB) that is used by the parser. These methods speed the runt|me by approxmmtely 87%.</Paragraph> <Paragraph position="6"> J 2.3.1 lleuristics for Ranking the 1 arses The parser generates the parses in a rank order. This ranking is determined using a combination of heuristics, which are expressed as structural preferences for deriwttion, e..g. attachment sites of adjuncts, right- vs. left- branching structures, topicalized sentences, etc.</Paragraph> <Paragraph position="7"> Similar hem'istics have been used for other parsers.</Paragraph> <Paragraph position="8"> See recent work by IHobbs and Bear, 199411, \[Mc-Cord, 19931,and \[Nagao, 1994t.</Paragraph> <Paragraph position="9"> A partial list el ~ the heuristics used in XTAG follows: null 1. Prefer argument positions to adjunct positions (here, this amotmts Io preferring fewer adjunct|on operations).</Paragraph> <Paragraph position="10"> 2. For PPs other Ihan o/', attach to nearest site that is llOt a proper noun.</Paragraph> <Paragraph position="11"> 3. Prefer right-branching structtu'e for sequences of adiectives adverbs and PPs.</Paragraph> <Paragraph position="12"> Prefer left-branching structure for sequences of norms.</Paragraph> <Paragraph position="13"> Prefer high attachment (wide-scope) for a moditier and a sequence of modifees of the same type (i.e. a PP following or preceding a coordinate NP, an adjective or determiner preceding a co-ordinate NP or sequence of Ns, an N preceding coordinate Ns).</Paragraph> <Paragraph position="14"> These rankings are used to control the number of sentences passed on to further levels of processing. In applications emphasizing speed, only the highest ranked parse will be considered, in applications emphasizing accuracy, the top N parses can be considered. null The syntactic database associates lexical items with the appropriate trees and tree families based on selectional information. The syntactic database entries were originally extracted from the Oxford Advanced Learner's Dictionary and Oxford Dictionary for Contemporary Idiomatic English; and then modified and augmented by hand. There are more than 37,000 syntactic database entries. Selected entries from this database are shown in Table 1. Each syntactic entry consists of an INDEX feld, the uninflected form under which the entry is compiled in the database, an ENTRY field, which contains all of the lexical items that will anchor the associated tree(s), a pos field, which gives the part-of-speech for the lexical item(s) in the ENTRY feld, and then either (but not both) a TREES or FAM field. The TREES field indicates a list of individual trees to be associated with the entry, while the FAM field indicates a list of tree families. A tree family, which corresponds to a subcategorization frame (see section 2.3.3), may contain a number of trees. A syntactic entry may also contain a list of feature templates (Fs) which expand out to feature equations to be placed in the specified tree(s). Any number of EX felds may be provided for example sentences. Note that lexical items may have more than one entry and may select the same tree more than once, using different features to capture lexical idiosyncrasies (e.g. have).</Paragraph> <Paragraph position="15"> Trees in the English \[;FAG framework fall into two conceptual classes. The smaller class consists o1' individual trees such as trees (a), (d), and (e) in Figure 3. These trees are generally anchored by non-verbal lexical items. The larger class consists of trees that are grouped into tree families. These tree families represent subcategorization frmnes; the trees in a tree family would be related to each other transformationally in a movement-based approach. Trees 3(b) and 3(c) are members of two distinct tree families. As illustrated by trees 3(d) and 3(e), each node of a tree is annotated with a set of features whose wtlues may be specified within the tree or may be derived from the syntactic database. There are 385 trees that compose 40 tree families, along with 62 individually selected trees in the tree database.</Paragraph> <Paragraph position="16"> The statistics database contains trec tmigram frequencies which have been collected by parsing lhe Wall Street Jourllal, the IBM manttal, and the ATIS corptts using the XTAG English grammar. The parser, augmented with the statistics database \[Joshi and Srinivas, 1994\], assigns each word of the input sentence the top three inost frequently used trees given the part-of-speech o1' the word. On failure the parser retries using all the trees suggested by the syntactic database liar each word. The augmented parser has been observed to have a success rate of 50% without retries.</Paragraph> </Section> <Section position="6" start_page="924" end_page="925" type="sub_section"> <SectionTitle> 2.4 X-Interface </SectionTitle> <Paragraph position="0"> XTAG provides a graphical interface for manipulating TAGs. The interface offers the following: Menu-based facility for creating and modifying tree liles and loading grammar files.</Paragraph> <Paragraph position="1"> User controlled parser parameters, including the parsing of categories (S, embedded S, NP, l)etP), and the use of the tagger (on/o flTrctry on failure), Storage/,etriewfl facilities for elementary and t)arsed hces as text liles.</Paragraph> <Paragraph position="2"> The production of postscript files corresponding to elementary and parsed trees.</Paragraph> <Paragraph position="3"> Graphical displays of trec and feature data structures, including a scroll 'web' for large tree structures.</Paragraph> <Paragraph position="4"> Mouse-based tree editor for crca|ing and modifying trees and feature shuctures.</Paragraph> <Paragraph position="5"> Hand combination of trees by adjunction or substitution for use in diagnosing grallunar problems. null Figure 4 shows tile X window interface after a number of sentences have becll pat'sed.</Paragraph> <Paragraph position="6"> derived- tt~-140766 parsers in the same class as XqAG. Although XTAG is being extended to handle sentence fragments, they are not included at present, and are thereby not reflected in the data in Table 2. Statistical information ti'om the parsed corpora described in Section 2.3.4 is presently used only for' speeding the parser but not to tune the grammar to parse any specilic corpus. Note then, thai the data below does not involve any corpus training.</Paragraph> </Section> </Section> <Section position="4" start_page="925" end_page="925" type="metho"> <SectionTitle> 3 ENGLISH GRAMMAR </SectionTitle> <Paragraph position="0"> The morphology, syntactic, and tree databases together comprise the English grammar. Lexical items not in the databases are handled by default mechanisms. The range of syntactic phenomena that can be handled is large and includes auxiliaries (including inversion), copula, raising and small clause constructions, topicalization, relative clauses, infinitives, gerunds, passives, adjuncts, it-clefts, wh-clefts, PRO constructions, noun-noun modifications, extraposition, determiner phrases, genitives, negation, noun-verb contractions and imperatives. Analyses for sentential adjuncts and time NP adverbials are currently being implemented. The combination of large scale lexicons and wide phenomena coverage result in a robust system.</Paragraph> </Section> <Section position="5" start_page="925" end_page="926" type="metho"> <SectionTitle> 4 CORPUS PARSING AND EVALU- ATION </SectionTitle> <Paragraph position="0"> XTAG has recently been used to parse the Wall Street Journal 2, IBM manual, and ATIS corpora as a means of evaluating the coverage and correctness of XTAG parses. For this evaluation, a sentence is considered to have parsed correctly if XTAG produces parse trees. Verifying the presence of the correct parse among the parses generated is done manually at present. Table 2 shows the preliminary results. We will present more complete and rigorous results by the time of the conference and compare them with other</Paragraph> <Section position="1" start_page="925" end_page="926" type="sub_section"> <SectionTitle> 2Sentences of length <= 15 words 4.1 Comparison with IBM Parser </SectionTitle> <Paragraph position="0"> A more detailed experiment to measure the crossing bracket accuracy of the XTAG-parsed IBM-manual sentences has been performed. Of the 1600 IBM sentences that have been parsed (those available from the Penn Treebank \[Marcus et al., 19931), only 67 overlapped with the IBM-manual treebank that was bracketed by University of Lancaster. 3 The XTAGparses for these 67 sentences were compared 4 with the Lancaster IBM-manual treebank.</Paragraph> <Paragraph position="1"> Table 3 shows the results obtained in this experiment. It also shows the crossing bracket accuracy of the latest IBM statistical parser \[Jelinek el al., 1994\] on the same genre of sentences. Recall is a measure of the number of bracketed constituents the system got right divided by the number of constituents in the corresponding Treebank sentences. Precision is the number of bracketed constituents the system got right divided by the number of bracketed constituents in the system's parse.</Paragraph> <Paragraph position="2"> Based on the present data, we believe our&quot; results will be consistent for the complete XTAO-parsed IBM corpus; we plan to evaluate the XTAG parses against the Penn Treebank. In addition, we are parsing the Lancaster sentences, and adding those to the XTAG IBM corpus.</Paragraph> <Paragraph position="3"> While the crossing-brackets measure is useful for comparing the output of different parsers, we believe that it is a somewhat inadequate method for&quot; evaluating a parser like XTAG for&quot; two main reasons. First, tile parse generated by the XTAG system is nnich richer in its representation of the internal structure of certain phrases thari those present in manually created treebanks. Even though the Lancaster treebank is 1nero detailed in terins of bracketing than the Pemi Treebank, it is not complete in its bracketing of the internal structure of notln phlases. As a result oFcomparing the XTAG parse with a skeletal representation, the precision score is misleadingly low for the XTAG system.</Paragraph> <Paragraph position="4"> a seColld i'eaSOll tliat lhe crossing brackcl IlleaSlll'e is inadequate for ewlhiafing XTA(I is thai the priiriary strtlcturo in XTAG is the doriwition tl'oo \['rOlll whicll the bracketed hee is deriwtd. Two identical bracketings for a sentellCO Ci\[ll \[laVe coinpletely tlil:forent deriwition trees. A iliore direcl ilieasuro oF the tlorlornlanco oil' file XTAG parser would evahiale the deriwition structtlre, which captures the dependencies between word;;.</Paragraph> </Section> <Section position="2" start_page="926" end_page="926" type="sub_section"> <SectionTitle> 4.2 CoiliparisOll with Alvey </SectionTitle> <Paragraph position="0"> We also colnpared the XTt\G parser to the Alvey Nal:tlral kailgtlage Tools (ANI ;l') t)alser, and louild that the two perfornled coniparably. We parsed the sol of I~DOCI'2 Noltn Pluases presented in Appendix l:l of the teciinical report IC, arroll, 1903 \], using Iho XTAG parser. The lechnical report presenls the ranking of tile correct parse and also gives the Ioial iillillber of doriwitions for each liOtill phrase. In this experhnenl, we have conlparod lhe total nuinber o\[' doriwliions obtained |'rent X'I'A(I with lhai oblainod front the ANUF parser.</Paragraph> <Paragraph position="1"> Table 4 sulnlnarizos the results of lhis cxperhlleilt.</Paragraph> <Paragraph position="2"> A iolal of 143 lloun i)luasos were parsed. The NPs whicll did nol have a correct parse in the top lhroo derivations for tile ANLT parser were considered as faihlres for AN\[\]'. Tile maxinitlnl and average iluniher of derivations cohnnns show Ihe highesl and lhe average lillnlbc;r o\[ derivations produced for tile NPs that tiave a correct derivation hi the top tinco deriwitlons, l:or tile XTAG systeiii, porloriilallce rcsulls with alid without ihe P()S lagger are siR)wit. 5 5BcCatlSt? tilt? NPs {11o, eli {ivol'~l~O, silorit',f Ihall ihc.' St?lltcnt?os on which il was lrahi~;d, the perforlnanco of Iho POS la~ggt?r is It would be interesting lo see if the two systems perfornied similarly on a wider range of dala. In ICarroll, 1993 \], only tile IA)OCE NPs are annotated with tile nunlber of deriwltions; we are interested in getting nlore data annolated with Ihis inforimition, in order to lllake further conlparisons.</Paragraph> </Section> </Section> class="xml-element"></Paper>