File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1069_metho.xml
Size: 21,985 bytes
Last Modified: 2025-10-06 14:13:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1069"> <Title>EXPLORING TIlE ROLE OF PUNCq'UATION IN PARSING NATURAl, TEXT</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> PUNCTUATION </SectionTitle> <Paragraph position="0"> Punctuation, as we consider it, can 1)o defined w~ the central part of the range oF non-lexical orthography.</Paragraph> <Paragraph position="1"> All,hough arguments could Ire made for including the md)-Iexical marks (e.g. hyl)hens, apostrol)hes ) and structural marlcs (e.g. bullets in itemisations), they are excluded since they Lend to be lexicalised or rather difficult to represent, respectively. Indeed, it is difficult t,o imagine the representation of structural punctuation, other than through the use of some special structural description language such ~m SGM I,.</Paragraph> <Paragraph position="2"> Within our definition o\[' punctuation then, we lind bro~*dly three types of mark: delimiting, separating and disambigu~tting, as described by Nunberg \[1990\].</Paragraph> <Paragraph position="3"> Some marks, the COlnlna especially, fall into multiple categories since they can have different roles, and the categories each per\[brm distinct lingnistic functions.</Paragraph> <Paragraph position="4"> l)elimiters (e.g. comma, (hush, l)arenthesis) occur to either side of a l)articular lexical expression to remove that exl)ression from the immediate syntactic context of the surrounding sentence (1). Tile delimited phr~e acts as a modifier to the adjacent phrase instead.</Paragraph> <Paragraph position="5"> (1) John, my friend, fell over and died.</Paragraph> <Paragraph position="6"> Separating marks come between similar grammatical items and indicate that the items form a list (2). They are therefore similar to conjunctions in their behavionr, and can sometimes replace conjunctions in a list.</Paragraph> <Paragraph position="7"> (2) I came, I saw, I conquered.</Paragraph> <Paragraph position="8"> I want butter, eggs and titan'.</Paragraph> <Paragraph position="9"> Disambiguating marks, usually commas, occur where an unintentional ambiguity coukl result if the marks were not there (3), and so perhaps illustrate best why tile use of puncttmtion within NL systems could be beneliciM.</Paragraph> <Paragraph position="10"> (3) Earlier, work was halted.</Paragraph> <Paragraph position="11"> In addition to the nature of different punctuation marks, there are several phenomena described by Nunberg \[1990\] which it is useful to consider before implementing any treatment of punctuation: Point absorption: strong point symbols (comma, dash, semicolon, etc.) absorb weaker adjacent ones (4). Commas are least powerfnl, and periods t most powerful; (4) It was John, my fi'iend.</Paragraph> <Paragraph position="12"> Bracket absorption: commas and dashes are removed if they occur directly before an end quote or parenthesis (5); (5) ... (my brother, ~.he teacher)...</Paragraph> <Paragraph position="13"> Quote tmt, sposition: punctuation directly to the right of an end quote is moved to the left of that character (6). This phenomenon occurs chiefly in American English, but can occur generally; (6) IIe said, &quot;I love you.&quot; Graphic absorption: orthogral)hically, but not linguistically, similar coincident symbols are absorbed (7). Thus the dot marking an abbreviation will absorb an adjacent period whereas it would not absorb an adjacent comma.</Paragraph> <Paragraph position="14"> (7) I wm'k fro&quot; the C.1.A., not the F.B.I. In addition to the phenomena associated with the interaction of punctuation, there are also distinct phenomena observable in the interaction of punctuation and lexical expressions. Thus delimited phrases cannot immediately contain delimited phrases of the IThroughout this paper I shall refer to sentence-final dots as periods rather than full-stops, to avoid confusion. same type (the sole exception may be with parenthetieals, though many people ob.iect to nested parentheses) and a<buncts such as the colon-expansion cannot contain further similar adjuncts. Therefore, in tile context of colon and semicolon seeping, (8) is ambiguous, but (9) is not.</Paragraph> <Paragraph position="15"> (8) words : words ; words .</Paragraph> <Paragraph position="16"> (9) words : words ; words : words .</Paragraph> </Section> <Section position="4" start_page="0" end_page="422" type="metho"> <SectionTitle> THE GI~AMMA1L </SectionTitle> <Paragraph position="0"> Recognition of punctuational phenomena does not imply tha.t they can be successfully encoded into a NL grammar, or whether the use of such a punctuated grammar will result in arty analytical advantages.</Paragraph> <Paragraph position="1"> Nunberg \[1990\] adw~cates two separate grammars, operatiug at different levels. A lexical grammar is proposed \['or the lexical expressions occurring between l~unctuation marl;s, and a text grammar is proposed for the structure of the punctuation, and the relation of those marks to the lexical expre.ssions they separate. The text gralllluar has within it distinct levels, such as phrasal and clausal, at which distinct punctuational phenomena can occur.</Paragraph> <Paragraph position="2"> This should, in theory, make for a very neat system: l.he lexical syntact, ic processes being kept separate from those that handle ImnCtUation.</Paragraph> <Paragraph position="3"> llowever, in pracl.ice, this system seems mdikely l,o succeed since in order to work, the lexical expressions that occur between punctuation marks must carry additional information about the syntactic categories occurring at their edges so that the text grammar can constrain the function of the punctuation marks.</Paragraph> <Paragraph position="4"> For example, if a sentence includes an itemised noun phrase (10), the lexical expression before the comma must be marked as ending with a noun phrase, and the lexleal expression after the comma must be marked as starting with a nottn phrase.</Paragraph> <Paragraph position="5"> A rule in the text grammar could then process the sel)arating comma as it clea,'ly Col nes between two similar syntactic elements.</Paragraph> <Paragraph position="6"> (10) lie lilies Willy, lan and Tom.</Paragraph> <Paragraph position="7"> \[e.d: ,,p\] \[sta,'~,: ,,p\] Ilowever, as (11) shows, the separating comina concept could require intbrmation about the categories at arbitrarily deep levels occurring a.t the ends of \]exical expressious surronnding punctuation rllarks, (u) 1 like to walk, skip, and rmt.</Paragraph> <Paragraph position="8"> I like to walk, to sldp, and to rtm.</Paragraph> <Paragraph position="9"> 1 like to walk, like to skip, but hate to run.</Paragraph> <Paragraph position="10"> Even with the above edge-category information, the parsing process is not necessarily made any easier (since often the fllll partial parses of a.II the separate expressions have to be held and joined). Therefore we seem to be at no advantage if wc use this approach.</Paragraph> <Paragraph position="11"> In add(lieu, it is dill\]cult to imagine what linguistic or psychological m0tivatidn such a separation of punctuation from lexical text could hold, since it seems rather unlikely that people process punctuation at a separate level to the text it surrounds.</Paragraph> <Paragraph position="12"> tIence it seems more sensible to use an integrated grammar, which handles both words and punctuation. This lets us describe the iuteraction of lulnetu at(on and lexieal expressions far more logically and concisely than if the two were separated. Good examples of this are disaml)iguatillg comnlas I ill a unified grammar we can simply write rules with an optional comma among the daughters (12). '</Paragraph> <Paragraph position="14"> A featnre-based tag grammar was written for this investigation (based loosely on one written by Briscoe and Waegner \[1992\]), and used in conjunction with tile parser inchlded in the Alvey Tools' Grammar Development Environment (ODE) \[Carroll etal, 1991\], which allows for rapid prototyping aud e,~sy analysis of parses. It should be stressed that this grammar is solely one of tags, aild so is not very detailed syntactically.</Paragraph> <Paragraph position="15"> In order to handle the additional complications of punctuation, tile notion of stoppedness of a category liars been introduced. Thus every category in the grammar has a stop feature which describes the punctu~Ltional character following it (13), and defaults to \[st -\] (unstopped) if there is no such character.</Paragraph> <Paragraph position="16"> (ca) tll~ man, = \[st el with the flowers, = \[st, f\] Since the rnles of the grammar further dictate that the mother category inherits the stop value of its rightmost daughter, ouly rules to specifically add pnnctuation for categories which could be lexicalised are necessary. Thus a rule for the additional of a punctuation marie after a lexicalised nouli would be as in lid). ('\['hc calligraphic letters rel)resellt unilication variables.) (14) n0\[st S\] --4- n0\[st, -1 \[punt N\] We can then specify that top level categories must be \[st f\] (period), that items in a list should be \[st c\] (comma), etc. In rules where we want to force a particular punctuation mark to the right of a category, that mark can be included in the rule, with the preceding category unstopped: (15) illustrates the addition of a comlna-delimited noun phrase to a iloun llilrase. Specifically mentioning tile l)unctuation nlark prevents the delimited phrase from being unstopped, resulting in an unstopped mother category. Note (,hat Cite phenomenon of point absorption has beeu captured by unifying the wdue of the st feature of tile mother and the identity of tile final punctuation marie. Thus the possible vahies of st are all the possible values of punt in addition to \[st -\], (15) up\[st S\] -~ up\[st c\] np\[st -\] \[punt S\].</Paragraph> <Paragraph position="17"> 'J2hus the stop feature seems sufficient to cope with tile punctuational phenomena inl;roduccd M)ove. |li order to incorporate tile pllenomena of interaction betweeu plmctuatiou and lexical expressions (e.g.</Paragraph> <Paragraph position="18"> preventing immediate nesting of similar delimited phrases), we need to iiltroduce it small Ullnlber of additioual features into the graunnar. If, for example, we make a comma-delimited noun phr,~se \[cm +\], we can then stipulate that any noun phrase that inchides a comma-delimited phrase has the feature \[cm \], so that the two cannot unify (16). Note that the unification of nmtlter and right-lnost daughter stop values is onlitted t7)r clarity of prescntal, ion.</Paragraph> <Paragraph position="19"> (is) ,~I,\[<:,,: -\] -~ l,V\[.~t (:\] ,,pill,, +, st, \] ~,'Ve can iUCOl'porato the relative scoping of coh)ns and semicolons, as discussed previously, into the granunar w;ry easily too. The semicolon rule (117) accepts any vahle of co in its arguments, but the eolou rule (18) only accepts fee -\]. The mother category of the eolou rule bears the feature fee t-\] to preveut inchlsiOll into further cololl-bearing sentences. Note that there are more versions of I, he colon rule, which deal witll dill'etch( constituents to either side of the colou, and also that, since the GI)E does not pel'nlit the disjunction of ligature values, the semicolon rule is merely an abbreviation of the innltiple rules required in the granlmar. ~top unilication is again omitted.</Paragraph> <Paragraph position="20"> (17) s\[co (dl V B)\] -~ s\[co A, sl, so\] s\[co B\].</Paragraph> <Paragraph position="21"> (18) s\[,:o +1 -~ s\[<:o -, ~t ~,,\] .+o % Ilenc0 the inclusion of a few simph~ extra features in it aorlnal granllnar h;_lS achieved an acceptable I.reatnlent of lnu~ctuatioual phenomer:a. ,qincc this work ouly represents the initial steps of providing a full aim pl'Ol)er accounb of tile role of puuc.tuatiou, no claims are lllade for the theoretical validity or colriplcteness of this approach!</Paragraph> </Section> <Section position="5" start_page="422" end_page="424" type="metho"> <SectionTitle> THE COIl.PUS </SectionTitle> <Paragraph position="0"> For the current hlw~stigat\[on it was necessary to use a corpus sulliciently rich in lmltctuation to illustrate the possible advantages or (lisadvantages of uLilising punctual.ion within the parsing process. Obviously a sentence whMl inchldes no lmnctuation will be equally difficult to parse with both punctuated and Ulqmnctuated gralniuars. Sinlihu'ly, for s(~iltCllCes including only ()lie or two marks of pllnctilation, l.he llSO of punctliatlon is likely to bc raLller procedural, and hence not necessarily very revealing.</Paragraph> <Paragraph position="1"> Therefore the tagged Spoken English Corpus was chosen \['lh.ylor ,~ Knowles, 1988\]. This featlu'es some very long seutences, and includes rich and varied punctuation. Since IJle corpus has l)cen l)unctnated IYlallually, by several different people, some idiosyncrasy occurs ill tile pnnctuatlollal style, I)ul, there is little punctuation which wonld be deemed inappropriate to the positidn it'oceurs in.</Paragraph> <Paragraph position="2"> A subset of 50 sentences w~ chosen from the whole corpus. Between them these sentences include material taken from news broadcasts, poetry readings, weather forecasts and programme reviews, so a wide variety of language is covered.</Paragraph> <Paragraph position="3"> The lengths of the sentences varied from 3 words to 63 words, the average being 31 words; and the punctuational complexity of the sentences varied from one mark (just a period) to 16 marks, the average being 4 punctuation marks. A sample tagged sentence is shown in (19), where fs denotes a period.</Paragraph> <Paragraph position="4"> (19) Their_APP$ meeting_NN1 involves~VVZ a_ATI ldnd_NNl of_|O life_NN1 swap_NN1 fs_l,'S The punctuated grammar, developed with this subset of the corpns, was used to parse the corpus subset, and then an unpunctuated version of the same grammar was used to parse the same subset.</Paragraph> <Paragraph position="5"> The reason that testing was performed on the training corpus was that, in the absence of a complete treatment of punctuation, the pnnetuational phenomena in the training corpus were the only ones the grammar could work with, and although they included almost all of the core phenomena mentioned, slightly different instances of the same phenomena could cause a parse failure. For reference, a small set of novel sentences were also parsed with the grammars, to determine their coverage outside the closed test.</Paragraph> <Paragraph position="6"> The unpunetuated version of the grammar was prepared by removing all the features relating to specifically punetuational phenomena, and also removing explicit mention of punctuation marks from the rules. This, of course, left behind certain rules that were fimetionally identical, and so duplicate rules were removed from the grammar. Similarly for rnles which performed the same function at different levels in tire grammar (e.g. attachment of prepositions to tile end of a sentence with a comma was also catered for by rules allowing prepositions to be attached to noun and verb phrases without a comma).</Paragraph> <Paragraph position="7"> I~ESULTS Results of parsing with the punctuated grammar were very good, yielding, on average, a surprisingly small number of parses. The number of parses ranged fi'om 1 to 520, with an average of 38. This average is unrepresentatively high, however, since only 4 sentences had over 50 parses. These were, in general, those with high numbers of punctuation marks, all containing at least 5, as in (20). Ignoring the four smallest and four largest results then, the average number of parses is reduced to just 15. Example (21) is more representative of parsing. On examination, a great number of the ambiguities seem to be due to inaccuracies or over-generality in the lexieal tags assigned to words in the corpus. The word more, for example, is triple ambiguous as determiner, adjective and noun, irrespective of where it occurs in a sentence. (20) (The sunlit weeks between were fifll of maids: Sarah, with orange wig and horsy teeth, was so bad-tempered that she scarcely spoke; Maud was my hateful nurse who smelled of SOal) , an(I forced me to eat chewy bits of fish, thrusl;ing me I)ack t.o babyhood with threats of nappies, dummies, and the feeding bottle.) 520 l)unct, uated parses (21) (More news about, the reverend Sun Myung Moon, lbunder of the Unification Church, who's currently in jail fox&quot; tax evasion: he was awarded an lmnorary degree lasL week hy the Roman Catholic University of la Plata in l/uenos Aires, Argentina.) t8 punctuated parses Besides the ambiguity of corpus tags, a l)roblem arose with words that had been completely mistagged.</Paragraph> <Paragraph position="8"> If these caused the parse to fail completely, the tag was changed in the development phase of tile grammar, but even so, the number of complete mistags was rather small in the sub-corlms used: around 10 words in the 50 sentences used.</Paragraph> <Paragraph position="9"> Initial attempts at parsing the corpus subset using the nnpunctuated version of the grammar were unsuccessfl, l on even the most powerfifl machine awtilable. This was due to the failure of the machine to represent all the l)arses sinmltaneously when unpacking the parse forest produced by the chart parser. A speciM section of code written for the (~I)E (grateful thanks are due to John Carroll for supplying this piece of code) to estimate the munber of individual parses represented by the packed parse-forest showed that for all but the most basically punctuated sentences, the number of parses was ridiculously huge. The figure for the sentence in (211.) w,ts in excess of 6.3x 10 le parses! F, ven though this estimate is an upper bound, since effects of feature value percolation during nnpaeldng are ignored, it has been fairly accura.te with most grammars in the past and still indicates that rather too many parses are being produced! Not all sentences produced such a massive number of parses: the sentence in (22) yielded only 1.q2 parses with the unplnletuated granlmar which was by far the smallest nnmbcr or nnpttnctuated parses. Most sentences that managed to pass tile estimation process produced between 10 (; and 110 9 parses.</Paragraph> <Paragraph position="10"> (22) (Ih'otestants, however, are a tiny minority in Argentina, and tile delegation won't be including a. I~.oman Catholic.) 9 punctuated parses On examination of tile grammar and tile corpus, it is possible to understand why this has happened.</Paragraph> <Paragraph position="11"> 'I'he punctuated grammar had to allow for sentences including comma-delimited noun phrases adjacent to undelimited noun phrases, as illustrated by the rules (15) and (16). These are relatively easy to mark and recognise when the punctuation is available, Itowever, without punctuational clues, and with the underspecific tagging system, any compound noun could appear as a set of delimited noun phrases with the unpunetuated grammar.</Paragraph> <Paragraph position="12"> Therefore the unpunetuated grammar was filrther trimmed, to such an extent that parses no longer accurately retlected the linguistic structure of the sentences, since, for example, comma delimited noun l)hr~es and compomtd nomls became indistinguishable. Some manual preparation of the sentences was also carried out to prevent the reoccurrance of simple, but costly, misl)arses.</Paragraph> <Paragraph position="13"> &quot;\['he results of the parse now became nmch more tractable. For bmsie sentences, as predict,ed, there was little difference in the performance of punctuated and unpunetuated gramlnars. Results were within an order of magnitude, showing that no signiticaut adwmtage w,'Ls gained through the use of lmnctuation. 'l'he sentences in (23) and (24) received t and 11 parses respectively with the unpunetuated grammar.</Paragraph> <Paragraph position="14"> (23) ('vVell, just recently, a (lay conference on miracles was convened by the research scientists, Christian Fellowship.)</Paragraph> </Section> <Section position="6" start_page="424" end_page="424" type="metho"> <SectionTitle> 4 punctuated parses </SectionTitle> <Paragraph position="0"> (24) (The assembly will also be discussing the Lit( immigration laws, lIong Kong, teenagers in the church, and of course, chur(:h mdl.y schemes.)</Paragraph> </Section> <Section position="7" start_page="424" end_page="424" type="metho"> <SectionTitle> 2 punctuated parses </SectionTitle> <Paragraph position="0"> (25) (They want to know whether, for instance, in a scientific age, Christians can really believe in the story of the feeding of the five thousmM as described, or was the miracle that those in the crowd with food shm'ed it with those who had none?) 24 punctuated parses l&quot;or the most complex sentences, however, the number of parses with tl,e unlmnctuated grammar was t,ypically more than two orders of magnitude higher than with the punctuated grammar. The sentence in (25) had 12,096 unpunctuat,ed parses.</Paragraph> <Paragraph position="1"> Parsing a set of ten previously unseen l)UnCtUationally complex sentences with the l)uncttmted grammar resulted in seven of the ten being unparsable. The other three parsed successfully, with the number of parses failing within the range of the results of the first part, of the investigation. The parse failures, on examination, were due to novel punctuational construct,ions occurring in the sentences which the grammar had not been designed to handle. Parsing the unseen sente,~ces with the.</Paragraph> <Paragraph position="2"> unpunetuated grammar resulted in one parse failure, with the results for the other 9 sentences rel'lectiug the previous results for complex sentences.</Paragraph> </Section> class="xml-element"></Paper>