File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1104_metho.xml
Size: 17,200 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1104"> <Title>SYNTACTIC ANALYSIS OF NATURAL LANGUAGE USING LINGUISTIC RULES AND CORPUS-BASED PATTER.NS</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 KNOWLEDGE ACQUISITION </SectionTitle> <Paragraph position="0"> We have used two schemes to extract knowledge from corpora. Both produce readable patterns that can be verified by a linguist. In the first scheme, sentences are handled as units and information about the structure of the sentence is extracted. 0nly the main constituents (like subject, objects) of the sentence are treated at this stage. The second scheme works with local context and looks only a few words to the right and to the left. It is used to resolve the nmdifier-t,ead dependencies in the phrases.</Paragraph> <Paragraph position="1"> First, we form an axis of the sentence using some given set of syntactic tags. We collect several layers of patterns that may be partly redundant with each other. For instance, simplifying a little, we can say that a sentence can be of the form subjecl -- main verb and there may be other words before and after the subject and main verb. We may also say that a sentence can be of the form subject -- main verb -- object. The latter is totally covered by the former because the former statement does not prohibit the appearance of an object but does not require it either.</Paragraph> <Paragraph position="2"> The redundant patterns are collected on purpose.</Paragraph> <Paragraph position="3"> During parsing we try to find the strictest frame for the sentence. If we can not apply some pattern because it conflicts with the sentence, we may use other, possibly more general, pattern. For instance, an axis that describes all accepted combinations of subject, objects and main verbs in the sentence, is stricter than an axis that describes all accepted combinations of subjects and main verbs.</Paragraph> <Paragraph position="4"> After applying the axes, the parser's output is usually still ambiguous because all syntactic tags are not taken into account yet (we do not handle, for instance, determiners and adjective premodifiers here). The remaining ambiguity is resolved using local information derived from a corpus. The second phase has a more probabilistie fiavour, although no actual probabilities are computed. We represent information in a readable form, where all possible contexts, that are common enough, are listed for each syntactic tag. The length of the contexts may vary. The common contexts arc longer than the rare ones. In parsing, we try to find a match for each word in a maximally long context.</Paragraph> <Paragraph position="5"> Briefly, the relation between tim axes and the joints is following. The axes force sentences to comply with the established frames. If more than one possibility is found, the joints are used to rank them.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The sentence axis </SectionTitle> <Paragraph position="0"> In this section we present a new method to collect information from a tagged corpus. We define a new concept, a sentence axis. The sentence axis is a pattern that describes the sentence structure at an appropriate level. We use it to select a group of possible analyses for the sentence. In our implementation, we form a group of sentence axes and the parser selects, using the axes, those analyses of the sentence that match all or as many as possible sentence axes.</Paragraph> <Paragraph position="1"> We define the sentence axis in the following way.</Paragraph> <Paragraph position="2"> Let S be a set of sentences and T a, set of syntactic tags. The sentence axis of S according to tags T shows the order of appearance of any tag in T for every sentence in S.</Paragraph> <Paragraph position="3"> Itere, we will demonstrate the usage of a sentence axis with one sentence. In our real application we, of course, use more text to build up a database of sentence axes. Consider the following sentence a</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="631" type="metho"> <SectionTitle> ISUBJ would_+FAUXV also_ADVL </SectionTitle> <Paragraph position="0"> increase_-FMAINV child NN> benefiLOBa , give_-FMAINV some_QN> help OBJ t0_AI)VL the 1)N> car_NN> industry <P and CC relax_-FMAINV r~,les OBa governing_<NOM-FMAiNV Iocal AN> avthority_NN> capital_AN> reeeipts OBJ , alIowing_-FMAINV councils SUBJ /o_INFMAI{K> spend_-FMAINV more ADVL.</Paragraph> <Paragraph position="1"> The axis according to the manually defined set T =</Paragraph> <Paragraph position="3"> which shows what order the elements of set T appear in the sentence above, and where three (lots mean that there may be something between words, e.g. +FAUXV is not followed (in ttfis c~e) immediately by SUBJ. When we have more than one. sentence, ttm axis contains more than one possible order for the elements of set T.</Paragraph> <Paragraph position="4"> The axis we have extracted is quite general. It defines the order in which the finite verbs and subjects in the sentence may occur but it does not say anything about nmdlnite verbs in the sentence. Notice that the second subject is not actually tt,e subject of the finite clause, but the subject of nontinite construction councils to spend more. This is inconvenient, and a question arises whether there should be a specific tag to mark suhjects of the nonllnite clauses. Voutilainen and Tapanaincn \[1993\] argued that the richer set of tags could make parsing more accurate in a rule-based system. It may be true he.re as well.</Paragraph> <Paragraph position="5"> We can also specify an axis for verbs of the sentence.</Paragraph> <Paragraph position="6"> 'Fhus the axis according to tim set</Paragraph> <Paragraph position="8"> .... kFAUXV ..... FMAINV ..... FMAINV .... FMAINV ..... FMAINV .,. INFMAR, K> -FMAINV * ..</Paragraph> <Paragraph position="9"> The nonlinite verbs occur in this axis four times one after another. We do not want just to list how many times a nonllnite verb may occur (or occurs in a corpus) in this kind of position, so we clearly need some generalisations.</Paragraph> <Paragraph position="10"> The fundamental rule ofgeneralisation that we used is the following: Anything that is repeated may be repeated any number of times.</Paragraph> <Paragraph position="11"> We mark this using l)rackets and a plus sign. The generalised axis for the above axis is</Paragraph> <Paragraph position="13"> aThe tag set is adapted from the Constraint Grammar of English as it is. It is more extensive than commonly used in tagged corpora projects (see Appendix A).</Paragraph> <Paragraph position="14"> Note that we added silently an extra (tot be.tweeu one -FMAINV and OBY in order not to make, distinctions between -FMAINV OBg and -FMAINV. . . OBJ here.</Paragraph> <Paragraph position="15"> Another generalisation can be made using equivalence clauses. We can ,assign several syntactic tags to the same equivalence class (for instance -I&quot;MAINV, < NOM-FMAIN V arrd < P-FMA \[N V), and then gen-. crate axes as above. 'l'he result would be</Paragraph> <Paragraph position="17"> where nonfinv denotes both -FMAINV ;u,d <NOM FMAINV (and also <P-.1,'MAINV).</Paragraph> <Paragraph position="18"> The equivalence classes are essential in the present tag set because the syntactic arguments of finite verbs are not distinguished from the arguments of nontlnite verbs. Using equivalence classes for the finite attd non-finite verbs, we may tmiht an generallsation that ;tl)plies to both types of clauses. Another way to solve the problem, is to add new tags for the arguments of the nontinite clauses, arid make several axes for them.</Paragraph> <Section position="1" start_page="630" end_page="631" type="sub_section"> <SectionTitle> 2.2 Local patterns </SectionTitle> <Paragraph position="0"> In the second phase of the pattern parsing scheme we apply local patterns, the joints. They contain ioformation about what kinds of modifiers have what kinds of heads, and vice versa.</Paragraph> <Paragraph position="1"> For instance, in the following sentence 4 the words fair and crack are both three ways ambiguous before the axes are applied.</Paragraph> <Paragraph position="3"> the DN> World <P/NN> Cvp_<P/Ol3J .</Paragraph> <Paragraph position="4"> After the axes have been applied, the noun phr,xse a fair crack has the analyses a DN> fairAN>/NN> crack OBJ.</Paragraph> <Paragraph position="5"> The word fairis still left partly ambiguous. We resolve this ambiguity using the joints.</Paragraph> <Paragraph position="6"> 4This analysis is comparable to the output of I'3NCCG. The ambiguity is marked here using the slash. The mor.phological information is not printed.</Paragraph> <Paragraph position="7"> in an ideal case we have only one head in each l)hrase, although it may not be in its exact location yet. r\['he following senLencv, fragment (temonstrate.s this They SUlta have...+ FAUXV been -VMAINV much AD-A> less P(X)Mlq,--,qfAD--A> attentive <NOM/PCOMPl,-.S to <NOM/AI)VI, theft)N> ...</Paragraph> <Paragraph position="8"> In tit(.' analysis, the head of the l)hr~me mvch less attentive may be less or altenlive. If it is less the word attentive is a postn,odifier, and if the head is attentive then less is a premodilier. Tim sentence is represented internally in the parser in such a way that if the axes make this distinction, i.e. force the.re to be exactly one subject complement, there are only two possil)le paths which the joints can select from: less AD-A> attenlive_l'COMPL- S and less J'COMPl,- S attentive <NOM.</Paragraph> <Paragraph position="9"> Generating the joints is quite straightforward. We produce different alternative variants for each syntac: tic tag and select some of them. Wc use a couple of parameters to validate possible joint candidates. * q'he error margin provides the probability for checking if the context is relevant, i.e., there is enottg\]t evidence for it among the existing contexts of the tag. This probability may be used in two ways: l,'or a syntactic tag, generate all contexts (of length n) tl, at appear in the corpora. Select all those contexts that are frequent enough. Do this with all n's wdues: 1, 2, ...</Paragraph> <Paragraph position="10"> -- First generate all contexts of length t. Select those contexts that are fregnent enough among the generated contexts. Next,, lengtlmn all contexts selected in the previous step by one word. Select those contexts that are frequent enough among the new generated context, s. R.epeat this sulficient malty times.</Paragraph> <Paragraph position="11"> lloth algorithms l)roduce a set of contexts of different lengths. Characteristic for t)oth the algorithms is that if they haw; gene.rated a context of length n that matches a syntactic function in a sentence, there is also a context of length n - 1 that matches. * The absolute, margin mmd)er of cases that is needed for the evidence, of till: generated context. If therc is less cvidencc, it is not taken into account arm a shorter context is generated. '.\['his is used to prevent strange behaviour with syntactic tags that are not very common or with a corpus that is not big enough.</Paragraph> <Paragraph position="12"> (r) 'l'he maximum length of the context to be generated. null l)uring the parsing, longer contexts are preferred to shorter ones. The parsing problem is thus a kind of pattern matching problem: we have to match a pattern (context) arouml each tag and tlnd a sequence of syntactic tags (analysis of the sentence) that h~m the best score. The scoring fimetion depends on the lengths of the matched patterns.</Paragraph> </Section> </Section> <Section position="6" start_page="631" end_page="632" type="metho"> <SectionTitle> 3 EXPERIMENTS WITH REAL CORPORA </SectionTitle> <Paragraph position="0"> Information concerning the axes was acquired from a manually checked and fully disambiguated corpus 5 of about 30,000 words and 1,300 sentences. Local context information was derived from corpora that were analysed by ENGCG. We generated three different parsers using three different corpora 6. Each corpus contains about 10 million words.</Paragraph> <Paragraph position="1"> For evaluation we used four test samples (in Figure 1). Three of them were taken frmn corpora that we used to generate the parsers and one is an additional sample. The samples that are named bbl, today and wsj belong to the corpora from which three different joint parsers, called BB1, TODAY and WSJ respectively, were generated. Sample bb~ is the additional sample that is not used during development of the parsers.</Paragraph> <Paragraph position="2"> The ambiguity rate tells us how much ambiguity is left after ENGCG analysis, i.e. how many words still have one or more alternative syntactic tags. The error rate shows us how many syntactic errors ENGCG has made while analysing the texts. Note that the ambiguity denotes the amount of work to be done, and the error rate denotes the number of errors that already exist in the input of our parser.</Paragraph> <Paragraph position="3"> All the samples were analysed with each generated parser (in Figure 2). The idea is to find out about the effects of different text types on the generation of the parsers. The present method is applied to reduce the syntactic ambiguity to zero. Success rates variate from 88.5 % to 94.3 % in ditferent samples.</Paragraph> <Paragraph position="4"> There is maximally a 0.5 percentage points difference in the success rate between the parsers when applied to the same data. Applying a parser to a sample from the same corpus of which it was generated does not generally show better results.</Paragraph> <Paragraph position="5"> Some of the distinctions left open by ENGCG may not be structurally resolvable (see \[Karlsson et al., 1994\]). A case in point is the prepositional attachment ambiguity, which alone represents about 20 % of the ambiguity in the ENGCG output. The proper way to deal with it in the CG framework is probably using lexical information.</Paragraph> <Paragraph position="6"> Therefore, as long as there still is structurally unresolvable ambiguity in the ENGCG output, a certain amount of processing before the present system SOonsisting of 15 individual texts from the Bank of English project \[J~.rvinen, 1994\]. The texts were chosen to cover a variety of text types but due to small size and intuitive sampling it cannot be truly representative.</Paragraph> <Paragraph position="7"> amdysed samples might improve the results considerably, e.g., convertins structurally unresolvable syntactic tags to a single underspecified tag. \[,'or instance, resolving prepositional attachment ambiguity by other means would iruprove the success rate of the current system to 90.5 % - 95.5 %. In the wsj sample ttLe improvement would be as much a.s 2.0 percentage points.</Paragraph> <Paragraph position="8"> The differences between success rates in different samples are partly explained by tile error types that are characteristic of the samples. For example, in the Wall Street Journal adverbials of time are easily parsed erroneously. This may cause an accumulation effect, ms happens in tile following sentence MAN AG Tuesday said fiscal 1989 net income rose 25% and said it, will raise its dividend for lhe year ended June 30 by about the same percentage.</Paragraph> <Paragraph position="9"> Tile phrase the year ended June 30 gets the analysis the_DN> year_NN> ended_AN> June_NN> 30_<P while the correct (or wanted) result is lhe DN> year_<P ended_<NOM-FMAINV June_ADVL 30 <NOM Different kind of errors appear in text bb! which contains incomplete sentences. The parser prefers complete sentences and produces errors in sentences like There w~s Provence in mid-autumn. Gold Zints.</Paragraph> <Paragraph position="10"> Air so serene you could look out over the sea for tens of miles. Rehabilitalion walks with him along tim woodland l)aths.</Paragraph> <Paragraph position="11"> The errors are: gold tints is parsed a.s svbjeel - main verb ~s well ~m r'ehabililation walks, and air is analysed ,as a main verb, Other words have the appropriate analyses.</Paragraph> <Paragraph position="12"> The strict sequentiality of morphological and syntactic analysis in ENGCG does not allow the use of syntactic information in morphological disambiguation. The present method makes it possible to prune the remaining morphological ambiguities, i.e. do some part-of-speech tagging. Morphological ambiguity remains unresoNed if the chosen syntactic tag is present in two or more morphological readings of the same word. Morphological ambiguity 7 is reduced close to zero (about 0.3 % in all the samples together) and the overall success rate of ENGCG + our pattern parser is 98.7 %.</Paragraph> <Paragraph position="13"> r After ENGCG the amount of nmrphologic',d ambiguity in the test data was 2.9 %, with au error rate of 0.4 %.</Paragraph> </Section> class="xml-element"></Paper>