File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1052_intro.xml

Size: 12,688 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1052">
  <Title>Corpus Data TP FP FN</Title>
  <Section position="4" start_page="356" end_page="358" type="intro">
    <SectionTitle>
2 Description of the System
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="356" end_page="357" type="sub_section">
      <SectionTitle>
2.1 Overview
</SectionTitle>
      <Paragraph position="0"> The system consists of the following six components which are applied in sequence to sentences containing a specific predicate in order to retrieve a set of subcategorization classes for that predicate:  1. A tagger, a first-order HMM part-of-speech (PoS) and punctuation tag disambiguator, is used to assign and rank tags for each word and punctuation token in sequences of sentences (Elworthy, 1994).</Paragraph>
      <Paragraph position="1"> 2. A lemmatizer is used to replace word-tag  pairs with lemma-tag pairs, where a lemma is the morphological base or dictionary headword form appropriate for the word, given the PoS assignment made by the tagger. We use an enhanced version of the GATE project stemmer (Cunningham et al., 1995).</Paragraph>
      <Paragraph position="2"> 3. A probabilistic LR parser, trained on a treebank, returns ranked analyses (Briscoe &amp;: Carroll, 1993; Carroll, 1993, 1994), using a grammar written in a feature-based unification grammar formalism which assigns 'shallow' phrase structure analyses to tag networks (or 'lattices') returned by the tagger (Briscoe &amp; Carroll, 1994,  1995; Carroll &amp; Briscoe, 1996).</Paragraph>
      <Paragraph position="3"> 4. A patternset extractor which extracts sub- null categorization patterns, including the syntactic categories and head lemmas of constituents, from sentence subanalyses which begin/end at the boundaries of (specified) predicates.</Paragraph>
      <Paragraph position="4"> 5. A pattern classifier which assigns patterns in patternsets to subcategorization classes or rejects patterns as unclassifiable on the basis of the feature values of syntactic categories and the head lemmas in each pattern.</Paragraph>
      <Paragraph position="5"> 6. A patternsets evaluator which evaluates sets of patternsets gathered for a (single) predicate, constructing putative subcategorization entries and filtering the latter on the basis of their reliability and likelihood.</Paragraph>
      <Paragraph position="6"> For example, building entries for attribute, and given that one of the sentences in our data was (la), the tagger and lemmatizer return (lb).</Paragraph>
      <Paragraph position="7">  (1) a He attributed his failure, he said, to no&lt; blank&gt; one buying his books.</Paragraph>
      <Paragraph position="9"> (lb) is parsed successfully by the probabilistic LR parser, and the ranked analyses are returned. Then the patternset extractor locates the subanalyses containing attribute and constructs a patternset. The highest ranked analysis and pattern for this example are shown in Figure 12 . Patterns encode the value of the VSUBCAT feature from the VP rule and the head lemma(s) of each argument. In the case of PP (I)2) arguments, the pattern also encodes the value of PSUBCAT from the PP rule and the head lemma(s) of its complement(s). In the next stage of processing, patterns are classified, in this case giving the subcategorization class corresponding to transitive plus PP with non-finite clausal complement.</Paragraph>
      <Paragraph position="10"> The system could be applied to corpus data by first sorting sentences into groups containing instances of a specified predicate, but we use a different strategy since it is more efficient to tag, lemmatize and parse a corpus just once, extracting patternsets for all predicates in each sentence; then to classify the patterns in all patternsets; and finally, to sort and recombine patternsets into sets of patternsets, one set for each distinct predicate containing patternsets of just the patterns relevant to that predicate. The tagger, lemmatizer, grammar and parser have been described elsewhere (see previous references), so we provide only brief relevant details here, concentrating on the description of the components 2The analysis shows only category aliases rather than sets of feature-value pairs. Ta represents a text adjunct delimited by commas (Nunberg 1990; Briscoe ~ Carroll, 1994). Tokens in the patternset are indexed by sequential position in the sentence so that two or more tokens of the same type can be kept distinct in patterns.</Paragraph>
      <Paragraph position="12"> of the system that are new: the extractor, classifier and evaluator.</Paragraph>
      <Paragraph position="13"> The grammar consists of 455 phrase structure rule schemata in the format accepted by the parser (a syntactic variant of a Definite Clause Grammar with iterative (Kleene) operators). It is 'shallow' in that no atof which thetempt is made to fully analyse unbounded dependencies. However, the distinction between arguments and adjuncts is expressed, following X-bar theory (e.g. Jackendoff, 1977), by Chomsky-adjunction to maximal projections of adjuncts (XP --* XP Adjunct) as opposed to 'government' of arguments (i.e. arguments are sisters within</Paragraph>
      <Paragraph position="15"> more, all analyses are rooted (in S) so the grammar assigns global, shallow and often 'spurious' analyses to many sentences. There are 29 distinct values for VSUBCAT and 10 for PSUBCAT; these are analysed in patterns along with specific closed-class head lemmas of arguments, such as it (dummy subjects), whether (wh-complements), and so forth, to classify patterns as evidence for one of the 160 sub-categorization classes. Each of these classes can be parameterized for specific predicates by, for example, different prepositions or particles. Currently, the coverage of this grammar--the proportion of sentences for which at least one analysis is found--is 79% when applied to the Susanne corpus (Sampson, 1995), a 138K word treebanked and balanced subset of the Brown corpus. Wide coverage is important since information is acquired only from successful parses. The combined throughput of the parsing components on a Sun UltraSparc 1/140 is around 50 words per CPU second.</Paragraph>
    </Section>
    <Section position="2" start_page="357" end_page="358" type="sub_section">
      <SectionTitle>
2.2 The Extractor~ Classifier and Evaluator
</SectionTitle>
      <Paragraph position="0"> The extractor takes as input the ranked analyses from the probabilistic parser. It locates the subanalyses around the predicate, finding the constituents identified as complements inside each subanalysis, and the subject clause preceding it. Instances of passive constructions are recognized and treated specially. The extractor returns the predicate, the VSUBCAT value, and just the heads of the complements (except in the case of PPs, where it returns the PSUBCAT value, the preposition head, and the heads of the PP's complements).</Paragraph>
      <Paragraph position="1"> The subcategorization classes recognized by the classifier were obtained by manually merging the classes exemplified in the COMLEX Syntax and ANLT dictionaries and adding around 30 classes found by manual inspection of unclassifiable patterns for corpus examples during development of the system. These consisted of some extra patterns for phrasM verbs with complex complementation and with flexible ordering &amp;quot;of the preposition/particle, some for non-passivizable patterns with a surface direct object, and some for rarer combinations of governed preposition and complementizer combinations. The classifier filters out as unclassifiable around 15% of patterns found by the extractor when run on all the patternsets extracted from the Susanne corpus. This demonstrates the value of the classifier as a filter of spurious analyses, as well as providing both translation between extracted patterns and two existing subcategorization dictionaries and a definition of the target subcategorization dictionary.</Paragraph>
      <Paragraph position="2"> The evaluator builds entries by taking the patterns for a given predicate built from successful parses and records the number of observations of each subcategorization class. Patterns provide several types of information which can be used to rank or select between patterns in the patternset for a given sentence exemplifying an instance of a predicate, such as the ranking of the parse from which it was extracted or the proportion of subanalyses supporting a specific pattern. Currently, we simply select the pattern supported by the highest ranked parse. However, we are experimenting with alternative approaches. The resulting set of putative classes for a predicate are filtered, following Brent (1993),  by hypothesis testing on binomial frequency data.</Paragraph>
      <Paragraph position="3"> Evaluating putative entries on binomial frequency data requires that we record the total number of patternsets n for a given predicate, and the number of these patternsets containing a pattern supporting an entry for given class m. These figures are straightforwardly computed from the output of the classifier; however, we also require an estimate of the probability that a pattern for class i will occur with a verb which is not a member of subcategorization class i. Brent proposes estimating these probabilities experimentally on the basis of the behaviour of the extractor. We estimate this probability more directly by first extracting the number of verbs which are members of each class in the ANLT dictionary (with intuitive estimates for the membership of the novel classes) and converting this to a probability of class membership by dividing by the total number of verbs in the dictionary; and secondly, by multiplying the complement of these probabilities by the probability of a pattern for class i, defined as the number of patterns for i extracted from the Susanne corpus divided by the total number of patterns. So, p(v -i), the probability of verb v not of class i occurring with a pattern for class i is:</Paragraph>
      <Paragraph position="5"> The binomial distribution gives the probability of an event with probability p happening exactly m times out of n attempts: n! P(m, n,p) - m!(n - rn)! pro(1 - p)n-m The probability of the event happening m or more times is:</Paragraph>
      <Paragraph position="7"> Thus P(m,n,p(v -i)) is the probability that m or more occurrences of patterns for i will occur with a verb which is not a member of i, given n occurrences of that verb. Setting a threshold of less than or equal to 0.05 yields a 95% or better confidence that a high enough proportion of patterns for i have been observed for the verb to be in class i 3.</Paragraph>
    </Section>
    <Section position="3" start_page="358" end_page="358" type="sub_section">
      <SectionTitle>
2.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Our approach to acquiring subcategorization classes is predicated on the following assumptions:  patterns for certain classes more often than others; and * even a highest ranked pattern for i is only a probabilistic cue for membership of i, so membership should only be inferred if there are enough occurrences of patterns for i in the data to outweigh the error probability for i.</Paragraph>
      <Paragraph position="1"> This simple automated, hybrid linguistic/statistical approach contrasts with the manual linguistic analysis of the COMLEX Syntax lexicographers (Meyers et al., 1994), who propose five criteria and five heuristics for argument-hood and six criteria and two heuristics for adjunct-hood, culled mostly from the linguistics literature. Many of these are not exploitable automatically because they rest on semantic judgements which cannot (yet) be made automatically: for example, optional arguments are often 'understood' or implied if missing. Others are syntactic tests involving diathesis alternation possibilities (e.g. passive, dative movement, Levin (1993)) which require recognition that the 'same' argument, defined usually by semantic class / thematic role, is occurring across argument positions. We hope to exploit this information where possible at a later stage in the development of our approach. However, recognizing same/similar arguments requires considerable quantities of lexical data or the ability to back-off to lexical semantic classes. At the moment, we exploit linguistic information about the syntactic type, obligatoriness and position of arguments, as well as the set of possible subcategorization classes, and combine this with statistical inference based on the probability of class membership and the frequency and reliability of patterns for classes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML