File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1312_metho.xml

Size: 16,917 bytes

Last Modified: 2025-10-06 14:14:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1312">
  <Title>United Kingdom</Title>
  <Section position="4" start_page="82" end_page="85" type="metho">
    <SectionTitle>
* Lexical reiteration
</SectionTitle>
    <Paragraph position="0"> Lexically reiterated items are likely candidates for antecedent (scores 2 if the NP is repeated within the same paragraph twice or more, 1 if repeated once and 0 if not). Lexically reiterated items include repeated synonymous noun phrases which may often be preceded by definite articles or demonstratives. null * Section heading preference If a noun phrase occurs in the heading of the section, part of which is the current sentence, then consider it as the preferred candidate for the antecedent (1, 0).</Paragraph>
    <Paragraph position="1"> * Collocation pattern preference This preference is given to candidates which have an identical collocation pattern with a pronoun. The collocation implemented here is restricted to the pattern &amp;quot;noun/pronoun, verb&amp;quot; or &amp;quot;verb, noun/pronoun&amp;quot; (2, 0). (owing to lack of syntactic information, this preference is somewhat weaker than the collocation preference described in \[Dagan &amp; Itai 90\] and suggested subsequently in our procedure for semi-automatic annotation) Press the key i down and turn the volume up... Press it i again.</Paragraph>
    <Paragraph position="2"> * Referential distance In complex sentences, noun phrases in the previous clause I are the best candidate for the antecedent of an anaphor in the subsequent clause, followed by noun phrases in the previous sentence, then by nouns situated 2 sentences further back and finally nouns 3 sentences further back (2, I, 0, -1). For anaphors in simple sentences, noun phrases in the previous sentence are the best candidate for antecedent, followed by noun phrases situated 2 sentences further back and finally nouns 3 sentences further back</Paragraph>
    <Paragraph position="4"> A &amp;quot;pure&amp;quot;, &amp;quot;non-prepositional&amp;quot; noun phrase is given a higher preference than a noun phrase which is part of a prepositional phrase (0, - 1) Insert the cassette i into the VCR making sure it i is suitable for the length of recording.</Paragraph>
    <Paragraph position="5">  Each of the antecedent indicators is assigned a score with a value ~ {-1, 0, 1, 2 }. These scores have been determined experimentally on an empirical basis and are constantly being updated. Top symptoms like &amp;quot;lexical reiteration&amp;quot; assign score &amp;quot;2&amp;quot; whereas non-candidates are given a negative score of &amp;quot;-1&amp;quot;. We should point out that the antecedent indicators are preferences and not absolute factors. There are cases where an antecedent indicator does not &amp;quot;point&amp;quot; to the correct antecedent. For instance, in the sentence &amp;quot;Insert the cassette into the VCR i making sure it i is turned on&amp;quot;, the indicator &amp;quot;non-prepositional noun phrases&amp;quot; would give a &amp;quot;wrong&amp;quot; contribution. Within the framework of all preferences (antecedent indicators), however, the right antecedent is still very likely to be tracked down - in the above example, the &amp;quot;non-prepositional noun phrases&amp;quot; heuristics would be overturned by the &amp;quot;collocational preference&amp;quot; one.</Paragraph>
    <Section position="1" start_page="83" end_page="83" type="sub_section">
      <SectionTitle>
2.2 Informal description of the algorithm
</SectionTitle>
      <Paragraph position="0"> The algorithm for pronoun disambiguation can be described informally as follows:  1. Examine the current sentence and the two preceding sentences (if available). Look for noun phrases 2 only to the left Of the anaphor 3 2. Select from the noun phrases identified only those which agree in gender and number 4 with the pronominal anaphor and group them as a set of potential candidates 3. Apply the antecedent indicators to each po null tential candidate and assign scores; the candidate with the highest score is proposed as antecedent.</Paragraph>
      <Paragraph position="1"> For an illustration as to how the approach operates see (\[Mitkov 97\]).</Paragraph>
    </Section>
    <Section position="2" start_page="83" end_page="83" type="sub_section">
      <SectionTitle>
2.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> 2A sentence splitter would already have segmented the text into sentences, a POS tagger would already have determined the parts of speech and a simple phrasal grammar would already have detected the noun phrases 31n this project we do not t~eat cataphora; non-anaphofic &amp;quot;it&amp;quot; occumng in constructions such as &amp;quot;It is important&amp;quot;, &amp;quot;It is necessary&amp;quot; is eliminated by a &amp;quot;referential filter&amp;quot; 4Note that this restriction may not always apply in languages other than English (e.g. German); on the other hand there are certain collective nouns in English which do not agree in number with their antecedents (e.g. &amp;quot;government&amp;quot;, &amp;quot;team&amp;quot;, &amp;quot;parliament&amp;quot; etc. can be referred to &amp;quot;they&amp;quot;; equally some plural nouns (e.g. &amp;quot;data&amp;quot;) can be referred to by &amp;quot;it&amp;quot;) and are exempted from the agreement test For practical reasons, the approach presented does not incorporate syntactic and semantic information (other than a list of domain terms) and it is not realistic to expect its performance to be as good as an approach which makes use of syntactic and semantic knowledge in terms of constraints and preferences. The lack of syntactic information, for instance, means giving up subject preference (or on other occasions object preference, see \[Mitkov 94a\]) which could be used in center tracking. Syntactic parallelism, useful in discriminating between identical pronouns on the basis of their syntactic function, also has to be forgone. Lack of semantic knowledge rules out the use of verb semantics and semantic parallelism. The preliminary evaluation, however, shows that less is lost than might be feared.</Paragraph>
      <Paragraph position="1"> Several documents (user's guides), with an overall length of 40 000 words, served as an initial evaluation corpus. The average success rate was 86%. While the test corpus contained the pronouns &amp;quot;he&amp;quot;, &amp;quot;she&amp;quot; and &amp;quot;they&amp;quot;, most pronouns were &amp;quot;it&amp;quot; (about 92%). The approach had a very high success rate with sentences which contained one pronoun only (above 90%) but failed in a few paragraphs which contained an abundance of &amp;quot;it&amp;quot;s, 2 (or more) in a sentence, with &amp;quot;it&amp;quot; referring in turn to different antecedents. In these examples, however, a frequent shift of center was observed and, in our view, they were not written in a natural style.</Paragraph>
      <Paragraph position="2"> A recent test with Computer Science textbook inputs showed a preliminary accuracy rate of  above 80%.</Paragraph>
      <Paragraph position="3"> 3. Other knowledge-poor approaches  The approaches proposed by Nasukawa (\[Nasukawa 94\]), Dagan &amp; Itai (\[Dagan &amp; Itai 90\]) and Kennedy &amp; Boguraev (\[Kennedy &amp; Boguraev 96\]) address anaphor resolution in a &amp;quot;knowledge-poor&amp;quot; way: the first approach takes into consideration heuristic preferences, the second uses frequency of collocational patterns and the third operates without a parser resorting to salience factors.</Paragraph>
    </Section>
    <Section position="3" start_page="83" end_page="84" type="sub_section">
      <SectionTitle>
3.1 Nasukawa's knowledge-independent
</SectionTitle>
      <Paragraph position="0"> approach T. Nasukawa (\[Nasukawa 94\]) describes a simple approach which uses intersentential information extracted from a source text in order to improve the accuracy of pronoun resolution. He suggests that collocation patterns (modifiermodifee relationships) can be used to determine whether a candidate for antecedent can modify  the modifee of a pronoun. Nasukawa also finds (similarly to (\[Mitkov 93\])) that the frequency of preceding noun phrases with the same lemma as the candidate noun phrase may be an indication for preference. Moreover, he suggests a heuristic rule favouring subjects over objects (compare \[Mitkov 93\] where this preference is sublanguage-based). null Each of the collocational, frequency or syntactic preferences gives its &amp;quot;preference value&amp;quot;; these values are eventually summed up. The candidate with the highest value is picked up as the antecedent.</Paragraph>
      <Paragraph position="1"> As an evaluation corpus Nasukawa uses 1904 consecutive sentences (containing altogether 112 third-person pronouns) from eight chapters of two different computer manuals. His algorithm handles the pronoun &amp;quot;it&amp;quot; and has been reported to select a correct antecedent in 93.8% of cases.</Paragraph>
    </Section>
    <Section position="4" start_page="84" end_page="84" type="sub_section">
      <SectionTitle>
3.2 Dagan &amp; Itai's corpus-based approach
</SectionTitle>
      <Paragraph position="0"> I. Dagan and A. Itai (\[Dagan &amp; Itai 90\]) report on a statistical approach for disambiguating pronouns; this is an alternative solution to the expensive implementation of full-scale selectional constraints knowledge. They perform an experiment to resolve references of the pronoun &amp;quot;it&amp;quot; in sentences randomly selected from the corpus. null In their statistical model, co-occurrence patterns observed in the corpus were used as selectional patterns. Candidates for antecedent were substituted for the anaphor and only those candidates appearing in frequent co-occurrence patterns were approved of.</Paragraph>
      <Paragraph position="1"> Dagan andi Itai report an accuracy rate of 87% for the sentences with genuine &amp;quot;it&amp;quot; anaphors (sentences in which &amp;quot;it&amp;quot; is not an anaphor have been manually eliminated). It should be pointed out that the success of this experiment depends on the parser used (in this case K. Jennsen's PEG parser).</Paragraph>
    </Section>
    <Section position="5" start_page="84" end_page="85" type="sub_section">
      <SectionTitle>
3.3 Kennedy &amp; Boguraev's approach without
</SectionTitle>
      <Paragraph position="0"> a parser In a recent paper, Kennedy and Boguraev (\[Kennedy &amp; Boguraev\]) describe an anaphor resolution approach which is a modified and extended version of that developed by Lappin and Leass (\[Lappin &amp; Leass 94\]). Their system does not require &amp;quot;in-depth, full&amp;quot; syntactic parsing but works from the output of a part of speech tagger, enhanced only by annotations of grammatical function of lexical items in the input text stream.</Paragraph>
      <Paragraph position="1"> The basic logic of their algorithm parallels that of Lappin and Leass's algorithm. The determination of disjoint reference, however, represents a significant point of divergence the two. Lappin and Leass's relies on syntactic configurational information, whereas Kennedy and Boguraev's, in the absence of such information, relies on inferences from grammatical function and precedence.</Paragraph>
      <Paragraph position="2"> After the morphological and syntactic filters have been applied, the set of discourse referents that remain is subjected to a final salience evaluation. The candidate with highest salience weighting is determined to be the actual antecedent; in the event of a tie, the closest candidate is chosen. The approach works for both lexical anaphors (reflexives and reciprocals) and pronouns. null Evaluation reports 75% accuracy but it should be pointed out that the results were obtained from a wide range of texts/genres: the evaluation was based on a random selection of genres, including press releases, product announcement, news stories, magazine articles, and other World Wide Web documents.</Paragraph>
      <Paragraph position="3"> 4. Limitations and advantages of the practical approach We must admit that the practical approach has been tested mainly on a specific genre: computer and hi-fi manuals. It also appears that some of the rules are more genre-specific than others (e.g. &amp;quot;verb preference&amp;quot; and &amp;quot;noun preference&amp;quot;). Therefore, we cannot claim that an equally high level of accuracy would be guaranteed in other genres.</Paragraph>
      <Paragraph position="4"> In addition, even though our preliminary resuits seem to be better than Kennedy and Boguraev's (75%), there is no ground for any real comparison since (i) our evaluation tests are not extensive enough and are of a preliminary nature and (ii) their evaluation is based on a random selection of genres, whereas our method has been applied to a single text genre.</Paragraph>
      <Paragraph position="5"> The practical approach presented has been developed recently and is subject to further research and improvements. In particular, we plan to enhance the accuracy of the initial score of each symptom by collecting more empirical evidence and to integrate all the antecedent indicators into a uniform and comprehensive probabilistic model.</Paragraph>
      <Paragraph position="6"> On the other hand, the main advantage of the practical approach lies in its independence of syntactic, semantic, domain and real-world knowledge, which makes it not only cheaper to implement but also appropriate for applications  in corpora. Thus we see the pronoun resolution approach as one of the components of a more general methodology aiming to offer a way forward in the automatic annotation of anaphoric links in corpora.</Paragraph>
      <Paragraph position="7"> 5. Proposed methodology for semi-automatic annotation Further to our comments in section 4, we would like to propose the development of a semi-automatic procedure for annotating pronominal anaphora in corpora. Such a procedure would speed up the manual marking of pronoun-antecedent pairs. The semi-automatic annotation editor would be practically based on our pronoun resolution approach made more &amp;quot;robust&amp;quot; by a &amp;quot;super&amp;quot; POS tagger and by corpus-based collocation patterns. The process of annotation will consist of the following stages: a) sentence splitting The first stage will be to segment the input into sentences by identifying their boundaries.</Paragraph>
      <Paragraph position="8"> b) &amp;quot;super&amp;quot;part-of-speech tagging We plan to use the so-called super part of speech taggers which (i) determine automatically lexical categories, (ii) provide further lexical information (e.g. gender, number) and (iii) identify the syntactic function of each part-of-speech unit (e.g. subject, object etc.) (\[Voutilainen et al. 1992\], \[Karlsson et al. 95\]).</Paragraph>
      <Paragraph position="9"> c) gender and number agreement Once the noun phrases (see footnote 2) in a sentence have been identified, agreement constraints for filtering NP candidates in current and preceding sentences will be activated. Certain (e.g. collective) nouns 5 will not be subject to such constraints (see footnote 4).</Paragraph>
      <Paragraph position="10"> d) corpus-based collocation patterns Possible antecedents will be substituted for the anaphors and the frequency of the new constructions will be calculated in corpora. Higher weightings will beassigned to NPs which occur more frequently in the same syntactic function as the anaphor (e.g. in combination with a certain verb or subject/object).</Paragraph>
      <Paragraph position="11"> e) antecedent indicators The antecedent indicators (as described above) will be used for the final weighting of the candidates and for proposing the antecedent. The candidate with the highest overall score after stages d) and e) will be picked up as the most likely antecedent.</Paragraph>
      <Paragraph position="12"> We are aware of the fact that this robust pronoun resolver is unlikely to produce 100% accuracy.</Paragraph>
      <Paragraph position="13"> Therefore, we envisage the development of a post-editing environment. Anaphors and allocated antecedents will be highlighted with the user accepting or correcting them.</Paragraph>
      <Paragraph position="14"> If future comprehensive evaluation suggests that the success of the new approach is restricted to certain genres only, it would be worthwhile to consider using other knowledge-poor approaches (e.g. \[Kennedy &amp; Boguraev 96\]), which have proved their efficiency, within the framework suggested.</Paragraph>
      <Paragraph position="15"> Last but not least, another promising option would be to enhance the framework proposed with a robust parser or even better to select an existing robust platform for pre-processing the input morphologically and syntactically (one such platform which we are currently looking at, is GATE (\[Cunningham et al. 96\]).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML