File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2119_intro.xml

Size: 4,037 bytes

Last Modified: 2025-10-06 14:05:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2119">
  <Title>Generalizing Automatically Generated Selectional Patterns</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Selectional constraints specify what combinations of words are acceptable or meaningful in particular syntactic relations, such as subject-verb-object or head-modifier relations. Such constraints are necessary for the accurate analysis of natural language text+ Accordingly, the acquisition of these constraints is an essential yet time-consuming part of porting a natural language system to a new domain. Several research groups have attempted to automate this process by collecting co-occurrence patterns (e.g., subject-verb-ol)ject patterns) from a large training corpus. These patterns are then used as the source of seleetional constraints in attalyzing new text.</Paragraph>
    <Paragraph position="1"> The initial successes of this approach raise the question of how large a training corpus is required. Any answer to this question must of course be relative to the degree of coverage required; the set of selectional patterns will never be 100% complete, so a large corpus will always provide greater coverage. We attempt to shed to some light on this question by processing a large corpus of text from a broad domain (business news) and observing how selectional coverage increases with domain size.</Paragraph>
    <Paragraph position="2"> In many cases, there are limits on the amount of training text, available. We therefore also consider how coverage can be increased using a tixed amount of text.</Paragraph>
    <Paragraph position="3"> The most straightforward acquisition procedures build selectional patterns containing only the specific word combinations found in the training corpus. (areater coverage can be obtained by generalizing fl'om the patterns collected so that patterns with semantically related words will also be considered acceptable. In most cases this has been (lotto using manually-created word classes, generalizing fi'oul specific words to their classes \[12,1,10\]. If a pre-existing set of classes is used (as in \[10\]), there is a risk that the classes awdlable may not match the needs of the task. If classes are created specifically to capture selectional constraints, there lnay be a substantial manual I&gt;urden in moving to a new domain, since at least some of the semantic word classes will be domain-specillc.</Paragraph>
    <Paragraph position="4"> We wish to avoid this manual component by auto: maritally identifying semantically related words. This can be done using the co-occurrence data, i.e., by idea: tifying words which occur in the same contexts (for example, verbs which occur with the same subjects and objects). From the co-occurrence data o110 Call coiil.pute a similarity relation between words \[8,7\]. This similarity information can then be used in several ways.</Paragraph>
    <Paragraph position="5"> One approach is to form word clusters based on this similarity relation \[8\]. This approach was taken by Sekine et al. at UMIST, who then used these chlsters to generalize the semantic patterns \[11\]. l'ereira et al. \[9\] used a variant of this approach, &amp;quot;soft clusters&amp;quot;, in which words can be members of difl'erent clusters to difl'eren t degrees.</Paragraph>
    <Paragraph position="6"> An alternative approach is to use the word similarity information directly, to inDr information about the likelihood of a co-occurrence pattern from information abont patterns involving similar words. This is the approach we have adopted for our current experiments \[6\], and which has also been employed by 17)agan et al. \[2\]. We corl:lttttl;e from the co+occurrence data a &amp;quot;confitsion matrix&amp;quot;, which measures the interchangeability of words in particular contexts. We then use the confllsion matrix directly to geueralize the selllantic patterns.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML