File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0204_intro.xml

Size: 4,544 bytes

Last Modified: 2025-10-06 14:03:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0204">
  <Title>Improving Semi-Supervised Acquisition of Relation Extraction Patterns</Title>
  <Section position="4" start_page="29" end_page="30" type="intro">
    <SectionTitle>
2 Semi-Supervised Learning of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
Extraction Patterns
</SectionTitle>
      <Paragraph position="0"> We begin by outlining the general process of learning extraction patterns using a semi-supervised algorithm, similar to one presented by Yangarber (2003).</Paragraph>
      <Paragraph position="1"> 1. For a given IE scenario we assume the existence of a set of documents against which the system can be trained. The documents are unannotated and may be either relevant (contain the description of an event relevant to the scenario) or irrelevant.</Paragraph>
      <Paragraph position="2"> 2. This corpus is pre-processed to generate a set of all patterns which could be used to represent sentences contained in the corpus, call this set P. The aim of the learning process is to identify the subset of P representing patterns which are relevant to the IE scenario.</Paragraph>
      <Paragraph position="3"> 3. The user provides a small set of seed patterns, Pseed, which are relevant to the scenario. These patterns are used to form the set of currently accepted patterns, Pacc, so Pacc - Pseed. The remaining patterns are treated as candidates for inclusion in the accepted set, these form the set Pcand(= P [?] Pacc).</Paragraph>
      <Paragraph position="4"> 4. A function, f, is used to assign a score to each pattern in Pcand based on those which are currently in Pacc. This function assigns a real number to candidate patterns so [?]c epsilon1 Pcand, f(c,Pacc) mapsto- R. A set of high scoring patterns (based on absolute scores or ranks after the set of patterns has been ordered by scores) are chosen as being suitable for inclusion in the set of accepted patterns.</Paragraph>
      <Paragraph position="5"> These form the set Plearn.</Paragraph>
      <Paragraph position="6">  5. (Optional) The patterns in Plearn may be reviewed by a user who may remove any they do not believe to be useful for the scenario.</Paragraph>
      <Paragraph position="7"> 6. The patterns in Plearn are added to Pacc and removed from Pcand, so Pacc - Pacc [?] Plearn and Pcand-Pacc [?] Plearn 7. Stop if an acceptable set of patterns has been learned, otherwise goto step 4  Previous algorithms which use this approach include those described by Yangarber et al. (2000) and Stevenson and Greenwood (2005). A key choice in the development of an algorithm using this approach is the process of ranking candidate patterns (step 4) since this determines the patterns which will be learned at each iteration. Yangarber et al. (2000) chose an approach motivated by the assumption that documents containing a large number of patterns already identified as relevant to a particular IE scenario are likely to contain further relevant patterns. This approach operates by associating confidence scores with patterns and relevance scores with documents. Initially seed patterns are given a maximum confidence score of 1 and all others a 0 score. Each document is given a relevance score based on the patterns which occur within it. Candidate patterns are ranked according to the proportion of relevant and irrelevant documents in which they occur, those found in relevant documents far more than in irrelevant ones are ranked highly. After new patterns have been accepted all patterns' confidence scores are updated, based on the documents in which they occur, and documents' relevance according to the accepted patterns they contain.</Paragraph>
      <Paragraph position="8"> Stevenson and Greenwood (2005) suggested an alternative method for ranking the candidate patterns. Their approach relied on the assumption that useful patterns will have similar meanings to the patterns which have already been accepted.</Paragraph>
      <Paragraph position="9"> They chose to represent each pattern as a vector consisting of the lexical items which formed the pattern and used a version of the cosine metric to determine the similarity between pairs of patterns, consequently this approach is referred to as &amp;quot;cosine similarity&amp;quot;. The metric used by this approach incorporated information from WordNet and assigned high similarity scores to patterns with similar meanings expressed in different ways.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML