File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-0503_relat.xml

Size: 8,439 bytes

Last Modified: 2025-10-06 14:15:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0503">
  <Title>Max-Planck-Institute for Computer Science</Title>
  <Section position="4" start_page="0" end_page="19" type="relat">
    <SectionTitle>
1.2 Related Work
</SectionTitle>
    <Paragraph position="0"> There are numerous Information Extraction (IE) approaches, which differ in various features: * Arity of the target relation: Some systems are designed to extract unary relations, i.e. sets of entities (Finn and Kushmerick, 2004; Califf and Mooney, 1997). In this paper we focus on the more general binary relations.</Paragraph>
    <Paragraph position="1"> * Type of the target relation: Some systems are restricted to learning a single relation, mostly the instanceOf-relation (Cimiano and V&amp;quot;olker, 2005b; Buitelaar et al., 2004). In this paper, we are interested in extracting arbitrary relations (including instanceOf).</Paragraph>
    <Paragraph position="2"> Other systems are designed to discover new binary relations (Maedche and Staab, 2000).</Paragraph>
    <Paragraph position="3"> However, in our scenario, the target relation is given in advance.</Paragraph>
    <Paragraph position="4"> * Human interaction: There are systems that require human intervention during the IE process (Riloff, 1996). Our work aims at a completely automated system.</Paragraph>
    <Paragraph position="5"> * Type of corpora: There exist systems that can extract information efficiently from formatted data, such as HTML-tables or structured text (Graupmann, 2004; Freitag and Kushmerick, 2000). However, since a large part of the Web consists of natural language text, we consider in this paper only systems that accept also unstructured corpora.</Paragraph>
    <Paragraph position="6"> * Initialization: As initial input, some systems require a hand-tagged corpus (J. Iria, 2005; Soderland et al., 1995), other systems require text patterns (Yangarber et al., 2000) or templates (Xu and Krieger, 2003) and again others require seed tuples (Agichtein and Gravano, 2000; Ruiz-Casado et al., 2005; Mann and Yarowsky, 2005) or tables of target concepts (Cimiano and V&amp;quot;olker, 2005a). Since hand- null labeled data and manual text patterns require huge human effort, we consider only systems that use seed pairs or tables of concepts.</Paragraph>
    <Paragraph position="7"> Furthermore, there exist systems that use the whole Web as a corpus (Etzioni et al., 2004) or that validate their output by the Web (Cimiano et al., 2005). In order to study different extraction techniques in a controlled environment, however, we restrict ourselves to systems that work on a closed corpus for this paper.</Paragraph>
    <Paragraph position="8"> One school of extraction techniques concentrates on detecting the boundary of interesting entities in the text, (Califf and Mooney, 1997; Finn and Kushmerick, 2004; Yangarber et al., 2002).</Paragraph>
    <Paragraph position="9"> This usually goes along with the restriction to unary target relations. Other approaches make use of the context in which an entity appears (Cimiano and V&amp;quot;olker, 2005a; Buitelaar and Ramaka, 2005). This school is mostly restricted to the instanceOf-relation. The only group that can learn arbitrary binary relations is the group of pattern matching systems (Etzioni et al., 2004; Agichtein and Gravano, 2000; Ravichandran and Hovy, 2002; Brin, 1999; Soderland, 1999; Xu et al., 2002; Ruiz-Casado et al., 2005; Mann and Yarowsky, 2005). Surprisingly, none of these systems uses a deep linguistic analysis of the corpus. Consequently, most of them are extremely volatile to small variations in the patterns. For example, the simple subordinate clause in the following example (taken from (Ravichandran and Hovy, 2002)) can already prevent a surface pattern matcher from discovering a relation between &amp;quot;London&amp;quot; and the &amp;quot;river Thames&amp;quot;: &amp;quot;London, which has one of the busiest airports in the world, lies on the banks of the river Thames.&amp;quot;</Paragraph>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
1.3 Contribution
</SectionTitle>
      <Paragraph position="0"> This paper presents LEILA (Learning to Extract Information by Linguistic Analysis), a system that can extract instances of an arbitrary given binary relation from natural language Web documents without human intervention. LEILA uses a deep analysis for natural-language sentences as well as other advanced NLP methods like anaphora resolution, and combines them with machine learning techniques for robust and high-yield information extraction. Our experimental studies on a variety of corpora demonstrate that LEILA achieves very good results in terms of precision and recall and outperforms the prior state-of-the-art methods.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
1.4 Link Grammars
</SectionTitle>
      <Paragraph position="0"> There exist different approaches for parsing natural language sentences. They range from simple part-of-speech tagging to context-free grammars and more advanced techniques such as Lexical Functional Grammars, Head-Driven Phrase Structure Grammars or stochastic approaches. For our implementation, we chose the Link Grammar Parser (Sleator and Temperley, 1993). It is based on a context-free grammar and hence it is simpler to handle than the advanced parsing techniques.</Paragraph>
      <Paragraph position="1"> At the same time, it provides a much deeper semantic structure than the standard context-free parsers. Figure 1 shows a simplified example of a linguistic structure produced by the link parser (a linkage).</Paragraph>
      <Paragraph position="2"> A linkage is a connected planar undirected graph, the nodes of which are the words of the sentence. The edges are called links. They are labeled with connectors. For example, the connector subj in Figure 1 marks the link between the subject and the verb of the sentence. The linkage must fulfill certain linguistic constraints, which are given by a link grammar. The link grammar specifies which word may be linked by which connector to preceding and following words. Furthermore, the parser assigns part-of-speech tags, i.e. symbols identifying the grammatical function of a word in a sentence. In the example in Figure 1, the letter &amp;quot;n&amp;quot; following the word &amp;quot;composers&amp;quot; indentifies &amp;quot;composers&amp;quot; as a noun.</Paragraph>
      <Paragraph position="3">  more complex example. The relationship between the subject &amp;quot;London&amp;quot; and the verb &amp;quot;lies&amp;quot; is not disrupted by the subordinate clause: London, which has one of the busiest airports, lies on the banks of the river Thames.  We say that a linkage expresses a relation r, if the underlying sentence implies that a pair of entities is in r. Note that the deep grammatical analysis of the sentence would allow us to define the meaning of the sentence in a theoretically well-founded way (Montague, 1974). For this paper, however, we limit ourselves to an intuitive understanding of the notion of meaning.</Paragraph>
      <Paragraph position="4"> We define a pattern as a linkage in which two  words have been replaced by placeholders. Figure 3 shows a pattern derived from the linkage in Figure 1 by replacing &amp;quot;Chopin&amp;quot; and &amp;quot;composers&amp;quot; by the placeholders &amp;quot;X&amp;quot; and &amp;quot;Y&amp;quot;.</Paragraph>
      <Paragraph position="5">  We call the (unique) shortest path from one placeholder to the other the bridge, marked in bold in the figure. The bridge does not include the placeholders. Two bridges are regarded as equivalent, if they have the same sequence of nodes and edges, although nouns and adjectives are allowed to differ. For example, the bridge in Figure 3 and the bridge in Figure 4 (in bold) are regarded as equivalent, because they are identical except for a substitution of &amp;quot;great&amp;quot; by &amp;quot;mediocre&amp;quot;. A pattern matches a linkage, if an equivalent bridge occurs in the linkage. For example, the pattern in Figure  If a pattern matches a linkage, we say that the pattern produces the pair of words that the linkage contains in the position of the placeholders. In Figure 4, the pair &amp;quot;Mozart&amp;quot; / &amp;quot;composers&amp;quot; is produced by the pattern in Figure 3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML