File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1092_intro.xml
Size: 3,483 bytes
Last Modified: 2025-10-06 14:02:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1092"> <Title>Automatic extraction of paraphrastic phrases from medium size corpora</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> This section presents some related works for the acquisition of extraction patterns and paraphrases from texts.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.1 IE and resource acquisition </SectionTitle> <Paragraph position="0"> IE is known to have established a now widely accepted linguistic architecture based on cascading automata and domain-specific knowledge (Appelt et al, 1993). However, several studies have outlined the problem of the definition of the resources. For example, E. Riloff (1995) says that about 1500 hours are necessary to define the resources for a text classification system on terrorism . Most of these resources are variants of extraction patterns, which have to be manually established.</Paragraph> <Paragraph position="1"> We estimate that the development of resources for IE is at least as long as for text classification. To address this problem of portability, a recent research effort focused on using machine learning throughout the IE process (Muslea, 1999). A first trend was to directly apply machine learning methods to replace IE components. For example, statistical methods have been successfully applied to the named-entity task. Among others, (Bikel et a., 1997) learns names by using a variant of hidden Markov models.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Extraction pattern learning </SectionTitle> <Paragraph position="0"> Another research area trying to avoid the time-consuming task of elaborating IE resources is concerned with the generalization of extraction patterns from examples. (Muslea, 1999) gives an extensive description of the different approaches of that problem. Autoslog (Riloff, 1993) was one of the very first systems using a simple form of learning to build a dictionary of extraction patterns. Ciravegna (2001) demonstrates the interest of independent acquisition of left and right boundaries of extraction patterns during the learning phase. In general, the left part of a pattern is easier to acquire than the right part and some heuristics can be applied to infer the right boundary from the left one. The same method can be applied for argument acquisition: each argument can be acquired independently from the others since the argument structure of a predicate in context is rarely complete.</Paragraph> <Paragraph position="1"> Collins and Singer (1999) demonstrate how two classifiers operating on disjoint features sets recognize named entities with very little supervision. The method is interesting in that the analyst only needs to provide some seed examples to the system in order to learn relevant information. However, these classifiers must be made interactive in order not to diverge from the expected result, since each error is transmitted and amplified by subsequent processing stages. Contrary to this approach, partially reproduced by Duclaye et al. (2003) for paraphrase learning, we prefer a slightly supervised method with clear interaction steps with the analyst during the acquisition process, to ensure the solution is converging.</Paragraph> </Section> </Section> class="xml-element"></Paper>