XML Viewer - w02-1113

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1113_evalu.xml
Size: 5,987 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1113">
  <Title>Generating extraction patterns from a large semantic network and an untagged corpus Thierry POIBEAU Thales and LIPN Domaine de Corbeville</Title>
  <Section position="6" start_page="3" end_page="4" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The evaluation concerned the extraction of information from a French financial corpus, about companies buying other companies. The corpus is made of 300 texts (200 texts for the training corpus, 100 texts for the test corpus).</Paragraph>
    <Paragraph position="1"> A system was first manually developed and evaluated. We then tried to perform the same task with automatically developed resources, so that a comparison is possible. At the beginning, the end-user must provide a set of relevant pattern to the acquisition system. We have developed a filtering tool to help the end user focus on relevant portion of text. Due to lack of place, we will not describe this filtering tool, which is very close in its conception to the EXDISCO system developed by R.</Paragraph>
    <Paragraph position="2"> Yangarber at NYU.</Paragraph>
    <Paragraph position="3"> First of all, the corpus is normalized. For example, all the company names are replaced by a variable *c-company* thanks to the named entity recognizer. In the semantic network, *c-company* is introduced as a synonym of company, so that all the sequences with a proper name corresponding to a company could be extracted.</Paragraph>
    <Paragraph position="4"> For the slot corresponding to the company that is being bought, 6 seed patterns were given to semantic expansion module. This module acquired from the corpus 25 new validated patterns. Each example pattern generated 4.16 new patterns on average. For example, from the pattern rachat de *c-company* we obtain the following list: reprise de *c-company* achat de *c-company* acquerir *c-company* racheter *c-company* cession de *c-company* This set of pattern includes nominal phrases (reprise de *c-company*) and verbal phrases (racheter *c-company*). The acquisition process concerns at the same time, the head and the expansion. This technique is very close to the co-training algorithm proposed for this kind of task by E. Riloff and R. Jones (Riloff et Jones, 1999) (Jones et al., 1999).</Paragraph>
    <Paragraph position="5"> The proposed patterns must be filtered and validated by the end-user. We estimate that generally 25% of the acquired pattern should be rejected. However, this validation process is very rapid: a few minutes only were necessary to check the 31 proposed patterns and retain 25 of them.</Paragraph>
    <Paragraph position="6"> We then compared these results with the ones obtained with the manually elaborated system. The evaluation concerned the two slots that necessitate a syntactic and semantic analysis: the company that is buying another one (slot 1) and the company that is being bought (slot 2). These slots imply nominal phrases, they can be complex and a functional analysis is most of the time necessary (is the nominal phrase the subject or the direct object of the sentence?). An overview of the results is given below (P is for precision, R for recall; P&amp;R is the combined  The system running with automatically defined resources is about 10% less efficient than the one with manually defined resources. The decrease of performance may vary in function of the slot (the decrease is less important for the slot 1 than for the slot 2). Two kind of errors are observed: Certain sequences are not found because a relation between words is missing in the semantic net. This is the case for some idiomatic expressions that were not registered in the network like tomber dans l'escarcelle de which means to acquire.</Paragraph>
    <Paragraph position="7"> Some sequences are extracted by the semantic analysis but do not correspond to a transformation registered in the syntactic variation management module. For example the sequence: *c-company* renforce son activite communication ethnique en prenant une participation dans *c-company*  is not completely recognized. The pattern (prendre &lt;DET&gt;) participation dans *c-company* correctly identifies the company that is being bought. But the pattern *c-company* (prendre &lt;DET&gt;) participation cannot apply because the subject is too far from the verb.</Paragraph>
    <Paragraph position="8"> Lastly, we can mention that some patterns that were not found manually are identified by the automatic procedure. The gain concerning development time is very significant (50 h were necessary to manually define the  *c-company* reinforces its activity in ethnic communication by taking some interest in *c-company* resources, only 10 h with the semi-automatic process).</Paragraph>
    <Paragraph position="9"> Even if the decrease of performance is significant (10%), it can be reduced using more linguistic knowledge. For example, we know that nominalizations are not correctly handled by the system at the moment. Some more information could be used from the semantic network (that also includes morphological and syntactic information) to enhance the performances of the overall system.</Paragraph>
    <Paragraph position="10"> Experiments have been made on different corpora and on different MUC-like tasks. They have all proved the efficiency of the strategy described in this paper. Moreover, it is possible to adapt the system so that it has a better precision, or a better recall, given user needs (Poibeau, 2001). For example, people working on large genomic textual databases are facing a huge amount of redundant information. They generally want some very precise information to be extracted. On the other hand, human operators monitoring critical situation generally want to be able to have access to all the available information. Our system is versatile and could be easily adapted to these different contexts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML