File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1126_intro.xml
Size: 4,586 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1126"> <Title>Information Extraction from Single and Multiple Sentences</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Information Extraction (IE) is the process of identifyingspeciflcpiecesofinformationintext, for example, the movements of company executives or the victims of terrorist attacks. IE is a complex task and a the description of an event may be spread across several sentences or paragraphs of a text. For example, Figure 1 shows two sentences from a text describing management succession events (i.e. changes in corporate executive management personnel). It can be seen that the fact that the executives are leaving and the name of the organisation are listed in the flrst sentence. However, the names oftheexecutivesandtheirpostsarelistedinthe second sentence although it does not mention the fact that the executives are leaving these posts. The succession events can only be fully understood from a combination of the information contained in both sentences.</Paragraph> <Paragraph position="1"> Combining the required information across sentences is not a simple task since it is necessary to identify phrases which refer to the same entities, \two top executives&quot; and \the executives&quot; in the above example. Additional di-culties occur because the same entity may be referred to by a difierent linguistic unit. For example, \International Business Machines Ltd.&quot; maybereferredtobyanabbreviation(\IBM&quot;), Pace American Group Inc. said it notifled two top executives it intends to dismiss them because an internal investigation found evidence of \self-dealing&quot; and \undisclosed flnancial relationships.&quot; The executives are Don H. Pace, cofounder, president and chief nickname (\Big Blue&quot;) or anaphoric expression suchas\it&quot;or\thecompany&quot;. Thesecomplications make it di-cult to identify the correspondencesbetweendifierentportionsofthetextde- null scribing an event.</Paragraph> <Paragraph position="2"> Traditionally IE systems have consisted of several components with some being responsible for carrying out the analysis of individual sentencesandothermoduleswhichcombinethe events they discover. These systems were often designed for a speciflc extraction task and could only be modifled by experts. In an effort to overcome this brittleness machine learning methods have been applied to port systems to new domains and extraction tasks with minimal manual intervention. However, some IE systems using machine learning techniques only extract events which are described within a single sentence, examples include (Soderland, 1999; ChieuandNg, 2002; Zelenkoetal., 2003).</Paragraph> <Paragraph position="3"> Presumably an assumption behind these approaches is that many of the events described in the text are expressed within a single sentence and there is little to be gained from the extra processing required to combine event descriptions. null Systemswhichonlyattempttoextractevents described within a single sentence only report results across those events. But the proportion of events described within a single sentence is notknownandthishasmadeitdi-culttocompare the performance of those systems against ones which extract all events from text. This question is addressed here by comparing two versions of the same IE data set, the evaluation corpus used in the Sixth Message Understanding Conference (MUC-6) (MUC, 1995). The corpusproducedforthisexercisewasannotated with all events in the corpus, including those described across multiple sentences. An independent annotation of the same texts was carried out by Soderland (1999), although he only identifled events which were expressed within a single sentence. Directly comparing these data sets allows us to determine what proportion of all the events in the corpus are described within a single sentence.</Paragraph> <Paragraph position="4"> The remainder of this paper is organised as follows. Section 2 describes the formats for representing events used in the MUC and Soderland data sets. Section 3 introduces a common representation scheme which allows events to be compared, a method for classifying types of event matches and a procedure for comparing the two data sets. The results and implications of this experiment are presented in Section 4.</Paragraph> <Paragraph position="5"> Some related work is discussed in Section 5.</Paragraph> </Section> class="xml-element"></Paper>