File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1126_metho.xml
Size: 10,551 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1126"> <Title>Information Extraction from Single and Multiple Sentences</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Event Scope and Representation </SectionTitle> <Paragraph position="0"> The topic of the sixth MUC (MUC-6) was management succession events (Grishman and Sundheim, 1996). The MUC-6 data has been commonly used to evaluate IE systems. The test corpus consists of 100 Wall Street Journal documents from the period January 1993 to June 1994, 54 of which contained management succession events (Sundheim, 1995). The format used to represent events in the MUC-6 corpus is now described.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 MUC Representation </SectionTitle> <Paragraph position="0"> Events in the MUC-6 evaluation data are recorded in a nested template structure. This format is useful for representing complex events which have more than one participant, for example, when one executive leaves a post to be replaced by another. Figure 2 is a simplifled event from the the MUC-6 evaluation similar to one described by Grishman and Sundheim (1996).</Paragraph> <Paragraph position="1"> This template describes an event in which \John J. Dooner Jr.&quot; becomes chairman of the company \McCann-Erickson&quot;. The MUC templatesaretoocomplextobedescribedfullyhere null but some relevant features can be discussed.</Paragraph> <Paragraph position="2"> format the POST, organisation (SUCCESSION ORG) and references to at least one IN AND OUT subtemplate, each of which records an event in which a person starts or leaves a job. The IN AND OUT sub-template contains details of the PERSON and the NEW STATUS fleld which records whetherthepersonisstartinganewjoborleaving an old one. Several of the flelds, including POST, PERSON and ORGANIZATION, may contain aliases which are alternative descriptions of the fleld flller and are listed when the relevant entity was described in difierent was in the text. For example, the organisation in the above template has two descriptions: \McCann-Erickson&quot; and \McCann&quot;. It should be noted that the MUC template structure does not link the fleld flllers onto particular instances in the texts. Consequently if the same entity description is used more than once then there is no simple way of identifying which instance corresponds to the event description.</Paragraph> <Paragraph position="3"> The MUC templates were manually fllled by annotatorswhoreadthetextsandidentifledthe management succession events they contained.</Paragraph> <Paragraph position="4"> The MUC organisers provided strict guidelines about what constituted a succession event and howthetemplatesshouldbefllledwhichtheannotators sometimes found di-cult to interpret (Sundheim, 1995). Interannotator agreement was measured on 30 texts which were examined bytwoannotators. Itwasfoundtobe83%when one annotator's templates were assumed to be correct and compared with the other.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Soderland's Representation </SectionTitle> <Paragraph position="0"> Soderland (1999) describes a supervised learning system called WHISK which learned IE rules from text with associated templates.</Paragraph> <Paragraph position="1"> WHISK was evaluated on the same texts from the MUC-6 data but the nested template structureprovedtoocomplexforthesystemtolearn. null Consequently Soderland produced his own simpler structure to represent events which he described as \case frames&quot;. This representation could only be used to annotate events described within a single sentence and this reduced the complexity of the IE rules which had to be learned.</Paragraph> <Paragraph position="2"> The succession event from the sentence \Daniel Glass was named president and chief executive o-cer of EMI Records Group, a unit of London's Thorn EMI PLC.&quot; would be represented as follows:1</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> @@TAGS Succession </SectionTitle> <Paragraph position="0"> fPersonIn DANIEL GLASSg fPost PRESIDENT AND CHIEF EXECUTIVE OFFICERg fOrg EMI RECORDS GROUPg Events in this format consist of up to four components: PersonIn, PersonOut, Post and Org. An event may contain all four components although none are compulsory. The minimum possible set of components which can form an event are (1) PersonIn, (2) PersonOut or (3) both Post and Org. Therefore a sentence must contain a certain amount of information to be listed as an event in this data set: the name of an organisation and post participating in a management succession event or the name of a person changing position and the direction of that change.</Paragraph> <Paragraph position="1"> Soderland created this data from the MUC-6 evaluation texts without using any of the existing annotations. The texts were flrst pre-processing using the University of Massachusetts BADGER syntactic analyser (Fisher etal.,1995)toidentifysyntacticclausesandthe namedentitiesrelevanttothemanagementsuccession task: people, posts and organisations. Each sentence containing relevant entities was examined and succession events manually identifled. null clarity.</Paragraph> <Paragraph position="2"> This format is more practical for machine learning research since the entities which participate in the event are marked directly in the text. The learning task is simplifled by the fact that the information which describes the event is contained within a single sentence and so the feature space used by a learning algorithm can be safely limited to items within that context.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Event Comparison </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Common Representation and Transformation </SectionTitle> <Paragraph position="0"> There are advantages and disadvantages to the eventrepresentationschemesusedbyMUCand Soderland. The MUC templates encode more information about the events than Soderland's representation but the nested template structure can make them di-cult to interpret manually. null In order to allow comparison between events each data set was transformed into a common format which contains the information stored in both representations. In this format each event is represented as a single database record with four flelds: type, person, post and organisation. The type fleld can take the values person in, person out or, when the direction of the succession event is not known, person move. The remaining flelds take the person, position and organisation names from the text. These flelds may contain alternative values which are separated by a vertical bar (\|&quot;).</Paragraph> <Paragraph position="1"> MUC events can be translated into this format in a straightforward way since each IN AND OUT sub-template corresponds to a single event in the common representation. The MUC representation is more detailed than the one used by Soderland and so some information is discarded from the MUC templates. For example, the VACANCY REASON flled which lists the reason for the management succession event is not transfered to the common format. The event listed in In order to carry out this transformation an eventhastobegeneratedforeachPersonInand PersonOut mentioned in the Soderland event.</Paragraph> <Paragraph position="2"> Soderland's format also lists conjunctions of post names as a single slot flller (\president and chief executive o-cer&quot; in this example). These are treated as separate events in the MUC format. Consequently they are split into the separatepostnamesandaneventgeneratedforeach null in the common representation.</Paragraph> <Paragraph position="3"> It is possible for a Soderland event to consist of only a Post and Org slot (i.e. there is neither a PersonIn or PersonOut slot). In these cases an underspecifled type, person move, is used and no person fleld listed. Unlike MUC templates Soderland's format does not contain alternative names for fleld flllers and so these never occur when an event in Soderland's format is translated into the common format.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Matching </SectionTitle> <Paragraph position="0"> The MUC and Soderland data sets can be compared to determine how many of the events in the former are also contained in the latter.</Paragraph> <Paragraph position="1"> This provides an indication of the proportion of events in the MUC-6 domain which areexpressible within a single sentence. Matches between Soderland and MUC events can be classifled as full, partial or nomatch. Each of these possibilities may be described as follows: Full A pair of events can only be fully matchingiftheycontainthesamesetofflelds. In addition there must be a common flller for each fleld. The following pair of events are an example of two which fully match.</Paragraph> <Paragraph position="2"> contains a proper subset of the flelds of another event. Each fleld shared by the two events must also share at least one flller.</Paragraph> <Paragraph position="3"> The following event would partially match either of the above events; the org fleld is absent therefore the matches would not be met. This can occur if corresponding flelds do not share a flller or if the set of flelds in the two events are not equivalent or one the subset of the other.</Paragraph> <Paragraph position="4"> Matching between the two sets of events is carried out by going through each MUC event and comparing it with each Soderland event for the same document. The MUC event is flrst compared with each of the Soderland events to check whether there are any equal matches. If one is found a note is made and the matching process moves onto the next event in the MUC set. If an equal match is not found the MUC event is again compared with the same set of Soderland events to see whether there are any partialmatches. WeallowmorethanoneSoderland event to partially match a MUC event so when one is found the matching process continues through the remainder of the Soderland events to check for further partial matches.</Paragraph> </Section> </Section> class="xml-element"></Paper>