File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1021_metho.xml
Size: 16,114 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1021"> <Title>Whither Written Language Evaluation?</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. THE MENU </SectionTitle> <Paragraph position="0"> The menu of evaluations will include rather different types of tasks in order to meet the range of objectives cited above. On the one hand, we want to continue evaluation on tasks -- such as &quot;information extraction&quot;-- which can he seen as prototypes for real applications, and so wiU continue to draw interest from outside the natural language processing community. We would like to make these tasks as simple as possible, consistent with a semblance of reality, so that evaluationper se does not become a major time drain.</Paragraph> <Paragraph position="1"> On the other hand, we are interested in exploring &quot;glass box&quot; evaluations -- evaluations of the ability of systems to identify crucial linguistic relationships which we believe are relevant to a high level of performance on a wide variety of language understanding tasks.</Paragraph> <Paragraph position="2"> Of course, some people will believe that we have chosen the wrong relationships, or at least that natural language systems need not make these relationships explicit in the process of performing a natural language analysis task, and so will decline to participate in some or all of the glass box evaluations. We respect these disagreements, and have organized the menu of evaluations to take them into account.</Paragraph> <Paragraph position="3"> Any particular choice of internal evaluations necessarily represents some bet on the path of technical development. However, we believe that the relationships we have selected are sufficiently basic to understanding that the bet is worth taking, and that by encouraging work on these tasks we will push research on natural language understanding in ways which would not be possible with a limited application task such as information extraction.</Paragraph> <Paragraph position="4"> The menu we came up with includes one task (named entity recognition) which, is sufficiently basic to be characterized as both an internal and an application task; four internal evaluations; and two application-oriented evaluations: 1. Named Entity Recognition: Identify company names, or. ganization names, personal names, location names, product names, dates, times, and money.</Paragraph> <Paragraph position="5"> 2. Parseval: Bracket the syntactic constituents of the sentence.</Paragraph> <Paragraph position="6"> 3. Predicate-Argument Structure: Identify the relationship between lexical elements in terms of relations such as logicalsubject, logical-object, etc.</Paragraph> <Paragraph position="7"> 4. WordSense Disambiguation: Identify the word sense of each noun, verb, adjective, and adverb in the text, using the inventory of word senses from WordNet 5. Coreference Resolution: Identify identity of reference, superset, and subset relations among text elements, as well as situations where a text element is an implicit argument of another (e.g., a subject or object of a nominalization which appears elsewhere in the text).</Paragraph> <Paragraph position="8"> 6. Mini-MUC: Identify instances of a particular class of event in the text, and fill a template with the crucial information about each instance.</Paragraph> <Paragraph position="9"> 7. Cross-Document Coreference: Identify coreference relations between objects and events in different articles.</Paragraph> <Paragraph position="10"> Evaluations 3, 4, and 5 are collectively known as Semeval. Each of the seven evaluations can be done independently, but there are potentials for using the results of the annotation for one taskin performing another; these relationships are shown in Figure 1. Presumably, most participants will generate predicate-argument structure from parser output, so for them good Parseval performance would be a prerequisite for good performance on the predicate-argument metric. Recognition of named entities is essential for good performance on both Semeval and Mini-MUC. Some people will want to use the Semeval processing/output for the Mini-MUC, and some people won't; it is an interesting scientific question whether it helps. Cross-Document Coreference requires the output of MIni-MUC.</Paragraph> <Paragraph position="11"> It will be possible for a site to investigate only one of these links, if they wished, rather than starting from the raw text input. This would allow people to build on others' work on named entity recognition, or to assess, assuming perfect or typical results on Semeval, how well one could do on Mini-MUC. Moreover, sites may be required to not only do a run using the (perfectly correct) key for the input to their component, but also using the (imperfect) actual results of some site participating in the full evaluation, which would be publicly available. (This might be arranged by staggering the evaluations, with the component evaluations scheduled before the mini-MUC evaluation.) These experiments would be analogous to the writtenlanguage-only part of the SLS evaluations.</Paragraph> </Section> <Section position="4" start_page="0" end_page="124" type="metho"> <SectionTitle> 3. THE EVALUATIONS </SectionTitle> <Paragraph position="0"> In this section we briefly describe each of the seven evaluation tasks.</Paragraph> <Paragraph position="1"> For each task we shall need to prepare a sample of text annotated with the information we wish the systems under evaluation to extract. To make the annotations more manageable and inspectable, we have combined the annotations for named entity recognition, coreference, and word sense identification. They are all encoded using an SGML tagging of the text, with separate attributes to record each type of information. Merging the annotations does not mean that the corresponding evaluations will be combined. We still expect that these three evaluations will be scored separately, and that text can be separately annotated for the three evaluations) To illustrate some of the annotations, members of the MUC-6 committee have annotated one of the &quot;joint venture&quot; news articles from the MUC-5 evaluation. The first two sentences of this article are: Bridgestone Sports Co. said Friday it has set up a joint venture in Talwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million New Talwan Dollars, will start production in January 1990 with production of 20,000 iron and &quot;metal wood&quot; clubs a month.</Paragraph> <Paragraph position="2"> The named-entity / coreference / word sense annotation is shown in Figures 2 and 3; the predicate-argument annotation is shown in The experience with MUC-5 indicated that recognition of company, organization, people, and location names is an essential ingredient in understanding business news articles, and is to a considerable degree separable from the other problems of language interpretation. In addition, such recognition can be of practical value by itself in tracking people and organizations in large volume of text. As a result, this evaluation may appeal to firms focussed on this limited task, who are not involved in more general language understanding.</Paragraph> <Paragraph position="3"> In Figures 2 and 3, the named entity recognition is reflected in all the SGML elements besides wd: entity for companies and other organizations, loc and complex-loc for locations, num for numbers (including percentages), date, and money. Additional element types would be provided for other constructs involving specialized lexical patterns, such as times and people's names. For most of these elements, one of the attributes gives a normalized form: the decimal <S n=l> <entity id=tl type=company name='Bridgestone Sports CO' > <wd lemma--say sense=\[verb.communieation.0\] > <date value='241189' > <wd id=t2 identical=tl ></Paragraph> <Paragraph position="5"> value of a number, a 6.-digit number for dates, a standardized form for company names (following MUC-5 rules for company names).</Paragraph> <Section position="1" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 3.2. Parseval </SectionTitle> <Paragraph position="0"> Parseval is a measure of the ability of a system to bracket the syntactic constituents in a sentence. This metric has now been in use for several years, and has been described elsewhere \[1\]. Parseval may eventually be supplanted in large part by the &quot;deeper&quot; and more detailed predicate-argument evaluation. However, for the present Parseval is being retained in order to accomodate participants focussed on surface grammar and participants reluctant to commit to predicate-argument evaluation until its design is stabilized and proven.</Paragraph> </Section> <Section position="2" start_page="121" end_page="122" type="sub_section"> <SectionTitle> 3.3. Predicate-argument structure </SectionTitle> <Paragraph position="0"> A very tentative predicate-argument structure for our two sentences is shown in Figure 4. As much as possible, we have tried to use the same structures which have been adopted by the Spoken Language Coordinating Committee for their predicate-argument evaluation. We summarize here, with some simplifications, only the most essential aspects of this representation.</Paragraph> <Paragraph position="1"> For each event or state in the text, we introduce a Davidsonian event variable i, and treat the type and each argument of the event as a separate predication. So, for example, Fred fed Francis on Friday would be represented as 2</Paragraph> <Paragraph position="3"> Each elementary predication can be numbered by preceding it with a number and colon. Roughly speaking, a system would be scored on the number of such elementary predications it gets correct. Because this notation is none too readable, however, we also allow the abbreviated form (eat \]event 1\] \[1-subj Fred\] \[1-obj Francis\] \[on Friday\]) where \[event 1\] could be omitted if there were no other references to the event variable. An entity, arising from a noun phrase with determiner det will be represented by 2Assuming that on is a primitive predicate, which is not expanded using an event variable. Otherwise we would have the predications (ev-type 2 on), O-subj 2 1), and (l-obj 2 Friday).</Paragraph> <Paragraph position="5"> e: (det <restrl restr2 ...>) Each restri is a constraint on the entity, stated as a predication on index e. Thus &quot;the brown cow which licked Fred&quot; would be represented by 1: (the <(brown \[1-subj 1\]) (cow \[1-subj 1\]) (lick \[1-subj 1\] \[1-obj Fred\])>) The notation &quot;?i&quot; means that i is optional; the notation i / j means that either i orj is allowed.</Paragraph> <Paragraph position="6"> The written language group, however, is not taking the same approach to the selection of predicates and role-names as the spoken language group. The spoken language group aspires to a truly semantic representation, independent of the particular syntactic form in which it was expressed. This seems feasible in the highly circumscribed domain of air traffic information. It does not seem a feasible near-term goal for all of language, or even for all of &quot;business news&quot;, which is a very broad domain. Instead we will be initially using a form of grammatical functional structure, with lexical items as heads (predicate types), and role names such as logical subject and logical object. The representation will be normalized with respect to only a limited number of syntactic alternations, such as passive, dative with &quot;for&quot;, and dative with &quot;to&quot;. I expect that the representation will gradually evolve to normalize a larger number of paraphrastic alternations.</Paragraph> </Section> <Section position="3" start_page="122" end_page="124" type="sub_section"> <SectionTitle> 3.4. Coreference </SectionTitle> <Paragraph position="0"> Coreference can be annotated either at the level of the word sequence or at the level of predicate-argument structure. By recording coreference at the word level, we lose some distinctions that can be captured at predicate-argument level. On the other hand, annotating at the word level allows for evaluation of coreference without generating predicate-argument structure. So -- in order to keep the menu items as independent as possible -- our current plan is to annotate coreference at the word level, with the head word of the anaphor pointing to the head word of the antecedent.</Paragraph> <Paragraph position="1"> Coreference is recorded through attributes in the SGML annotation (Figures 2 and 3). For purposes of reference, elements are annotated with an ident attribute. Identity of reference is indicated by an attribute identical pointing to the antecedent. A superset/subset relation is indicated by asub-ofattribute. Finally, if a predication has implicit arguments which are coreferential with prior text elements, they are annotated as args = &quot;\[role antecedent\]&quot;.</Paragraph> <Paragraph position="2"> 3.5. Word sense identification The third element of the Semeval triad is sense identification. As a sense inventory, we hayed used WordNet, which is widely and freely available and is broad in coverage \[4\]. The notation used to refer to particular WordNet sense was described in \[5\].</Paragraph> <Paragraph position="3"> 3.6. Mini-MUC This component is the direct descendant of the information extraction tasks in the previous MUCs \[2,3\]* In response to criticism that the evaluation task had gotten too complex, we have endeavored to make the new information extraction as simple as possible* The template will have a hierarchical structure, as in MUC-5, but probably with only two levels of &quot;objects&quot;. The objects at the lower level will represent common business news entities such as people and companies. A small inventory of such objects will be defined in advance. The upper level object will then be a simple structure with perhaps four or five slots, to capture the information about a particular type of event.</Paragraph> <Paragraph position="4"> The following were suggested as typical of such templates:</Paragraph> </Section> </Section> <Section position="5" start_page="124" end_page="124" type="metho"> <SectionTitle> 4. PLANS </SectionTitle> <Paragraph position="0"> The menu of evaluations which has been developed for MUC-6 is certainly ambitious; perhaps it is too ambitious and will need to be scaled back. While the cost of participating in a single one of these evaluations should be much less than the effolt required for MUC-5, the effort to prepare all these evaluations will be considerable. Detailed specifications will need to be developed for each of the evaluations, and substantial annotated corpora will have to be developed, both as the &quot;case law&quot; for subsequent evaluations and as a training corpus for trainable analyzers. If this is all successful, however, it holds the promise for fostering advances in several aspects of natural language understanding.</Paragraph> <Paragraph position="1"> A description of the menu of evaluations was disseminated electronically at the end of December 1993. Further details, including a sample annotated message, were distributed at the end of February 1994. After a period of public electronic comment, we shall be recruiting volunteer sites to begin annotating texts, slowly over the course of the spring, as the specifications are ironed out, more rapidly over the summer, once specifications are more stable.</Paragraph> <Paragraph position="2"> A dry run evaluation, possibly including only a subset of the menu items, will be conducted in late fall of 1994; MUC-6 is tentatively scheduled for May of 1995.</Paragraph> </Section> class="xml-element"></Paper>