File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1022_metho.xml
Size: 26,241 bytes
Last Modified: 2025-10-06 14:13:46
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1022"> <Title>SEMANTIC EVALUATION FOR SPOKEN-LANGUAGE SYSTEMS</Title> <Section position="5" start_page="0" end_page="126" type="metho"> <SectionTitle> 2. WHY DO SEMEVAL? </SectionTitle> <Paragraph position="0"> A question that has received, and continues to receive, extensive discussion is ~'Why do we want to do SemEval at allT&quot; George Doddington \[1\] has addressed this from ARPA's perspective as follows: Why is SemEval a good idea? Well, first, ARPA's goal in the HLT Program is to make strategic advances in core human language technology.</Paragraph> <Paragraph position="1"> This goal is a technology goal (to produce useful functionality). It is not a science goal (to produce scientific understanding), nor is it an application goal (to produce useful applications embodying human language technology). So, why do SemEval instead of only doing task-level evaluations (such as ATIS CAS)? There are three reasons: 1. SemEval offers the possibility of providing a much more direct and objective evaluation of underlying technical issues. It thus promises greater diagnostic leverage which would yield more rapid and efficient development of core.</Paragraph> <Paragraph position="2"> technology.</Paragraph> <Paragraph position="3"> 2. SemEval, by measuring performance at a technical level rather than an application level, eliminates much overhead and research inefficiency by obviating the need to support application effort and other back-end issues.</Paragraph> <Paragraph position="4"> This makes research much more efficient by focusing a greater fraction of effort on research and technical issues of direct interest. It also makes it more attractive and much easier for a new player to enter the game.</Paragraph> <Paragraph position="5"> 3. Semgval, by virtue of measuring performance below the application level, offers the opportunity to compare performance across different applications and to support formal evaluation among a much larger research community. Thus the potential benefit and evaluation support goes far beyond the relatively few ARPA HLT sites.</Paragraph> <Paragraph position="6"> The risk is that SemEval may end up measuring technical aspects of systems that axe not directly relevant or do not represent the important issues in HLT. We need to try hard to avoid this. And we won't immediately abandon other task-level evaluations such as the ATIS GAS. But a successful SemEval has the potential to create a larger research challenge and to accelerate HLT progress very significantly. The potential payoff is high and we need to give SemEval the best shot we can.</Paragraph> </Section> <Section position="6" start_page="126" end_page="126" type="metho"> <SectionTitle> 3. METHODOLOGICAL PRINCIPLES </SectionTitle> <Paragraph position="0"> Perhaps the main difficulty in defining SemEval is that there is no single generally accepted notation for representing the meaning of natural-language utterances. Instead, there are numerous notations and theories, from which we have to synthesize a notation that is compatible with as wide a variety of points of view as possible. Two methodological principles have been proposed to guide us in this task.</Paragraph> <Paragraph position="1"> The first is proposed as a strategy for helping us define annotations without bogging down in theoretical disputes: Take as our mantra =It's just a notation, not a theory.&quot; That is, the overriding consideration should be whether a proposed notation is a convenient way of marking a distinction that we agree we should mark, and that whether it meets other, theory-driven conditions (e.g., supporting a truth-conditional semantics, assigning a type-theoretic interpretation to every subexpression, being compositional, assigning one predicate per morpheme) is beside the point.</Paragraph> <Paragraph position="2"> The second methodological principle is a default rule for deciding when to make distinctions. There are many cases where it is not immediately clear whether to mark a distinction or not. The proposed default is not to mark distinctions. That is, it should take some positive argument that if the distinction is not marked, two utterances that we agree should be assigned different structures will be assigned the same structure. (Often, the most compelling such arguments are truth-conditional. That is, we can describe a situation where one utterance is clearly true and the other is dearly false.) The reason for defaulting to not making distinctions is that systems that make more distinctions than necessary should be able to collapse them for purposes of translation to the canonical representation fairly easily, but a system that doesn't make a distinction at all will be severely penalized if it is scored incorrect for not making it. Hence, we need evidence that the system is wrong not to make the distinction. Note that marking distinctions does not mean that the notation will represent a superficial analysis of the utterance. Quite the contrary, it will often require giving a common representation to many expressions that are quite different syntactically and, hence, push toward deeper representations.</Paragraph> </Section> <Section position="7" start_page="126" end_page="128" type="metho"> <SectionTitle> 4. PREDICATE-ARGUMENT STRUCTURE ISSUES </SectionTitle> <Paragraph position="0"> So far, most of the detailed proposals that have been discussed pertain to predicate-argument structure issues.</Paragraph> <Section position="1" start_page="126" end_page="127" type="sub_section"> <SectionTitle> 4.1. Syntax for Predicate-Argument Structure </SectionTitle> <Paragraph position="0"> In discussions of a possible syntax for predicate-argument structure, it has been evident that there are considerable variations in preferences for the amount of syntactic sugar to be used. To accommodate these differences in preferences, three intertranslatable levels of notation have been proposed. The first is simply LISP-style nested functor-argument notation, with two notational additions. First, we use angle brackets <...> to indicate implicit conjunction, arising, for example, from iterated modifiers in the utterance. Second, we use numerical indices followed by a colon to label expressions that may fill more than one argument position in the predicate-argument structure. For example, if we make the assumption that tall, block, and blue, are one-place predicates and we ignore tense, Every blue block is tall might be represented: (deal (tall l:(every <(block 1)(blue 1)>))) (The recursion implicit in the use of the index 1 will be explained in the discussion of quantification below.) A second level of representation is obtained by assigning every expression an index, and breaking the structure down into a list of atomic predications, interrelated by the indices:</Paragraph> <Paragraph position="2"> Finally, we may wish to break this notation down to an even more atomic level for the purpose of counting errors for scor- null ing: Cfunct 2 decl) (argl 2 3) (funct 3 tall) (argl 3 1) Cfunct I every) (argl I 4) (argl 1 5) Cfunct 4 block) (argl 4 1) (~unct 5 blue) (argl 5 1) 4.2. Scope of Quantified Noun Phrases There has been general agreement that quantified noun phrases should be represented &quot;in place,&quot; without their exact scope being represented. To be more precise, for a generalized quantifier Q(X, Y) corresponding to a noun phrase determiner like some or every, since the restriction X is essentially the content of the noun phrase, that would be indicated in the notation, but the body Y, not being structurally deterrnined by the syntax, would be left vague. This means we will give different representations to Every tall block is blue.</Paragraph> <Paragraph position="3"> Every blue block is tall and the complements are distinguished by the position they fill in the argument list: since t:he difference between these is structurally determined and unambiguous, but we will not give different representations to the two scopings of Some girl likes every boy.</Paragraph> <Paragraph position="5"> In the other approach, a &quot;Davidsonian&quot; \[2\] notation is used, in which an &quot;event&quot; described by the head is introduced, and the complements axe treated as fillers of role relations for the event. In the notation we have adopted, this comes out looking something like TJ~ere is some girl such that she likes every boy.</Paragraph> <Paragraph position="6"> ......... For every boy, there is some girl who likes him.</Paragraph> <Paragraph position="7"> since that distinction is not structurally determined (although structure has some influence), and it is often difficult to judge.</Paragraph> <Paragraph position="8"> An additional constraint that has been agreed on is that the notation should not have what has come to be called the &quot;linchpin problem&quot;; that is, the notation should not be such that, if one key piece is missed, the whole thing falls apart and credit is given for virtually nothing. In paxticular, it should be possible to miss which pieces belong inside or outside the restriction of the quantifier, and still get credit for recognizing all the predications of the &quot;quantified variable. ~ The solution that has been developed is illustrated by the predicate-argument structure given above for the phrase every blue block: l:(every <(block l)(blue I)>) The expression as a whole is given an index that is also used. in the argument positions inside the restriction of the quantifier that correspond to &quot;quantified variable&quot; positions* That way, the notation can clearly indicate which predications axe part of the quantifier restriction, but every argument position that would be filled by the quantified variable in a standard logical representation is filled with the same index, whether the predication is inside or outside of the quantifier restriction. null Quantified noun phrases are not the only constructs where the issue of scope arises. Others include modal verbs, negation, and propositional attitude verbs (e.g., want, know). It has generally been agreed that items whose scope is largely determined by linguistic structure would have their scope indicated in predicate-axgument structure (usually by treating them as having propositions as arguments), and that items whose scope is not determined by linguistic structure (such as only), would have their structure left underspecified in predicate-axgument structure.</Paragraph> <Paragraph position="9"> 4.3. What are the Predicates and Arguments? There are two widely used schemes for mapping linguistic heads (e.g., main verbs) and complements (e.g., subjects and objects) into predicate-argument structures. In one approach, the head is treated as a multi-argument predicate <(ev-type 1 P) (R1 1 X) (R2 1 Y=) (R3 1Z)> We have chosen the Davidsonian-style notation, because of its flexibility in leaving open exactly what complements (and adjuncts) a head has.</Paragraph> <Paragraph position="10"> After the decision to use a Davidsonian representation, the question came up of how widely to apply it. Davidson's original proposal was intended to apply only to verbs (in particular, only to action verbs), but examples arise with adjectives, adverbs, nouns, and even prepositions that seem to require a similar treatment. We have tentatively decided, therefore, to apply it to all of these types of expr6ssions, but to provide syntactic sugar to hide some of the complexity in simple cases.</Paragraph> </Section> <Section position="2" start_page="127" end_page="128" type="sub_section"> <SectionTitle> 4.4. Collapsing Lexical and Syntactic Distinctions </SectionTitle> <Paragraph position="0"> It has been tentatively agreed that a number of syntactic distinctions should be collapsed in predicate-argument structure: null Active vs. passive: Mary kissed John, vs. John was kissed by Mary.</Paragraph> <Paragraph position="1"> Dative movement: John gave Mary a book, vs. John gave a book to Mary.</Paragraph> <Paragraph position="2"> Raising verbs: It seems that John is here, vs. John seems to be here.</Paragraph> <Paragraph position="3"> It has also been agreed that verbs and their event nominalizations should be given the same underlying predicates, for example, arrive and arrival.</Paragraph> <Paragraph position="4"> It also appears that there are cases where multiple subcategorization patterns for the same verb can be handled by a single underlying predicate with different roles expressed. For example, in John baked Mary a cake, and John baked a cake, bake would be taken to express the same predicate, but with who the cake was for expressed by a role relation in only the first case. Other examples falling under this heading include &quot;control&quot; verbs, where if a certain role is not expressed, it is constrained to be filled by the same item as one of the other roles--for example, John expected Mary to win vs. John expected to win (= John expected himself to win).</Paragraph> <Paragraph position="5"> Another category of verbs that first seemed to fall into this class axe those for which both transitive and intransitive forms exist and it appears that the object of the transitive form may fill the same role as the subject of the intransitive form--for example, John melted the butter, vs. The butter melted. On closer examination, however, it seems there may be good reasons for treating these as distinct predicates, so this issue remains open.</Paragraph> <Paragraph position="6"> ticket's price price of a ticket price for a ticket price on a ticket</Paragraph> </Section> <Section position="3" start_page="128" end_page="128" type="sub_section"> <SectionTitle> 4.5. Complex Predicates </SectionTitle> <Paragraph position="0"> So far we have represented all predicates as atomic, but we might in some cases want to have predicates that are themselves structurally complex. One case is complex determiner phrases. We have treated determiners like some and every as a sort of predicate, but sometimes determiners are complex phrases like no more than seven. A second potential example of complex predicates are families of rdated prepositions, like at, before, and alter. It has been suggested that these might be profitably treated as utilizing a single underlying predicate, whose interpretation varies from domain to domain, together with a set of numerical comparison operations defined in terms of that predicate but fixed across all domains.</Paragraph> </Section> </Section> <Section position="8" start_page="128" end_page="128" type="metho"> <SectionTitle> 5. WORD SENSE AND ROLE IDENTIFICATION ISSUES </SectionTitle> <Paragraph position="0"> Much of the discussion of word-sense identification issues has revolved around whether WordNet \[3\] would be a suitable lexical resource to use as a source for word senses. The general impression seems to be that it probably is, but the details remain to be worked out.</Paragraph> <Paragraph position="1"> The choice of the Davidsonian representation for head-complement relations raises an important issue closely related to that of word-sense identification--namely, identification of the role-relations that hold between the events and the complements (R1, R2, and so forth in the discussion above).</Paragraph> <Paragraph position="2"> One possibility is to use fairly superficial identifiers (such as abbreviations for &quot;logical subject&quot; and &quot;logical object&quot;) or surface prepositions for role names, to keep the annotations domain-independent. An objection to this is that it is too syntactically oriented and does not represent a deep enough level of understanding.</Paragraph> <Paragraph position="3"> The approach currently being explored is to attempt to define a set of semantic classes relevant to the domain and construct role names from those classes. If there is only one relation between a pair of classes salient enough to be expressed by a grammatical role or preposition, then a simple concatenation of the class names would be used. For example, for '% flight on an airline&quot;, there is really only one salient relation between flights and airlines, so flight_airline might as well be used to name that relation. If there is more than one salient relation between two semantic classes, a grammatical relation name or preposition can be interpolated between the semantic class names. So, since flight.airport would be ambiguous between origin and destination, we would have flight_from_airport and flight_to_airport instead.</Paragraph> <Paragraph position="4"> Since theory-laden terms are not used to name the roles, this approach should avoid arguments such as whether a particular role is really an agent or is merely an experiencer. It also offers the possibility of expressing deeper regularities than grammatical roles or surface prepositions, since it allows us to say that all involve the same relation between a ticket and a price.</Paragraph> <Paragraph position="5"> A number of key issues raised by this approach remain to be resolved, including the following: Roles may be expressed (at least) by grammatical relations, prepositions, possessives, and the verb have. One question is whether we ever want to treat any of these constructions as having an autonomous sense, rather than expressing a role of some predicate. For example, one might want to say that in a book on a table, on simply expresses a relation between the book and the table that depends only on an autonomous sense of on independent of the predicate book, while in a price on a ticket, on expresses the role of the predicate price that is filled by a ticket.</Paragraph> <Paragraph position="6"> Under this proposal, it is necessary to know what the semantic class of the head of a phrase is in order to know what to call the roles that are expressed. Some phrases have null heads or contentless heads that pose a problem for this approach--for example, Show me the ones on United.</Paragraph> <Paragraph position="7"> In this case, we need to know what the ones refers to in order to know what relation on expresses. Conversely, conjunction can create situations where a phrase provides a role filler for two different heads: Show me the flights and \]ares to Boston.</Paragraph> <Paragraph position="8"> In a case like this, the notation needs to allow to Boston to supply a role filler for both flights and \]ares.</Paragraph> </Section> <Section position="9" start_page="128" end_page="129" type="metho"> <SectionTitle> 6. COREFERENCE DETERMINATION ISSUES </SectionTitle> <Paragraph position="0"> We take the term &quot;coreference&quot; very broadly to include a variety of types of constraints from context. Most of the cases that have been considered so far can be classified into three categories: 1. Strict coreference, where one expression denotes exactly the same entity as some other expression: Show the flights from Boston to Dallas and the times they arrive.</Paragraph> <Paragraph position="1"> they = the flights from Boston to Dallas 2. Relational coreference, where one expression denotes something bearing a specific relation to an entity denoted by some other expression: Show flights from Boston to Dallas and discount fares.</Paragraph> <Paragraph position="2"> discount fares = discount fares \]or flights from Boston to Dallas 3. General constraints from context: I need to go from Boston to Dallas. Show me all the morning flights.</Paragraph> <Paragraph position="3"> the morning flights = the morning flights Irom Boston to Dallas The current proposal for annotating these relations is to use a combination of co-indexing and expressing contextual constraints by constructing additional pieces of predicate-argument structure (which might or might not be copies of pieces of predicate-argument structure in the context). The feasibility of specifying these additional pieces of predicate-argument structure in a sufficiently constrained way to yield a canonical representation is currently being assessed.</Paragraph> </Section> <Section position="10" start_page="129" end_page="129" type="metho"> <SectionTitle> 7. ANNOTATION AND TEST ISSUES </SectionTitle> <Paragraph position="0"> they proceed. One suggestion is an iterative process, whereby a subset of the data would be annotated using a partial lexicon, with the annotators having the option of choosing &quot;none of the above&quot; for a word sense or role relation. A concordance would be produced for the =none of the above&quot; occurrences of each lexical item and role marker, and new word senses and roles would be added to the lexicon based on an analysis of the concordance. It has also been suggested that a threshold be set in terms of frequency of occurrence, and until some word sense or role relation exceeded the threshold, it could be left in the none-of-the-above bucket, and that none-of-the-above would be deemed the correct answer if that word sense or role relation turned up in test data. (None-of-the-above would also be deemed the correct answer if a completely new word sense or role relation turned up in test data.) We assume that SLS SemEval will work as much as possible like the ATIS GAS evaluations in terms of how we organize the collection and annotation of data and administration of the evaluation. In the general case, the expected process is:</Paragraph> </Section> <Section position="11" start_page="129" end_page="129" type="metho"> <SectionTitle> 8. ANNOTATION AND TEST SOFTWARE </SectionTitle> <Paragraph position="0"> A number of pieces of software will be needed to support the overall process: 1. At multiple sites, data will be collected, transcribed, and shipped to NIST.</Paragraph> <Paragraph position="1"> 2. NIST will partition data into training and test and ship data to third-party annotators.</Paragraph> <Paragraph position="2"> 3. Annotators will perform classification and annotate word-sense, predicate-argument structure, and corderence, and ship annotations back to NIST. 4. NIST will distribute training data with classifications and annotations to system developers.</Paragraph> <Paragraph position="3"> 5. A committee will resolve issues about how to classify and annotate data and' to maintain documentation on the same.</Paragraph> <Paragraph position="4"> 6. A mechanism will be established for reporting training data bugs to NIST, having the bugs corrected, and distributing the fixes.</Paragraph> <Paragraph position="5"> 7. NIST will release test data shortly before the evaluation, and participants will submit annotations produced by their systems to NIST for comparison with reference annotations.</Paragraph> <Paragraph position="6"> 8. An adjudication process will be set up to resolve disputes about the transcription, classification, and annotation of test data.</Paragraph> <Paragraph position="7"> Note that for the initial SLS SemEval the first two steps have already been completed, because we will use some of the same data for training and test that has already been collected for</Paragraph> </Section> <Section position="12" start_page="129" end_page="129" type="metho"> <SectionTitle> ATIS CAS. </SectionTitle> <Paragraph position="0"> Obtaining consistent annotation of the data is an important requirement. It is clear that a detailed annotation manual and good annotation tools need to be developed and that tight feedback between the annotators and an analog of the</Paragraph> </Section> <Section position="13" start_page="129" end_page="129" type="metho"> <SectionTitle> ATIS CAS Principles of Interpretation committee will be re- </SectionTitle> <Paragraph position="0"> quired.</Paragraph> <Paragraph position="1"> 1. Annotator aids -- Annotators cannot be expected to create complex annotations for utterances completely by hand. For ATIS CAS, NLParse was used, but for SemEval, NLParse is not suitable. One possibility would be to use one or more participants' systems to produce a first-pass annotation, which the annotators would then correct. There is some concern that this would produce annotations that are biased in favor of the system used to produce the initial structures. This might be partly alleviated by using multiple systems to produce a first pass, perhaps presenting to the annotators only the parts of the annotation that multiple systems agree on. The annotators will also need specialized editing tools tailored to creating and correcting SemEval structures. Such tools have been created by the Penn Tree-bank project \[4\] for producing syntactic bracketings of utterances, and it may be possible to adapt these for SemEval.</Paragraph> <Paragraph position="2"> 2. Annotation checker -- The annotations themselves will have a quite complex syntax and semantics. Software to check the resulting annotations will no doubt catch many annotation errors. It might be possible to build this functionality directly into the editing tools.</Paragraph> <Paragraph position="3"> 3. Annotation translators -- We have defined several levels of notation for SemEval. The highest, most syntactically sugared level seems likely to be used by the annotators, and the lowest level seems likely to be that to which the comparator is applied. However many levels are used, software to translate between them will be needed.</Paragraph> </Section> <Section position="14" start_page="129" end_page="129" type="metho"> <SectionTitle> 4. Comparator -- It will be necessary to build a compara- </SectionTitle> <Paragraph position="0"> tot for hypotheses and reference answers. This needs to be implemented in a way that permits all sites to use it, which would probably make C the implementation language of choice.</Paragraph> <Paragraph position="1"> Another important question is whether a detailed lexicon with patterns illustrating word senses and roles for the majority of the vocabulary will need to be constructed before annotation, or whether this can be developed by the annotators as</Paragraph> </Section> class="xml-element"></Paper>