File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1017_metho.xml

Size: 6,619 bytes

Last Modified: 2025-10-06 14:13:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1017">
  <Title>Session 3: Human Language Evaluation</Title>
  <Section position="2" start_page="0" end_page="99" type="metho">
    <SectionTitle>
2. WHAT TO EVALUATE?
</SectionTitle>
    <Paragraph position="0"> Once we decide to evaluate, the first question is what to evaluate? Where do we put probes to inspect the input and output, in order to perform an evaluation? This issue is discussed in the Sparck Jones paper\[ 1 \]. In some cases, we can evaluate the language technology in isolation from any front-end or back-end application, as shown in Figure 1, where probes are inserted on either side of the language interface itself. This gives us the kind of evaluation used for word error rate in speech (speech in, transcription out) or for machine translation, as proposed in the Brew/Thompson paper (source text in, target text out)\[2\]. This kind of evaluation computes output as a simple function of input to the language system.</Paragraph>
    <Paragraph position="1"> Unfortunately, it is not always possible to measure a meaningful output- for example, researchers have struggled long and hard with measurements for understanding - how can a system demonstrate that it has understood? If we had a general semantic representation, then we could insert a probe on the output side of the semantic component, independent of any specific application. The last three papers (\[3, 4, 5\]) take various approaches to the issue of predicate-argument  of a correct parse, the Treebank annotation enabled researchers to take the next step in agreeing to use the parse annotations (bracketings) as a &amp;quot;gold standard&amp;quot; against which to compare system-derived bracketings\[9\]. This evaluation, in turn, has enabled interesting automated teaming approaches to parsing.</Paragraph>
    <Paragraph position="2">  structure in an attempt to define a more semantically-based and application-independent measure.</Paragraph>
    <Paragraph position="3"> Right now, we can only measure understanding by evaluating an interface coupled to an application - Figure 2 shows the application back-end included inside the evaluation. This allows us to evaluate understanding in terms of getting the right answer for a specific task, as is done in the Air Travel In formation (ATIS) system, which evaluates language input/database answer output pairs. However, this means that to evaluate spoken language understanding, it is necessary to build an entire air travel information system.</Paragraph>
    <Paragraph position="4"> E EEE! 7!.:-:::::.-.. ::-:-.-..:\[ !E\[EiE~ ! !\[\[ ~ i{77\[i\[i7i7 iii~iil~,.~ i E EEEE 77\[ o UTPUT L-C~&gt;'~i&amp;quot;I:~:;:~:,:~:~:~!~:~:~:,:,:~:,:,~ i ..................................................... ~:7:iiil&amp;quot;l -~&amp;quot;  Finally, for certain kinds of applications, particularly interactive applications, it is appropriate to enlarge the scope of evaluation still further to include the users. For interactive systems, this is particularly important because the user response determines what the system does next, so that it is not possible to use pre-recorded data. 2 Increasingly complex human-computer interfaces, as well as complex collaborative tools, demand that a system be evaluated in its overall context of use (see Figure 3).</Paragraph>
    <Paragraph position="6"/>
  </Section>
  <Section position="3" start_page="99" end_page="99" type="metho">
    <SectionTitle>
3. HOW TO EVALUATE
</SectionTitle>
    <Paragraph position="0"> We must not only decide what inputs and outputs to use for evaluation; we must decide how to evaluate these input/output 2pre-recorded data allows the same dam to be used by all participating sites, effectively removing human variability as a factor in the evaluation. pairs as well. Evaluation seems relatively easy when there is an intuitive pairing between input and output, for example, between speech signal and transcription at the word or sentence level. The task is much more complex when there is either no representation for the output (how to represent understanding?) or in situations where the result is not unique: what is the correct translation of a particular text? What is the best response to a particular query? For such cases, it is often expedient to rely on human judgements, provided that these judgements (or relative judgements) are reproducible, given a sufficient number of judges. Evaluation of machine translation systems\[lO\] has used human judges to evaluate systems with differing degrees of interactivity and across different language pairs. The Brew and Thompson paper\[2\] also describes reliability of human judges in evaluating machine translation systems. Human judges have also been used in end-to-end evaluation of spoken language interfaces\[11\].</Paragraph>
  </Section>
  <Section position="4" start_page="99" end_page="99" type="metho">
    <SectionTitle>
4. WHERE TO GO FROM HERE?
</SectionTitle>
    <Paragraph position="0"> Because evaluation plays such an important role in driving research, we must weigh carefully what and how we evaluate. Evaluation should be theory neutral, to avoid bias against novel approaches; it should also push the frontiers of what we know how to do; and finally, it should support a broad range of research interests because evaluation is expensive. It requires significant community investment in infrastructure, not to mention time devoted to running evaluations and participating in them. For example, we estimate that the ATIS evaluation required several person-years to prepare annotated data, a staffof two to three people at NIST over several months to run the evaluation, time spent agreeing on standards, and months of staff effort at participating sites. Altogether, the annual cost of an evaluation certainly exceeds five person-years, or conservatively at least $500,000 per evaluation. Given this level of investment, it is critical to co-ordinate effort and obtain maximum leverage.</Paragraph>
    <Paragraph position="1"> The last three papers\[3, 4, 5\] all reflect a concern to develop better evaluation methods for semantics, with a shared focus on predicate-argument evaluation. The Treebank annotation paper\[3\] discusses the new predicate-argument annotation work under Treebank. The paper by Grishman discusses a range of new evaluation efforts for MUC, which are aimed at providing finer grained component evaluations. The last paper, by Moore, describes a similar, but distinct, effort towards developing more semantic evaluation methods for the spoken language community.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML