File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/j84-1002_metho.xml
Size: 37,108 bytes
Last Modified: 2025-10-06 14:11:34
<?xml version="1.0" standalone="yes"?> <Paper uid="J84-1002"> <Title>A Formal Basis for Performance Evaluation of Natural Language Understanding Systems</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 16 Computational Linguistics, Volume 10, Number 1, January-March 1984 </SectionTitle> <Paragraph position="0"> Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS ing may be attached to the same expression, that is, expressions are not required to be univocal.</Paragraph> <Paragraph position="1"> Let S be the set of all possible meanings that can be attached to expressions of E.</Paragraph> <Paragraph position="2"> We do not face here the problems of what S actually contains or of how S could be represented explicitly (which mostly pertain to cognitive psychology); let us assume S merely as that basic datum, shared by all humans speaking a given language, which allows effective interpersonal communication.</Paragraph> <Paragraph position="3"> We call the semantics of a natural language the total function f: E~2 s (into 2S), which associates to each expression of E the set of all its possible meanings. null Clearly the function f can be computed by any person who can understand perfectly the natural language to which the expressions of E belong (theoretical problems concerning subjective interpretation and disagreement between different people are not considered here).</Paragraph> <Paragraph position="4"> Moreover, f(e) = ~ denotes that no meaning is associated to the expression e, and hence eeL iff f(e) ~.</Paragraph> <Paragraph position="5"> Each expression eeE such that If(e) l_<l is called an univocal expression.</Paragraph> <Paragraph position="6"> Let now D be a nonempty subset of S that contains meanings all related to a unique subject (&quot;what we are speaking of&quot;, &quot;the topic of the discourse&quot;, &quot;the conceptual competence of a natural language understanding system&quot;); we call D a domain.</Paragraph> <Paragraph position="7"> Let fD be the restriction of f to D defined as: fo(e) = f(e)fl D, for any eeE.</Paragraph> <Paragraph position="8"> Let L D = E--fDI(~) be the restriction of L to D.</Paragraph> <Paragraph position="9"> It is obvious that LD_qL_qE.</Paragraph> <Paragraph position="10"> Let us now try to formalize the concept of natural language understanding system.</Paragraph> <Paragraph position="11"> The main problem is that of giving a formal representation to the informally defined domain D. To this purpose, we take a finite set of symbols B, called alphabet, and then we construct a set R of sequences of arbitrary finite length over B (that is, R-B*), in such a way that to every element deD an element of R, r = laD(d), is associated by a bi-univocal function h o. The sequence r = hD(d) is called the representation of d, while the set R is called a representation language for D.</Paragraph> <Paragraph position="12"> Obviously, the map la Dl is a total function hD~:R-~D, which associates to every sequence of R its informal meaning in D. Both h D and laD 1 are known to man, in the sense that he is able to compute them. We are now able to formalize the naive notion of natural language understanding system in the following way.</Paragraph> <Paragraph position="13"> Let D-S be a domain and R a representation language for D. A natural language understanding system UR/D in R on D is an algorithm that computes a total u R function gR/D:E---2 U {_L} (into 2Ru {+-}), where +- is U called the undefined symbol, gR/D(e) = +- denotes that U is unable to assign a meaning to the expression e, that is, that it fails in computing gR/D(e) (not that e has no meaning in the domain D!).</Paragraph> <Paragraph position="14"> Note that in the above definition we have assumed that a system UR/D should accept as input not only expression of L D but, generally, all expressions of E. The reason for this choice is that a basic feature of natural language understanding is also to recognize that some expressions are meaningless (they belong to E-L) or are in no way related to a given domain D (they are in L-LD). Clearly, this feature is often less important than the capability of correctly understanding expressions of LD, but this can be appropriately taken into account when defining a measure of performance. null Measuring the performance of a natural language understanding system UR/D may now be defined as evaluating how well UR/D is capable of explicitly representing in R the meaning of expressions of E.</Paragraph> <Paragraph position="15"> To define such a notion in quantitative terms we can first extend the bi-univocal function hD:D--,-R to the function (bi-univocal if +- is not considered)</Paragraph> <Paragraph position="17"> for xe2 D.</Paragraph> <Paragraph position="18"> Figure 1 illustrates the definitions of the functions u f, fD' hD' hD' and gR/D presented above. Considering now the three functions fD, hD, and u gR/D defined above, if we denote hDOfD-----gD' the performance of UR/D can then be expressed as the u degree of precision to which gR/D approaches gD over E.</Paragraph> <Paragraph position="19"> This task raises, however, some difficult problems. Two basic questions are: u (i) how to define the &quot;difference&quot; between gR/D and gD over E in such a way to match the intuitive notion of performance; (ii) how to measure such a &quot;difference&quot; in practice, that is, through an effective experimental proce null Both of these problems are discussed in the following sections (the former in sections 3 and 4, and the latter in section 5).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. A Theoretical Framework </SectionTitle> <Paragraph position="0"> Before tackling the core topic of this section in a formal way, let us examine from an intuitive point of view the basic requirements for a measure of performance ~r to be reasonably acceptable. The primary goal is that it should allow consistent comparison among different systems, in the sense that if ~r(U 1) = qr(U 2) the behaviour of the two systems U 1 and U 2 should be sufficiently similar, and that if ~r(Ul)>~r(U2), U 1 should perform better than U 2.</Paragraph> <Paragraph position="1"> Furthermore, this comparison should be as fine and precise as possible, in such a way to capture all the essential features of the behaviour of a system U in a given domain.</Paragraph> <Paragraph position="2"> Finally, comparison might be between two different systems, between two versions of the same system, between a system and a given set of issues, or between a system and an independent scale (Tennant 1980).</Paragraph> <Paragraph position="3"> To capture the intuitive notion of performance according to the above requirements, at least two points of view seem worth considering. First, a measure of performance should give a numerical value for u the &quot;distance&quot; between the two functions gR/D and gD' that is, the measure should allow us to formalize u how near gR/D(e) approaches ~D(e) for any eeE, or, more explicitly, how well each expression ePSE is understood by the system U. Second, it should weight this notion of &quot;distance&quot; in such a way as to take into account the fact that, generally, it is not equally important to understand well any expression in E; for example, it could be reasonable to suppose that correct understanding of expressions in L D is far more relevant than in E--LD, or that correct understanding is more important for frequently used expressions than for unusual and rare ones.</Paragraph> <Paragraph position="4"> According to the above remarks, an appropriate notion of performance qr will depend on two basic parameters: null (i) the shifting # u between gR/D(e) and ~D(e) for any eEE (ii) the importance p for any expression eeE to be correctly understood. null Different choices of /z and p clearly provide different notions of performance, ~r\[/~,0\], that fit different needs for capturing particular classes of features in a natural language understanding system.</Paragraph> <Paragraph position="5"> Let us now go further in defining an appropriate formal framework embedding the above ideas. In the u following, we shall omit in fD, gR/D, and gD the superscript U and the subscripts R/D and D, whenever this will not cause ambiguities.</Paragraph> <Paragraph position="6"> Let R be a representation language for a domain Dc-S a shifting function ~t on R is a function /z:(2 R U {.t.})x2R~\[0,1\], such that: - for each pair (r,r'), /z(r,r') = 0 iff r = r'; - there exists a pair (r,r t) such that/z(r,r w) = 1. From an intuitive point of view, /~(g(e),~(e)) represents the &quot;difference&quot; between the (set of) meaning(s) of e computed by a natural language understanding system U, which is expressed by g(e), and its correct (set of) meaning(s) ~(e). Hence, the value tz(g(e),g(e)) = 0 denotes perfect understanding of e, while t~(g(e),g(e)) = 1 denotes the worst case of misunderstanding of e.</Paragraph> <Paragraph position="7"> Given the set E of all expressions of a natural language of length less or equal than an appropriately fixed integer n, an importance function O on E is a function p:E-* \[0,1\].</Paragraph> <Paragraph position="8"> Intuitively, p(e) represents the importance that the meaning of e is correctly understood by the system U. The value p(e) = 0 denotes that it is not at all important that e be understood correctly or incorrectly; values of p(e) greater than 0 denote the greater importance for e to be understood correctly. Given a shifting function # on R and an importance function p on E, a performance measure cr for natural language understanding systems OR/D is the function</Paragraph> <Paragraph position="10"> Clearly, ~r ranges from the value 0, in the case where all expressions of E are correctly understood, to the value 1, in the case where all expressions are completely (that is, in the worst manner) misunderstood, Computational Linguistics, Volume 10, Number 1, January-March 1984 19 Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS independently of the choice of 0 (of course, o-O is not allowed, being meaningless).</Paragraph> <Paragraph position="11"> ~r\[/~,O\] provides a very synthetic representation of the performance of U that can be useful in several cases of evaluation and comparison. A richer and more informed picture of the performance of a system U fully coherent with the above definitions can be obtained in the following way, for the cases where the ranges of tt and 0 are finite. For given shifting tz and importance 0, let range(t~) = {61 ..... 8 n} and range(o) = {C/01,...,60m}. Then we pose: Ei, j = {e I tz(g(e),~(e)) = 8iandp(e) = a~j}, for iE{1 ..... n} and jE{1 ..... n}.</Paragraph> <Paragraph position="12"> Clearly, UEi, j = E and all El, j are pairwise disjoint.</Paragraph> <Paragraph position="13"> Therefore, {Ei,j} is a partitioning of E.</Paragraph> <Paragraph position="15"> for iE{1 ..... n} and jE{1 ..... m}. (We remember that E has been assumed to be finite, and hence so is Ei,j_-qE ).</Paragraph> <Paragraph position="16"> The n xm matrix \[Pi,j\] is called the ~-o-profile of U and and provides a far more informed representation of the performance of U than the value ~r\[/~,0\]. In fact, \[Pi,j\] allows one to discover and analyse the specific features of the system, going beyond the global value ~r\[/~,O\].</Paragraph> <Paragraph position="17"> The relation between \[Pi,j\] and ~r\[/~,O\] is straightforward: null</Paragraph> <Paragraph position="19"> Note that \[Pi,j\] depends on /z and p only through the partitioning {El,j} they induce on E, but it is independent of the actual values of 8 i and ~oj.</Paragraph> <Paragraph position="20"> Different choices of /~ and p clearly provide different measures of performance that can be compared, in general, only on a qualitative and intuitive basis.</Paragraph> <Paragraph position="21"> Therefore, evaluating the performance of a system U requires first the definition of t~ and 0, and then the computation of rr\[/z,p\]. Clearly, the most critical of these two steps is, from a conceptual point of view, the first as it completely determines the &quot;goodness&quot; of the measure and its actual matching with desired intuitive requirements. The second is only difficult from the computational point of view since E is usually very large and, hence, it is not possible to evaluate the sum in the definition of ~r\[/~,p\] in a direct, exhaustive way.</Paragraph> <Paragraph position="22"> In the next section we discuss in detail the problem of appropriately defining t~ and p, while section 5 is devoted to the topic of actually computing ~r\[/~,p\].</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Importance Parameters </SectionTitle> <Paragraph position="0"> Having discussed in the previous section an abstract theory of performance evaluation, we now deal with some implementations of it that may be of practical interest. Clearly, an implementation is obtained by assigning actual functions as values for the (functional) parameters /~ and p in the definition of ~r. Different choices of/z and p will yield different models for performance evaluation and will allow one to analyse different features of the systems to be evaluated. Since ~ and p are fully independent parameters, we shall deal with each separately.</Paragraph> <Paragraph position="1"> Let us begin with the shifting function t~; in order that only the effect of /~ be relevant to 7r, we shall suppose throughout the following discussion that 0 has the constant value 0(e)=l for any eEE.</Paragraph> <Paragraph position="2"> The simplest case is that where /z may assume only two (boolean) values 0 and 1, denoting a correct and a wrong understanding, respectively. Such a boolean shifting function is denoted by /~1 and formally de-</Paragraph> <Paragraph position="4"> for any pair (r',r')e(2Ru {+-})x2 R.</Paragraph> <Paragraph position="5"> The intuitive meaning of /~1, when used to evaluate a natural language understanding system U, is straightforward: ~r\[tzl,1\](U ) = x denotes the percentage of expressions of E that U is unable to understand correctly (clearly, 1-x is the percentage of expressions correctly understood by U).</Paragraph> <Paragraph position="6"> The above definition of ~t is very crude; in fact, systems with the same qr\[/ll,1 \] can show a very different behaviour, and, furthermore, qr\[#l,1\](Ul) 2> ~r\[tzl,1\](U 2) does not generally ensure that U 1 performs better than U 2.</Paragraph> <Paragraph position="7"> A slight improvement can be obtained by splitting the case r1#r '' into two subcases that cover, when evaluating U, the following situations: (i) U is unable to assign a meaning to an expression e (that is, it fails); hence, g(e) = r' = +- # r&quot; = ~(e) (ii) U assigns to an expression e a meaning that is not the correct one; hence, g(e) = r' # r&quot; = ~(e), It seems quite reasonable that generally case (i) is less serious than case (ii), so that we can propose a new definition of shifting ~t2: l 0 if r' =r&quot; /~2(r',r&quot;) = 8 if r' = +- 1 if r'#+- and r'#r&quot; where 3E(0,1).</Paragraph> <Paragraph position="8"> Clearly, the choice of 8 strongly affects the values of ~r\[/~,l\](U) and will depend on how much we want to distinguish between cases (i) and (ii) mentioned above.</Paragraph> <Paragraph position="9"> Going further to propose more fitting definitions of #, we may want to analyze in more detail the case r'~+- and r'#r&quot;. Recalling that r' and r&quot; are sets of strings in R, we can distinguish the following cases: (i) U assigns to an expression e the value 4~ (that is, no meaning), while it has a well-defined meaning; null (ii) U assigns to an expression e a proper nonempty subset of its meanings; (iii) U assigns to an expression e all its correct meanings and, in addition, other incorrect ones; (iv) U assigns to an expression e a proper nonempty subset of its meanings and, in addition, other incorrect ones; (v) U assigns to an expression e a nonempty set of meanings that is fully different from the correct one.</Paragraph> <Paragraph position="10"> that covers Formally, we can define the shifting It 3 all such situations by:</Paragraph> <Paragraph position="12"> where 6iPS(0,1), for i = 1, 2, 3, 4, 5.</Paragraph> <Paragraph position="13"> It could be reasonably assumed 61 < 8 2 < 6 3 < 6 4 < 6 5, since the situations to which they are attached are generally considered as denoting increasing degrees of misunderstanding (note that #3 deals in great detail with the case of ambiguous understanding, where at least one of r' or r&quot; is not a singleton).</Paragraph> <Paragraph position="14"> Along the line of reasoning shown in the above definitions, several other improvements are possible.</Paragraph> <Paragraph position="15"> For example, we can further refine the above case (v), r'#4~ and r'N r&quot; = ~, by taking into account the actual structure of the elements of r' and r&quot;. R being a well-defined formal language, we can first define an appropriate notion of &quot;distance&quot; /~ between elements of R, and then extend it to nonempty disjoint elements of 2 R.</Paragraph> <Paragraph position="16"> This kind of refinement is particularly significant when both r' and r&quot; are singletons, that is, understanding is not ambiguous, as is often the case. Also, it generally allows far more meaningful definitions of shifting, thus further approaching the intuitive notion of &quot;distance&quot; as &quot;degree of understanding&quot;. Let us turn our attehtion now to the importance function 0.</Paragraph> <Paragraph position="17"> Also for this function, a first simple proposal can be a boolean definition: no importance at all is assigned to expressions in E-L D and the same (not null) importance to every expression in L D. So we can define Pl as:</Paragraph> <Paragraph position="19"> A refinement of p\] can be obtained by analyzing the case eeL D and taking into account the frequency of use of expressions in L o. This will give more importance to the correct understanding of more frequently used expressions and less importance to that of rare or unusual ones. From the human point of view, it is obvious that texts with a greater frequency are used, and hence understood, by a larger number of people.</Paragraph> <Paragraph position="20"> Therefore, it seems meaningful to consider a system that can understand quite well the relatively small number of the most common texts and fails on the most unusual ones, to be better than a system that understands a lot of very rare texts but often fails in understanding the most common ones.</Paragraph> <Paragraph position="21"> Formally, we can define the frequency of expressions of e as a map z : E--\[0,1\], with the constraint</Paragraph> <Paragraph position="23"> The frequency function z(e) can be effectively determined by collecting, through an appropriate experimental activity, a meaningful bag of texts T, in which each eeE appears with a given integer multiplicity m(e), and then by computing Computational Linguistics, Volume 10, Number 1, January-March 1984 21 Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS</Paragraph> <Paragraph position="25"> A totally different criterion that could be used to refine the definition of importance functions is structural complexity of the expressions of E (or of LD).</Paragraph> <Paragraph position="26"> A very crude notion of structural complexity is simply given by the length of an expression e. In this case, given a chain 0 = ~O<~l<...</~m_ 1 of m non-negative integers, we can partition E into m classes:</Paragraph> <Paragraph position="28"> Then, a new importance function P3 is defined by:</Paragraph> <Paragraph position="30"> where C/0iE\[0,1\], for i = 1 .... ,m.</Paragraph> <Paragraph position="31"> It is worth noting that the length of a text is not independent of its frequency of use; we feel that in several application domains (such as, for example, man-machine interaction) short texts are much more frequent than long ones and that texts exceeding a given length are not used at all.</Paragraph> <Paragraph position="32"> A more refined notion of structural complexity of an expression may be given by taking into account its syntactic structure, defined on the basis of an appropriate set of characteristic features - see, for example, the classification proposed in Tennant (1980). E can be partitioned into different and disjoint classes E i, according to the set of syntactical features they match, and an importance function P4 can be defined as above:</Paragraph> <Paragraph position="34"> where ~0iE\[0,1\], for i = 1 ..... m.</Paragraph> <Paragraph position="35"> Let us note that, contrary to the above illustrated relation between the length of a text and its frequency, it seems reasonable to consider syntactical complexity as fully independent of frequency; in fact, quite complex syntactical features (such as ellipsis, anaphora, broken text, etc.) are frequently found in several application domains.</Paragraph> <Paragraph position="36"> Finally, a couple of other possible choices for assigning the importance function p are worth mentioning: one based on the notions of &quot;information content&quot; or &quot;structural complexity&quot; according to Kolmogorov (1965, 1968), and the other based on the concept of &quot;semantic complexity&quot; of an expression, which could be formally defined, for example, in the represented domain R. However, some more theoretical work on these notions is necessary before we can use them for our needs; hence we will not further develop these notions here.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. Measuring Performance in .Practice </SectionTitle> <Paragraph position="0"> In the preceding sections, some theoretical tools for measuring the performance of a natural language understanding system have been illustrated. At this point we have to put them to work: that is, we must discuss how the performance of a system can be actually evaluated and how the comparison between two different systems can be carried out.</Paragraph> <Paragraph position="1"> We distinguish two steps in the process of performance evaluation: (i) to assign the functions/~ and p; (ii) to compute ~r\[#,p\].</Paragraph> <Paragraph position="2"> Let us examine in detail each of the two points.</Paragraph> <Paragraph position="3"> The choice 'of the shifting function /~ depends only on the degree to which we want to refine the notion of error in understanding and on the varying importance we want to assign to each type of error. Hence it is often only a matter of subjective feeling choosing appropriate values for /~ in order to analyse particular features of the system to be evaluated. Also, the definition of # is strongly dependent on the representation language R for the domain D: the richer and more structured R is, the more refined and subtle are the possible definitions of/~.</Paragraph> <Paragraph position="4"> On the contrary, however, the choice of the importance function p can generally be based on more objective arguments, once an appropriate ranking among the desired understanding capabilities of the system to be evaluated has been defined. For example, in the case where the frequency of texts is taken into account, an appropriate experimental activity can provide reliable statistical estimations for the frequency z(e) of each expression eEE, thus allowing the effective computation of p(e). (Problems connected with the choice of a meaningful sample to estimate z(e) which could freely include millions of millions of expressions - are not dealt with here, since they are more related to statistics than to computational linguistics.) null Clearly, the choice of t~ and p fully determines the numerical value of ~r\[/~,p\] (or of the matrix \[Pi,j\]) in correspondence to a given system U. How a change in /~ or p can affect qr\[/~,p\] is generally impossible to pre-</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 22 Computational Linguistics, Volume 10, Number 1, January-March 1984 </SectionTitle> <Paragraph position="0"> Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS dict, since this strongly depends on the particular features of U. Therefore, evaluating a system with different choices of p or p can indeed provide a clearer image of its performance. Although the comparison between different values of 7r obtained with different pairs (#,O) is often only a matter of intuitive reasoning, an interesting particular case that can be conveniently dealt with formally is briefly sketched below.</Paragraph> <Paragraph position="1"> A shifting function /~' is a refinement of a shifting function # (/~'_2/,) iff: - range(/,) = {61 ..... 6 n} with 61<62<...<6 n ; - range(/,') = {6' l .... ,6'n,}, with 6'1< 6'2<...< 6t n, and n v >n ; - the partitioning {El} of E induced by/z t is a refinement of the partitioning {El} of E induced by /~; - for each class E i = U E' t, , where</Paragraph> <Paragraph position="3"> In an analogous way we can define the refinement p, of an importance function p (pv___p).</Paragraph> <Paragraph position="4"> A pair (~tV,p v) is a refinement of a pair (/~,p) (we write (p',p') _3 (/~,p)) iff/~' _3 /z and p' _3 p.</Paragraph> <Paragraph position="5"> It is straightforward to prove that: For any system U and any two pairs (/z,p) and (p',p'), (/L',p') _3 (/~,p) implies ~r\[tt',p'\](U ) < ~\[~,p\](U).</Paragraph> <Paragraph position="6"> For example, the shifting function /z 3 in section 4 refines /~2, which in turn refines /Zl, that is, /~1 --- #3&quot; For the importance function p, on the other hand, not one of the functions Ol, P2, P3, P4, in section 4 is a refinement of any other one.</Paragraph> <Paragraph position="7"> It is worth noting that, when defining appropriate pairs (it,p) to evaluate a system, there are basically two ways of reasoning for comparing different choices: the first one is to Start from a first basic proposal and to proceed through successive refinements until the desired degree of precision and detail is reached; the second one consists in proposing functions corresponding to several different points of view and then integrating them together in a well-balanced synthesis. Generally, the first approach is appropriate for the definition of /x, while the second one can be utilized for the choice of o.</Paragraph> <Paragraph position="8"> Let us turn now to the problem of computing ~r\[/~,O\], once/~ and P have been assigned.</Paragraph> <Paragraph position="9"> Obviously, it is unrealistic to compute the exact value of qr by considering the behaviour of the system with respect to every expression ecE. Hence, a sequence of test cases has to be considered (Gold 1967). Figure 2 shows a model for experimental performance evaluation. A GENERATOR provides at each time instant i (i=1,2 .... ) an expression eiEE. Then, the system U to be evaluated computes the meaning g(ei), which is compared by/z with the correct meaning g(ei) supplied by an EVALUATOR (a man supposed to be able to compute ~, that is, both f and ~). Finally, the value p(ei) is computed, and the current value of</Paragraph> <Paragraph position="11"> is determined.</Paragraph> <Paragraph position="12"> The major problem with the computation of ~r is the design of the GENERATOR, that is, the choice of the sample of E to be used for the evaluation of the system U.</Paragraph> <Paragraph position="13"> The mathematically simplest case is the one where a subset B _ E is randomly generated on the basis of a given probability distribution in E (for example, equiprobability); then,</Paragraph> <Paragraph position="15"> is a random variable such that E(qrB) = 7r for reasonable distributions, where E0rB) denotes the expectation of qr B. The value of E(qrB) may be estimated by means of statistical techniques such as, for example, the maximum likelihood function. Here, we will not give a detailed account of such techniques. They can be easily found in classical works of statistics and sampling theory (Kobayashi 1978; Cox, Hinkley 1977; Mood, Graybill 1980), when needed.</Paragraph> <Paragraph position="16"> A different technique would be that of fixing a confidence interval, and then establishing the number n of tests to be generated in order to obtain the value of qr(/z,p) within the given confidence level, by means, for example, of X 2 techniques.</Paragraph> <Paragraph position="17"> In addition to these elementary statistical methods, more sophisticated sampling techniques can be used.</Paragraph> <Paragraph position="18"> This requires us first to choose a partitioning of E into meaningful classes, and then to define a sample stratified according to the considered partitioning. In this</Paragraph> <Paragraph position="20"/> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> IL </SectionTitle> <Paragraph position="0"> case, the GENERATOR might not work on a purely random basis.</Paragraph> <Paragraph position="1"> All the above-mentioned techniques are independent of the choice of p, and do not take into account specific goals that could be assigned to performance evaluation (for example, syntactic capabilities, linguistic or conceptual competence, etc.). Such general purpose methods can sometimes provide a too much global and too less meaningful evaluation. Moreover, the sample to be used for the computation of ~r is generally very large and hard to collect.</Paragraph> <Paragraph position="2"> Special purpose evaluation, centered on the analysis of some specific features of U, can often be more interesting and easier to implement. In this case, the specific goal of the measurement should be carefully taken into account in the definition of P, and both the goal and p should direct the choice of the appropriate sample of E to be used for the experimental computation of ~r. More precisely, an experimental (special purpose) evaluation session could be organized as follows: 1. precisely individuating the system U, the domain D, and the representation language R; 2. defining the goals of the evaluations; 3. deciding which samples of E to collect and how to collect them; 4. defining/~; 5. defining p (and how to compute it for the chosen samples); 6. computing ~r (and/or \[Pi,j\]) null Note that several tx and p could be generally considered for a careful experimentation. Moreover, steps 3, 4, and 5 might require, in critical cases, specific preexperimentation and some refinement loops for appropriate tuning.</Paragraph> <Paragraph position="3"> In the appendix, a limited case study experimentation is briefly discussed.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. Discussion and Future Research Directions </SectionTitle> <Paragraph position="0"> In this paper we have presented a model for performance evaluation of natural language understanding systems. The main task of this model is that of providing a basis for a quantitative measure of how well a system can understand natural language, thus allowing an objective and experimental comparison of the performance of different systems.</Paragraph> <Paragraph position="1"> Before discussing some open problems and illustrating the main lines of future research, let us briefly discuss some further features of our approach by comparing it to the classical work by Tennant (1979, 1980) and by Finin, Goodman, and Tennant (1979).</Paragraph> <Paragraph position="2"> Tennant's proposal is based on the three main concepts of habitability, completeness, and abstract analysis. This last point is not considered here, as explained in section 1 (see further in this section for its</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 24 Computational Linguistics, Volume 10, Number 1, January-March 1984 </SectionTitle> <Paragraph position="0"> Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS possible relevance to future work); we therefore focus on the first two. From a naive point of view, habitability is used to test whether or not the system does what it was designed to do; completeness is introduced to test whether or not the system meets users' requirements. More precisely, Tennant introduces the two notions of coverage and completeness to denote, respectively, the capabilities (both conceptual and linguistic) that the designer has put within a system, and (similarly to Woods, Kaplan, Nash-Webber 1972 though differing from Woods 1977) the degree to which the capabilities expected by a set of users can actually be found in the system coverage. Furthermore, habitability denotes (quite differently from Watt 1968) the degree to which a system can actually exhibit the capabilities that it was designed to have.</Paragraph> <Paragraph position="1"> Our approach is based on a slightly different model and provides in some sense a refinement of the above concepts.</Paragraph> <Paragraph position="2"> We denote by the term competence the capabilities that a system is actually able to show, while by the term coverage we refer, according to Tennant, to the theoretical capabilities that a system should have as a consequence of its design specifications.</Paragraph> <Paragraph position="3"> More precisely, the conceptual coverage of a system UR/D is formalized in our model by the domain D, which represents, in fact, the range of concepts that are within the domain of discourse of a given application. null The linguistic coverage clearly includes L D but, generally, is not limited to L D since understanding a language in a given domain also implies the capability of recognizing that some expressions are not meaningful in that domain.</Paragraph> <Paragraph position="4"> In general, for a given importance function p, we can assume that the linguistic coverage is defined by: LW D = {e l ecE and 0(e)>A}, where A(0<A<I) is a fixed bound.</Paragraph> <Paragraph position="5"> The linguistic competence can then be defined as: L' D = {e I eeL' D and g(e) = ~(e)}, and the conceptual competence as:</Paragraph> <Paragraph position="7"> (without distinction between conceptual and linguistic aspects) approaches its coverage. This measure is quite similar to, and provides a refinement of, the concept of habitability, involving also to some extent the notion of completeness. In fact, both the choice of D as an adequate domain and the definition of o as a suitable importance function (and, therefore, of LWD) implicitly refer to a set of users and then to completeness.</Paragraph> <Paragraph position="8"> It is apparent that the proposal introduced in this paper demands further work, both theoretical and experimental, in order to have fully adequate tools for performance evaluation.</Paragraph> <Paragraph position="9"> First of all, some of the concepts presented here have to be further discussed and expanded. For example, in the definition of ~r, we have normalized it with respect to O by setting: where tt and p are given the same importance (in this case the value ~r=l would be reached only when all expressions of E are fully misunderstood, that is, t~=_l, and when it is important at the highest degree that each of them is correctly understood, that is, 0-1). While we have preferred here the first definition, arguments could be given in favour of the second. null A second critical point is the definition of the /~-p-profile \[Pi,j\]&quot; This could be further extended so as to provide a picture of several dimensions (features, for example: frequency, syntactic complexity, information content, etc.). Third, it is worthwhile considering and improving the notion of refinement: in fact, the present definition is not stable with respect to the choice of ~t and O. That is, it could be that, given two systems U and UI: ~\[u,p\](u) < ~\[u,pl(U') and, for some refinement (tL',p') of (/~,p): ~r\[~',o'\](U) > ~r\[tz',p'\](U'), The above concepts are summarized in Figure 3.</Paragraph> <Paragraph position="10"> Our definition of performance ~r\[/~,p\] tries to give a global idea of how well the competence of a system so that the refinement of the evaluation criteria may give an inversion of the first evaluation. A formal development of the three points mentioned above will</Paragraph> </Section> class="xml-element"></Paper>