File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1035_metho.xml

Size: 22,218 bytes

Last Modified: 2025-10-06 14:08:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1035">
  <Title>Toward a Task-based Gold Standard for Evaluation of NP Chunks and Technical Terms</Title>
  <Section position="3" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 Experimental design
</SectionTitle>
    <Paragraph position="0"> Our experiment assesses the index terms vis a vis their usefulness in a strictly controlled information access task. Subjects responded to a set of questions whose answers were contained in a 350 page collegelevel text (Rice, Ronald E., McCreadie, Maureen and Chang, Shan-ju L. (2001) Accessing and Browsing Information and Communication. Cambridge, MA: MIT Press.) Subjects used the Experimental Searching and Browsing Interface (ESBI) which forces them to access text via the index terms; direct text searching was prohibited. 25 subjects participated in the experiment; they were undergraduate and graduate students at Rutgers University. The experiments were conducted by graduate students at the Rutgers</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
University School of Communication, Information
and Library Studies (SCILS).
2.1 ESBI (Experimental Searching and Brows-
ing Interface)
Subjects used the Experimental Searching and
</SectionTitle>
      <Paragraph position="0"> Browsing Interface (ESBI) to find the answers to the questions. After an initial training session, ESBI presents the user with a Search/Browse screen (not shown); the question appears at the top of the screen.</Paragraph>
      <Paragraph position="1"> The subject may enter a string to search for in the index, or click on the &amp;quot;Browse&amp;quot; button for access to the whole index. At this point, &amp;quot;search&amp;quot; and &amp;quot;browse&amp;quot; apply only to the list of index terms, not to the text.</Paragraph>
      <Paragraph position="2"> The user may either browse the entire list of index terms or may enter a search term and specify criteria to select the subset of terms that will be returned.</Paragraph>
      <Paragraph position="3"> Most people begin with the latter option because the complete list of index terms is too long to be easily browsed. The user may select (click on) an index term to view a list of the contexts in which the term appears. If the context appears useful, the user may choose to view the term in its full context; if not, the user may either do additional browsing or start the process over again.</Paragraph>
      <Paragraph position="4"> Figure 1 shows a screen shot of ESBI after the searcher has entered the string democracy in the search box. This view shows the demo question and the workspace for entering answers. The string was (previously) entered in the search box and all index terms that include the word democracy are displayed.</Paragraph>
      <Paragraph position="5"> Although it is not illustrated here, ESBI also permits substring searching and the option to specify case sensitivity.</Paragraph>
      <Paragraph position="6"> Regardless of the technique by which the term was identified, terms are organized by grammatical head of the phrase. Preliminary analysis of our results has shown that most subjects like this analysis, which resembles standard organization of back-of-the-book indexes.</Paragraph>
      <Paragraph position="7"> Readers may notice that the word participation appears at the left-most margin, where it represents the set of terms whose head is participation. The indented occurrence represents the individual term.</Paragraph>
      <Paragraph position="8"> Selecting the left-most occurrence brings up contexts for all phrases for which participation is a head. Selecting on the indented occurrence brings up contexts for the noun participation only when it is not part of a larger phrase. This is explained to subjects during the pre-experimental training and an experimenter is present to remind subjects of this distinction if a question arises during the experiment.</Paragraph>
      <Paragraph position="9"> Readers may also notice that in Figure 1, one of the terms, participation require, is ungrammatical.</Paragraph>
      <Paragraph position="10"> This particular error was caused by a faulty part-of-speech tag. But since automatically identified index terms typically include some nonsensical terms, we have left these terms in - these terms are one of the problems that information seekers have to cope with in a realistic task-based evaluation.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 Questions
</SectionTitle>
      <Paragraph position="0"> After conducting initial testing to find out what types of questions subjects founder hard or easy, we spent considerable effort to design a set of 26 questions of varying degrees of difficulty. To obtain an initial assessment of difficulty, one of the experimenters used ESBI to answer all of the questions and rate each question with regard to how difficult it was to answer using the ESBI system. For example, the question What are the characteristics of Marchionini's model of browsing? was rated very easy because searching on the string marchionini reveals an index term Marchionini's which is linked to the text sentence: Marchionini's model of browsing considers five interactions among the information-seeking factors of &amp;quot;task, domain, setting, user characteristics and experience, and system content and interface&amp;quot; (p.107). The question What factors determine when users decide to stop browsing? was rated very difficult because searching on stop (or synonyms such as halt, cease, end, terminate, finish, etc.) reveals no helpful index terms, while searching on factors or browsing yields an avalanche of over 500 terms, none with any obvious relevance.</Paragraph>
      <Paragraph position="1"> After subjects finished answering each question, they were asked to rate the question in terms of its difficulty. A positive correlation between judgments of the experimenters and the experimental subjects (Sharp et al., under submission) confirmed that we had successfully devised questions with a range of difficulty. In general, questions that included terms actually used in the index were judged easier; questions where the user had to devise the index terms were judged harder.</Paragraph>
      <Paragraph position="2"> To avoid effects of user learning, questions were presented to subjects in random order; in the one hour experiment, subjects answered an average of about 9 questions.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Terms
</SectionTitle>
      <Paragraph position="0"> Although the primary goal of this research is to point the way to improved techniques for automatic creation of index terms, we used human created terms to create a baseline. For the human index terms, we used the pre-existing back-of-the-book index, which we believe to be of high quality.</Paragraph>
      <Paragraph position="1">  The two techniques for automatic identification were the technical terms algorithm of Justeson and Katz (1995) and the head sorting method (Dagan and Church (1994); Wacholder (1998). In the implementation of the Justeson and Katz' algorithm, technical terms are multi-word NPs repeated above some threshold in a corpus; in the head sorting method, technical terms are identified by grouping noun phrases with a common head (e.g., health-care workers and asbestos workers), and selecting as terms those NPs whose heads appear in two or more phrases. Definitionally, technical terms are a proper subset of terms identified by Head Sorting. Differences in the implementations, especially the pre-processing module, result in there being some terms identified by Termer that were not identified by Head Sorting.</Paragraph>
      <Paragraph position="2"> Table 2 shows the number of terms identified by each method. (*Because some terms are identified by more than one technique, the percentage adds up to more than 100%.) The fewest terms (673) were identified by the human method; in part this reflects the judgment of the indexer and in part it is a result of restrictions on index length in a printed text. The largest number of terms (7980) was identified by the head sorting method. This is because it applies looser criteria for determining a term than does the Justeson and Katz algorithm which imposes a very strict standard--no single word can be considered a term, and an NP must be repeated in full to be considered a term.</Paragraph>
      <Paragraph position="3">  identification Wacholder et al. (2000) showed that when experimental subjects were asked to assess the usefulness of terms for an information access task without actually using the terms for information access showed that the terms identified by the technical term algorithm, which are considerably fewer than the terms identified by head sorting, were overall of higher quality than the terms identified by the head sorting method. However, the fact that subjects assigned a high rank to many of the terms identified by Head Sorting suggested that the technical term algorithm was failing to pick up many potentially useful index terms.</Paragraph>
      <Paragraph position="4"> In preparation for the experiment, all index terms were merged into a single list and duplicates were removed, resulting in a list of nearly 10,000 index terms.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 Tracking results
</SectionTitle>
      <Paragraph position="0"> In the experiment, we logged the terms that subjects searched for (i.e., entered in a search box) and selected. In this paper, we report only on the terms that the subjects selected (i.e., clicked on). This is because if a subject entered a single word, or a sub-part of a word in the search box, ESBI returned to them a list of index terms; the subject then selected a term to view the context in which it appears in the text. This term might have been the same term originally searched for or it might have been a superstring. The terms that subjects selected for searching are interesting in their own right, but are not analyzed here.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="3" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> At the outset of this experiment, we did not know whether it would be possible to discover differences in human preferences for terms in the information access task reported on in this paper. We therefore started our research with the null hypothesis that all index terms are created equal. If users selected index terms in roughly the same proportion as the terms occur in the text, the null hypothesis would be proven.</Paragraph>
    <Paragraph position="1"> The results strongly discredit the null hypothesis.</Paragraph>
    <Paragraph position="2"> Table 3 shows that when measured by percentage of terms selected, subjects selected on over 13.2% of the available human terms, but only 1.73% and 1.43% respectively of the automatically selected terms. Table 3 also shows that although the human index terms formed only 6% of the total number of index terms, 40% of the terms which were selected by subjects in order to view the context were identified by human indexing. Although 80% of the index terms were identified by head sorting, only 51% of the terms subjects chose to select had been identified by this method. (*Because of overlap of terms selected by different techniques, total is greater than 100%)  method.</Paragraph>
    <Paragraph position="3"> To determine whether the numbers represent statistically significant evidence that the null hypothesis is wrong, we represent the null hypothesis (H</Paragraph>
    <Paragraph position="5"> and the falsification of the null hypothesis (H</Paragraph>
    <Paragraph position="7"> is the expected percentage of the selected terms that are type i in all the selected terms; u</Paragraph>
    <Paragraph position="9"> pected percentage if there is no user preference, i.e.</Paragraph>
    <Paragraph position="10"> the proportion of this term type i in all the terms. We rewrite the above as (3).</Paragraph>
    <Paragraph position="12"> Assuming that X is normally distributed, we can use a one-sample t test on X to decide whether to accept the hypothesis (1). The two-tailed t test (df =222) produces a p-value of less than .01% for the comparison of the expected and selected proportions of a) human terms and head sorted terms and b) human terms and technical terms. In contrast, the p-value for the comparison of head-sorted and technical terms was 33.7%, so we draw no conclusions about relative preferences for head sorted and technical terms.</Paragraph>
    <Paragraph position="13"> We also considered the possibility that our formulation of questions biased the terms that the subjects selected, perhaps because the words of the questions overlapped more with the terms selected by one of the methods.</Paragraph>
    <Paragraph position="14">  We took the following steps: 1) For each search word, calculate the number of terms overlapping with it from each source.</Paragraph>
    <Paragraph position="15"> 2) Based on these numbers, determine the proportion of terms provided by each method.</Paragraph>
    <Paragraph position="16"> 3) Sum the proportions of all the search words.</Paragraph>
    <Paragraph position="17"> As measured by the terms the subjects saw during browsing, 22% were human terms, 62% were head sorted terms and 16% were technical terms. Using the same reasoning about the null hypothesis as above, the p-value for the comparison of the ratios of human and head sorted terms was less than 0.01%, as was the comparison of the ratios of the human and technical terms. This supports the validity of the results of the initial test. In contrast, the p-value for the comparison of the two automatic techniques was 77.3%. Why did the subjects demonstrate such a strong preference for the human terms? Table 4 illustrates some important differences between the human terms and the automatically identified terms. The terms selected on are longer, as measured in number of words, and more complex, as measured by number of prepositions per index terms and by number of content-bearing words. As shown in Table 5, the difference of these complexity measures between human terms and automatically identified terms are statistically significant.</Paragraph>
    <Paragraph position="18"> Since longer terms are more specific than shorter terms (for example, participation in a democracy is longer and more specific than democracy), the results suggest that subjects prefer the more specific terms. If this result is upheld in future research, it has practical implications for the design of automatic term identification systems.</Paragraph>
    <Paragraph position="19">  tailed t-test on index term complexity. The numbers in the cells are p-value of the test.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Relationship between Term Source and
Search Effectiveness
</SectionTitle>
      <Paragraph position="0"> In this paper, our primary focus is on the question of what makes index terms 'better', as measured by user preferences in a question-answering task. Also of interest, of course, is what makes index terms 'better' in terms of how accurate the resulting users' answers are. The problem is that any facile judgment of free-text answer accuracy is bound to be arbitrary and potentially unreliable; we discuss this in detail in [26]. Nevertheless, we address the issue in a preliminary way in the current paper. We used an ad hoc set of canonical answers to score subjects' answers on a scale of 1 to 3, where 1 stands for 'very accurate', 2 stands for 'partly accurate' and 3 represents 'not at all accurate'. Using general loglinear regression (Poisson model) under the hypothesis that these two variables are independent of each other, our analysis showed that there is a systematic relationship (significance probability is 0.0504) between source of selected terms and answer accuracy. Specifically, in cases where subjects used more index terms identified by the human indexer, the answers were more accurate.</Paragraph>
      <Paragraph position="1"> On the basis of our initial accuracy judgments, we can therefore draw the preliminary conclusion that terms that were better in that they were preferred by the experimental subjects were also better in that they were associated with better answers. We plan to conduct a more in-depth analysis of answer accuracy and will report on it in future work.</Paragraph>
      <Paragraph position="2"> But the primary question addressed in this paper is how to reliably assess NP chunks and technical terms. These results constitute experimental evidence that the index terms identified by the human indexer constitute a gold standard, at least for the text used in the experiment. Any set of index terms, regardless of the technique by which they were created or the criteria by they were selected, can be compared vis a vis their usefulness in the information access task.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> The contribution of this paper is the description of a task-based gold-standard method for evaluating the usefulness and therefore the quality of NP chunks and technical terms. In this section, we address a number of questions about this method.</Paragraph>
    <Paragraph position="1">  1) What properties of terms can this technique be used to study? * One word or many. There are two parts to  the process of identifying NP terms: NP chunks that are candidate terms must be identified and candidate terms must be filtered in order to select a subset appropriate for use in the intended application. Justeson and Katz (1995) is an example of an algorithm where the process used for identifying NP chunks is also the filtering process. A byproduct of this technique is that single-word terms are excluded. In part, this is because it is much harder to determine in context which single words actually qualify as terms. But dictionaries of technical terminology have many one-word terms.</Paragraph>
    <Paragraph position="2"> * Simplex or complex NPs (e.g., Church 1988; Hindle and Rooth 1991; Wacholder 1998) identify simplex or base NPs - NPs which do not have any component NPs -- at least in part because this bypasses the need to solve the quite difficult attachment problem, i.e., to determine which simpler NPs should be combined to output a more complex NP. But if people find complex NPs more useful than simpler ones, it is important to focus on improvement of techniques to reliably identify more complex terms.</Paragraph>
    <Paragraph position="3"> * Semantic and syntactic terms variants.</Paragraph>
    <Paragraph position="4"> Daille et al. (1996), Jacquemin (2001) and others address the question of how to identify semantic (synonymous) and syntactic variants. But independent of the question of how to recognize variants is the question of which variants are to be preferred for different kinds of uses.</Paragraph>
    <Paragraph position="5"> * Impact of errors. Real-world NLP systems have a measurable error rate. By conducting experiments in which terms with errors are include in the set of test terms, the impact of these errors can be measured. The usefulness of a set of terms presumably is at least in part a function of the impact of the errors, whether the errors are a by-product of the algorithm or the implementation of the algorithm. null 2) Could the set of human index terms be used as a gold standard without conducting the human subject experiments? This of course could be done, but then the terms are being evaluated by a fixed standard - by definition, no set of terms can do better than the gold standard. This experimental method leaves open the possibility that there is a set of terms that is better than the gold standard. In this case, of course, the gold standard would no longer be a gold standard -- perhaps we would have to call it a platinum standard.</Paragraph>
    <Paragraph position="6"> 3) How reproducible is the experiment? The experiment can be re-run with any set of terms deemed to be representative of the content of the Rice text. The preparation of the materials for additional texts is admittedly time-consuming. But over time a sizable corpus of experimental materials in different domains could be built up. These materials could be used for training as well as for testing.</Paragraph>
    <Paragraph position="7"> 4) How extensible is the gold standard? The experimental protocol will be validated only if equally useful index terms can be created for other texts. We anticipate that they can.</Paragraph>
    <Paragraph position="8"> 5) How can this research help in the design of real world NLP systems? This technique can help in assessing the relative usefulness of existing techniques for identifying terms. It is possible, for example, there already exist techniques for identifying terms that are superior to the two tested here. If we can find such systems, their algorithms should be preferred. If not, there remains a need for development of algorithms to identify single word terms and complex phrases. 6) Do the benefits of this evaluation technique outweigh the costs? Given the fundamental difficulty of evaluating NP chunks and technical terms, task-based evaluation is a promising supplement to evaluation by precision and recall. These relatively time-consuming human subject experiments surely will not be undertaken by most system developers; ideally, they should be performed by neutral parties who do not have a stake in the outcome.</Paragraph>
    <Paragraph position="9"> 7) Should automated indexes try to imitate human indexers? Automated indexes should contain terms that are most easily processed by users. If the properties of such terms can be reliably discovered, developers of systems that identify terms intended to be processed by people surely should pay attention.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML