File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1042_evalu.xml
Size: 7,307 bytes
Last Modified: 2025-10-06 13:58:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1042"> <Title>Evaluation of Automatically Identified Index Terms for Browsing Electronic Documents I</Title> <Section position="4" start_page="305" end_page="307" type="evalu"> <SectionTitle> 5. Results </SectionTitle> <Paragraph position="0"> Our results for the three types of terms, by document, are shown in Figure 2. Although we asked subjects to rate three articles, some volunteers rated only two. All results were in-</Paragraph> <Section position="1" start_page="305" end_page="305" type="sub_section"> <SectionTitle> 5.1 Quality </SectionTitle> <Paragraph position="0"> For the three lists of index terms, TTs received the highest ratings for all three documents--an average of 1.79 on the scale of 1 to 5, with 1 being the best rating. HS came in second, with an average of 2.89, and KW came in last with an average of 3.27. It should be noted that averaging the average conceals the fact that the number of TTs is much lower than the other two types of terms, as shown in Figure 1.</Paragraph> <Paragraph position="1"> Figure 3 (included before Appendix A) shows cumulative rankings of terms by method.</Paragraph> <Paragraph position="2"> The X axis represents ratings awarded by subjects. The Y axis reflects the percentage of terms receiving a given rank or better. All data series must reach 100% since every term has been assigned a rating by the evaluators. At any given data point, a larger value indicates that a larger percentage of that series' data has that particular rating or better. For example, 100% of the TTs have a rating of 3 or better; while only about 30% of the terms of the lowest-scoring KW document received a score of 3 or better. In two out of the three documents, HS terms fall between TTs and KWs.</Paragraph> </Section> <Section position="2" start_page="305" end_page="306" type="sub_section"> <SectionTitle> 5.2 Coverage </SectionTitle> <Paragraph position="0"> The graph in Figure 3 shows results for quality, not coverage. In contrast, Figure 4, which shows the total number of terms rated at or below specified rankings, allows us to measure quality and coverage. (1 is the highest rating; 5 is the lowest.) This figure shows that the HS method identifies more high quality terms or below a specified rank TT clearly identifies the highest quality terms: 100% of TTs receive a rating of 2 or better.</Paragraph> <Paragraph position="1"> However, only 8 TTs received a rating of 2 or better (38% of the total), while 41 HSs re- null ceived a rating of 2 or better (26% of the total). This indicates that the TT method misses many high quality terms. KW, the least discriminating method in terms of quality, also provides better coverage than does TT.</Paragraph> <Paragraph position="2"> This result is consistent with our observation that TT identifies the highest quality terms, but there are very few of them: an average of 7 per 500 words compared to over 50 for HS and KW. Therefore there is a need for additional high quality terms. The list of HSs received a higher average rating than did the list of KWs, as shown in Figure 2. This is consistent with our expectation that phrases containing more content-bearing modifiers would be perceived as more useful index terms than would single word phrases consisting only of heads.</Paragraph> </Section> <Section position="3" start_page="306" end_page="307" type="sub_section"> <SectionTitle> 5.3 Ranking variability </SectionTitle> <Paragraph position="0"> The difference in the average ratings for the list of KWs and the list of head-sorted SNPs was less than expected. The small difference in average ratings for the HS list and the KW list can be explained, at least in part, by two factors: 1) Differences among professionals and students in inter-subject agreement and reliability; 2) A discrepancy in the rating of single word terms across term types.</Paragraph> <Paragraph position="1"> 22 students and 7 professionals participated in the study. Figure 5 shows differences in the ratings of professionals and of students.</Paragraph> <Paragraph position="2"> When variation in the scores for terms was calculated using standard deviation, the standard deviation for the professionals was 0.78, while for the students it was 1.02. Because of the relatively low number of professionals, the standard deviation was calculated only over terms that were rated by more than one professional. A review of the students' results showed that they appeared not to be as careful as the professionals. For example, the phrase 'Wall Street Journal' was included on the HS list only because it is specified as the document source. However, four of the eight students assigned this term a high rating (1 or 2); this is puzzling because the document is about asbestos-related disease. The other four students assigned a 4 or 5 to 'Wall Street Journal', as we expected.</Paragraph> <Paragraph position="3"> But the average score for this term was 3, due to the anomalous ratings. We therefore have more confidence in the reliability of the professional ratings, even though there are relatively few of them.</Paragraph> <Paragraph position="4"> We examined some of the differences in rating for term types. Single word index terms are rated more highly by professionals when they appear in the context of other single word index terms, but are downrated in the context of phrasal expansions that make the meaning of the one-word term more specific. The KW list and HS list overlap when the SNP consists only of a single word (the head) or only of a head modified by determiners. When the same word appears in both lists in identical form, the token in the KW list tends to receive a better rating than the token does when it appears in the HS list, where it is often followed by expansions of the head. For example, the word exposure received an average rating of 2.2 when it appeared on the KW list, but a rating of only 2.75 on the HS list. However, the more specific phrase racial quotas, which immediately followed quota on the HS list received a rating of 1.</Paragraph> <Paragraph position="5"> To better understand these differences, we selected 40 multi-word phrases and examined the average score that the phrase received in the TT and HS lists, and compared it to the average ratings that individual words received in the KW list. We found that in about half of the cases (21 of 40), the phrase as a whole and the individual words in the phrase received similar scores, as in Example 1 in Figure 6. In just over one-fourth of the cases (12 of 40), the phrase scored well, but scores from the individual words were rated from good to poor, as in Example 2. In about one-eighth of the cases (6 of 40), the phrase scored well, but the individual words scored poorly, as in Example 3. Finally, in only one case, shown in Example 4 of Figure 6, the phrase scored poorly but the individual words scored well.</Paragraph> <Paragraph position="6"> and single words This shows that single words in isolation are judged differently than the same word when presented in the context of a larger phrase.</Paragraph> <Paragraph position="7"> These results have important implications in the design of indexing tools.</Paragraph> </Section> </Section> class="xml-element"></Paper>