File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1106_evalu.xml
Size: 6,826 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1106"> <Title>Language & Information Engineering</Title> <Section position="5" start_page="846" end_page="848" type="evalu"> <SectionTitle> 4 Results and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="846" end_page="848" type="sub_section"> <SectionTitle> 4.1 Precision/Recall for Terminology Extraction </SectionTitle> <Paragraph position="0"> For each of the different candidate sets, we incrementally examined portions of the ranked output lists returned by each of the three measures we considered. The precision values for the various portions were computed such that for each percent point of the list, the number of true positives found (i.e., the number of terms) was scaled against the overall number of candidate items returned. This yields the (descending) precision curves in Figures 1, 2 and 3 and some associated values in Table 4.</Paragraph> <Paragraph position="1"> selected portions of the ranked list First, we observe that, for the various n-gram candidate sets examined, all measures outperform the baselines by far, and, thus, all are potentially useful measures for grading termhood. Still, the P -Mod criterion substantially outperforms all other measures at almost all points for all n-grams examined. Considering 1% of the bigram list (i.e., the rst 673 candidates) precision for P -Mod is 20 points higher than for the t-test and the C-value. At 1% of the trigram list (i.e., the rst 310 candidates), P -Mod's lead is 7 points. Considering 1% of the quadgrams (i.e., the rst 108 candidates), the t-test actually leads by 7 points. At 10% of the quadgram list, however, the P -Mod precision score has overtaken the other ones. With increasing portions of all ranked lists considered, the precision curves start to converge toward the baseline, but P -Mod maintains a steady advantage.</Paragraph> <Paragraph position="2"> The (ascending) recall curves in Figures 1, 2 and 3 and their corresponding values in Table 5 indicate which proportion of all true positives (i.e., the proportion of all terms in a candidate set) is identi ed by a particular measure at a certain point of the ranked list. For term extraction, recall is an even better indicator of a particular measure's performance because nding a bigger proportion of the true terms at an early stage is simply more economical.</Paragraph> <Paragraph position="3"> recall scores for biomedical term extraction Again, our linguistically motivated terminology extraction algorithm outperforms its competitors, and with respect to tri- and quadgrams, its gain is even more pronounced than for precision. In order to get a 0.5 recall for bigram terms, P -Mod only needs to winnow 29% of the ranked list, whereas the t-test and C-value need to winnow 35% and 37%, respectively. For trigrams and quadgrams, P -Mod only needs to examine 19% and 20% of the list, whereas the other two measures have to scan almost 10 additional percentage points. In order to obtain a 0.6, 0.7, 0.8 and 0.9 recall, the differences between the measures narrow for bigram terms, but they widen substantially for tri- and quadgram terms. To obtain a 0.6 recall for trigram terms, P -Mod only needs to winnow 27% of its output list while the t-test and C-value must consider 38% and 40%, respectively.</Paragraph> <Paragraph position="4"> For a level of 0.7 recall, P -Mod only needs to analyze 36%, while the t-test already searches 50% of the ranked list. For 0.8 recall, this relation is 50% (P -Mod) to 63% (t-test), and at recall point 0.9, 68% (P -Mod) to 77% (t-test). For quadgram term identi cation, the results for P -Mod are equally superior to those for the other measures, and at recall points 0.8 and 0.9 even more pronounced than for trigram terms.</Paragraph> <Paragraph position="5"> We also tested the signi cance of differences for these results, both comparing P -Mod vs. t-test and P -Mod vs. C-value. Because in all cases the ranked lists were taken from the same set of candidates (viz. the set of bigram, trigram, and quadgram candidate types), and hence constitute dependent samples, we applied the McNemar test (Sachs, 1984) for statistical testing. We selected 100 measure points in the ranked lists, one after each increment of one percent, and then used the two-tailed test for a con dence interval of 95%. Table 6 lists the number of signi cant differences for these measure points at intervals of 10 for the bi-, tri-, and quadgram results. For the bi-gram differences between P -Mod and C-value, all of them are signi cant, and between P -Mod and t-test, all are signi cantly different up to measure point 70.12 Looking at the tri- and quadgrams, although the number of signi cant differences is less than for bigrams, the vast majority of measure points is still signi cantly different and thus underlines the superior performance of the P -Mod measure.</Paragraph> <Paragraph position="6"> quadgrams using the two-tailed McNemar test at 95% con dence interval 12As can be seen in Figures 1, 2 and 3, the curves start to merge at the higher measure points and, thus, the number of signi cant differences decreases.</Paragraph> </Section> <Section position="2" start_page="848" end_page="848" type="sub_section"> <SectionTitle> 4.2 Domain Independence and Corpus Size </SectionTitle> <Paragraph position="0"> One might suspect that the results reported above could be attributed to the corpus size. Indeed, the text collection we employed in this study is rather large (104 million words). Other text genres and domains (e.g., clinical narratives, various engineering domains) or even more specialized biological sub-domains (e.g., plant biology) do not offer such a plethora of free-text material as the molecular biology domain. To test the effect a drastically shrunken corpus size might have, we assessed the terminology extraction methods for trigrams on a much smallersized subset of our original corpus, viz. on 10 million words. These results are depicted in Figure 4.</Paragraph> <Paragraph position="1"> traction on the 10-million-word corpus (cutoff c [?] 4, with 6,760 term candidate types) The P -Mod extraction criterion still clearly out-performs the other ones on that 10-million-word corpus, both in terms of precision and recall. We also examined whether the differences were statistically signi cant and applied the two-tailed McNemar test on 100 selected measure points. Comparing P -Mod with t-test, most signi cant differences could be observed between measure points 20 and 80, with almost 80% to 90% of the points being signi cantly different. These signi cant differences were even more pronounced when comparing the results between P -Mod and C-value.</Paragraph> </Section> </Section> class="xml-element"></Paper>