File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1099_metho.xml
Size: 8,721 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1099"> <Title>You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction</Title> <Section position="5" start_page="785" end_page="787" type="metho"> <SectionTitle> 3 Methods and Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="785" end_page="786" type="sub_section"> <SectionTitle> 3.1 Qualitative Criteria </SectionTitle> <Paragraph position="0"> Because various metrics assign a score to the candidates indicating as to what degree they qualify as a collocation or term (or not), these candidates should ideally be ranked in such a way that the following two conditions are met: * true collocations or terms (i.e., the true positives) are ranked in the upper portion of the output list.</Paragraph> <Paragraph position="1"> * non-collocations or non-terms (i.e., the true negatives) are ranked in the lower part of the output list.1 While a trivial solution to the problem might be to simply count the number of occurrences of candidates in the data, employing more sophisticated statistics-based / information-theoretic or even linguistically-motivated algorithms for grading term and collocation candidates is guided by the assumption that this additional level of sophistication yields more adequate rankings relative to these two conditions.</Paragraph> <Paragraph position="2"> Several studies (e.g., Evert and Krenn (2001), Krenn and Evert (2001), Frantzi et al. (2000), Wermter and Hahn (2004)), however, have already observed that ranking the candidates merely by their frequency of occurrence fares quite well 1Obviously, this goal is similar to ranking documents according to their relevance for information retrieval. compared with various more sophisticated association measures (AMs such as t-test, loglikelihood, etc.). In particular, the precision/recall value comparison between the various AMs exhibits a rather inconclusive picture in Evert and Krenn (2001) and Krenn and Evert (2001) as to whether sophisticated statistical AMs are actually more viable than frequency counting.</Paragraph> <Paragraph position="3"> Commonly used statistical significance testing (e.g., the McNemar or the Wilcoxon sign rank tests; see (Sachs, 1984)) does not seem to provide an appropriate evaluation ground either. Although Evert and Krenn (2001) and Wermter and Hahn (2004) provide significance testing of some AMs with respect to mere frequency counting for collocation extraction, they do not differentiate whether this is due to differences in the ranking of true positives or true negatives or a combination thereof.2 As for studies on ATR (e.g., Wermter and Hahn (2005) or Nenadi'c et al. (2004)), no statistical testing of the term extraction algorithms to mere frequency counting was performed.</Paragraph> <Paragraph position="4"> But after all, these kinds of commonly used statistical significance tests may not provide the right machinery in the first place. By design, they are rather limited (or focused) in their scope in that they just check whether a null hypothesis can be rejected or not. In such a sense, they do not provide a way to determine, e.g., to which degree of magnitude some differences pertain and thus do not offer the facilities to devise qualitative criteria to test whether an AM is superior to co-occurrence frequency counting.</Paragraph> <Paragraph position="5"> The purpose of this study is therefore to postulate a set of criteria for the qualitative testing of differences among the various CE and ATR metrics. We do this by taking up the two conditions above which state that a good CE or ATR algorithm would rank most of the true positives in a candidate set in the upper portion and most of the true negatives in the lower portion of the output. Thus, compared to co-occurrence frequency counting, a superior CE/ATR algorithm should achieve the following four objectives: 2In particular Evert and Krenn (2001) use the chi-square test which assumes independent samples and is thus not really suitable for testing the significance of differences of two or more measures which are typically run on the same set of candidates (i.e., a dependent sample). Wermter and Hahn (2004) use the McNemar test for dependent samples, which only examines the differences in which two metrics do not coincide.</Paragraph> <Paragraph position="6"> 1. keep the true positives in the upper portion 2. keep the true negatives in the lower portion 3. demote true negatives from the upper portion 4. promote true positives from the lower por null tion.</Paragraph> <Paragraph position="7"> We take these to be four qualitative criteria by which the merit of a certain AM against mere occurrence frequency counting can be determined.</Paragraph> </Section> <Section position="2" start_page="786" end_page="786" type="sub_section"> <SectionTitle> 3.2 Data Sets </SectionTitle> <Paragraph position="0"> For collocation extraction (CE), we used the data set provided by Wermter and Hahn (2004) which consists of a 114-million-word German newspaper corpus. After shallow syntactic analysis, the authors extracted Preposition-Noun-Verb (PNV) combinations occurring at least ten times and had them classified by human judges as to whether they constituted a valid collocation or not, resulting in 8644 PNV-combinations with 13.7% true positives. As for domain-specific automatic term recognition (ATR), we used a biomedical term candidate set put forth by Wermter and Hahn (2005), who, after shallow syntactic analysis, extracted 31,017 trigram term candidates occurring at least eight times out of a 104-million-word MEDLINE corpus. Checking these term candidates against the 2004 edition UMLS Metathesaurus (UMLS, 2004)3 resulted in 11.6% true positives. This information is summarized in Table 1.</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="786" end_page="786" type="sub_section"> <SectionTitle> 3.3 The Association Measures </SectionTitle> <Paragraph position="0"> We examined both standard statistics-based and more recent linguistically rooted association measures against mere frequency of occurrence counting (henceforth referred to as Frequency). As the standard statistical AM, we selected the t-test (see also Manning and Sch&quot;utze (1999) for a description on its use in CE and ATR) because it has been shown to be the best-performing statistics-only measure for CE (cf. Evert and Krenn (2001) and Krenn and Evert (2001)) and also for ATR (see Wermter and Hahn (2005)).</Paragraph> <Paragraph position="1"> Concerning more recent linguistically grounded AMs, we looked at limited syntagmatic modifiability (LSM) for CE (Wermter and Hahn, 2004) and limited paradigmatic modifiability (LPM) for ATR (Wermter and Hahn, 2005). LSM exploits the well-known linguistic property that collocations are much less modifiable with additional lexical material (supplements) than non-collocations.</Paragraph> <Paragraph position="2"> For each collocation candidate, LSM determines the lexical supplement with the highest probability, which results in a higher collocativity score for those candidates with a particularly characteristic lexical supplement. LPM assumes that domain-specific terms are linguistically more fixed and show less distributional variation than common noun phrases. Taking n-gram term candidates, it determines the likelihood of precluding the appearance of alternative tokens in various token slot combinations, which results in higher scores for more constrained candidates. All measures assign a score to the candidates and thus produce a ranked output list.</Paragraph> </Section> <Section position="4" start_page="786" end_page="787" type="sub_section"> <SectionTitle> 3.4 Experimental Setup </SectionTitle> <Paragraph position="0"> In order to determine any potential merit of the above measures, we use the four criteria described in Section 3.1 and qualitatively compare the different rankings given to true positives and true negatives by an AM and by Frequency. For this purpose, we chose the middle rank as a mark to divide a ranked output list into an upper portion and a lower portion. Then we looked at the true positives (TPs) and true negatives (TNs) assigned to these portions by Frequency and quantified, according to the criteria postulated in Section 3.1, to what degree the other AMs changed these rankings (or not). In order to better quantify the degrees of movement, we partitioned both the upper and the lower portions into three further subportions. null</Paragraph> </Section> </Section> class="xml-element"></Paper>