File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1075_intro.xml

Size: 7,155 bytes

Last Modified: 2025-10-06 14:03:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1075">
  <Title>A Nonparametric Method for Extraction of Candidate Phrasal Terms</Title>
  <Section position="2" start_page="0" end_page="605" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The ordinary vocabulary of a language like English contains thousands of phrasal terms -multiword lexical units including compound nouns, technical terms, idioms, and fixed collocations. The exact number of phrasal terms is difficult to determine, as new ones are coined regularly, and it is sometimes difficult to determine whether a phrase is a fixed term or a regular, compositional expression. Accurate identification of phrasal terms is important in a variety of contexts, including natural language parsing, question answering systems, information retrieval systems, among others.</Paragraph>
    <Paragraph position="1"> Insofar as phrasal terms function as lexical units, their component words tend to cooccur more often, to resist substitution or paraphrase, to follow fixed syntactic patterns, and to display some degree of semantic noncompositionality (Manning, 1999:183-186). However, none of these characteristics are amenable to a simple algorithmic interpretation. It is true that various term extraction systems have been developed, such as Xtract (Smadja 1993), Termight (Dagan &amp; Church 1994), and TERMS (Justeson &amp; Katz 1995) among others (cf. Daille 1996, Jacquemin &amp; Tzoukermann 1994, Jacquemin, Klavans, &amp; Toukermann 1997, Boguraev &amp; Kennedy 1999, Lin 2001). Such systems typically rely on a combination of linguistic knowledge and statistical association measures. Grammatical patterns, such as adjective-noun or noun-noun sequences are selected then ranked statistically, and the resulting ranked list is either used directly or submitted for manual filtering.</Paragraph>
    <Paragraph position="2"> The linguistic filters used in typical term extraction systems have no obvious connection with the criteria that linguists would argue define a phrasal term (noncompositionality, fixed order, nonsubstitutability, etc.). They function, instead, to reduce the number of a priori improbable terms and thus improve precision. The association measure does the actual work of distinguishing between terms and plausible nonterms. A variety of methods have been applied, ranging from simple frequency (Justeson &amp; Katz 1995), modified frequency measures such as c-values (Frantzi, Anadiou &amp; Mima 2000, Maynard &amp; Anadiou 2000) and standard statistical significance tests such as the t-test, the chi-squared test, and log-likelihood (Church and Hanks 1990, Dunning 1993), and information-based methods, e.g.</Paragraph>
    <Paragraph position="3"> pointwise mutual information (Church &amp; Hanks 1990).</Paragraph>
    <Paragraph position="4"> Several studies of the performance of lexical association metrics suggest significant room for improvement, but also variability among tasks.</Paragraph>
    <Paragraph position="5"> One series of studies (Krenn 1998, 2000; Evert &amp; Krenn 2001, Krenn &amp; Evert 2001; also see Evert 2004) focused on the use of association metrics to identify the best candidates in particular grammatical constructions, such as adjective-noun pairs or verb plus prepositional phrase constructions, and compared the performance of simple frequency to several common measures (the log-likelihood, the t-test, the chi-squared test, the dice coefficient, relative entropy and mutual information). In Krenn &amp; Evert 2001, frequency outperformed mutual information though not the ttest, while in Evert and Krenn 2001, log-likelihood and the t-test gave the best results, and mutual information again performed worse than frequency. However, in all these studies performance was generally low, with precision falling rapidly after the very highest ranked phrases in the list.</Paragraph>
    <Paragraph position="6"> By contrast, Schone and Jurafsky (2001) evaluate the identification of phrasal terms without grammatical filtering on a 6.7 million word extract from the TREC databases, applying both WordNet and online dictionaries as gold standards. Once again, the general level of performance was low, with precision falling off rapidly as larger portions  of the n-best list were included, but they report better performance with statistical and information theoretic measures (including mutual information) than with frequency. The overall pattern appears to be one where lexical association measures in general have very low precision and recall on unfiltered data, but perform far better when combined with other features which select linguistic patterns likely to function as phrasal terms.</Paragraph>
    <Paragraph position="7"> The relatively low precision of lexical association measures on unfiltered data no doubt has multiple explanations, but a logical candidate is the failure or inappropriacy of underlying statistical assumptions. For instance, many of the tests assume a normal distribution, despite the highly skewed nature of natural language frequency distributions, though this is not the most important consideration except at very low n (cf. Moore 2004, Evert 2004, ch. 4). More importantly, statistical and information-based metrics such as the log-likelihood and mutual information measure significance or informativeness relative to the assumption that the selection of component terms is statistically independent. But of course the possibilities for combinations of words are anything but random and independent. Use of linguistic filters such as &amp;quot;attributive adjective followed by noun&amp;quot; or &amp;quot;verb plus modifying prepositional phrase&amp;quot; arguably has the effect of selecting a subset of the language for which the standard null hypothesis -- that any word may freely be combined with any other word -- may be much more accurate. Additionally, many of the association measures are defined only for bigrams, and do not generalize well to phrasal terms of varying length.</Paragraph>
    <Paragraph position="8"> The purpose of this paper is to explore whether the identification of candidate phrasal terms can be improved by adopting a heuristic which seeks to take certain of these statistical issues into account. The method to be presented here, the mutual rank ratio, is a nonparametric rank-based approach which appears to perform significantly better than the standard association metrics.</Paragraph>
    <Paragraph position="9"> The body of the paper is organized as follows: Section 2 will introduce the statistical considerations which provide a rationale for the mutual rank ratio heuristic and outline how it is calculated. Section 3 will present the data sources and evaluation methodologies applied in the rest of the paper. Section 4 will evaluate the mutual rank ratio statistic and several other lexical association measures on a larger corpus than has been used in previous evaluations. As will be shown below, the mutual rank ratio statistic recognizes phrasal terms more effectively than standard statistical measures.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML