File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1106_metho.xml

Size: 13,060 bytes

Last Modified: 2025-10-06 14:09:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1106">
  <Title>Language &amp; Information Engineering</Title>
  <Section position="4" start_page="843" end_page="846" type="metho">
    <SectionTitle>
3 Methods and Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="843" end_page="844" type="sub_section">
      <SectionTitle>
3.1 Text Corpus
</SectionTitle>
      <Paragraph position="0"> We collected a biomedical training corpus of approximately 513,000 MEDLINE abstracts using the following query composed of MESH terms from the biomedical domain: transcription factors, blood cells and human.3 We then annotated the resulting 104-million-word corpus with the GENIA part-of-speech tagger4 and identi ed noun phrases (NPs) with the YAMCHA chunker (Kudo and Matsumoto, 2001). We restrict our study to NP recognition (i.e., determining the extension of a noun phrase but refraining from assigning any internal constituent structure to that phrase), because the vast majority of technical or scienti c terms surface as noun phrases (Justeson and Katz, 1995). We ltered out a number of stop words (determiners, pronouns, measure symbols, etc.) and also ignored noun phrases with  date tokens and types for the MEDLINE text corpus In order to obtain the term candidate sets (see Table 1), we counted the frequency of occurrence of noun phrases in our training corpus and categorized them according to their length. For this study, we restricted ourselves to noun phrases of length 2 (word bigrams), length 3 (word trigrams) and length 4 (word quadgrams). Morphological normalization of term candidates has shown to be bene cial for ATR (Nenadi*c et al., 2004). We thus normalized the nom- null structures (e.g., B and T cell ). However, analyzing their inherent ambiguity is a complex syntactic operation, with a comparatively marginal bene t for ATR (Nenadi*c et al., 2004).  inal head of each noun phrase (typically the right-most noun in English) via the full-form UMLS SPE-CIALIST LEXICON (UMLS, 2004), a large repository of both general-language and domain-speci c (medical) vocabulary. To eliminate noisy low-frequency data (cf. also Evert and Krenn (2001)), we de ned different frequency cut-off thresholds, c, for the bigram, trigram and quadgram candidate sets and only considered candidates above these thresholds.</Paragraph>
    </Section>
    <Section position="2" start_page="844" end_page="844" type="sub_section">
      <SectionTitle>
3.2 Evaluating Term Extraction Quality
</SectionTitle>
      <Paragraph position="0"> Typically, terminology extraction studies evaluate the goodness of their algorithms by having their ranked output examined by domain experts who identify the true positives among the ranked candidates. There are several problems with such an approach. First, very often only one such expert is consulted and, hence, inter-annotator agreement cannot be determined (as, e.g., in the studies of Frantzi et al. (2000) or Collier et al. (2002)). Furthermore, what constitutes a relevant term for a particular domain may be rather dif cult to decide even for domain experts when judges are just exposed to a list of candidates without any further context information. Thus, rather than relying on ad hoc human judgment in identifying true positives in a candidate set, as an alternative we may take already existing terminolgical resources into account.</Paragraph>
      <Paragraph position="1"> They have evolved over many years and usually reect community-wide consensus achieved by expert committees. With these considerations in mind, the biomedical domain is an ideal test bed for evaluating the goodness of ATR algorithms because it hosts one of the most extensive and most carefully curated terminological resources, viz. the UMLS METATHESAURUS (UMLS, 2004). We will then take the mere existence of a term in the UMLS as the decision criterion whether or not a candidate term is also recognized as a biomedical term.</Paragraph>
      <Paragraph position="2"> Accordingly, for the purpose of evaluating the quality of different measures in recognizing multi-word terms from the biomedical literature, we assign every word bigram, trigram, and quadgram in our candidate sets (see Table 1) the status of being a term (i.e., a true positive), if it is found in the 2004 edition of the UMLS METATHESAURUS.6 For 6We exclude UMLS vocabularies not relevant for molecular biology, such as nursing and health care billing codes.</Paragraph>
      <Paragraph position="3"> example, the word trigram long terminal repeat is listed as a term in one of the UMLS vocabularies, viz. MESH (2004), whereas t cell response is not. Thus, among the 67,308 word bigram candidate types, 14,650 (21.8%) were identi ed as true terms; among the 31,017 word trigram candidate types, their number amounts to 3,590 (11.6%), while among the 10,838 word quadgram types, 873 (8.1%) were identi ed as true terms.7</Paragraph>
    </Section>
    <Section position="3" start_page="844" end_page="846" type="sub_section">
      <SectionTitle>
3.3 Paradigmatic Modi ability of Terms
</SectionTitle>
      <Paragraph position="0"> For most standard association measures utilized for terminology extraction, the frequency of occurrence of the term candidates either plays a major role (e.g., C-value), or has at least a signi cant impact on the assignment of the degree of termhood (e.g., t-test). However, frequency of occurrence in a training corpus may be misleading regarding the decision whether or not a multi-word expression is a term. For example, taking the two trigram multi-word expressions from the previous subsection, the non-term t cell response appears 2410 times in our 104-million-word MEDLINE corpus, whereas the term long terminal repeat (long repeating sequences of DNA) only appears 434 times (see also Tables 2 and 3 below).</Paragraph>
      <Paragraph position="1"> The linguistic property around which we built our measure of termhood is the limited paradigmatic modi ability of multi-word terminological units. A multi-word expression such as long terminal repeat contains three token slots in which slot 1 is lled by long , slot 2 by terminal and slot 3 by repeat . The limited paradigmatic modi ability of such a trigram is now de ned by the probability with which one or more such slots cannot be lled by other tokens. We estimate the likelihood of precluding the appearance of alternative tokens in particular slot positions by employing the standard combinatory formula without repetitions. For an n-gram (of size n) to select k slots (i.e., in an unordered selection) we thus de ne:</Paragraph>
      <Paragraph position="3"> types drop with increasing n-gram length but also the proportion of true terms. In fact, their proportion drops more sharply than can actually be seen from the above data because the various cut-off thresholds have a leveling effect.</Paragraph>
      <Paragraph position="4">  For example, for n = 3 (word trigram) and k = 1 and k = 2 slots, there are three possible selections for each k for long terminal repeat and for t cell response (see Tables 2 and 3). k is actually a place-holder for any possible token (and its frequency) which lls this position in the training corpus.</Paragraph>
      <Paragraph position="5">  for the trigram non-term t cell response Now, for a particular k (1 [?] k [?] n; n = length of n-gram), the frequency of each possible selection, sel, is determined. The paradigmatic modi ability for a particular selection sel is then de ned by the n-gram's frequency scaled against the frequency of sel. As can be seen in Tables 2 and 3, a lower frequency induces a more limited paradigmatic modi ability for a particular sel (which is, of course, expressed as a higher probability value; see the column labeled modsel in both tables). Thus, with s being the number of distinct possible selections for a particular k, the k-modi ability, modk, of an n-gram can be de ned as follows (f stands for frequency):</Paragraph>
      <Paragraph position="7"> The paradigmatic modi ability, P -Mod, of an n-gram is the product of all its k-modi abilities:8</Paragraph>
      <Paragraph position="9"> Comparing the trigram P -Mod values for k = 1, 2 in Tables 2 and 3, it can be seen that the term long terminal repeat gets a much higher weight than the non-term t cell response , although their mere frequency values suggest the opposite. This is also re ected in the respective list rank (see Subsection 4.1 for details) assigned to both trigrams by the t-test and by our P -Mod measure. While t cell response has rank 24 on the t-test output list (which directly re ects its high frequency), P -Mod assigns rank 1249 to it. Conversely, long terminal repeat is ranked on position 242 by the t-test, whereas it occupies rank 24 for P -Mod. In fact, even lower-frequency multi-word units gain a prominent ranking, if they exhibit limited paradigmatic modi ability. For example, the trigram term porphyria cutanea tarda is ranked on position 28 by P -Mod, although its frequency is only 48 (which results in rank 3291 on the t-test output list). Despite its lower frequency, this term is judged as being relevant for the molecular biology domain.9 It should be noted that the termhood values (and the corresponding list ranks) computed by P -Mod also include k = 3 and, hence, take into account a reasonable amount of frequency load. As can be seen from the previous ranking examples, still this factor does not override the paradigmatic modi ability factors of the lower ks.</Paragraph>
      <Paragraph position="10"> On the other hand, P -Mod will also demote true terms in their ranking, if their paradigmatic modi ability is less limited. This is particularly the case if one or more of the tokens of a particular term often occur in the same slot of other equal-length n-grams.</Paragraph>
      <Paragraph position="11"> For example, the trigram term bone marrow cell occurs 1757 times in our corpus and is thus ranked quite high (position 31) by the t-test. P -Mod, however, ranks this term on position 550 because the to- null actually has the pleasant side effect of including frequency in our modi ability measure. In this case, the only possible selection k1k2k3 as the denominator of Formula (2) is equivalent to summing up the frequencies of all trigram term candidates. 9It denotes a group of related disorders, all of which arise from a de cient activity of the heme synthetic enzyme uroporphyrinogen decarboxylase (URO-D) in the liver.</Paragraph>
      <Paragraph position="12">  ken cell also occurs in many other trigrams and thus leads to a less limited paradigmatic modi ability. Still, the underlying assumption of our approach is that such a case is more an exception than the rule and that terms are linguistically more 'frozen' than non-terms, which is exactly the intuition behind our measure of limited paradigmatic modi ability.</Paragraph>
    </Section>
    <Section position="4" start_page="846" end_page="846" type="sub_section">
      <SectionTitle>
3.4 Methods of Evaluation
</SectionTitle>
      <Paragraph position="0"> As already described in Subsection 3.2, standard procedures for evaluating the quality of termhood measures usually involve identifying the true positives among a (usually) arbitrarily set number of the m highest ranked candidates returned by a particular measure, a procedure usually carried out by a domain expert. Because this is labor-intensive (besides being unreliable), m is usually small, ranging from 50 to several hundreds.10 By contrast, we choose a large and already consensual terminology to identify the true terms in our candidate sets. Thus, we are able to dynamically examine various m-highest ranked samples, which, in turn, allows for the plotting of standard precision and recall graphs for the entire candidate set. We thus provide a more reliable evaluation setting for ATR measures than what is common practice in the literature.</Paragraph>
      <Paragraph position="1"> We compare our P -Mod algorithm against the t-test measure,11 which, of all standard measures, yields the best results in general-language collocation extraction studies (Evert and Krenn, 2001), and also against the widely used C-value, which aims at enhancing the common frequency of occurrence measure by making it sensitive to nested terms (Frantzi et al., 2000). Our baseline is de ned by the proportion of true positives (i.e., the proportion of terms) in our bi-, tri- and quadgram candidate sets.</Paragraph>
      <Paragraph position="2"> This is equivalent to the likelihood of nding a true positive by blindly picking from one of the different sets (see Subsection 3.2).</Paragraph>
      <Paragraph position="3"> 10Studies on collocation extraction (e.g., by Evert and Krenn (2001)) also point out the inadequacy of such evaluation methods. In essence, they usually lead to very super cial judgments about the measures under scrutiny.</Paragraph>
      <Paragraph position="4"> 11Manning and Schcurrency1utze (1999) describe how this measure can be used for the extraction of multi-word expressions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML