File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/h05-1106_relat.xml

Size: 3,400 bytes

Last Modified: 2025-10-06 14:15:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1106">
  <Title>Language &amp; Information Engineering</Title>
  <Section position="3" start_page="0" end_page="843" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The automatic extraction of complex multi-word terms from domain-speci c corpora is already an active eld of research (cf., e.g., for the biomedical domain Rind esch et al. (1999), Collier et al. (2002), Bodenreider et al. (2002), or Nenadi*c et al. (2003)). Typically, in all of these approaches term candidates are collected from texts by various forms of linguistic ltering (part-of-speech tagging, phrase chunking, etc.), through which candidates obeying various linguistic patterns are identi ed (e.g., noun-noun, adjective-noun-noun combinations). These candidates are then submitted to frequency- or statistically-based evidence measures  (such as the C-value (Frantzi et al., 2000)), which compute scores indicating to what degree a candidate quali es as a term. Term mining, as a whole, is a complex process involving several other components (orthographic and morphological normalization, acronym detection, con ation of term variants, term context, term clustering; cf. Nenadi*c et al. (2003)). Still, the measure which assigns a termhood value to a term candidate is the essential building block of any term identi cation system.</Paragraph>
    <Paragraph position="1"> For multi-word automatic term recognition (ATR), the C-value approach (Frantzi et al., 2000; Nenadi*c et al., 2004), which aims at improving the extraction of nested terms, has been one of the most widely used techniques in recent years. Other potential association measures are mutual information (Damerau, 1993) and the whole battery of statistical and information-theoretic measures (t-test, loglikelihood, entropy) which are typically employed for the extraction of general-language collocations (Manning and Schcurrency1utze, 1999; Evert and Krenn, 2001). While these measures have their statistical merits in terminology identi cation, it is interesting to note that they only make little use of linguistic properties inherent to complex terms.2 More linguistically oriented work on ATR by Daille (1996) or on term variation by Jacquemin (1999) builds on the deep syntactic analysis of term candidates. This includes morphological and headmodi er dependency analysis and thus presupposes accurate, high-quality parsing which, for sublanguages at least, can only be achieved by a highly domain-dependent type of grammar. As sublanguages from different domains usually reveal a high degree of syntactic variability among each other (e.g., in terms of POS distribution, syntactic patterns), this property makes it dif cult to port grammatical speci cations to different domains.</Paragraph>
    <Paragraph position="2"> Therefore, one may wonder whether there are cross-domain linguistic properties which might be bene cial to ATR and still could be accounted for by only shallow syntactic analysis. In this paper, we propose the limited paradigmatic modi ability of terms as a criterion which meets these requirements and will elaborate on it in detail in Subsection 3.3.</Paragraph>
    <Paragraph position="3"> 2A notable exception is the C-value method which incorporates a term's likelihood of being nested in other multi-word units.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML