File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-2022_evalu.xml

Size: 5,642 bytes

Last Modified: 2025-10-06 13:59:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2022">
  <Title>Multilingual Term Extraction from Domain-specific Corpora Using Morphological Structure</Title>
  <Section position="4" start_page="172" end_page="173" type="evalu">
    <SectionTitle>
3 Experiments and results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
3.1 Corpora
</SectionTitle>
      <Paragraph position="0"> The system has been experimented on 4 corpora covering the domains of volcanology (V) and breast cancer (BC), in English (en) and in French (fr). The corpora have been automatically built from the web, using the methodology described in (Baroni and Bernardini, 2004), via the Yahoo! Search Web Services ( http://developer.</Paragraph>
      <Paragraph position="1"> yahoo.net/search/). The size of the corpora obtained are given in Table 1. This table also gives the number of key words, i.e., single-word terms extracted by comparing the frequency of occurrence of words in both corpora for each language (Rayson and Garside, 2000). Only terms with a log-likelihood of 3.8 or higher (p&lt;0.05) have been kept in the key words list. Table 2 gives a numerical overview of the results obtained by our method.</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
3.2 Prefixes and initial combining forms
</SectionTitle>
      <Paragraph position="0"> As shown by Table 2, the number of prefixes and initial combining forms identified is proportionally less for the volcanology corpora both in English and in French. Medical corpora seem to be more adapted to the method since the num- null and term families identified for each corpus ber of terms extracted is higher. The prefixes and combining forms identified are also highly dependent on the corpus domain. For instance, amongst the most frequent combining forms extracted for the BC corpora, we find &amp;quot;radio&amp;quot; and &amp;quot;chemo&amp;quot; (&amp;quot;chimio&amp;quot; in French) and for the V corpora, &amp;quot;strato&amp;quot; and &amp;quot;volcano&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
3.3 Terms
</SectionTitle>
      <Paragraph position="0"> The overlap percentage between the list of terms and the list of key words ranges from 38.65% (V fr) to 56.92% (V en) of the total amount of terms extracted. If we compare both the list of key words and the list of terms extracted for the BC en corpus with the Unified Medical Language System Metathesaurus (http://www.nlm.nih.gov/ research/umls/) we notice that some highly specific terms like &amp;quot;disease&amp;quot;, &amp;quot;blood&amp;quot; or &amp;quot;x-ray&amp;quot; are not identified by our method, while they occur in the key words list. These are usually morphologically simple terms, also used in everyday language. Conversely, terms with low frequency like &amp;quot;adenoacanthoma&amp;quot;, &amp;quot;chondroma&amp;quot; or &amp;quot;mammotomy&amp;quot; are correctly identified by the pattern-based approach but are missing in the key words list. Both methods are therefore complementary.</Paragraph>
      <Paragraph position="1"> In some cases, stop-words are extracted. This is a side effect of the pattern used to retrieve terms. Remember that terms are words which coalesce with combining forms, possibly with hyphenation. In English hyphens are sometimes mistakenly used instead of the dash to mark comment clauses. Consider for instance the following sentence: &amp;quot;As this magma-which drives one of the worlds largest volcanic systems-rises, it pushes up the Earths crust beneath the Yellowstone Plateau.&amp;quot;. Here &amp;quot;magma&amp;quot; is identified as a combining form since it ends with 'a' and is directly followed by a hyphen. Consequently, &amp;quot;which&amp;quot; is wrongly identified as a term.</Paragraph>
    </Section>
    <Section position="4" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
3.4 Term families
</SectionTitle>
      <Paragraph position="0"> Several types of term variants are grouped by the termconflationalgorithm: (a)graphicalandorthographical variants like &amp;quot;tumour&amp;quot; (British variant) and &amp;quot;tumor&amp;quot; (American variant); (b) inflectional variants like &amp;quot;tumor&amp;quot; and &amp;quot;tumors&amp;quot;; (c) derivational variants like &amp;quot;tumor&amp;quot; and &amp;quot;tumoral&amp;quot;. Two types of conflation errors may however occur: over-conflation, i.e., the conflation of terms which do not belong to the same morphological family and under-conflation, i.e. the absence of conflation for morphologically related terms.</Paragraph>
      <Paragraph position="1"> Some cases of over-conflation are obvious, such as the grouping of &amp;quot;significant&amp;quot; with &amp;quot;cant&amp;quot;. In some other cases it is more difficult to tell. This especially applies to the conflation of terms composed of word final combining forms like &amp;quot;-gram&amp;quot; or &amp;quot;-graph&amp;quot;. Under-conflation occurs when no combining form is shared between terms belonging to families represented by graphically similar terms. For instance, the following term families are extracted from the French volcanology corpus (V fr): F1= [basalte, m'etabasalte, m'eta-basalte], F2= [basaltes, ferro-basaltes, pal'eobasaltes] and F3= [basaltique, and'esitico-basaltique]. These families are not conflated, even though they obviously belong to the same morphological family.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML