File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2022_metho.xml

Size: 5,240 bytes

Last Modified: 2025-10-06 14:10:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2022">
  <Title>Multilingual Term Extraction from Domain-specific Corpora Using Morphological Structure</Title>
  <Section position="3" start_page="0" end_page="172" type="metho">
    <SectionTitle>
2 Description of the method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="171" type="sub_section">
      <SectionTitle>
2.1 Extraction of words
</SectionTitle>
      <Paragraph position="0"> The system takes as input a corpus of texts. Paragraphs written in another language than the target language are filtered out. Texts are then tokenised and words are converted to lowercase. Besides, words containing digits or other non-word characters are eliminated. However, hyphenated words are kept since hyphens mark morpheme boundaries. This preliminary step produces a word frequency list for the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
2.2 Acquisition of combining forms
</SectionTitle>
      <Paragraph position="0"> Prefixes and initial combining forms are automatically acquired using the following regular expression: ([aio]-)?(\w{3,}[aio])-. This regular expression represents character strings whose length is higher or equal to 4, ending with a, i or o and immediately followed by a hyphen.</Paragraph>
      <Paragraph position="1"> The first part of the regular expression accounts for words where several prefixes or combining forms follow one another (as for instance in the French word &amp;quot;h'epato-gastro-ent'erologues&amp;quot;). This regular expression applies to English but also to other languages like French or German: see for instance &amp;quot;chimio-radioth'erapie&amp;quot; in French, &amp;quot;chemo-radiotherapy&amp;quot; in English or &amp;quot;Chemoradiotherapie&amp;quot; in German.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
2.3 Identification of terms
</SectionTitle>
      <Paragraph position="0"> Terms are identified using the following pattern describing their morphological structure: E+W where E is a prefix or combining form and W is a wordwhoselengthishigherthan3; the'+'character represents the possible succession of several E elements at the beginning of a term. Prefixes and combining forms may be separated by a hyphen.</Paragraph>
      <Paragraph position="1"> When this pattern applies to one of the words in the corpus, two terms are recognised, one with a E+W structure and the other with a W structure.</Paragraph>
      <Paragraph position="2"> For instance, given the word &amp;quot;ferrobasalts&amp;quot;, the system identifies the terms &amp;quot;ferrobasalts&amp;quot; (E+W) and &amp;quot;basalts&amp;quot; (W).</Paragraph>
    </Section>
    <Section position="4" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
2.4 Conflation of terms
</SectionTitle>
      <Paragraph position="0"> Term variants are grouped in order to ease the analysis of results. The method for terms conflation can be decomposed in two stages:  1. Terms containing the same word W belong to  the same family, represented by the word W. For instance, both &amp;quot;chemotherapy&amp;quot; and &amp;quot;radiotherapy&amp;quot; contain the word &amp;quot;therapy&amp;quot;: they belong to the same family of terms, represented by the word &amp;quot;therapy&amp;quot;. 2. Two families are merged if they are represented by words sharing the same initial substring (with a minimum initial sub-string length of 4) and if the same prefix or combining form occurs in one term of each family. Consider for instance the families F1= [oncology, psycho-oncology, radiooncology, neuro-oncology, psychooncology, neurooncology] and F2 = [oncologist, neurooncologist]. The terms representing F1 (&amp;quot;oncology&amp;quot;) and F2 (&amp;quot;oncologist&amp;quot;) share an initial substring of length 7. Moreover the terms &amp;quot;neuro-oncology&amp;quot; from F1 and &amp;quot;neurooncologist&amp;quot; from F2 contain the combining form &amp;quot;neuro&amp;quot;. Families F1 and F2 are therefore united.</Paragraph>
      <Paragraph position="1"> When terms have been conflated, we select the most frequent term as a family's representative.</Paragraph>
    </Section>
    <Section position="5" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
2.5 Data visualisation
</SectionTitle>
      <Paragraph position="0"> The results obtained are displayed as a weighted list in HTML format. Such lists, also named &amp;quot;heat maps&amp;quot; or &amp;quot;tag clouds&amp;quot; when they describe tags1 usually represent the terms and topics which appear most frequently on websites or RSS feeds (Wikipedia, 2006). They can also be used to represent any kind of word list (V'eronis, 2005). Different colours and font sizes are used depending on the word's frequency of occurrence. We have adapted this method to visualise the list of extracted terms. Since several hundred terms may be extracted, only the terms representing a family are displayed on the weighted list. Weight is given by the cumulated frequency of all the terms belonging to the family (see Figure 1).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML