File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0104_metho.xml

Size: 17,862 bytes

Last Modified: 2025-10-06 14:13:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0104">
  <Title>Study and Implementation of Combined Techniques for Automatic Extraction of Terminology Bdatrice Daille TALANA</Title>
  <Section position="4" start_page="30" end_page="30" type="metho">
    <SectionTitle>
2. Morphosyntactic variants
</SectionTitle>
    <Paragraph position="0"> Morphosyntactic variants refer to the presence or not of an article before the N2 in the N1 PREP N~ structure: ligne d'abonng, lignes de l'abonng (subscriber lines), to the optional character of the preposition: tension hdlice, tension d'hdlice (helix voltage) and to synonymy relation between two base-terms of different structures: for example N ADJ and N1 d N2: rgseau commutd, rgseau d commutation (switched network)</Paragraph>
  </Section>
  <Section position="5" start_page="30" end_page="33" type="metho">
    <SectionTitle>
3. Elliptical variants
</SectionTitle>
    <Paragraph position="0"> A base-term of length 2 could be called up by an elliptic form: for example: ddbit which is used instead of dgbit binaire (bit rate).</Paragraph>
    <Paragraph position="1"> After this linguistic investigation, we decide to concentrate on terms of length 2 (base-terms) which seem by far the most frequent ones. Moreover, the majority of terms whose length is greater than 2 are built from base-terms. A statistical approach requires a good sampling that base-terms provide. To filter base-terms from the corpus, we use their morphosyntaetic structures. For this task, we need a tagged corpus where each item comes with its part-of-speech and its lemma.</Paragraph>
    <Paragraph position="2"> The part-of-speech is used to filter and the lemma to obtain an optimal sampling. We have use the stochastic tagger and the lemmatizer of the Scientific Center of IBM-France developed by the speech recognition team (\[Ddrouault, 1985\] and \[E1-B~ze, 19931).</Paragraph>
    <Paragraph position="3"> Linguistic filters We now face a choice: we can either isolate collocations using statistics and then apply linguistic filters, or apply linguistic filters and then statistics. It is the latter strategy that has been adopted: indeed, the former asks for the use of a window of an arbitrary size; if you take a small window size, you will miss a lot of occurrences, mainly morphosyntactic variants, base-terms modified by an inserted modifier, very frequent in French, and coordinated base-terms; if you take a longer one, you will obtain occurrences that do not refer to the same conceptual entity, a lot of ill-formed sequences which do not characterizes terms, and moreover wrong frequency counts as several short sequences are masked by only one long sequence. Using first linguistic filters based on part-of-speech tags appears as the best solution. Moreover, as patterns that characterizes base-terms can be described by regular expressions, the use of finite automata seems a natural way to extract and count the occurrences of the candidate base-terms.</Paragraph>
    <Paragraph position="4"> The frequency counts of the occurrences of the candidate terms are crucial as they are the parameters of the statistical scores. A wrong frequency count implies wrong or not relevant values of statistical scores. The objective is to optimize the count of base-terms occurrences and to minimize the count of incorrect occurrences. Graphical, orthographic and xnorpho.sy.t;wtic variants of base-terms (except synomymic varbmt,~) are taken into account as well as some syntactic variations that affect the base-terms structure: coordhmtion and insertion of modifiers. Coordimttion of two base-terms rarely leads to the creation of a new tcrnt of length greater than 2, so it is reasonable to thi.k that the sequence gquipements de modulation et d,' d, ~modulation (modulation and demodulation cqaipmvnls) is equivalent to the sequence gquipement ,h. modul.tt,m et dquipement de d~modulation (modulation equipment and demodulation equipment). Insertion of moditicrs inside a base-term structure does not raise problem, expect when this modifier is an adjective inserted inside a N1 PREP N2 structure. Let us examine the sequence antenne parabolique de rdception (parabolic receiving antenna), this sequence could be a term of length 3 (obtained either by over-composition or by modification) or a modified base-term, namely antenne de rgception modified by the inserted adjective parabolique. On one hand, we don't want to extract terms of length greater than 2, but on the other hand, it is not possible to ignore adjective insertion. So, we have chosen to accept insertion of adjective inside N1 PREP N~ structure.</Paragraph>
    <Paragraph position="5"> This choice implies the extraction of terms of length 3 of N 1 ADJ PREP N2 structure that are considered as terms of length 2. However, such cases are rare and the majority of N1 ADJ PREP N2 sequences refer to a N1 PREP N2 base-term modified by an adjective.</Paragraph>
    <Paragraph position="6"> Each occurrence of a base-terms is counted equally; we consider that there is equiprobability of the term appearance in the corpus. The occurrences of morphological sequences which characterize base-terms are classified under pairs: a pair is composed of two main items in a fixed order and collects all the sequences where the two lemmas of the pair appear in one of the allowed morphosyntactic patterns; for example, the sequences: ligne d'abonnd, lignes de l'abonnd (subscriber lines), ligne numgrique d'abonnd (digital subscriber line) are each one occurrence of the pair (llgne, abonn~).</Paragraph>
    <Paragraph position="7"> If we have the coordinated sequence lignes et services d'abonnd (subscriber lines and services), we count one occurrence for the pair (ligne, abonnd) and one occurfence for the pair (service, abonnd). Our program scans the corpus and counts and extracts collocations whose syntax characterizes base-terms. Under each pair, we find all the different occurrences found with their frequencies and their location in the corpus (file, sentence, item). This program runs fast: for example, it took 2 minutes to extract 8 000 pairs from our corpu s SCH (200 000 words) for the structure Nx de (DI,:T) N~ on a Sparc station ELC (SS1) under Sun-Os Release 4. ! .3.</Paragraph>
    <Paragraph position="8"> Now that we have obtained a set of pairs, each pair representing a candidate term, we apply statistical scores in order to distinguish terms from non-terms among the candidates.</Paragraph>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
Lexical Statistics
</SectionTitle>
      <Paragraph position="0"> The problem to solve now is to discover which statistical score is the best to isolate terms among our list of Candidates. So, we compute several measures: frequencies, association criteria, Shannon diversity and distance scores. All these measures could not be used for the same purpose: frequencies are the parameters of the association criteria, association criteria propose a conceptual sort of the couples, and Shannon diversity an&lt;i distance measures are not discriminatory scores but provide other types of informations.</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="33" type="sub_section">
      <SectionTitle>
Frequencies and Association criteria
</SectionTitle>
      <Paragraph position="0"> From a statistical point of view, the two lemmas of a pair could be considered as two qualitative variables whose link has to be tested. A contingency table is defined for each pair (Li, Lj):</Paragraph>
      <Paragraph position="2"> Li, with i' C/ i c d where: a stands for the frequency of pairs involving both Li and Lj, b stands for the frequency of pairs involving Li and Lj,, c stands for tile frequency of pairs involving Li, and Lj, d stands for the frequency of pairs involving Li, and The statistical literature proposes many scores which can be used to test the strength of the bond between the two variables of a contingency table. Some are well-known such as the association ratio, close to the concept of mutual information, introduced by \[Church and Hanks, 1990\]:</Paragraph>
      <Paragraph position="4"> A property of these scores is that their values increase with the strength of the bond of the lemmas. We have tried out several scores (more than ten) including IM, * 2 and Loglike and we have sorted the pairs following the score value. Each score proposes a conceptual sort of the pairs. This sort, however, could put at the top of the list compounds that belong to general language rather than to the telecommunication domain. As we want to obtain a list of telecommunication terms, it is</Paragraph>
      <Paragraph position="6"> essential to evaluate the correlation between the score values and the pairs and to find out which scores are the best to extract terminology. Therefore, we compare the values obtained for each score to a reference list of the domain. We have used the terminology data bank of the EEC, telecommunication section, which has been elaborated by experts. This evaluation has been done for 2 200 French pairs3of N1 de (DEW) N 2 structure extracted from our corpus SCH (200 000 words). Each score provides as a result a list where the candidates are sorted following the score value. We have defined equivalence classes which generally collect 50 successive pairs of the list. The results of a score are represented graphically thanks to an histogram in which the x-axis represents the pairs sorted according to the score value, and y-axis the ratio of the number of pairs belonging to the reference list divided by the number of pairs per equivalence class, i.e. generally 50 pairs. If all the pairs of an equivalence class belong to the reference list, we obtain the maximum ratio of 1; if none of the pairs appear in the reference list, the minimum ratio of 0 is reached. The ideal score should assign its high values (resp. low) to good (resp. bad) pairs, i.e.</Paragraph>
      <Paragraph position="7"> candidates which belong (resp. which don't belong) to the reference list. In other words, the histogram of the ideal score should assign to equivalence classes containing the high values (resp. low values) of the score a ratio close to 1 (resp. 0). We are not going to present here all the histograms obtained (see \[Daille, 1994\]). All of 3Only pairs which appear at least twice in the corpus have been retained.</Paragraph>
      <Paragraph position="8"> them show a general growing trend that confirm that the score values increase with the strength of the bond of the \]emma. However, the growth is more or less clear, with more or less sharp variations. The most beautiSd histogram is the simple frequency of the pair (see Figure 1). This histogram shows that more frequent tile pair is, the more likely the pair is a term. Frequency is the most significant score to detect terms of a technical domain. This results contradicts numerous results of lexical resources, which claim that association criteria are more significant than frequency: for example, all the most frequent pairs whose terminological status is undoubted share low values of association ratio (formula 1) as for example rdseau d satellites (satellite network} IM=2.57, liaison par satellite (satellite link) IM=2.72, circuit tglgphonique (telephone circuit )IM=3.32, station spatiale (space station) IM=l.17 etc. Tile remaining problem with the sort proposed by frequency is that it integrates very quickly bad candidates, i.e. pairs which are not terms. So, we have preferred to elect the Loglike coefficient (formula 3) the best score. Indeed, Loglike coefficient which is a real statistical test, takes into account the pair frequency but accepts very little noise for high values. To give an element of comparison, the first bad candidate with frequency for the general pattern N1 (PREP (DEW)) N2 is the pair (cas, transmission) which appears in 56th place; this pair, which is also the first bad candidate with Loglike, appears in 176th place. We give in figure 2 the topmost 11 french pairs sorted by the Loglike coefficient (Logl) (Nbc is the number of the pair occurrences and IM the value of  association ratio).</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
Diversity
</SectionTitle>
      <Paragraph position="0"> Diversity has been introduced by \[Shannon, 1948\] and chara,~terizes the marginal distribution of the lemma of a pair through the range of pairs. Its computation uses a contingency table of length n: we give below as an ,xample the contingency table which is associated to the pairs of N ADJ structure: progressi\] \] porteur L.:2:._l\] Total .~ corm't 9 0 The line counts nbi., which are found in the last column, represent the distribution of the adjectives with regards to a given noun. The columns counts nb.j, which are found on the last line, represent the distribution of the .ouns with regards to a given adjective. These distributio,s arc called &amp;quot;marginal distributions&amp;quot; of the nouns and the adjectives for the N ADJ structure. Diversity is computed for each lemma appearing in a pair, using the fornmla:</Paragraph>
      <Paragraph position="2"> For example, using the contingency table of the N ^vJ structure above, diversity of the noun onde is equal to:</Paragraph>
      <Paragraph position="4"> We note H1, diversity of the first lemma of a pair and !t2 diversity of the second lemma. We take into account the diversity normalized by the number of occurrences of the pairs:</Paragraph>
      <Paragraph position="6"> The normalized diversities hi and h2 are defined from Ill and H2.</Paragraph>
      <Paragraph position="7"> 'l'h~, normalized diversity provides interesting informations about the distribution of the pair lemmas in the set of pairs. A lemma with a high diversity means that it appears in several pairs in equal proportion; conw'rscly, a lemma which appear only in one pair owns a zero diversity (minimal value) and this, whatever is the frequency of the pair. High values of hi applied to the pairs of N ^DJ structure characterizes nouns that could l)c seen as key-words of the domain: r#sean (network), s~gnal, antenne (antenna), satellite. Conversely, high values of h~ applied to the pairs of N ADJ structure characterizes adjectives which do not take part to base-MWVs as n&amp;essaire (necessary), suivant (following), important, different (various), tel (such), etc. The pairs with a zero diversity on one of their lemma receive high values of association ratio and other association criteria and a non-definite value of Loglike coefficient. However, the diversity is more precise because it indicates if the two lemmas appear only together as for (ocEan, indien) (indian ocean) (Hl=hl=H2=h2=0), or if not, which of the two lemmas appear only with the other, as for (r&amp;,eau, maill~) (mesh network) (H2=hz=0), where the adjective mailM apl:wears only with rdseau or for (C/odeur, id&amp;al) (ideal coder) (Hi=hi=0) where the noun codeur appears only with the adjective ideal. Other examples are: Oh, salomon) (solomon island), (h~lium, gazeux) (helium gas), (suppresscur, bzho) (echo suppressor). These pairs collects many frozen compounds and collocations of the current language. In future work, we will investigate how to incorporate the nice results provided by diversity into an automatic extraction algorithm. null Distance Measures French base-terms often accept modifications of their internal structure as it has been demonstrated previously. Each time, an occurrence of a pair is extracted and counted, two distances are computed: the number of items Dist and the number of main items MDist which occur between the two lemmas. Then, for each couple, the mean and the variance of the number of items and main items are computed. The variance formula is:</Paragraph>
      <Paragraph position="9"> The distance measures bring interesting informations which concern the morphosyntactic variations of the base-terms, but they don't allow to take a decision upon the status of term or non-term of a candidate. A pair which has no distance variation, whatever is the distance, is or is not a term; we give now some examples of pairs which have no distance variations and which are not terms: paire de signal (a pair of signaO, type d'antenne (a type off antenna), organigramme de la figure (diagram of the figure), etc. We illustrate below how the distance measures allow to attribute to a pair its elementary type automatically, for example, either N1 N2, N1 PREP N2, N1 PREP DET N2, or Ni ADJ PREP (VET) N2 for the general N1 (PREP (VET)) N2 structure.</Paragraph>
      <Paragraph position="10">  We presented a combining approach for automatic term extraction. Starting from a first selection of lemma pairs representing candidate terms from a morphosyntactic point of view, we have applied and evaluated several statistical scores. Results were surprising: most association criteria (for example, mutual association) didn't give good results contrary to frequency. This bad behavior of the association criteria could be explained by the introduction of linguistic filters. We can notice anyway that frequency characterizes undoubtedly terms, contrary to association criteria which select in their high values frozen compounds belonging to general language. However, we preferred to elect the Loglike criterion rather than frequency as the best score. This latter takes into account frequency of the pairs but provide a conceptual sort of high accuracy. Our system which uses finite automata allows to increase the results of the extraction of lexical resources and to demonstrate the efficiency to incorporate linguistics in a statistic system. This method has been extended to bilingual terminology extraction using aligned corpora (\[Daille et al., 1994\]).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML