XML Viewer - p98-1074

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1074_metho.xml
Size: 10,015 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1074">
  <Title>Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora</Title>
  <Section position="4" start_page="446" end_page="448" type="metho">
    <SectionTitle>
3 Multilingual terminology
</SectionTitle>
    <Paragraph position="0"> extraction Several works describe methods to extract terms, or candidate terms, in English and/or French (Justeson and Katz, 1995; Daille, 1994; Nkwenti-Azeh, 1992). Some more specific works describe methods to align noun phrases within parallel corpora (Kupiec, 1993). The underlying assumption beyond these works is that the monolingually extracted units correspond to each other cross-lingually. Unfortunately, this is not always the case, and the above methodology suffers from the weaknesses pointed out by (Wu, 1997) concerning parse-parse-match procedures.</Paragraph>
    <Paragraph position="1"> It is not however possibie to fully reject the notion of grammar for term extraction, in so far as terms are highly characterized by their internal syntactic structure. We can also admit that lexical affinities between the diverse constituents of a unit can provide a good clue for termhood, but le~cal affinities, or otherwise called collocations, affect different finguistic units that need anyway be distinguished (Smadja, 1992).</Paragraph>
    <Paragraph position="2"> Moreover, a study presented in (Gaussier, 1995) shows that terminology extraction in English and in French is not symmetric. In many cases, it is possible to obtain a better approximation for English terms than it is for French terms. This is partly due to the fact that English relies on a composition of Germanic type, as defined in (Chuquet and Palllard, 1989) for example, to produce compounds, and of Romance type to produce free NPs, whereas French relies on Romance type for both, with the classic PP attachment problems.</Paragraph>
    <Paragraph position="3"> These remarks lead us to advocate a mixed model, where candidate terms are identified in English and where their French correspondent is searched for. But since terms constitute rigid units, lying somewhere between single word notions and complete noun phrases, we should not consider all possible French units, but only the ones made of consecutive words.</Paragraph>
    <Section position="1" start_page="446" end_page="447" type="sub_section">
      <SectionTitle>
3.1 Model
</SectionTitle>
      <Paragraph position="0"> It is possible to use flow network models to capture relations between English and French terms. But since we want to discover French units, we have to add extra vertices and nodes to our previous model, in order to account for all possible combinations of consecutive French words. We do that by adding several layers of vertices, the lowest layer being associated with the French words themselves, and each vertex in any upper layer being linked to two consecutive vertices of the layer below. The uppest layer contains only one vertex and can be seen as representing the whole French sentence. We will call a fertility graph the graph thus obtained. Figure 1 gives an example of part of a fertility graph (we have shown the flow values on each edge for clarity reasons; the brackets delimit a nultiword candidate term; we have not drawn the whole fertility graph encompassing the French sentence, but only part of it, the one encompassing the unit largeur de bande utilisde, where the possible combinations of consecutive words are represented by A, B, and C).</Paragraph>
      <Paragraph position="1"> Note that we restrict ourselves to le:dcal words (nouns, verbs, adjectives and adverbs), not trying to align grammatical words. Furthermore, we rely on lemmas rather than inflected froms, thus enabling us to conflate in one form all the variants of a verb for example (we have keeped</Paragraph>
      <Paragraph position="3"> inflected forms in our figures for readability reasons). null The minimal cost flow in the graphs thus defined may not be directly usable. This is due to two problems: 1. first, we can have ambiguous associations: in figure 1, for example, the association between bandwidth and largeur de bande can be obtained through the edge linking these two units (type 1), or through two edges, one from bandwidth to largeur de bande., and one from bandwidth to either largeur or hap.de (type 2), or even through the two edges from bandwidth to largeur and bande (type 3), 2. secondly, there may be conflicts between connections: in figure 1 both largeur de bande and tdldcommunications are linked to bandwidth even though they are not contiguous. null To solve ambiguous associations, we simply replace each association of type 2 or 3 by the equivalent type 1 association 3. For conflicts, we use the following heuristics: first select the conflicting edge with the lowest cost and assume 3We can formally define an equivalence relation, in terms of the associations obtained, but this is beyond the scope of this paper.</Paragraph>
      <Paragraph position="4"> that the association thus defined actually occurred, then rerun the minimal cost flow algorithm with this selected edge fixed once and for all, and redo these two steps until there is no more conflicting edges, replacing type 2 or 3 associations as above each time it is necessary. Finally, the alignment obtained in this way will be called a solved alignment 4.</Paragraph>
    </Section>
    <Section position="2" start_page="447" end_page="448" type="sub_section">
      <SectionTitle>
3.2 Experiment
</SectionTitle>
      <Paragraph position="0"> In order to test the previous model, we selected a small bilingual corpus consisting of 1000 aligned sentences, from a corpus on satellite telecommunications. We then ran the following algorithm, based on the previous model: 1. tag and lemmatise the English and French texts, mark all the English candidate terms using morpho-syntactic rules encoded in regular expressions, 2. build a first set of association probabilities, using the likelihood ratio test defined in (Gaussier, 1995), 3. for each pair of aligned sentences, construct the fertility graph allowing a candidate term of length n to be aligned with units of lenth (n-2) to (n+2), define the 4Once the solved alignment is computed, it is possible to determine the word associations between aligned units, through the application of the process described in the previous section with multiword notions.</Paragraph>
      <Paragraph position="1">  costs of edges linking English vertices to French ones as the opposite of the logarithm of the normalised sum of probabilities of all possible word associations defined by the edge (for the edge between multiple (el) access (e2) to the French unit acc~s (fl) mulitple (f2) it is 1/4 (~i,jp(ei, fj))), all the other edges receive an arbitrary cost value, compute the solved alignment, and increment the count of the associations obtained by overall value of the solved alignnlent, null 4. select the fisrt i00 unit associations according to their count, and consider them as valid. Go back to step 2, excluding from the search space the associations selected, till all associations have been extracted.</Paragraph>
    </Section>
    <Section position="3" start_page="448" end_page="448" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> To evaluate the results of the above procedure, we manually checked each set of associations obtained after each iteration of the process, going from the first 100 to the first 500 associations.</Paragraph>
      <Paragraph position="1"> We considered an association as being correct if the French expression is a proper translation of the English expression. The following table gives the precision of the associations obtained.</Paragraph>
      <Paragraph position="2"> N. Assoc. Prec.</Paragraph>
      <Paragraph position="3">  The associations we are faced with represent different linguistic units. Some consist of single content words, whereas others represent multi-word expressions. One of the particularity of our process is precisely to automatically identify multiword expressions in one language, knowing units in the other one. With respect to this task, we extracted the first two hundred multiword expressions from the associations above, and then checked wether they were valid or not.</Paragraph>
      <Paragraph position="4"> We obtained the following results: N. Assoc. Prec.</Paragraph>
      <Paragraph position="5">  As a comparison, (Kupiec, 1993) obtained a precision of 90% for the first hundred associations between English and French noun phrases, using the EM algorithm. Our experiments with a similar method showed a precision around 92% for the first hundred associations on a set of aligned sentences comprising the one used for the above experiment.</Paragraph>
      <Paragraph position="6"> An evaluation on single words, showed a precision of 9870 for the first hundred and 97% for the first two hundred. But these figures should be seen in fact as lower bounds of actual values we can get, in so far as we have not tried to extract single word associations from multi-word ones. Here is an example of associations obtained.</Paragraph>
      <Paragraph position="7"> telecommunication satellite satelllite de tdldcommunication communication satellite satelllite de tdldcommunication new satellite system nouveau syst~me de satellite syst~me de satellite nouveau syst~me de satellite enti~rement nouveau operating fss telecommunication link exploiter la liason de tdldcommunication du sfs implement mise en oeuvre wavelength longueur d'oncle offer offrir, proposer operation exploitation, opdration The empty words (prepositions, determiners) were extracted from the sentences. In all the cases above, the use of prepositions and determiners was consistent all over the corpus. There are cases where two French units differ on a preposition. In such a case, we consider that we have two possible different translations for the English term.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML