File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1030_metho.xml

Size: 11,637 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1030">
  <Title>Terminology Finite-State Preprocessing for Computational LFG</Title>
  <Section position="4" start_page="0" end_page="196" type="metho">
    <SectionTitle>
2 Terminology Extraction
</SectionTitle>
    <Paragraph position="0"> The first stage of this work was to extract terminology from our corpus. This corpus is a small French technical text of 742 sentences (7000 words). As we have at our disposal parallel aligned English/French texts, we use the English translation to decide when a potential term is actually a term. The terminology we are dealing with is mainly nominal. To perform this extraction task, we use a tagger (Chanod and Tapanainen, 1995) to disambiguate the French text, and then extract the following syntactic patterns, N Prep N, N N, N A, A N, which are good candidates to be terms. These candidates  are considered as terms when the corresponding English translation is a unit, or when their translation differs from a word to word translation. For example, we extract the following terms: (1) vitesses rampantes (creepers) boite de vitesse (gearbox) arbre de transmission (drive shaft) tableau de bord (instrument panel) This simple method allowed us to extract a set of 210 terms which are then integrated in the preprocessing stages of the parser, as we are going to explain in the following sections.</Paragraph>
    <Paragraph position="1"> We are aware that this semi-automatic process works because of the small size of our corpus. A fully automatic method (Jacquemin, 1997) could be used to extract terminology. But the material extracted was sufficient to perform the experiment of comparison we had in mind.</Paragraph>
  </Section>
  <Section position="5" start_page="196" end_page="197" type="metho">
    <SectionTitle>
3 Grammar Preprocessing
</SectionTitle>
    <Paragraph position="0"> In this section, we present how tokenization and morphological analysis are handled in the system and then how we integrate terminology processing in these two stages.</Paragraph>
    <Section position="1" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
3.1 Tokenization
</SectionTitle>
      <Paragraph position="0"> The tokenization process consists of splitting an input string into tokens, (Grefenstette and Tapanainen, 1994), (Ait-Mokthar, 1997), i.e.</Paragraph>
      <Paragraph position="1"> determining the word boundaries. If there is one and only one output string the tokenization is said to be deterministic, if there is more than one output string, the tokenization is non deterministic. The tokenizer of our application is non deterministic (Chanod and Tapanainen, 1996), which is valuable for the treatment of some ambiguous input string 2, but in this paper we deal with fixed multiword expressions.</Paragraph>
      <Paragraph position="2"> The tokenization is performed by applying a two-level finite-state transducer on the input string. For example, applying this transducer on the sentence in 2 gives the following result, the token boundary being the @ sign.</Paragraph>
      <Paragraph position="3"> (2) Le tracteur est ~ l'arr~t.</Paragraph>
      <Paragraph position="5"> 2for example bien que in French In this particular case, each word is a token. But several words can be a unit, for example compounds, or multiword expressions. Here are some examples of the desired tokenization, where terms are treated as units: (3) La bore de vitesse est en deux sections.</Paragraph>
      <Paragraph position="6"> (the gearbox is in two sections) La'.~boRe de vitesse~est~en~deux@sections~.~ (4) Ce levier engage l'arbre de transmission.</Paragraph>
      <Paragraph position="7"> (This lever engages the drive shaft.) Ce@levier~engage@l'~arbre de transmission@.@ We need such an analysis for the terminology extracted from the text. This tokenization is realized in two logical steps. The first step is performed by the basic transducer and splits the sentence in a sequence of single word. Then a second transducer containing a list of multiword expressions is applied. It recognizes these expressions and marks them as units. When more than one expression in the list matches the input, the longest matching expression is marked. We have included all the terms and their morphological variations in this last transducer, so that they are analyzed as single tokens later on in the process. The problem now is to associate a morphological analysis to these units.</Paragraph>
    </Section>
    <Section position="2" start_page="196" end_page="197" type="sub_section">
      <SectionTitle>
3.2 Morphological Analysis
</SectionTitle>
      <Paragraph position="0"> The morphological analyzer used during the parsing process, just after the tokenization process, is a two-level finite-state transducer (Chanod, 1994). This lexical transducer links the surface form of a string to its morphological analysis, i.e. its canonical form and some characterizing morphological tags. Some examples are given in 5.</Paragraph>
      <Paragraph position="1">  The compound terms have to be integrated into this transducer. This is done by developing a local regular grammar which describes the compound morphological variation, according to the inflectional model proposed in (Kartunnen et al., 1992).</Paragraph>
      <Paragraph position="2"> The hypothesis is that only the two main parts  of the compounds are able to vary. i.e. N1 or A1, and N2 or A2. in the patterns .VI prep N2, N1 N2, A1 N2, and ,VI A2. In our corpus, we identify two kinds of morphological variations: * The first part varies in number: gyrophare de toit. gyrophares de toit rdgime moteur, rggirnes moteur * Both parts vary in number: roue motrice, roues motrices This is of course not general for French compounds; there are other variation patterns, however it is reliable enough for the technical manual we are dealing with. Other inflectional schemes and exceptions are described in (Kartunnen et al., 1992) and (Quint, 1997), and can be easily added to the regular grammar if needed.</Paragraph>
      <Paragraph position="3"> A cascade of regular rules is applied on the different parts of the compound to build the morphological analyzer of the whole compound. For example, roue rnotrice is marked with the diacritic +DPL, for double plural and then, a first rule which just copies the morphological tags from the end to the middle is applied if the diacritic is present in the right context:  The composition of these two layers gives us the direct mapping between surface inflected forms and morphological analysis. The same kind of rules are used when only the first part of the compound varies, but in this case the second rule just deletes the tags of the second word. The two morphological analyzers for the two variations are both unioned into the basic morphological analyzer for French we use for morphology. The result is the transducer we use following tokenization and completing input preprocessing. An example of compound analysis is given here:  The morphological analysis developed here for terminology allows multiword terms to be treated as regular nouns within the parsing process. Constraints on agreement remain valid, for example for relative or adjectival attachment.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="197" end_page="199" type="metho">
    <SectionTitle>
4 Parsing with the Grammar
</SectionTitle>
    <Paragraph position="0"> One of the problems one encounters with parsing using a high level grammar is the multiplicity of (valid) analyses one gets as a result.</Paragraph>
    <Paragraph position="1"> While syntactically correct, some of these analyses should be removed for semantic reasons or in a particular context. One of the challenges is to reduce the parse number, without affecting the relevance of the results and without removing the desired parses. There are several ways to perform such a task, as described for example in (Segond and Copperman, 1997); we show here that finite state preprocessing for compounds is compatible with other possibilities.</Paragraph>
    <Section position="1" start_page="197" end_page="198" type="sub_section">
      <SectionTitle>
4.1 Experiment and Results
</SectionTitle>
      <Paragraph position="0"> The experiment reported here is very simple: it consists of parsing the technical corpus before and after integration of the morphological terms in the preprocessing components, using exactly the same grammar rules, and comparing the results obtained. As the compounds are mainly nominal, they will be analyzed just as regular nouns by the grammar rules. For example, if we parse the NP: (7) La bofte de vitesse (the gearbox) before integration we get the structures shown in Fig.3, and after integration we get the simple structures shown in Fig.4. The following tables show the results obtained on the whole corpus:</Paragraph>
      <Paragraph position="2"> Number of Token Parse Time sentences average average Average with terms 358 8.86 2.79 0.987 without terms 384 8.98 3.77 1.025 The results are straightforward: one observes a significant reduction in the number of parses as well as in the parsing time, and no change at all for sentences which do not contain technical terms. Looking closer at the results shows that the parses ruled out by this method are semantically undesirable. We discuss these results in the next section.</Paragraph>
    </Section>
    <Section position="2" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
4.2 Analysis of Results
</SectionTitle>
      <Paragraph position="0"> The good results we obtained in terms of parse number and parsing time reduction were predictable. As the nominal terminology groups flouns, prepositional phrases and adjectival / phrases together in lexical units, there is a significant reduction of the number of attachments.</Paragraph>
      <Paragraph position="1"> For example, the adjective hydraulique in the sentence: (8) Le voyant de levier de distributeur hydraulique s'allume. (The control valve lever warning light comes on.) can syntactically attach to voyant, levier, and distributeur which leads to 3 analyses. But in the domain the corpus is concerned with, distributeur hydraulique is a term. Parsing it as a nominal unit gives only one parse, which is the desired one. Moreover, grouping terms in unit resolves some lexical ambiguity in the preprocessing stage: for example, in ceinture de sdcurit4, the word ceinture is a noun but may be a verb in other contexts. Parsing ceinture de sdcurite&amp;quot; as a nominal term avoids further syntactic disambiguation.</Paragraph>
      <Paragraph position="2"> Of course, one has to be very careful with the terminology integration in order to prevent a loss of valid analyses. In this experiment, no valid analyses were ruled out, because the semi-automatic method we used for extraction and integration allowed us to choose accurate terms.</Paragraph>
      <Paragraph position="3"> The reduction in the number of attachments is the main source of the decrease in the number of parses.</Paragraph>
      <Paragraph position="4"> As the number of attachments and of lexical ambiguities decreases, the number of grammar rules applied to compute the results decreases  as well. The parsing time is reduced as a consequence. null The gain of efficiency is interesting in this approach, but perhaps more valuable is the perspicuity of the results. For example, in a translation application it is clear that the representation given in Fig. 4, is more relevant and directly exploitable than the one given in Fig. 3, because in this case there is a direct mapping between the semantic predicate in French and English.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML