File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0605_metho.xml
Size: 14,456 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0605"> <Title>AUTOMATIC LEXICON ENHANCEMENT BY MEANS OF CORPUS TAGGING</Title> <Section position="3" start_page="0" end_page="29" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With both Automatic Speech Processing and Natural Language Processing it is necessary to use a lexicon which associates each item with a certain number of characteristics (syntactic, morphologic, frequency, phonetic, etc.). In Speech Recognition, these lexicons are necessary in the lexical access phases and the language modelisation as they allow the association between lexical items and recognised sounds while maintaining syntactic coherence within the sentence under analysis. In Speech Synthesis, the grapheme-to-phoneme transcription phase uses morphological and syntactical information to constjrain the phonetic transcription of the graphemes.</Paragraph> <Paragraph position="1"> In both cases, using lexicons which have the maximum information about the subject is an important benefit.</Paragraph> <Paragraph position="2"> The actual performance of Automatic Speech Treatment systems often limits their application to smaller subject-areas of language (medical texts, economic articles, etc.). It is important to have specialised lexicons which cover these smaller subject-areas in order to optimise the synthesis or recognition applications. But although general lexicons are readily available now, this is not the case for specialised lexicons which contain, for example, technical terms relevant to a subject, or family and brand names as can be found in journalistic texts.</Paragraph> <Paragraph position="3"> When working with corpora we are faced by the evolutionary aspects of a given language. The quicker the evolution of a specialised area, the more the dictionary will lack the ability to cover the subject, because a dictionary represents the state of a language at a given time. The words missing from a lexicon (which we refer to here as Out-Of-Vocabulary words or OOV words) represent a significant problem. In effect, whatever the size of the lexicon used, one can always find OOV words in texts.</Paragraph> <Paragraph position="4"> If, for a given word, the lexical access fails, this failure can affect the processing of the word as well as the processing of the contextual words.</Paragraph> <Paragraph position="5"> It would be useful to have dynamic lexicons which evolve in accordance with the corpora processed in order to limit, as much as possible, the OOV words.</Paragraph> <Paragraph position="6"> Such an enhancement of lexicons could be automatic if big corpora of specialised texts were available : medical reports in an electronic form, newspaper available in CD-ROM, etc.</Paragraph> <Paragraph position="7"> This interesting idea of automatically enhancing specialised lexicons from a general lexicon and a big corpus, is the aim of this paper. By using statistical language models, we show how to automatically assign one or several categories to the OOV words which are found in our corpora. Then, by taking into account all the occurences of each OOV word, we are able to automatically extract a new lexicon of OOV word with reliable labels associated to each word.</Paragraph> </Section> <Section position="4" start_page="29" end_page="29" type="metho"> <SectionTitle> 2 Processing OOV words </SectionTitle> <Paragraph position="0"> Various applications at LIA need a large lexicon, such as the automatic generation of graphical accents in a French text, language models for a dictation machine, the grapheme-to-phoneme transcription system, etc. As most of these applications process text corpus, the lexicon is mainly used through a syntactic labelling system developed at the laboratory (E1-Btze, 1995). This tagging system is based on a 3-class probabilistic language model which has been trained on a corpus of 39 million words contained in articles of the french newspaper Le Monde.</Paragraph> <Paragraph position="1"> The lexicon used is composed of 230 000 items.</Paragraph> <Paragraph position="2"> The use of a big general dictionary allows us to limit most of the OOV words to one of these categories : proper names, composit words, unused flexions, neologisms, mistakes. The problem of missing roots becomes important when the texts processed belong to a different area than the one used during the building of the lexicon. This is the case in corpus dedicated to sub-areas of language, such as in technical documentation, for example.</Paragraph> <Paragraph position="3"> Previous studies (Ueberla, 1995; Maltese, 1991) show that the modelling of OOV words improves significantly the performance of a language model.</Paragraph> <Paragraph position="4"> The presence of OOV words in the corpus can produce errors, not only in the form itself, but also in its context in the sentence. This is the reason why the syntactic tagging system has been endowed with a module, called Devin (Spriet, 1996), which proposes a category for each OOV word that is found.</Paragraph> <Paragraph position="5"> The modules described here take into account all the simple OOV words, which are those composed with only alphabetical characters (no space, hyphen, digits, or special characters). A specific module dedicated to composite words is currently being developed. We classify these simple OOV words in two categories : the &quot;proper-names&quot;, and the &quot;commonwords&quot; which represent all the others ! By applying simple heuristics to a sentence we can separate the OOV words into proper-names and common-words.</Paragraph> </Section> <Section position="5" start_page="29" end_page="30" type="metho"> <SectionTitle> 3 Processing OOV common-words </SectionTitle> <Paragraph position="0"> with the morpho-syntactic Devin</Paragraph> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 3.1 Out-of-context process </SectionTitle> <Paragraph position="0"> The goal of this module is to give a probability to syntactic labels which can represent the OOV common-words. These labels are distributed amongst 21 syntactic classes (adverbs, adjectives, names, verbs). It is commonly accepted that the ending of a word belonging to one of these classes influences strongly its syntactic category (Vergne, 1989; Guillet, 1989). Using this idea, we trained a statistical model with all the words from our dictionary. We make the hypothesis that this model will correctly work on unknown words, since these words should be governed by the same morphological principles. null The approach chosen is based on decision-trees (Breiman, 1984). An out-of-context evaluation of the morpho-syntactic Devin is presented in (Spriet, 1996).</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.2 Context analysis </SectionTitle> <Paragraph position="0"> The context analysis of OOV words permits the choice, from all the possible categories proposed by the Devin, of the one which best fits with the context of the OOV word. The hypotheses produced for each OOV word are inserted in the graph of possible categories generated by the language model. The 3-class analysis allows us to find the label which has the best probability.</Paragraph> <Paragraph position="1"> We decided to test the module on a corpus containing &quot;forced&quot; OOV words. This means that we voluntarily removed from the lexicon a set of test words. The text corpus chosen contained 313 690 words of which 10 850 were &quot;forced&quot; OOV words (these 10 850 occurrences represent 3430 different forms).</Paragraph> <Paragraph position="2"> In the first stage, we labelled this corpus without using the Devin. 1771 errors of context (as compared to the initial reference) were induced by the addition of 10 850 OOV words. Then we labelled again the same corpus, this time using the Devin. 88.3% of OOV words were correctly labelled (as compared to the initial reference) and 86.2% of induced contextual errors were corrected due to attributing a syntactic category to each OOV word. Thus, 87.5% of labelling differences with the initial reference were corrected by using the Devin.</Paragraph> <Paragraph position="3"> It is important to point out that this type of evaluation does not take into account the errors which are intrinsic to the tagging system employed (about 4% as mentioned in (EI-B~ze, 1995)). Indeed, the syntactic categories calculated by the Devin were compared to those produced by the tagger when these words belonged to the lexicon. Nevertheless the benefit of this technique is that it is automatic, which allows us to test our module on an important corpus of tests. A manual verification of a small corpus of &quot;true&quot; OOV words has also been carried out (Spriet, 1996), the results are appreciably similar.</Paragraph> </Section> </Section> <Section position="6" start_page="30" end_page="30" type="metho"> <SectionTitle> 4 Proper-names process </SectionTitle> <Paragraph position="0"> The second category of OOV words represents the forms which have been identified as proper-names.</Paragraph> <Paragraph position="1"> We separate these words into the following classes : family name, first name, town name, company name, country name. It is not possible to simply make a morphological module which allows us to process proper-names. Thus, the estimation of an out-of-context probability for each of these classes is independent of the graphical form of the proper-names.</Paragraph> <Paragraph position="2"> It is therefore the consideration of the context that allows us to attribute a reliable probability to the likelihood of an OOV proper-name belonging to a specific class. We present here a method based on a statistic 3-class model dedicated to OOV proper names.</Paragraph> <Section position="1" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 4.1 Contextual Tagging using the Devin for </SectionTitle> <Paragraph position="0"> proper-names The general 3-class language model is, most of the time, unable to choose between the different categories of proper-names. In fact, when you have to decide whether an OOV word is a family name or a town name, the word-context of the OOV word is more useful than its syntactic-class-context. A 3-gram model seems natural for solving this problem. But, because we want to process OOV words, we use a 3-gram model specific to proper names where some categories of words are represented by their classes (all the proper names as well as punctuation and non-alphabetical words) while others are represented by their graphical form (all the other classes). In the labelling process, when an OOV proper-name Xi appears at position i in the sentence, the label which is given to Xi represents the class which maximize P(t/Xi), the probability of Xi belonging to the class t.</Paragraph> <Paragraph position="2"> We carried out similar experiments to those presented above. The test corpus was the same and we voluntarily removed 970 proper-names from the lexicon, which represented 5000 occurrences in the corpus. 86% of the OOV words had been correctly tagged by the proper-names language model.</Paragraph> <Paragraph position="3"> It is important to point out that the average number of classes which can be attributed to a proper-name is very close to 1 (1.07 in our test corpus and 1.08 in the general lexicon). This shows that the comparison between the reference labels and the labels calculated is a true evaluation.</Paragraph> </Section> </Section> <Section position="7" start_page="30" end_page="31" type="metho"> <SectionTitle> 5 Automatic lexicon production </SectionTitle> <Paragraph position="0"> In studying all the occurrences, in all their contexts, of the OOV words of a corpus, we aim to automatically obtain new lexicons which represent the corpus studied.</Paragraph> <Paragraph position="1"> As we have mentioned already, the syntactic tagger used was trained on a journalistic text corpus from the newspaper Le Monde. The test corpus chosen to validate our automatic lexicon enhancement method was composed with articles of the newspaper Le Monde Diplomatique from 1990 until 1995. This 6-million-word corpus contains a large amount of proper-names and technical terms relative to various subjects.</Paragraph> <Paragraph position="2"> The test corpus contains 110 000 OOV words composed as follows : The lack of static coverage of our general lexicon is 1.85% (0.38% for the OOV common-words and 1.06% for the OOV proper-names).</Paragraph> <Paragraph position="3"> By tagging the corpus using Devin modules (for common-words and proper-names) we are able to automatically extract a lexicon of OOV words which contains, for each word, its number of occurrences as well as the list of labels which have been attributed to it during the tagging process. The list of labels given to each word of the lexicon is classified by frequency, as shown in the example below.</Paragraph> <Paragraph position="4"> OOV word Nb C1 C2 C3 C4 tchdtch~ne 41 AFS AMS NFS ~NMS , 54% 32% 8% 6% This frequency information allows us to filter the lexicon according to 2 criteria : number of occurences of each word ; percentage of occurences for each label given to a word.</Paragraph> <Section position="1" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 5.1 Lexicon of common-words </SectionTitle> <Paragraph position="0"> For the OOV common-words, we reduce the lexicon to the words which have at least 4 occurences in the corpus, then we keep, for each word, only the syntactic labels which represent 80% of all the occurences of the word. We obtain a lexicon of 1032 items representing 44% of all the occurences of OOV common-words in our corpus.</Paragraph> </Section> <Section position="2" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 5.2 Lexicon of proper-names </SectionTitle> <Paragraph position="0"> The lexicon of OOV proper-names is limited to the words which have at least 4 occurences in the corpus and for which the most frequent label has a frequency of at least 90%. Then we keep, for each word, only the most frequent label. The lexicon contains 2250 words representing 28.5% of all the occurences of OOV proper-names in our corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>