File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1112_intro.xml
Size: 3,371 bytes
Last Modified: 2025-10-06 14:02:12
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1112"> <Title>A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Dictionary-Based Lemmatizer for Dutch </SectionTitle> <Paragraph position="0"> Statistical classification systems, like our WSD system, determine the most likely class for a given instance by computing how likely the words or linguistic features in the instance are for any given class. Estimating these probabilities is difficult, as corpora contain lots of different, often infrequent, words. Lemmatization2 is a method that can be used to reduce the number of wordforms that need to be taken into consideration, as estimation is more reliable for frequently occurring data.</Paragraph> <Paragraph position="1"> 2We chose to use lemmatization and not stemming because the lemma (or canonical dictionary entry form) can be used to look up an ambiguous word in a dictionary or an ontology like e.g. WordNet. This is not the case for a stem.</Paragraph> <Paragraph position="2"> Lemmatization reduces all inflected forms of a word to the same lemma. The number of different lemmas in a training corpus will therefore in general be much smaller than the number of different wordforms, and the frequency of lemmas will therefore be higher than that of the corresponding individual inflected forms, which in turn suggests that probabilities can be estimated more reliably.</Paragraph> <Paragraph position="3"> For the experiments in this paper, we used a lemmatizer for Dutch with dictionary lookup. Dictionary information is obtained from Celex (Baayen et al., 1993), a lexical database for Dutch. Celex contains 381,292 wordforms and 124,136 lemmas for Dutch. It also contains the PoS associated with the lemmas. This information is useful for disambiguation: in those cases where a particular word-form has two (or more) possible corresponding lemmas, the one matching the PoS of the wordform is chosen. Thus, in a first step, information about wordforms, their respective lemmas and their PoS is extracted from the database.</Paragraph> <Paragraph position="4"> Dictionary lookup can be time consuming, especially for large dictionaries such as Celex. To guarantee fast lookup and a compact representation, the information extracted from the dictionary is stored as a finite state automaton (FSA) using Daciuk's (2000) FSA morphology tools.3 Given a wordform, the compiled automaton provides the corresponding lemmas in time linear to the length of the input word. Contrasting this dictionary-based lemmatizer with a simple suffix stripper, such as the Dutch Porter Stemmer (Kraaij and Pohlman, 1994), our lemmatizer is more accurate, faster and more compact (see (Gaustad and Bouma, 2002) for a more elaborate description and evaluation).</Paragraph> <Paragraph position="5"> During the actual lemmatization procedure, the FSA encoding of the information in Celex assigns every wordform all its possible lemmas. For ambiguous wordforms, the lemma with the same PoS as the wordform in question is chosen. All word-forms that were not found in Celex are processed with a morphological guessing automaton.4 The key features of the lemmatizer employed are that it is fast, compact and accurate.</Paragraph> </Section> class="xml-element"></Paper>