File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1701_metho.xml
Size: 8,692 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1701"> <Title>Web-based frequency dictionaries for medium density languages</Title> <Section position="3" start_page="2" end_page="4" type="metho"> <SectionTitle> 2 The disambiguation of morphological </SectionTitle> <Paragraph position="0"> analyses In any morphologically complex language, the MA component will often return more than one possible analysis. In order to create a lemmatized frequency dictionary it is necessary to decide which MA alternative is the correct one, and in the vast majority of cases the context provides sufficient information for this. This morphological disambiguation task is closely related to, but not identical with, part of speech (POS) tagging, a term we reserve here for finding the major parts 1This year, we are publishing smaller pilot corpora for Czech (10m words), Croatian (4m words), and Polish (12m words), and we feel confident in predicting that these will face as little actual opposition from copyright holders as the Hungarian Webcorpus has.</Paragraph> <Paragraph position="1"> of speech (N, V, A, etc). A full tag contains both POS information and morphological annotation: in highly inflecting languages the latter can lead to tagsets of high cardinality (Tufis, et al., 2000). Hungarian is particularly challenging in this regard, both because the number of ambiguous tokens is high (reaching 50% in the Szeged Corpus according to (Csendes et al., 2004) who use a different MA), and because the ratio of tokens that are not seen during training (unseen) can be as much as four times higher than in comparable size English corpora. But if larger training corpora are available, significant disambiguation is possible: with a 1 m word training corpus (Csendes et al., 2004) the TnT (Brants, 2000) architecture can achieve 97.42% overall precision.</Paragraph> <Paragraph position="2"> The ratio of ambiguous tokens is usually calculated based on alternatives offered by a morphological lexicon (either built during the training process or furnished by an external application; see below). If the lexicon offers alternative analyses, the token is taken as ambiguous irrespective of the probability of the alternatives. If an external resource is used in the form of a morphological analyzer (MA), this will almost always overgenerate, yielding false ambiguity. But even if the MA is tight, a considerable proportion of ambiguous tokens will come from legitimate but rare analyses of frequent types (Church, 1988). For example the word nem, can mean both 'not' and 'gender', so both ADV and NOUN are valid analyses, but the adverbial reading is about five orders of magnitude more frequent than the noun reading, (12596 vs. 4 tokens in the 1 m word manually annotated Szeged Korpusz (Csendes et al., 2004)).</Paragraph> <Paragraph position="3"> Thus the difficulty of the task is better measured by the average information required for disambiguating a token. If word w is assigned the label Ti with probability P(Ti|w) (estimated as C(Ti,w)/C(w) from a labeled corpus) then the label entropy for a word can be calculated as H(w) = [?]summationtexti P(Ti|w)logP(Ti|w), and the difficulty of the labeling task as a whole is the weighted average of these entropies with respect to the frequencies of words w: summationtextw P(w)H(w). As we shall see in Section 3, according to this measure the disambiguation task is not as difficult as generally assumed.</Paragraph> <Paragraph position="4"> A more persistent problem is that the ratio of unseen items has very significant influence on the performance of the disambiguation system. The problem is more significant with smaller corpora: in general, if the training corpus has N tokens and the test corpus is a constant fraction of this, say N/10, we expect the proportion of new words to be cNq[?]1, where q is the reciprocal of the Zipf constant (Kornai, 1999). But if the test/train ratio is not kept constant because the training corpus is limited (manual tagging is expensive), the number of tokens that are not seen during training can grow very large. Using the 1.2 m words of Szeged Corpus for training, in the 699 m word webcorpus over 4% of the non-numeric tokens will be unseen. Given that TnT performs rather dismally on unseen items (Oravecz and Dienes, 2002) it was clear from the outset that for lemmatizing the webcorpus we needed something more elaborate.</Paragraph> <Paragraph position="5"> The standard solution to constrain the probabilistic tagging model for some of the unseen items is the application of MA (Hakkani-T&quot;ur et al., 2000; HajiVc et al., 2001; Smith et al., 2005). Here a distinction must be made between those items that are not found in the training corpus (these we have called unseen tokens) and those that are not known to the MA - we call these out of vocabulary (OOV). As we shall see shortly, the key to the best tagging architecture we found was to follow different strategies in the lemmatization and morphological disambiguation of OOV and known (invocabulary) tokens.</Paragraph> <Paragraph position="6"> The first step in tagging is the annotation of inflectional features, with lemmatization being postponed to later processing as in (Erjavec and DVzeroski, 2004). This differs from the method of (Hakkani-T&quot;ur et al., 2000), where all syntactically relevant features (including the stem or lemma) of word forms are determined in one pass. In our experience, the choice of stem depends so heavily on the type of linguistic information that later processing will need that it cannot be resolved in full generality at the morphosyntactic level.</Paragraph> <Paragraph position="7"> Our first model (MA-ME) is based on disambiguating the MA output in the maximum entropy (ME) framework (Ratnaparkhi, 1996). In addition to the MA output, we use ME features coding the surface form of the preceding/following word, capitalization information, and different character length suffix strings of the current word. The MA used is the open-source hunmorph analyzer (Tr'on et al., 2005) with the morphdb.hu Hungarian morphological resource, the ME is the OpenNLP package (Baldridge et al., 2001). The MA-ME model achieves 97.72% correct POS tagging and morphological analysis on the test corpus</Paragraph> <Paragraph position="9"> Maximum entropy or other discriminative Markov models (McCallum et al., 2000) suffer from the label bias problem (Lafferty et al., 2001), while generative models (most notably HMMs) need strict independence assumptions to make the task of sequential data labeling tractable. Consequently, long distance dependencies and non-independent features cannot be handled. To cope with these problems we designed a hybrid architecture, inwhichatrigramHMMiscombinedwith the MA in such a way that for tokens known to the MA only the set of possible analyses are allowed as states in the HMM whereas for OOVs all states are possible. Lexical probabilities P(wi|ti) for seen words are estimated from the training corpus, while for unseen tokens they are provided by the the above MA-ME model. This yields a trigram HMM where emission probabilities are estimated by a weighted MA, hence the model is called WMA-T3. This improves the score to 97.93%.</Paragraph> <Paragraph position="10"> Finally, it is possible to define another architecture, somewhat similar to Maximum Entropy Markov Models, (McCallum et al., 2000), using the above components. Here states are also the set of analyses the MA allows for known tokens and all analyses for OOVs, while emission probabilities are estimated by the MA-ME model. In the first pass TnT is run with default settings over the data sequence, and in the second pass the ME receives as features the TnT label of the preceding/following token as well as the one to be analyzed. This combined system (TnT-MA-ME) incorporates the benefits of all the submodules and reaches an accuracy of 98.17% on the Szeged Corpus. The results are summarized in Table 1.</Paragraph> <Paragraph position="11"> model accuracy</Paragraph> <Paragraph position="13"> disambiguation We do not consider these results to be final: clearly, further enhancements are possible e.g. by a Viterbi search on alternative sentence taggings using the T3 trigram tag model or by handling OOVs on a par with known unseen words using the guesser function of our MA. But, as we discuss in more detail in Halacsy et al 2005, we are already ahead of the results published elsewhere, especially as these tend to rely on idealized MA systems that have their morphological resources extended so as to have no OOV on the test set.</Paragraph> </Section> class="xml-element"></Paper>