File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2005_metho.xml
Size: 20,241 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2005"> <Title>Tagging Portuguese with a Spanish Tagger Using Cognates</Title> <Section position="4" start_page="34" end_page="35" type="metho"> <SectionTitle> 4 Morphological Analysis </SectionTitle> <Paragraph position="0"> Our morphological analyzer (Hana, 2005) is an open and modular system. It allows us to combine modules with different levels of manual input - from a module using a small manually provided lexicon, through a module using a large lexicon automatically acquired from a raw corpus, to a guesser using a list of paradigms, as the only resource provided manually. The general strategy is to run modules that make fewer errors and less overgenerate before modules that make more errors and overgenerate more. This, for example, means that modules with manually created resources are used before modules with resources We ignored the POS tags.</Paragraph> <Paragraph position="1"> automatically acquired. In the experiments below, we used the following modules - lookup in a list of (mainly) closed-class words, a paradigm-based guesser and an automatically acquired lexicon.</Paragraph> <Section position="1" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.1 Portuguese closed class words </SectionTitle> <Paragraph position="0"> We created a list of the most common prepositions, conjunctions, and pronouns, and a number of the most common irregular verbs. The list contains about 460 items and it required about 6 hours of work. In general, the closed class words can be derived either from a reference grammar book, or can be elicited from a native speaker. This does not require native-speaker expertise or intensive linguistic training. The reason why the creation of such a list took 6 hours is that the words were annotated with detailed morphological tags used by our system.</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.2 Portuguese paradigms </SectionTitle> <Paragraph position="0"> We also created a list of morphological paradigms.</Paragraph> <Paragraph position="1"> Our database contains 38 paradigms. We just encoded basic facts about the Portuguese morphology from a standard grammar textbook (Cunha and Cintra, 2001). The paradigms include all three regular verb conjugations (-ar, -er, -ir), the most common adjective and nouns paradigms and a rule for adverbs of manner that end with -mente (analogous to the English -ly). We ignore majority of exceptions. The creation of the paradigms took about 8 h of work.</Paragraph> </Section> <Section position="3" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 4.3 Lexicon Acquisition </SectionTitle> <Paragraph position="0"> The morphological analyzer supports a module or modules employing a lexicon containing information about lemmas, stems and paradigms. There is always the possibility to provide this information manually. That, however, is very costly. Instead, we created such a lexicon automatically.</Paragraph> <Paragraph position="1"> Usually, automatically acquired lexicons and similar systems are used as a backup for large high-precision high-cost manually created lexicons (e.g. Mikheev, 1997; Hlav'aVcov'a, 2001). Such systems extrapolate the information about the words known by the lexicon (e.g. distributional properties of endings) to unknown words. Since our approach is resource light, we do not have any such large lexicon to extrapolate from.</Paragraph> <Paragraph position="2"> The general idea of our system is very simple. The paradigm-based Guesser, provides all the possible analyses of a word consistent with Portuguese paradigms. Obviously, this approach mas- null sively overgenerates. Part of the ambiguity is usually real but most of it is spurious. We use a large corpus to weed the spurious analyses out of the real ones. In such corpus, open-class lemmas are likely to occur in more than one form. Therefore, if a lemma+paradigm candidate suggested by the Guesser occurs in other forms in other parts of the corpus, it increases the likelihood that the candidate is real and vice versa. If we encounter the word cantamos 'we sing' in a Portuguese corpus, using the information about the paradigms we can analyze it in two ways, either as being a noun in the plural with the ending -s, or as being a verb in the 1st person plural with the ending -amos. Based on this single form we cannot say more. However if we also encounter the forms canto, canta, cantam the verb analysis becomes much more probable; and therefore, it will be chosen for the lexicon. If the only forms that we encounter in our Portuguese corpus were cantamos and (the nonexisting) cantamo (such as the existing word ramo and ramos) then we would analyze it as a noun and not as a verb.</Paragraph> <Paragraph position="3"> With such an approach, and assuming that the corpus contains the forms of the verb matar 'to kill', mato1sg matas2sg, mata3sg, etc., we would not discover that there is also a noun mata 'forest' with a plural form matas - the set of the 2 noun forms is a proper subset of the verb forms. A simple solution is to consider not the number of form types covered in a corpus, but the coverage of the possible forms of the particular paradigm. However this brings other problems (e.g. it penalizes paradigms with large number of forms, paradigms with some obsolete forms, etc.). We combine both of these measures in Hana (2005).</Paragraph> <Paragraph position="4"> Lexicon Acquisition consists of three steps: 1. A large raw corpus is analyzed with a lexicon-less MA (an MA using a list of mainly closed-class words and a paradigm based guesser); 2. All possible hypothetical lexical entries over these analyses are created.</Paragraph> <Paragraph position="5"> 3. Hypothetical entries are filtered with aim to discard as many nonexisting entries as possible, without discarding real entries.</Paragraph> <Paragraph position="6"> Obviously, morphological analysis based on such a lexicon still overgenerates, but it overgenerates much less than if based on the endings alone. Consider for example, the form func, ~oes 'functions' of the feminine noun func, ~ao. The analyzer without a lexicon provides 11 analyses (6 lemmas, each with 1 to 3 tags); only one of them is correct. In contrast, the analyzer with an automatically acquired lexicon provides only two analyses: the correct one (noun fem. pl.) and an incorrect one (noun masc. pl., note that POS and number are still correct). Of course, not all cases are so persuasive.</Paragraph> <Paragraph position="7"> The evaluation of the system is in Table 3. The 98.1% recall is equivalent to the upper bound for the task. It is calculated assuming an oracle-Portuguese tagger that is always able to select the correct POS tag if it is in the set of options given by the morphological analyzer. Notice also that for the tagging accuracy, the drop of recall is less important than the drop of ambiguity.</Paragraph> </Section> </Section> <Section position="5" start_page="35" end_page="37" type="metho"> <SectionTitle> 5 Tagging </SectionTitle> <Paragraph position="0"> We used the TnT tagger (Brants, 2000), an implementation of the Viterbi algorithm for second-order Markov model. In the traditional approach, we would train the tagger's transitional and emission probabilities on a large annotated corpus of Portuguese. However, our resource-light approach means that such corpus is not available to us and we need to use different ways to obtain this information. null We assume that syntactic properties of Spanish and Portuguese are similar enough to be able to use the transitional probabilities trained on Spanish (after a simple tagset mapping).</Paragraph> <Paragraph position="1"> The situation with the lexical properties as captured by emission probabilities is more complex.</Paragraph> <Paragraph position="2"> Below we present three different ways how to obtains emissions, assuming: 1. they are the same: we use the Spanish emissions directly (SS5.1).</Paragraph> <Paragraph position="3"> 2. they are different: we ignore the Spanish emissions and instead uniformly distribute 36 the results of our morphological analyzer.</Paragraph> <Paragraph position="4"> (SS5.2) 3. they are similar: we map the Spanish emissions onto the result of morphological analysis using automatically acquired cognates. (SS5.3)</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 5.1 Tagging - Baseline </SectionTitle> <Paragraph position="0"> Our lowerbound measurement consists of training the TnT tagger on the Spanish corpus and applying this model directly to Portuguese.4 The overall performance of such a tagger is 56.8% (see the the min column in Table 4). That means that half of the information needed for tagging of Portuguese is already provided by the Spanish model. This tagger has seen no Portuguese whatsoever, and is still much better than nothing.</Paragraph> </Section> <Section position="2" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 5.2 Tagging - Approximating Emissions I </SectionTitle> <Paragraph position="0"> The opposite extreme to the baseline, is to assume that Spanish emissions are useless for tagging Portuguese. Instead we use the morphological analyzer to limit the number of possibilities, treating them all equally - The emission probabilities would then form a uniform distribution of the tags given by the analyzer. The results are summarized in Table 4 (the e-even column) - accuracy 77.2% on full tags, or 47% relative error reduction against the baseline.</Paragraph> </Section> <Section position="3" start_page="35" end_page="37" type="sub_section"> <SectionTitle> 5.3 Tagging - Approximating Emissions II </SectionTitle> <Paragraph position="0"> Although it is true that forms and distributions of Portuguese and Spanish words are not the same, they are also not completely unrelated. As any Spanish speaker would agree, the knowledge of Spanish words is useful when trying to understand a text in Portuguese.</Paragraph> <Paragraph position="1"> Many of the corresponding Portuguese and Spanish words are cognates, i.e. historically they descend from the same ancestor root or they are mere translations. We assume two things: (i) cognate pairs have usually similar morphological and distributional properties, (ii) cognate words are similar in form.</Paragraph> <Paragraph position="2"> Obviously both of these assumptions are approximations: null 1. Cognates could have departed in their meanings, and thus probably also have dif4Before training, we translated the Spanish tagset into the Portuguese one.</Paragraph> <Paragraph position="3"> ferent distributions. For example, Spanish embarazada 'pregnant' vs. Portuguese embarac,ada 'embarrassed'.</Paragraph> <Paragraph position="4"> 2. Cognates could have departed in their morphological properties. For example, Spanish cerca 'near'.adverb vs. Portuguese cerca 'fence'.noun (from Latin circa, circus 'circle'). null 3. There are false cognates - unrelated, but similar or even identical words. For example, Spanish salada 'salty'.adj vs. Portuguese salada 'salad'.noun, Spanish doce 'twelve'.numeral vs. Portuguese doce 'candy'.noun Nevertheless, we believe that these examples are true exceptions from the rule and that in majority of cases, the cognates would look and behave similarly. The borrowings, counter-borrowings and parallel developments of the various Romance languages have of course been extensively studied, and we have no space for a detailed discussion. Identifying cognates. For the present work, however, we do not assume access to philological erudition, or to accurate Spanish-Portuguese translations or even a sentence-aligned corpus. All of these are resources that we could not expect to obtain in a resource poor setting. In the absence of this knowledge, we automatically identify cognates, using the edit distance measure (normalized by word length).</Paragraph> <Paragraph position="5"> Unlike in the standard edit distance, the cost of operations is dependent on the arguments. Similarly as Yarowsky and Wicentowski (2000), we assume that, in any language, vowels are more mutable in inflection than consonants, thus for example replacing a for i is cheaper that replacing s by r. In addition, costs are refined based on some well known and common phonetic-orthographic regularities, e.g. replacing a q with c is less costly than replacing m with, say s. However, we do not want to do a detailed contrastive morpho-phonological analysis, since we want our system to be portable to other languages. So, some facts from a simple grammar reference book should be enough.</Paragraph> <Paragraph position="6"> Using cognates. Having a list of Spanish-Portuguese cognate pairs, we can use these to map the emission probabilities acquired on Spanish corpus to Portuguese.</Paragraph> <Paragraph position="7"> Let's assume Spanish word ws and Portuguese word wp are cognates. Let Ts denote the tags that ws occurs within the Spanish corpus, and let ps(t) be the emission probability of a tag t (t negationslash[?] Ts = ps(t) = 0). Let Tp denote tags assigned to the Portuguese word wp by our morphological analyzer, and the pp(t) is the even emission probability: pp(t) = 1|Tp |. Then we can assign the new emission probability pprimep(t) to every tag t [?] Tp in the following way (followed by normalization):</Paragraph> <Paragraph position="9"> Results. This method provides the best results.</Paragraph> <Paragraph position="10"> The full-tag accuracy is 82.1%, compared to 56.9% for baseline (58% error rate reduction) and 77.2% for even-emissions (21% reduction). The accuracy for POS is 87.6%. Detailed results are in column e-cognates of Table 4.</Paragraph> </Section> </Section> <Section position="6" start_page="37" end_page="37" type="metho"> <SectionTitle> 6 Evaluation & Comparison </SectionTitle> <Paragraph position="0"> The best way to evaluate our results would be to compare it against the TnT tagger used the usual way - trained on Portuguese and applied on Portuguese. We do not have access to a large Portuguese corpus annotated with detailed tags. However, we believe that Spanish and Portuguese are similar enough (see Sect. 2) to justify our assumption that the TnT tagger would be equally successful (or unsuccessful) on them. The accuracy of TnT trained on 90K tokens of the CLiC-TALP corpus is 94.2% (tested on 16K tokens). The accuracy of our best tagger is 82.1%. Thus the error-rate is more than 3 times bigger (17.9% vs. 5.4%).</Paragraph> <Paragraph position="1"> Branco and Silva (2003) report 97.2% tagging accuracy on 23K testing corpus. This is clearly better than our results, on the other hand they needed a large Portuguese corpus of 207K tokens.</Paragraph> <Paragraph position="2"> The details of the tagset used in the experiments are not provided, so precise comparison with our results is difficult.</Paragraph> </Section> <Section position="7" start_page="37" end_page="38" type="metho"> <SectionTitle> 7 Related work </SectionTitle> <Paragraph position="0"> Previous research in resource-light language learning has defined resource-light in different ways. Some have assumed only partially tagged training corpora (Merialdo, 1994); some have begun with small tagged seed wordlists (Cucerzan and Yarowsky, 1999) for named-entity tagging, while others have exploited the automatic transfer of an already existing annotated resource in a different genres or a different language (e.g. cross-language projection of morphological and syntactic information in (Yarowsky et al., 2001; Yarowsky and Ngai, 2001), requiring no direct supervision in the target language).</Paragraph> <Paragraph position="1"> Ngai and Yarowsky (2000) observe that the total weighted human and resource costs is the most practical measure of the degree of supervision.</Paragraph> <Paragraph position="2"> Cucerzan and Yarowsky (2002) observe that another useful measure of minimal supervision is the additional cost of obtaining a desired functionality from existing commonly available knowledge sources. They note that for a remarkably wide range of languages, there exist a plenty of reference grammar books and dictionaries which is an invaluable linguistic resource.</Paragraph> <Section position="1" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 7.1 Resource-light approaches to Romance languages </SectionTitle> <Paragraph position="0"> Cucerzan and Yarowsky (2002) present a method for bootstrapping a fine-grained, broad coverage POS tagger in a new language using only one person-day of data acquisition effort. Similarly to us, they use a basic library reference grammar book, and access to an existing monolingual text corpus in the language, but they also use a medium-sized bilingual dictionary.</Paragraph> <Paragraph position="1"> In our work, we use a paradigm-based morphology, including only the basic paradigms from a standard grammar textbook. Cucerzan and Yarowsky (2002) create a dictionary of regular inflectional affix changes and their associated POS and on the basis of it, generate hypothesized inflected forms following the regular paradigms.</Paragraph> <Paragraph position="2"> Clearly, these hypothesized forms are inaccurate and overgenerated. Therefore, the authors perform a probabilistic match from all lexical tokens actually observed in a monolingual corpus and the hypothesized forms. They combine these two models, a model created on the basis of dictionary information and the one produced by the morphological analysis. This approach relies heavily on two assumptions: (i) words of the same POS tend to have similar tag sequence behavior; and (ii) there are sufficient instances of each POS tag labeled by either the morphology models or closed-class entries. For richly inflectional languages, however, there is no guarantee that the latter assumption would always hold.</Paragraph> <Paragraph position="3"> The accuracy of their model is comparable to ours. On a fine-grained (up to 5-feature) POS space, they achieve 86.5% for Spanish and 75.5% for Romanian. With a tagset of a similar size (11 features) we obtain the accuracy of 82.1% for Portuguese. null Carreras et al. (2003) present work on developing low-cost Named Entity recognizers (NER) for a language with no available annotated resources, using as a starting point existing resources for a similar language. They devise and evaluate several strategies to build a Catalan NER system using only annotated Spanish data and unlabeled Catalan text, and compare their approach with a classical bootstrapping setting where a small initial corpus in the target language is hand tagged. It turns out that the hand translation of a Spanish model is better than a model directly learned from a small hand annotated training corpus of Catalan. The best result is achieved using cross-linguistic features. Solorio and L'opez (2005) follow their approach; however, they apply the NER system for Spanish directly to Portuguese and train a classifier using the output and the real classes.</Paragraph> </Section> <Section position="2" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 7.2 Cognates </SectionTitle> <Paragraph position="0"> Mann and Yarowsky (2001) present a method for inducing translation lexicons based on trasduction modules of cognate pairs via bridge languages.</Paragraph> <Paragraph position="1"> Bilingual lexicons within language families are induced using probabilistic string edit distance models. Translation lexicons for abitrary distant language pairs are then generated by a combination of these intra-family translation models and one or more cross-family online dictionaries. Similarly to Mann and Yarowsky (2001), we show that languages are often close enough to others within their language family so that cognate pairs between the two are common, and significant portions of the translation lexicon can be induced with high accuracy where no bilingual dictionary or parallel corpora may exist.</Paragraph> </Section> </Section> class="xml-element"></Paper>