File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2401_metho.xml
Size: 20,605 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2401"> <Title>Named Entities Translation Based on Comparable Corpora</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Experimental settings </SectionTitle> <Paragraph position="0"> We have obtained a Basque-Spanish comparable corpora processing news from two newspapers, one for each language: Euskaldunon Egunkaria, the only newspaper written entirely in Basque for Basque texts, and EFE for Spanish texts. We have collected the articles written in the 2002 year in both newspapers and we have obtained 40,648 articles with 9,655,559 words for Basque and 16,914 with 5,192,567 words for Spanish. Both newspapers deal with similar topics: international news, sports, politics, economy, culture, local issues and opinion articles, but with different scope.</Paragraph> <Paragraph position="1"> In order to extract Basque NEs, we have used Eihera (Alegria et al., 2003), a Basque NE recognizer developed in the IXA Group. Giving a written text in Basque as input, this tool applies a grammar based on linguistic features in order to identify the entities in the text. For the classification of the identified expressions, we use a heuristic that combines both internal and external evidence. We labeled this corpus for the HERMES project2(news databases: cross-lingual information retrieval and semantic extraction). Thus, we obtained automatically 142,464 different person, location and organization names.</Paragraph> <Paragraph position="2"> Since we have participated at the HERMES project, we have available labeled corpora for the other languages processed by other participants. It was the TALP3 research group the one that was in charge of labeling EFE 2002 newspaper's articles for the Spanish version, in which 106,473 different named entities were dealt with. We have built the comparable corpus using this data-set together with the Basque set mentioned above.</Paragraph> <Paragraph position="3"> Being Basque an agglutinative language, entity elements may contain more than just lexical information. So before doing any translation attempt a morphosyntactic analysis is required in order to obtain all the information from each element. Furthermore, Eihera works on a lemmatized text, so lematizing the input text is a strong requirement.</Paragraph> <Paragraph position="4"> For that purpose, we apply the lemmatizer/tagger for Basque (Alegria et al., 1998) developed by the IXA group.</Paragraph> <Paragraph position="5"> The goal of our system is to translate Basque person, location and organization names into Spanish entities. These two languages share a lot of cognates, that is, words that are similar in both languages and only have small, usually predictable spelling differences. Two experts have reviewed an extended list of word pairs4 extracted from EDBL (Basque Lexical Data-base) in order to detect these differences. All the observed variations have been listed in a spelling-rule list. These rules are in fact the ones that will be applied for the translation of some of the words, but obviously not for all.</Paragraph> <Paragraph position="6"> When translating Basque words into Spanish, usually the correct form is not obtained by applying the rules mentioned before, and a different strategy is required. For these words in particular, we have used bilingual dictionaries as in Al-Onaizan and Knight's work.</Paragraph> <Paragraph position="7"> We have used the Elhuyar 2000 bilingual dictionary, one of the most popular for that language pair. This dictionary has 74,331 Basque entries, and it contains the corresponding Spanish synonyms. null For the evaluation, we have used a set of 180 named entity-pairs. We have borrowed that set from the Euskaldunon Egunkaria 2002 newspaper.</Paragraph> <Paragraph position="8"> Concretely we applied Eihera, the Basque NE recognizer, to extract all the named entities in the corpus. Then we estimated the normalized frequency of each entity in the corpus, and we selected the most common ones. Finally we translated them manually into Spanish.</Paragraph> <Paragraph position="9"> In order to carry out an evaluation starting from correct Basque NEs, although the NEs were automatically extracted from the corpus, we verified that all the entities were correctly identified. Because if the original entity was not a correct expression, the translation system could not get a</Paragraph> </Section> <Section position="5" start_page="2" end_page="6" type="metho"> <SectionTitle> 4 Systems' Development </SectionTitle> <Paragraph position="0"> As we have mentioned before, we have done two different experiments in order to get a Basque-Spanish NE translation tool. For both trials we have used bilingual dictionaries and grammars to translate and transliterate entity elements, respectively. But the methodologies used to implement each transliteration grammar are different: on the one hand, we have used Basque linguistic knowledge to develop the grammar; on the other hand, we have defined a language-independent grammar based on edition distance.</Paragraph> <Paragraph position="1"> Those dictionaries and grammars have been used in order to obtain translation proposals for each entity element. But another methodology is needed for the system to propose the translation of whole entities. For the system based on linguistic information, a specific arranging rule set has been applied getting a candidate list. In order to decide which is the most suitable one, we have created a ranked list based on a simple web count.</Paragraph> <Paragraph position="2"> For the language-independent system a more simple methodology has been applied. We have generated all the possible candidate combinations, considering that every element can appear at any position in the entity. Then, a comparable corpus has been used in order to decide which is the most probable candidate.</Paragraph> <Paragraph position="3"> Now we will present the design of each experiment in detail.</Paragraph> <Section position="1" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 4.1 Linguistic Tool </SectionTitle> <Paragraph position="0"> We can see the pseudo-code of the linguistic tool at Figure 1.</Paragraph> <Paragraph position="1"> The linguistic tool, first tries to obtain a translation proposal for each entity element using bilingual dictionaries. If no candidate is obtained from that search, the transliteration grammar is applied. Once the system has obtained at least one proposal for each element, the arranging grammar is applied, and finally, the resultant entire entity proposals are ranked based on their occurrence on the web.</Paragraph> <Paragraph position="2"> Reviewing the extended list of words from EDBL (a Basque Lexical Data-base) we have obtained 24 common phonologic/spelling transformations, some of which depend on others, and can usually be used together, although not always. We have implemented these 24 transformations using the XFST (Beesley and Karttunen, 2003) tool and we have defined 30 rules. These rules have been ordered in such a way that rules with possible interactions are firstly applied and then the rest of them. This way we have avoided interaction problems. null For instance, lets say that we want to translate Kolonbia into Colombia and that our grammar has the following two simple transformation rules: nb ! mb and b ! v. If we apply the first rule and then the second one, the candidate we will obtain is Colomvia, and this is not the correct translation. However, if we do not allow to apply the second rule after the nb ! mb transformation, the grammar will propose the following candidates: Colonvia and Colombia. So it would generate bad forms but the correct forms too.</Paragraph> <Paragraph position="3"> We can conclude from this fact that it is necessary to apply the rules in a given order.</Paragraph> <Paragraph position="4"> The possible combinations of rules are so wide that it causes an overgeneration of candidates. To avoid working with such a big number of candidates in the following steps, we have decided to rank and select candidates using some kind of measure.</Paragraph> <Paragraph position="5"> We have estimated rules probabilities using the bilingual dictionary Elhuyar 2000. We have simply apply all possible rule combinations on every Basque word in the dictionary, and measured the normalized frequency of each rule and each rule pair. Thus, translation proposals are attached a probability based on the probability of a rule being applied, and only the most probable ones are proposed for the following steps.</Paragraph> <Paragraph position="6"> At this point, we have N translation candidates for each input entity element at the most, and they have been obtained applying the grammar or from the dictionary search. Our next goal is to create entire entity translation proposals combining all these candidates. But some words features, such as gender and number, must be considered and treated beforehand.</Paragraph> <Paragraph position="7"> The number of an entity element will be reflected in the whole entity. Let's say, for instance, translate the organization name Nazio Batuak5.</Paragraph> <Paragraph position="8"> The translation proposals from the previous modules for these two words are Naci'on (for Nazio) and Unida (for Batuak). If we do not consider that the corresponding Basque word of the Unida element is in the plural form, then the whole translation candidate will not be correct. In this case, we will need to pluralize the corresponding Spanish words.</Paragraph> <Paragraph position="9"> Unlike Spanish, Basque has no morphological gender. This means that for some Basque words the generation of both male and female form is required. The word idazkari, for example, has no morphological gender, and it has two corresponding Spanish words: the masculine secretario and the feminine secretaria. If we search for idazkari on the bilingual dictionary, we will only obtain the masculine form, but the feminine form is needed for some entities , as it is the case with Janet Reno Idazkaria6. Since Janet Reno is a woman's proper name, the correct translation of Idazkaria would be Secretaria. So before constructing the entire entity translation, both male and female forms have been generated for each element.</Paragraph> <Paragraph position="10"> The simplest entities to construct are the ones whose elements keep the same order in both the Basque and the Spanish forms. Person names usually follow this pattern.</Paragraph> <Paragraph position="11"> However, there are some translations that are not as regular and easy to translate as the previous ones. Suppose that we want to translate the Basque entity Lomeko Bake Akordio7 into the Spanish form Acuerdo de Paz de Lome. After applying grammar and bilingual dictionaries, we obtain the following translated elements (in order to simplify the explanation, we have assumed that the system will only return one translation candidate per element): Lome Acuerdo and Paz. As you can see, if we do not arrange those elements, the proposal will not be the appropriate Spanish transla- null tion.</Paragraph> <Paragraph position="12"> An expert's manual work has been carried out in order to define the element arranging needed when turning from one language to the other. The morphosyntactic information of the Basque entity elements (such as PoS, declension, and so on) has been used in this task.</Paragraph> <Paragraph position="13"> Using this manual work, we have defined 10 element-arranging rules using the XFST tool. In the example above, it is clear that some element-arranging rules are needed in order to obtain the correct translation. Let's see how our grammar's rules arranges those elements.</Paragraph> <Paragraph position="14"> When the system starts arranging the Lome Acuerdo and Paz Spanish words to get the correct translation for the Basque named entity Lomeko Bake Akordio it starts from the right to the left using the Basque elements' morphosyntactic information. So it will start arranging the translated elements for Bake Akordio from right to left. Both forms are common nouns with no declension case.</Paragraph> <Paragraph position="15"> Looking at the grammar the system will find a rule for this structure that switches position of the elements and inserts the preposition de in between. So the partial translation would be Acuerdo de Paz. The next step is to find the correct position for the translation of Lomeko, which is a location name declined in genitive. There is a rule in the grammar, that places the elements declined in genitive at the end of the partial entity and adds the preposition de before this element. So, the system will apply that rule, obtaining the Spanish translation of the whole entity Acuerdo de Paz de Lome, which is the correct form.</Paragraph> <Paragraph position="16"> As we have explained, we combine at the most the N translation candidates per entity elements with each other using the corresponding arranging rule to get the translation of the whole entity. So, at the most we will obtain NxN entity translation proposals. In order to know which candidate is the correct one, the tool makes a web search, but as the number of candidates is so high, we use the same candidate selection technique applied previously for element selection.</Paragraph> <Paragraph position="17"> This time we will use elements probability in order to obtain a measured proposal list. The x candidates with the highest probability are searched and ranked in a final candidate list of translated entities.</Paragraph> <Paragraph position="18"> In our experiments, we have used the Google API to consult the web. Searching entities in Google has the advantage of getting the most common forms for entities in any type of document. But if you prefer to get a higher precision (rather than a good recall), you can obtain a higher certainty rate by making a specialized search in the web. For those specialized searches we have used Wikipedia, a free encyclopedia written collaboratively by many of its readers in many languages.</Paragraph> </Section> <Section position="2" start_page="4" end_page="6" type="sub_section"> <SectionTitle> 4.2 Language Independent Tool </SectionTitle> <Paragraph position="0"> Since creating transformation rules for every language pairs is not always a viable task, we have designed a general transformation grammar, which fits well for most language pairs that use the same alphabetical system. All we need is a written corpus for each language and a bilingual dictionary.</Paragraph> <Paragraph position="1"> We have constructed a NE translation tool based on comparable corpora using that general grammar. As you can see in Figure 2, the system finds Basque translation proposals for entity elements applying the pseudo-transliteration module. Once it gets at least one translation candidate for each element, it applies the whole entity construction module obtaining all the possible whole entity candidates. Finally, it searches each candidate in the corresponding comparable corpus and returns a ranked candidate list based on that search, in order to obtain the correct translation form.</Paragraph> <Paragraph position="2"> The pseudo-transliteration module has two main sources: an edition distance (Kukich, 1992) grammar and a Spanish lexicon.</Paragraph> <Paragraph position="3"> The edition distance grammar is composed of three main rules: 1. a character can be replaced in a word 2. a character can disappear from a word 3. a new character can be inserted in a word There is no specific rule in the grammar for switching adjacent characters, because we can simulate that transformation just combining the deleting and inserting rules mentioned above. Since each rule can be applied n times for each word, the set of all translated words that we obtain, applying rules independently and combining them, is too extent.</Paragraph> <Paragraph position="4"> In order to reduce the output proposal-set, we have combined the grammar with a Spanish lexicon, and we have restricted the transformation rules to two applications. So words with more than two transformations have been avoided. Thus, when the system applies the resultant automaton of this combination, only the Spanish words that can be obtained with a maximum of two transformations would be proposed as pseudotransliterations of a Basque entity element. The Spanish lexicon has been constructed with all the words of EFE 2002 (the Spanish corpus of the 2002 year) and the bilingual dictionary Elhuyar 2000. And as we have considered this corpus as a comparable corpus with regard to the Euskaldunon Egunkaria 2002, Basque corpus version, we assume that most of the Basque words would have their corresponding translation in the Spanish set.</Paragraph> <Paragraph position="5"> However, there are some words that do not have their corresponding translation at EFE 2002, or their translation cannot be obtained applying only two transformations. In order to obtain their translations in a different way, we have used the Basque-Spanish Elhuyar 2000 bilingual dictionary. To be precise, we have converted the bilingual dictionary into an automaton, and we combined it with the resultant automaton obtained from applying the transliteration grammar in the Spanish lexicon.</Paragraph> <Paragraph position="6"> In this way the system is able to translate not only the transliterated words in EFE 2002 corpus, but also, the words that cannot be translated using transformation knowledge and that need information from a bilingual dictionary, such as 'Erakunde' vs. 'Organizaci'on'8.</Paragraph> <Paragraph position="7"> Since we want to build a language independent system that works just having two different language data-sets, we cannot use any linguistic feature for arranging entity elements and getting the 8Organization correct whole translated entity.</Paragraph> <Paragraph position="8"> We might use many approaches to arrange elements, but we have chosen the simplest one: combining each proposed element with the rest, considering that each proposal can appear in any position within the entity. Thus, the system will return a large list of candidates, but we have ensured that it will include the correct one, when the independent translation of all the elements has been correctly done.</Paragraph> <Paragraph position="9"> Although in some cases prepositions and articles are needed to obtain the correct Spanish form, the translation candidates for the whole entity will not contain any element apart from the translated words of the original entity. So, in the following step the lack of these elements will be taken into account.</Paragraph> <Paragraph position="10"> Once the system has calculated all possible translation candidates for the whole entity , the following step is to select the most suitable proposal. For that purpose, we have used the web in the linguistic tool. But this time, we have made used of the data-set in the Spanish-news articles, in which entities were tagged. This set is smaller and permits faster searching; furthermore, since Basque and Spanish-sets are comparable, the correct translation form is expected to occur in this smaller corpus, so it is very probable that the system will propose us the right translation.</Paragraph> <Paragraph position="11"> Therefore, every translation proposal will be searched in the Spanish data-set and will be positioned at the ranked list according to their frequency. Thus, the most repeated entities in the corpus would appear on the top of the list.</Paragraph> <Paragraph position="12"> 4.2.4 Combining web and comparable corpus rankings Both Euskaldunon Egunkaria 2002 and EFE 2002 data-sets are 2002 year news-sets, and a lot of named entities are due to occur in both sets. But since they are articles taken from newspaper of different countries, there may be some non-shared named entities.</Paragraph> <Paragraph position="13"> When the system finds these special entities in the Spanish comparable corpus, it is very probable that it will find none of the candidates, and so, the list will not be arranged.</Paragraph> <Paragraph position="14"> To avoid that random list ranking, when all translation candidates have a very low frequency, we propose to use the web to do a better rank- null ing. As we will present below, this optional second ranking step improves final results.</Paragraph> </Section> </Section> class="xml-element"></Paper>