File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1103_intro.xml

Size: 6,972 bytes

Last Modified: 2025-10-06 14:03:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1103">
  <Title>Multilingual Comparable Corpora</Title>
  <Section position="3" start_page="0" end_page="818" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Named Entity recognition has been getting much attention in NLP research in recent years, since it is seen as significant component of higher level NLP tasks such as information distillation and question answering. Most successful approaches to NER employ machine learning techniques, which require supervised training data. However, for many languages, these resources do not exist. Moreover, it is often difficult to find experts in these languages both for the expensive annotation effort and even for language specific clues. On the other hand, comparable multilingual data (such as multilingual news streams) are becoming increasingly available (see section 4).</Paragraph>
    <Paragraph position="1"> In this work, we make two independent observations about Named Entities encountered in such corpora, and use them to develop an algorithm that extracts pairs of NEs across languages. Specifically, given a bilingual corpora that is weakly temporally aligned, and a capability to annotate the text in one of the languages with NEs, our algorithm identifies the corresponding NEs in the second language text, and annotates them with the appropriate type, as in the source text.</Paragraph>
    <Paragraph position="2"> The first observation is that NEs in one language in such corpora tend to co-occur with their counterparts in the other. E.g., Figure 1 shows a histogram of the number of occurrences of the word Hussein and its Russian transliteration in our bilingual news corpus spanning years 2001 through late 2005. One can see several common peaks in the two histograms, largest one being around the time of the beginning of the war in Iraq. The word Russia, on the other hand, has a distinctly different temporal signature. We can exploit such weak synchronicity of NEs across languages to associate them. In order to score a pair of entities across languages, we compute the similarity of their time distributions.</Paragraph>
    <Paragraph position="3"> The second observation is that NEs often contain or are entirely made up of words that are phonetically transliterated or have a common etymological origin across languages (e.g. parliament in English and a2a4a3a6a5a8a7a9a3a11a10a13a12a15a14a4a16 , its Russian translation), and thus are phonetically similar. Figure 2 shows  its Russian transliteration (middle), and of the word Russia (bottom).</Paragraph>
    <Paragraph position="4"> an example list of NEs and their possible Russian transliterations.</Paragraph>
    <Paragraph position="5"> Approaches that attempt to use these two characteristics separately to identify NEs across languages would have significant shortcomings.</Paragraph>
    <Paragraph position="6"> Transliteration based approaches require a good model, typically handcrafted or trained on a clean set of transliteration pairs. On the other hand, time sequence similarity based approaches would incorrectly match words which happen to have similar time signatures (e.g., Taliban and Afghanistan in recent news).</Paragraph>
    <Paragraph position="7"> We introduce an algorithm we call co-ranking which exploits these observations simultaneously to match NEs on one side of the bilingual corpus to their counterparts on the other. We use a Discrete Fourier Transform (Arfken, 1985) based metric for computing similarity of time distributions, and show that it has significant advantages over other metrics traditionally used. We score NEs similarity with a linear transliteration model.</Paragraph>
    <Paragraph position="8"> We first train a transliteration model on single-word NEs. During training, for a given NE in one language, the current model chooses a list of top ranked transliteration candidates in another language. Time sequence scoring is then used to re-rank the list and choose the candidate best temporally aligned with the NE. Pairs of NEs and the best candidates are then used to iteratively train the</Paragraph>
    <Paragraph position="10"> Once the model is trained, NE discovery proceeds as follows. For a given NE, transliteration model selects a candidate list for each constituent word. If a dictionary is available, each candidate list is augmented with translations (if they exist).</Paragraph>
    <Paragraph position="11"> Translations will be the correct choice for some NE words (e.g. for queen in Queen Victoria), and transliterations for others (e.g. Bush in Steven Bush). We expect temporal sequence alignment to resolve many of such ambiguities. It is used to select the best translation/transliteration candidate from each word's candidate set, which are then merged into a possible NE in the other language.</Paragraph>
    <Paragraph position="12"> Finally, we verify that the NE is actually contained in the target corpus.</Paragraph>
    <Paragraph position="13"> A major challenge inherent in discovering transliterated NEs is the fact that a single entity may be represented by multiple transliteration strings. One reason is language morphology. For example, in Russian, depending on a case being used, the same noun may appear with various endings. Another reason is the lack of transliteration standards. Again, in Russian, several possible transliterations of an English entity may be acceptable, as long as they are phonetically similar to the source.</Paragraph>
    <Paragraph position="14"> Thus, in order to rely on the time sequences we obtain, we need to be able to group variants of the same NE into an equivalence class, and collect their aggregate mention counts. We would then score time sequences of these equivalence classes.</Paragraph>
    <Paragraph position="15"> For instance, we would like to count the aggregate number of occurrences of a82 Herzegovina, Hercegovinaa83 on the English side in order to map it accurately to the equivalence class of that NE's variants we may see on the Russian side of our corpus (e.g. a82a72a84 a12 a5a86a85a9a12a88a87a41a89a21a90a6a91 a14 a3a52a92a93a84 a12 a5a94a85a9a12a88a87a41a89a30a90a6a91 a14a4a95a37a92a93a84 a12 a5a86a85 a12a77a96 a87a41a89a30a90a32a91 a14a98a97a86a92a98a84 a12 a5a86a85 a12a77a87a41a89a21a90a6a91 a14a99a89a101a100a91a32a83 ). One of the objectives for this work was to use as  little of the knowledge of both languages as possible. In order to effectively rely on the quality of time sequence scoring, we used a simple, knowledge poor approach to group NE variants for the languages of our corpus (see 3.2.1).</Paragraph>
    <Paragraph position="16"> In the rest of the paper, whenever we refer to a Named Entity or an NE constituent word, we imply its equivalence class. Note that although we expect that better use of language specific knowledge would improve the results, it would defeat one of the goals of this work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML