File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2035_metho.xml
Size: 18,305 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2035"> <Title>Multilingual Lexical Database Generation from parallel texts in 20 European languages with endogenous resources</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> GIGUET EMMANUEL GREYC CNRS UMR 6072 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> LUQUET Pierre-Sylvain GREYC CNRS UMR 6072 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper deals with multilingual data-base generation from parallel corpora.</Paragraph> <Paragraph position="1"> The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous: it relies on the raw texts only, it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the</Paragraph> </Section> <Section position="5" start_page="0" end_page="271" type="metho"> <SectionTitle> 'Acquis Communautaire' Corpus. 1 Introduction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Automatic processing of bilingual and </SectionTitle> <Paragraph position="0"> multilingual corpora Processing bilingual and multilingual corpora constitutes a major area of investigation in natural language processing. The linguistic and translational information that is available make them a valuable resource for translators, lexicographers as well as terminologists. They constitute the nucleus of example-based machine translation and translation memory systems.</Paragraph> <Paragraph position="1"> Another field of interest is the constitution of multilingual lexical databases such as the project planned by the European Commission's Joint Research Centre (JRC) or the more established Papillon project. Multilingual lexical databases are databases for structured lexical data which can be used either by humans (e.g. to define their own dictionaries) or by natural language processing (NLP) applications.</Paragraph> <Paragraph position="2"> Parallel corpora are freely available for research purposes and their increasing size demands the exploration of automatic methods.</Paragraph> <Paragraph position="3"> The 'Acquis Communautaire' (AC) Corpus is such a corpus. Many research teams are involved in the JRC project for the enrichment of a multi-lingual lexical database. The aim of the project is to reach an automatic extraction of lexical tuples from the AC Corpus.</Paragraph> <Paragraph position="4"> The AC document collection was constituted when ten new countries joined the European Union in 2004. They had to translate an existing collection of about ten thousand legal documents covering a large variety of subject areas. The 'Acquis Communautaire' Corpus exists as a parallel text in 20 languages. The JRC has collected large parts of this document collection, has converted it to XML, and provide sentence alignments for most language pairs (Steinberger et al., 2006).</Paragraph> </Section> <Section position="2" start_page="0" end_page="271" type="sub_section"> <SectionTitle> 1.2 Alignment approaches </SectionTitle> <Paragraph position="0"> Alignment becomes an important issue for research on bilingual and multilingual corpora. Existing alignment methods define a continuum going from purely statistical methods to linguistic ones. A major point of divergence is the granularity of the proposed alignments (entire texts, paragraphs, sentences, clauses, words) which often depends on the application.</Paragraph> <Paragraph position="1"> In a coarse-grained alignment task, punctuation or formatting can be sufficient. At finer-grained levels, methods are more sophisticated and combine linguistic clues with statistical ones. Statistical alignment methods at sentence level have been thoroughly investigated (Gale & Church, 1991a/ 1991b ; Brown et al., 1991 ; Kay & Roscheisen, 1993). Others use various linguistic information (Simard et al., 1992 ; Papageorgiou et al., 1994). Purely statistical alignment methods are proposed at word level (Gale & Church, 1991a ; Kitamura & Matsumoto, 1995).</Paragraph> <Paragraph position="2"> (Tiedemann, 1993 ; Boutsis & Piperidis, 1996 ; Piperidis et al., 1997) combine statistical and linguistic information for the same task. Some methods make alignment suggestions at an intermediate level between sentence and word and word (Smadja, 1992 ; Smadja et al., 1996 ; Kupiec, 1993 ; Kumano & Hirakawa, 1994 ; Boutsis & Piperidis, 1998).</Paragraph> <Paragraph position="3"> A common problem is the delimitation and spotting of the units to be matched. This is not a real problem for methods aiming at alignments at a high level of granularity (paragraphs, sentences) where unit delimiters are clear. It becomes more difficult for lower levels of granularity (Simard, 2003), where correspondences between graphically delimited words are not always satisfactory.</Paragraph> </Section> </Section> <Section position="6" start_page="271" end_page="271" type="metho"> <SectionTitle> 2 The multi-grained endogenous align- </SectionTitle> <Paragraph position="0"> ment approach The approach proposed here deals with the spotting of multi-grained translation equivalents. We do not adopt very rigid constraints concerning the size of linguistic units involved, in order to account for the flexibility of language and translation divergences. Alignment links can then be established at various levels, from sentences to words and obeying no other constraints than the maximum size of candidate alignment sequences and their minimum frequency of occurrence.</Paragraph> <Paragraph position="1"> The approach is endogenous since the input is used as the only used linguistic resource. It is the multilingual parallel AC corpus itself. It does not contain any syntactical annotation, and the texts have not been lemmatised. In this approach, no classical linguistic resources are required. The input texts have been segmented and aligned at sentence level by the JRC. Inflectional divergencies of isolated words are taken into account without external linguistic information (lexicon) and without linguistic parsers (stemmer or tagger). The morphology is learnt automatically using an endogenous parsing module integrated in the alignment tool based on (Dejean, 1998).</Paragraph> <Paragraph position="2"> We adopt a minimalist approach, in the line of GREYC. In the JRC project, many languages do not have available linguistic resources for automatic processing, neither inflectional or syntactical annotation, nor surface syntactic analysis or lexical resources (machine-readable dictionaries etc.). Therefore we can not use a large amount of a priori knowledge on these languages.</Paragraph> </Section> <Section position="7" start_page="271" end_page="272" type="metho"> <SectionTitle> 3 Considerations on the Corpus </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="271" end_page="272" type="sub_section"> <SectionTitle> 3.1 Corpus definition </SectionTitle> <Paragraph position="0"> Concretely, the texts constituting the AC corpus (Steinberger et al., 2006) are legal documents translated in several languages and aligned at sentence level. Here is a description of the parallel corpus, in the 20 languages available: - Czech: 7106 documents - Danish: 8223 documents - German: 8249 documents - Greek: 8003 documents - English: 8240 documents - Spanish: 8207 documents - Estonian: 7844 documents - Finnish: 8189 documents - French: 8254 documents - Hungarian: 7535 documents - Italian: 8249 documents, - Lithuanian: 7520 documents - Latvian: 7867 documents - Maltese: 6136 documents - Dutch: 8247 documents - Polish: 7768 documents - Portuguese: 8210 documents - Slovakian: 6963 documents - Slovene:7821 documents - Swedish: 8233 documents The documents contained in the archives are XML files, UTF-8 encoding, containing information on &quot;sentence&quot; segmentation. Each file is stamped with a unique identifier (the celex identifier). It refers to a unique document. Here is an excerpt of the document 31967R0741, in Czech. <P sid=&quot;9&quot;>vzhledem k tomu, ze zavedenim rezimu jednotnych a povinnych nahrad pri vyvozu do tretich zemi od zavedeni jednotne organizace trhu pro zeme de lske produkty, jez ve znacne mire existuje od 1. cervence 1967, vyslo kriterium nejnizsi pru me rne nahrady stanovene pro financovani nahrad podle cl. 3 odst. 1 pism. a) narizeni c. 25 o financovani spolecne zeme de lske politiky2 z pouzivani;</P> [...] Sentence alignments files are also provided with the corpus for 111 language pairs. The XML files encoded in UTF-8 are about 2M packed and 10M unpacked. Here is an excerpt of the alignment file of the document 31967R0741, for the</Paragraph> <Paragraph position="2"> In this file, the xtargets &quot;ids&quot; refer to the <P sid=&quot;...&quot;> of the Czech and Danish translations of the document 31967R0741.</Paragraph> <Paragraph position="3"> The current version of our alignment system deals with one language pair at a time, whatever the languages are. The algorithm takes as input a corpus of bitexts aligned at sentence level. Usually, the alignment at this level outputs aligned windows containing from 0 to 2 segments. One-to-one mapping corresponds to a standard output (see link types &quot;1-1&quot; above). An empty window corresponds to a case of addition in the source language or to a case of omission in the target language. One-to-two mapping corresponds to split sentences (see link types &quot;1-2&quot; and &quot;2-1&quot; above).</Paragraph> <Paragraph position="4"> Formally, each bitext is a quadruple < T1, T2, Fs, C> where T1 and T2 are the two texts, Fs is the function that reduces T1 to an element set Fs(T1) and also reduces T2 to an element set Fs(T2), and C is a subset of the Cartesian product of Fs(T1) x Fs(T2) (Harris, 1988).</Paragraph> <Paragraph position="5"> Different standards define the encoding of parallel text alignments. Our system natively handles TMX and XCES format, with UTF-8 or UTF-16 encoding.</Paragraph> </Section> </Section> <Section position="8" start_page="272" end_page="274" type="metho"> <SectionTitle> 4 The Resolution Method </SectionTitle> <Paragraph position="0"> The resolution method is composed of two stages, based on two underlying hypotheses. The first stage handles the document grain. The second stage handles the corpus grain.</Paragraph> <Section position="1" start_page="272" end_page="273" type="sub_section"> <SectionTitle> 4.1 Hypotheses </SectionTitle> <Paragraph position="0"> hypothesis 1 : let's consider a bitext composed of the texts T to have a unique translation in the corresponding texts of language</Paragraph> <Paragraph position="2"> The first stage handles the document scale. Thus it is applied on each document, individually.</Paragraph> <Paragraph position="3"> There is no interaction at the corpus level.</Paragraph> <Paragraph position="4"> Determining the multi-grained sequences to be aligned First, we consider the two languages of the document independently, the source language L and the target language L . For each language, we compute the repeated sequences as well as their frequency.</Paragraph> <Paragraph position="5"> The algorithm based on suffix arrays does not retain the sub-sequences of a repeated sequence if they are as frequent as the sequence itself. For instance, if &quot;subjects&quot; appears with the same frequency than &quot;healthy subjects&quot; we retain only the second sequence. On the contrary, if &quot;disease&quot; occurs more frequently than &quot;thyroid disease&quot; we retain both.</Paragraph> <Paragraph position="6"> When computing the frequency of a repeated sequence, the offset of each occurrence is memorized. So the output of this processing stage is a list of sequences with their frequency and the offset list in the document.</Paragraph> <Paragraph position="7"> &quot;thyroid cancer&quot;: list of segments where the sequence appears 45, 46, 46, 48, 51, 51, ...</Paragraph> <Paragraph position="8"> Handling inflections Inflectional divergencies of isolated words are taken into account without external linguistic information (lexicon) and without linguistic parsers (stemmer or tagger). The morphology is learnt automatically using an endogenous approach derived from (Dejean, 1998). The algorithm is reversible: it allows to compute prefixes the same way, with reversed word list as input. The basic idea is to approximate the border between the nucleus and the suffixes. The border matches the position where the number of distinct letters preceding a suffix of length n is greater than the number of distinct letters preceding a suffix of length n-1.</Paragraph> <Paragraph position="9"> For instance, in the first English document of our corpus, &quot;g&quot; is preceded by 4 distinct letters, &quot;ng&quot; by 2 and &quot;ing&quot; by 10: &quot;ing&quot; is probably a suffix. In the first Greek document, &quot;a &quot; is preceded by 5 letters, &quot;ka &quot; by 1 and &quot;ik a &quot; by 10. &quot;ika &quot; is probably a suffix.</Paragraph> <Paragraph position="10"> The algorithm can generate some wrong morphemes, from a strictly linguistic point of view. But at this stage, no filtering is done in order to check their validity. We let the alignment algorithm do the job with the help of contextual information. null Vectorial representation of the sequences An orthonormal space is then considered in order to explore the existence of possible translation relations between the sequences, and in order to define translation couples. The existence of translation relations between sequences is approximated by the cosine of vectors associated to them, in this space.</Paragraph> <Paragraph position="11"> The links in the alignment file allow the construction of this orthonormal space. This space has n o dimensions, where n o is the number of non-empty links. Alignment links with empty sets (type=&quot;0-?&quot; or type=&quot;?-0&quot;) corresponds to cases of omission or addition in one language.</Paragraph> <Paragraph position="12"> Every repeated sequence is seen as a vector in this space. For the construction of this vector, we first pick up the segment offset in the document for each repeated sequence.</Paragraph> <Paragraph position="13"> &quot;thyroid cancer&quot;: list of segments where the sequence appears 45, 46, 46, 48, 51, 51 Then we convert this list in a n to be aligned, we look for the existence of a translation relation between it and every L sequence to be aligned. The existence of a translation relation between two sequences is approximated by the cosine of the vectors associated to them.</Paragraph> <Paragraph position="14"> The cosine is a mathematical tool used in in Natural Language Processing for various purposes, e.g. (Roy & Beust, 2004) uses the cosine for thematic categorisation of texts. The cosine is obtained by dividing the scalar product of two vectors with the product of their norms.</Paragraph> <Paragraph position="16"> We note that the cosine is never negative as vectors coordinates are always positive. The sequences proposed for the alignment are those that obtain the largest cosine. We do not propose an alignment if the best cosine is inferior to a certain threshold.</Paragraph> </Section> <Section position="2" start_page="273" end_page="273" type="sub_section"> <SectionTitle> 4.3 Stage 2 : Corpus management </SectionTitle> <Paragraph position="0"> The second stage handles the corpus grain and merges the information found at document grain, in the first stage.</Paragraph> </Section> <Section position="3" start_page="273" end_page="274" type="sub_section"> <SectionTitle> Handling the Corpus Dimension </SectionTitle> <Paragraph position="0"> The bitext corpus is not a bag of aligned sentences and is not considered as if it were. It is a bag of bitexts, each bitext containing a bag of aligned sentences.</Paragraph> <Paragraph position="1"> Considering the bitext level (or document grain) is useful for several reasons. First, for operational sake. The greedy algorithm for repeated sequence extraction has a cubic complexity. It is better to apply it on the document unit rather than on the corpus unit. But this is not the main reason.</Paragraph> <Paragraph position="2"> Second, the alignment algorithm between sequences relies on the principle of translation coherence: a repeated sequence in L1 has many chances to be translated by the same sequence in L2 in the same text. This hypothesis holds inside the document but not in the corpus: a polysemic term can be translated in different ways according to the document genre or domain.</Paragraph> <Paragraph position="3"> Third, the confidence in the generated alignments is improved if the results obtained by the execution of the process on several documents share compatible alignments.</Paragraph> </Section> <Section position="4" start_page="274" end_page="274" type="sub_section"> <SectionTitle> Alignment Filtering and Ranking </SectionTitle> <Paragraph position="0"> The filtering process accepts terms which have been produced (1) by the execution on at least two documents, (2) by the execution on solely one document if the aligned terms correspond to the same character string or if the frequency of the terms is greater than an empirical threshold function. This threshold is proportional to the inverse term length since there are fewer complex repeated terms than simple terms.</Paragraph> <Paragraph position="1"> The ranking process sorts candidates using the product of the term frequency by the number of output agreements.</Paragraph> </Section> </Section> class="xml-element"></Paper>