File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2035_evalu.xml
Size: 2,979 bytes
Last Modified: 2025-10-06 13:59:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2035"> <Title>Multilingual Lexical Database Generation from parallel texts in 20 European languages with endogenous resources</Title> <Section position="9" start_page="274" end_page="274" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The results concern an alignment task between English and the 19 other languages of the AC-Corpus. For each language pair, we considered 500 bitexts of the AC Corpus. We join in annexes A, B, and C some sample of this results.</Paragraph> <Paragraph position="1"> Annex A deals with English-French parallel texts, Annex B deals with English-Spanish parallel texts and finally Annex C deals with English-German ones. We discuss in the following lines of the English-French alignment.</Paragraph> <Paragraph position="2"> Among the correct alignments, we find domain dependant lexical terms: - legal terms of the EEC (EEC initial verification /verification primitive CEE, Regulation (EEC) No/reglement (CEE) no), - specialty terms (rear-view mirrors / retroviseurs, poultry/volaille).</Paragraph> <Paragraph position="3"> We also find invariant terms (km/h/km/h, kg/kg, mortem/mortem).</Paragraph> <Paragraph position="4"> We encounter alignments at different grain: territory/territoire Member States/Etats membres, Whereas/Considerant que, fresh poultrymeat/viandes fraiches de volaille, Having regard to the Opinion of the/vu l'avis.</Paragraph> <Paragraph position="5"> The wrong alignments mainly come from candidates that have not been confirmed by running on several documents (column ndoc=1): on/la commercialisation des.</Paragraph> <Paragraph position="6"> A permanent dedicated web site will be open in March 2006 to detail all the results for each language pair. The URL is http://users.info.unicaen.fr/~giguet/alignment.</Paragraph> <Section position="1" start_page="274" end_page="274" type="sub_section"> <SectionTitle> 5.1 Discussion </SectionTitle> <Paragraph position="0"> First, the results are similar to those obtained on the Greek/English scientific corpus.</Paragraph> <Paragraph position="1"> Second, it is sometimes difficult to choose between distinct proposals for a same term when the grain vary: Member/membre~ Member State~/membre~ Member States/Etats membres State/membre State~/membre~. There is a problem both in the definition of terms and in the ability of an automatic process to choose between the components of the terms.</Paragraph> <Paragraph position="2"> Third, thematic terms of the corpus are not always aligned, since they are not repeated. Corefence is used instead, thanks to nominal anaphora, acronyms, and also lexical reductions. Accuracy depends on the document domain. In the medical domain, acronyms are aligned but not their expansion. However, we consider that this problem has to be solved by an anaphora resolution system, not by this alignment algorithm.</Paragraph> </Section> </Section> class="xml-element"></Paper>