File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1117_metho.xml
Size: 15,574 bytes
Last Modified: 2025-10-06 14:14:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1117"> <Title>Philippe.Langlais~speech.kth.se</Title> <Section position="4" start_page="711" end_page="711" type="metho"> <SectionTitle> 3 Reference corpus </SectionTitle> <Paragraph position="0"> One of the main results of ARCADE has been to produce an aligned French-English corpus, combining texts of different genres and various degrees of difficulty for the alignment task. It is important to mention that until ARCADE, most alignment systems had been tested on judicial and technical texts which present relatively few difficulties for a sentence-level alignment. Therefore, diversity in the nature of the texts was preferred to the collection of a large quantity of similar data.</Paragraph> <Section position="1" start_page="711" end_page="711" type="sub_section"> <SectionTitle> 3.1 Format </SectionTitle> <Paragraph position="0"> ARCADE contributed to the development and testing of the Corpus Encoding Standard (CES), which was initiated during the MULTEXT project (Ide et al., 1995). The CES is based on SGML and it is an extension of the now internationally-accepted recommendations of the Text Encoding Initiative (Ide and Vdronis, 1995). Both the JOG and BAF parts of the ARCADE corpus (described below) are encoded in CES format.</Paragraph> </Section> </Section> <Section position="5" start_page="711" end_page="711" type="metho"> <SectionTitle> 3:2 JOC </SectionTitle> <Paragraph position="0"> The JOC corpus contains texts which were published in 1993 as a section of the C Series of the Official Journal of the European Community in all of its official languages. This corpus, which was collected and prepared during the MLCC and MULTEXT projects, contains, in 9 parallel versions, questions asked by members of the European Parliament on a variety of topics and the corresponding answers from the European Commission. JOC contains approximately 10 million words (ca. 1.1 million words per language). The part used for JOC was composed of one fifth of the French and English sections (ca. 200 000 words per language).</Paragraph> <Section position="1" start_page="711" end_page="711" type="sub_section"> <SectionTitle> 3.3 BAF </SectionTitle> <Paragraph position="0"> The BAF corpus is also a set of parallel French-English texts of about 400 000 words per language. It includes four text genres: 1) INST, four institutional texts (including transcription of speech from the Hansard corpus) for a totaling close to 300 000 words per language, 2) SCI-ENCE, five scientific articles of about 50 000 words per language, 3) TECH, technical documentation of about 40 000 words per language and 4) VERNE, the Jules Verne novel: &quot;De la terre d la lune&quot; (ca. 50 000 words per language). This last text is very interesting because the translation of literary texts is much freer than that of other types of tests. Furthermore, the English version is slightly abridged, which adds the problem of detecting missing segments.</Paragraph> <Paragraph position="1"> The BAF corpus is described in greater detail in (Simard, 1998).</Paragraph> </Section> </Section> <Section position="6" start_page="711" end_page="713" type="metho"> <SectionTitle> 4 Evaluation measures </SectionTitle> <Paragraph position="0"> We first propose a formal definition of parallel text alignment, as defined in (Isabelle and Simard, 1996). Based on that definition, the usual notions of recall and precision can be used to evaluate the quality of a given alignment with respect to a reference. However, recall and precision can be computed for various levels of granularity: an alignment at a given level (e.g. sentences) can be measured in terms of units of a lower level (e.g. words, characters). Such a fine-grained measure is less sensitive to segmentation problems, and can be used to weight errors according to the number of sub-units they span.</Paragraph> <Section position="1" start_page="712" end_page="712" type="sub_section"> <SectionTitle> 4.1 Formal definition </SectionTitle> <Paragraph position="0"> If we consider a text S and its translation T as two sets of segments S = {Sl, s2, .., Sn} and T = {tl,t2,...,tm}, an alignment A between S and T can be defined as a subset of the Cartesian product ~(S) x p(T), where p(S) and p(T) are respectively the set of all subsets of S and T.</Paragraph> <Paragraph position="1"> The triple iS, T, A) will be called a bitext. Each of the elements (ordered pairs) of the alignment will be called a bisegment.</Paragraph> <Paragraph position="2"> This definition is fairly general. However, in the evaluation exercice described here, segments were sentences and were supposed to be contiguous, yielding monotonic alignments.</Paragraph> <Paragraph position="3"> For instance, let us consider the following alignment, which will serve as the reference alignment in the subsequent examples. The formal representation of it is:</Paragraph> </Section> <Section position="2" start_page="712" end_page="713" type="sub_section"> <SectionTitle> 4.2 Recall and precision </SectionTitle> <Paragraph position="0"> Let us consider a bitext (S,T, Ar) and a proposed alignment A. The alignment recall with respect to the reference Ar is defined as: recall = IA N Arl/IA~I. It represents the proportion of bisegments in A that are correct with respect to the reference At. The silence corresponds to 1- recall. The alignment precision with respect to the reference Ar is defined as: precision = IA N Arl/IAI. It represents the proportion of bisegments in A that are right with respect to the number of bisegment proposed. The noise corresponds to 1 -- precision.</Paragraph> <Paragraph position="1"> We will also use the F-measure (Rijsbergen, 1979) which combines recall and precision in a single efficiency measure (harmonic mean of We note that: A n Ar = {((s,}, {tl})}. Alignment recall and precision with respect to Ar are 1/2 -- 0.50 and 1/3 -- 0.33 respectively. The F-measure is 0.40.</Paragraph> <Paragraph position="2"> Improving both recall and precision are antagonistic goals : efforts to improve one often result in degrading the other. Depending on the applications, different trade-offs can be sought. For example, if the bisegments are used to automatically generate a bilingual dictionary, maximizing precision (i.e. omitting doubtful couples) is likely to be the preferred option.</Paragraph> <Paragraph position="3"> Recall and precision as defined above are rather unforgiving. They do not take into account the fact that some bisegments could be partially correct. In the previous example, the bisegment ({s2}, {t3}) does not belong to the reference, but can be considered as partially correct: t3 does match a part of s2. To take partial correctness into account, we need to compute recall and precision at the sentence level instead of the alignment level.</Paragraph> <Paragraph position="4"> Assuming the alignment A = {al, a2,..., am} and the reference Ar = {arl, at2,..., am}, with ai = (asi, ati) and arj = (arsj,artj), we can derive the following sentence-to-sentence alignments: null</Paragraph> <Paragraph position="6"> Sentence-level recall and precision can thus be defined in the following way:</Paragraph> <Paragraph position="8"> In the example above: A' = {(sl, tl), (s2, t3)} and A~ = {(sl, tl), (s2, t2), (s2, t3)}. Sentence-level recall and precision for this example are therefore 2/3 = 0.66 and 1 respectively, as compared to the alignment-level recall and precision, 0.50 and 0.33 respectively. The F-measure becomes 0.80 instead of 0.40.</Paragraph> </Section> <Section position="3" start_page="713" end_page="713" type="sub_section"> <SectionTitle> 4.3 Granularity </SectionTitle> <Paragraph position="0"> In the definitions above, the sentence is the unit of granularity used for the computation of recall and precision at both levels. This results in two difficulties. First, the measures are very sensitive to sentence segmentation errors. Secondly, they do not reflect the seriousness of misalignments. It seems reasonable that errors involving short sentences should be less penalized than errors involving longer ones, at least from the perspective of some applications.</Paragraph> <Paragraph position="1"> These problems can be avoided by taking advantage of the fact that a unit of a given gran~arity (e.g. sentence) can always be seen as a (possibly discontinuous) sequence of units of finer granularity (e.g. character).</Paragraph> <Paragraph position="2"> Thus, when an alignment A is compared to a reference alignment Ar using the recall and precision measures computed at the char-level, the values obtained are inversely proportional to the quantity of text (i.e. number of characters) in the misaligned sentences, instead of the number of these misaligned sentences. For instance, in the example used above, we would have at sentence level: * using word granularity (punctuation marks are considered as words) :</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="7" start_page="713" end_page="714" type="metho"> <SectionTitle> 5 Systems tested </SectionTitle> <Paragraph position="0"> Six systems were tested, two of which having been submitted by the I:tALI.</Paragraph> <Paragraph position="1"> RALI/Jacal This system uses as a first step a program that reduces the search space only to those sentence pairs that are potentially interesting (Simard and Plamondon, 1996). The underlying principle is the automatic detection of isolated cognates (i.e. for which no other similar word exists in a window of given size). Once the search space is reduced, the system aligns the sentences using the well-known sentence-length model described in (Gale and Church, 1991).</Paragraph> <Paragraph position="2"> RALI/Sallgn The second method proposed by RALI is based on a dynamic programming scheme which uses a score function derived from a translation model similar to that of (Brown et al., 1990). The search space is reduced to a beam of fixed width around the diagonal (which would represent the alignment if the two texts were perfectly synchronized).</Paragraph> <Paragraph position="3"> LORIA The strategy adopted in this system differs from that of the other systems since sentence alignment is performed after the preliminary alignment of larger units (whenever possible, using mark-up), such as paragraphs and divisions, on the basis of the SGML structure.</Paragraph> <Paragraph position="4"> A dynamic programming scheme is applied to all alignment levels in successive steps.</Paragraph> <Paragraph position="5"> IRMC This system involves a preliminary, rough word alignment step which uses a transfer dictionary and a measure of the proximity of words (D~bili et al., 1994). Sentence alignment is then achieved by an algorithm which optimizes several criteria such as word-order conservation and synchronization between the two texts.</Paragraph> <Paragraph position="6"> LIA Like Jacal, the LIA system uses a pre-processing step involving cognate recognition which restricts the search space, but in a less restrictive way. Sentence alignment is then achieved through dynamic programming, using a score function which combines sentence length, cognates, transfer dictionary and frequency of translation schemes (1-1, 1-2, etc.).</Paragraph> <Paragraph position="7"> ISSCO Like the LORIA system, the ISSCO aligner is sensitive to the macro-structure of the document. It examines the tree structure of an SGML document in a first pass, weighting each node according to the number of characters contained within the subtree rooted at that node. The second pass descends the tree, first by depth, then by breath, while aligning sentences using a method resembling that of Gale</Paragraph> </Section> <Section position="8" start_page="714" end_page="715" type="metho"> <SectionTitle> & Church. 6 Results </SectionTitle> <Paragraph position="0"> Four sets of recall/precision measures were computed for the alignments achieved by the six systems for each text type previously described above: Align, alignment-level, Sent sentencelevel, Word, word-level and Char, characterlevel. The global efficiency of the different systems (average F-values) for each text type is given in Figure 1.</Paragraph> <Paragraph position="2"> * ,,-IH i ii iii i i i i i.oll.l~l!~l i i iiiiii ! i i i ii i i ~ i i i iiii i i i i ~,.t- ' I ~ i i i ..'~...,,~.1~ 'ii' ! i i i! ii , &quot;i i I.., ~i i i 1o< r.,i Ill ~ i Is l Figure h Global efficiency (average F-values for Align, Sent, Word and Char measures) of the different systems (Jacal, Salign, LORIA, IRMC, ISSCO, LIA), by text type (logarithmic scale). First, note than the Char measures are higher that the Align measures. This seems to confirm that systems tend to fail when dealing with shorter sentences. In addition, the reference alignment for the BAF corpus combines several 1-1 alignments in a single n-n alignment, for practical reasons owing to the sentence segmentation process. This results in decreased Align measures.</Paragraph> <Paragraph position="3"> The corpus on which all systems scored highest was the JOC. This corpus is relatively simple to align, since it contains 94% of 1-1 alignments, reflecting a translation strategy based on speed and absolute fidelity. In addition, this corpus contains a large amount of data that remains unchanged during the translation process (proper names, dates, etc.) and which can serve as anchor points by some systems. Note that the LORIA system achieves a slightly better performance than the others on this corpus, mainly because it is able to carry out a structure-alignment since paragraphs and divisions are explicitly marked.</Paragraph> <Paragraph position="4"> The worst results were achieved on the VERNE corpus. This is also the corpus for which the results showed the most scattering across systems (22% to 90% char-precision).</Paragraph> <Paragraph position="5"> These poor results are linked to the literary nature of the corpus, where translation is freer and more interpretative. In addition, since the English version is slightly abridged, the occasional omissions result in de-synchronization in most systems. Nevertheless, the LIA system still achieves a satisfactory performance (90% char-recall and 94% char-precision), which can be explained by the efficiency of its word-based pre-alignment step, as well as the scoring function used to rank the candidate bisegments.</Paragraph> <Paragraph position="6"> Significant discrepancy are also noted between the Align and Char recalls on the TECH corpus. This document contained a large glossary as an appendix, and since the terms are sorted in alphabetic order, they are ordered differently in each language. This portion of text was not manually aligned in the reference. The size of this bisegment (250-250) drastically lowers the Char-recall. Aligning two glossaries can be seen as a document-structure alignment task rather than a sentence-alignment task.</Paragraph> <Paragraph position="7"> Since the goal of the evaluation was sentence alignment, the TECH corpus results were not taken into account in the final grading of the systems.</Paragraph> <Paragraph position="8"> The overall ranking for all systems (excluding the TECH corpus results) is given in Figure 2, in terms of the Sent and Char F-measures. The LIA system obtains the best average results and shows good stability across texts, which is an important criterion for many applications.</Paragraph> </Section> class="xml-element"></Paper>