File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2129_metho.xml
Size: 14,061 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2129"> <Title>Automatic Detection of Omissions in Translations</Title> <Section position="5" start_page="764" end_page="764" type="metho"> <SectionTitle> 3 The Basic Method </SectionTitle> <Paragraph position="0"> Omissions in translations give rise to distinctive l>atterns in \[>itext maps, as illustrated in l!'igure I.</Paragraph> <Paragraph position="1"> 'l'he nearly horizontal l)art of the 1)itext inal> in</Paragraph> <Paragraph position="3"> A aud H co'rmspond lo rcgion.s a and b, respectively, l~cflion 0 has uo corresponding regiou ou the vertical azis.</Paragraph> <Paragraph position="4"> region 0 takes up almost no part o\[' the vertical axis. This region represents a section of the text on tit<', horizontal axis that has no corresponding section in the text on the ve.rtieal axis (,he very definition of an onlission. The slol>e betw<'en the end points of the region is unusually low. A n omission in the text on the horizonl:al axis would manliest itself ms a nearly verti<;al region in the bitext space. These ItIlllslla\[ slope <:onditions are the key to <letecting omissions.</Paragraph> <Paragraph position="5"> (-liven a noisc-fl:ee bitext map, omissions are easy to detect. First, a I)itext space is constructed by placing the original t<;xt on the y-axis, and the translation on the x-.axis. Second, the known points of correspondence are l>lotted in the l>itext sl>ace, l+,a<:h ad, iacent pair or points t)<)un<ls a segment of (,he bitext map. Any segment whose sh>l/e is unusually low ix a likely omission. Ttds notion can I>e made precise by specifying a sloI>e angle threshoht l. So, third, segt-/|ents with slope angle a < t are flagged a.s omitted segments.</Paragraph> </Section> <Section position="6" start_page="764" end_page="765" type="metho"> <SectionTitle> 4 Noise-Free Bitext Maps </SectionTitle> <Paragraph position="0"> The only way to ensure tliat a bitext map in noisefl:ee is to construct one by hand. Simard et al.</Paragraph> <Paragraph position="1"> (1992) hand-aligned corresponding sentences in two excerpts of tile Canadian Ihmsards (parliamentary debate transcripts available in English and French). l,'or historical reasons, these l>itexts are named &quot;easy&quot; and &quot;hard&quot; ill the literature, q'hc sentence-based alignments were converted to character-based aligmnei~ts l>y no(,iug the corresponding character positions at the end of ca.oh pair of aligned sentences. 'rhe result was two hand-constructed bitext maps. Several resear<:hers have used these \[>articular bilcxt ntaps ;is a gold standard f(>r evahiating hitext mal>l>itlg and aligmneut algorithms (Simard el; al., 1992; (\]hutch, 1993; I)agan et al., 19!)3; Melamed, 19!)6).</Paragraph> <Paragraph position="2"> Surprisingly, AI)OMIT f'ouu<l lnany errors in these hand-aligned/>itexts, both in the alignment and in the original translation. AI)OM1T pro-cessed both halves of both I>itexts using slol>e angle thresholds From 5 deg to 200 , in increments of 5 deg. For ea<'h run, AI)OMIT produced a list ()f the t>itext malls segm<mts whose slope angles were t>e low the speci\[ied threshold /,. The output for the French hall7 o1&quot; the &quot;easy&quot; bitexl,, with t -: 15 deg, consisted of the following 10 items: 29175) to (26917, 29176) 45647) to (42179, 45648) 47794) to (44236, 47795) Each ordered pair is a co-ordinate in the hitexL space; each pair of co-ordinates delimits one emiLted se.gmenL \]i;xamination of these L0 pairs o\[' C,}lara(-tcl? ra\[lgeS ill the bitext revealed Lhat * 4 omitted segments pointed to omissions in the original translation, * d omitted segments poitH,ed to aligmnent erroFs~ null * 1 omitted segment pointed to an omission which apparently caused an Mignment error (i.e. the segment contained ouc of each), * \[ omitted segment pointed to a piece of texl; that was accidentally repeated in the original, bu(, only translated once.</Paragraph> <Paragraph position="3"> With t = I0 deg, !) o\[&quot; the I0 segments b~ the list still came up; 8 out of 10 remained wit;it /. = 5 deg. Similar errors were discovered in tile other half of the &quot;easy&quot; bitext, and in the &quot;hard&quot; bitext, including one omission of Jnore than 450 characters. Other segrne.nts appeared in the list For ~ > 150 . None of the other segments were. outright omissions or misalignments. Howew'x, all of them corresponded to non-literal translations or paraphrases. For instance, with t = 20 deg, A I)()MI'F discovered an instance of &quot;Why is the governlnent doing this?&quot; (;ratlslatcd as &quot;Pourquoi?&quot; The hand-aligned bitexts were also used to measure ADOMIT's recall. The human aligners marked omissions in the originM translation by 1-0 alignments (Gale & Church, 1991; lsabelle, 1995). ADOMIT did not; use this information; the algorithm has no notion of a line of text. However, a simple cross-check showed that ADOMIT found all of the omissions. The README file distributed with the bitcxts admitted that the &quot;human aligners weren?t infallible&quot; and predicted &quot;probably no more than five or so&quot; alignment errors. ADOMIT corroborated this prediction by finding exactly five alignment errors. AI)OMIT's recall on both kinds of errors implies that when tile ten troublesome segments were hand-corrected in the &quot;easy&quot; bitext, the result was very likely the world's first noise-free bitext map.</Paragraph> </Section> <Section position="7" start_page="765" end_page="765" type="metho"> <SectionTitle> 5 A Translators' Tool </SectionTitle> <Paragraph position="0"> As any translator knows, many omissions are intentional. Translations are seldom word for word. Metaphors and idioms usually cannot be translated literally; so, paraphrasing is common. Sometimes, a paraphrased translation is nmch shorter or much longer than the original* Segments of the bitext map that represent such translations will have slope characteristics sin> ilar to omissions, even though the translations nmy be perfectly valid. These cases are termed intended omissions to distinguish them fl:om omission errors. To be useful, the omission detection algorithm must be able to tell the difference between intended and unintended omissions.</Paragraph> <Paragraph position="1"> Fortnnately, the two kinds of omissions have very different length distributions. Intended omissions are seldom longer than a few words, whereas accidental omissions are often on the of der of a sentence or more. So, an easy automatic way to separate the accidental omissions from the intended omissions is to sort; all the omitted segments from longest to shortest. The longer accidental omissions will float to the top of the sorted list;.</Paragraph> <Paragraph position="2"> Translators can search for omissions after they finish a translation, just like other writers run spelling checkers, after they finish writing. A translator who wants to correct omission errors can find them by scanning the sorted list of omitted segments Dora the top, and examining the relevant regions of the bitext. Each time the list points to an accidental omission, the translator ('an make the appropriate correction in the translation. If the translation is reasonably complete, the accidental omissions will quickly stop appearing in the list and the correction process can stop. Only the smallest errors of omission will remain.</Paragraph> </Section> <Section position="8" start_page="765" end_page="766" type="metho"> <SectionTitle> 6 The Problem of Noisy Maps </SectionTitle> <Paragraph position="0"> The results of l!\]xperiment ~l demonstrate ADOMI'F's t)otential. Ilowever, such stellar performance is only possible with a nearly perfect bitext map. Snch bitext maps rarely exist outside the laboratory; today's 1lest autonmtic methods for finding tlitext maps are far fl'om perfect (Church, 1993; l)agan et ah, 1993; Melamed, 1996). At least two kinds of map errors can interfere with omission detection. One kind results in Sl)urious omitted segments, while the other hides real omissions.</Paragraph> <Paragraph position="1"> I!'igure 2 shows how erroneous points in a bitext map can be indistinguishable from omitted segments. When such errors occur in the map, //</Paragraph> <Paragraph position="3"> A real omission could resull in lhe same map pallern as lhese erroneous poinls.</Paragraph> <Paragraph position="4"> ADOMIT cannot help but announce an omission where there isn't one. This kind of map error is the main obstacle to the algorithru's precision.</Paragraph> <Paragraph position="5"> The other kind of map error is the main obstacle to tile algorithm's recall. A typical manifestation is illustrated in Figure 1. The map points in Region O contradict the injective property of bitext maps. Most of the points in Region O are probably noise, because they map many positions on the x-axis to just a few positions on the yaxis. Such spurious points break up large omitted segments into seqnences of small ones. When the omitted segments are sorted by length for presentation to the translator, the fragmented omitted segments will sink to the bottom of the list along with segments that correspond to small intended omissions. The translator is likely to stop scanning the sorted list of omissions before reaching them.</Paragraph> </Section> <Section position="9" start_page="766" end_page="766" type="metho"> <SectionTitle> 7 ADOMIT </SectionTitle> <Paragraph position="0"> A I)OMIT alhwiates the fragmentation problem by finding and ignoring extralleOllS lna t) points.</Paragraph> <Paragraph position="1"> A COul)le of (hefinitions hell) to exl)lMn the technique. Recall that omitte(l segments are (lefine(I with respect to a chosen slope angle threshold l: Ally segment of the bitext map with slope angle less than t is an omitted segment. An omitted segtn(mt that contains extraneous t)oint,s ('an be ehara('terized as a sequence of mininml omitted segments, intersl)ersed with one or more, intcrferlug segments. A miniinal omitt('.d s(',gm(,.nt ix an onfitted segment between two adjaecnt points in the bitext map. A maximal omitte(1 segm(:nt is an ondtted segment that is not a proper subsegmc'nt of another omitted segtlmnt. Interferlng segnmnts are std)segtuents of maximal omitted segments with a slope m~gle at)()v(', Lit(: chosen threshold. IntertL'ring segments are always delinfite.d by extraneous Inap l)oinl;s. If il, were not for interfering segments, the fragmenl, ation problem could be solved I)y simply (;oneatenating a(lja(-ent minimal omitted segrne.ts, Using these definitions, the. prol)leHt of re(:otmtru(;tiug maximal omitted segme.nts can be stated as follows: Which sequences of mimmal omitted segments re-.</Paragraph> <Paragraph position="2"> suited fi'om fragmentation of a maximal omitted s('.gment? A maxintal omitted segmeut Hnlsl; \]la, vea slope angle t)elow the chosen threshohl t. S% the \[)robh;m (:an be solved I)y considering each I)air of ininimal omitted SeglllelltS, to Se':e if the. slope angle l)etween the starting point of the first and th(; end point el the secolM is less than 1. This brute \['oree solution requires ~:q)l)roximately 1/2&quot;n, 2 comparisons. Since a large bitext may have tens of thousan(ts of minimal omitted segments, a faster method is desirable.</Paragraph> <Paragraph position="3"> Theorem 1 suggests a fast algorithm to search \['or pairs of mini trial omitted segments th at arc \['arthest al)art, and that may have resulted ffo,l I'rag m('.nt~tion of a maximal omitted segment. 'Fhe theorem is illustrated in Figure 3. tt and 7' are mn(unonics for &quot;t)ottom&quot; and &quot;top.&quot; Th('.orein 1 Leg A be lh.e array of all minimal omitlcd segments, sorled by/lhe horizonlal posilion of the left end poinl. Lel H be a line in the bile.~l space, whose slope equals lhc slope of the main diagonal, such thai all lhe seqm.en:s in A lie above tl. l, el ,s be lhc left eudpoiut fff a se, gm, r'nl in A. tel :\[~ be a ray sla'rting at ,s with a slope angle equal lo the chose',, lhrcshohl I. Let i be Ihc i~ler,sc('lio'a, oJB and 'i ~. Let b bc the point o'. 11 with the same horizonlal posilion as s. No'w, a mamim.al omitted segm, enl starling at .~ musl end at so'me poi'.l c iu lhe triangle A.s'ib.</Paragraph> <Paragraph position="4"> Proof Sketch: s is deJiucd as lhe left end poiul, so e must be lo lhc righl of s. By dcfinilion of B, e must be abovc H. If c were above &quot;~',</Paragraph> <Paragraph position="6"> scgmenl,s. The array of minimal omilled segm( ul,s lies above line 17,. Any scqueucc of .segmenls shtl'ling al s, such lhal lhe slope angle of lhe whole sequence is less than l, musl end al some poinl (: in lhe lriangle Asib, then lhe slope angle of segmenI ,st would be ,qrea:cr lhan the slope angle of 7' = l, so Se co,hl not be an omilled segment. El A I)OMI'I' exploits Theorem l as follows. Each minimal omitted segtueut z h~ A is considered i.</Paragraph> <Paragraph position="7"> turn. Starting at z, AI)OM\[T searches the ar ray A for the last (i.e. righl, most)segtrtent whose right cml point e is in the triaugh'. A~sgb. Usually, this segment -will bc z itself, in which case the single mininml omitted segment is deemed a, maximal omii,tex\[ segment. When e is not (-)ll tile s.%l\[ic minimM omitted segmen\[, as s, AI)OM I'1' centare nares all Cite segments between s and c to form a maximal omitted segment. The search starting from SeglllOllt 7, (;all stop &8 SOOll ~ts it elleOllllt;(ws a segment with a right end point higher than i, For us('I'ul vahms of t, ea(:h search will Sl)a.tl only a handful of ean(lidate (rod points. \])roccssing l;\[m entire array A i. this .umner produces the desh:ed set of maximal omitt(',d seg\[nellts very quickly.</Paragraph> </Section> class="xml-element"></Paper>