XML Viewer - p00-1055

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/p00-1055_intro.xml
Size: 9,531 bytes
Last Modified: 2025-10-06 14:00:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1055">
  <Title>Using Confidence Bands for Parallel Texts Alignment</Title>
  <Section position="3" start_page="0" end_page="4" type="intro">
    <SectionTitle>
2 Correspondence Points Filters
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Overview
</SectionTitle>
      <Paragraph position="0"> The basic insight is that not all candidate correspondence points are reliable. Whatever heuristics are taken (similar word distributions, search corridors, point dispersion, angle deviation,...), we want to filter the most reliable points. We assume that reliable points have similar characteristics. For instance, they tend to gather somewhere near the 'golden translation diagonal'.</Paragraph>
      <Paragraph position="1"> Homographs with equal frequencies may be good alignment points.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="4" type="sub_section">
      <SectionTitle>
2.2 Source Parallel Texts
</SectionTitle>
      <Paragraph position="0"> We worked with a mixed parallel corpus consisting of texts selected at random from the Official Journal of the European Communities  .</Paragraph>
      <Paragraph position="1"> For each language, we included: * five texts with Written Questions asked by members of the European Parliament to the European Commission and their corresponding answers (average: about 60k words or 100 pages / text);  The same languages as those in footnote 1 plus Finnish (fi) and Swedish (sv).</Paragraph>
      <Paragraph position="2">  No Written Questions and Debates texts for Finnish and Swedish are available in ELRA (1997) since the texts provided are from the 1992-4 period and it was not until 1995 that the respective countries became part of the European Union.</Paragraph>
      <Paragraph position="3"> * five texts with records of Debates in the European Parliament (average: about 400k words or more than 600 pages / text). These are written transcripts of oral discussions; * five texts with judgements of The Court of Justice of the European Communities (average: about 3k words or 5 pages / text). In order to reduce the number of possible pairs of parallel texts from 110 sets (11 languagesx10) to a more manageable size of 10 sets, we decided to take Portuguese as the kernel language of all pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.3 Generating Candidate Correspon-
dence Points
</SectionTitle>
      <Paragraph position="0"> We generate candidate correspondence points from homographs with equal frequencies in two parallel texts. Homographs, as a naive and particular form of cognate words, are likely translations (e.g. Hong Kong in various European languages). Here is a table with the percentages of occurrences of these words in the used texts:  equal frequencies per pair of parallel texts (average percentage of homographs inside brackets). For average size texts (e.g. the Written Questions), these words account for about 5% of the total (about 3k words / text). This number varies according to language similarity. For instance, on average, it is higher for Portuguese-Spanish than for Portuguese-English.</Paragraph>
      <Paragraph position="1"> These words end up being mainly numbers and names. Here are a few examples from a parallel Portuguese-English text: 2002 (numbers, dates), ASEAN (acronyms), Patten (proper names), China (countries), Manila (cities), apartheid (foreign words), Ltd (abbreviations), habitats (Latin words), ferry (common names), global (common vocabulary).</Paragraph>
      <Paragraph position="2"> In order to avoid pairing homographs that are not equivalent (e.g. 'a', a definite article in Portuguese and an indefinite article in English), we restricted ourselves to homographs with the same frequencies in both parallel texts. In this way, we are selecting words with similar distributions. Actually, equal frequency words helped Jean-Francois Champollion to decipher the Rosetta Stone for there was a name of a King (Ptolemy V) which occurred the same number of times in the 'parallel texts' of the stone.</Paragraph>
      <Paragraph position="3"> Each pair of texts provides a set of candidate correspondence points from which we draw a line based on linear regression. Points are defined using the co-ordinates of the word positions in each parallel text. For example, if the first occurrence of the homograph word Patten occurs at word position 125545 in the Portuguese text and at 135787 in the English parallel text, then the point co-ordinates are (125545,135787). The generated points may adjust themselves well to a linear regression line or may be dispersed around it. So, firstly, we use a simple filter based on the histogram of the distances between the expected and real positions. After that, we apply a finer-grained filter based on statistically defined confidence bands for linear regression lines.</Paragraph>
      <Paragraph position="4"> We will now elaborate on these filters.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.4 Eliminating Extreme Points
</SectionTitle>
      <Paragraph position="0"> The points obtained from the positions of homographs with equal frequencies are still prone to be noisy. Here is an example:  line') candidate correspondence points. The linear regression line equation is shown on the top right corner.</Paragraph>
      <Paragraph position="1"> The figure above shows noisy points because their respective homographs appear in positions quite apart. We should feel reluctant to accept distant pairings and that is what the first filter does. It filters out those points which are clearly too far apart from their expected positions to be considered as reliable correspondence points. We find expected positions building a linear regression line with all points, and then determining the distances between the real and the expected word positions:  Expected positions are computed from the linear regression line equation y = ax + b, where a is the line slope and b is the Y-axis intercept (the value of y when x is 0), substituting x for the Portuguese word position. For Table 3, the expected word position for the word I at pt word position 3877 is 0.9165 x 3877 + 141.65 = 3695 (see the regression line equation in Figure 1) and, thus, the distance between its expected and real positions is  |3695 - 24998  |= 21303.</Paragraph>
      <Paragraph position="2"> If we draw a histogram ranging from the smallest to the largest distance, we get:  expected and real word positions.</Paragraph>
      <Paragraph position="3"> In order to build this histogram, we use the Sturges rule (see 'Histograms' in Samuel Kotz et al. 1982). The number of classes (bars or bins) is given by 1 + log  n, where n is the total number of points. The size of the classes is given by (maximum distance - minimum distance) / number of classes. For example, for Figure 1, we have 3338 points and the distances between expected and real positions range from 0 to  3338 [?] 12.7 - 13 and the size of the classes is (35997 - 0) / 13 [?] 2769. In this way, the first class ranges from 0 to 2769, the second class from 2769 to 5538 and so forth.</Paragraph>
      <Paragraph position="4"> With this histogram, we are able to identify those words which are too far apart from their expected positions. In Figure 2, the gap in the histogram makes clear that there is a discontinuity in the distances between expected and real positions. So, we are confident that all points above 22152 are extreme points. We filter them out of the candidate correspondence points set and proceed to the next filter.</Paragraph>
    </Section>
    <Section position="5" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.5 Confidence Bands of Linear Regres-
sion Lines
</SectionTitle>
      <Paragraph position="0"> Confidence bands of linear regression lines (Thomas Wonnacott and Ronald Wonnacott, 1990, p. 384) help us to identify reliable points, i.e. points which belong to a regression line with a great confidence level (99.9%). The band is typically wider in the extremes and narrower in the middle of the regression line.</Paragraph>
      <Paragraph position="1"> The figure below shows an example of filtering using confidence bands:  dence bands. Point A lies outside the confidence band. It will be filtered out.</Paragraph>
      <Paragraph position="2"> We start from the regression line defined by the points filtered with the Histogram technique, described in the previous section, and then we calculate the confidence band. Points which lie outside this band are filtered out since they are credited as too unreliable for alignment (e.g. Point A in Figure 3). We repeat this step until no pieces of text belong to different translations, i.e. until there is no misalignment.</Paragraph>
      <Paragraph position="3"> The confidence band is the error admitted at an x co-ordinate of a linear regression line. A point (x,y) is considered outside a linear regression line with a confidence level of 99.9% if its y co-ordinate does not lie within the confidence interval [ ax + b - error(x); ax + b + error(x)], where ax + b is the linear regression line equation and error(x) is the error admitted at the x co-ordinate. The upper and lower limits of the confidence interval are given by the following equation (see Thomas Wonnacott &amp; Ronald Wonnacott, 1990, p. 385):</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML