File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2134_metho.xml

Size: 17,042 bytes

Last Modified: 2025-10-06 14:15:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2134">
  <Title>Bitext Correspondences through Rich Mark-up</Title>
  <Section position="3" start_page="812" end_page="816" type="metho">
    <SectionTitle>
2 Bitext tagging and segmentation
</SectionTitle>
    <Paragraph position="0"> A large bitext has been compiled consisting of a collection of administrative and legal bilingual documents written both in Spanish and Basque, with close to 7 million words in each language.</Paragraph>
    <Paragraph position="1"> For the experiments, we have worked on a representative subset of around 500,000 words in each language. Several stages of automatic tagging, based on pattern matching and heuristics, were undertaken, rendering different descriptive levels: General encoding (paragraph, sentence, quoted text, dates, numbers, abbreviations, etc.).</Paragraph>
    <Paragraph position="2"> * Document specific tags that identify document types and define document internal organisation (sections, divisions, identification code, number and date of issue, issuer, lists, itemised sections, etc.).</Paragraph>
    <Paragraph position="3"> * Proper noun tagging (identification and categorisation of proper nouns into several classes, including: person, place, organisation, law, title, publication and uncategorised). null This collection of tags (shown in Table 1) reflects basic structural and referential features, which appear consistently at both sides of the bitext. Although the alignment of smaller segments (multi-word lexical units and collocations) will require more expressive tagging, such as part-of-speech tagging (POS), for the task of sentence alignment, this is not only unnecessary, but also inappropriate, since it would introduce undesired language dependent information. The encoding scheme has been based on TEI's guidelines for SGML based mark-up (Ide &amp; Veronis 95).</Paragraph>
    <Section position="1" start_page="812" end_page="812" type="sub_section">
      <SectionTitle>
2.1 Proper noun tagging
</SectionTitle>
      <Paragraph position="0"> As for many other text processing applications, proper noun tagging plays a key role in our approach to sentence alignment. It has been reported that proper nouns reach up to 10% of tokens in text (newswire text (Wakao et al.</Paragraph>
      <Paragraph position="1"> 96) and (Coates-Stephens 92)) and one third of noun groups (in the Agence France Presse flow (Wolinski et al. 95)). We have calculated that proper nouns constitute a 15% of the tokens in our corpus. The module for the recognition of proper nouns relies on patterns of typography (capitalisation and punctuation) and on contextual information (Church 88). It also makes use of lists with most common person, organisation, law, publication and place names. The tagger annotates a multi-word chain as a proper noun when each word in the chain is uppercase initial.</Paragraph>
      <Paragraph position="2"> A closed list of functional words (prepositions, conjunctions, determiners, etc.) is allowed to appear inside the proper noun chain, see examples in Table 2. A collection of heuristics discard uppercase initial words in sentence initial position or in other exceptional cases.</Paragraph>
      <Paragraph position="3"> In contrast with other known classifications (e.g. MUC-6 95), we exclude from our list of proper nouns time expressions, percentage expression, and monetary amount expressions (which for us fall under a different descriptive level). However, on top of organisation, person and location names, we include other entities such as legal nomenclature, the name of publications as well as a number of professional titles whose occurrence in the bitext becomes of great value for alignment.</Paragraph>
    </Section>
    <Section position="2" start_page="812" end_page="813" type="sub_section">
      <SectionTitle>
2.2 Bitext asymmetries
</SectionTitle>
      <Paragraph position="0"> Because our approach to alignment relies on consistent tagging, bitext asymmetries of any type need to be carefully dealt with. For example, capitalisation conventions across languages may show great divergences. Although, in theory, this should not be the case between Spanish and Basque, since officially they follow identical conventions for capitalisation (which are by the way the same as in French), in practise these conventions have been interpreted very differently by the writers of the two versions (lawyers in Spanish and translators in Basque). In the Basque version, nouns referring to organisations saila 'Department', professional titles diputatua 'Deputy', as well as many orographic or geographical sites arana 'Valley', are often written in lowercase, while in the Spanish original documents these are normally written in uppercase (see Table 2). These nouns belong to the type described as 'trigger' words by (Wakao et al.</Paragraph>
      <Paragraph position="1"> 96), in the sense that they permit the identification of the tokens surrounding them as proper nouns. Then, it has been required to resort to contextual information. The results of the resolution of these singularities are shown in Table  .</Paragraph>
      <Paragraph position="2"> 3 Using tags as cognates for sentence alignment Algorithms for sentence alignment abound and range from the initial pioneering proposals of (Brown et al. 91), (Gale &amp; Church 91a), (Church 93), or (Kay &amp; Roscheisen 93), to the more recent ones of (Chang &amp; Chen 97), or (Tillmann et al. 97). The techniques employed include statistical machine translation, cognates identification, pattern recognition, and digital signal and image processing. Our algorithm, as (Simard et al. 92), and (Melamed 97) employs cognates to align sentences; and similar to (Brown et al. 91), it also uses mark-up for that purpose. Its singularity does not lie on the use of mark-up as delimiter of text regions (Brown et al. 91) in combination with other techniques, but on the fact that it is the sole foundation for sentence alignment. We call it the 'tags as cognates' algorithm, TasC. This algorithm is not disrupted by word order differences or small asymmetries in non-literal translation, and, unlike other reported algorithms (Melamed 97), it possesses the additional advantage of being portable to any pair of languages without the need to resort to any language-specific heuristics. Provided an adequate and consistent bi-text mark-up, sentence alignment becomes a simple and accurate process also in the case of typologically disparate or orthographically distinct language pairs for which techniques based on lexical cognates may be problematic. One of of proper nouns the best consequences of this approach is that the burden of language dependent processing is dispatched to the monolingual tagging and segmentation phase.</Paragraph>
    </Section>
    <Section position="3" start_page="813" end_page="814" type="sub_section">
      <SectionTitle>
3.1 Similarity calculus between bitexts
</SectionTitle>
      <Paragraph position="0"> The alignment algorithm establishes similarity metrics between candidate sentences which are delimited by corresponding mark-up. Dice's co-efficient is used to calculate these similarity metrics (Dice 45). The coefficient returns a real numeric value in the range 0 to 1. Two sentences which are totally dissimilar in the content of their internal mark-up will return a Dice score of 0, while two identical contents will return a Dice score of 1.</Paragraph>
      <Paragraph position="1"> For two text segments, P and Q, one in each language, the formula for Dice's similarity coefficient will be: Dice(P, Q) -- 2FpQ Fp + FQ where FpQ is the number of identical tags that P and Q have in common, and Fp and FQ are the number of tags contained by each text segment P and Q.</Paragraph>
      <Paragraph position="2"> Since the alignment algorithm determines the best matching on the basis of tag similarity, not only tag names used to categorise different cognate classes (number, date, abbreviation, proper noun, etc.), but also attributes contained by these tags may help identify the cognate itself: &lt;num num=57&gt;57&lt;/num&gt;. Furthermore, attributes  may serve also to subcategorise proper noun tags: &lt;rs type=place&gt;Bilbao&lt;/rs&gt;.</Paragraph>
      <Paragraph position="3"> Such subcategorisations are of great value to calculate the similarity metrics. If mark-up is consistent, the correlation between tags in the candidate text segments will be high and Dice's coefficient will come close to 1. For a randomly created bitext sample of source sentences, Figure 1 illustrates how correct candidate alignments have achieved the highest Dice's coefficients (represented by '*'s), while next higher coefficients (represented by 'o's ) have achieved significant lower values. It must be noted that the latter do not correspond to correct values. The difference mean between Dice's coefficients corresponding to correct alignments and next higher values is:</Paragraph>
      <Paragraph position="5"> Where for a given source sentence i, DCci represents Dice's coefficient corresponding to its correct alignment and DCwi represents the next higher value of Dice's coefficients for the same source sentence i. In all the cases, this difference is greater than 0.2.</Paragraph>
      <Paragraph position="6"> For consistently marked-up bitexts, these results show that sentence alignment founded on the similarity between annotations can be robust criterion.</Paragraph>
      <Paragraph position="7"> Figure 2 illustrates how the Dice's coefficient is calculated between candidate sentences to alignment.</Paragraph>
    </Section>
    <Section position="4" start_page="814" end_page="816" type="sub_section">
      <SectionTitle>
3.2 The strategy of the TasC algorithm
</SectionTitle>
      <Paragraph position="0"> The alignment of text segments can be formalised by the matching problem in bipartite</Paragraph>
      <Paragraph position="2"> graphs. Let G = (V, E, U) be a bipartite graph, such that V and U are two disjoint sets of vertices, and E is a set of edges connecting vertices from V to vertices in U. Each edge in E has associated a cost. Costs are represented by a cost matrix. The problem is to find a perfect matching of G with minimum cost.</Paragraph>
      <Paragraph position="3"> The minimisation version of this problem is well known in the literature as the assignment problem.</Paragraph>
      <Paragraph position="4"> Applying the general definition of the problem to the particular case of sentence alignment: V and U represent two disjoint sets of vertices corresponding to the Spanish and Basque sentences that we wish to align. In this case, each edge has not a cost but a similarity metric quantified by Dice's coefficient. The fact that vertices are materialised by sentences detracts gen- null Aginduaren&lt;/rs&gt; lehen lerroaldea ez dela geri detektatu ondoren beraren argitarapen osoa egitera jo da.&lt;/s&gt; The common tags are: &lt;date date=27/04&gt;, &lt;num num=79&gt;, &lt;rs type=law&gt; The Dice's similarity coefficient will be: Dice(P,Q)= 2x3 / 4+3 = 0.857  ported in the literature. These constraints take into account the order in which sentences in both the source and target texts have been written, and capture the prevailing fact that translators maintain the order of the original text in their translations, which is even a stronger property of specialised texts, By default, a whole document delimits the space in which sentence alignment will take place, although this space can be customised in the algorithm. The average number of sentences per document is approximately 18. Two types of alignment can take place: * 1 to 1 alignment: when one sentence in the source document corresponds to one sentence in the target document (94.39% of the cases).</Paragraph>
      <Paragraph position="5"> * N to M alignment: when N sentences in the source document correspond to M sentences in the target document (only 5.61% of the cases). It includes cases of 1-2, 1-3 and 0-1 alignments.</Paragraph>
      <Paragraph position="6"> Both alignment types are handled by the algorithm. null  The .</Paragraph>
      <Paragraph position="7"> The algorithm TasC algorithm works in two steps: It obtains the similarity matrix S from Dice's coefficients corresponding to candidate alignment options. Each row in S represents the alignment options of a source sentence classified in decreasing order of similarity. In this manner, each column represents a preference position (1 the best alignment option, 2 the second best and so on). Therefore, each Si,j is the identification of one or more target sentences which match the source sentence i in the preference position j. In order to obtain the similarity matrix, it is not necessary to consider all possible alignment options. Constraints regarding sentence ordering and grouping greatly reduce the number of cases to be evaluated by the algorithm. In the algorithm each source sentence xi is compared with candidate target sentences yj as follows: (xi, Yi); (xi, YjYj+I ..., where YjYj+I represents the concatenation of yj with Yj+I. The algorithm module that deals with candidate alignment options can be easily customised to cope with different bitext configurations (since bitexts may range from a very simple one-paragraph text to more complex structures). In the current version of the algorithm seven alignment options are taken into account.</Paragraph>
      <Paragraph position="8"> 2. The TasC algorithm solves an assignment problem with several constraints. It aligns sentences by assigning to each ith source sentence the Si,j target option with minimum j value, that is, the option with more similarity. Furthermore, the algorithm solves the possible conflicts when a sentence matches with other sentences already aligned. The average cost of the algorithm, experimentally contrasted, is linear in the size of the input, although in the worst case the cost is bigger.</Paragraph>
      <Paragraph position="9"> The result of sentence alignment is reflected in the bitext by the incorporation of the attribute 'corresp to sentence tags, as can be seen  in Figure 3. This attribute points to the corresponding sentence identification code in the other language.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="816" end_page="816" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> The current version of the algorithm has been tested against a subcorpus of 500,000 words in each language consisting of 5,988 sentences and has rendered the results shown in Table 4.</Paragraph>
    <Paragraph position="1"> The accuracy of the 1 to 1 alignment is 100%.</Paragraph>
    <Paragraph position="2"> In the N to M case only 1 error occurred out of 314 sentences, which reaches 99.68% accuracy.</Paragraph>
    <Paragraph position="3"> The algorithm to sentence alignment has been designed in such a modular way that it can easily change the tagset used for alignment and the weight of each tag to adapt it to different bitext annotations. The current version of the algorithm uses the tagset shown in Table 1 without weights.</Paragraph>
  </Section>
  <Section position="5" start_page="816" end_page="816" type="metho">
    <SectionTitle>
5 Future work
</SectionTitle>
    <Paragraph position="0"> Once sentences have been aligned, the next step is the alignment of sentence-internal segments. The sentence will delimit the search space for this alignment, and hence, by reducing the search space, the alignment complexity is also reduced.</Paragraph>
    <Section position="1" start_page="816" end_page="816" type="sub_section">
      <SectionTitle>
5.1 Proper noun alignment
</SectionTitle>
      <Paragraph position="0"> Proper nouns are a key factor for the efficient management of the corpus, since they are the basis for the indexation and retrieval of documents in the two versions. For this reason, at present we are concerned with proper noun alignment, something which is not usually done in the mapping of bitexts. The alignment is achieved by resorting to: * The identification of cognate nouns, aided by a set of phonological rules that apply when Spanish terms are taken to produce loan words in Basque.</Paragraph>
      <Paragraph position="1"> * The restriction of cognate search space to previously aligned sentences, and * The application of the TasC algorithm adapted to proper noun alignment.</Paragraph>
    </Section>
    <Section position="2" start_page="816" end_page="816" type="sub_section">
      <SectionTitle>
5.2 Alignment of collocation
</SectionTitle>
      <Paragraph position="0"> The next step is the recognition and alignment of other multi-word lexical units and collocations. Due to the still unstable translation choices of much administrative terminology in Basque, on top of the considerable typological and structural differences between Basque and Spanish, many of the techniques reported in the literature (Smadja et al. 96), (Kupiec 93) and (Eijk 93) cannot be effectively applied. POS tagging combined with recurrent bilingual glossary lookup is the approach we are currently experimenting with.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML