File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1405_intro.xml
Size: 5,905 bytes
Last Modified: 2025-10-06 14:01:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1405"> <Title>DTD-Driven Bilingual Document Generation ...... ,Arantza Casillas ..... Departamento de Automgtica, Universidad de Alcalgt e-mail : arantza@aut, alcaia, es</Title> <Section position="2" start_page="0" end_page="32" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper discusses an approach to the architecture of an experimental interactive editing tool that integrates the processes of source document composition and translation into the target language. The tool has been conceived as an optimal solution for a particular case of bilingual production of legal documentation, but it also illustrates in a more general way how to exploit the possibilities of SGML (ISO8879, 1986) used extensively to annotate a whole range of linguistic and extralinguistic information in specialized bilingual corpora.</Paragraph> <Paragraph position="1"> SGML is well established as the coding scheme underlying most Translation Memory based systems (TMBS), and has been proposed as the cod-it~g scheme for the interchange of existing Translation Memory databases Translation Meinories eXchange, TMX (Melby, 1998). The advantages of SGML have also been perceived by a large conmmnity of corpus linguistics researchers, and big efforts have been made in the development of suitable markup options to encode a variety of textual types and functions -as clearly demonstrated by the Text Encoding Initiative, TEI; (Burnard & Speberg-MacQueen, 1995). While the tag-sets employed by TMBS are simple and task-oriented, TEI has offered a highly complex and versatile collection of tags. The guiding hypothesis in our experiment has been the idea that it is possible to explore TEI/SGML markup in order to develop a system that carries the concept of Translation Memory one step further. One important leature of SGML is the DTD. DTDs determine the logical structure of documents and how to tag them accordingly. We have concentrated on the accurate description of documents by means of TEI conformant SGML markup. The markup will help disclose the underlying logical structure of documents. From annotated documentation, DTDs can be induced and these DTDs provide the basic scheme to produce new documents. We have collected a corpus of official publications from three main institutions in the Basque Autonomous Region in Spain, the Boletln Oficial de Bizkaia (BOB, 1990-1995), Botetln Oficial de Alava (BOA, 1990-1994) and Bolet{n Oficial del Pais Vasco (BOPV, 1995).</Paragraph> <Paragraph position="2"> Documents in the corpus were composed by Adnfinistration clerks and translated by translators. Both clerks and translators have been using a wide variety of word-processors, although since 1994 MSWord has been generalized as the standard editing tool. Administrative documentation shows a regular structure, and is rich * in*recurrent textual patterns. For each docu- ..... ment type different document tokens share a common global distribution of elements. Official document composers learn these global structures and apply them consistently. It is also the case that composers tend to reuse old documents, where the whole document may be considered the translation unit. TM3 can .,-., : :- :...~ .~o. ~be~,g. o~strued,as~:~i:hiling~ai,doc,,~ent-database. Much redundancy originates from this TM collection, although it should be noticed that they are all by-products derived from the same annotated bitext which subsumes them all. Good software packages for TM1 and TM3 already exist in the market, and hence their exploitation is Table I: document files when producing new documents of the same type. Despite the fact that no SGML software was used at the editing phase, texts in the corpus show regular logical structures and consistent distribution of text segments. Our main goal in tagging the corpus was to make all them explicit (Martinez, 1997). The most common type of document in the corpus, the Orden Foral, was chosen (see Table 1). We analysed some 100 tokens and hand-marked the most salient elements. The heuristics to identify these elements were later expressed in a collection of recognition routines in Perl and tested against a set of 400 tokens, including the initial 100. As a result of this process of automatic tagging of structural elements we produced a TEI/SGML tagged corpus with yet no corresponding overt DTD. In Section 2 we will explain how DTDs were later induced from the tagged corpus.</Paragraph> <Paragraph position="3"> Once the corpus was segmented the next step was to align it. This was conducted at different levels: general document elements (DIV, SEG, P), as well as sentential and intra-sentential elements, such as S, ItS, NUM, DATE, etc. (Martinez, 1998b). Aligned in this way, the corpus becomes an important resource for translation.</Paragraph> <Paragraph position="4"> Four complementary language databases may be obtained at any time from the annotated corpus: three translation memory databases (TM1, TM2, and TM3) as well as a terminology database (termbase). The three TMs differ in the nature of the translation units they contain.</Paragraph> <Paragraph position="5"> TM1 consists of aligned sentences than can feed commercial TM software. TM2 contains elements which are translation segments ranging from whole sections of a document or multi-sentence paragraphs to smaller units, such as short phrases or proper names. TM3 simply hosts the whole collection of aligned bilingual Types of documents in the corpus beyond our interest (Trados Translator's Work..... . bench, Star!s Transit,.,SDLX, D e'jPS~fi,. IBM~s. browsing tool for TM3). The originality of our editing tool lies in a design which benefits from joining the potentiality of DTDs and the elements in TM2, as will be shown in sections 4 and 5.</Paragraph> </Section> class="xml-element"></Paper>