File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1405_metho.xml
Size: 9,712 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1405"> <Title>DTD-Driven Bilingual Document Generation ...... ,Arantza Casillas ..... Departamento de Automgtica, Universidad de Alcalgt e-mail : arantza@aut, alcaia, es</Title> <Section position="3" start_page="32" end_page="33" type="metho"> <SectionTitle> 2 DTD abstraction </SectionTitle> <Paragraph position="0"> SGML mark-up determines the logical structure of a document and its syntax in the form of a context-free grammar. This is called the Doc- null ument Type Definition (DTD) and it contains specifications for: (r) Names and content for all elements that are permitted to appear in a document.</Paragraph> <Paragraph position="1"> o Order in which these elements must appear. null o Tag attributes with default values for those elements.</Paragraph> <Paragraph position="2"> DTDs have been abstracted away from the annotations that were automatically introduced in the corpus. Similar experiments have been reported before in the literature. (Ahonen, 1995) uses a method to build document instances from tagged texts that consists of a deterministic finite automaton for each context model. Subsequently, these automata are generalized and converted into regular expressions which are easily transcribed into SGML content models. (Shafer, 1995) combines docmnent instances with simplification rules. Our method is similar to Sharer's, but .with a.modification in the way rules reduce document instances. A tool to obtain a DTD for all document instances has been developed (Casillas, 1999). Given that source and target documents show some syntactic and structural mismatches, two different DTDs are induced, one for each language, and siguiente a esta~:m~t.~eaci~m~.sin~p~er~ui~i~deAu~,.~;.~aila~aetek~:~epea~;~4ja~d~mxazpen ~hatl egiten utilizacidn de otros medios de defensa que estime den egunaren biharamunetik zenbatuko da epe oportunos. </seg9> hori; halo eta guztiz ere, egokiesten diren beste defentsabideak ere erabil litezke. </seg9> are paired through a correspondence table. Correspondences in this table can be up-dated, or deleted. At present, we have six DTDs, one for each document type in each language (there are three document types; Figure 2 shows a part of one of these DTDs). By means of these paired DTDs, document elements in each language are appropriately placed. In the process of generating the bilingual document, a document type must first be selected. Each document type has an associated DTD. This DTD specifies which elements are obhgatory and.which are optional.</Paragraph> <Paragraph position="3"> With the aid of the DTD, the source document is generated. The target document will be generated with .the aid of the com~esponding target DTD.</Paragraph> </Section> <Section position="4" start_page="33" end_page="34" type="metho"> <SectionTitle> 3 Joining TM2 and DTD </SectionTitle> <Paragraph position="0"> TM2 specifically stores a type of translation segment class, which we have tagged <segl>, <seg2>... <segn>, <title> and <rs>, and which is relevant to the DTD. Segments tagged <segn> are variable recurrent language patterns very frequent in the specialized domain of the corpus and whose occurrence in the text is well established. These <segn> tags include two attributes: id and correspond which locate the aligned segment both in the corpus and in the database (Figure 1). Segments tagged <rs> are referring expressions which have been recognized, tagged and aligned * and which correspond largely to proper names (Martinez, 1998a), (Martinez, 1998b). TM2 is managed in tile form of a relational database -where segments are stored, as records. .Each record in the database consists of four fields: the segment string, a counter for the occurrences of that string in the corpus, the tag and the attributes (type, id and corresp).</Paragraph> <Paragraph position="1"> Table 2 shows how the text fragment inside the content of the string field in the database maintains only the initial <segn> and <rs> tags. Furthermore, <rs> tagged segments inside <segn> records are simplified so that their content is dismissed and only the initial tag is kept (Lange et al., 1997). The reason is that they are considered variable elements within the segment (dates and numbers are also these type of elements). The strings Orden Foral of record 2 marked as <rs type=law> and Sala de lo Contencioso-Administrativo del Tribunal Superior de Justicia del Pais Vasco of record 3 <rs type=organization> are thus not included in record 1 <segg>, since they may differ in other instantiations of the segment. These internal elements are largely proper names that vary from one instantiation of the segment to another. The <rs> tag can be considered to be the name of the varying element. The value of the type attribute <rs type=law> constraints the kind of referential expression that may be inserted in that point of the translation segment. Table 2 shows that source and target records may not have straight one-to-one correspondences. Although this is by no means the general:case; only about 5.61%, (Martinez, 1998a), such one-to-N correspondences provide good ground to explain how the TM2 is designed. The asymmetry can be easily explained. The Spanish term recurso contencioso-administrativo has been translated into Basque by means of a category changing operation, where the Spanish adjective administrativo has been translated as a Basque noun complement Administrazioarekiko which literally means &quot;Administration-the-with-of' triggering its identification as a proper noun.</Paragraph> <Paragraph position="2"> Table 3 shows the way in which source language units are related with their corresponding target units, which, as can be observed, can be one-to-one or one-to-N. This means that one source element can have more than one translation. null TM2 is created in three steps: (r) First, non-pertinent tags are filtered out from the annotated corpus. Tags marking sentence <s> and paragraph <p> alignment are removed because they are of no interest for TM2 'recall that they are registered in TM1).</Paragraph> <Paragraph position="3"> * Second, translation segments <segn>, <title> phrases and referential expres null sions <rs> are detected in the source document and looked up in the database.</Paragraph> <Paragraph position="4"> o Third, if they are not already present in the database, they are stored each in its database.and values of the id and corresp attributes-are~used to set the correspondence between source and target database.</Paragraph> </Section> <Section position="5" start_page="34" end_page="36" type="metho"> <SectionTitle> 4 Composition Strategy </SectionTitle> <Paragraph position="0"> Every phase in tile process is guided by the markup contained in TM2 and the paired DTDs <rs type=organization>, bi hilabeteko epean; jakinarazpen hau egiten den egunaren de otros medios de. defensa que estime oportunos, biharamunetik zenbatuko da epe hori; que estime oportunos, hala eta guztiz ere, egokiesten diren beste ............ '-. . -. .~' deg.~ ..... :z-:: :::~:::defemVs~ideate.~ere:erabit~htezke~,...- - .... which control the application of this markup.</Paragraph> <Paragraph position="1"> The composition process follows two main steps which correspond to the traditional source document generation and translation into the target document. The markup and the paired DTD guides the process in the following manner: null 1. Before the user starts writing the source document, he must select a document type, i.e., a DTD. This has two consequences. On the one hand, the selected DTD produces a source document template that contains the logical structure of the document and some of its contents. On the other hand, the selected source DTD trigger:s .a target paired DTD, which will be used later to translate the document. There are three different types of elements in the source document template: (r) Some elements are mandatory and are .: A . .</Paragraph> <Paragraph position="2"> provided to the user, who must only choose its content among some alternative usages (s/he will get a list of alternatives ordered by frequency, for example <title>). Other obligatory elements, such as dates and numbers, will also be automatically generated.</Paragraph> <Paragraph position="3"> o Some other elements in the template are optional (e.g., <seg9>). Again, a list of alternatives will be offered to the user. These optional elements are .sensitive to the .context (document or division type), and markup is also responsible for constraining the valid op- ~ t ion.s.g:iverlTtQ,the user:. Obligatory and optional elements are retrieved from TM2, and make a considerable part of the source document.</Paragraph> <Paragraph position="4"> . All documents have an important part of their content which is not deter- null mined by the DTD (<divl>). It is the most variable part, and .the system lets the writer input text freely. It is when TM2 has nothing to offer that TM1 and TM3 may provide useful material.</Paragraph> <Paragraph position="5"> Given the recurrent style of legal documentation, it is quite likely that the user will be using many of the bilingual text choices already aligned and available in TM1 and TM3.</Paragraph> <Paragraph position="6"> 2. Once the source document has been completed, the system derives its particular logical structure, which, with the aid of the target DTD, is projected into the resulting target logical structure.</Paragraph> </Section> class="xml-element"></Paper>