File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2705_metho.xml
Size: 22,396 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2705"> <Title>Multi-dimensional Annotation and Alignment in an English-German Translation Corpus</Title> <Section position="3" start_page="35" end_page="36" type="metho"> <SectionTitle> 2 CroCo XML </SectionTitle> <Paragraph position="0"> The annotation in CroCo extends to different levels in order to cover possible linguistic evidence on each level. Thus, each kind of annotation (part-of-speech, morphology, phrase structure, grammatical functions) is realised in a separate layer. An additional layer is included which contains comprehensive metainformation in separate header files for each text in the corpus.</Paragraph> <Paragraph position="1"> The file containing the indexed tokens (see section 2.1) includes an xlink attribute referring to this header file as depicted in Figure 2.1. The metadata are based on the TEI guidelines and include register information. The complex multi-lingual structure of the corpus in combination with the multi-layer annotation requires indexing the corpus. The indexing is carried out on the basis of the tokenised corpus. Index and annotation layers are kept separate using XML stand-off mark-up. The mark-up builds on XCES . Different formats of the multiple annotation and alignment outputs are converted with Perl scripts. Each annotation and alignment unit is indexed. http://www.xml-ces.org The respective annotations and alignments are linked to the indexed units via XPointers.</Paragraph> <Paragraph position="2"> The following sections describe the different annotation layers and are exemplified for the German original sentence in (1) and its English translation in (2) .</Paragraph> <Paragraph position="3"> (1) Ich spielte viele Moglichkeiten durch, stellte mir den Tater in verschiedenen Posen vor, ich und die Pistole, ich und die Giftflasche, ich und der Knuppel, ich und das Messer.</Paragraph> <Paragraph position="4"> (2) I ran through numerous possibilities, pictured the perpetrator in various poses, me with the gun, me with the bottle of poison, me with the bludgeon, me with the knife.</Paragraph> <Section position="1" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 2.1 Tokenisation and indexing </SectionTitle> <Paragraph position="0"> The first layer to be presented here is the tokenisation layer. Tokenisation is performed in CroCo for both German and English by TnT (Brants 2000), a statistical part-of-speech tagger. As shown in Figure 2.1 each token annotated with the attribute strg has also an id attribute, which indicates the position of the word in the text.</Paragraph> <Paragraph position="1"> This id represents the anchor for all XPointers pointing to the tokenisation file by an id starting with a &quot;t&quot;. The file is identified by the name attribute. The xml:lang attribute indicates the language of the file, docType provides information on whether the present file is an original or a translation.</Paragraph> <Paragraph position="2"> Similar index files necessary for the alignment of the respective levels are created for the units chunk, clause and sentence. These units stand in All examples are taken from the CroCo Corpus.</Paragraph> <Paragraph position="3"> a hierarchical relation with sentences consisting of clauses, clauses consisting of chunks etc.</Paragraph> </Section> <Section position="2" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.2 Part-of-speech tagging </SectionTitle> <Paragraph position="0"> The second layer annotated for both languages is the part-of-speech layer, which is provided again by TnT . The token annotation of the part-of-speech layer starts with the xml:base attribute, which indicates the index file it refers to. The part-of-speech information for each token is annotated in the pos attribute, as shown in Figure 2.2. The attribute strg in the token index file and pos in the tag annotation are linked by an xlink attribute pointing to the id attribute in the index file. For example, the German token pointing to &quot;t65&quot; in the token index file whose strg value is stellte is a finite verb (with the PoS tag vvfin).</Paragraph> </Section> <Section position="3" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.3 Morphological annotation </SectionTitle> <Paragraph position="0"> Morphological information is particularly relevant for German due to the fact that this language carries much syntactic information within morphemes rather than in separate function words like English. Morphology is annotated in CroCo with MPro, a rule-based morphology tool (Maas 1998). This tool works on both languages.</Paragraph> <Paragraph position="1"> As shown in Figure 2.3 each token has morphological attributes such as person, case, gender, number and lemma. As before, the xlink attribute refers back to the index file, thus providing the connection between the morphological attributes and the strg information in the index file. For the morphological annotation of the German token &quot;t65&quot; in Figure 2.3 the strg value is determined by following the XPointer &quot;t65&quot; to the token index file, i.e. spielte. The pos value is retrieved by searching in the tag annotation for For German we use the STTS tag set (Schiller et al. 1999), and for English the Susanne tag set (Sampson 1995).</Paragraph> <Paragraph position="2"> the file with the same xml:base value. The matching tag, in this case vvfin, carries the same</Paragraph> <Paragraph position="4"> <token strg=&quot;Ich&quot; per=&quot;1&quot; case=&quot;nom&quot; nb=&quot;sg&quot; gender=&quot;f;m&quot; lemma=&quot;ich&quot; lb=&quot;ich&quot; xlink:href=&quot;#t64&quot;/> <token strg=&quot;spielte&quot; vtype=&quot;fiv&quot; tns=&quot;past&quot; per=&quot;3&quot; nb=&quot;sg&quot; lemma=&quot;spielen&quot; lb=&quot;spielen&quot; comp= &quot;spielen&quot; xlink:href=&quot;#t65&quot;/> <token strg=&quot;viele&quot; case=&quot;nom;acc&quot; nb=&quot;plu&quot; gender=&quot;f&quot; lemma=&quot;viel&quot; lb=&quot;viel&quot; comp=&quot;viel&quot; deg=&quot;base&quot; xlink:href=&quot;#t66&quot;/> <token strg=&quot;Moglichkeiten&quot; case= &quot;nom;acc&quot; nb=&quot;plu&quot; gender=&quot;f&quot; lemma= &quot;moglichkeit&quot; lb=&quot;moglich&quot; comp=</Paragraph> </Section> <Section position="4" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.4 Phrase chunking and annotation of </SectionTitle> <Paragraph position="0"> grammatical functions Moving up from the token unit to the chunk unit, first we have to index these units again before we can annotate them. The chunk index file assigns an id attribute to each chunk within the file. The problem of discontinuous phrase chunks is solved by listing child tags referring to the individual tokens which make up the chunk via xlink attributes. Figure 2.4 shows that the VP &quot;ch14&quot; in the German phrase annotation consists of The phrase structure annotation (see Figure 2.5) assigns the ps attribute to each phrase chunk identified by MPro. XPointers link the phrase structure annotation to the chunk index file. It should be noted that in CroCo the phrase structure analysis is limited to higher chunk nodes, as our focus within this layer is more on complete phrase chunks and their grammatical functions.</Paragraph> <Paragraph position="1"> The annotation of grammatical functions is again kept in a separate file (see Figure 2.6). Only the highest phrase nodes are annotated for their grammatical function with the attribute gf. The XPointer links the annotation of each function to the chunk id in the chunk index file. From this file in turn the string can be retrieved in the token annotation. For example, the English chunk &quot;ch13&quot; carries the grammatical function of direct object (DOBJ). It is identified as an NP in the phrase structure annotation by comparing the xml:base attribute value of the two files and the</Paragraph> </Section> <Section position="5" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.5 Alignment </SectionTitle> <Paragraph position="0"> In the examples shown so far, the different annotation layers linked to each other all belonged to the same language. By aligning words, grammatical functions, clauses and sentences, the connection between original and translated text is made visible. The use of this multi-layer alignment will become clearer from the discussion of a sample query in section 3.</Paragraph> <Paragraph position="1"> For the purpose of the CroCo project word alignment is realised with GIZA++ (Och & Ney 2003), a statistical alignment tool. Chunks and clauses are aligned manually with the help of MMAX II (Muller & Strube 2003), a tool allowing assignment of own categories and linking units. Finally, sentences are aligned using Win-Align, an alignment tool within the Translator's Workbench by Trados (Heyn 1996).</Paragraph> <Paragraph position="2"> The alignment procedure produces four new layers. It follows the XCES standard. Figure 2.7 shows the chunk alignment of (1) and (2). In this layer, we align on the basis of grammatical functions instead of phrases since this annotation includes the information of the phrase chunking as well as on the semantic relations of the chunks.</Paragraph> <Paragraph position="3"> The grammatical functions are mapped onto each other cross-linguistically and then aligned according to our annotation and alignment scheme.</Paragraph> <Paragraph position="4"> The trans.loc attribute locates the chunk index file for the aligned texts in turn. Furthermore, the respective language as well as the n attribute organising the order of the aligned texts are given.</Paragraph> <Paragraph position="5"> We thus have an alignment tag for each language in each chunk pointing to the chunk index file.</Paragraph> <Paragraph position="6"> As can be seen from Figure 2.7, chunks which do not have a matching equivalent receive the value &quot;#undefined&quot;, a phenomenon that will be of interest in the linguistic interpretation on the basis of querying the corpus.</Paragraph> </Section> </Section> <Section position="4" start_page="36" end_page="40" type="metho"> <SectionTitle> 3 Querying the CroCo Corpus </SectionTitle> <Paragraph position="0"> The comprehensive annotation including the alignment described in section 2 is the basis for the interpretation to be presented in what follows. We concentrate on two types of queries into the different alignment layers that are assumed relevant in connection with our research question.</Paragraph> <Section position="1" start_page="36" end_page="39" type="sub_section"> <SectionTitle> 3.1 Crossing lines and empty links </SectionTitle> <Paragraph position="0"> From the linguistic point of view we are interested in those units in the target text which do not have matches in the source text and vice versa, i.e. empty links, or whose alignment crosses the alignment of a higher level, i.e.</Paragraph> <Paragraph position="1"> crossing lines. We analyse for instance stretches of text contained in one sentence in the source text but spread over two sentences in the target text, as this probably has implications for the overall information contained in the target text.</Paragraph> <Paragraph position="2"> We would thus pose a query retrieving all instances where the alignment of the lower level is not parallel to the higher level alignment but points into another higher level unit. In the example below the German source sequence (3) as well as the English target sequence (4) both consist of three sentences. These sentences are each aligned as illustrated by dashed boxes in Figure 3.1.</Paragraph> <Paragraph position="3"> (3) Aus dem Augenwinkel sah ich, wie eine Schwester dem Bettnachbarn das Nachthemd wechselte. Sie rieb den Rucken mit Franzbranntwein ein und massierte den etwas jungeren Mann, dessen Adern am ganzen Korper blaulich hervortraten. Ihre Hande liessen ihn leise wimmern.</Paragraph> <Paragraph position="4"> (4) Out of the corner of my eye I watched a nurse change his neighbor's nightshirt and rub his back with alcoholic liniment. She massaged the slightly younger man, whose veins stood out blue all over his body. He whimpered softly under her hands.</Paragraph> <Paragraph position="5"> In German the first two sentences are subdivided into two clauses each. The English target sentences are co-extensive with the clauses contained in each sentence. This means that two English clauses have to accommodate four German clauses. Figure 3.1 shows that the German clause 3 (Sie rieb den Rucken mit Franzbranntwein ein) in sentence 2 is part of the bare infinitive complementation (...and rub his back with alcoholic liniment) in the English sentence 1. The alignment of this clause points out of the aligned first sentence, thus constituting crossing lines.</Paragraph> <Paragraph position="6"> Sentence 3Sentence 2Sentence 1 Clause 1 Clause 2 Clause 3 Clause 4 Clause 5 The third sentence also contains a crossing line, in this case on the levels of chunk and word alignment: The words Ihre Hande in the German subject chunk are aligned with the words her hands in the English adverbial chunk. However, this sentence is particularly interesting in view of empty links. The query asks for units not matching any unit in the parallel text, i.e. for xlink attributes whose values are &quot;#undefined&quot; (cf. section 2.5). In Figure 3.2, the empty links are marked by a black dot.</Paragraph> <Paragraph position="7"> Our linguistic interpretation is based on a functional view of language. Hence, the finite liessen (word 3) in the German sentence is interpreted as a semi-auxiliary and thus as the finite part of the verbal group. Therefore, wimmern (word 6) receives the label PRED, i.e. the non-finite part of the verb phrase, in the functional analysis. This German word is linked to word 2 (whimpered) in the target sentence, which is assigned FIN, i.e. the finite verb in the layer of grammatical functions. As FIN exists both in the source and in the target sentences, this chunk is aligned. The German functional unit PRED does not have an equivalent in the target text and gets an empty link. Consequently, word 3 in the source sentence (liessen) receives an empty link as well. This mismatch will be interpreted in view of our translation-oriented research. In the following subsection we will see how these two phenomena can be retrieved automatically.</Paragraph> </Section> <Section position="2" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 3.2 Corpus exploitation using XQuery </SectionTitle> <Paragraph position="0"> Since the multi-dimensional annotation and alignment is realised in XML, the queries are posed using XQuery . This query language is particularly suited to retrieve information from different sources like for instance individual annotation and alignment files. The use for multi-layer annotation is shown in (Teich et al. 2001). The query for determining an empty link at word level can be formulated as follows: find all words which do not have an aligned correspondent, i.e. which carry the xlink attribute value &quot;#undefined&quot;. The same query can be applied on the chunk level, the query returning the grammatical functions that do not have an equivalent in the other language.</Paragraph> <Paragraph position="1"> (5)Ihre Hande liessen ihn leise wimmern. null (6) He whimpered softly under her hands.</Paragraph> <Paragraph position="2"> Applied to the sentences in (5) and (6) the XQuery in Figure 3.3 returns all German and English words, which receive an empty link due to a missing equivalent in alignment (liessen and under). This query can be used analogously in all other alignment layers. It implies the call of a self-defined XQuery function (see Figure 3.4), which looks in the correspondent index file for words not aligned.</Paragraph> <Paragraph position="3"> Querying crossing lines in the German source sentence in (5) and the English target sentence in (6) is based on the annotation at word level as well as on the annotation at the chunk level. As mentioned in section 3.1, crossing lines are identified in (5) and (6) if the words contained in the chunks aligned on the grammatical function layer are not aligned on the word level. This means that the German subject is aligned with the English subject, but the words within the subject chunk are aligned with words in other grammatical functions instead.</Paragraph> <Paragraph position="4"> In a first step, the query for determining a crossing line requires information about all aligned German chunks with a xlink attribute whose value is not &quot;#undefined&quot; and all aligned German words with a xlink attribute whose value is not &quot;#undefined&quot;. Then all German words that are not aligned on the word level but are aligned as part of chunks on the chunk level are filtered out. Figure 3.6 reflects the respective XQuery.</Paragraph> <Paragraph position="5"> First, the aligned chunks ($ch1 and $ch2) are saved into variables. These values are important in order to detect the span for each of the chunks ($ch1/tok[position()=1], $ch1/tok[last()] and $ch2/tok[position()=1], $ch2/tok[last()]), and to identify the words making up the source chunks as well as their German or English equivalents. In the second step all words that do not have empty links are saved ($tok1 and $tok2). The last step filters the crossing lines, i.e. word alignments pointing out of the chunk alignment. For this purpose, we define a new function (local:containsToken) which tests whether a word belongs to a chunk or not. By applying local:containsToken for the German original and not(local:containsToken) for the English translation, all words in the German chunks whose aligned English equivalent words do not belong to the aligned English chunks are retrieved. The example query returns the German words Ihre Hande that are part of the German subject chunk and which are aligned with the English words her hands that again are part of the second adverbial chunk.</Paragraph> </Section> </Section> <Section position="5" start_page="40" end_page="41" type="metho"> <SectionTitle> 4 Summary and conclusions </SectionTitle> <Paragraph position="0"> In a broader view, it can be observed that there is an increasing need in richly annotated corpora across all branches of linguistics. The same holds for linguistically interpreted parallel corpora in translation studies. Usually, though, the problem with large-scale corpora is that they do not reflect the complexity of linguistic knowledge we are used to dealing with in linguistic theory.</Paragraph> <Paragraph position="1"> Simple research questions can of course be answered on the basis of raw corpora or with the help of an automatic part-of-speech tagging.</Paragraph> <Paragraph position="2"> Most linguistics and translation scholars are, however, interested in more complex questions like the interaction of syntax and semantics across languages.</Paragraph> <Paragraph position="3"> The research described here shows the use of comprehensive multi-layer annotation across languages. By relating a highly abstract research question to multiple layers of lexical and grammatical realisations, characteristic patterns of groups of texts, e.g. explicitation in translations and originals in the case of the CroCo project, can be identified on the basis of statistically relevant linguistic evidence.</Paragraph> <Paragraph position="4"> If we want to enrich corpora with multiple kinds of linguistic information, we need a linguistically motivated model of the linguistic units and relations we would like to extract and draw conclusions based on an annotated and aligned corpus. So the first step for the compilation of a parallel translation corpus is to provide a classification of linguistic units and relations and their mappings across source and target languages. The classification of English and German linguistic units and relations chosen for the CroCo project (i.e. for the investigation of explicitation in translations and originals) is reflected in the CroCo annotation and alignment schemes and thus in the CroCo Corpus annotation and alignment.</Paragraph> <Paragraph position="5"> From a technical point of view, the representation of a multilingual resource comprehensively annotated and aligned is to be realised in such a way that * multiple linguistic perspectives on the corpus are possible since different annotations and alignments can be investigated independently or in combination, * the corpus format guarantees best possible accessibility and exchangeability, and * the exploitation of the corpus is possible using easily available tools for search and analysis.</Paragraph> <Paragraph position="6"> We coped with this challenge by introducing a multi-layer stand-off corpus representation format in XML (see section 2), which takes into account not only the different annotation layers needed from a linguistic point of view, but also multiple alignment layers necessary to investigate different translation relations.</Paragraph> <Paragraph position="7"> We also showed how the CroCo resource can be applied to complex research questions in linguistics and translation studies using XQuery to retrieve multi-dimensional linguistic information (see section 3). Based on the stand-off storage of annotation and alignment layers combined with the possibility to exploit the required layers through intelligent queries, parallel text segments and/or parallel annotation units can be extracted and compared across languages.</Paragraph> <Paragraph position="8"> In order to make the CroCo resource available to researchers not familiar with the complexities of XML mark-up and the XQuery language, a graphical user interface will be implemented in Java which allows formulating queries without knowledge of the XQuery syntax.</Paragraph> </Section> class="xml-element"></Paper>