File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1910_metho.xml
Size: 18,368 bytes
Last Modified: 2025-10-06 14:09:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1910"> <Title>Bootstrapping Parallel Treebanks</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Bootstrapping a German-Swedish </SectionTitle> <Paragraph position="0"> parallel treebank We have built a small German-Swedish parallel treebank with 25 sentence pairs taken from the Europarl corpus. First, the German sentences</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Brants' Part-of-Speech Tagger and Chunker for </SectionTitle> <Paragraph position="0"> German. The PoS tagger employs the STTS, a set of around 50 PoS-tags for German. The set is so large because it incorporates some morpho-syntactic features (e.g. it distinguishes between flnite and non-flnite verb forms). The chunker assigns a at constituent structure with the usual node labels (e.g. AP, NP, PP, S, VP), but also special labels for coordinated phrases (e.g.</Paragraph> <Paragraph position="1"> CAP, CNP, CPP, CS, CVP). In addition the chunker suggests syntactic functions (like subject, object, head or modifler) as edge labels.</Paragraph> <Paragraph position="2"> The human treebank annotator controls the suggestions made by the tagger and the chunker and modifles them where necessary. Tagger and chunker help to speed up the annotation process for German sentences enormously. The upper tree in flgure 1 shows the structure for the following sentence (taken from Europarl): (EN: But citizens of some of our member states have become victims of terrible natural disasters.) Now let us look at the resources available for Swedish. First there is SUC (the Stockholm-Ume&quot;a-Corpus), a 1 million word corpus of written Swedish designed as a representative corpus along the lines of the Brown corpus. SUC contains PoS-tags, morphological tags and lemmas for all tokens as well as proper name classes.</Paragraph> <Paragraph position="3"> All the information is hand-checked. So this is proper training material for a PoS tagger.</Paragraph> <Paragraph position="4"> Compared to the 50 tags of the STTS, the 22 SUC PoS-tags (e.g. only one verb tag) are rather coarse-grained, but of course we can use the combination of PoS-tags and morphological information to automatically derive a richer tag set.</Paragraph> <Paragraph position="5"> Training material for a Swedish chunker is harder to come by. There are two early Swedish treebanks, Mamba and SynTag (dating back to the 1970s (!) and 1980s respectively), but they are rather small (about 5000 sentences each), very heterogeneously annotated and somewhat faulty (cf. (Nivre, 2002)). Therefore, the most serious attempt at training a chunker for Swedish was based on an automatically created \treebank&quot; which of course contained a certain error rate (Megyesi, 2002). Essentially there exists no constituent structure treebank for Swedish that could be used for training a chunker with resulting structures corresponding to the German sentences.</Paragraph> <Paragraph position="6"> Therefore we have worked with a difierent approach (described in detail in (Samuelsson, 2004)). We flrst trained a PoS tagger on SUC and used it to assign PoS-tags to our Swedish sentences. We then converted the Swedish PoS-tags in these sentences into the corresponding German STTS tags.5 We loaded the Swedish sentences into Annotate (now with STTS tags), and we were then able to reuse the German chunker to make structural decisions over the Swedish sentences. This worked surprisingly 5An alternative approach could have been to map all tags in the SUC to STTS and then train a Swedish tagger on this converted material.</Paragraph> <Paragraph position="7"> well due to the structural similarities of Swedish and German. After the semi-automatic annotation of the syntactic structure, the PoS-tags were converted back to the usual Swedish tag set. This is a straight-forward example of how resources for one language (in this case German) can be reused to bootstrap linguistic structure in another albeit related language (here Swedish).</Paragraph> <Paragraph position="8"> The lower tree in flgure 1 shows the structure for the Swedish sentence which corresponds to the German sentence in example 1.</Paragraph> <Paragraph position="9"> (2) D~aremot har inv&quot;anarna i ett antal av v&quot;ara medlemsl~ander drabbats av naturkatastrofer som verkligen varit f~orskr~ackliga.</Paragraph> <Paragraph position="10"> (EN: However inhabitants of a number of our member states were afiected by natural disasters which indeed were terrible.) Since the German STTS is more flne-grained than the SUC tag set, the mapping from the SUC tag set to STTS does not entail loosing any information. When converting in this direction the problem is rather which option to choose. For example, the SUC tag set has one tag for adjectives, but the STTS distinguishes between attributive adjectives (ADJA) and adverbial or predicative adjectives (ADJD). We decided to map all Swedish adjectives to ADJA since the information in SUC does not give us any clue about the usage difierence. The human annotator then needs to correct the ADJA tag to ADJD if appropriate, in order to enable the chunker to work as intended.</Paragraph> <Paragraph position="11"> Other tag mapping problems come with the SUC tags for adverb, determiner, pronoun and possessive all of which are marked as \interrogative or relative&quot; in the guidelines. There is no clear mapping of these tags to STTS. We decided to use the mapping in table 1.</Paragraph> <Paragraph position="12"> The beneflt of using the German chunker for annotating the Swedish sentences is hard to quantify. A precise experiment would require one group of annotators to work with this chunker and another to work without it on the same sentences for a comparison of the time needed.</Paragraph> <Paragraph position="13"> We performed a small experiment to see how often the German chunker suggests the correct node labels and edge labels for the Swedish sentences (when the children tags/nodes were manually selected). In 100 trials we observed 89 correct node labels and 93% correct edge labels (for 305 edges). If we assume that manual inspection of correct suggestions takes about a third of the time of manual annotation, and if we also assume that the correction of erroneous suggestions takes the same amount of time as manual annotation, then the employment of the German chunker for Swedish saves about 60% of the annotation time.</Paragraph> <Paragraph position="14"> Reusing a chunker for bootstrapping a parallel treebank between closely related languages like German and Swedish is only a flrst step towards reusing annotation (be it automatic or manual) in one language for another language.</Paragraph> <Paragraph position="15"> But it points to a promising research direction. (Yarowsky et al., 2001) have reported interesting results of an annotation-projection technique for PoS tagging, named entities and morphology. And (Cabezas et al., 2001) have explored projecting syntactic dependency relations from English to Basque. This idea was followed by (Hwa et al., 2002) who investigated English to Chinese projections based on the direct correspondence assumption. They conclude that annotation projections are nearly 70% accurate (in terms of unlabelled dependencies) when some linguistic knowledge is used. We believe that annotation projection is a di-cult fleld but even if we only succeed in a limited number of cases, it will be valuable for increased speed in the development of parallel treebanks.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Alignment </SectionTitle> <Paragraph position="0"> The alignment in our experimental treebank is based on the nodes, not the edge labels. Figure 1 shows the phrase alignment as thick lines across the trees. All of the alignment mapping was done by hand.</Paragraph> <Paragraph position="1"> We decided to make the alignment deterministic, i.e. a node in one language can only be aligned with one node in the other language.</Paragraph> <Paragraph position="2"> There are, of course, a lot of problems with the alignment. We have looked at the meaning, rather than the exact wording. Sometimes different words are used in an S or VP, but we still feel that the meaning is the same, and therefore we have aligned them. We might have alignment on one constituent level, while there are difierences (i.e. no alignment) on lower levels of the tree. Therefore we consider it important to make the parse trees su-ciently deep. We need to be able to draw the alignment on as many levels as possible.</Paragraph> <Paragraph position="3"> Another problem arises when the sentences are constructed in difierent ways, due to e.g.</Paragraph> <Paragraph position="4"> passivisation or topicalisation. Although German and Swedish are structurally close, there are some clear difierences.</Paragraph> <Paragraph position="5"> + German separable preflx verbs (e.g. fangen an = begin) do not have a direct correspondence in Swedish. However, Swedish has frequent particle verbs (e.g. ta upp = bring up). But whereas the German separated verb preflx occupies a speciflc position at the end of a clause (\Rechte Satzklammer&quot;), the Swedish verb particle occurs at the end of the verb group.</Paragraph> <Paragraph position="6"> + The general word order in Swedish subordinate clauses is the same as in main clauses. Unlike in German there is no verb-flnal order in subordinate clauses.</Paragraph> <Paragraph position="7"> + German uses accusative and dative case endings to mark direct and indirect objects.</Paragraph> <Paragraph position="8"> This is re ected in the German function labels for accusative object (OA) and for SUC tag STTS tag HA int. or rel. adverb PWAV adverbial interrog. or relative pronoun HD int. or rel. determiner PWS (stand-alone) interrog. pronoun HP int. or rel. pronoun PRELS (stand-alone) relative pronoun HS int. or rel. possessive PPOS (stand-alone) possessive pronoun dative object (DO). Swedish has lost these case endings and the labels therefore need not re ect case but rather object function.</Paragraph> <Paragraph position="9"> Our overall conclusion is that applying the German treebank annotation guidelines to Swedish works well when the few peculiarities of Swedish are taken care of.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Corpus representation </SectionTitle> <Paragraph position="0"> After annotating the sentences in both languages with the Annotate treebank editor, the tree structures were exported in the NEGRA export format from the MySQL database. The flle in NEGRA format is easily loaded into TIGERSearch via the TIGERRegistry which provides an import fllter for this format. This import process creates a TIGER-XML flle which contains the same information as the NEGRA flle. The difierence is that the pointers in the NEGRA format go from the tokens to the pre-terminal nodes (and from nodes to parent nodes) in a bottom-up fashion, whereas in the TIGER-XML flle the nodes point to their children by listing their id numbers (idref) and their edge label (in a top-down perspective).</Paragraph> <Paragraph position="1"> In this flle the tokens of the sentence (terminals) are listed beneath each other with their corresponding PoS-tag (PPER for personal pronoun, VVFIN for flnite verb, APPRART for contracted preposition etc.). The nodes (nonterminals) are listed with their name and their outgoing edges with labels such as HD for head, NK for noun kernel, SB for subject etc.</Paragraph> <Paragraph position="3"> Since all tokens and all nodes are uniquely numbered, these numbers can be used for the phrase alignment. For the representation of the alignment we adapted a DTD that was developed for the Link~oping Word Aligner (Ahrenberg et al., 2002). The XML-flle with the alignment information then looks like this. The sentLink-tags each contain one sentence pair, while each phraseLink represents one aligned node pair.</Paragraph> <Paragraph position="4"> This fragment flrst specifles the two involved XML flles for German (De.xml) and Swedish (Sv.xml). It then states the phrase pairs for the sentence pair 1 - 1 from these flles. For example, phrase number 501 from the German sentence 1 is aligned with phrase number 503 of the Swedish sentence.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Tools for Parallel Treebanks </SectionTitle> <Paragraph position="0"> Treebank tools are usually of two types. First there are tools for producing the treebank, i.e.</Paragraph> <Paragraph position="1"> for automatically adding information (taggers, chunkers, parsers) and for manual inspection and correction (treebank editors). On the other hand we need tools for viewing and searching a treebank.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Treebank Editors </SectionTitle> <Paragraph position="0"> Of course the tools for monolingual treebank production can also be used for building the language-speciflc parts of a parallel treebank.</Paragraph> <Paragraph position="1"> Thus a treebank editor such as Annotate with built-in PoS tagger and chunker is an invaluable resource. But such a tool should include or be complemented with a completeness and consistency checker.</Paragraph> <Paragraph position="2"> In addition the parallel treebank needs to be aligned on the sub-sentence level. Automatic word alignment systems will help ((Tiedemann, 2003) discusses some interesting approaches).</Paragraph> <Paragraph position="3"> But tools for checking and correcting this alignment will be needed. For example the I*Link system (Ahrenberg et al., 2002) could be used for this task. I*Link comes with a graphical user interface for creating and storing associations between segments in a bitext. I*Link is aimed at word and phrase associations and requires bitexts that are pre-aligned at the sentence level.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Treebank Search Tools </SectionTitle> <Paragraph position="0"> With the announcement of the Penn Treebank, some 10 years ago, came a search tool called tgrep. It is a UNIX-based program that allows querying a treebank specifying dominance and precedence relations over trees (plus regular expressions and boolean operators). The search results are bracketed trees in line-based or indented format catering for the needs of difierent users. For example, the following tgrep query searches for a VP that dominates (not necessarily directly) an NP which immediately precedes a PP.</Paragraph> <Paragraph position="2"> More recently TIGERSearch was launched.</Paragraph> <Paragraph position="3"> It is a Java-based program that comes with a graphical user interface and a powerful featurevalue-oriented query language. The output are graphical tree representations in which the matched part of the tree is highlighted and focused. TIGERSearch's ease of installation and friendly user interface have made it the tool of choice for many treebank researchers.</Paragraph> <Paragraph position="4"> According to our knowledge no speciflc search tools for parallel treebanks exist. In addition to the above sketched search options of tgrep and TIGERSearch a search tool for parallel tree-banks will have to allow queries that combine constraints over two trees. For example one wants to issue queries such as \Find a tree in language 1 with a relative clause where the parallel tree in language 2 uses a prepositional phrase for the same content.&quot;</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Displaying Parallel Trees </SectionTitle> <Paragraph position="0"> There is currently no ofi-the-shelf tool that can display parallel trees so that one could view two phrase structure trees at the same time with their alignment. Therefore we discuss possible display options of such a future program.</Paragraph> <Paragraph position="1"> One alternative is to show the two trees above each other (as in flgure 1). And there are many ways to visualize the alignment: Either by drawing lines between the nodes (as we did), or bycolor markingthe nodes, or byopening another window where only chosen parallel nodes are shown. The latter case corresponds to a zoomfunction, butthis also entailsthat the user has to click on a node to view the alignment.</Paragraph> <Paragraph position="2"> Another alternative would be a mirror imaging. One language would have its tree with the root at the top and the tree of the other language would be below with the root at the bottom. The alignment could be portrayed in the same ways as above.</Paragraph> <Paragraph position="3"> But then the display problem is mainly a problem concerning the computer screens of today, where a large picture partly lands outside of the screen, while a smaller scale picture might result in words that are too small to be readable. One solution could be to use two screens (as is done in complex layout tasks), but then we cannot have a solution with the trees above each other, but rather next to each other, possibly with some kind of color marking of the nodes.</Paragraph> <Paragraph position="4"> A last alternative is to use vertical trees, where the words are listed below each other, showing phrase depth horizontally. Then the alignment could be shown by having the nodes side by side instead of above each other. This is the least space consuming alternative, but it is also the least intuitive one. Furthermore, this is not a viable alternative if the trees contain crossing branches.</Paragraph> <Paragraph position="5"> We currently favor the flrst approach with two trees above each other, and we have written a program that takes the SVG (scalable vector graphics) representation of two trees (as exported from TIGERSearch), merges the two graphs into a single graph and adds the phrase alignment lines based on the information in the alignment flle.</Paragraph> </Section> </Section> class="xml-element"></Paper>