File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1112_metho.xml
Size: 24,634 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1112"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Structural Similarity Measure</Title> <Section position="4" start_page="0" end_page="92" type="metho"> <SectionTitle> 2 Imperfections of measures based </SectionTitle> <Paragraph position="0"> on string similarity There are many application areas in the NLP in which it is useful to apply the measures exploiting the similarity of word forms (strings). They serve very well for example for tasks like spellchecking (where the choice of the best candidates for correction of a spelling error is typically based upon the Levenshtein metrics) or estimating the similarity of a new source sentence to those stored in the translation memory of a Machine Aided Translation system. They are a bit controversial in a &quot;proper&quot; machine translation, where the popular BLEU score (Papineni et al., 2002), although widely accepted as a measure of translation accuracy, seems to favor stochastic approaches based on an n-gram model over other MT methods (see the results in (Nist, 2001)).</Paragraph> <Paragraph position="1"> The controversies the BLEU score seems to provoke arise due to the fact that the evaluation of MT systems can be, in general, performed from two different viewpoints. The first one is that of a developer of such a system, who needs to get a reliable feedback in the process of development and debugging of the system. The primary interest of such a person is the grammar or dictionary coverage and system performance and he needs a cheap, fast and simple evaluation method in order to allow frequent routine tests indicating the improvements of the system during the development of the system.</Paragraph> <Paragraph position="2"> The second viewpoint is that of a user, who is primarily concerned with the capability of the system to provide fast and reliable translation requiring as few post-editing efforts as possible. The simplicity, speed and low costs are not of such importance here. If the evaluation is performed only once, in the moment when the system is considered to be ready, the evaluation method may even be relatively complicated, expensive and slow. A good example of such a complex measure is the FEMTI framework (Framework for the Evaluation of Machine Translation). The most complete description of the FEMTI framework can be found in (Hovy et al., 2002). Such measures are much more popular among translators than among language engineers and MT systems developers.</Paragraph> <Paragraph position="3"> If we aim at measuring the similarity of languages or language distances, our point of view should be much more similar to that of a human translator than of a system developer, if we'll stick to our MT analogy. When looking for clues concerning the desirable properties of a language similarity (or distance) measure, we can first try to formulate the reasons why we consider the simple string-based (or wordform-based) measures inadequate.</Paragraph> <Paragraph position="4"> If we take into account a number of languages existing in the world, the number of word forms existing in each of those languages and a simple fact that a huge percentage of those word forms is not longer than five or six characters, it is quite clear that there is a huge number of overlapping word forms which have completely different meaning in all languages containing that particular word form.</Paragraph> <Paragraph position="5"> Let us take for illustration some language pairs of non-related languages.</Paragraph> <Paragraph position="6"> For example for Czech and English (the languages very different with regard both to the lexicon and syntax) we can find several examples of overlapping word forms. The English word house means a duckling in Czech, the English indefinite article a is in Czech also very frequent, because it represents a coordinating conjunction and, while an is an archaic form of a pronoun in Czech. On the other hand, if we look at the identical (or nearly identical) word forms in similar languages, we can find many examples of totally different meaning.</Paragraph> <Paragraph position="7"> For example, the word form Vzivot means life in Czech and belly in Russian; godina means year in Serbo-Croatian while hodina is an hour in Czech (by the way, an hour in Russian is Vcas -- and the same word means time in Czech).</Paragraph> <Paragraph position="8"> The overlapping word forms between relatively distant languages are so frequent that it is even possible to create (more or less) syntactically correct sentences in one language containing only word forms from the other language. Again, let us look at the Czech-English language pair. The English sentences Let my pal to pile a lumpy paste on a metal pan. or I had to let a house to a nosy patron. consist entirely of word forms existing also in Czech, while the Czech sentence Adept demise metal hole pod led. -- [A resignation candidate was throwing sticks under the ice.] consists of English word forms.</Paragraph> <Paragraph position="9"> Creating such a Czech sentence is more complicated -- as a highly inflected language it uses a wide variety of endings, which make it more difficult to create a syntactically correct sentence from word forms of a language which has incomparably smaller repertoire of endings. This fact directly leads to another argument against the string similarity based measures -- even though two languages may have very similar syntactic properties and their basic word forms may also be very similar, then if the languages are highly inflective and the only difference between those languages are different endings used for expressing identical morphosyntactic properties, the string similarity based methods will probably show a substan- null tial difference between these languages.</Paragraph> <Paragraph position="10"> This is highly probable especially for shorter words -- the words with a basic form only four or five characters long may have endings longer or equal to the length of the basic form, for example: nov'a/novata &quot;new&quot; (Cze/Mac), vidVen'y/vidimyj &quot;seen&quot; (Cze/Rus), fotografuj'ic'i/fotografuojantysis &quot;photographing&quot; (Cze/Lit).</Paragraph> <Paragraph position="11"> The last but not least indirect argument against the use of string-based metrics can be found in (KuboVn and B'emov'a, 1990). The paper describes so called transducing dictionary, a set of rules designed for a direct transcription of a certain category of source language words into a target language. The system has been tested on two language pairs (Englishto-Czech and Czech-to-Russian) and although there was a natural original assumption that such a system will cover substantially more expressions when applied to a pair of related languages (which are not only related, but also quite similar), this assumption turned to be wrong. The system covered almost identical set of words for both language pairs -- namely the words with Greek or Latin origin. The similarity of coverage even allowed to build an English-to-Russian transducing dictionary using Czech as a pivot language with a negligible loss of the coverage.</Paragraph> </Section> <Section position="5" start_page="92" end_page="92" type="metho"> <SectionTitle> 3 Experience from MT of similar languages </SectionTitle> <Paragraph position="0"> The Machine Translation field is a good testing ground for any theory concerning the similarity of natural languages. The systems dealing with related languages usually achieve higher translation quality than the systems aiming at the translation of more distant language pairs -- the average MT quality for a given system and a given language pair might therefore also serve as some kind of a very rough metrics of similarity of languages concerned.</Paragraph> <Paragraph position="1"> Let us demonstrate this idea using an example of a multilingual MT system described in several recently published papers (see e.g.</Paragraph> <Paragraph position="2"> (HajiVc et al., 2003) or (Homola and KuboVn, 2004)). The system aims at the translation from a single source language (Czech) into multiple more or less similar target languages, namely into Slovak, Polish, Lithuanian, Lower Sorbian and Macedonian.</Paragraph> <Paragraph position="3"> The system is very simple -- it doesn't contain any full-fledged parser, neither rule based, nor stochastic one. It relies on the syntactic similarity of the source and target languages.</Paragraph> <Paragraph position="4"> It is transfer-based with the transfer being performed as soon as possible, depending on the similarity of both languages. In its simplest form (Czech to Slovak translation) the system consists of the following modules: 1. Morphological analysis of the source language (Czech) 2. Morphological disambiguation of the source language text by means of a stochastic tagger 3. Transfer exploiting the domain-related bilingual glossaries and a general (domain independent) bilingual dictionary 4. Morphological synthesis of the target language null The lower degree of similarity between Czech and the remaining target languages led to an inclusion of a shallow parsing module for Czech for some of the language pairs. This module directly follows the morphological disambiguation of Czech.</Paragraph> <Paragraph position="5"> The evaluation results presented in (Homola and KuboVn, 2004) indicate that even though Czech and Lithuanian are much less similar at the lexical and morphological level (e.g. at both levels actually dealing with strings), the translation quality is very similar due to the syntactic similarity between all languages concerned. null</Paragraph> </Section> <Section position="6" start_page="92" end_page="97" type="metho"> <SectionTitle> 4 Typology of language similarity </SectionTitle> <Paragraph position="0"> The experience from the field of MT of closely related languages presented in the previus sections shows that it is very useful to classify the language similarity into several categories:</Paragraph> <Paragraph position="2"> Let us now look at these categories from the point of view of machine translation,</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 4.1 Typological similarity </SectionTitle> <Paragraph position="0"> The first type of similarity is probably the most important one. If both the target and the source language are of a different language type, it is more difficult to obtain good MT quality. The notions like word order, the existence or non-existence of articles, different temporal system and several other properties have direct consequences for the translation quality. Let us take Czech and Lithuanian as an example of the language pair, which doesn't belong to the same group of languages (Czech is a Slavic and Lithuanian Baltic language).</Paragraph> <Paragraph position="1"> Both languages have rich inflection and very high degree of word order freedom, thus it is not necessary to change the word order at the constituent level. On the other hand, both languages differ a lot in the lexics and morphology. null For example, both (1) and (3) mean approximately &quot;The father read a/the book&quot;. What these sentences differ in is the information structure. (1) should be translated as &quot;The father read a book&quot;, whereas (3) means in fact &quot;The book has been read by the father&quot;.1 The category of voice differs in both sentences because of strict word order in English, although in both Czech equivalents, active voice is used.2 We see that in the Lithuanian translation, the word order is exactly the same.</Paragraph> </Section> <Section position="2" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 4.2 Lexical similarity </SectionTitle> <Paragraph position="0"> The lexical similarity does not mean that the vocabulary has to have the same origin, i.e., that words have to be created from the same (proto-)stem. What is important for shallow MT (and for MT in general), is the semantic correspondence (preferably one-to-one relation). null Lexical similarity is the least important one from the point of view of MT, because the lexical differences are solved in the glossaries and general dictionaries.</Paragraph> </Section> <Section position="3" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 4.3 Syntactic similarity </SectionTitle> <Paragraph position="0"> Syntactic similarity is also very important especially on higher levels, in particular on the verbal level. The differences in verbal valences have negative influence on the quality of translation due to the fact that the transfer thus requires a large scale valence lexicon for both languages, which is extremely difficult to build. Syntactic structure of smaller constituents, such as nominal and prepositional phrases, is not that important, because it is possible to analyze those constituents syntactically using a shallow syntactic analysis and thus it is possible to adapt locally the syntactic structure of a target sentence.</Paragraph> </Section> <Section position="4" start_page="93" end_page="97" type="sub_section"> <SectionTitle> 4.4 Morphological similarity </SectionTitle> <Paragraph position="0"> Morphological similarity means similar structure of morphological hierarchy and paradigms such as case system, verbal system etc. In our understanding Baltic and Slavic languages (except for Bulgarian and Macedonian) have a similar case system and their verbal system is quite similar as well. Some problems are caused by synthetic forms, which have to be expressed by analytical constructions in other languages (e.g., future tense or conjunctive in Czech and Lithuanian). The differences in morphology can be relatively easily overcomed by the exploitation of full-fledged morphology of both languages (source and target).</Paragraph> <Paragraph position="1"> Similar morphological systems simplify the transfer. For example, Slavonic languages (except of Bulgarian and Macedonian) have 6-7 cases. The case system of East Baltic languages is very similar, although it has been reduced formally in Latvian (instrumental forms are equal as dative and accusative and the function of instrumentral is expressed by the preposition ar &quot;with&quot;, similarly as in Upper Sorbian). (Ambrazas, 1996) gives seven cases for Lithuanian, but there are in fact at least eight cases in Lithuanian (or ten cases but only eight of them are productive3). Nevertheless the case systems of Slavonic and East Baltic languages are very similar which makes the languages quite similar even across the border of different language groups.</Paragraph> <Paragraph position="2"> Significant differences occur only in the verbal system, East Baltic languages have a huge amount of participles and half-participles that have no direct counterpart in Czech. The The participle valdysiantis is used instead of an embedded sentence, because Lithuanian has future participles. These participles have to be expresses by an embedded sentence in Slavonic languages.</Paragraph> <Paragraph position="3"> 5 An outline of a structural similarity measure In this section, we propose a comparatively simple measure of syntactic (structural) similarity. There are generally two levels which may serve as a basis for such a structural measure, the surface or deep syntactic level. Let us first explain the reasons supporting our choice of surface syntactic level.</Paragraph> <Paragraph position="4"> Compared to deep syntactic representation, the surface syntactic trees are much more 3Although some Balticists argue that illative forms are adverbs, it is a fact that this case is productive and used quite often (Erika Rimkut.e, personal communication), though it has been widely replaced by prepositional phrases. Allative and adessive are used only in some Lithuanian dialects, except of a few fixed allative forms (e.g., vakarop(i) &quot;in the evening&quot;, velniop(i) &quot;to the hell&quot;.) closely related to the actual surface form of a sentence. It is quite common that every word form or punctuation sign is directly related to a single node of a surface syntactic tree. The deep syntactic trees, on the other hand, usually represent autosemantic words only, they may even actually contain more nodes than there are words in the input sentence (for example, when the input sentence contains ellipsis). It is also quite clear that the deep syntactic trees are much more closely related to the meaning of the sentence than its original surface form, therefore they may hide certain differences between the languages concerned, it is a generally accepted hypothesis that transfer performed on the deep syntactic level is easier than the transfer at the surface syntactic level, especially for syntactically and typologically less similar languages.</Paragraph> <Paragraph position="5"> The second important decision we had to make was to select the best type of surface syntactic trees between the dependency and phrase structure trees. For practical reasons we have decided to use dependency trees. The main motivation for this decision is the enormous structural ambiguity of phrase structure trees that represent sentences with identical surface form. Let us have a look at the follow- null The syntactic structure of this sentence can be expressed by two phrase structure trees representing different order of attaching nominal phrases to a verb.4</Paragraph> <Paragraph position="7"> There is no linguistically relevant difference between these two trees. Although generally useful, the information hidden in both trees is purely superfluous for our goal of designing a simple structural metrics. The dependency tree obtained from the phrase structure ones by contraction of all head edges seem to be much more appropriate for our purpose. In our example, we therefore get the following form of the dependency tree:</Paragraph> <Paragraph position="9"> The nodes of the dependency trees representing surface syntactic level directly correspond to word forms present in the sentence.</Paragraph> <Paragraph position="10"> For the sake of simplicity, the punctuation marks are not represented in our trees. They would probably cause a lot of technical problems and might distort the whole similarity measure. The node of a tree are ordered and reflect the surface word-order of the sentence.</Paragraph> <Paragraph position="11"> Different labels of nodes in both languages (see the example below) don't influence the value of the measure, however they are important for the identification of corresponding nodes (a bilingual dictionary is used here).</Paragraph> <Paragraph position="12"> The structural measure we are suggesting is based on the analogy to the Levenshtein measure. It is therefore pretty simple -- the distance of two trees is the minimal amount of elementary operations that transform one tree to the other. We consider the following elementary operations: 1. adding a node, 2. removing a node, 3. changing the order of a node, 4. changing the father of a node.</Paragraph> <Paragraph position="13"> The similarity of languages can be obtained as an average distance of individual sentences in a parallel corpus.</Paragraph> <Paragraph position="14"> The following examples show the use of the measure on individual trees. The correspondence between individual nodes of both trees can be handled by exploiting the bilingual dictionary wherever necessary: &quot;Vesna has come.&quot; (Pol) The distance between (7) and (8) is equal 1, since one node has been removed (the dotted line gives the removed node).</Paragraph> <Paragraph position="15"> The distance between (9) and (10) is equal 1, since one node has been removed (the dotted line gives the removed node).</Paragraph> <Paragraph position="16"> The Czech-English example (11) shows two sentences which have a mutual distance equal to 3 -- if we start changing the Czech tree into an English one, then the first elementary operation is the deletion of the node r'ad, the second operation adds the new node corresponding to the English word likes and the third and last operation is the change of the father of the node corresponding to the personal pronoun on [he] from swimming to likes. As mentioned above, the node labels are not taken into account, the fact that the Czech finite verbal form plave changes into an English gerund has no effect on the distance.</Paragraph> <Paragraph position="17"> A similar case are sentences with a dative agent, for example: In this sentence, the Czech mi does not match to I since it is no subject. Similarly, the substantive zima does not match to cold, since it is a different part of speech. Hence two nodes are removed and two new nodes are added, which gives us a distance of 4. This example demonstrates that the measure tends to behave naturally - even short sentences containing syntactically different constructions get a relatively high score. To formalize the process described above, let us introduce a notion of lexical and analytical equality of nodes in analytical trees: * Two nodes equal lexically if and only if they share the same meaning in the given context. Nevertheless to simplify automatic processing, we treat two nodes as lexically equal if they share a particular meaning (defined e.g. as a non-empty intersection of Wordnet classes).</Paragraph> <Paragraph position="18"> * Two nodes equal analytically if and only if they have the same analytical label (e.g. subject, spacial adverbial etc.).</Paragraph> <Paragraph position="19"> As for the measure, two nodes match to each other if they 1) occur at the same position in the subtree of their parent and 2) equal lexically and analytically.</Paragraph> <Paragraph position="20"> If a subtree (greater than 1) is added or removed, the operation contributes to the measure with the size of the subtree (the amount of its nodes), for example in the following id- null In the above example, the distance is equal 2.</Paragraph> <Paragraph position="21"> The automatic procedure can be described as follows (given two trees): 1. Align all sons of the root node.</Paragraph> <Paragraph position="22"> 2. Count discrepancies.</Paragraph> <Paragraph position="23"> 3. For all matched nodes, go to step 1 to process subtrees and sum up distances.</Paragraph> </Section> <Section position="5" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 5.2 Discussion </SectionTitle> <Paragraph position="0"> It is obvious that our measure expresses the typological similarity of languages. We get comparatively high values even for genetically related languages if their typology is different.</Paragraph> <Paragraph position="1"> Let us demonstrate this fact on Czech and &quot;Ivan gave the book to Stojan.&quot; (Mac) The distance equals 5. The score is relatively high, taken into account that both languages are related. It indicates again that for a given purpose the measure seems to provide consistent results.</Paragraph> <Paragraph position="2"> The proposed measure takes into account only the structure of the trees, completely ignoring node and edge labels. Let us analyze the following example: &quot;This book is read often.&quot; The syntactic trees of both sentences have the same structure, but (17) is passive and (18) active (with a general subject). This is of course a significant difference and as such it should be captured in the measure, nevertheless our simple measure doesn't reflect it. There are several reasons why a current version of the measure doesn't include morphological and morphosyntactic labels. One of the reasons is a different nature of the problem -to design a reliable measure combining structural information with the information contained in node labels is very difficult. From the technical point of view, a great obstacle is also the variety of systems of tags used for this purpose for individual languages, which may not be compatible. For example, Macedonian has almost no cases at nouns, therefore it would make no sense to use cases in the noun annotation, while for other Slavic languages (and not only for Slavic ones) is this information very important. To find a good integration of morphosyntactic features into the structural measure is definitely a very interesting topic for future research.</Paragraph> </Section> </Section> class="xml-element"></Paper>