File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2409_metho.xml
Size: 20,185 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2409"> <Title>Modeling Monolingual and Bilingual Collocation Dictionaries in Description Logics</Title> <Section position="3" start_page="0" end_page="65" type="metho"> <SectionTitle> 2 Collocation Data </SectionTitle> <Paragraph position="0"> Properties of collocations. A mere list of word pairs or sequences (give a talk, lose one's patience) is not a collocation dictionary. For use in NLP, linguistic properties of the collocations and of their components must be provided: these include the category of the components (giveV + talkN), the distribution of base (talk) and collocate (give), as well as morphosyntactic preferences, e.g. with respect to the number of an element (e.g. have high hopes), the use of a determiner (lose one'sposs|{} patience, cf. Evert et al. (2004)).</Paragraph> <Paragraph position="1"> For collocations to be identifiable in the context of a sentence (e.g. to avoid attachment ambiguity in parsing) and, conversely, in generation, to be correctly inserted into a sentence, the syntagmatic behavior of collocations must be described. This includes their function within a sentence (e.g in the case of adverbial NPs) and the subcategorization of their components, e.g. with support verb constructions (make the proposal to + INF). As subcategorization is not fully predictable from the subcategorization of the noun (how to explain the preposition choice in Unterst&quot;utzung finden bei jmdm, 'find support in so.', be supported?), we prefer to encode the respective data in the monolingual dictionary. To support translation mapping at the complement level, the representation of each complement contains its grammatical category (NP, AP, etc.), its grammatical function (subject, object, etc.) and a semantic role inspired by FrameNet1. This allows us to cater for divergence cases: jmdSubj/SPEAKER bringt jmdmInd.Obj/ADDRESSEE etw.Obj/TOPIC in Erinnerung vs. someoneSubj/SPEAKER reminds someoneObj/ADDRESSEE of sth.Prep.Obj/TOPIC.</Paragraph> <Paragraph position="2"> Relations involving collocations. For language generation, paraphrasing or for summarization, paradigmatic relations of collocations must also be modeled. These include synonymy, antonymy and taxonomic relations, but also morphological ones (word formation) and combinations of collocations. Synonymy and antonymy should relate collocations with other collocations, but also with single words and with idioms: all three types should have the same status. Next to strict synonymy, there may be 'quasi-synonymy'.</Paragraph> <Paragraph position="3"> Transparent noun compounds tend to share collocates with their heads (Pause einlegen, Rauchpause einlegen, Kaffeepause einlegen): if the relation between compound and head (Kaffeepause - Pause) and between the respective collocations is made explicit, this knowledge can be exploited in translation, when a compositional equivalent is chosen (have a (smoking/coffee) break). Paraphrasing and its applications also profit from an explicit representation of morphological relations 1Cf. http://framenet.icsi.berkeley.edu/ between collocates: submit + proposal, submission of + proposal and submitter of + proposal all refer to the same collocational pattern.</Paragraph> <Paragraph position="4"> A formal model for a collocation dictionary, monolingual and/or bilingual, has to keep track of the above mentioned properties and relations of collocations; both should be queriable, alone and in arbitrary combinations.</Paragraph> <Paragraph position="5"> Other collocation dictionaries and dictionary architectures. Most of the above mentioned properties and relations have been discussed in the descriptive literature, but to our knowledge, they have never been modeled all in an electronic dictionary. The Danish STO dictionary (Braasch and Olsen, 2000) and Krenn's (2000) database of German support verb+PP-constructions both emphasize morphosyntactic preferences, but do not include relations. The electronic learners' dictionaries DAFLES and DICE2 focus on semantic explanations of collocations, but do not contain details about most of the properties and relations mentioned above. The implementation of Mel'Vcuk's Meaning=Text-Theory in the DiCo/LAF model3 comes closest to our requirements, insofar as it is highly relational and includes some though not all of the morphological relations we described above.</Paragraph> <Paragraph position="6"> The Papillon project (S'erasset and Mangeot-Lerebours, 2001) proposes a general architecture for the interlingual linking of monolingual dictionaries; as it is inspired by the DiCo formalizarion, it foresees links between readings, e.g. to account for morphological relations. This mechanism could in principle be extended to syntagmatic phenomena; we are, however, not aware of a Papillon-based collocation dictionary.</Paragraph> </Section> <Section position="4" start_page="65" end_page="68" type="metho"> <SectionTitle> 3 Modeling in OWL DL </SectionTitle> <Paragraph position="0"> In this section, we present the main features of OWL DL and their relevance to the modeling of lexical data. Section 3.2 addresses the design of a monolingual collocation dictionary using OWL DL (Spohr, 2005).</Paragraph> <Section position="1" start_page="65" end_page="66" type="sub_section"> <SectionTitle> 3.1 Main Features of OWL </SectionTitle> <Paragraph position="0"> hofer et al., 2004), combining the expressivity of OWL with the computational completeness and decidability of Description Logics (Baader et al., 2003)4. Properties of OWL DL relevant for lexical modeling are listed and discussed in the following.</Paragraph> <Paragraph position="1"> Classes. An OWL DL data model consists of a subsumption hierarchy of classes, i.e. a class X subsumes all its subclasses X1 to Xn. While classes represent concepts, their instances (called OWL individuals) represent concrete manifestations in the model. Classes and their instances can be constrained by stating assertions in the model definition, e.g. a class can be defined as being disjoint with other classes, which means that instances of a certain class cannot at the same time be instances of the disjoints of this particular class. Properties. Classes are described by properties.</Paragraph> <Paragraph position="2"> These can be used either to specify XML Schema Datatypes (datatype properties) or to relate instances of one class to instances of (probably) other classes (object properties). These classes are then defined as the domain and range of a property, i.e. a particular property may only relate instances of classes in its domain to instances of classes in its range. In addition to this, a property may be assigned several distinct formal attributes, such as symmetric, transitive or functional, and can be defined as the inverse of another property. Similar to classes, properties can be structured hierarchically as well, which, among others, facilitates the use of underspecified information in queries (see section 3.2).</Paragraph> <Paragraph position="3"> Inferences. The possibility to infer explicit knowledge from implicit statements is a core feature of OWL DL and can be performed by using DL reasoners (such as FaCT5, Pellet6 or Racer-Pro7). The most basic inference is achieved via the subsumption relation among classes or properties in the respective hierarchy (see above), but also more sophisticated inferences are possible.</Paragraph> <Paragraph position="4"> Among others, these may involve the formal attributes of properties just mentioned. For example, stating that instance A is linked to B via a symmetric property P leads a reasoner to infer that B is also linked to A via P. In conjunction with transitivity, a relatively small set of explicit statements may suffice to interrelate several instances implicitly (i.e. all instances in a particular equivalence class created by P).</Paragraph> <Paragraph position="5"> Consistency. In addition to inferences, DL reasoners can further be used to check the consistency of an OWL DL model. One of the primary objectives is to check whether the assertions made about classes and their instances (see above) are logically consistent or whether there are contradictions. This consistency checking is based on the open-world assumption, which states that &quot;what cannot be proven to be true is not believed to be false&quot; (Haarslev and M&quot;oller, 2005). Since lexical data occasionally demand a closed world, other checking formalisms are required, which are mentioned in section 3.2 below.</Paragraph> </Section> <Section position="2" start_page="66" end_page="68" type="sub_section"> <SectionTitle> 3.2 Monolingual Collocation Dictionary </SectionTitle> <Paragraph position="0"> A data model for a monolingual collocation dictionary based on OWL DL has been presented in (Spohr, 2005). It was designed using the Prot'eg'e OWL Plugin (Knublauch et al., 2004) and makes use of the advantages of OWL DL mentioned above.</Paragraph> <Paragraph position="1"> Lexical vs. descriptive entities. On the class level, the model distinguishes between lexical entities (e.g. single-word and multi-word entities, such as collocations or idioms) and descriptive entities (e.g. gender, part-of-speech, or subcategorisation frames), with lexical entities being linked to descriptive entities via properties. More than 40 of these descriptive properties have been modeled.</Paragraph> <Paragraph position="2"> In order to reflect the distinction between metalanguage vocabulary and object language vocabulary, the two types of entities can be separated such that they are part of different models. In other words, the classes and instances of descriptive entities constitute a model of descriptions, which is imported by a lexicon model containing classes and instances of lexical entities (see also section 4.1 below).</Paragraph> <Paragraph position="3"> Lexical relations. In addition to descriptive properties, the data model also contains a number of lexical relations linking lexical entities, such as morphological or semantic relations. These relations have been structured hierarchically and contain several subproperties, such as hasCompound or isSynonymOf, which use the formal attributes mentioned in section 3.1. For instance, isSynonymOf has been defined as a symmetric and transitive property (as opposed to the non-transitive isQuasiSynonymOf; see section 2), while hasCompound has been defined as the inverse of a property isCompoundOf. A small sample of descriptive and lexical relations of the collocation Kritik &quot;uben is illustrated in Figure 1 below. Semantic relations link lexical entities on the conceptual (i.e. word sense) level. Therefore, the synonym of Kritik &quot;uben is not some general single-word entity kritisieren VV, but a particular word sense of kritisieren, kritisieren VV 1 in this case (see Spohr (2005) for more detail).</Paragraph> <Paragraph position="4"> Queries. The data model can be queried very efficiently using the Sesame framework (Broekstra et al., 2002; Broekstra, 2005) and its associated query language SeRQL. An example query retrieving all collocations and their types is given below, along with a sample of the results8.</Paragraph> <Paragraph position="5"> their types, along with results Due to the fact that the relations in the data 8In these examples, lex: is the namespace prefix for resources defined in the data model.</Paragraph> <Paragraph position="6"> model have been structured hierarchically, it is possible to state underspecified queries. Figure 3 illustrates an underspecified query for semantically related entities, regardless of the precise nature of this relation. Hence, the first two rows in the result table below contain synonym pairs, while the last two rows contain antonym pairs.</Paragraph> <Paragraph position="7"> As is indicated in Figure 3, the results appear twice, i.e. they contain every combination of those entities between which the relation holds. This is due to the fact that the respective semantic relations have been defined as symmetric properties (see above).</Paragraph> <Paragraph position="8"> Consistency and data integrity. Section 3.1 mentioned the distinction between the open-world assumption and the closed-world assumption.</Paragraph> <Paragraph position="9"> While the consistency checking performed by DL reasoners is generally based on an open world, it is vital especially for lexical data to simulate a closed world in order to check data integrity. Consider, for instance, the assertion that every collocation has to have a base and a collocate. Due to the open-world assumption, a DL reasoner would never render a collocation violating this constraint inconsistent, simply because it cannot prove that this collocation has either no base or no collocate. In order for this to happen, the simulation of a closed world is needed. In our approach, this is achieved by stating consistency constraints in SeRQL. Figure 4 below illustrates a constraint for the purpose just mentioned.</Paragraph> <Paragraph position="10"> This query retrieves all collocations and subtracts those who have a path to both a base and a collocate. The result set then contains exactly those instances which have either no base or no collocate.</Paragraph> <Paragraph position="11"> cation have a base and a collocate?</Paragraph> </Section> </Section> <Section position="5" start_page="68" end_page="69" type="metho"> <SectionTitle> 4 Bilingual Model Architecture </SectionTitle> <Paragraph position="0"> Based on the definition of a monolingual collocation dictionary described above, the architecture of a bilingual dictionary model can be designed such that it is made up of several components (i.e. OWL models). These are introduced in the following.</Paragraph> <Section position="1" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 4.1 Components of a Bilingual Dictionary </SectionTitle> <Paragraph position="0"> The components of a bilingual dictionary are illustrated in Figure 5.</Paragraph> <Paragraph position="1"> Model of descriptions. The most basic component of a bilingual dictionary model is a model of descriptions, which contains language-independent classes and instances of descriptive entities, as well as the relations among them (see section 3.2).</Paragraph> <Paragraph position="2"> Lexicon model. The model of descriptions is imported by an abstract lexicon model via the owl:imports statement (see (Bechhofer et al., 2004)). The effect of using the import statement is that the lexicon model can access the classes, instances and properties defined in the description model without being able to alter the data therein. In addition to the thus available classes, the lexicon model further provides classes of lexical entities and relations among them, as well as relations linking lexical and descriptive entities.</Paragraph> <Paragraph position="3"> Monolingual dictionary model. The lexicon model serves as input for the creation of a mono-lingual dictionary model, i.e. the lexicon model is not imported by the dictionary model, rather the dictionary model is an instantiation of it. There are practical reasons for doing so, the most important one being that the class of lexical entities (defined in the lexicon model) and its instances (defined in the monolingual dictionary) thus have the same namespace prefix, which would not be the case if the lexicon model was imported by the monolingual dictionary. The advantages are most obvious in the context of the mapping between monolingual dictionary models (see section 4.2). Finally, a monolingual dictionary may further introduce its own instances (or even classes) of descriptive entities, i.e. descriptions which are language-specific and which are hence not part of the language-independent model of descriptions (see above). Translation model. The translation model is an abstract model containing only relations between monolingual dictionary models, i.e. it does not contain class definitions. Since the model is required to be generic, these relations do not have a specified domain and range, as otherwise the translation model would be restricted to a single language pair. The specification of the domain and range of the relations is performed in the final model of the bilingual dictionary.</Paragraph> <Paragraph position="4"> Bilingual dictionary model. The bilingual dictionary model is an instantiation of the translation model. It further imports two monolingual dictio- null nary models and specifies the domain and range of the abstract relations in the translation model (see section 4.2 below).</Paragraph> </Section> <Section position="2" start_page="69" end_page="69" type="sub_section"> <SectionTitle> 4.2 Mapping between Models </SectionTitle> <Paragraph position="0"> By importing the monolingual dictionaries, each of these models is assigned a unique namespace prefix, e.g. english: or german:. Thus, in an English-German dictionary, for instance, a relation calledhasTranslationmay be defined as a symmetric property linking lexical entities of the English monolingual dictionary model (i.e. its domain is defined for instances with the english: prefix) to lexical entities of the German model (i.e. instances with german:). This translation mapping is illustrated in Figure 6 for the collocation Kritik &quot;uben.</Paragraph> <Paragraph position="1"> express criticism gual dictionaries As is indicated there, multi-word entities can be translated as single-word entities and vice versa. Moreover, since hasTranslation has been defined as a symmetric property, the translation mapping is bidirectional. However, since some instance in one language model might not have an equivalent instance in the other model, a further property can be defined which links the respective entity to a new instance created in the bilingual model (see Paraphrase in the figure above). As this instance is only required for the modeling of this particular bilingual dictionary, it is not part of the &quot;original&quot; monolingual models, and hence the relation between the respective entities is not bidirectional. null In addition to the translation mapping of lexical entities, it may further be necessary to map instances of descriptive entities of one model onto instances in the other model. As was mentioned in section 4.1, the model of descriptions contains language-independent descriptive entities. Since both monolingual dictionaries import the model of descriptions (via the lexicon model), the two &quot;versions&quot; of it are unified in the bilingual model. However, it is certainly conceivable to have two languages which both avail themselves of a descriptive entity that is not languageindependent, but which is the same for the two languages in question. For example, not all languages have the gender neuter. English and German, however, do have it, and therefore an English-German bilingual dictionary has to express that english:neuter is the same as german:neuter. In OWL, this can be achieved by using the owl:sameAs statement, which expresses exactly the circumstances just mentioned.</Paragraph> </Section> <Section position="3" start_page="69" end_page="69" type="sub_section"> <SectionTitle> 4.3 Example Query </SectionTitle> <Paragraph position="0"> A query retrieving the situation depicted in Figure 6 is given below. It extracts the (quasi-)synonyms of Kritik &quot;uben (which Kritik &quot;uben itself is a part of) and their respective translations and/or paraphrases. The latter is achieved by restricting the properties that Rel2 may stand for to those having the prefix bdm:, i.e. the prefix defined for the bilingual dictionary model. In other words, the query leaves the exact relation between B and C underspecified and simply restricts it to being defined in the bilingual dictionary, which only contains relations linking instances belonging to different monolingual dictionaries. The results are shown in the table below.</Paragraph> </Section> </Section> class="xml-element"></Paper>