File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1001_metho.xml

Size: 18,769 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1001">
  <Title>LEXICAL MARKUP FRAMEWORK (LMF) FOR NLP MULTILINGUAL RESOURCES</Title>
  <Section position="4" start_page="7" end_page="7" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Lexical Markup Framework (LMF) is a model that provides a common standardized framework for the construction of Natural Language Processing (NLP) lexicons. The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of a large number of individual electronic resources to form extensive global electronic resources.</Paragraph>
    <Paragraph position="1"> Types of individual instantiations of LMF can include monolingual, bilingual or multilingual lexical resources. The same specifications are to be used for both small and large lexicons. The descriptions range from morphology, syntax, semantic to translation information organized as different extensions of an obligatory core package. The model is being developed to cover all natural languages. The range of targeted NLP applications is not restricted. LMF is also used to model machine readable dictionaries (MRD), which are not within the scope of this paper.</Paragraph>
  </Section>
  <Section position="5" start_page="7" end_page="7" type="metho">
    <SectionTitle>
2 History and current context
</SectionTitle>
    <Paragraph position="0"> In the past, this subject has been studied and developed by a series of projects like GENELEX [Antoni-Lay], EAGLES, MULTEXT, PAROLE, SIMPLE, ISLE and MILE [Bertagna]. More recently within ISO  the standard for terminology management has been successfully elaborated by the sub-committee three of ISO-TC37 and published under the name &amp;quot;Terminology Markup Framework&amp;quot; (TMF) with the ISO-16642 reference. Afterwards, the ISO-TC37 National delegations decided to address standards dedicated to NLP. These standards are currently elaborated as high level specifications and deal with word segmentation (ISO 24614), annotations (ISO 24611, 24612 and 24615), feature structures (ISO 24610), and lexicons (ISO 24613) with this latest one being the focus of the current paper. These standards are based on low level specifications dedicated to constants, namely data categories (revision of ISO 12620), language codes (ISO 639), script codes (ISO 15924), country codes (ISO 3166), dates (ISO 8601) and Unicode (ISO 10646).</Paragraph>
    <Paragraph position="1"> This work is in progress. The two level organization will form a coherent family of standards with the following simple rules:  2) the high level specifications provide structural elements that are adorned by the standardized constants.</Paragraph>
  </Section>
  <Section position="6" start_page="7" end_page="7" type="metho">
    <SectionTitle>
3 Scope and challenges
</SectionTitle>
    <Paragraph position="0"> The task of designing a lexicon model that satisfies every user is not an easy task. But all the efforts are directed to elaborate a proposal that fits the major needs of most existing models.</Paragraph>
    <Paragraph position="1"> In order to summarise the objectives, let's see what is in the scope and what is not.</Paragraph>
    <Paragraph position="2"> LMF addresses the following difficult challenges: null * Represent words in languages where multiple orthographies (native scripts or transliterations) are possible, e.g. some Asian languages.</Paragraph>
    <Paragraph position="3"> * Represent explicitly (i.e. in extension) the morphology of languages where a description of all inflected forms (from a list of lemmatised forms) is manageable, e.g.</Paragraph>
    <Paragraph position="4"> English.</Paragraph>
    <Paragraph position="5"> * Represent the morphology of languages where a description in extension of all inflected forms is not manageable (e.g. Hungarian). In this case, representation in intension is the only manageable issue.</Paragraph>
    <Paragraph position="6">  language.</Paragraph>
    <Paragraph position="7"> Linguistic constants, like /feminine/ or /transitive/, are not defined within LMF but are specified in the Data Category Registry (DCR) that is maintained as a global resource by ISO TC37 in compliance with ISO/IEC 111793:2003. null The LMF specification complies with the modeling principles of Unified Modeling Language (UML) as defined by OMG  [Rumbaugh 2004]. A model is specified by a UML class diagram within a UML package: the class name is not underlined in the diagrams. The various examples of word description are represented by UML instance diagrams: the class name is underlined. null</Paragraph>
  </Section>
  <Section position="7" start_page="7" end_page="7" type="metho">
    <SectionTitle>
5 Structure and core package
</SectionTitle>
    <Paragraph position="0"> LMF is comprised of two components: 1) The core package consists of a structural skeleton that describes the basic hierarchy of information in a lexical entry.</Paragraph>
    <Paragraph position="1"> 2) Extensions to the core package are expressed in a framework that describes the reuse of the core components in conjunction with additional components required for the description of the contents of a specific lexical resource.</Paragraph>
    <Paragraph position="2"> In the core package, the class called Database represents the entire resource and is a container for one or more lexicons. The Lexicon class is the container for all the lexical entries of the same language within the database. The Lexicon Information class contains administrative information and other general attributes. The Lexical Entry class is a container for managing the top level language components. As a consequence, the number of representatives of single words, multi-word expressions and affixes of the lexicon is equal to the number of lexical entries in a given lexicon. The Form and Sense classes are parts of the Lexical Entry. Form consists of a text string that represents the word. Sense specifies or identifies the meaning and context of the related form. Therefore, the Lexical Entry manages the relationship between sets of related forms and their senses. If there is more than one orthogra- null phy for the word form (e.g. transliteration) the Form class may be associated with one to many Representation Frames, each of which contains a specific orthography and one to many data categories that describe the attributes of that orthography. null The core package classes are linked by the relations as defined in the following UML class  A subset of the core package classes are extended to cover different kinds of linguistic data. All extensions conform to the LMF core package and cannot be used to represent lexical data independently of the core package. From the point of view of UML, an extension is a UML package. Current extensions for NLP dictionaries are:</Paragraph>
  </Section>
  <Section position="8" start_page="7" end_page="7" type="metho">
    <SectionTitle>
NLP Morphology
</SectionTitle>
    <Paragraph position="0"> , NLP inflectional paradigm, NLP Multiword Expression pattern, NLP Syntax, NLP Semantic and Multilingual notations, which is the focus of this paper.</Paragraph>
  </Section>
  <Section position="9" start_page="7" end_page="7" type="metho">
    <SectionTitle>
6 NLP Multilingual Extension
</SectionTitle>
    <Paragraph position="0"> The NLP multilingual notation extension is dedicated to the description of the mapping between two or more languages in a LMF database.</Paragraph>
    <Paragraph position="1"> The model is based on the notion of Axis that links Senses, Syntactic Behavior and examples pertaining to different languages. &amp;quot;Axis&amp;quot; is a  Morphology, Syntax and Semantic packages are described in [Francopoulo].</Paragraph>
    <Paragraph position="2">  term taken from the Papillon  . Axis can be organized at the lexicon manager convenience in order to link directly or indirectly objects of different languages.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.1 Considerations for standardizing multi-
lingual data
</SectionTitle>
      <Paragraph position="0"> The simplest configuration of multilingual data is a bilingual lexicon where a single link is used to represent the translation of a given form/sense pair from one language into another.</Paragraph>
      <Paragraph position="1"> But a survey of actual practices clearly reveals other requirements that make the model more complex. Consequently, LMF has focused on the following ones: (i) Cases where the relation 1-to-1 is impossible because of lexical differences among languages. An example is the case of English word &amp;quot;river&amp;quot; that relates to French words &amp;quot;riviere&amp;quot; and &amp;quot;fleuve&amp;quot;, where the latter is used for specifying that the referent is a river that flows into the sea. The bilingual lexicon should specify how these units relate.</Paragraph>
      <Paragraph position="2"> (ii) The bilingual lexicon approach should be optimized to allow the easiest management of large databases for real multilingual scenarios. In order to reduce the explosion of links in a multibilingual scenario, translation equivalence can be managed through an intermediate &amp;quot;Axis&amp;quot;. This object can be shared in order to contain the number of links in manageable proportions.</Paragraph>
      <Paragraph position="3"> (iii) The model should cover both transfer and pivot approaches to translation, taking also into account hybrid approaches. In LMF, the pivot approach is implemented by a &amp;quot;Sense Axis&amp;quot;. The transfer approach is implemented by a &amp;quot;Transfer Axis&amp;quot;.</Paragraph>
      <Paragraph position="4"> (iv) A situation that is not very easy to deal with is how to represent translations to languages that are similar or variants. The problem arises, for instance, when the task is to represent translations from English to both European Portuguese and Brazilian Portuguese. It is difficult to con- null To be more precise, Papillon uses the term &amp;quot;axie&amp;quot; from &amp;quot;axis&amp;quot; and &amp;quot;lexie&amp;quot;. In the beginning of the LMF project, we used the term &amp;quot;axie&amp;quot; but after some bad comments about using a non-English term in a standard, we decided to use the term &amp;quot;axis&amp;quot;. sider them as two separate languages. In fact, one is a variant of the other. The differences are minor: a certain number of words are different and some limited phenomena in syntax are different.</Paragraph>
      <Paragraph position="5"> Instead of managing two distinct copies, it is more effective to manage one lexicon with some objects that are marked with a dialectal attribute.</Paragraph>
      <Paragraph position="6"> Concerning the translation from English to Portuguese: a limited number of specific Axis instances record this variation and the vast majority of Axis instances is shared.</Paragraph>
      <Paragraph position="7"> (v) The model should allow for representing the information that restricts or conditions the translations. The representation of tests that combine logical operations upon syntactic and semantic features must be covered.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.2 Structure
</SectionTitle>
      <Paragraph position="0"> The model is based on the notion of Axis that link Senses, Syntactic Behavior and examples pertaining to different languages. Axis can be organized at the lexicon manager convenience in order to link directly or indirectly objects of different languages. A direct link is implemented by a single axis. An indirect link is implemented by several axis and one or several relations.</Paragraph>
      <Paragraph position="1"> The model is based on three main classes: Sense Axis, Transfer Axis, Example Axis.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.3 Sense Axis
</SectionTitle>
      <Paragraph position="0"> Sense Axis is used to link closely related senses in different languages, under the same assumptions of the interlingual pivot approach, and, optionally, it can also be used to refer to one or several external knowledge representation systems. null The use of the Sense Axis facilitates the representation of the translation of words that do not necessarily have the same valence or morphological form in one language than in another. For example, in a language, we can have a single word that will be translated by a compound word into another language: English &amp;quot;wheelchair&amp;quot; to Spanish &amp;quot;silla de ruedas&amp;quot;. Sense Axis may have the following attributes: a label, the name of an external descriptive system, a reference to a specific node inside an external description.</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.4 Sense Axis Relation
Sense Axis Relation permits to describe the
</SectionTitle>
      <Paragraph position="0"> linking between two different Sense Axis instances. The element may have attributes like label, view, etc.</Paragraph>
    </Section>
    <Section position="5" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.6 Transfer Axis Relation
Transfer Axis Relation links two Transfer Axis
</SectionTitle>
      <Paragraph position="0"> instances. The element may have attributes like: label, variation.</Paragraph>
      <Paragraph position="1"> The label enables the coding of simple inter-lingual relations like the specialization of &amp;quot;fleuve&amp;quot; compared to &amp;quot;riviere&amp;quot; and &amp;quot;river&amp;quot;. It is not, however, the goal of this strategy to code a complex system for knowledge representation, which ideally should be structured as a complete coherent system designed specifically for that purpose.</Paragraph>
    </Section>
    <Section position="6" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.7 Source Test and Target Test
</SectionTitle>
      <Paragraph position="0"> Source Test permits to express a condition on the translation on the source language side while Target Test does it on the target language side.</Paragraph>
      <Paragraph position="1"> Both elements may have attributes like: text and comment.</Paragraph>
    </Section>
    <Section position="7" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.5 Transfer Axis
</SectionTitle>
      <Paragraph position="0"> Transfer Axis is designed to represent multi-lingual transfer approach. Here, linkage refers to information contained in syntax. For example, this approach enables the representation of syntactic actants involving inversion, such as (1):</Paragraph>
    </Section>
    <Section position="8" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.8 Example Axis
Example Axis supplies documentation for
</SectionTitle>
      <Paragraph position="0"> Due to the fact that a lexical entry can be a support verb, it is possible to represent translations that start from a plain verb to a support verb like (2) that means &amp;quot;Mary dreams&amp;quot;:</Paragraph>
    </Section>
    <Section position="9" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.9 Class Model Diagram
</SectionTitle>
      <Paragraph position="0"> The UML class model is an UML package. The diagram for multilingual notations is as follows:</Paragraph>
    </Section>
    <Section position="10" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
7.1 First example
</SectionTitle>
      <Paragraph position="0"> The first example is about the interlingual approach with two axis instances to represent a near match between &amp;quot;fleuve&amp;quot; in French and &amp;quot;river&amp;quot; in English. In the diagram, French is located on the left side and English on the right side. The axis on the top is not linked directly to any English sense because this notion does not  exist in English.</Paragraph>
      <Paragraph position="1"> : Sense Axis Relation comment = flows into the sea label = more precise : Sense label = eng:riverlabel = fra:riviere : Sense : Sense label = fra:fleuve : Sense Axis : Sense Axis</Paragraph>
    </Section>
    <Section position="11" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
7.2 Second example
</SectionTitle>
      <Paragraph position="0"> Let's see now an example about the transfer approach about slight variations between variants. The example is about English on one side and European Portuguese and Brazilian on the other side. Due to the fact that these two last variants have a very similar syntax, but with some local exceptions, the goal is to avoid a full and dummy duplication. For instance, the nominative forms of the third person clitics are largely preferred in Brazilian rather than the oblique form as in European Portuguese. The transfer axis relations hold a label to distinguish which axis to use depending on the target object.</Paragraph>
    </Section>
    <Section position="12" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
7.3 Third example
</SectionTitle>
      <Paragraph position="0"> A third example shows how to use the Transfer Axis relation to relate different information in a multilingual transfer lexicon. It represents the translation of the English &amp;quot;develop&amp;quot; into Italian and Spanish. Recall that the more general sense links &amp;quot;eng:develop&amp;quot; and &amp;quot;esp:desarrollar&amp;quot;. Both, Spanish and Italian, have restrictions that should  be tested in the source language: if the second argument of the construction refers to certain elements (picture, mentalCreation, building) it should be translated into specific verbs.</Paragraph>
      <Paragraph position="1">  During the last three years, the ISO group focused on the UML specification. In the last version of the LMF document [LMF 2006] a DTD has been provided as an informative annex. The following conventions are adopted:  tions that are not aggregations) are transcoded as IDREF(S) The first example (i.e. &amp;quot;river&amp;quot;) can be represented with the following XML tags:</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="7" end_page="7" type="metho">
    <SectionTitle>
9 Comparison
</SectionTitle>
    <Paragraph position="0"> A serious comparison with previously existing models is not possible in this current paper due to the lack of space. We advice the interested colleague to consult the technical report &amp;quot;Extended examples of lexicons using LMF&amp;quot; located at: &amp;quot;http://lirics.loria.fr&amp;quot; in the document area. The report explains how to use LMF in order to represent OLIF-2, Parole/Clips, LC-Star, Word-Net, FrameNet and BDef.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML