File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1110_metho.xml
Size: 22,497 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1110"> <Title>Frameworks, Implementation and Open Problems for the Collaborative Building of a Multilingual Lexical Database</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 1 The Papillon Project </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Motivations </SectionTitle> <Paragraph position="0"> The Papillon project is the result of the gathering of different people sharing common problems and solutions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1.1 A Lack of Resources </SectionTitle> <Paragraph position="0"> On the Internet, a lot of free dictionaries are available but very few of them imply more than 2 languages. Most of these dictionaries include English as one of their languages.</Paragraph> <Paragraph position="1"> Furthermore, the existing dictionaries often lack information essential for beginners or NLP systems.</Paragraph> <Paragraph position="2"> Another point contributing to this lack: the high costs of development of large lexical resources for NLP involves also a high price, dissuasive for the end-user.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1.2 Existing Structures and Tools for Multilingual Dictionaries </SectionTitle> <Paragraph position="0"> Some partners of the Papillon project have been involved in research on the definition of structures and tools to handle multilingual lexical databases.</Paragraph> <Paragraph position="1"> They were looking for an opportunity to apply their research results on real scale lexical data. Internet Most partners were participating, as computer scientists, in the development of open source products. With the democratisation of Internet access in a lot of countries, came the opportunity to apply the open source principles to the development of a multipurpose, multilingual lexical database.</Paragraph> <Paragraph position="2"> Cooperation projects for bilingual dictionaries are already going on such as EDICT, a Japanese-English dictionary lead by Jim Breen (2001) for more than 10 years and more recently, SAIKAM, a Japanese-Thai dictionary (see Ampornaramveth (2000)).</Paragraph> <Paragraph position="3"> With the Papillon project, the dictionary is extended to a multilingual lexical database. Volunteers will find lexicons developed by others and some tools to complete or correct the Papillon multilingual dictionary. Users will also be able to define their own personal views of the database.</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 1.2 Dictionary Markup Language Framework </SectionTitle> <Paragraph position="0"> Mathieu Mangeot-Lerebours (2001) defines a complete framework for the consultation and the construction of dictionaries. The framework is completely generic in order to manage heterogeneous dictionaries with their own proper structures. This framework is extensively used in Papillon project.</Paragraph> <Paragraph position="1"> The framework consists in the definition of an database can be described with DML elements. The entire hierarchy of the XML files, elements and attributes is described using XML schemata and grouped into the DML namespace. Figure 1 describes the organisation of the main DML elements.</Paragraph> <Paragraph position="2"> The XML schemata are available online. This allows users to edit and validate their files online with an XML schema validator.</Paragraph> <Paragraph position="3"> The DML framework may be used to encode many different dictionary structures. Indeed, two dictionary structures can be radically different. In order to handle such heterogeneous structures with the same tools, we have defined a subset of DML element and attributes that are used to http://www-clips.imag.fr/geta/services/dml identify which part of the different structures represent the same lexical information. This subset is called Common Dictionary Markup (CDM). This set is in constant evolution. If the same kind of information is found in several dictionaries then a new element representing this piece of information is added to the CDM set. It allows tools to have access to common information in heterogeneous dictionaries by way of pointers into the structures of the dictionaries. 1.3 Three Layers for the Lexical Data The lexical data repository of the Papillon project is divided into 4 subdirectories: Administration contains guidelines and administrative files</Paragraph> <Paragraph position="5"> The name of the files and directories is normalised in order to allow easy navigation into the repository.</Paragraph> <Paragraph position="6"> All lexical data stored in the repository is free of rights or protected by a GPL-like licence.</Paragraph> <Paragraph position="7"> This directory contains lexical data in their original format. When a dictionary is received, it is first stored there while waiting to be &quot;recycled&quot;. For each dictionary, we create a metadata file containing all available information concerning the dictionary (name, languages covered, creation date, size, authors, domain, etc.). It is then used to evaluate the quality of the dictionary and to guide the recycling process. These dicitonaries are freely downloadable as they are. The Purgatory directory receives the lexical data once the recuperation process is over. This process consists in converting the lexical data from its original format into XML encoded in UTF-8. To perform this task, we use the RECUPDIC methodology described in Doan-Nguyen (1998) regular expression tools like Perl scripts.</Paragraph> <Paragraph position="8"> If a dictionary is already encoded in XML, the recuperation process consists in mapping the elements of information into CDM elements and storing the correspondence into the metadata file. Internet users access these dictionaries as classical online dictionaries, retrieving individual entries by way of requests on the Papillon web site.</Paragraph> <Paragraph position="9"> The Paradise directory contains only one dictionary often called the &quot;Papillon dictionary&quot;. This dictionary has a particular DML structure. Internet users access entries of this dictionary by way of requests to the Papillon web site.</Paragraph> <Paragraph position="10"> It is possible to retrieve only one entry, or any subset of entries in any available output format. The &quot;native&quot; format is the Papillon textual XML DML format in UTF-8. Users also have ways to add new entries or correct existing ones online. Other purgatory dictionaries may be integrated into the Papillon dictionary with the help of the CDM elements.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 The Papillon Multilingual </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Dictionary 2.1 Macrostructure </SectionTitle> <Paragraph position="0"> The architecture of the Papillon multilingual dictionary is based on Gilles Serasset (1994) and has been prototyped by Blanc (1999). This architecture uses a pivot structure based on multiple monolingual volumes linked to an interlingual acception volume.</Paragraph> <Paragraph position="1"> Each entry of a monolingual volume represents a word sense. In this document, we use the term of &quot;lexie&quot; as in the Explanatory and Combinatory Dictionary to name a monolingual entry. The meaning of &quot;lexie&quot; is not the same as &quot;lexeme&quot;. A lexie is a complete monolingual entry.</Paragraph> <Paragraph position="2"> The interlingual volume gathers all the interlingual acceptions. An interlingual acception represents the union of word-senses or &quot;lexies&quot; considered as &quot;equivalent&quot; among different monolingual volumes. This equivalence is calculated from translation links. In this document, we use the term of &quot;axie&quot; to name an interlingual acception.</Paragraph> <Paragraph position="3"> Real contrastive problems in lexical equivalence (not to be confused with monolingual polysemy, homonymy or synonymy as clearly explained in Mel'cuk and Wanner (2001) are handled by way of a special kind of link between axies. Figure 2 illustrates this architecture using a classical example involving &quot;Rice&quot; in 4 languages. In this example, we used the word senses as given by the &quot;Petit Robert&quot; dictionary for French and the &quot;Longman Dictionary of Contemporary English&quot; for English. As shown, the French and English dictionaries do not make any word sense distinction between cooked and uncooked rice seeds. However, this distinction is clearly made in Japanese and Malay. No axie may be used to denote the union of the word senses for Malay &quot;nasi&quot; and &quot;beras&quot; unless we want to consider them as true synonyms in Malay (which would be false). Hence, we have to create 3 different axies: one for the union of &quot;nasi&quot; and Yu Fan (gohan), the other for the union of &quot;beras&quot; and Mi (kome) and one for the union of &quot;rice&quot; and &quot;riz&quot;. A link (non-continuous line in Figure 1 has to be added between the third axies and the others in order to keep the translation equivalence between the word-senses.</Paragraph> <Paragraph position="4"> Note that the links between axies do not bear any particular semantics and should not be confused with some kind of ontological links.</Paragraph> <Paragraph position="5"> Bilingual dictionaries can be obtained from the multilingual dictionary.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Microstructure </SectionTitle> <Paragraph position="0"> The structure of the lexies (units of the monolingual dictionaries) is based on Polguere (2000) and Mel'cuk's work on the combinatorial and explanatory lexicography, a part of the meaning-text theory. An XML schema using the DML framework has been defined to represent this structure as accurately as possible.</Paragraph> <Paragraph position="1"> This structure is common to all the monolingual dictionaries. In order to cope with language differences, small variations are authorised for each monolingual lexicon. Up to now, these variations have been used to define the parts of speech for each language and to add information specific to each language, such as level of politeness and counters for Japanese.</Paragraph> <Paragraph position="2"> Figure 3 presents an excerpt of the XML encoding of the French entry &quot;meurtre&quot; (murder) and Figure 4 shows a DEC-like view.</Paragraph> <Paragraph position="3"> The general schema has been presented in detail in Gilles Serasset & Mathieu Mangeot-Lerebours (2001).</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Implementation of the </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Collaborative Web Site </SectionTitle> <Paragraph position="0"> For the external user, the Papillon project is viewed as a dynamic web site providing access the existing dictionaries and giving ways to contribute to the Papillon dictionary.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.1 General Architecture </SectionTitle> <Paragraph position="0"> The Papillon web site is built with a Java based open source framework called Enhydra . It is designed around a standard 3-tier architecture a presentation layer in charge of the interface with the user. We currently use classical HTML/CSS rendering, but plan to integrate WML access to the dictionaries (for mobile phones), * a business layer in charge of data manipulation and transformation. We currently use XML data (in UTF-8) and XSL transformations for data manipulation, * a data layer in charge of the communication with the database via a JDBC driver. The data layer should be managed by an XML database allowing language dependent sorting. For the moment, XML databases are still in an early stage. In order to advance in the project, a mapping system for DML has been defined in order to store the XML data into conventional relational databases. PostgreSQL is used at this point.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> * 3.2 Particular features </SectionTitle> <Paragraph position="0"> As different users may have different needs (translators, learners...) we define different views of the Papillon dictionary. Each view is encoded as a XSL stylesheet that is applied on the result of each user query. In the future, we will also allow users to define their own custom views and store them on the server. All these transformations are done on the server in order to allow users to use their preferred browser (even if it is not XML aware). Figure 4 shows an example of the French entry &quot;MEURTRE&quot; (murder) viewed as in Mel'cuk's DEC dictionary.</Paragraph> <Paragraph position="1"> displayed using Mel'cuk's classical view To avoid the unintentional pollution of the database by erroneous data, the contributions of a user are to be validated by a central group of trusted users. In the mean time, the contributions are stored as XSL stylesheets in the cntributor's private space.</Paragraph> <Paragraph position="2"> Each time a user requests a corresponding entry, the request is performed in the main database and in the user space. The results from the user space are used to modify results from the main database. This way, the contribution is immediately visible to the user exactly as if it had been integrated into the main database.</Paragraph> <Paragraph position="3"> While contributions are waiting to be validated and integrated into the common space, The contributors may choose to share them with other users or groups of users.</Paragraph> <Paragraph position="4"> Every user can contribute at his/her level. For example, a linguist specialist of lexical functions will enter values of lexico-semantic functions, a phonologist pronunciations and a professional bilingual translator will enter new interlingual links or check the semi-automatically generated ones. For this, different interfaces will be developed to accommodate the various user profiles.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Annex Tools </SectionTitle> <Paragraph position="0"> As the web site hosts a rather complex collaborative work, we have added some tools that are not related to lexicography, but that have to work in a multilingual context.</Paragraph> <Paragraph position="1"> First, there is a tool to archive our Papillon mailing list. Such a tool is very common on Internet sites. However, as we found out, these tools may not be used in our multilingual context, where mails may contain discussion in different languages, written with different tools, and encoded using different standards. Hence we patched an existing tool so that it archives all mail in UTF-8, regardless of its original encoding.</Paragraph> <Paragraph position="2"> To avoid the considerable work of the webmaster and to facilitate the communication and the exchange of informations between the users of the database, we are developing tools to facilitate the use of a document repository.</Paragraph> <Paragraph position="3"> After registration and login, users can easily upload online a file in whatever format. It will immediately be stored into the document repository and made accessible online on the web.</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Actual Research and Development </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Directions </SectionTitle> <Paragraph position="0"> The Papillon project is a extremely interesting experimentation platform. We are currently working on validation of monolingual data, management of axies and acquisition of new data.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Validation of the Monolingual Data </SectionTitle> <Paragraph position="0"> A team of trusted lexicographers validates user contributions before they are integrated into the main database.</Paragraph> <Paragraph position="1"> This validation is a time consuming process and implies a good level in linguistics and lexicography. Moreover, we may not find enough specialists volunteering for such a work and we may have to pay a core team for this.</Paragraph> <Paragraph position="2"> This task is essential and should be conducted as quickly as possible lest the users will be discouraged by the delays implied by the central team.</Paragraph> <Paragraph position="3"> Hence, even in this validation process, we wish to enroll users as much as possible. For this task, we plan to implement tools for indirect validation of information using vote mechanisms and generating questions answerable without any special knowledge in linguistics.</Paragraph> <Paragraph position="4"> As a first experiment, we will use a French generator in order to produce a lot of examples using the word to be validated and a set of known words (already validated). These examples will be presented to native speakers and they will simply have to accept or reject them. This strategy is very interesting in our context, as it will help validating the lexical functions.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Management of the Interlingual Links </SectionTitle> <Paragraph position="0"> The use of a pivot dictionary to represent translation equivalence is challenging. This macrostructure is very satisfying on a theoretical level, but introduces a high complexity of management.</Paragraph> <Paragraph position="1"> In Serasset (1994), we envisaged that these interlingual acceptions would be created and managed by hand by a team of specialists, helped by tools that would detect inconsistencies and propagate decisions among the different languages. This appeared to be unrealistic.</Paragraph> <Paragraph position="2"> However, we now have means to manage these acceptions automatically. For this, we use the fact that the interlingual acceptions volume does not, in any way, represent a semantic pivot. It is not related to an ontology.</Paragraph> <Paragraph position="3"> In fact, the only relevant purpose of this interlingual volume is to factorise the bilingual links we find in classical bilingual dictionaries (or the ones that will be specified by the users). Hence, given a set of translation equivalences between monolingual acceptions of different languages, it is possible to compute a minimal set of acceptions (and their links) that conforms to a set of well-formedness criteria.</Paragraph> <Paragraph position="4"> One of the difficult tasks is to obtain bilingual translation equivalences between monolingual acceptions when bilingual dictionaries often provide bilingual links between mere lemmas.</Paragraph> <Paragraph position="5"> For this, we will use aligned corpora and translations memories to add contextual information to the translation pairs.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Acquisition of new data </SectionTitle> <Paragraph position="0"> To depend entirely on volunteer work is of course unrealistic, especially while beginning to build the lexical database. That is why we first reuse existing dictionaries in order to build the kernelof the database.</Paragraph> <Paragraph position="1"> Contributors will come in later, filling in missing informationin existing entries and creating partial or complete new entries as well as links.</Paragraph> <Paragraph position="2"> However, as we are using a rather complex structure which require some skills that are not shared by all Internet users, we will have to help them help us.</Paragraph> <Paragraph position="3"> In particular, we are beginning to use corpus-based techniques to extract lemmas that will be candidates as a value of a lexical function. Determining the appropriate lexical function is one of the jobs of our contributors, but they will be helped in this task by tools that will provide them with questions and candidate paraphrases.</Paragraph> <Paragraph position="4"> For a complement of information or to help the contributors in their task, the database should also propose the consultation of other dictionaries stored locally or available online on the web.</Paragraph> <Paragraph position="5"> Moreover, to be really useful for the reader, and especially to the learners, the examples found in the dictionaries will be translated in other languages literally and semantically. Some of these translations will be extracted from aligned corpora.</Paragraph> <Paragraph position="6"> Conclusion The theoretical frameworks for the whole database, the macrostructure and the microstructure are very well defined. It constitutes a solid basis for the implementation. A lot of open problems still have to be addressed for the Papillon project to be a success. In this respect, the Papillon project appears to be a very interesting experimentation platform for a lot of NLP research as data acquisition or human access to lexical data, among others.</Paragraph> <Paragraph position="7"> All this research will improve the attraction of such a project to the Internet users. This attraction is necessary for the project to go on, as it is highly dependent on its users motivations.</Paragraph> <Paragraph position="8"> This way, we will be able to provide a very interesting multilingual lexical database that we hope useful for a lot of persons.</Paragraph> </Section> </Section> class="xml-element"></Paper>