File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/c92-1012_intro.xml
Size: 6,077 bytes
Last Modified: 2025-10-06 14:05:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1012"> <Title>Towards Developing Reusable NLP Dictionaries</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Acquisition </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Strategies </SectionTitle> <Paragraph position="0"> There are three potentially useful strategies to develop large lexical resources, which are not ill principle mutually exclusive.</Paragraph> <Paragraph position="1"> MRDs The extraction of data from machine-readable dictionaries has received nmch attention ill the past decade. In our view tile usefulness of existing material for NLP application has been somewhat overestimated. Traditional dictionaries are oriented towards a market of hlll/lau constlu3ers~ who coustllt the dictionary for entirely different reasons than N LP applications. For instance, most of the information in NhP dictio.</Paragraph> <Paragraph position="2"> uaries is concerned with the grammatical description of Ac:rf!s DI! COLING-92, NAh'H~S, 23-28 AOt'rr 1992 5 3 l'l~oc, o1: COLING-92, NANTI!S, AU(;. 23-28, 1992 words, which in many dictionaries is only rudimentarily available ~.</Paragraph> <Paragraph position="3"> Furthermore, given that humans can use their intelligence and knowledge of the language(s), much information is only present in unformalized definitions and examples. As discussed in e.g. \[MeNaught, 1988\], it is often feasible to extract (relatively) formalized information, but the cost-effectiveness of autmnatic extraction of information from less formalized data is highly questionable. null From this discussion it follows that MRDs atone cannot be the source for NLP dictionaries. In section 2.2 we will discuss in more detail the evaluation of the potential sources of data for our specific purposes.</Paragraph> <Paragraph position="4"> Corpora Automatic extraction of lexical features by applying various pattern recognition techniques to large bodies of text has received some attention recently (cf.</Paragraph> <Paragraph position="5"> e.g. \[Zernik and Jaeobs, 1990\]). tlowever, the information needed for our applications cannot be extracted from corpora yet, although important improvements can bc expected in the following years.</Paragraph> <Paragraph position="6"> Lexicography Given the present inadequacy of MRDs and corpus-related tools, manual labour is indispensable for lexicon development. The tools described in section 3 have been developed as a 'workbench' to support these lexieographical activities. We will show that this tool allows for easy integration of information extracted from MRDs with lexicographic editing.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Sources </SectionTitle> <Paragraph position="0"> Evaluation Measure It is difficult to assess the &quot;reusability&quot; of existing data without an evaluation mea~ snre, i.e. without knowing .for what purpose the data shonld be usable. This is especially difficult in the case of grammatical features. We developed a lexicon fragment (implemented as TFS type hierarchy, cf. section 3) defining the classification scheme for the monolingual dictionaries. This fragment is inspired by IfPSG and GB, and incorporates many of the (innovative) distinctions developed by ttm client applications Eun.OTItA and ROSETTA. It is, however, much more lezicalistthan these systems.</Paragraph> <Paragraph position="1"> Eventually, all lexical entries in the two languages should be described using this scheme, so that they can be readily converted to client applications. The data that can be extracted from a potential source has been interpreted with respect to this classification scheme to assess the amount of information contained in it.</Paragraph> <Paragraph position="2"> Data Analysis The machine-readable sources we considcred are the existing Van Dale Dutch monolingual and bilingual Dutch-Spanish machine-readable dictionary and the CELEX lexical database. From our evaluation it followed that existing MRDs for Dutch (as for almost all other languages) contain only a small part of the information needed by NLP applications.</Paragraph> <Paragraph position="3"> ~Well-structured dictionaries like \[Longman, 1987\] are an important exception to this, cf. \[Boguraev and Briscoe, t989\], Fortunately, the CELEX lexical database has enriched a selection of 30000 entries of the &quot;Van Dale Dictionary of Contemporary Dutch&quot; with grammatical information, taking into account the requirements of a number of (prototype) NLP applications under development in the Netherlands. A large amount of information needed for our target applications can be converted automatically from this database. The entries, stored in a relational database, can be imported into the Dutch lexicon using the TFS constraint solver similarly to the conversion to client applications (see section 5). The C r.gx dictionary has historic links to tile Van Dale dictionaries (especially with respect to reading distinction), which greatly simplifies integration of these sources.</Paragraph> <Paragraph position="4"> With respect to translation information we found that the &quot;raw&quot; translational data could be extracted easily from the Vail Dale bilingual dictionaries. The original Vail Dale concept is especially interesting for multilingum applications, as the Dutch part is the same (at least in principle) in all bilingual dictionaries with Dutch as source language (cf. \[van Sterkenburg el al., 1982\]).</Paragraph> <Paragraph position="5"> Extraction of information about phrasal translation, such as the choice of the support verb of a noun in the target language, is unfortunately hidden in unrestricted text (example sentences etc.), from which it is difficult to extract. Phrasal information also snffers greatly from incompleteness.</Paragraph> </Section> </Section> class="xml-element"></Paper>