File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2119_metho.xml
Size: 6,210 bytes
Last Modified: 2025-10-06 14:09:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2119"> <Title>Application Adaptive Electronic Dictionary with Intelligent Interface</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Overview of TransDict </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Feature space </SectionTitle> <Paragraph position="0"> TransDict is originally built over a set of features relevant for the patent applications including: Semantic features: SEM_Cl - semantic class, CASE_ROLEs, - a set of case-roles associated with a lexeme, if any).</Paragraph> <Paragraph position="1"> Syntactic features: FILLERs, - sets of most probable fillers of case-roles in terms of types of phrases and lexical preferences.</Paragraph> <Paragraph position="2"> Linking features: PATTERNs, - linearization patterns of lexemes that code both the knowledge about co-occurrences of lexemes with their case-roles and the knowledge about their linear order. Morphological features: POS, - part of speech, MORPH, - wordforms, number, gender, etc.; the sets of parts of speech and wordforms are domain and application specific (Sheremetyeva, cf.).</Paragraph> <Paragraph position="3"> Rank feature: RANK, - corpus-based frequency within one semantic class. The more frequent is a lexeme, the less its rank.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Organization and architecture </SectionTitle> <Paragraph position="0"> TransDict includes cross-referenced monolingual lexicons for every language. A monolingual dictionary consists of a set of entries. An entry identifies lexical information for one meaning of a lexeme of a given language. Every entry is maximally defined as a tree of features:</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> SEM-CL[Language[POS RANK [MORPH CASE_ROLE FILLER PATTERN] </SectionTitle> <Paragraph position="0"> The CASE_ROLE , FILLER and PATTERN features might not be specified in certain entries, e.g., for nouns-physical objects.</Paragraph> <Paragraph position="1"> A maximal entry has the following fields: internal formats: in data files and index files. The developer works with the Main Dictionary File (MDF) visualised by the interface (Figure 2). When the lexicographer saves the data multiple extractions from MDF are automatically created. These extractions contain different data subsets relevant for different processing steps (tagging, disambiguation, transfer and generation). The extractions are created for every language and for every pair of languages. They are linked to applications by special DLL (dynamic link library) functions that access only one of the dictionary extractions for every processing step. This approach gives a significant increase in access speed and processing, which is crucial for real world systems. This and the fact that TransDict is implemented for PC motivated our choice not to use the SQL database and XML (which would have slowed down the application performance). It does not mean, however, that TransDict could not be used in the on-line regime. An interface and a dll can be written for this purpose.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Supporting tools </SectionTitle> <Paragraph position="0"> We developed the following TransDict tools: Data importer/merger imports wordlists and/or feature values from external files and applications. For example, the tool is pipelined to a tagger and to AutoPat and AutoTrans user interfaces, to automatically import unknown words.</Paragraph> <Paragraph position="1"> Defaulter automatically assignes entry structures and some of feature values to entries. Editor a) edits feature values in an entry and b) edits dictionary settings, - languages, semantic classes, parts of speech, wordforms and their tags. Any change of settings automatically propagates to corresponding entries.</Paragraph> <Paragraph position="2"> Morphological generator automatically generates wordforms for a given word base form.</Paragraph> <Paragraph position="3"> Content and format checker reveals incomplete and/or bad formatted entries.</Paragraph> <Paragraph position="4"> Look-up tool performs wild card search and search on any combination of specified parameters.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Interface design </SectionTitle> <Paragraph position="0"> A lexicographer interacts with the lexicon by an extemely user-friendly interface (Figure2). The left pane of the interface screen contains a scrollable list of lexeme base forms2 in a selected language. A click on a language bookmark over equivalent entries as shown in the interface. the morphological zone displays an entry in the selected language equivalent to a highlighted word in the left column. All supporting tools are accessed through the interface menus.</Paragraph> <Paragraph position="1"> The &quot;Add&quot; button calls pop-up menus where the developer is prompted to select a semantic class and part-of speech. This done, an entry with a relevant structure, tags and default values will be displayed. After the user types in a base form all other wordforms are automatically generated on mouse click. The developer is to review the default knowledge and edit it if necessary. The content and format checker take care of correct descriptions with different kinds of alert messages and rewriting support. Powerful search can be done both in a look-up and edit mode.</Paragraph> <Paragraph position="2"> Changing the dictionary settings can easily change a base form status of a wordform, the structure of the entry and other specification parameters. Figure 3 shows how the default noun entry with two slots for its morphological forms: singular and plural, is reset for Danish where definiteness is expressed morphologically, thus duplicating the number of members of the noun paradigm compared with English.</Paragraph> </Section> class="xml-element"></Paper>