File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2114_metho.xml
Size: 14,623 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2114"> <Title>Rebuilding the Oxford Dictionary of English as a semantic network</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Computational analysis of the </SectionTitle> <Paragraph position="0"> text must be designed to identify and aggregate cues, and to employ rules for inheriting information from entry or branch level down to sense level, in order to generate the right data to construct each lexical object (see section 4 below). On the other hand, ODE has some key features that make it particularly appropriate for enhancement a as comprehensive electronic lexical database, usable both as a dictionary and as a resource for exploitation in NLP applications: * even-handed treatment of both British and American spellings, plus extensive coverage of other World English; * coverage of proper-name items (places, people, events, etc.); * relatively codified editorial style, facilitating computational evaluation of definitions; * detailed and codified indication of syntactic characteristics, sense by sense; * example sentences linked to individual senses, providing cues about grammatical behaviour and collocation. Furthermore, ODE is very much a corpus-based dictionary: all key editorial decisions - including 2 An example of this in ODE is the high number of run-on derivatives, which are undefined on the assumption that the user will be able to infer the sense morphologically.</Paragraph> <Paragraph position="1"> orthography, sense distinctions, choice of example sentences, and syntactic information - are motivated primarily by corpus evidence. This means that a formalized encoding of the information presented in the dictionary is likely to generate robust and comprehensive resources for NLP tasks dealing with arbitrary real-world text.</Paragraph> <Paragraph position="2"> 4 Data structure and lexical objects The ODE data structure dispenses with the idea of the entry as the dictionary's basic unit. This move reflects the proposition that the traditional dictionary entry is an artefact of human-readable dictionaries, and expresses a predominantly etymological view of the language. The dictionary entry brings together lexical elements (definitions, parts of speech, phrases, etc.) which may have little relationship to each other from the point of view of language use (meaning, domain, frequency, collocation, etc.). The entry is therefore an inappropriate unit for computer processing. Instead, the ODE data structure is built as a series of discrete lexical objects which correspond roughly to individual dictionary senses or to senses of embedded lemmas (phrases, compounds, derivatives, etc.). These lexical objects function as packets of data which exist independently of each other, each containing all relevant information about meaning, morphology, lexical function, semantic category, etc. Hence every sense may be queried, extracted, or manipulated as an independent object without any reference to the entry in which it was originally situated. Although search results, etc., may still (typically) be represented to the user in the form of traditional dictionary entries, from a functional point of view the entry is rendered redundant as a data object, and for almost all purposes is ignored when handling search queries and other processing. The choice of lexeme-meaning pair rather than entry as the basic lexical object allows a much more detailed and exhaustive specification of the way the language functions on a sense-by-sense basis. Each object typically includes (minimally) a definition, one or more example sentences, a full specification of morphology and syntactic behaviour (see section 5 below), and a range of classifiers (see section 6 below). This creates some redundancy: for example, an entry with numerous senses may generate several lexical objects which repeat the same morphology and syntactic behaviour. However, this is more than offset by the advantages: by integrating data into self-contained lexical objects, the structure (a) supports robust and positive Boolean operations and (b) opens up the lexicon for reconfiguration according to any new type of lexical or semantic relation.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Morphological and phrasal specification </SectionTitle> <Paragraph position="0"> Every lexical object in ODE provides a complete specification of all the lexical forms relevant to its sense. This includes not only the morphology of the lexemes themselves, but also structured data about their syntactic roles, variant spellings, British and US spellings, and alternative forms. This is true not only for single-word lexemes but also for abbreviations, initialisms, affixes, contractions, compounds, phrases, and proper names. ODE thus provides facility for look-up of real-world lexical forms in context, including: As mentioned above, such data was generated for every lexical object (sense). For some polysemous headwords this may mean multiple repetitions of morphological listings. More typically, however, listings vary from sense to sense. One common variable, for example, is that for a given noun headword, some lexical objects may include a plural while others do not. Similarly, variant spellings may often be associated with some but not all lexical objects. More pertinently perhaps, phrasal patterns associated with the lexeme will vary significantly from sense to sense, particularly for verbs.</Paragraph> <Paragraph position="1"> It should be noted that this formal variability can play an important role in the early stages of disambiguation tasks; the particular form of a lexeme or extended phrase may often allow groups of senses to be reliably discounted or at least downgraded as possible meanings.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 This simplifies </SectionTitle> <Paragraph position="0"> the set of candidate meanings subsequently presented to semantic analysis algorithms.</Paragraph> <Paragraph position="1"> Hence it was judged necessary to develop distinctive morphological and phrasal specifications for each lexical object. Principally this was done algorithmically: the dictionary text associated with each lexical object was examined for clues not only in explicit syntax information but also in definition patterns and especially in example sentences. In default of any strong pointers emerging from this process, forms were produced by a simple morphological generator.</Paragraph> <Paragraph position="2"> However, it became apparent that many formal variations were not cued in any way by the dictionary text. As a simple example, if an adjective is gradable then this is always indicated 3 John Sinclair has proposed that 'Every distinction in meaning is associated with a distinction in form' (quoted in Hanks 2003, p. 61). Although doubtful about this as a universal rule, I agree that for a given lexeme one may often link major categories of meaning to typical formal characteristics.</Paragraph> <Paragraph position="3"> at headword level in the dictionary, but there is nothing to indicate which senses are gradable and which are not. Because of limitations like this, a significant stage of post-processing manual correction was necessary.</Paragraph> </Section> <Section position="7" start_page="0" end_page="4" type="metho"> <SectionTitle> 6 Classification </SectionTitle> <Paragraph position="0"> Each lexical object includes a set of classifiers which are used to position the object's definition in semantic taxonomies (structures of superordination/subordination) and in subject areas (domains). This may be used in two principal ways. Firstly, it allows the dictionary to be navigated along associative or thesaurus-like routes: it structures the lexicon's set of objects into a network of connections between meanings. Secondly, it allows calculations of semantic distance between lexical objects, supporting NLP tasks relating to sense disambiguation, context-sensitive look-up, and document classification.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 6.1 Semantic taxonomy </SectionTitle> <Paragraph position="0"> All lexical objects representing noun meanings (about 100,000) have been structured in a semantic taxonomy. This represents a substantial majority of all objects; the smaller sets of verbs and adjectives will be similarly structured as the next stage of the project.</Paragraph> <Paragraph position="1"> The taxonomy's high-level nodes (major classifications) were originally constructed to follow the file structure of WordNet nouns. Although the ODE taxonomy has subsequently diverged in detail, mappings to WordNet have been retained. At the early stages of the project, a relatively small number of noun objects were classified manually. The definitions contained in these objects were then used as training data to establish a definitional 'profile' for each high-level node. Definitional profiles were used to automatically classify further objects. Applied iteratively (with manual correction at each stage), this process succeeded in classifying all noun objects in a relatively coarse-grained way. Beyond this stage, the statistical data which had previously been integrated to build definition profiles was instead decomposed to help identify ways in which high-level nodes could be subdivided to develop greater granularity.</Paragraph> <Paragraph position="2"> Definitional profiling here involves two elements:</Paragraph> </Section> </Section> <Section position="8" start_page="4" end_page="5" type="metho"> <SectionTitle> 4 Metrics for using taxonomies to calculate semantic </SectionTitle> <Paragraph position="0"> similarity are discussed in Leacock and Chodorow 1998. An implementation of ODE as an electronic dictionary uses such metrics in the automatic glossing of its own text.</Paragraph> <Paragraph position="1"> 1. Identification of the 'key term' in the definition. This is the most significant noun in the definition. It is not always coterminous with the genus term; for example, in a definition beginning 'a morsel of food which...', the 'key term' is calculated to be food rather than morsel. 2. Stemming and scoring of all other meaningful lexemes in the definition (ignoring articles, conjunctions, etc.). A simple weighting scheme is used to give more importance to lexemes at the beginning of a definition (e.g. a modifier of the key term) than to lexemes at the end. These two elements are then assigned mutual information scores in relation to each possible classification, and the two MI scores are combined in order to give an overall score. This overall score is taken to be a measure of how 'typical' a given definition would be for a given classification. This enables one very readily to rank all the lexical objects attached to a given node, and to identify other objects which are candidates for that node. The semantic taxonomy currently has about 2200 nodes on up to 12 levels - on average, 46 objects per node. However, this average disguises the fact that there are a small number of nodes which classify significantly larger sets of objects. Further subcategorization of large sets is desirable in principle, but is not considered a priority in all cases: subcategorization of a given set is deprioritized if the set if relatively homogeneous, i.e. if the distribution of 'typicality' scores for each object is relatively small.</Paragraph> <Paragraph position="2"> Hence the goal is not achieve granularity on the order of WordNet's 'synset' (a set in which all terms are synonymous, and hence are rarely more than four or five in number). Instead, granularity is based on a more thesaurus-like measure of parity between objects in a set.</Paragraph> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 6.2 Domains </SectionTitle> <Paragraph position="0"> As with semantic classification, a number of domain indicators were assigned manually, and these were then used iteratively to seed assignment of further indicators to statistically similar definitions. Automatic assignment is a little more straightforward and robust here, since most of the time the occurrence of strongly-typed vocabulary will be a sufficient cue, and there is little necessity 5 For example, the tree set is several times larger than average, but since tree definitions have similar profiles, the set produces a high homogeneity score. This accords loosely with the intuition that for most NLP purposes, there is little value in making semantic distinctions between different species of tree.</Paragraph> <Paragraph position="1"> to identify a key term or otherwise to parse the definition.</Paragraph> <Paragraph position="2"> Not every lexical object is domain-specific: currently 49% of all objects have been assigned at least one domain marker. Each iteration of the assignment process will continue to capture objects which are less and less strongly related to the domain. Beyond a certain point, the relationship will become too tenuous to be of much use in most contexts; but that point will differ for each subject field (and for each context). Hence a further objective is to implement a grading scheme which not only classifies an object by domain but also scores its relevance to that domain.</Paragraph> <Paragraph position="3"> Currently, a set of about 220 domains are used, both technical (Pharmaceutics, Oceanography,) and thematic (Crime, Sex). These were originally planned to be organized in a Dewey-like taxonomy. However, this was found to be unworkable, since within the lexicon domains emerged as a series of overlapping fields rather than as a hierarchy. Hence the domains have now been reorganized not taxonomically but rather as a network of multiple connections. Within the network, each edge connecting two domains represents the strength of association between those domains.6 7 Work in progress Ongoing work focuses mainly on the cumulative population of lexical objects with additional data fields, in particular collocations and frequency measures.</Paragraph> <Paragraph position="4"> Additionally, further classifier types are being trialled in order to define further relations between lexical objects. These include linguistic relations such as antonymy and root derivation, and real-world relations such as geographical region.</Paragraph> </Section> </Section> class="xml-element"></Paper>