File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2117_metho.xml
Size: 12,504 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2117"> <Title>Language Resources for a Network-based Dictionary</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Language Resources </SectionTitle> <Paragraph position="0"> Zock (2002) proposes the use of only one type of information structure in his network, namely a type of semantic information. There are, however, a number of other types of information structures that may also be relevant for a user. Psychological experiments show that almost all levels of linguistic description reveal priming effects. Strong mental between words are based not only on a semantic relationship but also on morphological an phonological relationships. These types of relationships should also be included in a network based dictionary as well.</Paragraph> <Paragraph position="1"> A number of LRs that are suitable in this scenario already provide some sort of network-like structure possibly closely related to networks meaningful to a human user. All areas are large research fields of their own and we will therefore only touch upon a few aspects.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Manually Constructed Networks </SectionTitle> <Paragraph position="0"> Manually constructed networks usually consist of paradigmatic information since words of the same part-of-speech are related to each other. In ontologies usually only nouns are considered and are integrated into these in order to structure the knowledge to be covered.</Paragraph> <Paragraph position="1"> The main advantage of such networks, since they are hand-built, is the presumable correctness (if not completeness) of the content. Additionally, these semantic nets usually include typed relations between nodes, such as e.g. &quot;hyperonymy&quot; and &quot;is a&quot; and therefore provides additional information for a user. It is safe to rely on the structure of a network coded by humans to a certain extend even if it has certain disadvantages, too. For example networks tend to be selective on the amount of data included, i.e. sometimes only one restricted area of knowledge is covered. Furthermore they include basically only paradigmatic information with some exceptions. This however is only part of the greater structure of lexical networks.</Paragraph> <Paragraph position="2"> The most famous example is WordNet (Fellbaum, 1998) for English - which has been visualized already at http://www.visualthesaurus.com - and its various sisters for other languages. It reflects a certain cognitive claim and was designed to be used in computational tasks such as word sense disambiguation. Furthermore ontologies may be used as a resource, because in ontologies usually single words or NPs are used to label the nodes in the network.</Paragraph> <Paragraph position="3"> An example is the &quot;Universal Decimal Classification&quot;4 which was originally designed to classify all printed and electronic publications in libraries with the help of some 60,000 classes. However one can also think of it as a knowledge representation system as the information is coded in order to reflect the knowledge about the topics covered.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Automatically Generated Paradigmatic Networks </SectionTitle> <Paragraph position="0"> A common approach to the automatic generation of semantic networks is to use some form of the so called vector-space-model in order to map semantically similar words closely together in vector space if they occur in similar contexts in a corpus (Manning and Sch&quot;utze, 1999). One example, Latent Semantic Analysis (Landauer et al., 1998, LSA) has been accepted as a model of the mental lexicon and is even used by psycholinguists as a basis for the categorization and evaluation of test-items. The results from this line of research seem to describe not only relations between words but seem to provide the basis for a network which could be integrated into a network-based dictionary. A disadvantage of LSA is the positioning of polysemous words at a position between the two extremes, i.e. between the two senses which makes the approach worthless for polysemous words in the data.</Paragraph> <Paragraph position="1"> There are several other approaches such as Ji and Ploux (2003) and the already mentioned Rapp (2002). Ji and Ploux also develop a statistics-based method in order to determine so called &quot;contexonyms&quot;. This method allows one to determine different senses of a word as it connects to different clusters for the various senses, which can be seen as automatically derived SynSets as they are known from WordNet. Furthermore her group developed a visualization tool, that presents the results in a way unseen before. Even though they claim to have developed an &quot;organisation model&quot; of the mental lexicon only the restricted class of paradigmatic relations shows up in their calculations.</Paragraph> <Paragraph position="2"> Common to almost all the automatically derived semantic networks is the problem of the unknown relation between items as opposed to manually constructed networks. On the one hand a typed relation provides additional information for a user about two connected nodes but on the other hand it seems questionable if a known relation would really help to actually infer the meaning of a connected node (contrary to Zock (2002)).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Automatically Generated Syntagmatic Networks </SectionTitle> <Paragraph position="0"> Substantial parts of the mental lexicon probably also consist of syntagmatic relations between words which are even more important for the interpretation of collocations.5 The automatic extraction of collocations, i.e. syntagmatic relations between words, from large corpora has been an area of interest in recent years as it provides a basis for the automatic enrichment of electronic lexicons and also dictionaries. Usually attempts have been made at extracting verb-noun-, verb-PP- or adjective-nouncombinations. Noteworthy are the works by Krenn and Evert (2001) who have tried to compare the different lexical association measures used for the extraction of collocations. Even though most approaches are purely statistics-based and use little linguistic information, there are a few cases where a parser was applied in order to enhance the recognition of collocations with the relevant words not 5We define collocations as a syntactically more or less fixed combination of words where the meaning of one word is usually altered so that a compositional construction of the meaning is prevented.</Paragraph> <Paragraph position="1"> being next to each other (Seretan et al., 2003).</Paragraph> <Paragraph position="2"> The data available from the collocation extraction research of course cannot be put together to give a complete and comprehensive network. However certain examples such as the German project &quot;Deutscher Wortschatz&quot;6 and the visualization technique used there suggest a network like structure also in this area useful for example in the language learning scenario as mentioned above.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Phonological/Morphological Networks </SectionTitle> <Paragraph position="0"> Electronic lexica and rule systems for the phonological representation of words can be used for spell-checking as has been done e.g. in the Soundex approach (Mitton, 1996). In this approach a word not contained in the lexicon is mapped onto a simplified and reduced phonological representation and compared with the representations of words in the lexicon. The correct words coming close to the misspelled word on the basis of the comparison are then chosen as possible correction candidates.</Paragraph> <Paragraph position="1"> However this approach makes some drastic assumptions about the phonology of a language in order to keep the system simple. With a more elaborate set of rules describing the phonology of a language a more complex analysis is possible which even allows the determination of words that rhyme.7 Setting a suitable threshold for some measure of similarity a network should evolve with phonologically similar words being connected with each other. A related approach to spelling correction is the use of so called &quot;tries&quot; for the efficient storage of lexical data (Oflazer, 1996). The calculation of a minimal editing distance between an unknown word and a word in a trie determines a possible correct candidate. null Contrary to Zock (2002) who suggests this as an analysis step on its own we think that the phonological and morphological similarity can be exploited to form yet another layer in a network-based dictionary. Zock's example of the looked-for &quot;relegate&quot; may than be connected to &quot;renegade&quot; and &quot;delegate&quot; via a single link and thus found easily. Here again, probably only partial nets are created but they may nevertheless help a user looking for a certain word whose spelling s/he is not sure of.</Paragraph> <Paragraph position="2"> Finally there are even more types of LRs containing network-like structures which may contribute to a network-based dictionary. One example to be mentioned here is the content of machine-readable cation.</Paragraph> <Paragraph position="3"> dictionaries. The words in definitions contained in the dictionary entries - especially for nouns - are usually on the one hand semantically connected to the lemma and on the other hand are mostly entries themselves which again may provide data for a network. In research in computational linguistics the relation between the lemma and the definition has been utilized especially for word sense disambiguation tasks and for the automatic enrichment of language processing systems (Ide and V'eronis, 1994).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Open Issues and Conclusion </SectionTitle> <Paragraph position="0"> So far we have said nothing about two further important parts of such a dictionary: the representation and the visualization of the data. There are a number of questions which still need to be answered in order to build a comprehensive dictionary suitable for an evaluation. With respect to the representation two major questions seem to be the following.</Paragraph> <Paragraph position="1"> * As statistical methods for the analysis of corpora and for the extraction of frequent cooccurrance phenomena tend to use non-lemmatized data, the question is whether it makes sense to provide the user with the more specific data based on inflected material.</Paragraph> <Paragraph position="2"> * Secondly the question arises how to integrate different senses of a word into the representation, if the data provides for this information (as WordNet does).</Paragraph> <Paragraph position="3"> With regard to visualization especially the dynamic aspects of the presentation need to be considered.</Paragraph> <Paragraph position="4"> There are various techniques that can be used to focus on parts of the network and suppress others in order to make the network-based dictionary manageable for a user which need to be evaluated in usability studies. Among these are hyperbolic views and so-called cone trees As we have shown a number of LRs, especially those including syntagmatic, morphological and phonological information, provide suitable data to be included into a network-based dictionary. The data in these LRs either correspond to the presumed content of the mental lexicon or seem especially suited for the intended usage. One major prop-erty of the new type of dictionary proposed here is the disintegration of the macro- and the micro-structure of a traditional dictionary because parts of the micro-structure (the definition of the entries) become part of the macro-structure (primary links to related nodes) of the new dictionary. Reflecting the structure of the mental lexicon this dictionary should allow new ways to access the lexical data and support language production and language learning.</Paragraph> </Section> class="xml-element"></Paper>