File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2105_metho.xml
Size: 14,294 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2105"> <Title>Word lookup on the basis of associations: from an idea to a roadmap</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Search based on the relations </SectionTitle> <Paragraph position="0"> between concepts and words If one agrees with what we have just said, one could view the mental dictionary as a huge semantic network composed of nodes (words and concepts) and links (associations), with either being able to activate the other.7 Finding a 5The idea according to which the mental dictionary (or encyclopedia) is basically an associative network, composed of nodes (words or concepts) and links (associations) is not new, neither is the idea of spreading activation. Actually the very notion of association goes back at least to Aristotle (350BC), but it is also inherent in work done by philosophers (Locke, Hume), physiologists (James & Stuart Mills), psychologists (Galton, 1880; Freud, 1901; Jung and Riklin, 1906) and psycholinguists (Deese, 1965; Jenkins, 1970; Schvaneveldt, 1989). For surveys in psycholinguistics see (H&quot;ormann, 1972), or more recent work (Spitzer, 1999). The notion of association is also implicit in work on semantic networks (Quillian, 1968), hypertext (Bush, 1945), the web (Nelson, 1967), connectionism (Dell et al., 1999) and, of course, in WordNet (Miller et al., 1993; Fellbaum, 1998).</Paragraph> <Paragraph position="1"> 6In the preceding sections we used several times the terms words and concepts interchangeably, as if they were the same. Of course, they are very different. Yet, not knowing what a concept looks like (a single node, or every node, i.e. headword of the word's definition?), we think it is safer to assume that the user can communicate with the computer (dictionary) only via words. Hence, concepts are represented by words, yet, since the two are connected, one can be accessed via the other, which addresses the interface problem with the computer. Another point worth mentionning is the fact that associations may depend on the nature of the arguments (words vs. concepts). While in theory anything can be associated with anything (words with words, words with concepts, concepts with concepts, etc.), in practice words tend to trigger a different set of associations than concepts. Also, the connectivity between words and concepts explains to some extent the power and the flexibility of the human mind. Words are shorthand labels for concepts, and given the fact that the two are linked, one can make big leaps in no time and easily move from one plane (let's say the conceptual level) to the other (the linguistic counterpart). Words can be reached via concepts, but the latter can also serve as starting point to find a word. Compared to the links between concepts which are a superhighway, associations between words are more like countryroads.</Paragraph> <Paragraph position="2"> 7Actually, one could question the very notion of mental dictionary which is convenient, but misleading in as it supposes a dedicated part for this task in our brain. A word amounts thus to entering the network and following the links leading from the source node (the first word that comes to your mind) to the target word (the one you are looking for). Suppose you wanted to find the word &quot;nurse&quot; (target word), yet the only token coming to your mind were &quot;hospital&quot;. In this case the system would generate internally a graph with the source word at the center and all the associated words at the periphery. Put differently, the system would build internally a semantic network with &quot;hospital&quot; in the center and all its associated words as satellites (figure 1).8 Obviously, the greater the number of associations, the more complex the graph. Given the diversity of situations in which a given object may occur we are likely to build many associations. In other words, lexical graphs tend to bemultiply indexed mental encyclopedia, composed of polymorph information (concepts, words, meta-linguistic information) seems much more plausible to us.</Paragraph> <Paragraph position="3"> 8AKO: a kind of; ISA: subtype; TIORA: typically involved object, relation or actor.</Paragraph> <Paragraph position="4"> come complex, too complex to be a good representation to support navigation. Readability is hampered by at least two factors: high connectivity (the great number of links or associations emanating from each word), and distribution: conceptually related nodes, that is, nodes activated by the same kind of assocation are scattered around, that is, they do not necessarily occur next to each other, which is quite confusing for the user. In order to solve this problem we suggest to display by category (chunks) all the words linked by the same kind of association to the source word (see figure 2). Hence, rather than displaying all the connected words as a flat list, we suggest to present them in chunks to allow for categorial search. Having chosen a category, the user will be presented a list of words or categories from which he must choose. If the target word is in the category chosen by the user (suppose he looked for a hyperonyme, hence he checked the ISA-bag), search stops, otherwise it ing to the nature of the link the current list, which would then become the new starting point.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A resource still to be built </SectionTitle> <Paragraph position="0"> The fact that the links are labeled has some very important consequences. (a) While maintaining the power of a highly connected graph (possible cyclic navigation), it has at the interface level the simplicity of a tree: each node points only to data of the same type, i.e. same kind of association. (b) Words being presented in clusters, navigation can be accomplished by clicking on the appropriate category. The assumption being that the user generally knows to which category the target word belongs (or at least, he can recognize within which of the listed categories it falls), and that categorical search is in principle faster than search in a huge list of unordered (or, alphabetically ordered) words.</Paragraph> <Paragraph position="1"> Word access, as described here, amounts to navigating in a huge associative network. Of course, such a network has to be built. The question is how. Our proposal is to build it automatically by parsing an existing corpus containing sufficient amount of information on world knowledge (for example, an encyclopedia). This would yield a set of associations (see below),9 which still need to be labeled. A rich ontology should be helpful in determining the adequate label for many, if not most of the links. Unlike private information,10 which by 9The assumption being that every word co-occurring with another word in the same sentence is a candidate of an association. The more frequently two words co-occur in a given corpus, the greater their associative strength. 10For example, the word elephant may remind you of a definition cannot and should not be put into a public dictionary,11 encyclopedic knowledge can be added in terms of associations, as this information expresses commonly shared knowledge, that is, the kind of associations most people have when encountering a given word. Take for example the word elephant. An electronic dictionary like Word Net associates the following gloss with the headword: large, gray, four-legged mammal, while Webster gives the following information: null A mammal of the order Proboscidia, of which two living species, Elephas Indicus and E. Africanus, and several fossil species, are known. They have a proboscis or trunk, and two large ivory tusks proceeding from the extremity of the upper jaw, and curving upwards. The molar teeth are large and have transverse folds. Elephants are the largest land animals now existing. null While this latter entry is already quite rich (trunk, ivory tusk, size), an encyclopedia contains even more information.12 If all this information were added to an electronic resource, it would enable us to access the same word (e.g. elephant) via many more associations than ever before. By looking at the definition here above, one will notice that many associations are quite straightforward (color, size, origin, etc.), and since most of them appear frequently in a pattern-like manner it should be possible to extract them automatically (see footnote 18 below). If one agrees with these views, the remaining question is how to extract this encyclopedic information and to add it to an existing electronic resource. Below we will outline some methods for extracting associated words and discuss the feasibility of using current methodology to achieve this goal.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Automatic extraction of word </SectionTitle> <Paragraph position="0"> associations Above we outlined the need for obtaining associations between words and using them to improve dictionary accessibility. While the associations can be obtained through association experiments with human subjects, this strategy is specific animal, trip or location (zoo, country in Africa). 11This does not (and should not) preclude the possibility to add it to one's personal dictionary.</Paragraph> <Paragraph position="1"> 12You may consider taking a look at Wikipedia (http: //en.wikipedia.org/wiki/) which is free.</Paragraph> <Paragraph position="2"> not very satisfying due to the high cost of running the experiments (time and money), and due to its static nature. Indeed, given the costs, it is impossible to repeat these experiments to take into account the evolution of a society.</Paragraph> <Paragraph position="3"> Hence, the goal is to automatically extract associations from large corpora. This problem was addressed by a large number of researchers, but in most cases it was reduced to extraction of collocations which are a proper subset of the set of associated words. While hard to define, collocations appear often enough in corpora to be extractable by statistical and information-theory based methods.</Paragraph> <Paragraph position="4"> There are several basic methods for evaluating associations between words: based on frequency counts (Choueka, 1988; Wettler and Rapp, 1993), information theoretic (Church and Hanks, 1990) and statistical significance (Smadja, 1993). The statistical significance often evaluate whether two words are independant using hypothesis tests such as t-score (Church et al., 1991), the X2, the log-likelihood (Dunning, 1993) and Fisher's exact test (Pedersen, 1996). Extracted sets for associated words are further pruned using numerical methods, or linguistic knowledge to obtain a subset of collocations. null The various extraction measures have been discussed in great detail in the literature (Manning and Sch&quot;utze, 1999; McKeown and Radev, 2000), their performance has been compared (Dunning, 1993; Pedersen, 1996; Evert and Krenn, 2001), and the methods have been combined to improve overall performance (Inkpen and Hirst, 2002). Most of these methods were originally applied in large text corpora, but more recently the web has been used as a corpus (Pearce, 2001; Inkpen and Hirst, 2002).</Paragraph> <Paragraph position="5"> Collocation extraction methods have been used not only for English, but for many other languages: French (Ferret, 2002), German (Evert and Krenn, 2001) and Japanese (Nagao and Mori, 1994), to cite but those.</Paragraph> <Paragraph position="6"> The most obvious question in this context is to clarify to what extent available collocation extraction techniques fulfill our needs of extracting and labeling word associations. Since collocations are a subset of association, it is possible to apply collocation extraction techniques to obtain related words, ordered in terms of the relative strength of association.</Paragraph> <Paragraph position="7"> The result of this kind of numerical extraction would be a large set of numerically weighted word pairs. The problem with this approach is that the links are only labeled in terms of their relative associative strength, but not categorically, which makes it impossible to group and present them in a meaningful way for the dictionary user. Clusters based only on the notion of association strength are inadequate for the kind of navigation described here above. Hence another step is necessary: qualification of the links according to their types. Only once this is done, a human being could use it to navigate through a large conceptual-lexical network (the dictionary) as described above. Unfortunately, research on automatic link identification has been rather sparse. Most attempts have been devoted to the extraction of certain types of links (usually syntactic type (Lin, 1998) or on extensions of WordNet with topical information contained in a thesaurus (Stevenson, 2002) or on the WWW (Agirre et al., 2000). Additional methods need to be considered in order to reveal (automatically) the kind of associations holding between words and/or concepts.</Paragraph> <Paragraph position="8"> Earlier in this paper we have suggested the use of an encyclopedia as a source of general world knowledge. It should be noted, though, that there are important differences between large corpora and encyclopedias. Large corpora usually contain a lot of repetitive texts on a limited number of topics (e.g. newspaper articles) which makes them very suitable for statistical methods. On the other hand, while being maximally informative and comprehensive, encyclopedias are written in a highly controlled language, and their content is continually updated and re-edited, with the goal to avoid unnecessary repetition. While most of the information contained in an entry is important, there is a lack of redundancy. Hence, measures capable of handling word pairs with low appearance counts (e.g. log-likelihood or Fisher's exact test) should be favored. Also, rather than looking at individual words, one might want to look at word patterns instead.</Paragraph> </Section> class="xml-element"></Paper>