File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1054_metho.xml
Size: 9,192 bytes
Last Modified: 2025-10-06 14:14:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1054"> <Title>Quantifying lexical influence: Giving direction to context</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> V Krip~sundar </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> CEDAR & Dept. of Computer Science SUNY at Buffalo Buffalo NY 14260, USA Abstract </SectionTitle> <Paragraph position="0"> The relevance of context in disambiguating natural language input has been widely acknowledged in the literature. However, most attempts at formalising the intuitive notion of context tend to treat the word and its context symmetrically. We demonstrate here that traditional measures such as mutual information score are likely to overlook a significant fraction of all co-occurrence phenomena in natural language. We also propose metrics for measuring directed lexical influence and compare performances.</Paragraph> <Paragraph position="1"> Keywords: contextual post-processing, defining context, lexical influence, directionality of context</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> It is widely accepted that context plays a significant role in shaping all aspects of language. Indeed, comprehension would be utterly impossible without the extensive application of contextual information. Evidence from psycholinguistic and cognitive psychological studies also demonstrates that contextual information affects the activation levels of lexical candidates during the process of perception (Weinreich, 1980; McClelland, 1987). Garvin (1972) describes the role of context as follows: \[The meaning of\] a particular text \[is\] not the system-derived meaning as a whole, but that part of it which is included in the contextually and situationally derived meaning proper to the text in question. (p. 69- 70) In effect, this means that the context of a word serves to restrict its sense.</Paragraph> <Paragraph position="1"> The problem addressed in this research is that of improving the performance of a natural-language recogniser (such as a recognition system for hand-written or spoken language). The recogniser output typically consists of an ordered set of candidate words (word-choices) for each word position in the input stream. Since natural language abounds in contextual information, it is reasonable to utilise this in improving the performance of the recogniser (by disambiguating among the word-choices).</Paragraph> <Paragraph position="2"> The word-choices (together with their confidence values) constitute a confusion set. The recogniser may further associate a confidence-value with each of its word choices to communicate finer resolution in its output. The language module must update these confidence values to reflect contextual knowledge.</Paragraph> </Section> <Section position="5" start_page="0" end_page="332" type="metho"> <SectionTitle> 2 Linguistic post-processing </SectionTitle> <Paragraph position="0"> The language module can, in principle, perform several types of &quot;post-processing&quot; on the wordcandidate lists that the recogniser outputs for the different word-positions. The most promising possibilities are: * re-ranking the confusion set (and assigning new confidence-values to its entries), and, * deleting low-confidence entries from the confusion set (after applying contextual knowledge) Several researchers in NLP have acknowledged the relevance of context in disambiguating natural language input ((Evett et al., 1991); (Zernik, 1991); (Hindle & Rooth, 1993); (Rosenfeld, 1994)). In fact, the recent revival of interest in statistical language processing is partly because of its (comparative) success in modelling context. However, a theoretically sound definition of context is needed to ensure that such re-ranking and deleting of word-choices helps and not hinders (Gale & Church, 1990).</Paragraph> <Paragraph position="1"> Researchers in information theory have come up with many inter-related formalisations of the ideas of context and contextual influence, such as mutual information and joint entropy. However, to our knowledge, all attempts at arriving at a theoretical basis for formalising the intuitive notion of context have treated the word and its context symmetrically.</Paragraph> <Paragraph position="2"> Many researchers ((Smadja, 1991); (Srihari & Baltus, 1993)) have suggested that the information-theoretic notion of mutual information score (MIS) directly captures the idea of context. However, MIS is deficient in its ability to detect one-sided correlations (cf. Table 1), and our research indicates that asymmetric influence measures are required to properly handle them (KripPSsundar, 1994).</Paragraph> <Paragraph position="3"> For example, it seems quite unlikely that any symmetric information measure can accurately capture the co-occurrence relationship between the two words 'Paleolithic' and 'age' in the phrase 'Paleolithic age'. The suggestion that 'age' exerts as much influence on 'Paleolithic' as vice versa seems ridiculous, to say the least. What is needed here is a directed (ie, one-sided)influence measure (DIM), something that serves as a measure of influence of one word on another, rather than as a simple, symmetric, &quot;co-existence probability&quot; of two words. Table 1 illustrates how a DIM can be effective in detecting lexical and lexico-semantic associations.</Paragraph> </Section> <Section position="6" start_page="332" end_page="332" type="metho"> <SectionTitle> 3 Comparing measures of lexical </SectionTitle> <Paragraph position="0"> influence We used a section of the Wall Street Journal (WSJ) corpus containing 102K sentences (over two million words) as the training corpus for the partial results described here. The lexicon used was a simple 30Kword superset of the vocabulary of the training corpus. null The results shown here serve to strengthen our hypothesis that non-standard information measures are needed for the proper utilisation of linguistic context. Table 1 shows some pairs of words that exhibit differing degrees of influence on each other. It also demonstrates very effectively that one-sided information measures are much better than symmetric measures at utilising context properly. The arrow between each pair of words in the table indicates the direction of influence (or flow of information). The preponderance of word-pairs that exhibit only one direction of significant influence (eg, 'according'---~'to') shows that no symmetric score could have captured the correlations in all of these phrases.</Paragraph> <Paragraph position="1"> Our formulation of directed influence is still evolving. The word-pairs in Table 1 have been selected randomly from the test-set with the criterion that they scored &quot;significantly&quot; (ie, > 0.9) on at least one of the three measures D1, D2 and D3. The four measures (including MIS) are defined as follows:</Paragraph> <Paragraph position="3"> ....</Paragraph> <Paragraph position="4"> In these definitions, #wlw2 denotes the frequency of co-occurrence of the words wl and w2,1 while 1Note that the exact word order of wl and w2 is irrelevant here.</Paragraph> <Paragraph position="5"> #Wl, and #w~ represent (respectively) the frequencies of their (unconditional) occurrence. #Cmax a~--! max(@wlw2) is defined to be the</Paragraph> <Section position="1" start_page="332" end_page="332" type="sub_section"> <SectionTitle> Wlt~2 </SectionTitle> <Paragraph position="0"> maximum co-occurrence frequency in the corpus, and appears to be a better normalisation factor than the size of the corpus itself.</Paragraph> <Paragraph position="1"> The definition of MIS implicitly incorporates the size of the corpus, since it has two P0 terms in the denominator, and only one in the numerator. The DIM's, on the other hand, have balanced fractions.</Paragraph> <Paragraph position="2"> Therefore, we have not included a log-term in the definitions of D1, D2, and D3 above.</Paragraph> <Paragraph position="3"> D1 is a straightforward estimation of the conditional probability of co-occurrence. It forms a base-line for performance evaluations, but is prone to sparse data problems (Dunning, 1993).</Paragraph> <Paragraph position="4"> The step() functions in D2 and D3 represent two attempts at minimising such errors. These functions are piecewise-linear mappings of the normalised co-occurrence frequency, and are used as scaling factors. Their effect is apparent in Table 1, especially in the bottom third of the table, where the low frequency of the primer pushes D3 down to insignificant levels.</Paragraph> <Paragraph position="5"> The metrics D2 and D3 can and should be normMised, perhaps to the 0-1 range, in order to facilitate integration with other metrics such as the recogniser's confidence value. Similarly, the lack of normalisation of MIS hampers direct comparison of scores with the three DIM's.</Paragraph> </Section> </Section> class="xml-element"></Paper>