File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1006_metho.xml
Size: 17,618 bytes
Last Modified: 2025-10-06 14:10:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1006"> <Title>Automatic Extraction of Idioms using Graph Analysis and Asymmetric Lexicosyntactic Patterns</Title> <Section position="4" start_page="48" end_page="50" type="metho"> <SectionTitle> 3 The Symmetric Graph Model as used for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="49" type="sub_section"> <SectionTitle> Lexical Acquisition and Idiom Extraction </SectionTitle> <Paragraph position="0"> This section of the paper describes the techniques used to extract potentially idiomatic patterns from text, as deduced from previously successful experiments in lexical acquisition.</Paragraph> <Paragraph position="1"> The main extraction technique is to use lexicosyntactic patterns of the form &quot;A, B and/or C&quot; to find nouns that are linked in some way. For example, consider the following sentence from the British National Corpus (BNC).</Paragraph> <Paragraph position="2"> Ships laden with nutmeg, cinnamon, cloves or coriander once battled the Seven Seas to bring home their precious cargo.</Paragraph> <Paragraph position="3"> Since the BNC is tagged for parts-of-speech, we know that the words highlighted in bold are nouns. Since the phrase &quot;nutmeg, cinnamon, cloves or coriander&quot; fits the pattern &quot;A, B, C or D&quot;, we create nodes for each of these nouns and create links between them all. When applied to the whole of the BNC, these links can be aggregated to form a graph with 99,454 nodes (nouns) and 587,475 links, as described by Widdows and Dorow (2002). This graph was originally used for lexical acquisition, since clusters of words in the graph often map to recognized semantic classes with great accuracy (> 80%, (Widdows and Dorow, 2002)).</Paragraph> <Paragraph position="4"> However, for the sake of smoothing over sparse data, these results made the assumption that the links between nodes were symmetric, rather than directed. In other words, when the pattern &quot;A and/or B&quot; was encountered, a link from A to B and a link from B to A was introduced. The nature of symmetric and antisymmetric relationships is examined in detail by Widdows (2004). For the purposes of this paper, it suffices to say that the assumption of symmetry (like the assumption of transitivity) is a powerful tool for improving recall in lexical acquisition, but also leads to serious lapses in precision if the directed nature of links is overlooked, especially if symmetrized links are used to infer semantic similarity.</Paragraph> <Paragraph position="5"> This problem was brought strikingly to our attention by the examples in Figure 1. In spite of appearing to be a circle of related concepts, many of the nouns in this group are not similar at all, and many of the links in this graph are derived from very very different contexts. In Figure 1, cat and mouse are linked (they are re both animals and the phrase &quot;cat and mouse&quot; is used quite often): but then mouse and keyboard are also linked because they are both objects used in computing. A keyboard, as well as being a typewriter or computer keyboard, is also used to mean (part of) a musical instrument such as an organ or piano, and keyboard is linked to violin. A violin and a fiddle are the same instrument (as often happens with synonyms, they don't appear together often but have many neighbours in common).</Paragraph> <Paragraph position="6"> The unlikely circle is completed (it turns out) because of the phrase from the nursery rhyme Hey diddle diddle, The cat and the fiddle, The cow jumped over the moon; It became clear from examples such as these that idiomatic links, like ambiguous words, were a serious problem when using the graph model for lexical acquisition. However, with ambiguous words, this obstacle has been gradually turned into an opportunity, since we have also developed ways to used the apparent flaws in the model to detect which words are ambiguous in the first place (Widdows, 2004, Ch 4). It is now proposed that we can take the same opportunity for certain idioms: that is, to use the properties of the graph model to work out which links arise from idiomatic usage rather than semantic similarity. null</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.1 Idiom Extraction by Recognizing Asymmetric Patterns </SectionTitle> <Paragraph position="0"> The link between the cat and fiddle nodes in Figure 1 arises from the phrase &quot;the cat and the fiddle.&quot; However, no corpus examples were ever found of the converse phrase, &quot;the fiddle and the cat.&quot; In cases like these, it may be concluded that placing a symmetric link between these two nodes is a mistake.</Paragraph> <Paragraph position="1"> Instead, a directed link may be more appropriate.</Paragraph> <Paragraph position="2"> We therefore formed the hypothesis that if the phrase &quot;A and/or B&quot; occurs frequently in a corpus, but the phrase &quot;B and/or A&quot; is absent, then the link between A and B should be attributed to idiomatic usage rather than semantic similarity.</Paragraph> <Paragraph position="3"> The next step was to rebuild, finding those relationships that have a strong preference for occurring in a fixed order. Sure enough, several British English idioms were extracted in this way. However, several other kinds of relationships were extracted as well, as shown in the sample in Table 1.1 After extracting these pairs, groups of them were gathered together into directed subgraphs.2 Some of these directed subgraphs are reporduced in the analysis in the following section.</Paragraph> </Section> </Section> <Section position="5" start_page="50" end_page="54" type="metho"> <SectionTitle> 4 Analysis of Results </SectionTitle> <Paragraph position="0"> The experimental results include representatives of several types of asymmetric relationships, including the following broad categories.</Paragraph> <Paragraph position="1"> 'True' Idioms There are many results that display genuinely idiomatic constructions. By this, we mean phrases that have an explicitly lexicalized nature that a native speaker may be expected to recognize as having a special reference or significance. Examples include 1. historic quotations such as &quot;lies, damned lies and statistics&quot;3 and &quot;bread and circuses.&quot;4 2. titles of well-known works.</Paragraph> <Paragraph position="2"> 3. colloquialisms.</Paragraph> <Paragraph position="3"> 4. groups of objects that have become fixed nominals in their own right.</Paragraph> <Paragraph position="4"> All of these types share the common property that any NLP system that encounters such groups, in order to behave correctly, should recognize, generate, or translate them as phrases rather than words.</Paragraph> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> Hierarchical Relationships </SectionTitle> <Paragraph position="0"> Many of the asymmetric relationships follow some pattern that may be described as roughly hierarchical. A cluster of examples from two domains is shown in Figure 2. In chess, a rook outranks a bishop, and the phrase &quot;rook and bishop&quot; is encoun- null rected relationships rook.&quot; In the church, a cardinal outranks a bishop, a bishop outranks most of the rest of the clergy, and the clergy (in some senses) outrank the laity.</Paragraph> <Paragraph position="1"> Sometimes these relationships coincide with figure / ground and agent / patient distinctions. Examples of this kind, as well as &quot;clergy and laity&quot;, include &quot;landlord and tenant&quot;, &quot;employer and employee&quot;, &quot;teacher and pupil&quot;, and &quot;driver and passengers&quot;. An interesting exception is &quot;passengers and crew&quot;, for which we have no semantic explanation. null Pedigree and potency appear to be two other dimensions that can be used to establish the directedness of an idiomatic construction. For example, Figure 3 shows that alcoholic drinks normally appear before their cocktail mixers, but that wine outranks some stronger drinks.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> Gender Asymmetry </SectionTitle> <Paragraph position="0"> The relationship between corresponding concepts of different genders also appear to be heavily biased towards appearing in one direction. Many of these relationships are shown in Figure 4. This shows that, in cases where one class outranks another, the higher class appears first, but if the classes are identical, then the male version tends to appear before the female. This pattern is repeated in many pairs of words such as &quot;host and hostess&quot;, &quot;god and goddess&quot;, etc. One exception appears to be in parenting relationships, where female precedes male, as in &quot;mother and father&quot;, &quot;mum and dad&quot;, &quot;grandma and grandpa&quot;.</Paragraph> </Section> <Section position="3" start_page="51" end_page="53" type="sub_section"> <SectionTitle> Temporal Ordering </SectionTitle> <Paragraph position="0"> If one word refers to an event that precedes another temporally or logically, it almost always appears first. The examples in Table 2 were extracted by our experiment. It has been pointed out that for cyclical events, it is perfectly possible that the order of these pairs may be reversed (e.g., &quot;late night and early morning&quot;), though the data we extracted from the BNC showed strong tendencies in the directions given.</Paragraph> <Paragraph position="1"> A directed subgraph showing many events in human lives in shown in Figure 5.</Paragraph> <Paragraph position="2"> Prototype precedes Variant In cases where one participant is regarded as a 'pure' substance and the other is a variant or mixture, the pure substance tends to come first. These occur particularly in scientific writing, examples including &quot;element and compound&quot;, &quot;atoms and molecules&quot;, &quot;metals and alloys&quot;. Also, we see &quot;apples and pears&quot;, &quot;apple and plums&quot;, and &quot;apples and oranges&quot;, suggesting that an apple is a prototypical fruit (in agreement with some of the results of prototype theory; see Rosch (1975)).</Paragraph> <Paragraph position="3"> Another possible version of this tendency is that core precedes periphery, which may also account for asymmetric ordering of food items such as &quot;fish and chips&quot;, &quot;bangers and mash&quot;, &quot;tea and coffee&quot; (in the British National Corpus, at least!) In some cases such as &quot;meat and vegetables&quot;, a hierarchical or figure / ground distinction may also be argued.</Paragraph> <Paragraph position="4"> Mistaken extractions Our preliminary inspection has shown that the extraction technique finds comparatively few genuine mistakes, and the reader is encouraged to follow the links provided to check this claim. However, there are some genuine errors, most of which could be avoided with more sophisticated preprocessing.</Paragraph> <Paragraph position="5"> To improve recall in our initial lexical acquisition experiments, we chose to strip off modifiers and to stem plural forms to singular forms, so that &quot;apples and green pears&quot; would give a link between apple and pear.</Paragraph> <Paragraph position="6"> However, in many cases this is a mistake, because the bracketing should not be of the form &quot;A and (B C),&quot; but of the form &quot;(A and B) C.&quot; Using part-of-speech tags alone, we cannot recover this information. One example is the phrase &quot;hardware and software vendors,&quot; from which we obtain a link between hardware and vendors, instead of a link between hardware and software.</Paragraph> <Paragraph position="7"> A fuller degree of syntactic analysis would improve this situation. For extracting semantic relationships, are usually ordered temporally when they occur together null Cederberg and Widdows (2003) demonstrated that nounphrase chunking does this work very satisfactorily, while being much more tractable than full parsing. null The mistaken pair middle and class shown in Table 1 is another of these mistakes, arising from phrases such as &quot;middle and upper class&quot; and &quot;middle and working class.&quot; These examples could be avoided simply by more accurate part-of-speech tagging (since the word &quot;middle&quot; should have been tagged as an adjective in these examples).</Paragraph> <Paragraph position="8"> This concludes our preliminary analysis of results. null 5 Filtering using Latent Semantic Analysis and Combinatoric Analysis From the results in the previous section, the following points are clear.</Paragraph> <Paragraph position="9"> 1. It is possible to extract many accurate examples of asymmetric constructions, that would be necessary knowledge for generation of natural-sounding language.</Paragraph> <Paragraph position="10"> 2. Some of the pairs extracted are examples of general semantic patterns, others are examples of genuinely idiomatic phrases.</Paragraph> <Paragraph position="11"> Even for semantically predictable phrases, the fact that the words occur in fixed patterns can be very useful for the purposes of disambiguation, as demonstrated by (Yarowsky, 1995). However, it would be useful to be able to tell which of the asymmetric patterns extracted by our experiments correspond to semantically regular phrases which happen to have a conventional ordering preference, and which phrases correspond to genuine idioms. This final section demonstrates two techniques for performing this filtering task, which show promising results for improving our classification, though should not yet be considered as reliable.</Paragraph> </Section> <Section position="4" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 5.1 Filtering using Latent Semantic Analysis </SectionTitle> <Paragraph position="0"> Latent semantic analysis or LSA (Landauer and Dumais, 1997) is by now a tried and tested technique for determining semantic similarity between words by analyzing large corpus (Widdows, 2004, Ch 6).</Paragraph> <Paragraph position="1"> Because of this, LSA can be used to determine whether a pair of words is likely to participate in a regular semantic relationship, even though LSA may not contribute specific information regarding the nature of the relationship. However, once a relationship is expected, LSA can be used to predict whether this relationship is used in contexts that are typical uses of the words in question, or whether these uses appear to be anomalies such as rare senses or idioms.</Paragraph> <Paragraph position="2"> This technique was used successfully by (Cederberg and Widdows, 2003) to improve the accuracy of hyponymy extraction. It follows that it should be useful to tell the difference between regularly related words and idiomatically related words.</Paragraph> <Paragraph position="3"> To test this hypothesis, we used an LSA model built from the BNC using the Infomap NLP software.5 This was used to measure the LSA similarity between the words in each of the pairs extracted by the techniques in Section 4. In cases where a word was too infrequent to appear in the LSA model, we used 'folding in,' which assigns a word-vector 'on the fly' by adding together the vectors of any surrounding words of a target word that are in the model.</Paragraph> <Paragraph position="4"> The results are shown in Table 3. The hypothesis is that words whose occurrence is purely idiomatic would have a low LSA similarity score, because they are otherwise not closely related. However, this hypothesis does not seem to have been confirmed, partly due to the effects of overall frequency. For example, the word Porgy only occurs in the phrase the vicinity of chalk and cheese &quot;Porgy and Bess,&quot; and the word bobs almost always occurs in the phrase &quot;bits and bobs.&quot; A more effective filtering technique would need to normalize to account for these effects. However, there are some good results: for example, the low score between assault and battery reflects the fact that this usage, though compositional, is a rare meaning of the word battery, and the same argument can be made for element and compound. Thus LSA might be a better guide for recognizing rarity in meaning of individual words than it is for idiomaticity of phrases.</Paragraph> </Section> <Section position="5" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 5.2 Link analysis </SectionTitle> <Paragraph position="0"> Another technique for determining whether a link is idiomatic or not is to check whether it connects two areas of meaning that are otherwise unconnected. A hallmark example of this phenomenon is the &quot;chalk and cheese&quot; example shown in Figure 6. 6 Note that none of the other members of the rock-types clusters is linked to any of the other foodstuffs. We may be tempted to conclude that the single link between these clusters is an idiomatic phenomenon. This technique shows promise, but has yet to be explored in detail.</Paragraph> </Section> </Section> class="xml-element"></Paper>