File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1017_metho.xml
Size: 21,680 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1017"> <Title>Constructing Semantic Space Models from Parsed Corpora</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dependency-based Vector Space Models </SectionTitle> <Paragraph position="0"> Once we move away from words as the basic context unit, the issue of representation of syntactic information becomes pertinent. Information about the dependency relations between words abstracts over word order and can be considered as an intermediate layer between surface syntax and semantics. More formally, dependencies are asymmetric binary relationships between a head and a modifier (Tesniere, 1959). The structure of a sentence can be represented by a set of dependency relationships that form a tree as shown in Figure 1. Here the head of the sentence is the verb carry which is in turn modified by its subject lorry and its object apples.</Paragraph> <Paragraph position="1"> It is the dependencies in Figure 1 that will form the context over which the semantic space will be constructed. The construction mechanism sets out by identifying the local context of a target word, which is a subset of all dependency paths starting from it. The paths consist of the dependency edges of the tree labelled with dependency relations such as subj, obj, or aux (see Figure 1). The paths can be ranked by a path value function which gives different weight to different dependency types (for example, it can be argued that subjects and objects convey more semantic information than determiners). Target words are then represented in terms of syntactic features which form the dimensions of the semantic space. Paths are mapped to features by the path equivalence relation and the appropriate cells in the matrix are incremented.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Definition of Semantic Space </SectionTitle> <Paragraph position="0"> We assume the semantic space formalisation proposed by Lowe (2001). A semantic space is a matrix whose rows correspond to target words and columns to dimensions which Lowe calls basis elements: Definition 1. A Semantic Space Model is a matrix</Paragraph> <Paragraph position="2"> of column i, t j 2 T denotes the target word of row j, and Ki j the cell (i; j).</Paragraph> <Paragraph position="3"> T is the set of words for which the matrix contains representations; this can be either word types or word tokens. In this paper, we assume that co-occurrence counts are constructed over word types, but the framework can be easily adapted to represent word tokens instead.</Paragraph> <Paragraph position="4"> In traditional semantic spaces, the cells Ki j of the matrix correspond to word co-occurrence counts. This is no longer the case for dependency-based models. In the following we explain how co-occurrence counts are constructed.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Building the Context </SectionTitle> <Paragraph position="0"> The first step in constructing a semantic space from a large collection of dependency relations is to construct a word's local context.</Paragraph> <Paragraph position="1"> Definition 2. The dependency parse p of a sentence s is an undirected graph p(s) = (Vp;Ep). The set of nodes corresponds to words of the sentence: Vp = fw1;:::;wng. The set of edges is Ep Vp Vp.</Paragraph> <Paragraph position="2"> Definition 3. A class q is a three-tuple consisting of a POS-tag, a relation, and another POS-tag. We write Q for the set of all classes Cat R Cat. For each parse p, the labelling function Lp : Ep ! Q assigns a class to every edge of the parse.</Paragraph> <Paragraph position="3"> In Figure 1, the labelling function labels the left-most edge as Lp((a;lorry)) =hDet;det;Ni. Note that Det represents the POS-tag &quot;determiner&quot; and det the dependency relation &quot;determiner&quot;.</Paragraph> <Paragraph position="4"> In traditional models, the target words are surrounded by context words. In a dependency-based model, the target words are surrounded by dependency paths.</Paragraph> <Paragraph position="5"> Definition 4. A path ph is an ordered tuple of edges he1;:::;eni2 Enp so that 8i : (ei 1 = (v1;v2) ^ ei = (v3;v4)) ) v2 = v3 Definition 5. A path anchored at a word w is a path he1;:::;eni so that e1 = (v1;v2) and w = v1. Write Phw for the set of all paths over Ep anchored at w. In words, a path is a tuple of connected edges in a parse graph and it is anchored at w if it starts at w. In Figure 1, the set of paths anchored at lorry 1 is: fh(lorry;carry)i;h(lorry;carry);(carry;apples)i; h(lorry;a)i;h(lorry;carry);(carry;might)i;:::g The local context of a word is the set or a subset of its anchored paths. The class information can always be recovered by means of the labelling function.</Paragraph> <Paragraph position="6"> Definition 6. A local context of a word w from a sentence s is a subset of the anchored paths at w. A function c : W ! 2Phw which assigns a local context to a word is called a context specification function. 1For the sake of brevity, we only show paths up to length 2. The context specification function allows to eliminate paths on the basis of their classes. For example, it is possible to eliminate all paths from the set of anchored paths but those which contain immediate subject and direct object relations. This can be formalised as:</Paragraph> <Paragraph position="8"> In Figure 1, the labels of the two edges which form paths of length 1 and conform to this context specification are marked in boldface. Notice that the local context of lorry contains only one anchored path (c(lorry) = fh(lorry;carry)ig).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Quantifying the Context </SectionTitle> <Paragraph position="0"> The second step in the construction of the dependency-based semantic models is to specify the relative importance of different paths. Linguistic information can be incorporated into our framework through the path value function.</Paragraph> <Paragraph position="1"> Definition 7. The path value function v assigns a real number to a path: v : Ph !R.</Paragraph> <Paragraph position="2"> For instance, the path value function could penalise longer paths for only expressing indirect relationships between words. An example of a length-based path value function is v(ph) = 1n where ph = he1;:::;eni. This function assigns a value of 1 to the one path from c(lorry) and fractions to longer paths.</Paragraph> <Paragraph position="3"> Once the value of all paths in the local context is determined, the dimensions of the space must be specified. Unlike word-based models, our contexts contain syntactic information and dimensions can be defined in terms of syntactic features. The path equivalence relation combines functionally equivalent dependency paths that share a syntactic feature into equivalence classes.</Paragraph> <Paragraph position="4"> Definition 8. Let be the path equivalence relation on Ph. The partition induced by this equivalence relation is the set of basis elements B.</Paragraph> <Paragraph position="5"> For example, it is possible to combine all paths which end at the same word: A path which starts at wi and ends at w j, irrespectively of its length and class, will be the co-occurrence of wi and w j. This word-based equivalence function can be defined in the following manner: h(v1;v2);:::;(vn 1;vn)i h(v01;v02);:::;(v0m 1;v0m)i iff vn = v0m This means that in Figure 1 the set of basis elements is the set of words at which paths end. Although co-occurrence counts are constructed over words like in traditional semantic space models, it is only words which stand in a syntactic relationship to the target that are taken into account.</Paragraph> <Paragraph position="6"> Once the value of all paths in the local context is determined, the local observed frequency for the co-occurrence of a basis element b with the target word w is just the sum of values of all paths ph in this context which express the basis element b. The global observed frequency is the sum of the local observed frequencies for all occurrences of a target word type t and is therefore a measure for the co-occurrence of t and b over the whole corpus.</Paragraph> <Paragraph position="7"> Definition 9. Global observed frequency:</Paragraph> <Paragraph position="9"> As Lowe (2001) notes, raw frequency counts are likely to give misleading results. Due to the Zipfian distribution of word types, words occurring with similar frequencies will be judged more similar than they actually are. A lexical association function can be used to explicitly factor out chance cooccurrences. null Definition 10. Write A for the lexical association function which computes the value of a cell of the matrix from a co-occurrence frequency:</Paragraph> <Paragraph position="11"/> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Parameter Settings </SectionTitle> <Paragraph position="0"> All our experiments were conducted on the British National Corpus (BNC), a 100 million word collection of samples of written and spoken language (Burnard, 1995). We used Lin's (1998) broad coverage dependency parser MINIPAR to obtain a parsed version of the corpus. MINIPAR employs a manually constructed grammar and a lexicon derived from WordNet with the addition of proper names (130,000 entries in total). Lexicon entries contain part-of-speech and subcategorization information. The grammar is represented as a network of 35 nodes (i.e., grammatical categories) and 59 edges (i.e., types of syntactic (dependency) relationships).</Paragraph> <Paragraph position="1"> MINIPAR uses a distributed chart parsing algorithm.</Paragraph> <Paragraph position="2"> Grammar rules are implemented as constraints associated with the nodes and edges.</Paragraph> <Paragraph position="4"> The dependency-based semantic space was constructed with the word-based path equivalence function from Section 2.3. As basis elements for our semantic space the 1000 most frequent words in the BNC were used. Each element of the resulting vector was replaced with its log-likelihood value (see Definition 10 in Section 2.3) which can be considered as an estimate of how surprising or distinctive a co-occurrence pair is (Dunning, 1993).</Paragraph> <Paragraph position="5"> We experimented with a variety of distance measures such as cosine, Euclidean distance, L1 norm, Jaccard's coefficient, Kullback-Leibler divergence and the Skew divergence (see Lee 1999 for an overview). We obtained the best results for cosine (Experiment 1) and Skew divergence (Experiment 2). The two measures are shown in Figure 2.</Paragraph> <Paragraph position="6"> The Skew divergence represents a generalisation of the Kullback-Leibler divergence and was proposed by Lee (1999) as a linguistically motivated distance measure. We use a value of a = :99.</Paragraph> <Paragraph position="7"> We explored in detail the influence of different types and sizes of context by varying the context specification and path value functions. Contexts were defined over a set of 23 most frequent dependency relations which accounted for half of the dependency edges found in our corpus. From these, we constructed four context specification functions: (a) minimum contexts containing paths of length 1 (in Figure 1 sweet and carry are the minimum context for apples), (b) np context adds dependency information relevant for noun compounds to minimum context, (c) wide takes into account paths of length longer than 1 that represent meaningful linguistic relations such as argument structure, but also prepositional phrases and embedded clauses (in Figure 1 the wide context of apples is sweet, carry, lorry, and might), and (d) maximum combined all of the above into a rich context representation.</Paragraph> <Paragraph position="8"> Four path valuation functions were used: (a) plain assigns the same value to every path, (b) length assigns a value inversely proportional to a path's length, (c) oblique ranks paths according to the obliqueness hierarchy of grammatical relations (Keenan and Comrie, 1977), and (d) oblength context specification path value function combines length and oblique. The resulting 14 parametrisations are shown in Table 1. Length-based and length-neutral path value functions are collapsed for the minimum context specification since it only considers paths of length 1.</Paragraph> <Paragraph position="9"> We further compare in Experiments 1 and 2 our dependency-based model against a state-of-the-art vector-based model where context is defined as a &quot;bag of words&quot;. Note that considerable latitude is allowed in setting parameters for vector-based models. In order to allow a fair comparison, we selected parameters for the traditional model that have been considered optimal in the literature (Patel et al., 1998), namely a symmetric 10 word window and the most frequent 500 content words from the BNC as dimensions. These parameters were similar to those used by Lowe and McDonald (2000) (symmetric 10 word window and 536 content words). Again the log-likelihood score is used to factor out chance co-occurrences.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Experiment 1: Priming </SectionTitle> <Paragraph position="0"> A large number of modelling studies in psycholinguistics have focused on simulating semantic priming studies. The semantic priming paradigm provides a natural test bed for semantic space models as it concentrates on the semantic similarity or dissimilarity between a prime and its target, and it is precisely this type of lexical relations that vector-based models capture.</Paragraph> <Paragraph position="1"> In this experiment we focus on Balota and Lorch's (1986) mediated priming study. In semantic priming transient presentation of a prime word like tiger directly facilitates pronunciation or lexical decision on a target word like lion. Mediated priming extends this paradigm by additionally allowing indirectly related words as primes - like stripes, which is only related to lion by means of the intermediate concept tiger. Balota and Lorch (1986) obtained small mediated priming effects for pronunciation tasks but not for lexical decision. For the pronunciation task, reaction times were reduced significantly for both direct and mediated primes, however the effect was larger for direct primes.</Paragraph> <Paragraph position="2"> There are at least two semantic space simulations that attempt to shed light on the mediated priming effect. Lowe and McDonald (2000) replicated both the direct and mediated priming effects, whereas Livesay and Burgess (1997) could only replicate direct priming. In their study, mediated primes were farther from their targets than unrelated words.</Paragraph> <Paragraph position="3"> Materials were taken form Balota and Lorch (1986). They consist of 48 target words, each paired with a related and a mediated prime (e.g., lion-tigerstripes). Each related-mediated prime tuple was paired with an unrelated control randomly selected from the complement set of related primes.</Paragraph> <Paragraph position="4"> One stimulus was removed as it had a low corpus frequency (less than 100), which meant that the resulting vector would be unreliable. We constructed vectors from the BNC for all stimuli with the dependency-based models and the traditional model, using the parametrisations given in Section 3.1 and cosine as a distance measure. We calculated the distance in semantic space between targets and their direct primes (TarDirP), targets and their mediated primes (TarMedP), targets and their unrelated controls (TarUnC) for both models.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3.2.3 Results </SectionTitle> <Paragraph position="0"> We carried out a one-way Analysis of Variance (ANOVA) with the distance as dependent variable (TarDirP, TarMedP, TarUnC). Recall from Table 1 that we experimented with fourteen different context definitions. A reliable effect of distance was observed for all models (p < :001). We used the e2 statistic to calculate the amount of variance accounted for by the different models. Figure 3 plots e2 against the different contexts. The best result was obtained for model 7 which accounts for 23.1% of the variance (F(2;140) = 20:576, p < :001) and corresponds to the wide context specification and the plain path value function. A reliable distance effect was also observed for the traditional vector-based model (F(2;138) = 9:384, p < :001).</Paragraph> <Paragraph position="1"> Pairwise ANOVAs were further performed to examine the size of the direct and mediated priming effects individually (see Table 2). There was a reliable direct priming effect (F(1;94) = 25:290, p < :001) but we failed to find a reliable mediated priming effect (F(1;93) = :001, p = :790). A reliable direct priming effect (F(1;92) = 12:185, p = :001) but no mediated priming effect was also obtained for the traditional vector-based model. We used the e2 statistic to compare the effect sizes obtained for the dependency-based and traditional model. The best dependency-based model accounted for 23.1% of the variance, whereas the traditional model accounted for 12.2% (see also Table 2).</Paragraph> <Paragraph position="2"> Our results indicate that dependency-based models are able to model direct priming across a wide range of parameters. Our results also show that larger contexts (see models 7 and 11 in Figure 3) are more informative than smaller contexts (see models 1 and 3 in Figure 3), but note that the wide context specification performed better than maximum. At least for mediated priming, a uniform path value as assigned by the plain path value function outperforms all other functions (see Figure 3).</Paragraph> <Paragraph position="3"> Neither our dependency-based model nor the traditional model were able to replicate the mediated priming effect reported by Lowe and McDonald (2000) (see L & McD in Table 2). This may be due to differences in lemmatisation of the BNC, the parametrisations of the model or the choice of context words (Lowe and McDonald use a special procedure to identify &quot;reliable&quot; context words). Our results also differ from Livesay and Burgess (1997) who found that mediated primes were further from their targets than unrelated controls, using however a model and corpus different from the ones we employed for our comparative studies. In the dependency-based model, mediated primes were virtually indistinguishable from unrelated words.</Paragraph> <Paragraph position="4"> In sum, our results indicate that a model which takes syntactic information into account outperforms a traditional vector-based model which simply relies on word occurrences. Our model is able to reproduce the well-established direct priming effect but not the more controversial mediated priming effect. Our results point to the need for further comparative studies among semantic space models where variables such as corpus choice and size as well as preprocessing (e.g., lemmatisation, tokenisation) are controlled for.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Experiment 2: Encoding of Relations </SectionTitle> <Paragraph position="0"> In this experiment we examine whether dependency-based models construct a semantic space that encapsulates different lexical relations. More specifically, we will assess whether word pairs capturing different types of semantic relations (e.g., hyponymy, synonymy) can be distinguished in terms of their distances in the semantic space.</Paragraph> <Paragraph position="1"> Our experimental materials were taken from Hodgson (1991) who in an attempt to investigate which types of lexical relations induce priming collected a set of 142 word pairs exemplifying the following semantic relations: (a) synonymy (words with the same meaning, value and worth), (b) superordination and subordination (one word is an instance of the kind expressed by the other word, pain and sensation), (c) category coordination (words which express two instances of a common superordinate concept, truck and train), (d) antonymy (words with opposite meaning, friend and enemy), (e) conceptual association (the first word subjects produce in free association given the other word, leash and dog), and (f) phrasal association (words which co-occur in phrases private and property).</Paragraph> <Paragraph position="2"> The pairs were selected to be unambiguous examples of the relation type they instantiate and were matched for frequency. The pairs cover a wide range of parts of speech, like adjectives, verbs, and nouns. sults for model 7 As in Experiment 1, six words with low frequencies (less than 100) were removed from the materials. Vectors were computed for the remaining 278 words for both the traditional and the dependency-based models, again with the parametrisations detailed in Section 3.1. We calculated the semantic distance for every word pair, this time using Skew divergence as distance measure.</Paragraph> </Section> </Section> class="xml-element"></Paper>