File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1105_metho.xml

Size: 15,970 bytes

Last Modified: 2025-10-06 14:10:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1105">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Comparison of Similarity Models for the Relation Discovery Task</Title>
  <Section position="5" start_page="25" end_page="27" type="metho">
    <SectionTitle>
3 Modelling Relation Similarity
</SectionTitle>
    <Paragraph position="0"> The possible space of models for relation similarity can be explored in a principled manner by parameterisation. In this section, we discuss several 2Previous approaches select labels from the collection of context words for a relation cluster (Hasegawa et al., 2004; Zhang et al., 2005). Chen et al. (2005) use discriminative category matching to make sure that selected labels are also able to differentiate between clusters.</Paragraph>
    <Paragraph position="1">  parameters including the term context representation, whether or not we apply dimensionality reduction, and what similarity measure we use.</Paragraph>
    <Section position="1" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Term Context
</SectionTitle>
      <Paragraph position="0"> Representing texts in such a way that they can be compared is a familiar problem from the fields of information retrieval (IR), text mining (TM), textual data analysis (TDA) and natural language processing (NLP) (Lebart and Rajman, 2000).</Paragraph>
      <Paragraph position="1"> The traditional model for IR and TM is based on a term-by-document (TxD) vector representation. Previous approaches to relation discovery (Hasegawa et al., 2004; Chen et al., 2005) have been limited to TxD representations, using tf*idf weighting and the cosine similarity measure. In information retrieval, the weighted term representation works well as the comparison is generally between pieces of text with large context vectors.</Paragraph>
      <Paragraph position="2"> In the relation discovery task, though, the term contexts (as we will define them in Section 4) can be very small, often consisting of only one or two words. This means that a term-based similarity matrix between entity pairs is very sparse, which may pose problems for performing reliable clustering. null An alternative method widely used in NLP and cognitive science is to represent a term context by its neighbouring words as opposed to the documents in which it occurs. This term co-occurrence (TxT) model is based on the intuition that two words are semantically similar if they appear in a similar set of contexts (see e.g.</Paragraph>
      <Paragraph position="3"> Pado and Lapata (2003)). The current work explores such a term co-occurrence (TxT) representation based on the hypothesis that it will provide a more robust representation of relation contexts and help overcome the sparsity problems associated with weighted term representations in the relation discovery task. This is compared to a baseline term-by-document (TxD) representation which is a re-implementation of the approach used by Hasegawa et al. (2004) and Chen et al. (2005).</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.2 Dimensionality Reduction
</SectionTitle>
      <Paragraph position="0"> Dimensionality reduction techniques for document and corpus modelling aim to reduce description length and model a type of semantic similarity that is more linguistic in nature (e.g., see Landauer et al.'s (1998) discussion of LSA and synonym tests). In the current work, we explore singular value decomposition (Berry et al., 1994), a technique from linear algebra that has been applied to a number of tasks from NLP and cognitive modelling. We also explore latent Dirichlet allocation, a probabilistic technique analogous to singular value decomposition whose contribution to NLP has not been as thoroughly explored.</Paragraph>
      <Paragraph position="1"> Singular value decomposition (SVD) has been used extensively for the analysis of lexical semantics under the name of latent semantic analysis (Landauer et al., 1998). Here, a rectangular matrix is decomposed into the product of three matrices (Xwxp = WwxnSnxn(Ppxn)T) with n 'latent semantic' dimensions. The resulting decomposition can be viewed as a rotation of the n-dimensional axes such that the first axis runs along the direction of largest variation among the documents (Manning and Sch&amp;quot;utze, 1999). W and P represent terms and documents in the new space. And S is a diagonal matrix of singular values in decreasing order.</Paragraph>
      <Paragraph position="2"> Taking the product WwxkSkxk(Ppxk)T over the first D columns gives the best least square approximation of the original matrix X by a matrix of rank D, i.e. a reduction of the original matrix to D dimensions. SVD can equally be applied to the word co-occurrence matrices obtained in the TxT representation presented in Section 2, in which case we can think of the original matrix as being a term x co-occurring term feature matrix.</Paragraph>
      <Paragraph position="3"> While SVD has proved successful and has been adapted for tasks such as word sense discrimination (Sch&amp;quot;utze, 1998), its behaviour is not easy to interpret. Probabilistic LSA (pLSA) is a generative probabilistic version of LSA (Hofmann, 2001). This models each word in a document as a sample from a mixture model, but does not provide a probabilistic model at the document level.</Paragraph>
      <Paragraph position="4"> Latent Dirichlet Allocation (LDA) addresses this by representing documents as random mixtures over latent topics (Blei et al., 2003). Besides having a clear probabilistic interpretation, an additional advantage of these models is that they have intuitive graphical representations.</Paragraph>
      <Paragraph position="5"> Figure 3 contains a graphical representation of the LDA model as applied to TxT word co-occurrence matrices in standard plate notation. This models the word features f in the co-occurrence context (size N) of each word w (where w [?] W and |W |= W) with a mixture of topics z. In its generative mode, the LDA model samples a topic from the word-specific multino- null mial distribution th. Then, each context feature is generated by sampling from a topic-specific multinomial distribution phz.3 In a manner analogous to the SVD model, we use the distribution over topics for a word w to represent its semantics and we use the average topic distribution over all context words to represent the conceptual content of an entity pair context.</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
3.3 Measuring Similarity
</SectionTitle>
      <Paragraph position="0"> Cosine (Cos) is commonly used in the literature to compute similarities between tf*idf vectors:</Paragraph>
      <Paragraph position="2"> In the current work, we use cosine over term and SVD representations of entity pair context.</Paragraph>
      <Paragraph position="3"> However, it is not clear which similarity measure should be used for the probabilistic topic models.</Paragraph>
      <Paragraph position="4"> Dagan et al. (1997) find that the symmetric information radius measure performs best on a pseudo-word sense disambiguation task, while Lee (1999) find that the asymmetric skew divergence - a generalisation of Kullback-Leibler divergence - performs best for improving probability estimates for unseen word co-occurrences.</Paragraph>
      <Paragraph position="5"> In the current work, we compare KL divergence with two methods for deriving a symmetric mea3The hyperparameters a and b are Dirichlet priors on the multinomial distributions for word features (ph [?] Dir(b)) and topics (th [?] Dir(a)). The choice of the Dirichlet is explained by its conjugacy to the multinomial distribution, meaning that if the parameter (e.g. ph, th) for a multinomial distribution is endowed with a Dirichlet prior then the posterior will also be a Dirichlet. Intuitively, it is a distribution over distributions used to encode prior knowledge about the parameters (ph and th) of the multinomial distributions for word features and topics. Practically, it allows efficient estimation of the joint distribution over word features and topics P(vectorf,vectorz) by integrating out ph and th.</Paragraph>
      <Paragraph position="6"> sure. The KL divergence of two probability distributions (p and q) over the same event space is defined as:</Paragraph>
      <Paragraph position="8"> In information-theoretic terms, KL divergence is the average number of bits wasted by encoding events from a distribution p with a code based on distribution q. The symmetric measures are defined as:</Paragraph>
      <Paragraph position="10"> The first is termed symmetrised KL divergence (Sym) and the second is termed Jensen-Shannon (JS) divergence. We explore KL divergence as well as the symmetric measures as it is not known in advance whether a domain is symmetric or not.</Paragraph>
      <Paragraph position="11"> Technically, the divergence measures are dissimilarity measures as they calculate the difference between two distributions. However, they can be converted to increasing measures of similarity through various transformations. We treated this as a parameter to be tuned during development and considered two approaches. The first is from Dagan et al. (1997). For KL divergence, this function is defined as Sim(p,q) = 10[?]bKL(p||q), where b is a free parameter, which is tuned on the development set (as described in Section 4.2). The same procedure is applied for symmetric KL divergence and JS divergence. The second approach is from Lee (1999). Here similarity for KL is defined as Sim(p,q) = C [?]KL(p||q), where C is a free parameter to be tuned.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="27" end_page="29" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Materials
</SectionTitle>
      <Paragraph position="0"> Following Chen et al. (2005), we derive our relation discovery data from the automatic content extraction (ACE) 2004 and 2005 materials for evaluation of information extraction.4 This is preferable to using the New York Times data used by Hasegawa et al. (2004) as it has gold standard annotation, which can be used for unbiased evaluation. null The relation clustering data is based on the gold standard relations in the information extraction  data. We only consider data from newswire or broadcast news sources. We constructed six data subsets from the ACE corpus based on four of the ACE entities: persons (PER), organisations (ORG), geographical/social/political entities (GPE) and facilities (FAC). The six data subsets were chosen during development based on a lower limit of 50 for the data subset size (i.e. the number of entity pairs in the domain), ensuring that there is a reasonable amount of data. We also set a lower limit of 3 for the number of classes (relation types) in a data subset, ensuring that the clustering task is not too simple.</Paragraph>
      <Paragraph position="1"> The entity pair instances for clustering were chosen based on several criteria. First, we do not use ACE's discourse relations, which are relations in which the entity referred to is not an official entity according to world knowledge. Second, we only use pairs with one or more non-stop words in the intervening context, that is the context between the two entity heads.5 Finally, we only keep relation classes with 3 or more members. Table 4.1 contains the full list of relation types from the subsets of ACE that we used. (Refer to Table 4.2 for definition of the relation type abbreviations.) We use the Infomap tool6 for singular value decomposition of TxT matrices and compute the conceptual content of an entity pair context as the average over the reduced D-dimensional representation of the co-occurrence vector of the terms in the relation context. For LDA, we use Steyvers and Griffiths' Topic Modeling Toolbox7). The input is produced by a version of Infomap which was modified to output the TxT matrix. Again, we compute the conceptual content of an entity pair as the average over the topic vectors for the context words. As documents are explicitly modelled in the LDA model, we input a matrix with raw frequencies. In the TxD, unreduced TxT and SVD models we use tf*idf term weighting.</Paragraph>
      <Paragraph position="2"> We use the same preprocessing when preparing the text for building the SVD and probabilistic topic models as we use for processing the intervening context of entity pairs. This consisted of Mx-Terminator (Reynar and Ratnaparkhi., 1997) for sentence boundary detection, the Penn Treebank  sed script8 for tokenisation, and the Infomap stop word list. We also use an implementation of the Porter algorithm (Porter, 1980) for stemming.9</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
4.2 Model Selection
</SectionTitle>
      <Paragraph position="0"> We used the ACE 2004 relation data to perform model selection. Firstly, dimensionality (D) needs to be optimised for SVD and LDA. SVD was found to perform best with the number of dimensions set to 10. For LDA, dimensionality interacts with the divergence-to-similarity conversion so they were tuned jointly. The optimal configuration varies by the divergence measure with D = 50 and C = 14 for KL divergence, D = 200 and C = 4 for symmetrised KL, and D = 150 and C = 2 for JS divergence. For all divergence measures, Lee's (1999) method outperformed Dagan et al.'s (1997) method. Also for all divergence measures, the model hyper-parameter b was found to be optimal at 0.0001. The a hyper-parameter was always set to 50/T following Griffiths and Steyvers (2004).</Paragraph>
      <Paragraph position="1"> Clustering is performed with the CLUTO software10 and the technique used is identical across models. Agglomerative clustering is used for comparability with the original relation discovery work of Hasegawa et al. (2004). This choice was motivated because as it is not known in advance how many clusters there should be in a new domain. null One way to view the clustering problem is as an optimisation process where an optimal clustering is chosen with respect to a criterion function over the entire solution. The criterion function used here was chosen based on performance on the development data. We compared a number of criterion functions including single link, complete link, group average, I1, I2, E1 and H1. I1 is a criterion function that maximises sum of pairwise similarities between relation instances assigned to each cluster, I2 is an internal criterion function that maximises the similarity between each relation instance and the centroid of the cluster it is assigned to, E1 is an external criterion function that minimises the similarity between the centroid vector of each cluster and the centroid vector of the</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="29" end_page="29" type="metho">
    <SectionTitle>
ORG-GPE ORG-ORG PER-FAC PER-GPE PER-ORG PER-PER
</SectionTitle>
    <Paragraph position="0"> basedin 54 subsidiary 36 located 127 located 222 staff 121 business 81 subsidiary 27 emporgothr 14 owner 14 resident 79 executive 100 family 20 located 15 partner 8 near 4 executive 42 member 44 persocothr 16 gpeaffothr 3 member 6 staff 30 emporgothr 27 perorgothr 9 employgen 7 employgen 9 near 7 located 4 ethnic 5 executive 3 ideology 3 member 3 Total 99 Total 64 Total 145 Total 380 Total 305 Total 147  entire collection, and H1 is a combined criterion function that consists of the ration of I1 over E1. The I2, H1 and H2 criterion functions outperformed single link, complete link and group average on the development data. We use I2, which performed as well as H1 and H2 and is superior in terms of computational complexity (Zhao and Karypis, 2004).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML