File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2045_metho.xml
Size: 10,063 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2045"> <Title>Unsupervised Feature Selection for Relation Extraction</Title> <Section position="3" start_page="262" end_page="264" type="metho"> <SectionTitle> 2 Proposed Method </SectionTitle> <Paragraph position="0"> Feature selection for relation extraction is the task of finding important contextual words which will help to discriminate relation types. Unlike supervised learning, where class labels can guide feature search, in unsupervised learning, it is expected to define a criterion to assess the importance of the feature subsets. Due to the interplay between feature selection and clustering solution, we should define an objective function to evaluate both feature subset and model order.</Paragraph> <Paragraph position="1"> In this paper, the model selection capability is achieved by resampling based stability analysis, which has been successfully applied to several unsupervised learning problems (e.g. (Levine and Domany, 2001), (Lange et al., 2002), (Roth and Lange et al., 2003), (Niu et al., 2004)). We extend the cluster validation strategy further to address both feature selection and model order identification. null Table 1 presents our model selection algorithm.</Paragraph> <Paragraph position="2"> The objective function MFk;k is relevant with both feature subset and model order. Clustering solution that is stable against resampling will give rise to a local optimum of MFk;k, which indicates both important feature subset and the true cluster number.</Paragraph> <Section position="1" start_page="262" end_page="262" type="sub_section"> <SectionTitle> 2.1 Entropy-based Feature Ranking </SectionTitle> <Paragraph position="0"> Let P = fp1;p2;:::pNg be a set of local context vectors of co-occurrences of entity pair E1 and E2. Here, the context includes the words occurring between, before and after the entity pair. Let W = fw1;w2;:::;wMg represent all the words occurred in P. To select a subset of important features from W, words are first ranked according to their importance on clustering. The importance can be assessed by the entropy criterion.</Paragraph> <Paragraph position="1"> Entropy-based feature ranking is based on the assumption that a feature is irrelevant if the presence of it obscures the separability of data set(Dash et al., 2000).</Paragraph> <Paragraph position="2"> We assume pn, 1 * n * N, lies in feature space W, and the dimension of feature space is tion Input: Corpus D tagged with Entities(E1;E2); Output: Feature subset and Model Order (number of relation types); 1. Collect the contexts of all entity pairs in the document corpus D, namely P; 2. Rank features using entropy-based method described in section 2.1; 3. Set the range (Kl;Kh) for the possible number of relation clusters; 4. Set estimated model order k = Kl; 5. Conduct feature selection using the algorithm presented in section 2.2; 6. Record ^Fk,k and the score of the merit of both of them, namely MF;k; 7. If k < Kh, k = k +1, go to step 5; otherwise, go to Step 7; 8. Select k and feature subset ^Fk which maximizes the score of the merit MF;k; M. Then the similarity between i-th data point pi and j-th data point pj is given by the equation: Si;j = exp(!fi / Di;j), where Di;j is the Euclidean distance between pi and pj, and fi is a positive constant, its value is !ln0:5D , where D is the average distance among the data points. Then the entropy of data set P with N data points is defined as:</Paragraph> <Paragraph position="4"> For ranking of features, the importance of each word I(wk) is defined as entropy of the data after discarding feature wk. It is calculated in this way: remove each word in turn from the feature space and calculate E of the data in the new feature space using the Equation 1. Based on the observation that a feature is the least important if the removal of it results in minimum E, we can obtain the rankings of the features.</Paragraph> </Section> <Section position="2" start_page="262" end_page="263" type="sub_section"> <SectionTitle> 2.2 Feature Subset Selection and Model Order Identification </SectionTitle> <Paragraph position="0"> In this paper, for each specified cluster number, firstly we perform K-means clustering analysis on each feature subset and adopts a scattering criterion &quot;Invariant Criterion&quot; to select an optimal feature subset F from the feature subset space.</Paragraph> <Paragraph position="1"> Here, trace(P!1W PB) is used to compare the cluster quality for different feature subsets 1, which ture Subset and Model Order Function: criterion(F;k;P;q) Input: feature subset F, cluster number k, entity pairs set P, and sampling frequency q; Output: the score of the merit of F and k; 1. With the cluster number k as input, perform k-means clustering analysis on pairs set PF ; 2. Construct connectivity matrix CF;k based on above clustering solution on full pairs set PF ; 3. Use a random predictor %0k to assign uniformly drawn labels to each entity pair in PF ; 4. Construct connectivity matrix CF;%0k based on above clustering solution on full pairs set PF ; 5. Construct q sub sets of the full pairs set, by randomly selecting fiN of the N original pairs, 0 * fi * 1; 6. For each sub set, perform the clustering analysis in Step 2;3;4, and result C,,F;k, C,,F;%0k; 7. Compute MF;k to evaluate the merit of k using Equation 3; 8. Return MF;k; measures the ratio of between-cluster to within-cluster scatter. The higher the trace(P!1W PB), the higher the cluster quality.</Paragraph> <Paragraph position="2"> To improve searching efficiency, features are first ranked according to their importance. Assume Wr = ff1;:::;fMg is the sorted feature list. The task of searching can be seen in the feature subset space: f(f1;:::;fk),1 * k * Mg.</Paragraph> <Paragraph position="3"> Then the selected feature subset F is evaluated with the cluster number using the objective function, which can be formulated as:</Paragraph> <Paragraph position="5"> to coverage(P;F) , ? 2. Here, ^Fk is the optimal feature subset, F and k are the feature subset and the value of cluster number under evaluation, and the criterion is set up based on resampling-based stability, as Table 2 shows.</Paragraph> <Paragraph position="6"> Let P,, be a subset sampled from full entity pairs set P with size fijPj (fi set as 0.9 in this paper.), C(C,,) be jPj PS jPj(jP,,j PS jP,,j) connectivity matrix based on the clustering results on P(P,,). Each entry cij(c,,ij) of C(C,,) is calculated in the following: if the entity pair pi 2 P(P,,), pj 2 P(P,,) belong to the same cluster, then cij(c,,ij) equals 1, else 0. Then the stability is de-</Paragraph> <Paragraph position="8"> t, where m is the total mean vector and mj is the mean vector for jth cluster and (Xj!mj)t is the matrix transpose of the column vector (Xj !mj).</Paragraph> <Paragraph position="9"> 2let coverage(P;F) be the coverage rate of the feature set F with respect to P. In practice, we set ? = 0:9.</Paragraph> <Paragraph position="10"> fined in Equation 2:</Paragraph> <Paragraph position="12"> (2) Intuitively, M(C,,;C) denotes the consistency between the clustering results on C,, and C. The assumption is that if the cluster number k is actually the &quot;natural&quot; number of relation types, then clustering results on subsets P,, generated by sampling should be similar to the clustering result on full entity pair set P. Obviously, the above function satisfies 0 * M * 1.</Paragraph> <Paragraph position="13"> It is noticed that M(C,,;C) tends to decrease when increasing the value of k. Therefore for avoiding the bias that small value of k is to be selected as cluster number, we use the cluster validity of a random predictor %0k to normalize M(C,,;C). The random predictor %0k achieved the stability value by assigning uniformly drawn labels to objects, that is, splitting the data into k clusters randomly. Furthermore, for each k, we tried q times. So, in the step 7 of the algorithm of Table 2, the objective function M(C,,F;k;CF;k) can be normalized as equations 3:</Paragraph> <Paragraph position="15"> Normalizing M(C,,;C) by the stability of the random predictor can yield values independent of k.</Paragraph> <Paragraph position="16"> After the number of optimal clusters and the feature subset has been chosen, we adopted the K-means algorithm for the clustering phase. The output of context clustering is a set of context clusters, each of them is supposed to denote one relation type.</Paragraph> </Section> <Section position="3" start_page="263" end_page="264" type="sub_section"> <SectionTitle> 2.3 Discriminative Feature identification </SectionTitle> <Paragraph position="0"> For labelling each relation type, we use DCM (discriminative category matching) scheme to identify discriminative label, which is also used in document classification (Gabriel et al., 2002) and weights the importance of a feature based on their distribution. In this scheme, a feature is not important if the feature appears in many clusters and is evenly distributed in these clusters, otherwise it will be assigned higher importance.</Paragraph> <Paragraph position="1"> To weight a feature fi within a category, we take into account the following information:</Paragraph> <Paragraph position="3"> , where pfi;k is the number of those entity pairs which contain feature fi in cluster k. Nk is the total number of term pairs in cluster k.</Paragraph> <Paragraph position="4"> + The relative importance of fi across clusters is given</Paragraph> <Paragraph position="6"> is the set of clusters which contain feature fi. N is the total number of clusters.</Paragraph> <Paragraph position="7"> Here, WCi;k and CCi are designed to capture both local information within a cluster and global information about the feature distribution across clusters respectively. Combining both WCi;k and CCi we define the weight Wi;k of fi in cluster k as: Wi;k = WC</Paragraph> <Paragraph position="9"/> </Section> </Section> class="xml-element"></Paper>