File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2007_metho.xml
Size: 11,738 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2007"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics N Semantic Classes are Harder than Two</Title> <Section position="5" start_page="49" end_page="50" type="metho"> <SectionTitle> 3 Identifying Candidate Phrases for Classification </SectionTitle> <Paragraph position="0"> In this section we introduce the two data sources we use to extract sets of candidate related phrases for classification: a TREC-WordNet intersection and query logs.</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 3.1 Noun-Phrase Pairs Cooccuring in TREC News Sentences </SectionTitle> <Paragraph position="0"> The first is a data-set derived from TREC news corpora and WordNet used in previous work for binary semantic class classification (Snow et al., 2005). We extract two sets of candidate-related pairs from these corpora, one restricted and one more complete set.</Paragraph> <Paragraph position="1"> Snow et al. obtained training data from the intersection of noun-phrases cooccuring in sentences in a TREC news corpus and those that can be labeled unambiguously as hypernyms or non-hypernyms using WordNet. We use a restricted set since instances selected in the previous work are a subset of the instances one is likely to encounter in text. The pairs are generally either related in one type of relationship, or completely unrelated.</Paragraph> <Paragraph position="2"> In general we may be able to identify related phrases (for example with distributional similarity (Lee, 1999)), but would like to be able to automatically classify the related phrases by the type of the relationship. For this task we identify a larger set of candidate-related phrases.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.2 Query Log Data </SectionTitle> <Paragraph position="0"> To find phrases that are similar or substitutable for web searchers, we turn to logs of user search sessions. We look at query reformulations: a pair of successive queries issued by a single user on a single day. We collapse repeated searches for the same terms, as well as query pair sequences repeated by the same user on the same day.</Paragraph> <Paragraph position="1"> Whole queries tend to consist of several concepts together, for example &quot;new york |maps&quot; or &quot;britney spears |mp3s&quot;. We identify segments or phrases using a measure over adjacent terms similar to mutual information. Substitutions occur at the level of segments. For example, a user may initially search for &quot;britney spears |mp3s&quot;, then search for &quot;britney spears |music&quot;. By aligning query pairs with a single substituted segment, we generate pairs of phrases which a user has substituted. In this example, the phrase &quot;mp3s&quot; was substituted by the phrase &quot;music&quot;.</Paragraph> <Paragraph position="2"> Aggregating substitutable pairs over millions of users and millions of search sessions, we can calculate the probability of each such rewrite, then test each pair for statistical significance to eliminate phrase rewrites which occurred in a small number of sessions, perhaps by chance. To test for statistical significance we use the pair independence likelihood ratio, or log-likelihood ratio, test. This metric tests the hypothesis that the probability of phrase b is the same whether phrase a has been seen or not by calculating the likelihood of the observed data under a binomial distribution using probabilities derived using each hypothesis (Dunning, 1993).</Paragraph> <Paragraph position="4"> A high negative value for l suggests a strong dependence between query a and query b.</Paragraph> </Section> </Section> <Section position="6" start_page="50" end_page="50" type="metho"> <SectionTitle> 4 Labeling Phrase Pairs for Supervised </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> Learning </SectionTitle> <Paragraph position="0"> We took a random sample of query segment substitutions from our query logs to be labeled. The sampling was limited to pairs that were frequent substitutions for each other to ensure a high probability of the segments having some relationship.</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.1 WordNet Labeling </SectionTitle> <Paragraph position="0"> WordNet is a large lexical database of English words. In addition to defining several hundred thousand words, it defines synonym sets, or synsets, of words that represent some underlying lexical concept, plus relationships between synsets. The most frequent relationships between noun-phrases are synonym, hyponym, hypernym, and coordinate, defined in Table 1. We also may use meronym and holonym, defined as the PART-OF relationship.</Paragraph> <Paragraph position="1"> We used WordNet to automatically label the subset of our sample for which both phrases occur in WordNet. Any sense of the first segment having a relationship to any sense of the second would result in the pair being labeled. Since WordNet contains many other relationships in addition to those listed above, we group the rest into the other category. If the segments had no relationship in Word-Net, they were labeled no relationship.</Paragraph> </Section> <Section position="3" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.2 Segment Pair Labels </SectionTitle> <Paragraph position="0"> Phrase pairs passing a statistical test are common reformulations, but can be of many semantic types. Rieh and Xie (2001) categorized types of query reformulations, defining 10 general categories: specification, generalization, synonym, parallel movement, term variations, operator usage, error correction, general resource, special resource, and site URLs. We redefine these slightly to apply to query segments. The summary of the definitions is shown in Table 1, along with the distribution in the data of pairs passing the statistical test.</Paragraph> <Paragraph position="1"> More than 90% of phrases in query logs do not appear in WordNet due to being spelling errors, web site URLs, proper nouns of a temporal nature, etc. Six annotators labeled 2,463 segment pairs selected randomly from our sample. Annotators agreed on the label of 78% of pairs, with a Kappa statistic of .74.</Paragraph> </Section> </Section> <Section position="7" start_page="50" end_page="51" type="metho"> <SectionTitle> 5 Automatic Classification </SectionTitle> <Paragraph position="0"> We wish to perform supervised classification of pairs of phrases into semantic classes. To do this, we will assign features to each pair of phrases, which may be predictive of their semantic relationship, then use a machine-learned classifier to assign weights to these features. In Section 7 we will look at the learned weights and discuss which features are most significant for identifying which semantic classes.</Paragraph> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 5.1 Features </SectionTitle> <Paragraph position="0"> Features for query substitution pairs are extracted from query logs and web pages.</Paragraph> <Paragraph position="1"> We submit the two segments to a web search engine as a conjunctive query and download the top 50 results. Each result is converted into an HTML Document Object Model (DOM) tree and segmented into sentences.</Paragraph> <Paragraph position="2"> Dependency Tree Paths The path from the first segment to the second in a dependency parse tree generated by MINIPAR (Lin, 1998) from sentences in which both segments appear. These were previously used by Snow et al. (2005). These features were extracted from web pages in all experiments, except where we identify that we used TREC news stories (the same data as used by Snow et al.). HTML Paths The paths from DOM tree nodes the first segment appears in to nodes the second segment appears in. The value is the number of times the path occurs with the pair. Class Description Example % synonym one phrase can be used in place of the other without loss in meaning low cost; cheap 4.2 hypernym X is a hypernym of Y if and only if Y is a X muscle car; mustang 2.0 hyponym X is a hyponym of Y if and only if X is a Y (inverse of hypernymy) lotus; flowers 2.0 coordinate there is some Z such that X and Y are both Zs aquarius; gemini 13.9 generalization X is a generalization of Y if X contains less information about the topic lyrics; santana lyrics 4.8 specialization X is a specification of Y if X contains more information about the topic credit card; card 4.7 spelling change spelling errors, typos, punctuation changes, spacing changes peopl; people 14.9 stemmed form X and Y have the same lemmas ant; ants 3.4 URL change X and Y are related and X or Y is a URL alliance; alliance.com 29.8 other relationship X and Y are related in some other way flagpoles; flags 9.8 no relationship X and Y are not related in any obvious way crypt; tree 10.4 data.</Paragraph> <Paragraph position="3"> Lexico-syntactic Patterns (Hearst, 1992) A sub-string occurring between the two segments extracted from text in nodes in which both segments appear. In the example fragment &quot;authors such as Shakespeare&quot;, the feature is &quot;such as&quot; and the value is the number of times the substring appears between &quot;author&quot; and &quot;Shakespeare&quot;.</Paragraph> <Paragraph position="4"> Table 2 summarizes features that are induced from the query strings themselves or calculated from query log data.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 5.2 Additional Training Pairs </SectionTitle> <Paragraph position="0"> We can double our training set by adding for each pair u1,u2 a new pair u2,u1. The class of the new pair is the same as the old in all cases but hypernym, hyponym, specification, and generalization, which are inverted. Features are reversed from f(u1,u2) to f(u2,u1).</Paragraph> <Paragraph position="1"> A pair and its inverse have different sets of features, so splitting the set randomly into training and testing sets should not result in resubstitution error. Nonetheless, we ensure that a pair and its inverse are not separated for training and testing.</Paragraph> </Section> <Section position="3" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 5.3 Classifier </SectionTitle> <Paragraph position="0"> For each class we train a binary one-vs.-all linearkernel support vector machine (SVM) using the optimization algorithm of Keerthi and DeCoste (2005).</Paragraph> <Paragraph position="1"> For n-class classification, we calibrate SVM scores to probabilities using the method described by Platt (2000). This gives us P(class|pair) for each pair. The final classification for a pair is argmaxclassP(class|pair).</Paragraph> <Paragraph position="2"> ear regression, and our reproduction of the same experiment, using a support vector machine (SVM).</Paragraph> <Paragraph position="3"> Binary classifiers are evaluated by ranking instances by classification score and finding the Max F1 (the harmonic mean of precision and recall; ranges from 0 to 1) and area under the ROC curve (AUC; ranges from 0.5 to 1 with at least 0.8 being &quot;good&quot;). The meta-classifier is evaluated by precision and recall of each class and classification accuracy of all instances.</Paragraph> </Section> </Section> class="xml-element"></Paper>