File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0805_metho.xml
Size: 23,869 bytes
Last Modified: 2025-10-06 14:10:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0805"> <Title>Exploring Semantic Constraints for Document Retrieval</Title> <Section position="4" start_page="0" end_page="35" type="metho"> <SectionTitle> 2 Domain-Driven AV Extraction </SectionTitle> <Paragraph position="0"> This section describes a method that automatically discovers attribute-value structures from unstructured texts, the result of which is represented as texts annotated with semantic tags.</Paragraph> <Paragraph position="1"> We chose the digital camera domain to illustrate and evaluate the methodology described in this paper. We expect this method to be applicable to all domains whose main features can be represented as a set of specifications.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 2.1 Construction of Domain Model </SectionTitle> <Paragraph position="0"> A domain model (DM) specifies a terminology of concepts, attributes and values for describing objects in a domain. The relationships between the concepts in such a model can be heterogeneous (e.g., the link between two concepts can mean inheritance or containment). In this work, a domain model is used for establishing a vocabulary as well as for establishing the attribute-value relationship between phrases.</Paragraph> <Paragraph position="1"> For the digital camera domain, we automatically constructed a domain model from existing Web resources. Web sites such as epinions.com and dpreview.com generally present information about cameras in HTML tables generated from internal databases. By querying these databases and extracting table content from the dynamic web pages, we can automatically reconstruct the databases as domain models that could be used for NLP purposes. These models can optionally be organized hierarchically. Although domain models generated from different websites of the same domain are not exactly the same, they often share many common features.</Paragraph> <Paragraph position="2"> From the epinions.com product specifications for 1157 cameras, we extracted a nearly comprehensive domain model for digital cameras, consisting of a set of attributes (or features) and their possible values. A portion of the model is represented as follows: In this example, attributes are shown in curly brackets and sub-attributes in angle brackets.</Paragraph> <Paragraph position="3"> Attributes are followed by possible units for their numerical values. Values come below the attributes, headed by their frequencies in all specifications. The frequency information (in parentheses) is used to calculate term weights of attributes and values.</Paragraph> <Paragraph position="4"> Specifications in HTML tables generally do not specify explicitly the type restrictions on values (even though the types are typically defined in the underlying databases). As type restrictions contain important domain information that is useful for value extraction, we recover the type restrictions by identifying patterns in values. For example, attributes such as price or dimension usually have numerical values, which can be either a single number (&quot;$300&quot;), a range (&quot;$100 $200&quot;), or a multi-dimensional value (&quot;4 in. x 3 in. x 2 in.&quot;), often accompanied by a unit, e.g., $ or inches, whereas attributes such as brand and accessory usually have string values, e.g., &quot;Canon&quot; or &quot;battery charger&quot;.</Paragraph> <Paragraph position="5"> We manually compile a list of units for identifying numerical values, which is partially domain general. We identify range and multi-dimensional values using such patterns as &quot;A -B&quot;, &quot;A to B&quot;, &quot;less than A&quot;, and &quot;A x B&quot;, etc. Numerical values are then normalized to a uniform format.</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 2.2 Identification of AV Pairs </SectionTitle> <Paragraph position="0"> Based on the constructed domain model, we can identify domain values in unstructured texts and assign attribute names and domains to them. We focus on extracting values of a domain attribute.</Paragraph> <Paragraph position="1"> Attribute names appearing by themselves are not of interest here because attribute names alone cannot establish attribute-value relations. However, identification of attribute names is necessary for disambiguation.</Paragraph> <Paragraph position="2"> The AV extraction procedure contains the following steps: 1. Use MINIPAR (Lin, 1998) to generate dependency parses of texts.</Paragraph> <Paragraph position="3"> 2. For all noun phrase chunks in parses, iteratively match sub-phrases of each chunk with the domain model to find all possible matches of attribute names and values above a threshold: * A chunk contains all words up to the noun head (inclusive); * Post-head NP components (e.g., PP and clauses) are treated as separate chunks.</Paragraph> <Paragraph position="4"> 3. Disambiguate values with multiple at- null tribute assignments using the sentence context, with a preference toward closer context based on dependency.</Paragraph> <Paragraph position="5"> 4. Mark up the documents with XML tags that represent AV pairs.</Paragraph> <Paragraph position="6"> Steps 2 and 3 are the center of the AV extraction process, where different strategies are employed to handle values of different types and where ambiguous values are disambiguated. We describe these strategies in detail below.</Paragraph> </Section> <Section position="3" start_page="34" end_page="34" type="sub_section"> <SectionTitle> Numerical Value </SectionTitle> <Paragraph position="0"> Numerical values are identified based on the unit list and the range and multi-dimensional number patterns described earlier in Section 2.1. The predefined mappings between units and attributes suggest attribute assignment. It is possible that one unit can be mapped to multiple attributes. For example, &quot;x&quot; can be mapped to either optical zoom or digital zoom, both of which are kept as possible candidates for future disambiguation. For range and multi-dimensional numbers, we find all attributes in the domain model that have at least one matched range or multi-dimensional value, and keep attributes identified by either a unit or a pattern as candidates. Numbers without a unit can only be matched exactly against an existing value in the domain model.</Paragraph> </Section> <Section position="4" start_page="34" end_page="35" type="sub_section"> <SectionTitle> String Value </SectionTitle> <Paragraph position="0"> Human users often refer to a domain entity in different ways in text. For example, a camera called &quot;Canon PowerShot G2 Black Digital Camera&quot; in our domain model is seldom mentioned exactly this way in ads or reviews, but rather as &quot;Canon PowerShot G2&quot;, &quot;Canon G2&quot;, etc. However, a domain model generally only records full name forms rather than their all possible variations. This makes the identification of domain values difficult and invalidates the use of a trained classifier that needs training samples consisting of a large variety of name references.</Paragraph> <Paragraph position="1"> An added difficulty is that web texts often contain grammatical errors and incomplete sentences as well as large numbers of out-of-vocabulary words and, therefore, make the dependency parses very noisy. As a result, effectiveness of extraction algorithms based on certain dependency patterns can be adversely affected.</Paragraph> <Paragraph position="2"> Our approach makes use of the more accurate parser functionalities of part-of-speech tagging and phrase boundary detection, while reducing the reliance on low level dependency structures.</Paragraph> <Paragraph position="3"> For noun phrase chunks extracted from parse trees, we iteratively match all sub-phrases of each chunk with the domain model to find matching attributes and values above a threshold.</Paragraph> <Paragraph position="4"> It is often possible to find multiple AV pairs in a single NP chunk.</Paragraph> <Paragraph position="5"> Assigning domain attributes to an NP is essentially a classification problem. In our domain model, each attribute can be seen as a target class and its values as the training set. For a new phrase, the idea is to find the value in the domain model that is most similar and then assign the attribute of this nearest neighbor to the phrase.</Paragraph> <Paragraph position="6"> This motivates us to adopt K Nearest Neighbor (KNN) (Fix and Hodges, 1951) classification for handling NP values. The core of KNN is a similarity metric. In our case, we use word editing distance (Wagner and Fischer, 1974) that takes into account the cost of word insertions, deletions, and substitutions. We compute word editing distance using dynamic programming techniques. null Intuitively, words do not carry equal weights in a domain. In the earlier example, words such as &quot;PowerShot&quot; and &quot;G2&quot; are more important than &quot;digital&quot; and &quot;camera&quot;, so editing costs for such words should be higher. This draws an analogy to the metric of Inverse Document Frequency (IDF) in the IR community, used to measure the discriminative capability of a term in a document collection. If we regard each value string as a document, we can use IDF to measure the weight of each term in a value string to emphasize important domain terms and de-emphasize more general ones. The normalized cost is computed as: )log(/)/log( TNNTN where TN is the total number of values for an attribute, and N is the number of values where a term occurs. This equation assigns higher cost to more discriminative terms and lower cost to more general terms. It is also used to compute costs of terms in attribute names. For words not appearing in a class the cost is 1, the maximum cost.</Paragraph> <Paragraph position="7"> The distance between a new phrase and a DM phrase is then calculated using word editing cost based on the costs of substitution, insertion, and deletion, where</Paragraph> <Paragraph position="9"> where CostDM is the cost of a word in a domain value (i.e., its normalized IDF score), and Costnew is that of a word in the new phrase. The cost is also normalized by the larger of the weighted lengths of the two phrases. We use a threshold of 0.6 to cut off phrases with higher cost.</Paragraph> <Paragraph position="10"> For a phrase that returns only a couple of matches, the similarity, i.e., the matching probability, is computed as 1 - Costedit; otherwise, the similarity is the maximum likelihood of an attribute based on the number of returned values belonging to this attribute.</Paragraph> <Paragraph position="11"> Disambiguation by Sentence Context The AV identification process often returns multiple attribute candidates for a phrase that needs to be further disambiguated. The words close to the phrase usually provide good indications of the correct attribute names. Motivated by this observation, we design the disambiguation procedure as follows. First we examine the sibling nodes of the target phrase node in the dependency structure for a mention of an attribute name that overlaps with a candidate. Next, we recursively traverse upwards along the dependency tree until we find an overlap or reach the top of the tree. If an overlap is found, that attribute becomes the final assignment; otherwise, the attribute with the highest probability is assigned. This method gives priority to the context closest (in terms of dependency) to the target phrase. For example, in the sentence &quot;The 4x stepless digital zoom lets you capture intricate details&quot; (parse tree shown below), &quot;4x&quot; can be mapped to both optical zoom and digital zoom, but the sentence context points to the second candidate.</Paragraph> </Section> </Section> <Section position="5" start_page="35" end_page="36" type="metho"> <SectionTitle> 3 Document Retrieval Systems </SectionTitle> <Paragraph position="0"> This section introduces three document retrieval systems: the first one retrieves unstructured texts based on vector space models, the second one takes advantage of semantic structures constructed by the methods in Section 2, and the last one combines the first two systems.</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 3.1 Term-Based Retrieval (S1) </SectionTitle> <Paragraph position="0"> Our system for term-based retrieval from unstructured text is based on the CLARIT system, implementing a vector space retrieval model (Evans and Lefferts, 1995; Qu et al., 2005). The CLARIT system identifies terms in documents and constructs its index based on NLPdetermined linguistic constituents (NPs, sub-phrases and words). The index is built upon full documents or variable-length subdocuments. We used subdocuments in the range of 8 to 12 sentences as the basis for indexing and scoring documents in our experiments.</Paragraph> <Paragraph position="1"> Various similarity measures are supported in the model. For the experiments described in the paper, we used the dot product function for computing similarities between a query and a document: null where WQ(t) is the weight associated with the query term t and WD(t) is the weight associated with the term t in the document D. The two weights were computed as follows: where IDF and TF are standard inverse document frequency and term frequency statistics, respectively. IDF(t) was computed with the target corpus for retrieval. The coefficient C(t) is an &quot;importance coefficient&quot;, which can be modified either manually by the user or automatically by the system (e.g., updated during feedback).</Paragraph> <Paragraph position="2"> For term-based document retrieval, we have also experimented with pseudo relevance feed-back (PRF) with various numbers of retrieved documents and various numbers of terms from such documents for query expansion. While PRF did result in improvement in performance, it was not significant. This is probably due to the fact that in this restricted domain, there is not much vocabulary variation and thus the advantage of using query expansion is not fully realized.</Paragraph> </Section> <Section position="2" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 3.2 Constraint-Based Retrieval (S2) </SectionTitle> <Paragraph position="0"> The constraint-based retrieval approach searches through the AV-annotated document collection based on the constraints extracted from queries.</Paragraph> <Paragraph position="1"> Given a query q, our constraint-based system scores each document in the collection by comparing the extracted AV pairs with the constraints in q. Suppose q has a constraint c(a, v) that restricts the value of the attribute a to v, where v can be either a concrete value (e.g., 5 megapixels) or a range (e.g., less than $400). If a</Paragraph> <Paragraph position="3"> is present in a document d with a value v' that satisfies v, that is, v'= v if v is a concrete value or v' falls in the range defined by v, d is given a positive score w. However, if v' does not satisfy v, then d is given a negative score -w. No mention of a does not change the score of d, except that, when c is a string constraint, we use a back-off model that awards d a positive score w if it contains v as a substring. The final score of d given q is the sum of all scores for each constraint in q, normalized by the maximum score</Paragraph> <Paragraph position="5"> , where ci is one of the n constraints specified in q and wi its score. We rank the documents by their scores. This scoring schema facilitates a sensible cutoff point, so that a constraint-based retrieval system can return 0 or fewer than top N documents when a query has no or very few relevant documents.</Paragraph> </Section> <Section position="3" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 3.3 Combined Retrieval (S3) </SectionTitle> <Paragraph position="0"> Lee (1997) analyzed multiple post-search data fusion methods using TREC3 ad hoc retrieval data and explained the combination of different search results on the grounds that different runs retrieve similar sets of relevant documents, but different sets of non-relevant documents. The combination methods therefore boost the ranks of the relevant documents. One method studied was the summation of individual similarities, which bears no significant difference from the best approach (i.e., further multiply the summation with the number of nonzero similarities).</Paragraph> <Paragraph position="1"> Our system therefore adopts the summation method for its simplicity. Because the scores from term-based and constraint-based retrieval are normalized, we simply add them together for each document retrieved by both approaches and re-rank the documents based on their new scores.</Paragraph> <Paragraph position="2"> More sophisticated combination methods can be explored here, such as deciding which score to emphasize based on the characterizations of the queries, e.g., whether a query has more numerical values or string values.</Paragraph> </Section> </Section> <Section position="6" start_page="36" end_page="38" type="metho"> <SectionTitle> 4 Experimental Study </SectionTitle> <Paragraph position="0"> In this section, we describe the experiments we performed to investigate combining terms and semantic constraints for document retrieval.</Paragraph> <Section position="1" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 4.1 Data Sets </SectionTitle> <Paragraph position="0"> To construct a domain corpus, we used search results from craigslist.org. We chose the &quot;for sale - electronics&quot; section for the &quot;San Francisco Bay Area&quot;. We then submitted the search term &quot;digital camera&quot; in order to retrieve advertisements. After manually removing duplicates and expired ads, our corpus consisted of 437 ads posted between 2005-10-28 and 2005-11-07. A typical ad is illustrated below, with a small set of XML tags specifying the fields of the title of the ad (title), date of posting (date), ad body (text), ad id (docno), and document (doc). The length of the documents varies considerably, from 5 or 6 sentences to over 70 (with specifications copied from other websites). The ads have an average length of 230 words.</Paragraph> <Paragraph position="1"> The test queries were constructed based on human written questions from the Digital Photography Review website (www.dpreview.com) Q&A forums, which contain discussions from real users about all aspects of digital photography. Often, users ask for suggestions on purchasing digital cameras and formulate their needs as a set of constraints. These queries form the base of our topic collection.</Paragraph> <Paragraph position="2"> The following is an example of such a topic manually annotated with the semantic constraints of interest to the user: In this example, the user query text is in the query field and the manually extracted AV constraints based on the domain model are in the constraint field. Two types of constraints are distinguished: hard and soft. The hard constraints must be satisfied while the soft constraints can be relaxed. Manual determination of hard vs. soft constraints is based on the linguistic features in the text. Automatic constraint extraction goes one step beyond AV extraction for the need to identify relations between attributes and values, for example, &quot;nothing higher than&quot; indicates a &quot;<=&quot; relationship. Such constraints can be extracted automatically from natural text using a pattern-based method. However, we have yet to produce a rich set of patterns addressing constraints. In addition, such query capability can be simulated with a form-based parametric search interface.</Paragraph> <Paragraph position="3"> In order to make a fair comparison between systems, we use only phrases in the manually extracted constraints as queries to system S1. For the example topic, S1 extracted the NP terms &quot;SLR&quot;, &quot;1500&quot; and &quot;Nikon D70&quot;. During retrieval, a term is further decomposed into its subterms for similarity matching. For instance, the term &quot;Nikon D70&quot; is decomposed into subterms &quot;Nikon&quot; and &quot;D70&quot; and thus documents that mention the individual subterms can be retrieved.</Paragraph> <Paragraph position="4"> For this topic, the system S2 produced annotations as those shown in the constraint field. Table 1 gives a summary of the distribution statistics of terms and constraints for 30 topics selected from the Digital Photography Review website.</Paragraph> </Section> <Section position="2" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 4.2 Relevance Judgments </SectionTitle> <Paragraph position="0"> Instead of using human subjects to give relevance judgments for each document and query combination, we use a human annotator to mark up all AV pairs in each document, using the GATE annotation tool (Cunningham et al, 2002).</Paragraph> <Paragraph position="1"> The attribute set contains the 40 most important attributes for digital cameras based on automatically computed term distributions in our data set. The inter-annotator agreement (without annotator training) as measured by Kappa is 0.72, which suggests satisfactory agreement.</Paragraph> <Paragraph position="2"> Annotating AV pairs in all documents gives us the capability of making relevance judgments automatically, based on the number of matches between the AV pairs in a document and the constraints in a topic. This automatic approach is reasonable because unlike TREC queries which are short and ambiguous, the queries in our application represent very specific information needs and are therefore much longer. The lack of ambiguity makes our problem closer to boolean search with structured queries like SQL than traditional IR search. In this case, a human assessor should give the same relevance judgments as our automatic system if they follow the same instructions closely. An example instruction could be &quot;a document is relevant if it describes a digital camera whose specifications satisfy at least one constraint in the query, otherwise it is not relevant&quot; (similar to the narrative field of a TREC topic).</Paragraph> <Paragraph position="3"> We specify two levels of relevance: strict and relaxed. Strict means that all hard constraints of a topic have to be satisfied for a document to be relevant to the topic, whereas relaxed means that at least half of the hard constraints have to be satisfied. Soft constraints play no role in a relevance judgment. The advantage of the automatic approach is that when the levels of relevance are modified for different application purposes, the relevance judgment can be recomputed easily, whereas in the manual approach, the human assessor has to examine all documents again.</Paragraph> <Paragraph position="4"> across topics for relaxed and strict judgments Figure 1 shows the distributions of the relevant documents for the test topic set. With strict judgments, only 20 out of the 30 topics have relevant documents, and among them 6 topics have fewer than 10 relevant documents. The topics with many constraints are likely to result in low numbers of relevant documents. The average numbers of relevant documents for the set are 57.3 for relaxed judgments, and 18 for strict judgments.</Paragraph> </Section> </Section> class="xml-element"></Paper>