File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1050_metho.xml
Size: 13,365 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1050"> <Title>A Probabilistic Answer Type Model</Title> <Section position="4" start_page="393" end_page="395" type="metho"> <SectionTitle> 3 Resources </SectionTitle> <Paragraph position="0"> Before introducing our model, we first describe the resources used in the model.</Paragraph> <Section position="1" start_page="393" end_page="394" type="sub_section"> <SectionTitle> 3.1 Word Clusters </SectionTitle> <Paragraph position="0"> Natural language data is extremely sparse. Word clusters are a way of coping with data sparseness by abstracting a given word to a class of related words. Clusters, as used by our probabilistic answer typing system, play a role similar to that of named entity types. Many methods exist for clustering, e.g., (Brown et al., 1990; Cutting et al., 1992; Pereira et al., 1993; Karypis et al., 1999).</Paragraph> <Paragraph position="1"> We used the Clustering By Committee (CBC)</Paragraph> </Section> <Section position="2" start_page="394" end_page="394" type="sub_section"> <SectionTitle> Word Clusters </SectionTitle> <Paragraph position="0"> suite software, network, wireless, ...</Paragraph> <Paragraph position="1"> rooms, bathrooms, restrooms, ...</Paragraph> <Paragraph position="2"> meeting room, conference room, ...</Paragraph> <Paragraph position="3"> ghost rabbit, squirrel, duck, elephant, frog, ... goblins, ghosts, vampires, ghouls, ...</Paragraph> <Paragraph position="4"> punk, reggae, folk, pop, hip-pop, ...</Paragraph> <Paragraph position="5"> huge, larger, vast, significant, ...</Paragraph> <Paragraph position="6"> coming-of-age, true-life, ...</Paragraph> <Paragraph position="7"> clouds, cloud, fog, haze, mist, ...</Paragraph> <Paragraph position="8"> algorithm (Pantel and Lin, 2002) on a 10 GB English text corpus to obtain 3607 clusters. The following is an example cluster generated by CBC: tension, anger, anxiety, tensions, frustration, resentment, uncertainty, confusion, conflict, discontent, insecurity, controversy, unease, bitterness, dispute, disagreement, nervousness, sadness, despair, animosity, hostility, outrage, discord, pessimism, anguish, ...</Paragraph> <Paragraph position="9"> In the clusteringgenerated byCBC, aword may belong to multiple clusters. The clusters to which a word belongs often represent the senses of the word. Table 1 shows two example words and their clusters.</Paragraph> </Section> <Section position="3" start_page="394" end_page="395" type="sub_section"> <SectionTitle> 3.2 Contexts </SectionTitle> <Paragraph position="0"> The context in which a word appears often imposesconstraintsonthesemantictypeoftheword. null This basic idea has been exploited by many proposals for distributional similarity and clustering, e.g., (Church and Hanks, 1989; Lin, 1998; Pereira et al., 1993).</Paragraph> <Paragraph position="1"> Similar to Lin and Pantel (2001), we define the contexts of a word to be the undirected paths in dependency trees involving that word at either the beginning or the end. The following diagram shows an example dependency tree: Which city hosted the 1988 Winter Olympics?</Paragraph> <Paragraph position="3"> The links in the tree represent dependency relationships. The direction of a link is from the head to the modifier in the relationship. Labels associated with the links represent types of relations.</Paragraph> <Paragraph position="4"> In a context, the word itself is replaced with a variable X. We say a word is the filler of a context if it replaces X. For example, the contexts for the word &quot;Olympics&quot; in the above sentence include the following paths: In these paths, words are reduced to their root forms and proper names are reduced to their entity tags (we used MUC7 named entity tags).</Paragraph> <Paragraph position="5"> Paths allow us to balance the specificity of contexts and the sparseness of data. Longer paths typically impose stricter constraints on the slot fillers. However, they tend to have fewer occurrences, making them more prone to errors arising from data sparseness. We have restricted the path length to two (involving at most three words) and require the two ends of the path to be nouns.</Paragraph> <Paragraph position="6"> We parsed the AQUAINT corpus (3GB) with Minipar (Lin, 2001) and collected the frequency counts of words appearing in various contexts.</Paragraph> <Paragraph position="7"> Parsing and database construction is performed off-line as the database is identical for all questions. We extracted 527,768 contexts that appeared at least 25 times in the corpus. An example context and its fillers are shown in Figure 1.</Paragraph> <Paragraph position="8"> X host Olympics subj obj Africa 2 grant 1 readiness 2 AP 1 he 2 Rio de Janeiro 1 Argentina 1 homeland 3 Rome 1 Athens 16 IOC 1 Salt Lake City 2 Atlanta 3 Iran 2 school 1 Bangkok 1 Jakarta 1 S. Africa 1 ... ... ...</Paragraph> <Paragraph position="9"> decades 1 president 2 Zakopane 4 facility 1 Pusan 1 government 1 race 1 To build a probabilistic model for answer typing, we extract a set of contexts, called question contexts, from a question. An answer is expected to be a plausible filler of the question contexts. Question contexts are extracted from a question with two rules. First, if the wh-word in a question has a trace in the parse tree, the question contextsarethecontextsofthetrace. Forexample, the question &quot;What do most tourists visit in Reims?&quot; is parsed as: Whati do most tourists visit ei in Reims?</Paragraph> <Paragraph position="11"> The symbol ei is the trace of whati. Minipar generates the trace to indicate that the word what is the object of visit in the deep structure of the sentence. The following question contexts are extracted from the above question: obj in visit X in Reims The second rule deals with situations where the wh-word is a determiner, as in the question &quot;Which city hosted the 1988 Winter Olympics?&quot; (the parse tree for which is shown in section 3.2). In such cases, the question contexts consist of a single context involving the noun that is modified by the determiner. The context for the above sentence is X citysubj , corresponding to the sentence &quot;X is a city.&quot; This context is used because the question explicitly states that the desired answer is a city. The context overrides the other contexts because the question explicitly states the desired answer type. Experimental results have shown that using this context in conjunction with other contexts extracted from the question produces lower performance than using this context alone.</Paragraph> <Paragraph position="12"> Intheeventthatacontextextractedfromaquestion is not found in the database, we shorten the context in one of two ways. We start by replacing the word at the end of the path with a wildcard that matches any word. If this fails to yield entries in the context database, we shorten the context to length one and replace the end word with automatically determined similar words instead of a wildcard.</Paragraph> <Paragraph position="13"> Candidate contexts are very similar in form to question contexts, save for one important difference. Candidate contexts are extracted from the parsetreesoftheanswercandidatesratherthanthe question. In natural language, some words may be polysemous. For example, Washington may refer to a person, a city, or a state. The occurrences of Washington in &quot;Washington's descendants&quot; and &quot;suburban Washington&quot; should not be given the same score when the question is seeking a location. Given that the sense of a word is largely determined by its local context (Choueka and Lusignan, 1985), candidate contexts allow the model to take into account the candidate answers' senses implicitly.</Paragraph> </Section> </Section> <Section position="5" start_page="395" end_page="397" type="metho"> <SectionTitle> 4 Probabilistic Model </SectionTitle> <Paragraph position="0"> The goal of an answer typing model is to evaluate the appropriateness of a candidate word as an answer to the question. If we assume that a set of answer candidates is provided to our model by some means (e.g., words comprising documents extracted by an information retrieval engine), we wish to compute the value P(in(w,GQ)|w). That is, the appropriateness of a candidate answer w is proportional to the probability that it will occur in the question contexts GQ extracted from the question. null To mitigate data sparseness, we can introduce a hidden variable C that represents the clusters to whichthecandidateanswermaybelong. Asacandidate may belong to multiple clusters, we obtain:</Paragraph> <Paragraph position="2"> Giventhatawordappears, weassumethatithas the same probability to appear in a context as all other words in the same cluster. Therefore:</Paragraph> <Paragraph position="4"> We can now rewrite the equation in (2) as:</Paragraph> <Paragraph position="6"> This equation splits our model into two parts: one models which clusters a word belongs to and the other models how appropriate a cluster is to the question contexts. When GQ consists of multiple contexts, we make the na&quot;ive Bayes assumption that each individual context gQ [?] GQ is independent of all other contexts given the cluster C.</Paragraph> <Paragraph position="8"> Equation (5) needs the parameters P(C|w) and P(in(C,gQ)|C), neither of which are directly available from the context-filler database. We will discuss the estimation of these parameters in Section 4.2.</Paragraph> <Section position="1" start_page="396" end_page="396" type="sub_section"> <SectionTitle> 4.1 Using Candidate Contexts </SectionTitle> <Paragraph position="0"> The previous model assigns the same likelihood to every instance of a given word. As we noted in section 3.2.2, a word may be polysemous. To take intoaccountaword'scontext,wecaninsteadcomputeP(in(w,GQ)|w,in(w,Gw)), whereGw is the set of contexts for the candidate word w in a retrieved passage.</Paragraph> <Paragraph position="1"> By introducing word clusters as intermediate variables as before and making a similar assumption as in equation (3), we obtain:</Paragraph> <Paragraph position="3"> Like equation (4), equation (7) partitions the model into two parts. Unlike P(C|w) in equation (4), the probability of the cluster is now based on the particular occurrence of the word in the candidate contexts. It can be estimated by:</Paragraph> <Paragraph position="5"/> </Section> <Section position="2" start_page="396" end_page="397" type="sub_section"> <SectionTitle> 4.2 Estimating Parameters </SectionTitle> <Paragraph position="0"> Our probabilistic model requires the parameters P(C|w), P(C|w,in(w,g)), and P(in(C,g)|C), wherew isaword,C isaclusterthatw belongsto, and g is a question or candidate context. This section explains how these parameters are estimated without using labeled data.</Paragraph> <Paragraph position="1"> The context-filler database described in Section 3.2 provides the joint and marginal frequency counts of contexts and words (|in(g,w)|, |in([?],g) |and |in(w,[?])|). These counts allow us to compute the probabilities P(in(w,g)), P(in(w,[?])), and P(in([?],g)). We can also compute P(in(w,g)|w), which is smoothed with add-one smoothing (see equation (11) in Figure 2).</Paragraph> <Paragraph position="2"> The estimation of P(C|w) presents a challenge.</Paragraph> <Paragraph position="3"> We have no corpus from which we can directly measure P(C|w) because word instances are not labeled with their clusters.</Paragraph> <Paragraph position="5"> We use the average weighted &quot;guesses&quot; of the top similar words of w to compute P(C|w) (see equation 13). The intuition is that if wprime and w are similar words, P(C|wprime) and P(C|w) tend to have similar values. Since we do not know P(C|wprime) either, we substitute it with uniform distribution Pu(C|wprime) as in equation (12) of Figure 2. Although Pu(C|wprime) is a very crude guess, the weighted average of a set of such guesses can often be quite accurate.</Paragraph> <Paragraph position="6"> The similarities between words are obtained as a byproduct of the CBC algorithm. For each word, we use S(w) to denote the top-n most similar words (n=50 in our experiments) and sim(w,wprime) to denote the similarity between words w and wprime.</Paragraph> <Paragraph position="7"> The following is a sample similar word list for the</Paragraph> <Paragraph position="9"> plaint 0.29, lawsuits 0.27, jacket 0.25, countersuit 0.24, counterclaim 0.24, pants 0.24, trousers 0.22, shirt 0.21, slacks 0.21, case 0.21, pantsuit 0.21, shirts 0.20, sweater 0.20, coat 0.20, ...} The estimation for P(C|w,in(w,gw)) is similar to that of P(C|w) except that instead of all wprime [?] S(w), we instead use {wprime|wprime [?] S(w) [?] in(wprime,gw)}. By only looking at a particular contextgw,wemayobtainadifferentdistributionover null C thanP(C|w)specifies. Intheeventthatthedata are too sparse to estimate P(C|w,in(w,gw)), we fall back to using P(C|w).</Paragraph> <Paragraph position="10"> P(in(C,g)|C) is computed in (14) by assuming each instance of w contains a fractional instance of C and the fractional count is P(C|w). Again, add-one smoothing is used.</Paragraph> </Section> </Section> class="xml-element"></Paper>