XML Viewer - h01-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1008_metho.xml
Size: 10,273 bytes
Last Modified: 2025-10-06 14:07:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1008">
  <Title>Assigning Belief Scores to Names in Queries</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. EVALUATION OF NAME MATCH
PROBABILITY VERSUS IDF
</SectionTitle>
    <Paragraph position="0"> To test our hypothesis that name match probability predicts relevance better than idf, we compared how well name queries with high match probabilities performed against name queries with high idf. We performed two experiments. In the first, we selected names of individuals in our legal directory. In the second, we used the names of currently active major league baseball players.</Paragraph>
    <Paragraph position="1"> To conduct the first experiment, we labeled person names in a collection of 27,000 WSJ documents with a commercially available name tagging program. We then extracted these names and created a merged list of names specified by first and last name and pulled from this list names that occurred within our legal directory. We then sorted this list by name match probability and by document occurrence frequency (which is equivalent to idf) to create two lists. We then binned the names in the name match probability list into sets that fell between the following probability ranges: 1.0-0.9, 0.9-0.8 ,0.8-0.7, 0.7-0.6, 0.6-0.5, 0.5-0.4, 0.4-0.3, 0.3-0.2, 0.2-0.1, and 0.1-0.0. We binned the names in the document frequency list into sets that fell into the following document occurrence frequencies: 1, 2, 3, 4, 5, 6, 7, 8, 9, and &gt;=10.</Paragraph>
    <Paragraph position="2"> We then selected 50 names at random from each of these bins (except for bins associated with 0.8-0.7 and 0.7-0.6 probabilities which contained 42 and 31 names respectively). For each name selected, we identified the legal directory entry that was compatible with the name. In most cases, only one legal directory entry was compatible with the name. In some cases, multiple entries were compatible. For example, the name &amp;quot;Paul Brown&amp;quot; is compatible with 71 legal directory entries since there are 71 people in the directory with the first name &amp;quot;Paul&amp;quot; and the last name &amp;quot;Brown&amp;quot;. In these cases, we selected one of the entries at random.</Paragraph>
    <Paragraph position="3"> For each name in each bin, we found the set of documents in the WSJ collection that would be returned by the word proximity query &amp;quot;First_name +2 Last_name&amp;quot;. That is, the documents that contained the first name followed within two words by the last name.</Paragraph>
    <Paragraph position="4"> The search precision results for match probability and document frequency bins are shown in tables 2 and 3 below. The search precision of each bin was the number of relevant documents returned by the names in the bin divided by the total number of documents returned. The row labeled &amp;quot;Number Unique Names in Each Category&amp;quot; is a count of the number of unique first and last name pairs found within the WSJ collection for the probability and document frequency ranges indicated. It was from these sets of names that we selected our queries.</Paragraph>
    <Paragraph position="5"> The results in tables 2 and 3 show that match probability does a better job of estimating relevance than idf. Table 2 shows that search precision goes up as match probability rises. Table 3 shows no apparent correspondence between document frequency and search precision.</Paragraph>
    <Paragraph position="7"> In the second experiment, we performed basically the same steps described above on the names of the 286 baseball players currently playing in the major leagues. We assigned name match probabilities to these names using the language model we derived from the legal directory. Of the 286 names, we found 82 that were compatible with one or more name instances in the WSJ collection. For all 82, we found the set of documents in the WSJ collection that would be returned by the word proximity query &amp;quot;First_name +2 Last_name&amp;quot;. We then measured how frequently the documents returned for a particular word proximity query actually referenced the player with which the name query was paired. As in the attorney and judge name experiment, name match probability predicted relevance more accurately than idf.</Paragraph>
    <Paragraph position="8"> The results for baseball player names are shown in tables 4 and 5 above.</Paragraph>
    <Paragraph position="9"> Note that on average the search precision for baseball players was higher than for attorneys and judges. This is due to the combined  effects of there being far fewer baseball player names than attorney and judge names and the fact that the average probability of a baseball player being mentioned in the news is higher than the average probability for a judge or attorney being mentioned.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. USING RARE NAMES TO IDENTIFY
SEARCH FEATURES
</SectionTitle>
    <Paragraph position="0"> An important use of name match probabilities is the identification of co-occurrence features in text that can serve to disambiguate name references. If we know certain names in the corpora very probably refer to certain individuals listed in a professional directory, we can look for words that co-occur frequently with these names but infrequently with names in general. These words are likely to work well at disambiguating references to names of low match probability.</Paragraph>
    <Paragraph position="1"> As an example of feature identification, consider the figures 1 and 2 above. In these figures, the word &amp;quot;rare&amp;quot; stands for the 20% of names in the legal directory that have the highest match probability. The phrase &amp;quot;medium rare&amp;quot; stands for the next 20% and so on. The word &amp;quot;common&amp;quot; then stands for the 20% of names with the lowest match probability. For each of the five categories of name rarity, the graphs in the figures show the probability of an appositive term occurring at a given word position relative to the position of a name.</Paragraph>
    <Paragraph position="2"> Figure 1 shows the probability of attorney appositive nouns such as &amp;quot;attorney&amp;quot;, &amp;quot;lawyer&amp;quot;, &amp;quot;counsel&amp;quot;, or &amp;quot;partner&amp;quot; occurring at 12 different word positions around attorney names of varying degrees of rarity. Position -1 stands for the word position directly before the name. Position +1 stands for the position directly after.</Paragraph>
    <Paragraph position="3"> Position -2 stands for the word position two words in front of the name and so on. Figure 2 shows the probability of judge appositive nouns such as &amp;quot;judge&amp;quot; or &amp;quot;justice&amp;quot; occurring around judge names.</Paragraph>
    <Paragraph position="4"> The graphs in figures 1 and 2 show that the probability of appositive terms occurring at particular word positions grows steadily as the name rarity increases. This demonstrates that appositive terms are good indicators for judge and attorney names within the WSJ collection. The figures also shows the word positions in which we should look for appositive terms.</Paragraph>
    <Paragraph position="5"> Figure 1 shows that we should look for attorney appositives in word positions -2, -1, +2, +4, and +5. This makes intuitive sense because it accounts for sentence constructs such as those shown in table 6.</Paragraph>
    <Paragraph position="6"> The sudden drop off in appositive term probability at word position +1 also makes sense since an article, adjective, or other part of speech often occurs between a trailing appositive head noun and the proper noun it modifies. The drop off at word position +3 is still something of a mystery and is not something we can explain at this time. Since +3 behavior seems to have no linguistic basis that we can perceive, we do not rely on it in constructing our search operator.</Paragraph>
    <Paragraph position="7"> Figure 2 shows that we should look for judge appositives in word position -1. This makes perfect sense since it accounts for constructs such as &amp;quot; Judge William Rehnquist&amp;quot; and &amp;quot;Justice Antonin Scalia&amp;quot;. Figure 2 also suggests that using the -1 appositive test should yield good search recall since the conditional probability for rare names is about 0.9.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. PRELIMINARY SEARCH OPERATOR
EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> We are currently investigating what levels of search precision and recall we can achieve with special attorney and judge name search operators using name rarity together with co-occurrence features such as appositive, city, state, firm, and court terms. Our preliminary results are encouraging. Initial experiments with the attorney search operator indicate we can achieve a nine fold improvement in search precision over simple word proximity searches over the WSJ collection while sacrificing 18% recall.</Paragraph>
    <Paragraph position="1"> Preliminary results are shown in table 7 below. We produced these results by selecting 677 attorney names at random from the legal directory that existed within the WSJ collection. For each name, we ran word proximity searches using the first and last name of the lawyers and scored the results. Using the scored results from 377 of the names, we then trained a special Bayesian based name operator that used first name, last name, city, state, firm, and name rarity information as sources of name match evidence. Finally we tested the word proximity operator performance against the special name operator using the remaining 300 names.</Paragraph>
    <Paragraph position="2"> Note that we have assumed above that word proximity searches yield 100% recall. This is not wholly accurate since it does not account for nicknames, use of first name initials, and so on. We plan to revise this recall estimate in the future, but for now we assume that a word proximity search on first and last name provides close to 100% recall in a collection such as the WSJ.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML