File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0315_metho.xml

Size: 19,228 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0315">
  <Title>Name Searching and Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="134" type="metho">
    <SectionTitle>
2 Definitions, Problems, and Issues
</SectionTitle>
    <Paragraph position="0"> Name searching is a term that has been used in a variety of ways. It is useful to define for purposes of this paper what is meant by name searching and related terminology and to describe the application areas for which name searching systems have been developed. In their comprehensive review article of personal name-matching applications Borgman and Siegflied \[1992\] categorize applications as being: 1) name atithority control, 2) information retrieval, and 3) duplicate detection.</Paragraph>
    <Paragraph position="1"> Name-matching in a database context is the process of comparing two character strings and determining whether or not the two strings designate the same entity; in the applications Borgman and Siegfried considered, the same person, but more generally the same institutional, geographical, or other proper-named entities as well.</Paragraph>
    <Paragraph position="2">  This determination might be made solely on the basis of a direct comparison of the two strings, or more knowledge might be used, e.g., models of a) variant spelling or representation of names, b) keying errors, c) phonetic models, or d) record-linkage. That is, if the names to be compared are part of records containing additional fielded information, e.g., age or social security number, this information can be used as additional evidence in the name-matching process.</Paragraph>
    <Paragraph position="3"> Name-matching assumes that two character strings have been identified which are names and the question is only whether they are instances of the same name. Typically it is also important to determine if the names refer to the same entity. Another important class of algorithms is needed for name recognition in applications where the names are not already manually identified. Name recognition is the precess of identifying that a given character string is in fact a name. Such techniques can be used to extract names from text in the case of an information extraction system \[Proceedings 1992, 1995\], or as part of the indexing process for an information rellieval system. The same, or similar, techniques can be used at retrieval time when parsing a user's query.</Paragraph>
    <Paragraph position="4"> Commercial products, such as Carnegie Group's NameFinder and IsoQuest's NameTag are available to support these sorts of applications.</Paragraph>
    <Paragraph position="5"> Name matching in the context of information retrieval differs from name matching in either database or natural lang~ge understanding contexts. In all three types of applications what is ultimately of interest is not that two names match, whether exactly or approximately, as character strings, but that the entities to which they refer are identical. Such reference resolution is not generally possible without some additional context. In the case of database retrieval additional context is provided by the structured nature of the data. A name typically is one field of a record corresponding to the named entity. The other fields, e.g., age, or social security number, can be used to infer that the two names being matched do refer to the same individual. In the case of natural language understanding systems there is linguistic context, as well, perhaps, as domain knowledge representation which can be used to help infer that the two naraes being matched refer to the same individual. Information retrieval differs from both of these types of applications, because it has neither the structure provided by a database record, nor the linguistic depth or domain knowledge representation of the natural language understanding system. Practically name matching becomes a matter of determining whether the surface forms of the two names being matched are close enough as to indicate that it is plausible that they refer to the same individual.</Paragraph>
    <Paragraph position="6"> Name searching can be defined as the process of using a name as part of a query in order to retrieve information associated with that name in a database. Name searching, in the general case, includes both name recognition and name-matching. If names are not already identified as such in the database's text records, e.g., when they appear as part of a free text field and have not been previously tagged as being names, then name recognition is required.</Paragraph>
    <Paragraph position="7"> Similarly in parsing the query, if the name has not been identified as a name by the syntax of the query, then it will be necessmy to recognize it. Once names are recognized in query and database record, then name-matching algorithms are needed to determine whether the names are the same, or that they in fact designate the same individual, e.g., two instances of the lexical entity Judge Smith are the same name, but may not designate the same individual.</Paragraph>
  </Section>
  <Section position="4" start_page="134" end_page="136" type="metho">
    <SectionTitle>
3 The Stady
</SectionTitle>
    <Paragraph position="0"> This study consists of three parts. The first is a review of the literature on the accuracy of name recognition, in particular the results from the MUC-6 Named Entity Task \[Proceedings 1995\]. The second part of the study measures retrieval performance with name searching simulated by probabilistic searching with a proximity operator against a standard test collection with associated relevance judgments. The third part of the study analyzes the frequency of occurrence of personal and company names in legal and newspaper text collections and queries.</Paragraph>
    <Section position="1" start_page="134" end_page="135" type="sub_section">
      <SectionTitle>
3.1 Name Recognition Accuracy
</SectionTitle>
      <Paragraph position="0"> The Message Understanding Conferences (MUC) have evaluated the information extraction performance of the leading extraction systems for several years \[Proceedings 1992, 1995\]. Extracting names has always been part of the extraction task for MUC, but with MUC-6 \[Proceedings 1995\], a specific Named Entity sub-task was developed to focus exclusively on name extraction from news text. Participating systems were evaluated on personal, organiTational, and other name recognition, as well as on related tasks, such as recognizing time and numeric expressions. The leading systems achieved very high accuracy for personal name recognition.</Paragraph>
    </Section>
    <Section position="2" start_page="135" end_page="136" type="sub_section">
      <SectionTitle>
3.2 Evaluation of Name Recognition and
Retrieval Performance
</SectionTitle>
      <Paragraph position="0"> To measure the gain in retrieval performance that might be achieved using name searching, a set of 38 queries conlaining personal names was developed by a domain expert and run against West's FED test collection. The FED collection consists of 410,883 federal case law documents. The expert also identified the set of relevant documents from the FED collection associated with each query.</Paragraph>
      <Paragraph position="1"> There are several ways that name searching could be implemented in a document retrieval context. One way would be to use name recognition software to tag all personal names in the document collection and also in queries. Alternatively, the collection could be tagged, but the user might be required to specify names in the query. Either way, strings designated as being names in the query would be matched against strings lagged as names in the text. Strings tagged as names in the collection might also be indexed differently than other strings. In particular they might not be stemmed, since presumably the similarity in meaning assumed to obtain among strings stemming to a common stem for general terms, would not apply to names.</Paragraph>
      <Paragraph position="2"> A different approach to name searching would be to leave the collection unchanged, but to handle name queries differently from other queries. A combination of these two approaches would also be possible, i. e., tagging names in text and queries, as well as handling name queries differently. The strong personal name recognition results from MUC-6 \[Proceedings 1995\] suggest that approaches using name lagging are likely to work well.</Paragraph>
      <Paragraph position="3"> In this study, however, names were not tagged. Rather, name searching was simulated by probabilistic searching with a proximity operator for multiple word names.</Paragraph>
      <Paragraph position="4"> The 38 queries (shown in the appendix) were run against the FED. Retrieval performance using proximity-based name searching on this test collection, as described in section 4.2, was compared against a baseline provided by the WIN retrieval algorithm. WIN is West's probabilistic retrieval engine based on the inference network model (Turtle and Croft 1991).</Paragraph>
      <Paragraph position="5"> The baseline searches treated each term in the query as a separate concept. The relevance score for each document was computed as the sum of the logged products of each term's term frequency(if) and inverse document frequency (id0.</Paragraph>
      <Paragraph position="6"> The proximity searches treated non-name terms in same way the baseline searches did. However, for name terms, the proximity searches used the tf and idf of the proximally ordered name terms. The proximity searches computed relevance for names using the tf and idf of occurrences in which the first name occurred 2 or fewer word positions before the last name. In this way advantage was taken of the fact that name terms are ordered and resist interruption by non-name terms.</Paragraph>
      <Paragraph position="7"> For example, in the query Cases involvingjailhouse lawyer Joe Woods, the baseline search treated Joe and Woods as independent concepts. Joe occurred in 7,669 documents within the 410,883 document test collection and had a normalized idf of 0.31. Woods occurred in 18,064 documents and had an idf of 0.24.</Paragraph>
      <Paragraph position="8"> The ordered proximity search treated Joe Woods as a single concept in which the terms comprising the concept were proximally ordered. Joe +2 Woods occurred in 17 documents and had an idf of 0.78. By treating Joe Woods in this manner, the proximity search boosted the scores of documents containing references to the person Joe Woods and thereby improved search performance.</Paragraph>
      <Paragraph position="9"> Our search engine computes the normalized idf, nidf, in the following way:</Paragraph>
      <Paragraph position="11"> where N = collection size and n = the number of documents containing the term.</Paragraph>
      <Paragraph position="12"> Table 1 shows the frequency counts and normalized idf for the concepts in the quely Cases involving jailhouse</Paragraph>
    </Section>
    <Section position="3" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
3.3 Name Recognition Case Law Collection
</SectionTitle>
      <Paragraph position="0"> A manually marked up case law name recognition test collection of 724 test documents was created for evaluating name recognition and name t~equency analysis. Guidelines and example marked up pages from case law text were prepared for use by the manual markers. Personal and institutional, or company, names were tagged in an SGML-like manner. Other names, acronyms, and abbreviations were also lagged including: geographic; product; facility; and (court) case names.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="136" end_page="137" type="metho">
    <SectionTitle>
4 Remits
</SectionTitle>
    <Paragraph position="0"> The MUC-6 Named Entity Task \[Proceedings 1995\] results show the effectiveness of name recognition for news text, if not directly for case law text. Support for the hypothesis that name searching can lead to retrieval performance improvement was provided by simulating name searching using a proximity operator, which required that queiy multiple word name terms occur within two non-stopwords of each other in the text of a document The name frequency analyses show that names occur frequently enough in case law to merit special handling. In news text and queries names occur with much greater frequency (see table 4).</Paragraph>
    <Section position="1" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
4.1 Name Recognition Accuracy
</SectionTitle>
      <Paragraph position="0"> The leading systems on the personal name recognition portion of the MUC-6 Named Entity Task, e.g., those developed by SR.A and BBN, each had recall and precision scores of 980/0, or higher \[Proceedings 1995\]. While this performance was achieved on news text, and may not necessarily generalize to other types of text, it is a very strong result. It suggests that comparable levels of performance may be achievable for other text types, as well. NameTag \[NarneTag 1996\], for example, was able to obtain this high accuracy using two major knowledge sources: a representation of name structure, e.g., first name last name; and contextual knowledge about name occurrences, e.g., that a corporate executive's name often co-occurs with a rifle. These knowledge sources are implementedin a) name recognition rules consisting of a pattern and an action and in b) lexical resources, e.g., part of speech information.</Paragraph>
    </Section>
    <Section position="2" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
4.2 Effect on Retrieval Performance
</SectionTitle>
      <Paragraph position="0"> For the 38 queries with personal names (see section 3.2) run against the FED collection, proximity-based name searching led to significant improvement over the baseline WIN searching. Table 2 compares results for proximity-based to the baseline. The first column of table 2 shows eleven levels of recall, while the second and third columns show the precision scores for baseline and proximity-based name searching, respectively, for the corresponding level of recall. The final row shows the eleven point averages, and the numbers in parentheses are the percentage improvement of the proximity-based approach over the baseline. This method of recall/precision evaluation is widely used in information retrieval research, and in particular has been used in the Text REtrieval Conferences OREC) \[Harman 1996\].</Paragraph>
      <Paragraph position="1"> The proximity operator required that the narue terms occur within two non-stopwords of each other in the text</Paragraph>
    </Section>
    <Section position="3" start_page="136" end_page="137" type="sub_section">
      <SectionTitle>
Queries Containing Personal Names
4.3 Name Frequencies in the Case Law
Collection
</SectionTitle>
      <Paragraph position="0"> There were 58,585 personal name word tokens in the manually marked set of 720 cases constituting the Case Law Collection. This represents 2.05% of all word tokens in the collection (not counling stopwords). Table 3 shows counts and percentages for the various types of names manually marked in this set of documents. Table 4 shows that percentage of user natural language queries containing person, company, and other names to several news databases over periods of several days in 1995.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="137" end_page="137" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> This study suggests that the name recognition accuracy of name searching software is reasonably good and it seems safe to assume that that accuracy can be improved using domain-specific heuristics and tuning. For queries containing names there was retrieval performance improvement using name searching, as simulated by proximity operators. This study further shows that the frequency of occurrence of personal, and other names in cases is sufficient to warrant their separate treatment in document retrieval.</Paragraph>
    <Paragraph position="1"> The performance improvement obtained by proximity searching against a collection which had not had names pre-tagged suggests that better retrieval performance improvement gains may be possible using simple name matching heuristics if the query name term is known, rather than relying on pre-processed name tagging.</Paragraph>
    <Paragraph position="2"> Whether pre-tagging the collection with name recognition software could give even better retrieval performance is an open research question. The MUC-6 results imply that recognition accuracy is very high, at least for news text, but whether this would help retrieval much, given that the name to be searched is already known, i.e., specified in the query, is uncertain.</Paragraph>
    <Paragraph position="3"> This study supports the view that name recognition and matching in the context of information retrieval is a significantly different problem from either name searching, or matching, in relational databases, or name recognition, or extraction, i.e., tagging names m free text. Most rese~arch and development has focussed on these latter two applications, rather than information retrieval. The prospect of adaptation for information retrieval of the name recognition and matching techniques developed for these applications, seems promising, however. For Boolean retrieval systems one approach would be to put the burden of query name recognition on the user by requiring that the user tag a query term as being a personal, company, or other name. Then name recognition techniques, much like those of information extraction, could be used to find candidate matching names in free text and name matching techniques, much like those of database applications, could be used to determine whether names identified in query and text matched.</Paragraph>
    <Paragraph position="4"> For systems such as WIN, Freestyle, or TargeL of West Publishing, Lexis-Nexis, and DIALOG, respectively, which take natural language queries as input, the approach to take is less clear. Although it would be possible to have the user, as in the Boolean situation, tag query terms as names, this would seem to violate the underlying philosophy of natural language input search systems, i.e., that the user communicate with the search engine in ordinary natural language. If the user does not provide query name recognition, then the system must do so automatically. It might be thought that the same query recognition software used to recognize names in text could do the same in queries. This is possible, but the nature of document and query text is quite different.</Paragraph>
    <Paragraph position="5"> Much less rich syntactic content is usually present in queries, which also tend to be quite short in commercial online systems \[Lu and Keefer 1995\]. This greatly changes the recognition problem, especially for software which finds patterns in text as the basis of its name recognition \[Krupka 1995\]. Software which relied much more on an exhaustive lexicon of names and variants might do better, but could not deal with names which were not contained in its lexicon.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML