File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/a88-1028_intro.xml
Size: 2,705 bytes
Last Modified: 2025-10-06 14:04:39
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1028"> <Title>COMPUTATIONAL TECHNIQUES FOR IMPROVED NAME SEARCH</Title> <Section position="3" start_page="0" end_page="203" type="intro"> <SectionTitle> 1.0 INTRODUCTION </SectionTitle> <Paragraph position="0"> This paper describes enhancements made to current name search techniques used to access large databases of proper names. The work focused on improving name search algorithms to yield better matching and retrieval performance on data-bases containing large numbers of non-European 'foreign' names. Because the linguistic mix of names in large computer-supported databases has changed due to recent immigration and other demographic factors, current name search procedures do not provide the accurate retrieval required by insurance companies, state motor vehicle bureaus, law enforcement agencies and other institutions. As the potential consequences of incorrect retrieval are so severe (e.g., loss of benefits, false arrest), it is necessary that name name search techniques be improved to handle the linguistic variability reflected in current databases.</Paragraph> <Paragraph position="1"> Our specific approach decomposed the name search problem into two main components: * Language classification techniques to identify the source language for a given query name, and Name association techniques, once a source language for a name is known, to exploit language-specific rules to generate variants of a name due to spelling variation, bad transcriptions, nicknames, and other name conventions. A statistical classification technique based on the use of Hidden Markov Models (HMM) was used as a language discriminator. The test database contained about 11,000 names, including about 2,000 each from three target languages, Vietnamese, Farsi and Spanish, and 5,000 termed 'other' to broadly represent general European names. The decision procedures assumed a closed-world situation in which a name must be assigned to one of the four classes.</Paragraph> <Paragraph position="2"> Language-specific rules in the form of context-sensitive, string rewrite rules were used to generate name variants. These were based on linguistic analysis of naming conventions, pronunciations and common misspellings for each target language.</Paragraph> <Paragraph position="3"> These two components were incorporated into a front-end system driving existing name search procedures. The front-end system was implemented in the C language and runs on a VAX-11/780 and Sun 3 workstations under Unix 4.2. Preliminary tests indicate improved retrieval (number of correct items retrieved) by as much as 20-30% over standard</Paragraph> </Section> class="xml-element"></Paper>