File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2198_intro.xml
Size: 5,867 bytes
Last Modified: 2025-10-06 14:06:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2198"> <Title>Locating noun phrases with finite state transducers.</Title> <Section position="2" start_page="0" end_page="1212" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Information Retrieval in full texts is one of the challenges of the next years. Web engines attempt to select among the millions of existing Web Sites, those corresponding to some input request. Newspaper archives is another exam- null ple: there are several gigabytes of news on electronic support, and the size is increasing every day. Different approaches have been proposed to retrieve precise information in a large database of natural texts: 1. Key-words algorithms (e.g. Yahoo): co-occurrences of tile different words of the request are searched for in one same document. Generally, slight variations of spelling are allowed to take into account grammatical endings and typing errors.</Paragraph> <Paragraph position="1"> 2. Exact pattern algorithms (e.g. OED): sequences containing occurrences described by a regular expression oll characters are located.</Paragraph> <Paragraph position="2"> 3. Statistical algorithms (e.g. LiveTopic): they offer to the user documents containing words of the request and also words that are statistically and semantically close with respect of clustering or factorial analysis. The first method is the simplest one: it generally provides results with an important noise (documents containing homographs of the words of the request, not in relation with the request, or documents containing words that have a form very close to that of the request, but with a different meaning).</Paragraph> <Paragraph position="3"> The second method yields excellent results, to the extent that the pattern of the request is sufficiently complex, and thus allows specification of synonymous forms. Also, the different grammatical endings can be described precisely. The drawback of such precision is the difficulty to build and handle complex requests.</Paragraph> <Paragraph position="4"> The third approach can provide good results for a very simple request. But., as any statistical method, it needs documents of a huge size, and thus, cannot take into account words occurring a limited number of times in the database, which is the case of roughly one word out of two, according Zipf's law 1 (Zipf, 1932).</Paragraph> <Paragraph position="5"> We are particularly interested in finding noun phrases containing or referring to proper nouns, in order to answer the following requests: 1. Who is John Major? 2. Find all document re/erring to John Major. 3. Find all people, who have been French ministers o~ culture.</Paragraph> <Paragraph position="6"> With the key-word method, texts containing the sequence 'John Major' are found, but also, texts containing 'a UN Protection Force, Major Rob Anninck', 'P. Major', 'a former Long Islander, John Jacques' and 'Mr. Major'.</Paragraph> <Paragraph position="7"> The statistical approach will probably succeed (supposing the text is large enough) in associating the words John Major, with the words Britain, prime and minister. Therefore, it would provide documents containing the sequence 'the prime minister, John Major', but also 'the French prime minister' or 'Timothy Eggar, Britain's energy minister' which have exactly the same number of correctly associated words. Such answers are an inevitable consequence of any method not grammatically founded.</Paragraph> <Paragraph position="8"> M. Gross and J. Senellart (1998) have proposed a preprocessing step of the text which groups up to 50 % of the words of the text into compound utterances. By hiding irrelevant meanings of simple words which are part of compounds, they obtain more relevant tokens. In the preceding example, the minimal tokens would be the compound nouns 'prime minister 'or 'energy minister', thus, the statistical engine could not have misinterpreted the word 'minister' in 'ene~yy minister' and in 'prime minister'.</Paragraph> <Paragraph position="9"> We propose here a new method based on a formal and full description of the specific phrases actually used to describe occupations. We also use large coverage dictionaries, and libraries of general purpose finite state transducers. Our algorithm finds answers to questions of types 1, 2 and 3, with nearly no errors due to silence, or to noise. The few cases of remaining errors are treated in section 5 and we show, that in order to avoid them by a gencral method, one must perform a complete syntactic analysis of 1 This is true whatever the size of the database is. the sentence.</Paragraph> <Paragraph position="10"> Our algorithm has three different applications. First, by using dictionaries of proper nouns and local grammars d~cribing occupations, it answers requests. Synonyms and hyponyms are formally treated, as well as the chronological evolution of the corpus. By consulting a pre-processed index of the database, it provides results in real time. The second application of the algorithm consists in replacing proper nouns in FSTs by variables, and use them to locate and propose to the user new proper nouns not listed in dictionaries. In this way, the construction of the library of FSTs and of the dictionaries can be automated at least in part. The third application is automatic translation of such noun phrases, by constructing the equivalent transducers in the different languages.</Paragraph> <Paragraph position="11"> In section 2, we provide the formal description of the problem, and we show how we can use automaton representations. In section 3, we show how we can handle requests. In section 4, we give some examples. In section 5, we analyze failed answers. In section 6, we show how we use transducers to enrich a dictionary.</Paragraph> </Section> class="xml-element"></Paper>