File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/p98-1045_abstr.xml
Size: 6,556 bytes
Last Modified: 2025-10-06 13:49:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1045"> <Title>Automatic Semantic Tagging of Unknown Proper Names</Title> <Section position="1" start_page="0" end_page="286" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Implemented methods for proper names recognition rely on large gazetteers of common proper nouns and a set of heuristic rules (e.g. Mr. as an indicator of a PERSON entity type). Though the performance of current PN recognizers is very high (over 90%), it is important to note that this problem is by no means a &quot;solved problem&quot;. Existing systems perform extremely well on newswire corpora by virtue of the availability of large gazetteers and rule bases designed for specific tasks (e.g. recognition of Organization and Person entity types as specified in recent Message Understanding Conferences MUC).</Paragraph> <Paragraph position="1"> However, large gazetteers are not available for most languages and applications other than newswire texts and, in any case, proper nouns are an open class.</Paragraph> <Paragraph position="2"> In this paper we describe a context-based method to assign an entity type to unknown proper names (PNs). Like many others, our system relies on a gazetteer and a set of context-dependent heuristics to classify proper nouns. However, due to the unavailability of large gazetteers in Italian, over 20% detected PNs cannot be semantically tagged.</Paragraph> <Paragraph position="3"> The algorithm that we propose assigns an entity type to an unknown PN based on the analysis of syntactically and semantically similar contexts already seen in the application corpus.</Paragraph> <Paragraph position="4"> The performance of the algorithm is evaluated not only in terms of precision, following the tradition of MUC conferences, but also in terms of Information Gain, an information theoretic measure that takes into account the complexity of the classification task.</Paragraph> <Paragraph position="5"> Introduction In terms of syntactic categories, proper nouns are lexical NPs that can be formed by primitive proper names (Adolfo_Battaglia), groups of proper nouns of different semantic categories (San Paolo di Brescia), and also of non-proper nouns (Banca dei regolamenti internazionali). In the latter case, capital letters are optional, making the problem of PN items identification even more complex.</Paragraph> <Paragraph position="6"> In the literature, it is accepted that an adeq.uate treatment of proper nouns reqmres the use of a context-sensitive grammar (McDonald, 1996). McDonald points out that the context sensitivity requirement involves two complementary types of evidence: internal and external.</Paragraph> <Paragraph position="7"> The internal evidence, can be derived from the sequence of words in a text (proper nouns and trigger words, such as Inc., &, Ltd., Company, etc.), and is gained in almost all state-of-art PNs recognisers by the use of large gazetteers and lists of trigger words.</Paragraph> <Paragraph position="8"> The external evidence is the context of a proper noun, that provides classificatory criteria to reinforce internal evidence, if any, or supplies some classificatory evidence. In fact, proper names form an open class, making the incompleteness of gazetteers an obvious problem.</Paragraph> <Paragraph position="9"> The methods for recognition of proper nouns (PNs) described in literature closely reflects this view of the problem.</Paragraph> <Paragraph position="10"> PN identification typically includes: * a gazetteer lookup, which locates simple and complex nominals identifying common PNs, such as companies, person names, locations, etc.</Paragraph> <Paragraph position="11"> * a set of patterns or rules, stated in terms of part-of-speech, syntactic or lexical features (e.g. Mr. as an indicator of a PERSON entity type), orthographic features (e.g. capitalization), etc.</Paragraph> <Paragraph position="12"> Proper nouns recognition has recently attracted much attention especially in the area of Information Extraction, where this problem is known as the Named Entity recognition task. The highest performing systems include large numbers of hand-coded rules, or patterns, such as VIE (Humphreys et al. 1996), the UMass system (Fisher et al. 1997) and Proteus (Grishman et al. 1992), but lately a high performance has been obtained by the use of statistical methods. For example, Ny.mble (Bikel et al. 1997) learns names using a trained approach based on a variant of Hidden Markov Models.</Paragraph> <Paragraph position="13"> However, a 90% success rate is reached at the price of tagging manually around half a million words. Since PNs are mostly domain-specific, presumably a comparable effort is needed when shifting to different domains.</Paragraph> <Paragraph position="14"> High performances of the existing systems are by no means the result of many years of studies and research in the area of IE from newswire English texts, promoted and funded by the Message Understanding Conferences (MUC) organizers. Yet, there is no evidence that a similar performance could be obtained in other languages and domains, if not at the price of a similar effort for rule writing (or manual training), and for the compilation of a high-coverage gazetteer. A recent study (Palmer and Day, 1997) established that the baseline performances of the PN recognition task for several languages and application domains vary between 34% and 71%. The lower bound is calculated by considering a simple algorithm that recognizes PNs on the basis of a list of frequent proper nouns seen in a training set.</Paragraph> <Paragraph position="15"> The method we propose in this paper combines symbolic and statistical approaches to classify unknown PNs using context evidence previously extracted from the application corpus. The method can be used to overcome the limitation of small gazetteers and poorly encoded rule bases.</Paragraph> <Paragraph position="16"> Our method is untrained: what is needed is a learning (raw) corpus, a surface syntactic analyzer, a dictionary of synonyms, a list of category names for classifying PNs (we used the categories proposed in the forthcoming MUC-7), and a &quot;start-up&quot; gazetteer and rule base, used to acquire an initial model of typical PNs contexts.</Paragraph> <Paragraph position="17"> In the next section, we describe the method in detail. Section 3 is dedicated to a discussion of experimental results.</Paragraph> </Section> class="xml-element"></Paper>