File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1071_intro.xml
Size: 4,986 bytes
Last Modified: 2025-10-06 14:05:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1071"> <Title>Evaluation of an Algorithm for the Recognition and Classification of Proper Names</Title> <Section position="2" start_page="0" end_page="418" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The appropriate treatment of proper names is essential in a natural language understanding system which processes unedited newswire text, since up to 10 % of this type of text may consist of proper names (Coates-Stephens, 1992). Nor is it only the sheer volume of names that makes them important; for some applications, such as information extraction (IE), robust handling of proper names is a prerequisite for successflflly performing other tasks such as template filling where correctly identifying the entities which play semantic roles in relational frames is crucial. Recent research in the fifth and sixth Message Understanding Conferences (MUC5, 1993) (MUC6, 1995) has shown that the recognition and classification of proper names in business newswire text can now be done on a large scale and with high accuracy: the success rates of the best systems now approach 96%.</Paragraph> <Paragraph position="1"> We have developed an IE system - LaSIf (Large Scale Information Extraction) (Gaizauslms ct al, 1995) which extracts important facts from tmsiness newswire texts. As a key part of the extraction task, the system recognises and classifies certain types of naming expressions, namely those specified in the MUC-6 named entity (NE) task definition (MUC6, 1995). These include organisation, person, and location names, time expressions, percentage expressions, and monetary amount expressions. As defined for MUC-6, the first three of these are proper names, the fourth contains some expressions that would be classified as proper names by linguists and some that; would not, while the last two would generally not be thought of as proper names. In this paper we concentrate only the behaviour of the LaSIE system with regards to recognising and classifying expressions in the first four classes, i.e. those which consist entirely or in part of proper names (though nothing hangs on omitting the others). The version of the system reported here achieves ahnost 92% combilmd precision and recall scores on this task against blind test data.</Paragraph> <Paragraph position="2"> Of course the four name classes mentioned are not the only classes of proper names. Brand narnes, book and movie names, and ship names are .just a few further classes one might chose to identify. One might also want to introduce sub-classes within the selected classes. We have not done so here for two reasons. First, and foremost, in order to generate quantitative evaluation results we have used tile MUC-6 data and scoring resources and these restrict us to the above proper name classes. Secondly, these four name classes account for the bulk of proper name occurrences in business newswire text. Our approach could straightforwardly be extended to account for additional classes of proper nalnes, and the points we wish to make about tile approach can be adequately presented using only this restricted set.</Paragraph> <Paragraph position="3"> Our approach to proper name recognition is heterogeneous. We take advantage of graphological, syntactic, semantic, world knowledge, and discourse level information to perform the task. In the paper we present details of the approach, describing those data and processing componm~ts of the overall IE system which contribute to proper name recognition and classification. Since name recognition and classification is achieved through the activity of four successive components in the system, we quantitatively ewfluate tile successive contribution of each comt)onent in our overall approach. We perform this analysis not only for all classes of names, but for each class separately.</Paragraph> <Paragraph position="4"> The, resulting analysis 1. supports McDonald's obse, rvation (McDoi> aid, 1993) that external evidence as well as internal evidence is essential for achieving high precision and recall ill the recognition and classification task; i.e. not just the name string itself must be examined, but other information in the text must be used as well; 2. shows that all eoInponents in our heterogeneous apt)roach contribute significantly; 3. shows that not all classes of prot)er naines benefit equally h'om the contritmtions of the different colnponents in our system: in particular, organisation names t)enefit most from the use of external evidence.</Paragraph> <Paragraph position="5"> In tile second section an overview of the I,aSIE system is presented. The third section explains in detail how proper names are reeognised and classified in the system. The results of evaluating the system on a blind test set; of 30 articles are presented and discussed in section 4. Section 5 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>