File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1219_metho.xml
Size: 5,190 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1219"> <Title>Proper Name Classification in an Information Extraction Toolset</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Problem Domain </SectionTitle> <Paragraph position="0"> The Named Entity Test is one component of the message understanding conference (MUC 57 (DARPA, 1995)) evaluations. The goal of the NE test is to add SGML tags to the evaluation texts that mark up all the proper names. The body of text used in these trials is a selection of articles from the Wall Street Journal. McDonald (McDonald, 1996) characterizes the problem as having three sub-components: Wallis, Yuen and Chase 161 Proper Name Classification Peter Wallis, Edmund Yuen and Greg Chase (1998) Proper Name Classification in an Information Extraction Toolset. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 161-162.</Paragraph> <Paragraph position="1"> * delimit the sequence of words that make up the name, i.e. identify its boundaries; * classify the resulting constituent based on the kind of individual it names (e.g. Person, Organization, Location); and * record the name and the individual it denotes in the discourse model The emphasis in this paper is on a method for classifying the name using external evidence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Classification </SectionTitle> <Paragraph position="0"> During this process, internal evidence (McDonald, 1996) may be gleaned as to the type of the named entity. Titles such as Mr, Ms, Dr, Sir, and Jr provide evidence of the named entity being a person.</Paragraph> <Paragraph position="1"> The presence of Ltd. or G.m.b.H. signify a company.</Paragraph> <Paragraph position="2"> External evidence (McDonald, 1996) about a named entity's type can also be used. If it is unclear whether a name refers to a person or a company, it can help to look at the verb it participates in, or at any modifiers it may have. People do things like &quot;head&quot; organizations, &quot;speak&quot; and &quot;walk&quot;. Companies &quot;merge&quot; and &quot;take measures&quot;. People have employment roles, gender, and age; companies have locations and managing directors. Ideally a system would have rules that say if a subject-of-averb( < NE >, (head, say, explain ...) ) then the named entity is of type person. Similarly a function modified-by( < NE >, (chairman, head, < number > years old, ...) ) could be used in a rule to determine if the < NE > is a person. Writing such rules require a list of terms which are good selector terms for the entity of interest. The proposal is to add a tool to the fact extractor workbench that helps the language engineer find good selector terms using probabilistic measures.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Finding Class Selectors </SectionTitle> <Paragraph position="0"> To measure how good a selector term is for an existing fact extractor, we need to compare the probablity that the word is present in a sentence and the probability that the word is in a sentence given that a &quot;fact&quot; is in that sentence.</Paragraph> <Paragraph position="2"> Sf = sentence with fact f Prob(w in S \[ f in S) = number of S I with w number of S f (1) number of S with w Prob(w in S) = number of S (2) If w and f are independent then 1 will approximate 2 however if they are dependent 1 will be different from 2.</Paragraph> <Paragraph position="3"> A measure of w's selective power can be calculated as a ratio.</Paragraph> <Paragraph position="4"> Sell(w ) ~ Prob(w in S t f in S)</Paragraph> <Paragraph position="6"> An Sel of close to 1 indicates little correlation between the term, w, and the fact, \]. An Sel significantly greater than 1 indicates a high degree of correlation between w and f and hence w is a good selector term. Interestingly, a Sel of significantly less than 1 (close to zero) indicates that the presence of w is a good indication of f being absent.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Incorporating Selective Power </SectionTitle> <Paragraph position="0"> A tool has been incorporated into the Fact Extractor Workbench that allows the user to run one or more fact extractors over the text corpus and produce and ordered set of candidate selector terms. This list of selector terms can then be considered for inclusion into a more refined fact extractor.</Paragraph> <Paragraph position="1"> For example, by measuring the selective power of corpus words for the &quot;City&quot; fact extractor pattern, we can find which words are used in the context of Washington, the city and which are used in the context of Washington, the person. By ranking corpus words based on selective power, we single out candidates as good selector terms to refine the &quot;City&quot; fact extractor.</Paragraph> </Section> class="xml-element"></Paper>