File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1050_metho.xml

Size: 6,916 bytes

Last Modified: 2025-10-06 14:14:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1050">
  <Title>A Synopsis of Learning to Recognize Names Across Languages</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Proper names represent a unique challenge for MT and IR systems. They are not found in dictionaries, are very large in number, come and go every day, and appear in many alias forms. For these reasons, list based matching schemes do not achieve desired performance levels. Hand coded heuristics can be developed to achieve high accuracy, however this approach lacks portability. Much human effort is needed to port the system to a new domain.</Paragraph>
    <Paragraph position="1"> A desirable approach is one that maximizes reuse and minimizes human effort. This paper presents an approach to proper name recognition that uses machine learning and a language independent framework.</Paragraph>
    <Paragraph position="2"> Knowledge incorporated into the framework is based on a set of measurable linguistic characteristics, or features. Some of this knowledge is constant across languages. The rest can be generated automatically through machine learning techniques.</Paragraph>
    <Paragraph position="3"> Whether a phrase (or word) is a proper name, and what type of proper name it is (company name, loca-tion name, person name, date, other) depends on (1) the internal structure of the phrase, and (2) the surrounding context.</Paragraph>
    <Paragraph position="4"> Internal: 'qVlr. Brandon&amp;quot; Context: 'The new compan.~= Safetek, will make air bags.&amp;quot; The person title &amp;quot;Mr.&amp;quot; reliably shows &amp;quot;Mr. Brandon&amp;quot; to be a person name. &amp;quot;Safetek&amp;quot; can be recognized as a company name by utilizing the preceding contextual phrase and appositive &amp;quot;The new company,&amp;quot;.</Paragraph>
    <Paragraph position="5"> The recognition task can be broken down into delimitation and classification. Delimitation is the determination of the boundaries of the proper name, while classification serves to provide a more specific category.</Paragraph>
    <Paragraph position="6">  his resignation yesterday.</Paragraph>
    <Paragraph position="7"> During the delimit step, proper name boundaries are identified. Next, the delimited names are categorized.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="357" type="metho">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> The approach taken here is to utilize a data-driven knowledge acquisition strategy based on decision trees which uses contextual information. This differs from other approaches (Farwell et al., 1994; Kitani &amp; Mitamura, 1994; McDonald, 1993; Rau, 1992) which attempt to achieve this task by: (1) hand-coded heuristics, (2) list-based matching schemes, (3) human-generated knowledge bases, and (4) combinations thereof. Delimitation occurs through the application of phrasal templates. These templates, built by hand, use logical operators (AND, OR, etc.) to combine features strongly associated with proper names, including: proper noun, ampersand, hyphen, and comma. In addition, ambiguities with delimitation are handled by including other predictive features within the templates.</Paragraph>
    <Paragraph position="1"> To acquire the knowledge required for classification, each word is tagged with all of its associated features. Various types of features indicate the type of name: parts of speech (POS), designators,  morphology, syntax, semantics, and more. Designators are features which alone provide strong evidence for or against a particular name type. Examples include &amp;quot;Co.&amp;quot; (company), &amp;quot;Dr.&amp;quot; (person), and &amp;quot;County&amp;quot; (location).</Paragraph>
    <Paragraph position="2"> Features are derived through automated and manual techniques. On-line lists can quickly provide useful features such as cities, family names, nationalities, etc. Proven POS taggers (Farwell et al., 1994; Brill, 1992; Matsumoto et al., 1992) predetermine POS features.</Paragraph>
    <Paragraph position="3"> Other features are derived through statistical measures and hand analysis.</Paragraph>
    <Paragraph position="4"> A decision tree is built (for each name class) from the initial feature set using a recursive partitioning algorithm (Quinlan, 1986; Breiman et al., 1984) that uses the following function as its splitting criterion:</Paragraph>
    <Paragraph position="6"> where p represents the proportion of names within a tree node belonging to the class for which the tree is built. The feature which minimizes the weighted sum of this function across both child nodes resulting from a split is chosen. A multitree approach was chosen over learning a single tree for all name classes because it allows for the straightforward association of features within the tree with specific name classes, and facilitates troubleshooting. Once built, the trees are all applied individually, and then the results are merged.</Paragraph>
    <Paragraph position="7"> Trees typically contained 100 or more nodes.</Paragraph>
    <Paragraph position="8"> In order to work with another language, the following resources are needed: (1) pre-tagged training text in the new language using same tags as before, (2) a tokenizer for non-token languages, (3) a POS tagger (plus translation of the tags to a standard POS convention), and (4) translation of designators and lexical (list-based) features.</Paragraph>
    <Paragraph position="9"> Figure 1 shows the working development system.</Paragraph>
    <Paragraph position="10"> The starting point is training text which has been pre-tagged with the locations of all proper names. The tokenizer separates punctuation from words. For non-token languages (no spaces between words), it also separates contiguous characters into constituent words. The POS tagger (Brill, 1992; Farwell et. al., 1994; Matsumoto et al, 1992) attaches parts of speech. The set of derived features is attached. Names are delimited using a set of POS based hand-coded templates. A decision tree is built based on the existing feature set and the specified level of context to be considered. The generated tree is applied to test data and scored. Hand analysis of results leads to the discovery of new features. The new features are added to the tokenized training text, and the process repeats.</Paragraph>
    <Paragraph position="11"> Language-specific modules are highlighted with bold borders. Feature translation occurs through the utilization of: on-line resources, dictionaries, atlases, bilingual speakers, etc. The remainder is constant across languages: a language independent core, and an optimally derived feature set for English. Parts of the development system that are executed by hand appear shaded. Everything else is automatic.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML