XML Viewer - w04-0705

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0705_metho.xml
Size: 13,576 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0705">
  <Title>Applying Coreference to Improve Name Recognition</Title>
  <Section position="4" start_page="1" end_page="3" type="metho">
    <SectionTitle>
2 Baseline Systems
</SectionTitle>
    <Paragraph position="0"> The task we consider in this paper is to identify three classes of names in Chinese text: persons (PER), organizations (ORG), and geo-political entities (GPE). Geo-political entities are locations which have an associated government, such as cities, states, and countries.</Paragraph>
    <Paragraph position="1">  Name recognition in Chinese poses extra challenges because neither capitalization nor word segmentation clues are explicitly provided, although most of the techniques we describe are more generally applicable.</Paragraph>
    <Paragraph position="2"> Our study builds on an extraction system developed for the ACE evaluation, a multi-site evaluation of information extraction organized by the U.S. Government. Following ACE terminology, we will use the term mention to refer to a name or noun phrase of one of the types of interest, and the term entity for a set of coreferring mentions. We briefly describe in this section the baseline Chinese named entity tagger, as well as the coreference system, used in our experiments.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Chinese Name Tagger
</SectionTitle>
      <Paragraph position="0"> Our baseline name tagger consists of an HMM tagger augmented with a set of post-processing rules. The HMM tagger generally follows the NYMBLE model (Bikel et al, 1997), but with a larger number of states (12) to handle name prefixes and suffixes, and transliterated foreign names separately. It operates on the output of a word segmenter from Tsinghua University. It uses a trigram model with dynamic backoff. The post-processing rules correct some omissions and systematic errors using name lists (for example, a list of all Chinese last names; lists of organization and location suffixes) and particular contextual patterns (for example, verbs occurring with people's names). They also deal with abbreviations and nested organization names.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Chinese Coreference Resolver
</SectionTitle>
      <Paragraph position="0"> For this study we have used a rule-based coreference resolver. Table 1 lists the main rules and patterns used. We have extensive rules for name-name coreference, including rules specific to the particular name types. For these experiments, we do not attempt to resolve pronouns, and we only resolve names with nominals when the name and nominal appear in close proximity in a specific structure, as listed in Table 1.</Paragraph>
      <Paragraph position="1"> We have used the MUC coreference scoring metric (Vilain et al, 1995) to evaluate this resolver, excluding all pronouns and limiting ourselves to noun phrases of semantic type PER, ORG, and GPE. Using a perfect (hand-generated) set of mentions, we obtain a recall of 82.7% and precision of 95.1%, for an F score of 88.47%.</Paragraph>
      <Paragraph position="2">  This class is used in the U.S. Government's ACE evaluations; it excludes locations without governments, such as bodies of water and mountains.</Paragraph>
      <Paragraph position="3"> Using the mentions generated by our extraction system, we obtain a recall of 74.3%, a precision of 84.5%, and an F score of 79.07%.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="99" type="metho">
    <SectionTitle>
3 Confidence Measures
</SectionTitle>
    <Paragraph position="0"> In order to decide when we need to rely on global (coreference) information for name tagging, we want to have some assessment of the confidence that the name tagger has in individual tagging decisions. In this paper, we use two tools to reach this goal. The first method is to use three manually built proper name lists which include common names of each type (selected from the high frequency names in the user query blog of COMPASS, a Chinese search engine, and name lists provided by Linguistic Data Consortium; the PER list includes 147 names, the GPE list 226 names, and the ORG list 130 names). Names on these lists are accepted without further review.</Paragraph>
    <Paragraph position="1"> The second method is to have the HMM tagger compute a probability margin for the identification of a particular name as being of a particular type.</Paragraph>
    <Paragraph position="2"> Scheffer et al. (2001) used a similar method to identify good candidates for tagging in an active learner. During decoding, the HMM tagger seeks the path of maximal probability through the Viterbi lattice. Suppose we wish to evaluate the confidence with which words w</Paragraph>
    <Paragraph position="4"> are identified as a name of type T. We compute  is the maximum path probability and P  is the maximum probability among all paths for which some word in w</Paragraph>
    <Paragraph position="6"> is assigned a tag other than T.</Paragraph>
    <Paragraph position="7"> A large margin indicates greater confidence in the tag assignment. If we exclude names tagged with a margin below a threshold, we can increase the precision of name tagging at some cost in recall. Figure 1 shows the trade-off between margin threshold and name recognition performance. Names with a margin over 3.0 are accepted on this basis.</Paragraph>
    <Paragraph position="8">  In our scoring, we use the ACE keys and only score mentions which appear in both the key and system response. This therefore includes only mentions identified as being in the ACE semantic categories by both the key and the system response. Thus these scores cannot be directly compared against coreference scores involving all noun phrases.</Paragraph>
    <Paragraph position="10"> name recognition performance</Paragraph>
  </Section>
  <Section position="6" start_page="99" end_page="99" type="metho">
    <SectionTitle>
4 Distribution of Name Errors
</SectionTitle>
    <Paragraph position="0"> We consider now names which did not pass the confidence measure tests: names not on the common name list, which were tagged with a margin below the threshold. We counted the accuracy of these &amp;quot;obscure&amp;quot; names as a function of the number of mentions in an entity; the results are shown in Table 2.</Paragraph>
    <Paragraph position="1"> The table shows that the accuracy of name recognition increases as the entity includes more mentions. In other words, if a name has more coref-ed mentions, it is more likely to be correct. This also provides us a linguistic intuition: if people mention an obscure name in a text, they tend to emphasize it later by repeating the same name or describe it with nominal mentions.</Paragraph>
    <Paragraph position="2"> The table also indicates that the accuracy of single name entities (singletons) is much lower than the overall accuracy. So, although they constitute only about 10% of all names, increasing their accuracy can significantly improve overall performance. Coreference information can play a great role here. Take the 157 PER singletons as an example; 56% are incorrect names. Among these incorrect names, 73% actually belong to the other two name types. Many of these can be easily fixed by searching for coreference to other mentions without type restriction. Among the correct names, 71% can be confirmed by the presence of a title word or a Chinese last name. From these observations we can conclude that without strong confirmation features, singletons are much less likely to be correct names.</Paragraph>
  </Section>
  <Section position="7" start_page="99" end_page="99" type="metho">
    <SectionTitle>
5 Incorporating Coreference Information
</SectionTitle>
    <Paragraph position="0"> into Name Recognition We make use of several features of the coreference relations a name is involved in; the features are listed in Table 3. Using these features, we built an independent classifier to predict if a name identified by the baseline name tagger is correct or not. (Note that this classifier is trained on all name mentions, but during test only 'obscure' names which failed the tests in section 3 are processed by this classifier.) Each name corresponds to a feature vector which consists of the factors described in Table 3. The PER context words are generated from the context patterns described in (Ji and Luo, 2001). We used a Support Vector Machine to implement the classifier, because of its state-of-the-art performance and good generalization ability. We used a polynomial kernel of degree 3.</Paragraph>
  </Section>
  <Section position="8" start_page="99" end_page="99" type="metho">
    <SectionTitle>
6 Name Rules based on Coreference
</SectionTitle>
    <Paragraph position="0"> Besides the factors in the above statistical model, additional coreference information can be used to filter and in some cases correct the tagging produced by the HMM. We developed the following rules to correct names generated by the baseline tagger.</Paragraph>
    <Section position="1" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
6.1 Name Structure Errors
</SectionTitle>
      <Paragraph position="0"> Sometimes the Name tagger outputs names which are too short (incomplete) or too long. We can make use of the relation among mentions in the same entity to fix them. For example, nested ORGs are traditionally difficult to recognize correctly. Errors in ORG names can take the following forms:  (1) Head Missed. Examples: &amp;quot;Zhong Guo Yi Zhu (Tuan ) / Chinese Art (Group)&amp;quot;, &amp;quot;Zhong Guo Xue Sheng (Hui ) / Chinese Student (Union)&amp;quot;, &amp;quot;E Luo Si He Dong Li (Suo ) / Russian Nuclear Power (Instituition)&amp;quot; Rule 1: If an ORG name x is coref-ed with other mentions with head y (an ORG suffix), and in the original text x is immediately followed by y, then tag xy instead of x; otherwise discard x.</Paragraph>
      <Paragraph position="1"> (2) Modifier Missed. Rule 1 can also be used to restore missed modifiers. For example, &amp;quot;(Ai Ding Bao )Da Xue / (Edinburgh) University&amp;quot;; &amp;quot;(Peng Cheng )You Xian Gong Si / (Peng Cheng) Limited Corporation&amp;quot;, and some incomplete translated PER names such as &amp;quot;(Ba )Le Si Tan / (Pa)lestine&amp;quot;.</Paragraph>
      <Paragraph position="2"> (3) Name Too Long Rule 2: If a name x has no coref-ed mentions but part of it, x', is identical to a name in another entity y, and y includes at least two mentions; then tag x' instead of x.</Paragraph>
    </Section>
    <Section position="2" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
6.2 Name Type Errors
</SectionTitle>
      <Paragraph position="0"> Some names are mistakenly recognized as other name types. For example, the name tagger has difficulty in distinguishing transliterated PER name and transliterated GPE names.</Paragraph>
      <Paragraph position="1"> To solve this problem we designed the following rules based on the relation among entities.</Paragraph>
      <Paragraph position="3"> is recognized as type1, the entity it belongs to has only one mention; and</Paragraph>
      <Paragraph position="5"> is recognized as type2, the entity it belongs to has at least two mentions; and name</Paragraph>
      <Paragraph position="7"> correct type1 to type2.</Paragraph>
      <Paragraph position="8"> For example, if &amp;quot; Ke Li Mu Lin / Kremlin&amp;quot; is mistakenly identified as PER, while &amp;quot;Ke Li Mu Lin Gong / Kremlin Palace&amp;quot; is correctly identified as ORG, and in coreference results, &amp;quot;Ke Li Mu Lin / Kremlin&amp;quot; belongs to a singleton entity, while &amp;quot;Ke Li Mu Lin Gong / Kremlin Palace&amp;quot; has coref-ed mentions, then we correct the type of &amp;quot;Ke Li Mu Lin / Kremlin&amp;quot; to ORG. Another common mistake gives rise to the sequence &amp;quot;PER+title+PER&amp;quot;, because our name tagger uses the title word as an important context feature for a person name (either preceding or following the title). But this is an impossible structure in Chinese. We can also use coreference information to fix it.</Paragraph>
      <Paragraph position="9"> Rule 4: If &amp;quot;PER+title+PER&amp;quot; appears in the name tagger's output, then we discard the PER name with lower coref certainty; and check whether it is coref-ed to other mentions in a GPE entity or ORG entity; if it is, correct the type.</Paragraph>
      <Paragraph position="10"> Using this rule we can correctly identify &amp;quot;[Si Li</Paragraph>
    </Section>
    <Section position="3" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
6.3 Name Abbreviation Errors
</SectionTitle>
      <Paragraph position="0"> Name abbreviations are difficult to recognize correctly due to a lack of training data. Usually people adopt a separate list of abbreviations or design separate rules (Sun et al. 2002) to identify them. But many wrong abbreviation names might be produced. We find that coreference information helps to select abbreviations.</Paragraph>
      <Paragraph position="1"> Rule 5: If an abbreviation name has no coref-ed mentions and it is not adjacent to another abbreviation (ex. &amp;quot;Zhong /China Mei /America&amp;quot;), then we discard it.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="99" end_page="99" type="metho">
    <SectionTitle>
7 System Flow
</SectionTitle>
    <Paragraph position="0"> Combining all the methods presented above, the flow of our final system is shown in Figure 2:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML