XML Viewer - p02-1060

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1060_metho.xml
Size: 10,258 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1060">
  <Title>Named Entity Recognition using an HMM-based Chunk Tagger</Title>
  <Section position="3" start_page="0" end_page="211" type="metho">
    <SectionTitle>
2 HMM-based Chunk Tagger
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="211" type="sub_section">
      <SectionTitle>
2.1 HMM Modeling
</SectionTitle>
      <Paragraph position="0"> Given a token sequence</Paragraph>
      <Paragraph position="2"> = , the goal of NER is to find a stochastic optimal tag sequence</Paragraph>
      <Paragraph position="4"> The second item in (2-1) is the mutual information between</Paragraph>
      <Paragraph position="6"> . In order to simplify the computation of this item, we assume mutual information independence:</Paragraph>
      <Paragraph position="8"/>
      <Paragraph position="10"/>
      <Paragraph position="12"> )|(log )(log)(log)|(log (2-4) The basic premise of this model is to consider the raw text, encountered when decoding, as though it had passed through a noisy channel, where it had been originally marked with NE tags. The job of our generative model is to directly generate the original NE tags from the output words of the noisy channel. It is obvious that our generative model is reverse to the generative model of traditional HMM</Paragraph>
      <Paragraph position="14"> and have: in BBN's IdentiFinder, which models the original process that generates the NE-class annotated words from the original NE tags. Another difference is that our model assumes mutual information independence (2-2) while traditional HMM assumes conditional probability independence (I-1). Assumption (2-2) is much looser than assumption (I-1) because assumption (I-1) has the same effect with the sum of assumptions (2-2) and (I-3)  . In this way, our model can apply more context information to determine the tag of current token.</Paragraph>
      <Paragraph position="15"> From equation (2-4), we can see that: 1) The first item can be computed by applying chain rules. In ngram modeling, each tag is assumed to be probabilistically dependent on the N-1 previous tags.</Paragraph>
      <Paragraph position="16"> 2) The second item is the summation of log probabilities of all the individual tags. 3) The third item corresponds to the &amp;quot;lexical&amp;quot; component of the tagger.</Paragraph>
      <Paragraph position="17">  We will not discuss both the first and second items further in this paper. This paper will focus on the third item</Paragraph>
      <Paragraph position="19"> )|(log , which is the main difference between our tagger and other traditional HMM-based taggers, as used in BBN's IdentiFinder. Ideally, it can be estimated by using the forward-backward algorithm [Rabiner89] recursively for the 1 st -order [Rabiner89] or 2 nd -order HMMs [Watson+92]. However, an alternative back-off modeling approach is applied instead in this paper (more details in section 4).</Paragraph>
    </Section>
    <Section position="2" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
2.2 HMM-based Chunk Tagger
</SectionTitle>
      <Paragraph position="0"> ))(log)|((logmaxarg )|(logmaxarg</Paragraph>
      <Paragraph position="2"> We can obtain equation (I-2) from (2.4) by assuming</Paragraph>
      <Paragraph position="4"> = is the word-feature sequence. In the meantime, NE-chunk</Paragraph>
      <Paragraph position="6"> t is structural and consists of three parts: 1) Boundary Category: BC = {0, 1, 2, 3}. Here 0 means that current word is a whole entity and 1/2/3 means that current word is at the beginning/in the middle/at the end of an entity.</Paragraph>
      <Paragraph position="7"> 2) Entity Category: EC. This is used to denote the class of the entity name.</Paragraph>
      <Paragraph position="8"> 3) Word Feature: WF. Because of the limited number of boundary and entity categories, the word feature is added into the structural tag to represent more accurate models.</Paragraph>
      <Paragraph position="9"> Obviously, there exist some constraints between</Paragraph>
      <Paragraph position="11"> t on the boundary and entity categories, as shown in Table 1, where &amp;quot;valid&amp;quot; / &amp;quot;invalid&amp;quot; means the tag sequence</Paragraph>
      <Paragraph position="13"/>
    </Section>
  </Section>
  <Section position="4" start_page="211" end_page="211" type="metho">
    <SectionTitle>
3 Determining Word Feature
</SectionTitle>
    <Paragraph position="0"> As stated above, token is denoted as ordered pairs of word-feature and word itself: &gt;=&lt; iii wfg , .</Paragraph>
    <Paragraph position="1"> Here, the word-feature is a simple deterministic computation performed on the word and/or word string with appropriate consideration of context as looked up in the lexicon or added to the context. In our model, each word-feature consists of several sub-features, which can be classified into internal sub-features and external sub-features. The internal sub-features are found within the word and/or word string itself to capture internal evidence while external sub-features are derived within the context to capture external evidence.</Paragraph>
    <Section position="1" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.1 Internal Sub-Features
</SectionTitle>
      <Paragraph position="0"> Our model captures three types of internal sub-features: 1)  f is the basic sub-feature exploited in this model, as shown in Table 2 with the descending order of priority. For example, in the case of non-disjoint feature classes such as ContainsDigitAndAlpha and ContainsDigitAndDash, the former will take precedence. The first eleven features arise from the need to distinguish and annotate monetary amounts, percentages, times and dates. The rest of the features distinguish types of capitalization and all other words such as punctuation marks. In particular, the FirstWord feature arises from the fact that if a word is capitalized and is the first word of the sentence, we have no good information as to why it is capitalized (but note that AllCaps and CapPeriod are computed before FirstWord, and take precedence.) This sub-feature is language dependent. Fortunately, the feature computation is an extremely small part of the implementation. This kind of internal sub-feature has been widely used in machine-learning systems, such as BBN's IdendiFinder and New York Univ.'s MENE. The rationale behind this sub-feature is clear: a) capitalization gives good evidence of NEs in Roman languages; b) Numeric symbols can automatically be grouped into categories.</Paragraph>
      <Paragraph position="1">  f is the semantic classification of important triggers, as seen in Table 3, and is unique to our system. It is based on the intuitions that important triggers are useful for NER and can be classified according to their semantics. This sub-feature applies to both single word and multiple words. This set of triggers is collected semi-automatically from the NEs and their local context of the training data.</Paragraph>
      <Paragraph position="2">  f , as shown in Table 4, is the internal gazetteer feature, gathered from the look-up gazetteers: lists of names of persons, organizations, locations and other kinds of named entities. This sub-feature can be determined by finding a match in the gazetteer of the corresponding NE type where n (in Table 4) represents the word number in the matched word string. In stead of collecting gazetteer lists from training data, we collect a list of 20 public holidays in several countries, a list of 5,000 locations from websites such as GeoHive  and a list of 10,000 famous people from websites such as Scope</Paragraph>
    </Section>
    <Section position="2" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
Systems
</SectionTitle>
      <Paragraph position="0"> . Gazetters have been widely used in NER systems to improve performance.</Paragraph>
    </Section>
    <Section position="3" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.2 External Sub-Features
</SectionTitle>
      <Paragraph position="0"> For external evidence, only one external macro context feature  f , as shown in Table 5, is captured in our model.</Paragraph>
      <Paragraph position="1">  f is about whether and how the encountered NE candidate is occurred in the list of NEs already recognized from the document, as shown in Table 5 (n is the word number in the matched NE from the recognized NE list and m is the matched word number between the word string and the matched NE with the corresponding NE type.). This sub-feature is unique to our system. The intuition behind this is the phenomena of name alias.</Paragraph>
      <Paragraph position="2"> During decoding, the NEs already recognized from the document are stored in a list. When the system encounters a NE candidate, a name alias algorithm is invoked to dynamically determine its relationship with the NEs in the recognized list. Initially, we also consider part-of-speech (POS) sub-feature. However, the experimental result is disappointing that incorporation of POS even decreases the performance by 2%. This may be because capitalization information of a word is submerged in the muddy of several POS tags and the performance of POS tagging is not satisfactory, especially for unknown capitalized words (since many of NEs include unknown capitalized words.). Therefore, POS is discarded.</Paragraph>
      <Paragraph position="3">  f : the External Macro Context Feature (L means Local document)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="211" end_page="4321" type="metho">
    <SectionTitle>
4 Back-off Modeling
</SectionTitle>
    <Paragraph position="0"> Given the model in section 2 and word feature in section 3, the main problem is how to  sufficient training data for every event whose conditional probability we wish to calculate. Unfortunately, there is rarely enough training data to compute accurate probabilities when decoding on new data, especially considering the complex word feature described above. In order to resolve the sparseness problem, two levels of back-off modeling are applied to approximate )/(  different combinations of the four sub-features described in section 3, and k f is approximated in the descending order of</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML