File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3318_metho.xml

Size: 3,583 bytes

Last Modified: 2025-10-06 14:11:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3318">
  <Title>Recognizing Nested Named Entities in GENIA corpus</Title>
  <Section position="3" start_page="0" end_page="112" type="metho">
    <SectionTitle>
2 Methodology
</SectionTitle>
    <Paragraph position="0"> We use SVM-light (http://svmlight.joachims.org/) to train a binary classifier on the GENIA corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="112" type="sub_section">
      <SectionTitle>
2.1 Data Set
</SectionTitle>
      <Paragraph position="0"> The GENIA corpus (version 3.02) contains 97876 named entities (35947 distinct) of 36 types, and 490941 tokens (19883 distinct). There are 16672  nested entities, containing others or nested in others (the maximum embedded levels is four). Among all the outmost entities, 2342 are protein and 1849 are DNA, while there are 9298 proteins and 1452 DNAs embedded in other entities.</Paragraph>
    </Section>
    <Section position="2" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
2.2 Features and Class Label
</SectionTitle>
      <Paragraph position="0"> For each token, we generate four types of features, reflecting its characteristics on orthography, partof-speech, morphology, and special nouns. We also use a window of (-2, +2) as its context.</Paragraph>
      <Paragraph position="1"> For each token, we use two schemes to set the class label: outmost labeling and inner labeling. In the outmost labeling, a token is labeled +1 if the outmost entity containing it is the target entity, while in the inner labeling, a token is labeled +1 if any entity containing it is the target entity.</Paragraph>
      <Paragraph position="2"> Otherwise, the token is labeled -1.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="112" end_page="112" type="metho">
    <SectionTitle>
3 Experiment And Discussion
</SectionTitle>
    <Paragraph position="0"> We report our preliminary experimental results on recognizing protein and DNA nested entities. For each target entity type (e.g., protein) and each labeling scheme, we obtain a data set containing 490941 instances. We run 5-fold cross-validation, and measure performance (P/R/F) of exact match, left/right boundary match w.r.t. outmost and inner entities respectively. The results are shown in Table 1 and Table 2.</Paragraph>
    <Paragraph position="1">  From the tables, we can see that while the outmost labeling works (slightly) better for the outmost entities, the inner labeling works better for the inner entities. This result seems reasonable in that each labeling scheme tends to introduces more entities of its type in the training set.</Paragraph>
    <Paragraph position="2"> It is interesting to see that the inner labeling works much better in identifying inner proteins than in inner DNAs. The reason could be due to the fact that there are about three times more inner proteins than the outmost ones, while the numbers of inner DNAs and outmost DNAs are roughly the same (see Section 2.1).</Paragraph>
    <Paragraph position="3"> Another observation is that the inner labeling gains significantly (over the outmost labeling) in the inner entities, comparing to its loss in the outmost entities. We are not sure whether this is the general trend for other types of entities, and if so, what causes it. We will address this issue in our following work.</Paragraph>
    <Paragraph position="4">  We hope these results can help in recognizing nested NEs, and also attract more attention to the nested NE problem. We are going to further our study by looking into more related issues.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML