File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0904_metho.xml

Size: 15,236 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0904">
  <Title>Comparison between Tagged Corpora for the Named Entity Task</Title>
  <Section position="5" start_page="20" end_page="20" type="metho">
    <SectionTitle>
3 Corpora
</SectionTitle>
    <Paragraph position="0"> We used two corpora in our experiments representing two popular domains in IE, molecular-biology (from MEDLINE) and newswire texts (from MUC-6). These are now described.</Paragraph>
    <Section position="1" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.1 MUC-6
</SectionTitle>
      <Paragraph position="0"> The corpus for MUC-6 (MUC, 1995) contains 60 articles, from the test corpus for the dry and formalruns. An example canbe seenin Figure 1. We can see several interesting features of the domain such as the focus of NF.,s on people and organization profiles. Moreover we see that there are many pre-name clue words such as &amp;quot;Ms.&amp;quot; or &amp;quot;Rep.&amp;quot; indicating that a Republican politician's name should follow.</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.2 Biology
</SectionTitle>
      <Paragraph position="0"> In our tests in the domain of molecular-biology we are using abstracts available from PubMed's MEDLIhrE. The MEDLINE database is an online collection of abstracts for published journal articles in biology and medicine and contains more than nine million articles. Currently we have extracted a subset of MEDLINE based on a search using the keywords human AND blood cell AND transcription .factor yielding about 3650 abstracts.</Paragraph>
      <Paragraph position="1"> Of these 100 docmnents were NE tagged for our experiments using a human domain expert. An example of the annotated abstracts is shown in Figure 2. In contrast to MUC-6 each article is quite short and there are few pre-class clue words making the task much more like terminology identification and classification than pure name finding. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="20" end_page="22" type="metho">
    <SectionTitle>
4 A first attempt at corpus
</SectionTitle>
    <Paragraph position="0"> comparison based on simple token frequency A simple and intuitive approach to NE task difficulty comparison used in some previous studies such as (palmer and Day, 1997) who studied corpora in six different languages, compares class to term-token ratios on the assumption that rarer classes are more difficult to acquire. The relative frequency counts from these ratios also give an indirect measure of the granularity of a class, i.e. how wide it is. While this is appealing, we show that this approach does not necessarily give the best metric for comparison.</Paragraph>
    <Paragraph position="1"> Tables 2 and 3 show the ratio of the number of different words used in NEs to the total number of words in the NE class vocabulary. The number of different tokens is influenced by the corpus size and is not a suitable index that can uniformly show the difficulty for different NE tasks, therefore it should be normalized. Here we use words as tokens. A value close to zero indicates little variation within the class and should imply that the class is easier to acquire. We see that the NEs in the biology domain seem overall to be easier to acquire than those in the MUC-6 domain given hxical variation.</Paragraph>
    <Paragraph position="2"> The figures in the second columns of Tables 2 and 3 are normalized so that all numerals are replaced by a single token. It still seems though that MUC-6 is a considerably more eheJlenging domain than biology. This is despite the fact that the ratios for ENAMEX expressions such as Date,  A graduate of &lt;ENAMEX TYPE=&amp;quot; ORGANIZATION&amp;quot; &gt;Harvard Law SChooI&lt;/ENAMEX&gt;, Ms. &lt;ENAMEX TYPE=&amp;quot;PERSON'&gt;Washington&lt;/ENAMEX&gt; worked as a laywer for the corporate finance division of the &lt;ENAMEX TYPE='ORGANIZATION~&gt;SEC&lt;/ENAMEX&gt; in the late &lt;TIMEX  &lt;PROTEIN&gt;SOX-4&lt;/PROTEIN&gt;, an &lt;PROTEIN&gt;Sty-like HMG box protein&lt;/PROTEIN&gt;, is a transcriptional activator in &lt;SOLrRCE.cell-type&gt;lymphocytes&lt;/SOUl:tCE&gt;. Previous studies in &lt;SOURCE.cell-type&gt;lymphocytes&lt;/SOUB.CE&gt; have described two DNA-binding &lt;PROTEIN&gt;HMG bax proteins&lt;/PROTEIN&gt;, &lt;PROTEIN&gt;TCF-I&lt;/PROTEIN&gt; and &lt;PROTEIN&gt;LEF-I&lt;/PROTEIN&gt;, with affinity for the &lt;DNA&gt;A/TA/TCAAAG motif&lt;/DNA&gt; found in several &lt;SOURCE.cell-type&gt;T cell&lt;/SOUl~CE&gt;-specific enhancers. Evaluation of cotransfection experiments in &lt;SOURCE.cell-type&gt;non-T cells&lt;/SOURCE&gt; and the observed inactivity of an &lt;DNA&gt;AACAAAG concatamer&lt;/DNA&gt; in the &lt;PROTEIN&gt;TCF-1 &lt;/PROTEIN&gt; / &lt;PROTEIN&gt;LEF-1 &lt;/PROTEIN&gt;-expressing &lt;SOURCE.cell-line&gt;T cell line BW5147&lt;/SOURCE&gt;, led us to conclude that these two proteins did not mediate the observed enhancer effect.</Paragraph>
    <Paragraph position="3">  sions in the Time class are so rare however that it is di~cult to make any sort of meaningftfl comparison. In the biology corpus, the ratios are not significantly changed and the NE classes defined for biology documents seem to have the same chuj-acteristics as non-numeric ENAMEX classes in MUCC-6 documents.</Paragraph>
    <Paragraph position="4"> Comparing between the biology documents and the MUC-6 documents, we may say that identifying entities in biology docmnents is easier than identifying ENAMEX entities in MUC-6 documents. null</Paragraph>
  </Section>
  <Section position="7" start_page="22" end_page="97" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated the performance of our two systems using a cross validation method. For the MUC-6 corpus, 6-fold cross validation was performed on the 60 texts and 5-fold cross validation was performed for the 100 texts in the biology corpus.</Paragraph>
    <Paragraph position="1"> Norm. numerals</Paragraph>
    <Paragraph position="3"> We use &amp;quot;F-scores ~ for evaluation of our experiments (Van Rijsbergen, 1979). &amp;quot;F-score&amp;quot; is a measurement combining &amp;quot;Recall&amp;quot; and &amp;quot;Predsion&amp;quot; and defined in Equation 3. &amp;quot;Recall&amp;quot; is the percentage of answers proposed by the system that correspond to those in the human-made key set. &amp;quot;Precision&amp;quot; is the percentage of correct answers among the answers proposed by the system. The F-scores presented here are automatically calculated using a scoring program (Chinchor, 1995).</Paragraph>
    <Paragraph position="5"> In Table 4 we show the actual performance of our term recognition systems, NE-DT and NEHMM. We can see that corpus comparisons based only on class-token ratios are inadequate to explain why both systems' performance was about the same in both domains or why NEHMM did better in both test corpora than NE-DT. The difference in performance is despite there being more training examples in biology (3301 NEs) than in MUC-6 (2182 NEs). Part of the reason for this is  that the class-token ratios ignore individual system knowledge, i.e. the types of features that can be captured and useful in the corpus domain. Among other considerations they also fail to consider the overlap of words and features between classes in the same corpus domain.</Paragraph>
  </Section>
  <Section position="8" start_page="97" end_page="97" type="metho">
    <SectionTitle>
6 Corpus comparison based on
</SectionTitle>
    <Paragraph position="0"> information theoretical measures In this section we attempt to present measures that overcome some of the limitations of the class-token method. We evaluate tbe contribution from each feature used in our NE recognition systems by calculating its entropy. There are thee types of feature information used by our two systems: lexo ical information, character type information, and part-of-speech information.</Paragraph>
    <Paragraph position="1"> The entropy for NE classes H(C) is defined by</Paragraph>
    <Paragraph position="3"> n(c): the number of words in class c N: the total number of words in text We can calculate the entropy for features in the same way.</Paragraph>
    <Paragraph position="4"> When a feature F is given, the conditional entropy for NE classes H(CIF) is defined by</Paragraph>
    <Paragraph position="6"> n(c, f): the number of words in class c with the feature value f n(/): the number of words with the feature value f Using these entropies, we can calculate information gain (Breiman et al., 1984) and gain ratio (Quinlan, 1990). Information gain for NE classes and a feature I(C; F) is given as follows:</Paragraph>
    <Paragraph position="8"> The information gain I(C; F) shows how the feature F is related with NE classes C. When F is completely independent of C, the value of I(C; F) becomes the minimum value O. The maximum value of I(C;_F) is equivalent to that of H(C), when the feature F gives sufficient information to recognize named entities. Information gain can also be calculated by:</Paragraph>
    <Paragraph position="10"> We show the values of the above three entropies in Table 5,6, and 7. In these tables, F is replaced with single letters which represent each of the model's features, i.e. character types (T), part-of-speech (P), and hxical information (W).</Paragraph>
    <Paragraph position="11"> Gain ratio is the normalized value of in.formation gain. The gain ratio GR(C; F) is defined by</Paragraph>
    <Paragraph position="13"> The range of the gain ratio GR(C; F) is 0 &lt; GR(C; F) _~ 1 even when the class entropy is different in various corpora, so we can compare the values directly in the different NE recognition tasks.</Paragraph>
    <Section position="1" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
6.1 Character types
</SectionTitle>
      <Paragraph position="0"> Character type features are used to identify named entities in the MUCC-6 and biology corpus.</Paragraph>
      <Paragraph position="1"> However, the distribution of the character types are quite different between these two types of documents as we can see in Table 5. We see through the gain-ratio score that character type information has a greater predictive power for classes in MUC~ than biology due to the higher entropy of character type and class sequences in the biology corpus, i.e. the greater disorder of this information. The result partially shows why identification and classification is harder in biological documents than in newspaper articles such as the MUC-6 corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
6.2 Part-of-speech
</SectionTitle>
      <Paragraph position="0"> Table 6 shows the entropy scores for part-of-speech (POS) sequences in the two corpora. We see through the gain ratio scores that POS information is not so powerful for acquiring NEs in the biology domain compared to the MUC-6 domain.</Paragraph>
      <Paragraph position="1">  In fact POS information for biology is far less useful than character information when we compare the results in Tables 5 and 6, whereas POS has about the same predictive power as character information in the MUC-6 domain. One likely explanation for this is that the POS tagger we use in NE-DT is trained on a corpus based on newspaper articles, therefore the assigned POS tags are often incorrect in biology documents.</Paragraph>
    </Section>
    <Section position="3" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
6.3 Lexical information
</SectionTitle>
      <Paragraph position="0"> Table 7 shows the entropy statistics for the two domains. Although entropy for words in biology is lower than MUC-6, the entropy for classes is higher leading to a lower gain ratio in biology. We also note that, as we would expect, in comparison to the other two types of knowledge, surface word forms are by far the most useful type of knowledge with a gain ratio in MUC-6 of 0.897 compared to 0.479 for POS and 0.478 for character types in the same domain. However, such knowledge is also the least generalizable and runs the risk of datasparseness. It therefore has to be complemented by more generalizable knowledge such as character</Paragraph>
    </Section>
    <Section position="4" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
6.4 Comparison between the
</SectionTitle>
      <Paragraph position="0"> comblnutlon of features In this section we show a comparison of gain ratio for the features used by both systems in each corpus. Values of gain ratio for each feature set are shown on the 'GR' column in Tables 8, 9, 10 and 111. The values of GR show that surface words have the best contribution in both corpora for both systems. We can see that gain ratio for all features in NE-DT is actually lower than the top level model for NEHMM in biology, reflecting the actual system performance that we observed.</Paragraph>
      <Paragraph position="1"> We also see that in the biology corpus, the combination of all features in NE-DT has a lower contribution than in the MUC-6 corpus. This indicates the limitation of the current feature set for the biology corpus and shows that we need to utilize other types of features in this domain.</Paragraph>
      <Paragraph position="2"> Values for cross entropy between training and test sets are shown in Tables 8, 9, 10 and 11 to-IOn the 'Features' col, mn~ &amp;quot;(Features) for A#&amp;quot; means the features used in each HMM sub-model which corresponds with the A# in Ecluation 2. And also, 'ALL' in Tables 10 and 11 means all the features used in decision tree, i.e.</Paragraph>
      <Paragraph position="3"> {P~-l,~,,+l,F~-l,t,t+l,W,-1,~,~+l).</Paragraph>
      <Paragraph position="4"> Table 10: Values of Entropy for NE-DT features in the MUC-6 corpus  gether with error bounds in parentheses. These values are calculated for pairs of an NE class and features, and averaged for the n-fold experiments. In the MUC-6 corpus, 60 texts are separated into 6 subsets, and one of them is used as the test set and the others are put together to form a training set. Similarly, 100 texts are separated into 5 subsets in the biology corpus. We also show the coverage of the pairs on the 'Coverage' col,,mn.</Paragraph>
      <Paragraph position="5"> Coverage means that how many pairs which appeared in a test set also appear in a trainlug set. In these columns, the greater the cross entropy between features and a class, the more different their occurrences between tr~iuing and test sets.</Paragraph>
      <Paragraph position="6"> On the other hand, as the coverage for classfeatures pairs increases, so does the part of the test set that is covered with the given feature set. The results in both corpora for both systems show a drawback of surface words, since their coverage for a test set is lower than that of features like POSs and character types in both corpora Also, the coverage of surface words in the biology corpus is higher than in the MUC6 corpus as opposed to other features. The result matches our intuition that vocabulary in the biology corpus is relatively restricted but has a variety of types other than normal English words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML