File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1118_evalu.xml

Size: 5,028 bytes

Last Modified: 2025-10-06 14:00:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1118">
  <Title>Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition</Title>
  <Section position="10" start_page="156" end_page="157" type="evalu">
    <SectionTitle>
8 RESULTS
</SectionTitle>
    <Paragraph position="0"> MENE's maximum entropy training algorithm gives it reasonable performance with moderate-sized training corpora or few information sources, while allowing it to really shine when more training data and information sources are added. Table 2 shows MENE's performance on the MUC-7 &amp;quot;dry run&amp;quot; corpus, which consisted of 25 articles mostly on the topic of aviation disasters. All systems shown were trained on 350 articles on the same domain (this training corpus consisted of about 270,000 words, which our system turned into 321,000 tokens).</Paragraph>
    <Paragraph position="1"> Note the smooth progression of the scores as more data is added to the system. Also note that, when combined under MENE, the three weakest systems, MENE, Proteus, and Manitoba outperform the strongest single system, IsoQuest's. Finally, the top score of 97.12 from combining all three systems is a very strong result. On a different set of data, the MUC-7 formal run data, the accuracy of the two human taggers who were preparing the answer key was tested and it was discovered that one of them had an F-Measure of 96.95 and the other of 97.60 (Marsh and Perzanowski, 1998). Although we don't have human performance measures on the dry run test set, it seems that we have attained a result which is at least competitive with that of a human.</Paragraph>
    <Paragraph position="2"> We also did a series of runs to examine how the systems performed with different amounts of training data. These experiments are summarized in table 3. Note the 97.38 all-systems result which we  achieved by adding 75 articles from the formal-run test corpus to the basic 350-article training data. In addition to being an outstanding performance figure, this number shows MENE's responsiveness to good training material. A few other conclusions can be drawn from this data. First of all, MENE needs at least 20 articles of tagged training data to get acceptable performance on its own. Secondly, there is a minimum amount of training data which is needed for MENE to improve an external system. For Proteus and the Manitoba system, this number seems to be'about 80 articles, because they show a degradation of performance at 40. Since the IsoQuest system was stronger to start with, MENE required 150 articles to show an improvement. Note the anomaly in comparing the 250 and 350 article columns. Proteus shows only a very small gain and IsoQuest shows a deterioration. These last 100 articles added to the system were tagged by us at NYU, and we would humbly guess that we tagged them less carefully than the rest of the data which was tagged by BBN and Science Applications International Corporation (SAIC).</Paragraph>
    <Paragraph position="3"> MENE has also been run against all-uppercase data. On this we achieved an F-measure of 88.19 for the MENE-only system and 91.38 for the MENE + Proteus system. The latter figure matches the best currently published result (Bikel et al., 1997) on within-domain all-caps data. On the other hand, we scored lower on all-caps than BBN's Identifinder in the MUC-7 formal evaluation for reasons which are probably similar to the ones discussed in section 9 in the comparison of our mixed case performances (Miller et al., 1998) (Borthwick et al., 1998). We have put very little effort into optimizing MENE on  this type of corpus and believe that there is room for improvement here.</Paragraph>
    <Paragraph position="4"> In another experiment, we stripped out all features other than the lexical features and .,;till achieved an F-measure of 88.13. Since these features do not rely on any external knowledge sources and are automatically generated, this result is a strong indicator of MENE's portability.</Paragraph>
    <Paragraph position="5"> The MUC-7 formal evaluation involved a shift in topic which was not communicated to the participants beforehand-the training data focused on airline disasters while the test data was on missile and rocket launches. MENE fared much more poorly on this data than it did on the within-domain data quoted above, achieving an F-measure of only 88.80 for the MENE + Proteus system and 84.22 for the MENE-only system. While 88.80 was still the fourth highest score out of the twelve participants in the evaluation, we feel that it is necessary to view this number as a cross-domain portability result rather than as an indicator of how the system can do on unseen data within its training domain. We believe that if the system had been allowed to train on missile/rocket launch articles, its performance on these articles would have been much better. More MENE test results and discussion of the formal run can be found in (Borthwick et al., 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML