File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0430_evalu.xml
Size: 2,242 bytes
Last Modified: 2025-10-06 13:58:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0430"> <Title>Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons</Title> <Section position="6" start_page="2" end_page="2" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> To perform named entity extraction on the news articles in the CoNLL-2003 English shared task, several families of features are used, all time-shifted by -2, -1, 0, 1, 2: (a) the word itself, (b) 16 character-level regular expressions, mostly concerning capitalization and digit patterns, such as A, A+, Aa+, Aa+Aa*, A., D+, where A, a and D indicate the regular expressions [A-Z], [a-z] and [0-9], (c) 8 lexicons entered by hand, such as honorifics, days and months, (d) 15 lexicons obtained from specific web sites, such as countries, publicly-traded companies, surnames, stopwords, and universities, (e) 25 lexicons obtained by WebListing (including people names, organizations, NGOs and nationalities), (f) all the above tests with prefix firstmention from any previous duplicate of the current word, (if capitalized). A small amount of hand-filtering was performed on some of the WebListing lexicons. Since GoogleSets' support for non-English is severely limited, only 5 small lexicons were used for German; but character bi- and tri-grams were added.</Paragraph> <Paragraph position="1"> A Java-implemented, first-order CRF was trained for about 12 hours on a 1GHz Pentium with a Gaussian prior variance of 0.5, inducing 1000 or fewer features (down to a gain threshold of 5.0) each round of 10 iterations of L-BFGS. Candidate conjunctions are limited to the 1000 atomic and existing features with highest gain. Performance results for each of the entity classes can be found in Figure 1. The model achieved an overall F1 of 84.04% on the English test set using 6423 features. (Using a set of fixed conjunction patterns instead of feature induction results in F1 73.34%, with about 1 million features; trial-and-error tuning the fixed patterns would likely improve this.) Accuracy gains are expected from experimentation with the induction parameters and improved WebListing.</Paragraph> </Section> class="xml-element"></Paper>