File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1038_evalu.xml
Size: 2,163 bytes
Last Modified: 2025-10-06 13:58:34
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1038"> <Title>Large-scale Controlled Vocabulary Indexing for Named Entities</Title> <Section position="6" start_page="279" end_page="279" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> In the fmal pre-release test, Entity Indexing was applied to more than 13,500 documents from 250 publications. Each document in the test was reviewed by a data analyst. Several of these were also reviewed by a researcher to verify the analysts' consistency with the formal evaluation criteria. Recall was 92.0% and precision was 96.5% when targeting documents with major references.</Paragraph> <Paragraph position="1"> Additional spot tests were done after the process was applied in production to archived documents and to incoming documents. These tests routinely showed recall and precision to be in the 90% to 96% range on over 100,000 documents examined.</Paragraph> <Paragraph position="2"> Some recall errors were due to company names with unusual structure. Many such problems can be addressed through manual intervention in the topic definitions. Some publication styles also led to recall errors. One publisher introduces a variety of unanticipated abbreviations, such as Intnl for International. Trade publications tend to use only short forms of company names even for lesser known companies. Those companies may be well-known only within an industry and thus to the audience of the trade publication. These types of problems can be addressed through manual intervention in the topic definitions, although for the abbreviations problem this is little more than patching.</Paragraph> <Paragraph position="3"> Capitalized and all upper case text in headlines and section headings was a routine source of precision errors. These often led to unwanted term matching, particularly affecting acronyms and one-word company name variants. Different companies with similar names also led to precision problems. This was particularly tree for subsidiaries of the same company whose names differed only by geographically-distinct company designators.</Paragraph> </Section> class="xml-element"></Paper>