File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-3320_concl.xml

Size: 1,438 bytes

Last Modified: 2025-10-06 13:55:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3320">
  <Title>Refactoring Corpora</Title>
  <Section position="5" start_page="116" end_page="116" type="concl">
    <SectionTitle>
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> The underlying motivation for this paper is the hypothesis that corpus refactoring is practical, economical, and useful. Erjavec (2003) converted the GENIA corpus from its native format to a TEI P4 format. They noted that the translation process brought to light some previously covert problems with the GENIA format. Similarly, in the process of the refactoring we discovered and repaired a number of erroneous entity boundaries and spurious entities.</Paragraph>
    <Paragraph position="1"> A number of enhancements to the corpus are now possible that in its previous form would have been difficult at best. These include but are not limited to performing syntactic and semantic annotation and adding negative examples, which would expand the usefulness of the corpus. Using revisioning software, the distribution of iterative feature additions becomes simple.</Paragraph>
    <Paragraph position="2"> We found that this corpus could be refactored with about 3 person-weeks' worth of time. Users can take advantage of the corrections that we made to the entity component of the data to evaluate novel named entity recognition techniques or information extraction approaches.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML