File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3320_intro.xml

Size: 1,854 bytes

Last Modified: 2025-10-06 14:04:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3320">
  <Title>Refactoring Corpora</Title>
  <Section position="3" start_page="0" end_page="116" type="intro">
    <SectionTitle>
2 Methods
</SectionTitle>
    <Paragraph position="0"> The target WordFreak and XML-embedded formats were chosen for two reasons. First, there is some evidence suggesting that standoff annotation and embedded XML are the two most highly preferred corpus annotation formats, and second, these formats are employed by the two largest extant curated biomedical corpora, GENIA (Kim et al., 2001) and BioIE (Kulick et al., 2004).</Paragraph>
    <Paragraph position="1"> The PDG corpus we refactored was originally constructed by automatically detecting protein-protein interactions using the system described in Blaschke et al. (1999), and then manually reviewing the output. We selected it for our pilot project because it was the smallest publicly available corpus of which we were aware. Each block of text has a deprecated MEDLINE ID, a list of actions, a list of proteins and a string of text in which the actions and proteins are mentioned. The structure and contents of the original corpus dictate the logical steps of the refactoring process:  2. Locate the original source sentence in the title or abstract.</Paragraph>
    <Paragraph position="2"> 3. Locate the &amp;quot;action&amp;quot; keywords and the entities (i.e., proteins) in the text.</Paragraph>
    <Paragraph position="3"> 4. Produce output in the new formats.</Paragraph>
    <Paragraph position="4">  Between each file creation step above, human curators verify the data. The creation and curation process is structured this way so that from one step to the next we are assured that all data is valid, thereby giving the automation the best chance of performing well on the subsequent step.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML