File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0303_evalu.xml

Size: 3,251 bytes

Last Modified: 2025-10-06 13:58:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0303">
  <Title>Contrast And Variability In Gene Names</Title>
  <Section position="4" start_page="2" end_page="864" type="evalu">
    <SectionTitle>
5 Results II
</SectionTitle>
    <Paragraph position="0"> Table 2 shows the results. As intuition would suggest, all heuristics were effective in locating more names than strict pattern matches alone.</Paragraph>
    <Paragraph position="1"> For example, optional hyphenation heuristic allowed the official gene name alpha-2macroglobulin to find a match in Moreover, C5a also enhanced transcription of the gene for the type-2 acute phase protein alpha 2macroglobulin n HC indirectly by increasing LPS-dependent IL-6 release from KC.</Paragraph>
    <Paragraph position="2"> names located by strict pattern matching 1846 Additional names located by vowel sequence heuristic matches  Additional names located by optional hyphen heuristic matches  Additional names located by case insensitive heuristic matches  Additional names located by optional parentheses heuristic matches  Additional names located by plural matches  match, heuristics, and plurals.</Paragraph>
    <Paragraph position="3"> However, we were concerned about the possibility of poor precision, i.e. false-positives. For this reason, we ran our heuristics against the same body of Medline abstracts, then randomly selected up to 100 tokens of gene names suggested by each heuristic (some found less than 100 tokens in our corpus--see Table 2 above). We labelled each putative gene name with the canonical gene name that we believed it to refer to, and then asked a subject matter expert to evaluate whether the gene names that we had identified were or were not the gene names that we believed them to be. The expert was presented with a three-way forced-choice paradigm, the options being yes, no, and can't tell. It seemed useful to be able to compare the precision of our technique with the incidence of false positives from strict pattern matches, so the expert was also presented with a number of strict matches (i.e., not identified by our heuristics) to evaluate, in a quantity roughly equivalent to the number of heuristicallysuggested names that they were asked to evaluate. Table 3 shows the results.</Paragraph>
    <Paragraph position="4">  We note the following: 1. Even strict pattern matches and forms that vary only with respect to inflectional morphology (i.e., the plurals) yield a nontrivial percentage of false positives--a percentage which is actually higher than two of our heuristics (optionality of hyphenation and optionality of parenthesized material).</Paragraph>
    <Paragraph position="5"> 2. Two of our heuristics (equivalence of vowel sequences and case insensitivity) yielded unexpectedly high rates of false positives. The vowel sequence heuristic can probably be made to yield a lower rate by fine-tuning it. For example, false positives from this heuristic can be reduced by disregarding any weak matches that come from one or more name-final upper-case I's, since these are commonly used in gene names to form Roman numerals. The high false positive rate of the case insensitivity heuristic is unexpected, and will be investigated further.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML