File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0303_intro.xml
Size: 7,090 bytes
Last Modified: 2025-10-06 14:01:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0303"> <Title>Contrast And Variability In Gene Names</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Almost all current approaches to entity identification are actually not tackling the identification per se, but rather merely the (still difficult) location of named entities in text. The difference between these is that entity location consists of the (difficult enough) task of demarcation of the boundaries of names in text, whereas entity identification consists of the same thing, plus mapping the located names to the canonical entities that they refer to. In this paper we present data on variability in the orthographic representation of gene names, and then show how knowledge about that variability can be used for heuristics that increase recall in the entity identification task. (We use the term &quot;gene name&quot; as shorthand for &quot;gene, protein, or RNA name.&quot;) To understand why it is important to be able to map located names to the canonical entities that they refer to, consider the outcome of running an information extraction routine with access only to entity location against a hypothetical document about rat somatotropin.</Paragraph> <Paragraph position="1"> It contains the synonymous names rat somatotropin, somatotropin, and growth hormone, all of which refer to the same biomolecule, whose canonical name we will assume to be somatotropin. (The document is hypothetical; somatotropin and its synonyms are not.) Suppose that the paper includes three separate assertions, of the form somatotropin is upregulated by X, transcription of rat somatotropin is blocked by Y, and growth hormone is expressed by cells of type Z. The system correctly extracts three assertions, but incorrectly attributes them to three separate biomolecules, only one of which is the canonical form. Now consider the outcome of running an information extraction routine with access to entity identification against the same document.</Paragraph> <Paragraph position="2"> Again, the system extracts three assertions, but this time all three assertions are correctly attributed to the same biomolecule, i.e.</Paragraph> <Paragraph position="3"> somatotropin. Krauthammer et al. (2000), who are arguably the only researchers who have attempted to do actual identification as we Association for Computational Linguistics.</Paragraph> <Paragraph position="4"> the Biomedical Domain, Philadelphia, July 2002, pp. 14-20. Proceedings of the Workshop on Natural Language Processing in define it, have noted that while it is possible to recognize gene names in the face of variability, it remains difficult to map the recognized names to their canonical referents. They point out that heuristics might be helpful in doing this.</Paragraph> <Paragraph position="5"> We studied variability in gene names with an eye toward finding such heuristics. Our goal was to differentiate between kinds of variability that tend to differentiate between names with different referents, e.g. aha vs. aho or ACE1 vs.</Paragraph> <Paragraph position="6"> ACE2, as opposed to kinds of variability that only differentiate between synonyms that share a referent, such as tumour protein homologue and tumor protein homolog or ACE and ACE1.</Paragraph> <Paragraph position="7"> We then use data on contrast and variability to suggest our heuristics. The idea behind using such heuristics is that an identified entity in some text that differs minimally from the canonical name for some entity can be mapped to that canonically-labelled entity if such mapping is allowed by some heuristic.</Paragraph> <Paragraph position="8"> We use the term contrast to describe or refer to dimensions or features which can be used to distinguish between two samples of natural language with different meaning. Issues of contrast versus variability can be discussed with reference to individual characters or sequences of characters, or with reference to more abstract features, such as orthographic case. In the molecular biology domain, we will say that some feature is contrastive if it encodes the difference between the names of two different genes. In other words, contrasts occur interentity. We will say that some feature is (noncontrastively) variable if it differs merely between synonyms; in other words, variability occurs within members of a synonym set.</Paragraph> <Paragraph position="9"> We can trivially identify the contrast between BRCA1 vs. MHC class I polypeptiderelated sequence C, or between sonic hedgehog vs. eyeless. What we are really interested in is minimally different tuples--sets that differ with respect to only one feature. For instance, we would want to look at BRCA1 and BRCA2, which differ with respect to whether the character at the right edge is 1 or 2, or estrogen receptor beta and oestrogen receptor beta, which differ with respect to the presence or absence of an o at the left edge. Ideally, then, we are looking for sets of names that differ only by a single unit. However, the size and scope of the unit needs further discussion. When dealing with written language, the unit of concern will usually be the grapheme. A grapheme may be as small as a single character, but may also be considerably longer, e.g. the sequence ough in dough or through. In this study, we considered graphemes longer than a single character only in the case of vowels. Sometimes we will want to consider tuples that differ with respect to strings that are considerably larger than a grapheme, such as a word, or a string of parenthesized material; this will be discussed further in the first Methods section. The issue of tuple size will be discussed there, as well.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Corpus Methods I: Investigating </SectionTitle> <Paragraph position="0"> dimensions of contrast and variability We examined a large corpus of gene names and of synonyms for those gene names to determine what sorts of features are contrastive in gene names, and what sorts of features can vary without affecting the referential status of a gene name. The corpus was derived from the LocusLink LL_tmpl file (the version on the LocusLink download site at 2:32 p.m. on Sept. 13, 2001), available by ftp from ftp://ncbi.nlm.nih.gov. This is an easily readable dump of LocusLink, which &quot;provides a single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites&quot; (www.ncbi.nlm.nih.gov/locuslink).</Paragraph> <Paragraph position="1"> We then pulled out the names and synonyms for all LocusLink entries for the species Mus musculus, Rattus norvegicus, and Homo sapiens. We took the fields labelled as</Paragraph> </Section> </Section> class="xml-element"></Paper>