File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0303_metho.xml

Size: 10,474 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0303">
  <Title>Contrast And Variability In Gene Names</Title>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
OFFICIAL GENE NAME, PREFERRED
GENE NAME, OFFICIAL SYMBOL,
PREFERRED SYMBOL, PRODUCT,
PREFERRED PRODUCT, ALIAS SYMBOL,
</SectionTitle>
    <Paragraph position="0"> and ALIAS PROT. Some genes were unnamed, and these were excluded from the analysis. To our surprise, we also found that some gene names were duplicated within the same genome-e.g., in the M. musculus genome, there are two genes named reciprocal translocation, Ch4 6 and 7, Adler 17; we filtered out duplicate names and excluded them, as well. This left 42,608 genes for the mouse, 4457 for the rat, and 25,915 for the human. For each organism, we created one file containing just gene names, and for each organism we created a set of files containing all gene names and their synonyms.</Paragraph>
    <Paragraph position="1"> For the gene name file, we used just those fields labelled OFFICIAL GENE NAME or PREFERRED GENE NAME; for the combined name/synonym files, we used all of the fields given above.</Paragraph>
    <Paragraph position="2">  Finding contrasts in the corpus For each species, we pulled out a list of all names that were indicated as OFFICIAL GENE NAME or PREFERRED GENE NAME in the LL_tmpl file. Each name in this file represents a different gene. We examined the names in this single large file for contrastive differences. Finding noncontrastive variability in the corpus For each species, for each gene, we pulled out the list of all names that were indicated by any of the set of labels listed above, and stored them separately. With each of the many resulting files (one per gene), we examined the small set of synonymous names for noncontrastive variability.</Paragraph>
    <Paragraph position="3"> Finding minimal tuples The most obvious way to find minimal tuples would be to first determine the minimum edit distance between all pairs of gene names, and then select all pairs with minimum edit distance below some cutoff value. However, this approach would suffer from two obvious flaws. The first flaw is that it is computationally expensive, since it is a O(n  )-complex problem.</Paragraph>
    <Paragraph position="4"> The second flaw is that it is ineffective. It only yields tuples of size 2, but in fact sets of minimally differing gene names occur in sets of size 3, 4, 5, and even considerably larger, e.g. the three-member set conserved sequence block I, conserved sequence block II, and conserved sequence block III. We chose an alternative approach to the problem of finding minimal tuples. It consists of the following steps: For each gene name g183g32 transform the gene name to some reduced form g183g32 using the reduced form as the key in a hash of keys g174 lists, add the full form to a list of full forms from which that reduced form was derived For each key in the hash g183g32 retrieve the list of names that is mapped to by that key g183g32 if the list of names pointed to by that key has more than one element, report the list For example, if the input is the list of gene names gamma-glutamyltransferase 1, gamma-glutamyltransferase 2, gamma-glutamyltransferase 3, matrix metalloproteinase 23A, matrix metalloproteinase 23B, and acrosin, and the transformation that is being applied to each name consists of deletion of the last character, then the output will be two lists of &gt; 1  element pointed to by gamma-glutamyltransferase and matrix metalloproteinase 23, and one list of a single element, acrosin. The two lists with &gt; 1 element would be reported as minimal tuples.  We applied four transformations designed to investigate syntagmatic, or positional, effects. These consisted of removing the first character, the first word, the last character, and the last word.</Paragraph>
    <Paragraph position="5"> We applied four transformations designed to investigate paradigmatic, or content-based, effects. These consisted of mapping vowel sequences to a constant string (the purpose of this being to look at American vs. British dialectal differences in gene names); replacement of hyphens with spaces; removal of parenthesized material; and normalization of case. These relatively simple transformations miss a number of categories of differences between gene names, e.g. single-character differences in non-edge positions, such as 0 BETA-1 GLOBIN vs. 0 BETA-2 GLOBIN; single-word differences in non-edge positions, such as DOPAMINE D1A RECEPTOR vs.</Paragraph>
    <Paragraph position="6"> DOPAMINE D2 RECEPTOR; proper substring relationships, such as EYE vs. EYE2; and interactions between the features that we did examine, such as calsequestrin 1 (fast-twitch, skeletal muscle) vs. calsequestrin 2 (cardiac muscle), which is not found by any one heuristic but would be found by the combination of the parenthesized-material transformation followed by either of the right-edge transformations. Nonetheless, they seem like a reasonable starting point.</Paragraph>
    <Paragraph position="7"> Results I 3 Table 1 and Graphs 1, 2, 3, and 4 summarize our findings on contrast and variability in gene names.</Paragraph>
    <Paragraph position="8"> One surprising finding was that every paradigmatic dimension of contrast that we examined turned out to be contrastive in at least some very small number of cases. We did not expect hyphenation to ever be contrastive, but found that within the H. sapiens genome, the two genes at LocusLink ID's 51086 and 112858 differ in just that feature, having the names putative protein-tyrosine kinase and putative protein tyrosine kinase, respectively. The two genes at LocusLink ID's 51251 and 90859 differ in the same way, being named uridine 5'monophosphate hydrolase 1 and uridine 5' monophosphate hydrolase 1, respectively.</Paragraph>
    <Paragraph position="9">  variability in gene names. &amp;quot;Percentage&amp;quot; columns give percentage of total names considered for that species, rounded to three decimal places. DOC = dimension of</Paragraph>
    <Paragraph position="11"> case, PM = parenthesized material. S = species, M = mouse, R = rat, H = human. %N = contrastive names as percentage of total names for that species.</Paragraph>
    <Paragraph position="12"> We did not expect case ever to be contrastive, but found that within the R. norvegicus genome, the two genes at LocusLink ID's 24969 and 83789 differ with respect to just that feature, having the names Ribosomal protein S2 and ribosomal protein S2, respectively. The two genes at LocusLink ID's 56764 and 65028 differ in the same way, having the names dnaj-like protein and DnaJ-like protein. As Graphs 1 and 2 show, these contrasts were not common, but we were surprised to observe them at all.</Paragraph>
    <Paragraph position="13"> (In considering these findings, it should be noted that these results are specific to a particular version of LocusLink. We were interested in the extent to which these unexpected minimal pairs might be erroneous, so we examined the corresponding LOCUSID's in a subsequent revision of the file from several months later (May 1, 2002, 10:21 a.m.). We found that some of these entries had been combined, and some had been assigned an OFFICIAL_GENE_NAME, but others were unchanged, and so while we cannot eliminate the possibility that they are in error and have just managed to elude the editing process thus far, it is certainly the case that these anomalous contrasts continue to exist in the database, and we have no reason to assume that such names will not continue to be entered into the database, erroneously or otherwise, and therefore it behooves us to consider their implications for entity identification.) Graph 3. Parenthesized material: contrast and variability  We found marked edge effects. Contrasts are much more likely to be marked at the name boundary than are noncontrastive differences. There is a marked asymmetry in the directionality of the location of contrastive differences: they are much more likely to appear at the right edge of the word than at the left edge of the word. There are also marked intra-species differences. For example, although large edge effects are obvious for names (as opposed to synonyms) in the mouse and human genomes, they are not in the rat genome. In interpreting variability, it will likely be helpful to have some awareness of what species is being discussed.</Paragraph>
    <Paragraph position="14"> Graph 4. Edge effects: contrastive on left, synonymous on right  These findings suggested a set of heuristics for allowing weakened pattern matches on gene names. The heuristics are stated as transformations applied to regular expressions representing gene names to generate new regular expressions for the same gene names, but the heuristics can be applied in other ways as well, e.g. by grammar-based generation of alternate forms. The heuristics are listed below:  1. Equivalence of vowel sequences: for any regular expression representing a gene name, substitute the regular expression formed by replacing all vowel sequences with one or more of any vowel.</Paragraph>
    <Paragraph position="15"> 2. Optionality of hyphens: for any regular expression representing a gene name, substitute the regular expression formed by replacing every hyphen with the disjunction of a hyphen or a space.</Paragraph>
    <Paragraph position="16"> 3. Optionality of parenthesized material: for any regular expression representing a gene name, substitute the regular expression formed by making any paired parentheses and the material they enclose (and surrounding whitespace, as appropriate) optional.</Paragraph>
    <Paragraph position="17"> 4. Case insensitivity: for any regular expression, apply it case-insensitively.</Paragraph>
    <Paragraph position="18">  To evaluate the extent to which each of these heuristics led to increased entity recognition, we ran our heuristics against a large body of Medline abstracts. We counted the number of entities that were found by an exact pattern match to a LocusLink name, and counted the number of additional names that were found by each heuristic. Although none of our heuristics specifically addressed morphologically-induced variability, we also added a search for pluralized gene names, so that we could compare the extent to which recognition of plurals improved recall to the extent to which our heuristics improved recall.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML