File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1030_intro.xml
Size: 5,551 bytes
Last Modified: 2025-10-06 14:00:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1030"> <Title>Extracting the Names of Genes and Gene Products with a Hidden Markov Model</Title> <Section position="3" start_page="0" end_page="202" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Ileeent studies into the use of SUl)ervised learning-t)ased models for the n~mled entity task in the miero-lsioh)gy domain have. shown that lnodels based on HMMs and decision trees such as (Nol)al;~t et al., 1999) ~,r(; much more generalisable and adaptable to slew classes of words than systems based on traditional hand-lmilt 1)attexns a.nd domain specific heuristic rules such as (Fukuda et al., 1998), overcoming the 1)rol)lems associated with data sparseness with the help of sophisticated smoothing algorithms (Chen and Goodman, 1996).</Paragraph> <Paragraph position="1"> HMMs can be considered to be stochastic finite state machines and have enjoyed success in a number of felds including speech recognition and part-of-speech tagging (Kupiec, 1992). It has been natural therefore that these models have been adapted tbr use in other word-class prediction tasks such as the atoned-entity task in IE. Such models are often based on ngrams. Although the assumption that a word's part-of speech or name class can be predicted by the previous n-1 words and their classes is counter-intuitive to our understanding of linguistic structures and long distance dependencies, this simple method does seem to be highly effective ill I)ractice. Nymble (Bikel et al., 1997), a system which uses HMMs is one of the most successflfl such systems and trains on a corpus of marked-up text, using only character features in addition to word bigrams.</Paragraph> <Paragraph position="2"> Although it is still early days for the use of HMMs for IE, we can see a number of trends in the research. Systems can be divided into those which use one state per class such as Nymble (at the top level of their backoff model) and those which automatically learn about the model's structure such as (Seymore et al., 1999). Additionally, there is a distinction to be made in the source of the knowledge for estimating transition t)robabilities between models which are built by hand such as (Freitag and McCallure, 1999) and those which learn fl'om tagged corpora in the same domain such as the model presented in this paper, word lists and corpora in different domains - so-called distantly-labeled data (Seymore et al., 1999).</Paragraph> <Section position="1" start_page="201" end_page="202" type="sub_section"> <SectionTitle> 2.1 Challenges of name finding in </SectionTitle> <Paragraph position="0"> molecular-biology texts The names that we are trying to extract fall into a number of categories that are often wider than the definitions used for the traditional named-entity task used in MUC and may be considered to share many characteristics of term recognition. null The particular difficulties with identit)dng and elassit~qng terms in the molecular-biology domain are all open vocabulary and irrgeular naming conventions as well as extensive cross-over in vocabulary between classes. The irregular naming arises in part because of the number of researchers from difli;rent fields who are up in XML for lfiochemical named-entities.</Paragraph> <Paragraph position="1"> working on the same knowledge discovery area as well as the large number of substances that need to be named. Despite the best, etforts of major journals to standardise the terminology, there is also a significant problem with synonymy so that often an entity has more tlm.n one name that is widely used. The class cross-over of terms arises because nlally prot(:ins are named after DNA or RNA with which they react. null All of the names which we mark up must belong to only one of the name classes listed in Table 1. We determined that all of these name classes were of interest to domain experts and were essential to our domain model for event extraction. Example sentences from a nmrked ut) abstract are given in Figure 1.</Paragraph> <Paragraph position="2"> We decided not to use separate states ibr pre- and post-class words as had been used in some other systems, e.g. (Freitag and McCallure, 1999). Contrary to our expectations, we observed that our training data provided very poor maximum-likelihood probabilities for these words as class predictors.</Paragraph> <Paragraph position="3"> We found that protein predictor words had the only significant evidence and even this was quite weak, except in tlm case of post-class words which included a mmfi)er of head nouns such as &quot;molecules&quot; or &quot;heterodimers&quot;. In our S()UI{CF,.cl 93 le'ukemic T cell line Kit225 S()UI\],CE.(:t 417 h,'wm, an T lymphocytes SOURCE.too 21 ,%hizosacch, aromyces pombc S()URCE.mu 64 mice SOURCE.vi 90 ItJV-1 S()UI{CE.sl 77 membrane S()UI{CE.ti 37 central 'ner,vo'us system UNK t,y~vsine ph, osphovylal, ion t)ro{xfiils~ protein groups, families~ cOral)loxes and Slll)Sl;I'llCI;lll'eS. I)NAs I)NA groups, regions and genes RNAs I~NA groups, regions and genes abstracts.</Paragraph> <Paragraph position="4"> early experiments using IIMMs that in(:orporated pro- and 1)ost-c\]ass states we \[imnd tha.t pcrforlnance was signiticantly worse than wil;h-Ollt; sll(;h si;at;cs an(l st) w('. formulated the ~uodcl as g,~ivcll ill S(;(;\[;iOll :/.</Paragraph> <Paragraph position="6"> and for all other words and their name classes as tbllows:</Paragraph> </Section> </Section> class="xml-element"></Paper>