File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1216_metho.xml
Size: 9,277 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1216"> <Title>Named Entity Recognition in Biomedical Texts using an HMM Model</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Data Preparation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Labeled Data </SectionTitle> <Paragraph position="0"> Our labeled data is from GENIA 3.02 (Ohta et al 2002), which contains 2,000 abstracts (360K words). It has been annotated with semantic information such as DNA, protein annotations.</Paragraph> <Paragraph position="1"> These are useful for training models. It contains Part of Speech (POS) information as well.</Paragraph> <Paragraph position="2"> Although POS is not considered very useful for NER in newspaper articles, it can dramatically improve the performance of NER in biomedical texts (Zhou et al 2004). Our model is trained from this labeled data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Unlabeled Data </SectionTitle> <Paragraph position="0"> We downloaded 17G XML abstract data from MEDLINE, which contains 1,381,132 abstracts.</Paragraph> <Paragraph position="1"> Compared to the labeled data, we have far more unlabeled data, and the amount of available unlabeled data increases every day. We used this unlabeled data for computing word similarity. We extracted 66,303,526 proximity relationships from the unlabeled data.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="84" type="metho"> <SectionTitle> 3 Distributional Word Similarity </SectionTitle> <Paragraph position="0"> &quot;Words that tend to appear in the same contexts tend to have similar meanings.&quot; (Harris 1968). For example, the words corruption and abuse are similar because both of them can be subjects of verbs like arouse, become, betray, cause, continue, cost, exist, force, go on, grow, have, increase, lead to, and persist, etc, and both of them can modify nouns like accusation, act, allegation, appearance, and case, etc.</Paragraph> <Paragraph position="1"> Many methods have been proposed to compute distributional similarity between words, e.g., (Hindle, 1990), (Pereira et al. 1993), (Grefenstette 1994) and (Lin 1998). Almost all of the methods represent a word by a feature vector where each feature corresponds to a type of context in which the word appeared.</Paragraph> <Section position="1" start_page="84" end_page="84" type="sub_section"> <SectionTitle> 3.1 Proximity-based Similarity </SectionTitle> <Paragraph position="0"> It is natural to use dependency relationship (Mel'cuk, 1987) as features, but a parser has to be available. Since biomedical text is highly irregular, and is very different from text like newspaper, a parser developed for the newspaper domain may not perform very well on biomedical text. Since most dependency relationships involve words that are situated close to one another, the dependency relationships can often be approximated by co-occurrence relationships within a small window (Turney 2001); (Terra and Clarke 2003). We define the features of the word w to be the first non-stop word on either side of w and the intervening stop words (which can be defined as the top-k most frequent words in the corpus). For example, for a sentence &quot;He got a job from this company.&quot; (Considering a, from and this to be stop words.), the features of job provided by this sentence are shown in Table 1.</Paragraph> </Section> <Section position="2" start_page="84" end_page="84" type="sub_section"> <SectionTitle> 3.2 Computing Word Similarity </SectionTitle> <Paragraph position="0"> Once the contexts of a word are represented as a feature vector, the similarity between two words can be computed using their context vectors. We ) to denote the feature vectors for the words u and v respectively, where n is the number of feature types extracted from a corpus. We use f i to denote the ith feature.</Paragraph> <Paragraph position="1"> The point-wise mutual information (PMI) between a feature f i and a word u measures the strength association between them. It is defined as:</Paragraph> <Paragraph position="3"> ) is the probability of f i co-occurring with any word; and P(u) is the probability of any feature co-occurring with u. The similarity between word u and v is defined as the Cosine of PMI:</Paragraph> <Paragraph position="5"> Different similarity measures of distributional similarity can affect the quality of the result to s statistically significant degree. (Zhao and Lin 2004) shows that the Cosine of PMI is a significantly better similarity measure than several other commonly used similarity measures.</Paragraph> <Paragraph position="6"> Similar words are computed for each word in the unlabeled data. Only a subset of the similarity information is useful, because the similarity of words outside of the training data and test data vocabulary is not used. We only take into account the similar words that occur in the training data more than 10 times and only those word pairs which have point-wise mutual information greater than a threshold (0.04). Table 2 shows the computing result for &quot;IL-0&quot; system may not able to capture words like IL-0ra, IL-0Ralpha, which are in the similar word list of IL-0, and it is very likely that they belong to the same semantic category. Many different kinds of expressions for numbers (like 0, 00-00, 00.00, -00, 0/0, five, six, 0-, iii, IV etc) are grouped together automatically.</Paragraph> </Section> </Section> <Section position="5" start_page="84" end_page="121" type="metho"> <SectionTitle> 4 HMM Model and Smoothing Schema </SectionTitle> <Paragraph position="0"> We follow the HMM model introduced in (Zhou et al 2004). The structure of an HMM model contains States and observations. In our model, each state is represented by a semantic tag, or a POS tag if the semantic tag is not available; each observation contains a word sequence. The main computing difficulty in (Zhou et al 2004) is the probability of a tag given a word sequence: formula (1). We use formula (2) to estimate formula (1). If the bigram is unseen in the training data, we use formula (3). If the unigram is also unseen, we use the unknown information which is</Paragraph> <Paragraph position="2"> We find that about 26% of the bigrams (word</Paragraph> <Paragraph position="4"> ) in the testing data is unseen, so the smoothing is critical.</Paragraph> <Paragraph position="5"> In order to compute formula (1), we can use the back-off (Katz 1987); (Bikel et al 1999) approach. Baseline1 and Baseline2 in our system use different back-off schema.</Paragraph> <Paragraph position="6"> The following formula is introduced in (Lee</Paragraph> <Paragraph position="8"> where S(w) is a set of candidate similar words and sim(w,w') is the similarity between word w and w'.</Paragraph> <Paragraph position="9"> Word similarity-based smoothing approach is used in our system to make advantage of the huge unlabeled corpus. In order to plug the word similarity-based smoothing into our HMM model, we made several extensions to formula (4).</Paragraph> <Paragraph position="10"> For each word w, we define p as the distribution of w's tags, which are annotated in the training data. We use the KL-Divergence to compute the distance between two distributions:</Paragraph> <Paragraph position="12"> We define the similarity between the tag distributions of word w and w' as:</Paragraph> <Paragraph position="14"> The harmonic average of word similarity and tag distribution similarity is defined as the similarity of word w and w' used in our system.</Paragraph> <Paragraph position="16"> So, we get formula (5) and (6). Formula (5) is for bigram smoothing and formula (6) is for Because it is natural to back-off from bigram to unigram, in our system, a back-off smoothing approach is combined with the word similarity-based smoothing. We follow these procedures to compute formula (1).</Paragraph> <Paragraph position="17"> 1. Check the frequency of the bigram (w</Paragraph> <Paragraph position="19"> If the frequency is high (>10), use formula (2). Stop.</Paragraph> <Paragraph position="20"> 2. Check the frequency of the unigram (w t ). If the frequency of the unigram is high (>30), use formula (3). Stop.</Paragraph> <Paragraph position="21"> 3. Try formula (5) for bigram smoothing, and check the frequency summary of the similar words involved in the smoothing. If the summary is high (>10), use formula (5). Stop.</Paragraph> <Paragraph position="22"> 4. Try formula (6) for unigram smoothing, and check the frequency summary for this case. If the summary is high (>30), use formula (6). Stop.</Paragraph> <Paragraph position="23"> 5. If the bigram is not unseen, use formula (2). Stop.</Paragraph> <Paragraph position="24"> 6. If the unigram is not unseen, use formula (3). Stop.</Paragraph> <Paragraph position="25"> 7. Use low frequency (<5) word information in the training data and formula (3). Our Baseline1 uses step 5, 6 and 7; Baseline2 uses step 1, 2, 5, 6 and 7.</Paragraph> </Section> class="xml-element"></Paper>