File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1218_metho.xml
Size: 12,326 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1218"> <Title>Adapting an NER-System for German to the Biomedical Domain</Title> <Section position="3" start_page="0" end_page="92" type="metho"> <SectionTitle> 2 A knowledge-poor approach to NER </SectionTitle> <Paragraph position="0"> The optimal practice in NER yields efficient and highly reliable results based only on cheaply available resources like an annotated corpus of reasonable size and non-annotated data. Approaches rich in handcrafted knowledge or dependent on other language technology tools suffer from several limitations: They are laborious to maintain and to adapt to new domains, especially with respect to the creation and evaluation of the domain-sensitive lists of NEs. Furthermore, the application of additional tools like part-of-speech tagger, syntactic chunker etc. increases processing time, and it is not clear at the moment whether such tools facilitate the task without additional adaptations to the new domain. In order to build an efficient and easy to adapt system, we developed a knowledge-poor approach. We refrain from * any additional linguistic tools like morphological analyser, part of speech tagger or syntactic chunker; * any handcrafted linguistic resources like dictionaries; * any handcrafted knowledge providing lists like gazetteers, lists of NEs or lists of trigger words.</Paragraph> <Paragraph position="1"> From a linguistic point of view, NEs are phenomena located at the phrase-level. Nevertheless, for the sake of straightforwardness, we restrict our model to single words. To overcome the knowledge sparseness, the so-called three-level model of word form observance was developed and successfully applied to German person names (Rossler, 2004). In Section 4 we discuss our attempts to apply this model to the biomedical domain.</Paragraph> <Paragraph position="2"> The approach is based on linear SVM classifiers.</Paragraph> <Paragraph position="3"> SVM (Vapnik, 1995) is a powerful machine learning algorithm for binary classification able to handle large numbers of parameters efficiently. It is common within the NLP community to use SVMs with non-linear kernels. Takeuchi and Col- null lier (2003) successfully applied a polynomial kernel function for biomedical NER. Beside the good classifier capabilities of non-linear kernels, they are very expensive in terms of processing time for training and applying. Therefore, we favor linear SVMs not suffering from these limitations.</Paragraph> <Paragraph position="4"> Instead of using surface words in combination with morphological analyses and/or handcrafted suffix and prefix lists, we represent words with a set of positional character n-grams. Using the training data, this set is compiled by extracting the last uni- and bigram, three trigrams from the end, and three trigrams from the beginning of every word. All the entries occurring less than four times are removed. Table 1 contains an example of this feature set f3. The representation is capable of capturing simple morphological regularities of NEs and the context words surrounding them. Additionally, we use deterministic word-surface features (feature set f1) commonly used in NER (see Bykel et al., 1997), indexing, for instance, whether a word form is capitalized, consists of numbers, contains capitals, etc. We also consider word length and map it to one dimension (feature set f2). To capture the context of the word to classify, we set a six-word window, consisting of the three preceding, the current, and the two succeeding words. All the features mentioned in Table 1 are extracted for all words of the defined window.</Paragraph> <Paragraph position="5"> f1 Word-surface feature like e.g. &quot;4-digit number&quot;, &quot;ATCG-sequence&quot;, &quot;Uppercase only&quot; etc. f2 Character-based word length f3 Sub-word form representation with positional character n-grams. &quot;Hammer&quot; is represented as: &quot;r&quot;, &quot;er&quot;, &quot;mer&quot; at the end, &quot;ham&quot; at first, &quot;amm&quot; at second, &quot;mme&quot; at next to last position. f4 Probabilites of all classes if higher than zero, calculated by the second-order Markov Model.</Paragraph> <Paragraph position="6"> ture set f4 is described in Section 3.</Paragraph> </Section> <Section position="4" start_page="92" end_page="93" type="metho"> <SectionTitle> 3 Adapting the System </SectionTitle> <Paragraph position="0"> After adding ATCG sequence (see Shen et al.</Paragraph> <Paragraph position="1"> 2003) and GreekLetter (see Collier et al. 2000) as domain-specific deterministic word-surface features, we ran first experiments on the GENIA (2003) corpus. While inspecting the results we noticed that special attention was necessary to address the correct boundary detection of the enti- null All experiments were conducted with the SVMlight software package, freely available at: http://svmlight.joachims.org.</Paragraph> <Paragraph position="2"> ties and the transformation of the output of the SVM-classifiers to the IOB-notation.</Paragraph> <Paragraph position="3"> A first step to improve the boundary detection is based on the output of a second-order Markov Model in order to support the SVMs that are not optimised to tag linear sequences. We trained TnT (Brants, 1998), a Markov Model implemented for POS-tagging on the surface words, and used the probabilities for all classes as features for the SVMs (feature set f4 on Table 1).</Paragraph> <Paragraph position="4"> The second step was implemented within the post-processing component designed to transform the output of the SVM-classifiers to the IOBnotation. In order to facilitate the multi-class output, we set up a total of seven classifiers: Five of them specific to the five NE-classes and two additional classifiers assigning a general begin-tag and a general outside-tag. Although a dynamic programming approach to resolve the multi-class issue for SVMs is an important desideratum, we implemented a simple heuristic as a first step.</Paragraph> <Paragraph position="5"> To transform the output of the seven classifiers into the IOB-output, we first applied a simple one-vs-rest method based on the decision values of the SVMs. The general begin tag was used to support the correct detection of the B-tags.</Paragraph> <Paragraph position="6"> In a second post-processing step, we improved the results based on a definition of the revisability of a label assigned with respect to a competing label. According to this, a label is revisable if the competing label is among the three best labels and has a decision value higher than 0.2, or if the value of the outside-classifier is lower than 0.2, i.e. the label OUTSIDE is not that confident. A label is considered to be competing to the current label if it was assigned to the word before or the word after.</Paragraph> <Paragraph position="7"> 4 Attempting to utilize the three-level model to the biomedical domain The three-level model described in Rossler (2004) is motivated by the fact that lexical resources in the form of named entity lists deal with surface words, i.e. word forms, thus ignoring the problems of homonymy and polysemy.</Paragraph> <Paragraph position="8"> To address this issue, we distinguish three different levels to observe word forms and the semantic labels assigned to them and show how they are related to and support the NER: * The local level describes a single occurrence of a word form. The correct labelling of these occurrences is the actual task of NER.</Paragraph> <Paragraph position="9"> * The discourse level describes all occurrences of a word form within a text unit and the semantic labels assigned to them. Addressing word sense disambiguation, Gale et al.</Paragraph> <Paragraph position="10"> (1992) introduced the idea of a word sense located at the discourse-level and observed a strong one-sense-per-discourse tendency, i.e.</Paragraph> <Paragraph position="11"> several occurrences of a polysemous word form have a tendency to belong to the same semantic class within one discourse. It is common practice in NER to utilize the discourse level to disambiguate items in non-predictive contexts (see e.g. Mikheev et al., 1999).</Paragraph> <Paragraph position="12"> * The corpus level describes all occurrences of a word form within all texts available for the application. The larger the corpus, the more likely a particular word form is seen as member of two or more semantic classes.</Paragraph> <Paragraph position="13"> In order to utilize the discourse level, all words tagged as entity within one MEDLINE abstract are stored in a dynamic lexicon. Then, the processed discourse unit is matched against the dynamic lexicon in order to detect entities in non-predictive contexts. To find the correct boundaries the unit is post-processed as described in Section 3.</Paragraph> <Paragraph position="14"> To reflect the issues concerning polysemy and homonymy of lexical resources, we propose sophisticated word-form based NE lists, representing how likely a particular entry will be tagged with a particular label. These values are specific for a corpus, i.e. they are located at the corpus level. To create such resources, we propose a form of lexical bootstrapping. We assume that the probabilities calculated on the basis of a weak classifier applied to a large unlabeled corpus are sufficient for our task. Therefore, we trained classifiers for all classes and applied them to a 30-million word corpus extracted from MEDLINE (1999), using the search term [&quot;blood cell&quot; or &quot;transcription factor&quot;]. This automatically annotated corpus was used to create a corpus specific lexicon containing about 95,000 word forms. For all these entries, we extracted the total frequency of being tagged with a particular label and the relative frequency of being tagged with a discretized decision value by the SVM classifiers, i.e. we set five thresholds and counted how often an item was labelled with a decision value fulfilling a particular threshold.</Paragraph> <Paragraph position="15"> Both techniques completely failed: Neither the utilization of the discourse-level, nor the lexical bootstrapping had a positive impact when applied to the biomedical domain. This raises the question on the specifics of the biomedical domain.</Paragraph> <Paragraph position="16"> The utilization of the discourse-level is proved of value in most NE-tasks, thus the failure within the biomedical domain is surprising. The one-sense-per-discourse tendency is obviously weaker in the biomedical domain, since genes and proteins can share the same name and be mentioned in the same abstract. Additionally, the NEs occurring within the GENIA corpus consist in average of more than two words and seem to be diverse in their appearance, even within one document. For almost every word form, even brackets and stop-words can be a part of an NE, it is a great deal of work to develop heuristics improving recall without lowering precision dramatically. Moreover, the method is highly sensitive to precision errors, as it spreads out elements tagged incorrectly. Furthermore, it is questionable if abstracts - due to their enormous density and shortness - are appropriate text units for this method.</Paragraph> <Paragraph position="17"> The failure of the lexical bootstrapping is more difficult to interpret since this technique is not that well-tested. In our experiments, it was successfully applied to German person names and also had some positive impact on German organization and location names. One source of problems can be seen in the low precision of the classifier used to create the annotated corpus. We assume that a high-precision and low-recall classifier will produce better lexical resources. Another source can be seen in the complexity and the length of biological names. The restriction to single words is probably not appropriate for the bootstrapping process. For future research, we will investigate the bootstrapping of external evidence, i.e. we will not focus on the learning of names, but rather on the units that indicate the beginning or the end of a name-class.</Paragraph> </Section> class="xml-element"></Paper>