File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3325_metho.xml
Size: 14,956 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3325"> <Title>The Difficulties of Taxonomic Name Extraction and a Solution</Title> <Section position="4" start_page="127" end_page="128" type="metho"> <SectionTitle> 3 Preliminaries </SectionTitle> <Paragraph position="0"> This section introduces some preliminaries regarding word-level language recognition. We also describe a measure to quantify the user effort induced by interactions.</Paragraph> <Section position="1" start_page="127" end_page="127" type="sub_section"> <SectionTitle> 3.1 Measure for User Effort </SectionTitle> <Paragraph position="0"> In NLP, the f-Measure is popular to quantify the performance of a word classifier:</Paragraph> <Paragraph position="2"> But components that use active learning have three possible outputs. If the decision between positive or negative is narrow, they may classify a word as uncertain and prompt the user. This prevents misclassifications, but induces intellectual effort. To quantify this effort as well, there are two further measures:</Paragraph> <Paragraph position="4"> Given this, Coverage C is defined as the fraction of all classifications that are not uncertain:</Paragraph> <Paragraph position="6"> To obtain a single measure for overall classification quality, we multiply f-Measure and coverage and define Quality Q as</Paragraph> <Paragraph position="8"/> </Section> <Section position="2" start_page="127" end_page="127" type="sub_section"> <SectionTitle> 3.2 Word-Level Language Recognition for Taxonomic Name Extraction </SectionTitle> <Paragraph position="0"> In earlier work (Sautter 2006), we have presented a technique to classify words as parts of taxonomic names or as common English, respectively.</Paragraph> <Paragraph position="1"> It is based on two statistics containing the N-Gram distribution of taxonomic names and of common English. Both statistics are built from examples from the respective languages. It uses active learning to deal with the lack of training data. Precision and recall reach a level of 98%.</Paragraph> <Paragraph position="2"> This is satisfactory, compared to common NER components. At the same time, the user has to classify about 3% of the words manually. In a text of 10.000 words, this would be 300 manual classifications. We deem this relatively high.</Paragraph> </Section> <Section position="3" start_page="127" end_page="128" type="sub_section"> <SectionTitle> 3.3 Formal Structure of Taxonomic Names </SectionTitle> <Paragraph position="0"> The structure of taxonomic names is defined by the rules of Linnaean nomenclature (Ereshefsky 1997). They are not very restrictive and include many optional parts. For instance, both Prenolepis (Nylanderia) vividula Erin subsp. guatemalensis Forel var. itinerans Forel and Dolichoderus decollatus are taxonomic names. There are only two mandatory parts in such a name: the genus and the species. Table 1 shows the decomposition of the two examples. The parts with their names in brackets are optional. More formally, the rules of Linnaean nomenclature define the structure of taxonomic names as follows: * The genus is mandatory. It is a capitalized word, often abbreviated by its first one or two letters, followed by a dot.</Paragraph> <Paragraph position="1"> * The subgenus is optional. It is a capitalized word, often enclosed in brackets.</Paragraph> <Paragraph position="2"> * The species is mandatory. It is a lower case word. It is often followed by the name of the scientist who first described the species.</Paragraph> <Paragraph position="3"> * The subspecies is optional. It is a lower case word, often preceded by subsp. or subspecies as an indicator. It is often followed by the name of the scientist who first described it.</Paragraph> <Paragraph position="4"> * The variety is optional. It is a lower case word, preceded by var. or variety as an indicator. It is often followed by the name of the scientist who first described it.</Paragraph> </Section> </Section> <Section position="5" start_page="128" end_page="130" type="metho"> <SectionTitle> 4 Combining Techniques </SectionTitle> <Paragraph position="0"> for Taxonomic Name Extraction Due to its capability of learning at runtime, the word-level language recognizer needs little training data, but it still does. In addition, the manual effort induced by uncertain classifications is high.</Paragraph> <Paragraph position="1"> Making use of the typical structure of taxonomic names, we can improve both aspects. First, we can use syntax-based rules to harvest training data directly from the documents. Second, we can use these rules to reduce the number of words the classifier has to deal with. However, it is not possible to find rules that extract taxonomic names with both high precision and recall, as we will show later. But we have found rules that fulfill one of these requirements very well. In what follows, we refer to these as precision rules and recall rules, respectively.</Paragraph> <Section position="1" start_page="128" end_page="129" type="sub_section"> <SectionTitle> 4.1 The Classification Process </SectionTitle> <Paragraph position="0"> 1. We apply the precision rules. Every word sequence from the document that matches such a rule is a sure positive.</Paragraph> <Paragraph position="1"> 2. We apply the recall rules to the phrases that are not sure positives. A phrase not matching one of these rules is a sure negative.</Paragraph> <Paragraph position="2"> 3. We make use of domain-specific vocabulary and filter out word sequences containing at least one known negative word.</Paragraph> <Paragraph position="3"> 4. We collect a set of names from the set of sure positives (see Subsection 4.5). We then use these names to both include and exclude further word sequences.</Paragraph> <Paragraph position="4"> 5. We train the word-level language recognizer with the surely positive and surely negative words. We then apply it to the remaining uncertain word sequences.</Paragraph> <Paragraph position="5"> Figure 1 visualizes the classification process. At first sight, other orders seem to be possible as well, e.g., the language recognizer classifies each word first, and then we apply the rules. But this is not feasible: It would require external training data. In addition, the language recognizer would have to classify all the words of the document. This would incur more manual classifications.</Paragraph> <Paragraph position="6"> This approach is similar to the bootstrapping algorithm proposed by Jones (1999). The difference is that this process works solely with the document it actually processes. In particular, it does not need any external data or a training phase. Average biosystematics documents contain about 15.000 words, which is less than 0.02% of the data used by Niu (2003). On the other hand, with the classification process proposed here, the accuracy of the underlying classifier has to be very high from the start.</Paragraph> </Section> <Section position="2" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 4.2 Structural Rules </SectionTitle> <Paragraph position="0"> In order to make use of the structure of taxonomic names, we use rules that refer to this structure.</Paragraph> <Paragraph position="1"> We use regular expressions for the formal representation of the rules. In this section, we develop a regular expression matching any word sequence that conforms to the Linnaean rules of nomenclature (see 3.3). Table 2 provides some abbreviations, to increase readability. We model taxonomic names as follows: * The genus is a capitalized word, often abbreviated. We denote it as <genus>, which stands for {<CapW>|<CapA>}.</Paragraph> <Paragraph position="2"> * The subgenus is a capitalized word, optionally surrounded by brackets. We denote it as <subGenus>, which stands for <CapW>|(<CapW>).</Paragraph> <Paragraph position="3"> * The species is a lower case word, optionally followed by a name. We denote it as <species>, which stands for <LcW>{_<Name>}?.</Paragraph> <Paragraph position="4"> * The subspecies is a lower case word, preceded by the indicator subsp. or subspecies, and optionally followed by a name. We denote it as <subSpecies>, standing for {subsp.|subspecies}_<LcW>{_<Name>}?.</Paragraph> <Paragraph position="5"> * The variety is a lower case word, preceded by the indicator var. or variety, and optionally followed by a name. We denote it as <variety>, which stands for {var.| variety}_<LcW>{_<Name>}?.</Paragraph> <Paragraph position="6"> A taxonomic name is now modeled as follows.</Paragraph> <Paragraph position="7"> We refer to the pattern as <taxName>: <genus>{_<subGenus>}? _<species>{_<subSpecies>}? {_<variety>}?</Paragraph> </Section> <Section position="3" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 4.3 Precision Rules </SectionTitle> <Paragraph position="0"> Because <taxName> matches any sequence of words that conforms to the Linnaean rules, it is not very precise. The simplest match is a capitalized word followed by one in lower case. Any two words at the beginning of a sentence are a match! To obtain more precise regular expressions, we rely on the optional parts of taxonomic names. In particular, we classify a sequence of words as a sure positive if it contains at least one of the optional parts <subGenus>, <subSpecies> and <variety>. Even though these regular expressions may produce false negatives, our evaluation will show that this happens very rarely. Our set of precise regular expressions has three elements: To classify a word sequence as a sure positive if it matches at least one of these regular expressions, we combine them disjunctively and call the result <preciseTaxName>.</Paragraph> <Paragraph position="1"> A notion related to that of a sure positive is the one of a surely positive word. A surely positive word is a part of a taxonomic name that is not part of a scientist's name. For instance, the taxonomic name Prenolepis (Nylanderia) vividula Erin subsp. guatemalensis Forel var. itinerans Forel contains the surely positive words Prenolepis, Nylanderia, vividula, guatemalensis, and itinerans. We assume that surely positive words exclusively appear as parts of taxonomic names.</Paragraph> </Section> <Section position="4" start_page="129" end_page="130" type="sub_section"> <SectionTitle> 4.4 Recall Rules </SectionTitle> <Paragraph position="0"> <taxName> matches any sequence of words that conforms to the Linnaean rules, but there is a further issue: Enumerations of several species of the same genus tend to contain the genus only once.</Paragraph> <Paragraph position="1"> For instance, in Pseudomyrma arboris-sanctae Emery, latinoda Mayr and tachigalide Forel&quot;we want to extract latinoda Mayr and tachigalide Forel as well. To address this, we make use of the surely positive words: We use them to extract parts of taxonomic names that lack the genus.</Paragraph> <Paragraph position="2"> Our technique also extracts the names of the scientists from the sure positives and collects them in a name lexicon. Based on the structure described in Section 3.3, a capitalized word in a sure positive is a name if it comes after the second position. From the sure positive Pseudomyrma (Minimyrma) arboris-sanctae Emery, the technique extracts Pseudomyrma, Minimyrma and arboris-sanctae. In addition, it would add Emery to the name lexicon.</Paragraph> <Paragraph position="3"> We cannot be sure that the list of sure positive words suffices to find all species names in an enumeration. Hence, our technique additionally collects all lower-case words followed by a word contained in the name lexicon. In the example, we extract latinoda Mayr and tachigalide Forel if Mayr and Forel are in the name lexicon.</Paragraph> </Section> <Section position="5" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 4.5 Data Rules </SectionTitle> <Paragraph position="0"> Because we want to achieve close to 100% in recall, the recall rules are very weak. In consequence, many word sequences that are not taxonomic names are considered uncertain. Before the word-level language recognizer deals with them, we see some more ways to exclude negatives.</Paragraph> <Paragraph position="1"> Sure Negatives . As mentioned in Subsection 4.3, <taxName> matches any capitalized word followed by a word in lower case. This includes the start of any sentence. Making use of the sure negatives, we can recognize these phrases. In particular, out technique classifies any word sequence as negative that contains a word which is also in the set of sure negatives. For instance, in sentence &quot;Additional evidence results from ...&quot;, Additional evidence matches <taxName>. Another sentence contains an additional advantage, which does not match <taxName>. Thus, the set of sure negatives contains an, additional, and advantage. Knowing that additional is a sure negative, we exclude the phrase Additional evidence.</Paragraph> <Paragraph position="2"> Names of Scientists. Though the names of scientists are valid parts of taxonomic names, they also cause false matches. The reason is that they are capitalized. A misclassification occurs if they are matched with the genus or subgenus part <taxName> cannot exclude this. In addition, they might appear elsewhere in the text without belonging to a taxonomic name. Similarly to sure negatives, we exclude a match of <taxName> if the first or second word is contained in the name lexicon. For instance, in &quot;..., and Forel further concludes&quot;, Forel further matches <taxName>. If the name lexicon contains Forel, we know that it is not a genus, and thus exclude Forel further.</Paragraph> </Section> <Section position="6" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 4.6 Classification of Remaining Words </SectionTitle> <Paragraph position="0"> After applying the rules, some word sequences still remain uncertain. To deal with them, we use word-level language recognition. We train the classifier with the sure positive and sure negative words. We do not classify every word separately, but compute the classification score of all words of a sequence and then classify the sequence as a whole. This has several advantages: First, if one word of a sequence is uncertain, this does not automatically incur a feedback request. Second, if a word sequence is uncertain as a whole, the user gives feedback for the entire sequence. This results in several surely classified uncertain words at the cost of only one feedback request. In addition, it is easier to determine the meaning of a word sequence than the one of a single word.</Paragraph> </Section> </Section> class="xml-element"></Paper>