File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2025_metho.xml
Size: 6,479 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2025"> <Title>Decision List NE Learning HMM NE Learning Concept-based Seeds</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Implementation </SectionTitle> <Paragraph position="0"> Figure 1 shows the overall system architecture.</Paragraph> <Paragraph position="1"> Before the bootstrapping is started, a large raw training corpus is parsed. The bootstrapping experiment reported in this paper is based on a corpus containing ~100,000 news articles and totally ~88,000,000 words. The parsed corpus is saved into a repository, which supports fast retrieval by keyword based indexing scheme.</Paragraph> <Paragraph position="2"> The unsupervised bootstrapping is performed as follows: 1. User provides concept-based seeds; 2. Retrieve parsing structures involving concept-based seeds from the repository to train a decision list for NE classification; 3. Apply the learned rules to the NE candidates retrieved from the repository; 4. Construct an NE annotated corpus using the tagged proper names and their neighboring words; 5. Train an HMM based on the annotated cor- null pus.</Paragraph> <Paragraph position="3"> A parser is necessary for concept-based NE bootstrapping. This is due to the fact that concept-based seeds only share pattern similarity with the corresponding NEs at structural level, not at string sequence level. In fact, the anaphoric function of pronouns and common nouns to represent antecedent NEs indicates the substitutability of proper names by the noun phrases headed by the corresponding common nouns or pronouns. For example, this man can substitute the proper name John Smith in almost all structural patterns.</Paragraph> <Paragraph position="4"> Five binary dependency relationships decoded by our parser are used for parsing-based NE rule learning: (i) a Has_Predicate(b): from logical sub-ject a to verb b; (ii) a Object_Of(b): from logical object a to verb b; (iii) a Has_Amod(b): from noun a to its adjective modifier b; (iv) a Possess(b): from the possessive noun-modifier a to head noun b; (v) a IsA(b): equivalence relation (including appositions) from one NP a to another NP b.</Paragraph> <Paragraph position="5"> The concept-based seeds used in the experiments are: (i) he, she, his, her, him, man, woman for PER; (ii) city, province, town, village for LOC; (iii) company, firm, organization, bank, airline, army, committee, government, school, university for ORG.</Paragraph> <Paragraph position="6"> From the parsed corpus in the repository, all instances (821,267) of the concept-based seeds involved in the five dependency relations are retrieved. Each seed instance was assigned a concept tag corresponding to NE. For example, each instance of he is marked as PER. The instances with concept tagging plus their associated parsing relationships are equivalent to an annotated NE corpus. Based on this training corpus, the Decision List Learning algorithm [Segal & Etzioni 1994] is used. The accuracy of each rule was evaluated using Laplace smoothing as follows,</Paragraph> <Paragraph position="8"> As the PER tag dominates the corpus due to the high occurrence frequency of he and she, learning is biased towards PER as the answer. To correct this bias, we employ the following modification scheme for instance count. Suppose there are a to-</Paragraph> <Paragraph position="10"> N ORG instances, then in the process of rule accuracy evaluation, the involved instance count for any NE type will be adjusted by the coefficient</Paragraph> <Paragraph position="12"> A total of 1,290 parsing-based NE rules, shown in samples below, are learned, with accuracy higher than 0.9.</Paragraph> <Paragraph position="13"> Due to the unique equivalence nature of the IsA relation, we add the following IsA-based rules to the top of the decision list: IsA(seed)Gc6 tag of the seed, e.g. IsA(man) Gc6 PER The parsing-based first learner is used to tag a raw corpus. First, we retrieve all the named entity candidates associated with at least one of the five parsing relationships from the repository. After applying the decision list to the retrieved 1,607,709 NE candidates, 33,104 PER names, 16,426 LOC names, and 11,908 ORG names are tagged. In order to improve the bootstrapping performance, we use the heuristic one tag per domain for multi-word NE in addition to the one sense per discourse principle [Gale et al 1992]. These heuristics are found to be very helpful in both increasing positive instances (i.e. tag propagation) and decreasing the spurious instances (i.e. tag elimination). The tag propagation/elimination scheme is adopted from [Yarowsky 1995]. After this step, a total of 367,441 proper names are classified, including 134,722 PER names, 186,488 LOC names, and 46,231 ORG names.</Paragraph> <Paragraph position="14"> The classified proper name instances lead to the construction of an automatically tagged training corpus, consisting of the NE instances and their two (left and right) neighboring words within the same sentence.</Paragraph> <Paragraph position="15"> In the final stage, a bi-gram HMM is trained based on the above training corpus. The HMM training process follows [Bikel 1997].</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Benchmarking </SectionTitle> <Paragraph position="0"> We used the same blind testing corpus of 300,000 words containing 20,000 PER, LOC and ORG instances to measure performance degradation of unsupervised learning from the existing supervised NE tagger (Table 1, P for Precision, R for Recall, F for F-measure and F/D for F-measure degradation).</Paragraph> <Paragraph position="1"> The performance for PER and LOC are above 80%, and approaching the performance of supervised learning. The reason of the unsatisfactory performance of ORG (52.7%) is not difficult to understand. There are numerous sub-types of ORG that cannot be represented by the less than a dozen concept-based seeds used for this experiment.</Paragraph> <Paragraph position="2"> In addition to the key NE types in MUC, we also tested this method for recognizing user-defined NE types. We use the following concept-based seeds for PRODUCT (PRO) NE: car, truck, vehicle, product, plane, aircraft, computer, software, operating system, database, book, platform, network. Table 2 shows the benchmarks for PRODUCT tagging.</Paragraph> </Section> class="xml-element"></Paper>