File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1038_metho.xml

Size: 10,798 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1038">
  <Title>Large-scale Controlled Vocabulary Indexing for Named Entities</Title>
  <Section position="4" start_page="277" end_page="278" type="metho">
    <SectionTitle>
3 Approach
</SectionTitle>
    <Paragraph position="0"> &amp;quot;ITI was originally proposed as a tool to support controlled vocabulary indexing, and most early tests focused on narrowly defined legal and news topics such as insurable interest and earthquakes.</Paragraph>
    <Paragraph position="1"> TH was also tested on a number of companies, people, organizations and places as topics. TTI was first put into production to categorize documents by broadly-defined topics such as Europe political and business news and federal tax law.</Paragraph>
    <Paragraph position="2"> When we began investigating the possibility of creating Entity Indexing, TTI was a natural starting point. It had demonstrated high accuracy and flexibility across a variety of topics and data types. Three problems were also apparent. First, TTI would not scale to support several thousand topics.</Paragraph>
    <Paragraph position="3"> Second, it took a long time to build a topic definition, about one staff day each. Third, topics were defined on a publication-specific basis. With then700 publications in our news archives in combination with our scale goals and the time needed to build topic definitions, the definition building costs were too high. We needed to scale the technology, and we needed to substantially automate the topic definition-building process.</Paragraph>
    <Paragraph position="4"> For Entity Indexing, we addressed scale concerns through software tuning, substantially improved memory management, a more efficient hash function in the lookup algorithm, and moving domain-specific functionality from topic definitions into the software. The rest of this paper focuses on the cost of building the definitions.</Paragraph>
    <Section position="1" start_page="277" end_page="277" type="sub_section">
      <SectionTitle>
3.1 Analyzing Companies in the News
</SectionTitle>
      <Paragraph position="0"> In order to reduce definition building costs, we originally believed that we would focus on increasing our reliance on TTrs training tools.</Paragraph>
      <Paragraph position="1"> Training data would have to include documents from a variety of publications if we were to be able to limit definitions to one per topic regardless of the publications covered.</Paragraph>
      <Paragraph position="2"> Unfortunately the data did not cooperate. Using a list of all companies on the major U.S. stock exchanges, we randomly selected 89 companies for investigation. Boolean searches were used to retrieve documents that mentioned those companies.</Paragraph>
      <Paragraph position="3"> We found that several of these companies were rarely mentioned in the news. One company was not mentioned at all in our news archives, a second one was mentioned only once, and twelve were mentioned only in passing. Several appeared as major references in only a few documents. In a second investigation involving 40,000 companies from various sources, fully half appeared in zero or only one news document in one two-year window.</Paragraph>
      <Paragraph position="4"> We questioned whether we even wanted to create topic definitions for such rarely occurring companies. Again, marketing and product positioning reasons dictated that we do so: it is easier to tell customers that the product feature covers all companics that meet one of a few criteria than it is to give customers a list of the individual companies covered. It is also reasonable to assume that public and larger companies may appear in the news at some future point.</Paragraph>
    </Section>
    <Section position="2" start_page="277" end_page="278" type="sub_section">
      <SectionTitle>
3.2 Company Name Usage
</SectionTitle>
      <Paragraph position="0"> While analyzing news articles about these companies, we noted how company names and their variants were used. For most companies discussed in documents that contained major references to the company, some form of the full company name typically appears in the leading text. Shorter variants typically are used for subsequent mentions as well as in the headline. Corresponding ticker symbols often appear in the headline or leading text, but only after a variant of the company name appears. In some publications, ticker symbols are used as shorter variants throughout the document.</Paragraph>
      <Paragraph position="1"> Acronyms are somewhat rare; when they are used, they behave like other shorter variants of the name.</Paragraph>
      <Paragraph position="2"> Shorter variants typically are substrmgs of the full company name beginning with the leftmost words in the name. For a company named  whose names consist of at least two words in addition to company designators such as Inc. or Corp. There is no consistency as to whether the company  designators contribute to acronyms. Thus for the above example</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="278" end_page="279" type="metho">
    <SectionTitle>
SCSI
SCS
</SectionTitle>
    <Paragraph position="0"> are both potential acronyms.</Paragraph>
    <Paragraph position="1"> Generating such variants from a full form of a company name is straightforward. Similarly, rules working with a table of equivalences can handle common abbreviations, so the variant</Paragraph>
    <Section position="1" start_page="278" end_page="278" type="sub_section">
      <SectionTitle>
Samson Computing Supply Incorporated
</SectionTitle>
      <Paragraph position="0"> can be generated from Samson Computing Supply lnc. We assign weights to term variants based on term length and the presence or absence of company designators. Longer variants with designators are regarded as less ambiguous than shorter variants without designators, and thus have higher weights. A table of particularly problematic one-word variants, such as American, National and General, is used to make a weighting distinction between these and other one-word variants. One-word variants are also marked so they do not match lower case strings during lookup. A label is assigned to each variant to indicate its function and relative strength in the document categorization process.</Paragraph>
    </Section>
    <Section position="2" start_page="278" end_page="279" type="sub_section">
      <SectionTitle>
3.3 Generating Topic Defmitions
</SectionTitle>
      <Paragraph position="0"> Our company controlled vocabulary indexing process requires definition builders to provide a primary CVT and zero or more secondary CVTs for each targeted company. The CVTs are the primary input to the automatic topic definition generation process. If the definition builder provides the company name and ticker symbol to be used as CVTs for some company, as in</Paragraph>
      <Paragraph position="2"> the following definition can be generated automatically: null</Paragraph>
      <Paragraph position="4"> That two acronyms were generated points out a potential problem with using robust variant generation as a means to automatically build topic definitions. Overgeneration will produce some name variants that have nothing to do with the company. However, although overgeneration of variants routinely occurs, testing showed that such overgeneration has little adverse effect.</Paragraph>
      <Paragraph position="5"> This approach to automatically generating topic definitions is successful for most companies, including those that appear rarely in our data, because most company names and their variants have consistent structure and use patterns. There are exceptions. Some companies are so well-lmown that they often appear in news articles without the corresponding full company name. Large companies and companies with high visibility (e.g., Microsoft, AT&amp;T, NBC and Kmart) are among these.</Paragraph>
      <Paragraph position="6"> Other companies simply have unusual names. Our authority file is an editable text file where definition builders not only store and maintain the primary and secondary CVTs for each company, but it also allows builders to specify exception information that can be used to override any or all of the results of the automatic definition generation process. In addition, builders can use two additional labels to identify especially strong name variants (e.g., IBM for International Business Machines) and related terms whose presence in a document provide disambiguating context (e.g., Delta, airport and flights for American Airlines, often referred to only as American). For our initial release of 15,000 companies, 17% of the definitions had some manual intervention beyond providing primary and secondary CVTs. Entity definitions built entirely manually usually took less than thirty minutes apiece. Overall, on average less than five minutes were spent per topic on definition building.</Paragraph>
      <Paragraph position="7"> This includes the time used to identify the targeted companies and add their primary and secondary CVTs to the authority file. Populating the authority  file is required regardless of the technical approach used.</Paragraph>
    </Section>
    <Section position="3" start_page="279" end_page="279" type="sub_section">
      <SectionTitle>
3.4 Applying Definitions to Documents
</SectionTitle>
      <Paragraph position="0"> All topic definitions contain a set of labeled terms to look up. The document categorization process combines these into a large lookup table. A lookup step applies the table to a document and records term frequency information. If a match occurs in the headline or leading text, extra &amp;quot;frequency&amp;quot; is recorded in order to place extra emphasis on lookup matches in those parts of the document. If the same term is in several definitions (e.g., American is a short name variant in hundreds of definitions), frequency information is recorded for each definition.</Paragraph>
      <Paragraph position="1"> Once the end of the document is reached, frequency and term label-based weights are used to calculate a score for each topic. If the score exceeds some threshold, corresponding CVTs are added to the document. Typically a few matches of high-weight terms or a variety of lower-weighted terms are necessary to produce scores above the threshold. A document may be about more than one targeted topic.</Paragraph>
    </Section>
    <Section position="4" start_page="279" end_page="279" type="sub_section">
      <SectionTitle>
3.5 System Implementation
</SectionTitle>
      <Paragraph position="0"> The tools used to build and maintain topic definitions were implemented in C/C++ on UNIX-based workstations. The document categorization process was implemented in PL1 and a proprietary lexical scanner, and operates in a mainframe MVS environment. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML