File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0113_metho.xml

Size: 11,265 bytes

Last Modified: 2025-10-06 14:13:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0113">
  <Title>Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches</Title>
  <Section position="3" start_page="143" end_page="145" type="metho">
    <SectionTitle>
2 Gold Standards Evaluation
2.1 Thesauri
</SectionTitle>
    <Paragraph position="0"> Roger's Thesaurus is readily available via anonymous ftp 1. In it are collected more than 30,000 unique words arranged in a shallow hierarchy under 1000 topic numbers such as Existence (Topic Number 1), Inexistence (2), Substantiality (3), Unsubstantiality (4), ..., Rite (998), Canonicals (999), and Temple (1000). Although this is far from the total number of semantic axes of which one could think, it does provide a wide swath of commonly accepted associations of English language words. We would expect that any system claiming to extract semantics from text should find some of the relations contained in this resource.</Paragraph>
    <Paragraph position="1"> By transforming the online source of such a thesaurus, we use it as a gold standard by which to measure the results of different similarity extraction techniques. This measurement is done by checking whether the 'similar words' discovered by each technique are placed under the same heading in this thesaurus.</Paragraph>
    <Paragraph position="2"> In order to create this evaluation tool, we extracted a list consisting of all single-word entries from our thesauri with their topic number or numbers. A portion of the extracted Roger list in Figure 1 shows that abatement appears under two topics: Nonincrease (36) and Discount (813). Abbe and abbess both belong under the same topic heading 996 (Clergy). The extracted Roger's list has 60,071 words (an average of 60 words for each of the 1000 topics). Of these 32,000 are unique (an average of two occurrence for each word). If we assume for simplicity that each word appears under exactly 2 of the 1000 topics, and that the words are uniformly distributed, the chance that two words wl and</Paragraph>
    <Paragraph position="4"> since wl is under 2 topic headings and since the chance that w2 is under any specific topic heading is 2/1000. The probability of finding two randomly chosen words together under the same heading, then, is 0.4%.</Paragraph>
    <Paragraph position="5"> Our measurement of a similarity extraction technique using this gold standard is performed as follows.</Paragraph>
    <Paragraph position="6"> 1 For example, in March 1993 it was available via anonymous ftp at the Internet site world.std.com in the directory/obi/obi2/Gutenberg/etext91, as well at over 30 other sites.</Paragraph>
    <Paragraph position="7">  Given a corpus, use the similarity extraction method to derive similarity judgements between the words appearing in the corpus. For each word, take the word appearing as most similar. Examine the human compiled thesaurus to see if that pair of words appears under the same topic number. If it does, count this as a hit.</Paragraph>
    <Paragraph position="8"> This procedure was followed on the 4 megabyte corpus described below to test two semantic extraction techniques, one using syntactically derived contexts to judge similarity and one using window-based contexts. The results of these evaluations are also given below.</Paragraph>
    <Section position="1" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
2.2 Dictionary
</SectionTitle>
      <Paragraph position="0"> We also use an online dictionary as a gold standard following a slightly different procedure.</Paragraph>
      <Paragraph position="1"> Many researchers have drawn on online dictionaries in attempts to do semantic discovery \[Sparck Jones, 1986, Vossen et aL, 1989, Wilks et ai., 1989\], whereas we use it here only as a tool for evaluating extraction techniques from unstructured text. We have an online version of Webster's 7th available, and we use it in evaluating discovered similarity pairs.</Paragraph>
      <Paragraph position="2"> This evaluation is based on the assumption that similar words will share some overlap in their dictionary definitions. In order to determine overlap, each the entire literal definition is broken into a list of individual words. This list of tokens contains all the words in the dictionary entry, including dictionary-related markings and abbreviations. In order to clean this list of non-information-bearing words, we automatically removed any word or  token 1. of fewer than 4 characters, 2. among the most common 50 words of 4 or more letters in the Brown corpus, 3. among the most common 50 words of 4 or more letters appearing in the definitions of Webster's 7th,  ad-min-is-tra-tlon n. 1. the act or process of administering 2. performance of executive duties :: c&lt;MANAGEMENT&gt; 3. the execution of public affairs as distinguished from policy making 4. a) a body of persons who administer b) i&lt;cap&gt; :: a group constituting the political executive in a presidential government c) a governmental agency or board 5. the term of office of an administrative officer, or body.</Paragraph>
      <Paragraph position="3"> administer, administering, administrative, affairs, agency, board, constituting, distinguished, duties, execution, executive, government, governmental, making, management, office, officer, performance, persons, policy, political, presidential, public, term  4. listed as a preposition, quantifier, or determiner in our lexicon, 5. of 4 or more letters from a common information retrieval stoplist, 6. among the dictionary-related set: slang, attrib, kind, word, brit, heSS, lion, ment.  These conditions generated a list of 434 stopwords of 4 or more characters which are retracted from any dictionary definition, The remaining words are sorted into a list. For example, the list produced for the definition of the word administration is given in Figure 2. For simplicity no morphological analysis or any other modifications were performed on the tokens in these lists.</Paragraph>
      <Paragraph position="4"> To compare two words using these lists, the intersection of each word's filtered definition list is performed. For example, the intersection between the lists derived from the dictionary entries of diamond and ruby is (precious, stone); between right and freedom it is (acting, condition, political, power, privilege, right). In order to use these dictionaryderived lists as an evaluation tool, we perform the following experiment on a corpus. Given a corpus, take the similarity pairs derived by the semantic extraction technique in order of decreasing frequency of the first term. Perform the intersection of their respective two dictionary definitions as described above. If this intersection contains two or more elements, count this as a hit.</Paragraph>
      <Paragraph position="5"> This evahlation method was also performed on the results of both semantic extraction techniques applied to the corpus described in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="145" end_page="146" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> The corpus used for the evaluating the two techniques was extracted from Grolier's Encyclopedia for other experiments in semantic extraction. In order to generate a relatively coherent corpus, the corpus was created by extracting only those those sentences which contained the word Harvard or one of the thirty hyponyms found under the word institution in WordNet 2 \[Miller et al., 1990\], viz. institution, establishment, charity, religion, *.., settlement* This produced a corpus of 3.9 megabytes of text.</Paragraph>
    <Paragraph position="1"> 2 WordNet was not used itself as a gold standard since its hierarchy is very deep and its inherent notion of semantic classes is not as clearly defined as in Roger.</Paragraph>
  </Section>
  <Section position="5" start_page="146" end_page="146" type="metho">
    <SectionTitle>
4 Semantic Extraction Techniques
</SectionTitle>
    <Paragraph position="0"> We will use these gold standard evaluation techniques to compare two techniques for extracting similarity lists from raw text.</Paragraph>
    <Paragraph position="1"> The first technique \[Grefenstette, 1992\] extracts the syntactic context of each word throughout the corpus. The corpus is divided into lexical units via a regular grammar, each lexical unit is assigned a list of context-free syntactic categories, and a normalized form. Then a time linear stochastic grammar similar to the one described in \[de Marcken, 1990\] selects a most probable category for each word. A syntactic analyzer described in \[Grefenstette, 1993\] chunks nouns and verb phrases and create relations within chunks and between chunks. A noun's context becomes all the other adjectives, nouns, and verbs that enter into syntactic relations with it.</Paragraph>
    <Paragraph position="2"> As a second technique, more similar to classical knowledge-poor techniques \[Phillips, 1985\] for judging word similarity, we do not perform syntactic disambiguation and analysis, but simply consider some window of words around a given word as forming the context of that word. We suppose that we have a lexicon, which we do, that gives all the possible parts of speech for a word. Each word in the corpus is looked up in this lexicon as in the first technique, in order to normalize the word and know its possible parts of speech \[Evans et al., 1991\]. A noun's context will be all the words that can be nouns, adjectives, or verbs within a certain window around the noun. The window that was used was all nouns, adjectives, or verbs on either side of the noun within ten and within the same sentence.</Paragraph>
    <Paragraph position="3"> In both cases we will compare nouns to each other, using their contexts. In the first case, the disambiguator determines whether a given ambiguous word is a noun or not. In the second case, we will simply decide that if a word can be at once a noun or verb, or a noun or adjective, that it is a noun. This distinction between the two techniques of using a cursory syntactic analysis or not allows us to evaluate what is gained by the addition of this processing step.</Paragraph>
    <Paragraph position="4"> Figure 3 below shows the types of contexts extracted by the selective syntactic technique and by the windowing technique for a sentence from the corpus.</Paragraph>
    <Paragraph position="5"> Once context is extracted for each noun, the contexts are compared for similarity using a weighted Jaccard measure \[Grefenstette, 1993\]. In order to reduce run time for the similarity comparison, only those nouns appearing more than 10 times in tile corpus were retained. 2661 unique nouns appear 10 times or more. For the windowing technique 33,283 unique attributes with which to judge the words are extracted. The similarity judging run takes 4 full days on a DEC 5000, compared to 3 and 1/2 hours for the similarity calculation using data from the syntactic technique, due to greatly increased number of attributes for each word. For each noun, we retain the noun rated as most similar by the Jaccard similarity measure. Figure 4 shows some examples of words found most similar by both techniques.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML