File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1061_metho.xml

Size: 22,338 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1061">
  <Title>A SEMANTIC CONCORDANCE</Title>
  <Section position="4" start_page="303" end_page="303" type="metho">
    <SectionTitle>
3. THE BROWN CORPUS
</SectionTitle>
    <Paragraph position="0"> The textual component of our universal semantic concordance is taken from the Brown Corpus \[3, 4\]. The corpus was assembled at Brown University in 1963-64 under the direction of W. Nelson Francis with the intent of making it broadly representative of American English writing. It contains 500 samples, each approximately 2,000 words long, for a total of approximately 1,014,000 running words of text, where a &amp;quot;word&amp;quot; is defined graphically as a string of contiguous alphanumeric characters with a space at either end.</Paragraph>
    <Paragraph position="1"> The genres of writing range from newspaper reporting to technical writing, and from fiction to philosophical essays.</Paragraph>
    <Paragraph position="2"> The computer-readable form of the Brown Corpus has been used in a wide variety of research studies, and many laboratories have obtained permission to use it. It was initially used for studies of word frequencies, and subsequently was made available with syntactic tags for each word. Since it is well known in a variety of contexts, and widely available, the Brown Corpus seemed a good place to begin.</Paragraph>
  </Section>
  <Section position="5" start_page="303" end_page="303" type="metho">
    <SectionTitle>
4. SEMANTIC TAGGING
</SectionTitle>
    <Paragraph position="0"> Two contrasting strategies for connecting a lexicon and a corpus emerge depending on where the process starts. The targeted approach starts with the lexicon: target a polysemous word, extract all sentences from the corpus in which that word occurs, categorize the instances and write definitions for each sense, and create a pointer between each instance of the word and its appropriate sense in the lexicon; then target another word and repeat the process. The targeted approach has the advantage that concentrating on a single word should produce better definitions---it is, after all, the procedure that lexicographers regard as ideal. And it also makes immediately available a classification of sentences that can be used to test alternative methods of automatic sense resolution.</Paragraph>
    <Paragraph position="1"> The alternative strategy starts with the corpus and proceeds through it word by word: the sequential approach. This procedure has the advantage of immediately revealing deficiencies in the lexicon: not only missing words (which could be found more directly), but also missing senses and indistinguishable definitions--deficiencies that would not surface so quickly with the targeted approach. Since the promise of improvements in WordNet was a major motive for pursuing this research, we initially adopted the sequential approach for the bulk of our semantic tagging.</Paragraph>
    <Paragraph position="2"> A second advantage of the sequential approach emerged as the work proceeded. One objective test of the adequacy of a lexicon is to use it to tag a sample of text, and to record the number of times it fails to have a word, or fails to have the appropriate sense for a word. We have found that such records for WordNet show considerable variability depending on the particular passage that is tagged, but over several months the averaged estimates of its coverage have been slowly improving: coverage it is currently averaging a little better than 96%.</Paragraph>
  </Section>
  <Section position="6" start_page="303" end_page="304" type="metho">
    <SectionTitle>
5. CONTEXT: A TAGGING INTERFACE
</SectionTitle>
    <Paragraph position="0"> The task of semantically tagging a text by hand is notoriously tedious, but the tedium can be reduced with an appropriate user interface. ConText is an X-windows interface designed specifically for annotating written texts with WordNet sense tags \[5\]. Since WordNet contains only open-class words, ConText is used to tag only nouns, verbs, adjectives, and adverbs; that is to say, only about 50% of the running words in the Brown Corpus are semantically tagged.</Paragraph>
    <Paragraph position="1">  Manual tagging with ConText requires a user to examine each word of the text in its context of use and to decide which WordNet sense was intended. In order to facilitate this task, ConText displays the word to be tagged in its con. text, along with the WordNet synsets for all of the senses of that word (in the appropriate part of speech). For example, when the person doing the tagging reaches &amp;quot;horse&amp;quot; in the sentence: The horse and men were saved, but the oxen drowned.</Paragraph>
    <Paragraph position="2"> ConText displays WordNet synsets for five meanings of  noun ' ' horse' ': 1. sawhorse, horse, sawbuck, buck (a framework used by carpenters) 2. knight, horse (a chess piece) 3. horse (a gymnastic apparatus) 4. heroin, diacetyl morphine, H, horse, junk, scag, smack (a morphine derivative) 5. horse, Equus caballus (herbivorous quadruped)  The tagger uses the cursor to indicate the appropriate sense (5, in this example), at which point ConText attaches a label, or semantic tag, to that word in the text. ConText then moves on to &amp;quot;men,&amp;quot; the next content word, and the process repeats. If the word is missing, or ff the appropriate sense is missing, the tagger can insert comments calling for the necessary revisions of WordNet.</Paragraph>
    <Paragraph position="3"> 5.1. Input to ConText In the current version of ConText, text to be tagged semantically must be preprocessed to indicate collocations and proper nouns (by concatenating them with underscores) and to provide syntactic tags. Since different corpora come in different formats and so requke slighdy different preprocessing, we have not tried to incorporate the preprocessor into ConText itself.</Paragraph>
    <Paragraph position="4"> A tokenizer searches the input text for collocations that WordNet knows about and when one is found it is made into a unit by connecting its parts with underscores. For example, if a text contains the collocation &amp;quot;took place,&amp;quot; the tokenizer will convert it to &amp;quot;took_place.&amp;quot; ConText can then display the synset for &amp;quot;take place&amp;quot; rather than successive synsets for &amp;quot;take&amp;quot; and &amp;quot;place.&amp;quot; Syntactic tags indicate the part of speech of each word in the input text. We have used an automatic syntactic tagger developed by Eric Brill \[6\] which he generously adapted to our needs. For example, &amp;quot;store&amp;quot; can be a noun or a verb; when the syntactic tagger encounters an instance of &amp;quot;store&amp;quot; it tries to decide from the context whether it is being used as a noun or a verb. ConText then uses this syntactic tag to determine which part of speech to display to the user. ConText also uses syntactic tags in order to skip over closed-class words. Since the automatic syntactic tagger sometimes makes mistakes, ConText allows the user to change the part of speech that is being displayed, or to tag words that should not have been skipped.</Paragraph>
    <Paragraph position="5"> After the text has been syntactically tagged, all contiguous strings of proper nouns are joined with an underscore. For example, the string &amp;quot;Mr. Charles C. Carpenter&amp;quot; is output as &amp;quot;Mr._Charles_C._Carpenter.&amp;quot; Here, too, the user can manually correct any mistaken concatenations.</Paragraph>
    <Paragraph position="6"> An example may clarify what is involved in preprocessing. The 109th sentence in passage k13 of the Brown Corpus is: He went down the hall to Eugene's bathroom, to turn on the hot-water heater, and on the side of the tub he saw a pair of blue wool swimming trunks.</Paragraph>
    <Paragraph position="7"> After preprocessing, this sentence is passed to ConText in the following form: br-kl3:109: He/PP went_down/VB the/DT hall/NN to/TO</Paragraph>
    <Paragraph position="9"> The version displayed to the tagger, however, looks like the Brown Corpus, except that collocations are indicated by underscores. Note, incidentally, that the processor has made a mistake in this example: &amp;quot;went_down&amp;quot; (as in &amp;quot;the ship went down&amp;quot;) is not the sense intendeed here.</Paragraph>
    <Section position="1" start_page="304" end_page="304" type="sub_section">
      <SectionTitle>
5.2. Output of ConText
</SectionTitle>
      <Paragraph position="0"> The output of ConText is a file containing the original text annotated with WordNet semantic tags; semantic tags are given in square brackets, and denote the particular WordNet synset that is appropriate. For example, when &amp;quot;hall&amp;quot; is tagged with \[noun.artifact.l\] it means that the word is being used to express the concept defined by the synset containing &amp;quot;halll&amp;quot; in the noun.artifact file. (Since WordNet is constantly growing and changing, references to the lexicographers' files have been retained; if the lexical component were frozen, some more general identifier could be used instead.) In cases where the appropriate sense of a word is not in WordNet, the user annotates that word with a comment that is later sent to the appropriate lexicographer.</Paragraph>
      <Paragraph position="1"> After the lexicographer has edited WordNet, the text must be retagged. In the retag mode, ConText skips from one commented word to the next.</Paragraph>
      <Paragraph position="2"> In addition to the syntactic and semantic tags, ConText adds SGML markers and reformats the text one word to a line.</Paragraph>
      <Paragraph position="3"> The SGML markers delimit sentences &lt;s&gt;, sentence numbers &lt;stn&gt;, words in the text &lt;wd&gt;, base forms of text words &lt;mwd&gt;, comments &lt;cmt&gt;, proper nouns &lt;pn&gt;, part-of-speech tags &lt;tag&gt; and semantic tags &lt;sn&gt; or &lt;msn&gt;. The sentence preprocessed above might come out of ConText looking like this:  Note that the tokenizcr's mistaken linking of &amp;quot;went_down&amp;quot; has now been corrected by the tagger. Also note &amp;quot;&lt;cmt&gt;WORD_MISSING&lt;/cmt&gt;&amp;quot; on line 16 of the output: that comment indicates that the tagger has connected &amp;quot;hotwater&amp;quot; and &amp;quot;heater&amp;quot; to form the collocation &amp;quot;hotwater heater,&amp;quot; which was not in WordNet. This illustrates the kind of comments that are passed on to the lexicographers, who use them to edit or add to WordNet.</Paragraph>
      <Paragraph position="4"> The WordNet database is constantly growing and changing.</Paragraph>
      <Paragraph position="5"> Consequently, previously tagged texts must be updated periodically. In the update mode, ConText searches the tagged files for pointers to WordNet senses that have subsequently been revised. A new semantic tag must then be inserted by the tagger.</Paragraph>
    </Section>
    <Section position="2" start_page="304" end_page="304" type="sub_section">
      <SectionTitle>
5.3 Tracking
</SectionTitle>
      <Paragraph position="0"> As the number of semantically tagged files increased, the difficulty of keeping track of which files had beeen preprocessed, which had been tagged, which were ready to be retagged, which had been retagged, and which were complete and cleared for use made it necessary to create a master traacking system that would handle the record keeping automatically. Scripts were written that allowed an administrator to preprocess files and add them to the tracking system. Once files are in the tracking system, other scripts keep a log of all the tagging activities pertaining to each file, and insure that taggers will not try to perform operations that are invalid for files with a given status. The administrator can easily generate simple reports on the status of all files in the tracking system.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="304" end_page="306" type="metho">
    <SectionTitle>
6. QUERYING THE TAGGED TEXT
</SectionTitle>
    <Paragraph position="0"> A program to query the semantically tagged database has also been written: prsent (print sentences) allows a user to retrieve sentences by entering the base form of a word and its semantic tag. It was developed as a simple interface to the semantic concordance, and puts the burden of knowing the word's semantic tag on the user. This program is useful to the lexicographers, who are intimately familiar with WordNet semantic tags and who use it to find sample sentences. A more robust interface is needed, however.</Paragraph>
    <Paragraph position="1"> Presently under development is a comprehensive querying tool that will allow a user the flexibility of specifying various retrieval criteria and display options. Envisioned is an X-Windows application with two main windows: one area for entering searching information and another for displaying the retrieved sentences. A primary search key is the only required component. Additional search keys can be specified to find words that co-occur in sentences. This alone is a powerful improvement over prsent. Other  options will restrict or expand the retrieval, as listed here: 1. Search only given part(s) of speech.</Paragraph>
    <Paragraph position="2"> 2. Search only for a specific sense.</Paragraph>
    <Paragraph position="3"> 3. Expand search to include sentences for synonyms of search key.</Paragraph>
    <Paragraph position="4"> 4. Expand search to include sentences for hyponyms of search key.</Paragraph>
    <Paragraph position="5"> 5. Use primary key and all secondary keys, or primary key and any secondary key.</Paragraph>
    <Paragraph position="6"> 6. Search for a secondary key that is within n words of the primary key.</Paragraph>
    <Paragraph position="7">  As important as specifying searching criteria is how the retrieved information is displayed. An option will be provided to display retrieved sentences in a concordance format (all the target words vertically aligned and surrounded by context to the window's borders) or left justified. Search keys will be highlighted in the retrieved sentences.</Paragraph>
    <Paragraph position="8">  Implementation of this program requires the creation of a &amp;quot;master list&amp;quot; of semantically tagged words. Each line in the alphabetized list contains the target word, its semantic tag, and for each sentence containing the word, a list of all the co-occurring nouns, verbs, adjectives, and adverbs with numbers indicating their position in the sentence. For example, the sentence already dissected provides a context for &amp;quot;hall&amp;quot; that might look like this:</Paragraph>
    <Paragraph position="10"> Collecting entries for this sense of &amp;quot;hall&amp;quot; provides valuable information about the contexts in which it can occur.</Paragraph>
  </Section>
  <Section position="8" start_page="306" end_page="307" type="metho">
    <SectionTitle>
7. APPLICATIONS
</SectionTitle>
    <Paragraph position="0"> Our reasons for building this universal semantic concordance were to test and improve the coverage of WordNet and to develop resources for developing and testing procedures for the automatic sense resolution in context. It should be pointed out, however, that semantic concordances can have other uses.</Paragraph>
    <Section position="1" start_page="306" end_page="306" type="sub_section">
      <SectionTitle>
7.1. Instruction
</SectionTitle>
      <Paragraph position="0"> Dictionaries are said to have evolved from the interlinear notations that medieval scholars added for difficult Latin words \[7\]. Such notations were found to be useful in teaching students; as the number of such notations grew, collections of them were extracted and arranged in lists. When the lists took on a life of their own their educational origins were largely forgotten. A semantic concordance brings this story back to its origins: lexical &amp;quot;footnotes&amp;quot; indicating the meaning that is appropriate to the context are immediately available electronically.</Paragraph>
      <Paragraph position="1"> One obvious educational use of a semantic concordance would be for people trying to learn English as a second language. By providing them with the appropriate sense of an unfamiliar word, they are spared the task of selecting a sense from the several alternatives listed in a standard dictionary. Moreover, they can retrieve other sentences that illustrate the same usage of the word, and from such sentences they can acquire both local and topical information about the use of a word: (1) local information about the grammatical constructions in which that word can express the given concept, and (2) topical information about other words that are likely to be used when that concept is discussed. null A use for specific semantic concordances would be in science education! much of the new learning demanded of beginning students in any field of science is terminological.</Paragraph>
    </Section>
    <Section position="2" start_page="306" end_page="306" type="sub_section">
      <SectionTitle>
7.2. Sense Frequencies
</SectionTitle>
      <Paragraph position="0"> Much attention has been paid to word frequencies, but relatively little to the frequencies of occurrence of different meanings. Some lexicographers have atempted to order the senses of polysemous words from the most to the least frequent, but the more general question has not been asked because the data for answering it have not been available.</Paragraph>
      <Paragraph position="1"> We have enough tagged text now, however, to get an idea what such data would look like. For example, here are prelirninary data for the 10 most frequent concepts expressed by nouns, based on some 80 selections from the Brown Corpus: 172 {year, (timeperiod)} 144 {person, individual, someone, man, mortal, human, soul, (a human being)\] 139 \[man, adult_male, (a grown man)} 105 {consequence, effect, outcome, result, upshot, (a phenomenon that follows and is caused by some previous phenomenon)} 104 {night, night_time, dark, (time after sunset and before sunrise while it is dark outside)} 102 {kind, sort, type, form, (&amp;quot;sculpture is a form of art&amp;quot; or &amp;quot;what kind of man is this?&amp;quot;)} 94 {eye, eyeball, oculus, optic, peeper, (organ of sight)} 89 {day, daytime, daylight, (time after sunrise and before sunset while it is light outside)} 88 {set, class, category, type, family, (a collection of things sharing a common attribute)} 87 {number, count, complement, (a definite quantity)} Our limited experience suggests, however, that such statistics depend critically on the subject matter of the corpus that is used.</Paragraph>
    </Section>
    <Section position="3" start_page="306" end_page="307" type="sub_section">
      <SectionTitle>
7.4. Sense Co-occurrences
</SectionTitle>
      <Paragraph position="0"> One shortcoming of WordNet that several users have pointed out to us is its lack of topical organization. Peter Mark Roget's original conception of his thesaurus relied heavily on his list of topics, which enabled him to pull together in one place all of the words used to talk about a given topic. This tradition of topical organization has survived in many modern thesauri, even though it requires a double look-up by the reader. For example, under &amp;quot;baseball&amp;quot; a topically organized thesaurus would pull together words like &amp;quot;batter,&amp;quot; &amp;quot;team,&amp;quot; &amp;quot;lineup,&amp;quot; &amp;quot;diamond,&amp;quot; &amp;quot;homer,&amp;quot; &amp;quot;hit,&amp;quot; and so on. Topical organization obviously facilitates sense resolution: if the topic is baseball, the meaning of &amp;quot;ball&amp;quot; will differ from its meaning when the topic is, say, dancing. In WordNet, those same words are scattered about: a baseball is an artifact, batters are people, a team is a group, a lineup is a list, a diamond is a location, a homer is  an act, to hit is a verb, and so on. By itself, WordNet does not provide topical groupings of words that can be used for sense resolution.</Paragraph>
      <Paragraph position="1"> One solution would be to draw up a list of topics and index all of the WordNet synsets to the topics in which they are likely to occur. Chapman \[8\], for example, uses 1,073 such classes and categories. But such lists are necessarily arbigary. A universal semantic concordance should be able to accomplish the same result in a more natural way. That is to say, a passage discussing baseball would use words together in their baseball senses; a passage discussing the drug trade would use words together with senses appropriate to that topic, and so on. Instead of a long list of topics, the corpus should include a large variety of passages.</Paragraph>
      <Paragraph position="2"> In order to take advantage of this aspect of universal semantic concordances, it is necessary to be able to query the textual component for associated concepts. Data on sense co-occurrences build up slowly, of course, but they will be a valuable by-product of this line of work.</Paragraph>
      <Paragraph position="3"> 7.4. Testing We are developing a version of the ConText interface that can be used for psychometric testing. The tagger's task in using ConText resembles an extended multiple-choice examination, and we believe that that feature can be adapted to test reading comprehension. Given a text that has already been tagged, readers' comprehension can be tested by seeing whether they are able to choose correct senses on the basis of the contexts of use.</Paragraph>
      <Paragraph position="4"> No doubt there are other, even better uses for semantic concordances. As the variety of potential applications grows, however, the need to automate the process of semantic tagging will become ever more pressing. But we must begin with what we have. We are now finishing a first installment of semantically tagged text consisting of 100 passages from the Brown Corpus; as soon as that much has been completed and satisfactorily cleaned up, we plan to make it, and the corresponding WordNet database, available to other laboratories that also have permission to use the Brown Corpus. We expect that such distribution will stimulate further uses for semantic concordances, uses that we have not yet imagined. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML