XML Viewer - w94-0114

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0114_metho.xml
Size: 22,393 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0114">
  <Title>Statistical versus symbolic parsing for captioned-information retrieval</Title>
  <Section position="3" start_page="112" end_page="116" type="metho">
    <SectionTitle>
3. Example captions
</SectionTitle>
    <Paragraph position="0"> To illustrate the problems posed by the corpus, we present some example captions. All are single-case.</Paragraph>
    <Paragraph position="1"> an/apq-89 xan-1 radar set in nose of t-2 buckeye modified aircraft bu# 7074, for flight evaluation test.</Paragraph>
    <Paragraph position="2"> 3/4 overall view of aircraft on runway. null This is typical of many captions: two noun phrases, each terminated with a period, where the first describes the photographic subject and the second describes the picture itself. Also typical are the complex nominal-compound strings, &amp;quot;an/apq-89 xan-1 radar set&amp;quot; and &amp;quot;t-2 buckeye modified aircraft bu# 7074&amp;quot;. Domain knowledge, or statistics as we shall argue, is necessary to recognize &amp;quot;an/apq-89&amp;quot; as a radar type, &amp;quot;xan-I&amp;quot; a version number for that radar, &amp;quot;t-2&amp;quot; an aircraft type, &amp;quot;buckeye&amp;quot; a slang additional name for a T-2, &amp;quot;modified&amp;quot; a conventional adjective, and &amp;quot;bu# 7074&amp;quot; as an aircraft code ID.</Paragraph>
    <Paragraph position="3">   -3program walleye, an/awg-16 fire control pod on a-4c bu# 147781 aircraft, china lake on tail, fit test.</Paragraph>
    <Paragraph position="4"> 3/4 front overall view and closeup 1/4 front view of pod.</Paragraph>
    <Paragraph position="5"> This illustrates some common domain-dependent noun-phrase syntax. &amp;quot;A-4c bu# 147781&amp;quot; is a common pattem of &lt;equipment-type&gt; &lt;prefix-code&gt; &lt;code-number&gt;, a pattern frequent enough to deserve its own grammar rule. Similarly &amp;quot;an/awg-16 fire control pod&amp;quot; is the common pattern of &lt;equipment-name&gt; &lt;equipmentpurpose&gt; &lt;equipment-type&gt;, and &amp;quot;3/4 front overall view&amp;quot; is of the form &lt;view-qualifier&gt; &lt;view-qualifier&gt; &lt;view-type&gt;.</Paragraph>
    <Paragraph position="6"> graphics presentation tid progress 76. sea site update, wasp head director and hawk screech\[sun visor radars, tdp portion only, excellent.</Paragraph>
    <Paragraph position="7"> This illustrates the need for domain-dependent lexicon information. Here &amp;quot;wasp&amp;quot;, &amp;quot;hawk&amp;quot;, and &amp;quot;sun visor&amp;quot; should not be interpreted in their common English word senses, but as special equipment terms. Furthermore, &amp;quot;progress 76&amp;quot; means &amp;quot;progress in 1976&amp;quot;, and &amp;quot;excellent&amp;quot; refers to the quality of the picture. And the &amp;quot;head director&amp;quot; is not a person but a guidance system, and the &amp;quot;sea site&amp;quot; is not in the sea but a dry lakebed flooded with water to a few inches. Such unusual word senses strongly call for inference from domain-dependent statistics. They are also a good argument lor natural-language processing for inforrnation retrieval instead of keyword matching.</Paragraph>
    <Paragraph position="8"> aerial low oblique, looking s from inyodern rd at main gate down china lake bl to bowman rd. on l, b to t, water reservoirs, of crcl, pw cmpnd, vieweg school, capehart b housing, burroughs hs, cimarron gardens, east r/c old duplex stor.</Paragraph>
    <Paragraph position="9"> lot. on r, b to t, trngl, bar s motel, arrowsmith, comarco, hosp and on to bowman rd.</Paragraph>
    <Paragraph position="10"> This illustrates the problems with the misspellings and nonstandard abbreviations in the captions. &amp;quot;Trf crcl&amp;quot; is supposed to be &amp;quot;traffic circle&amp;quot;, &amp;quot;trngr' is triangle, &amp;quot;capehart b&amp;quot; is &amp;quot;capehart base&amp;quot;, but &amp;quot;b to t&amp;quot; is &amp;quot;bottom to top&amp;quot;. &amp;quot;Vieweg&amp;quot; which looks like a misspelling of &amp;quot;viewed&amp;quot; is actuaUy the correct name of a former base commander, but &amp;quot;inyodem&amp;quot; which looks correct actually is a misspelling of &amp;quot;Inyokem&amp;quot;, a nearby town. Such abbreviations and misspellings can only be found by reference to known domain words and using heuristics. null per-heps, parachute extraction rocket-helicopter escape propulsion system, test setup, 700# f in launcher showing 50# deadweight, nylon strap, and parachute cannister. null This illustrates the difficulties of interpreting the numerous acronyms in the captions. Here the first word of the above is an immediately-explained acronym; a careful search for such constructs helps considerably, as often an acronym is explained in at least one   -4caption. But even explained acronyms cause difficulties. We can generally take the subject of the appositive phrase after the acronym as the type of the acronym, &amp;quot;system&amp;quot; in this case, but how the other words relate to it is complicated and less determined by conventional English syntax than the need to obtain a cute acronym.</Paragraph>
    <Paragraph position="11"> 4. Our approach to statistical parsing MARIE-1 uses the standard approach of intelligent natural-language processing for information retrieval (Grosz et al, 1987; Rau, 1988; Sembok and van Rijsbergen, 1990) of hand-coding of lexical and semantic information for the words in a narrow domain. We used the DBG software from Language Systems, Inc. (in Woodland Hills, CA) to help construct the parser for MARIE-1.</Paragraph>
    <Paragraph position="12"> Nonetheless, considerable additional work was needed to adapt DBG to our domain. Even though we focused on a random sample of only 220 captions, they averaged 50 words in length and required a lexicon and type hierarchy of 1000 additional words beyond the 1000 we could use from the prototype DBG application for cockpit speech. A large number of additional semantic rules had to be written for the many long and complicated noun-noun sequences that had no counterpart in cockpit speech.</Paragraph>
    <Paragraph position="13"> These required difficult debugging because DBG's multiple-pass semantic processing is tricky to figure out, and the inability of DBG to backtrack and find a second interpretation meant that we could only find a maximum of one bug per run. But hardest of all to use were DBG's syntactic features. These required a grammar with fixed assigned probabilities on each rule, which necessitated a delicate balancing act that considered the entire corpus, to choose what was often a highly sensitive number. The lack of context sensitivity meant that this number had to programrned artificially for each rule to obtain adequate performance (for which some researchers have claimed success), instead of being taken from applicable statistics on the corpus, which makes more sense. But this &amp;quot;programming&amp;quot; was more trial-and-error than anything.</Paragraph>
    <Paragraph position="14"> MARIE-I's approach would be unworkable for the 29,538 distinct words in the full 100,000-caption NAWC database.</Paragraph>
    <Paragraph position="15"> Statistical parsing has emerged in the last few years as an alternative. It assigns probabilities of co-occurrence to sets of words, and uses these probabilities to guess the most likely interpretation of a sentence. The probabilities can be derived from statistics on a corpus, a representative set of example sentences, and they can capture fine semantic distinctions that would otherwise require additional lexicon information. null Statistical parsing is especially well suited for information retrieval because the goal of the latter is to find data that will probably satisfy a user, but satisfaction is never guaranteed. Also, good information retrieval does not require the full natural-language understanding   -5that hand-tailored semantic routines provide: Understanding of the words matched is not generally helpful beyond their synonym, hierarchical type, and hierarchical part information. For instance, the query &amp;quot;missile mounted on aircraft&amp;quot; should match all three of: --&amp;quot;sidewinder on f-18&amp;quot; --&amp;quot;sidewinder attached to wing pylon&amp;quot; --pylon mounted aim-9m sidewinders&amp;quot; since &amp;quot;sidewinder&amp;quot; and &amp;quot;aim-9m&amp;quot; are types of missiles, &amp;quot;f-18&amp;quot; as a kind of aircraft, and &amp;quot;on&amp;quot; and &amp;quot;attached&amp;quot; mean the same thing as &amp;quot;mounted&amp;quot;. NAWC-WD captions are often imprecise with verbs, so detailed semantic analysis of them is usually fruitless. Parsing is still essential to connect related words in a caption, as to recognize the similar deep structure of the three examples above.</Paragraph>
    <Paragraph position="16"> But a parser for information retrieval can have fewer grammatical categories and fewer rules than one for full natural-language understanding.</Paragraph>
    <Paragraph position="17"> Creating the full synonym list, type hierarchy, and part hierarchy for applications of the size of the NAWC-WD database (42,000 words including words closely related to those in the captions) is considerable work. Fortunately, a large part of this job for any English application has been already accomplished in the Wordnet system (Miller et al, 1990), a large thesaurus system that includes this kind of information, plus rough word frequencies and morphological processing. From Wordnet we obtained information for 6,843 words mentioned in the NAWC-WD captions (for 24,094 word-sense entries), together with 15,417 &amp;quot;alias&amp;quot; facts relating other word senses to 24,094 as synonyms.</Paragraph>
    <Paragraph position="18"> (The alias facts shortened the lexicon by about 85%.) This left 22,~95 words in the captions that did not have available Wordnet data, for which we used a variety of methods to create lexicon entries. The full breakdown of the lexi- null The special-format rules do things like interpret &amp;quot;BU# 462945&amp;quot; as an aircraft identification number and &amp;quot;02/21/93&amp;quot; as a date. Misspellings and abbreviations were obtained mostly automatically, with human checking, from rule-based   -6systemsdescribed in (Rowe and Laitinen, 1994). The effort for lexiconbuilding, although it is not yet complete, was relatively modest (0.25 of a manyear) thanks to Wordnet, which suggests good portability. Some of this success can be attributed to the restrictions of caption semantics.</Paragraph>
    <Paragraph position="19"> We converted all this information to a Quintus Prolog format compatible with the rest of MARIE-2, and used this in parsing and interpretation. The basic meaning assigned to a noun or verb is that it is a subtype of the concept designed by its name in the type hierarchy, with additional pieces of meaning added by its relationships (like modification) to other words in the sentence. For instance, for &amp;quot;big missile on stand&amp;quot;, a representative meaning list</Paragraph>
    <Paragraph position="21"> where v3 and v5 are variables and the numbers after the hyphen indicate the word sense number.</Paragraph>
  </Section>
  <Section position="4" start_page="116" end_page="119" type="metho">
    <SectionTitle>
5. Statistical parsing techniques
</SectionTitle>
    <Paragraph position="0"> This approach can be fast since we just substitute standard synonyms for the words in a sentence, append the type and relationship specifications for all the nouns; verbs, adjectives, and adverbs, and resolve references using the parse tree, to obtain a &amp;quot;meaning list&amp;quot; or semantic graph, following the paradigm of (Covington, 1994) for the nonstatistical aspects. But this can still be slow because it would seem we need to find all the reasonable interpretations of a sentence in order to rank them. To simplify matters, we restricted the grammar to binary parse rules (context-free rules with one or two symbols for the replacement). The likelihood of an interpretation can be found by assigning probabilities to word senses and rules.</Paragraph>
    <Paragraph position="1"> If we could assume near-independence of the probabilities of each part of the sentence, we could multiply them to get the probability of the whole sentence (Fujisaki et al, 1991). This is mathematically equivalent to taking the sum of the logarithms of the probabilities, and hence a branch-and-bound search could be done to quickly find the N best parses of the a sentence.</Paragraph>
    <Paragraph position="2"> But words of sentences are obviously not often independent or nearindependent. Statistical parsing often exploits the probabilities of strings of successive words in a sentence (Jones and Eisner, 1992). However, with binary parse rules, a simpler and more semantic idea is to consider only the probability of co-occurrence of the two subparses. For example, the probability of parsing &amp;quot;f-18 landing&amp;quot; by the rule &amp;quot;NP -&gt; NP PARTICIPLEPHRASE&amp;quot; should include the likelihood of an F-18 in particular doing a landing and the likelihood of this syntactic structure.</Paragraph>
    <Paragraph position="3"> The co-occurrence probability for &amp;quot;f-18&amp;quot; and &amp;quot;land&amp;quot; is especially helpful because it is unexpectedly large, since there are only a few things in the world that land.</Paragraph>
    <Paragraph position="4">   -7-Estimates of co-occurrence probabilities can inherit in the type hierarchy (Rowe, 1985). So if we have insufficient statistics in our corpus about how often an F-18 lands, we may enough on how often an aircraft lands; and assuming that F-18s are typical of aircraft in this respect, we can estimate how often F-18s land. The second word can be generalized too, so we can use statistics on 'T-18&amp;quot; and &amp;quot;moving&amp;quot;, or both the words can be simultaneously generalized, so we can use statistics on &amp;quot;aircraft&amp;quot; and &amp;quot;moving&amp;quot;. The idea is to find some statistics that can be reliably used to estimate the co-occurrence probability of the words. Each parse rule can have separate statistics, so the alternative parse of &amp;quot;f-18 landing&amp;quot; by &amp;quot;NP -&gt; ADJECTIVE GERUND&amp;quot; would be evaluated by separate statistics.</Paragraph>
    <Paragraph position="5"> To keep this number of possible co-occurrence probabilities manageable, it is important to restrict them to twoprobability. When parse rules recognize multiword sequences as grammatical units, those sequences can be reduced to &amp;quot;headwords&amp;quot;. For instance, &amp;quot;the big f-18 from china lake landing at armitage field&amp;quot; can also be parsed by &amp;quot;NP -&gt; NP PARTICIPLEPHRASE&amp;quot; and the same co-occurrence probability used, since &amp;quot;f-18&amp;quot; is the principal noun and hence headword of the noun phrase &amp;quot;the big f-18 from china lake&amp;quot;, and &amp;quot;landing&amp;quot; is the participle and hence headword of the participial phrase &amp;quot;landing at arrnitage field&amp;quot;. We can get a measure of the interaction of larger numbers of words by multiplying the probabilities for all such binary nodes of the parse tree.</Paragraph>
    <Paragraph position="6"> This is not an independence assumption anymore because an important word can appear as headword of many different syntactic units, and thus affect the overall rating of a parse in many places.</Paragraph>
    <Paragraph position="7"> A big advantage for us of statistical parsing is in identification of unknown words. As we noted earlier, our corpus has many equipment terms, geographical names, and names of people that are not covered by Wordnet. But for information retrieval, detailed understanding of these terms is usually not required beyond recognizing their category, and this can be inferred by co-occurrence probabilities. For instance, in &amp;quot;personnel mounting ghw-12 on an f-18&amp;quot;, &amp;quot;ghw-12&amp;quot; must be a piece of equipment because of the high likelihoods of co-occurrence of equipment terms with &amp;quot;mount&amp;quot; and equipment terms with &amp;quot;on'.</Paragraph>
    <Paragraph position="8"> 6. More about the statistical database We will obtain the necessary counts from running the parser on the 100,000 captions. Using branch-and-bound search, the parser will find what it considers the most likely parse; if this is incorrect, a human monitor will say so and force it to consider the second most likely parse, and so on. Counts are incremented for each binary node in the parse tree, and also for all superconcepts of the words involved. As counts accumulate, the system should gradually become more likely to guess the correct   -8parse on its first try.</Paragraph>
    <Paragraph position="9"> The statistical database for binary co-occurrence statistics will need careful design because the data will be sparse and there will be many small entries.</Paragraph>
    <Paragraph position="10"> For instance, for the NAWC-WD captions there are about 20,000 synonym sets about which we have lexicon informarion. This means 200 million possible co-occurrence pairs, but the total of all their counts can only be 610,182, the total number of word instances in all captions. Our counts database uses four search trees indexed on the first word, the part of speech plus word sense of the first word, the second word, and the part of speech plus word sense of the second word. Stonng counts rather than probabilities saves storage and reduces work on update. Various compression techniques can further reduce storage, but especially the elimination of data that can be closely approximated from other counts using sampling theory (Rowe, 1985). For instance, if &amp;quot;f-18&amp;quot; occurs 10 times in the corpus, all kinds of aircraft occur 1000 times, and there are 230 occurrences of aircraft landing, estimate the number of &amp;quot;f-18 landing&amp;quot;s in the corpus as 230 * 10 / 1000 = 2.3; if the actual count is within a standard deviation of the Value, do not store it in the database. The standard deviation when n is the size of the subpopulation, N is the size of the population, and A the count for the population, is qA (N-A )(N-n )/nN2(N-1) (Cochran, 1977).</Paragraph>
    <Paragraph position="11"> Such calculations require also &amp;quot;unary&amp;quot; counts stored with each word or standard phrase, but there are far fewer of these: (While unary counts also directly affect the likelihood of a particular sentence, that effect can be ignored since it is constant over all sentence interpretations.) null We need not store statistics for every word in the statistical database. Many words and phrases used in are corpus are codes that appear rarely, like airplane ID numbers and dates. For such concepts, we only keep statistics on the superconcept, &amp;quot;ID number&amp;quot; and &amp;quot;date&amp;quot; for these examples. Which concepts are to be handled this way is domaindependent, but generally simple.</Paragraph>
    <Paragraph position="12"> 7. More about the restriction to binary probabilities It may seem inadequate to restrict co-occurrence probabilities to pairs of headwords. We argue that while this is inadequate for general natural-language processing, information retrieval in general and captions in particular are minimally affected. That is because the sublanguages of such applications are highly case-oriented, and cases are binary. So structures like subjectverbal-object can be reduced to verbalsubject and verbal-object cases; adjective 1-adjecfive2-noun can be reduced to two adjective-noun case relationships, either separately with each adjective or by reducing adjectiveladjecrive2 to a composite concept if adjective2 can be taken as a noun.</Paragraph>
    <Paragraph position="13"> Prepositional phrases would seem to be trouble, however, because the preposi- null  -9tion simultaneously interacts with both its subject and its object. We handle them by subclassifying prepositions as location, time, social, abstract, or miscellaneous, reflecting the main features that affect compatibility with subjects and objects. Then, say, a parse of &amp;quot;aircraft at nawc&amp;quot; retrieves only a high count if the preposition is a location and not time preposition, and so permits compatibility under those syntactically restricted circumstances of &amp;quot;nawc&amp;quot; and &amp;quot;aircraft&amp;quot;.</Paragraph>
    <Paragraph position="14"> Another objection raised to binary probabilities is the variety of nonlocal semantic relationships the can occur in discourse. Captions at NAWC-WD are usually multi-sentence, and anaphora do occur which can usually be resolved by simple methods. More difficult is the problem of resolving multiple possible word senses. For &amp;quot;sidewinder on ground&amp;quot;, are we talking about the snake or the missile? (NAWC-WD captions are all lower case.) The proper interpretation depends on whether the previous sentence was &amp;quot;flora and fauna of the china lake area&amp;quot; (NAWC-WD has many such pictures for public relations) or &amp;quot;loading sequence for a missile&amp;quot;. We have three answers to this challenge.</Paragraph>
    <Paragraph position="15"> First, most of the words that have many multiple meanings in English are abstract or metaphorical, and not appropriate for use on captions.</Paragraph>
    <Paragraph position="16"> Second, when ambiguous words do occur, the odds are good that some immediate syntactic relationship will provide the necessary clues to resolving ambiguity; for instance, both &amp;quot;sidewinder mounted&amp;quot; and &amp;quot;sidewinder coiled&amp;quot; are unambiguous when using co-occurrence counts. Third, even if multiple interpretations cannot be ruled out for a word, an information retrieval system can just try each, and take the union of the results (i.e. as a logical disjunction); generally only one interpretation will every match a query. Note that if count statistics are derived from the same corpus that is subsequently used for retrieval, as MARIE-2 intends to do, the probabilities obtained from our parse will be a rough estimate of the yield (selectivity) of each interpretation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML