File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1068_metho.xml

Size: 15,796 bytes

Last Modified: 2025-10-06 14:12:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1068">
  <Title>Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Lawrence Erlbaum Assoc.,</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FAST TEXT PROCESSING FOR INFORMATION RETRIEVAL
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> We describe an advanced text processing system for information retrieval from natural language document collections. We use both syntactic processing as well as statistical term clustering to obtain a representation of documents which would be more accurate than those obtained with more traditional key-word methods. A reliable top-down parser has been developed that allows for fast processing of large amounts of text, and for a precise identification of desired types of phrases for statistical analysis.</Paragraph>
    <Paragraph position="1"> Two statistical measures are computed: the measure of informational contribution of words in phrases, and the similarity measure between words.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="347" type="metho">
    <SectionTitle>
APPROXIMATE PARSING WITH TTP
</SectionTitle>
    <Paragraph position="0"> Trp (Tagged Text Parser) is a top down English parser specifically designed for fast, reliable processing of large amounts of text. The parser operates on a tagged input, where each word has been marked with a tag indicating a syntactic category: a part of speech with selected morphological features such as number, tense, mode, case end degree) As an example, consider the following sentence from an article appearing in the Communications of the ACM: The binary number system often many advantages over a decimal representation for a high-performance, general-purpose computer.</Paragraph>
    <Paragraph position="1"> This sentence is tagged as follows (we show the best-tags option only; dt - determiner, nn - singular noun, nns - plural noun, in preposition, jj - adjective, vbz - verb in present tense third person singular): \[\[the,dt\],\[binary,jj\],\[number,nn\],\[system,rm\],\[offers,vbz\], \[many,jj\],\[adventages,rms\],\[over, in\], \[a,dt\],\[decimal,jj\], \[representafion,nn\], \[for,in\] ,\[a, dt\], \[high_per formence,nn\], \[comS,comS\],\[general purpose,nn\],\[computer,nn\],\[perS,perS\]\] Tagging of the input text substantially reduces the search space of a top-down parser since it resolves many lexical ambiguities, such as singular verb vs. plural noun, past tense vs. past participle, or preposition vs. wh-determiner. Tagging also helps to reduce the number of parse structures that can be assigned to a sentence, decreases the demand for consulting of the dictionary, end simplifies dealing with unknown words.</Paragraph>
    <Paragraph position="2"> t At present we use the 35-tag Penn Treebank Tagset created at the University of Pennsylvania. Prior to parsing, the text is tagged automatically using a program supplied by Bolt Beranek and Newman. We wish to thank Ralph Weischedel and Marie Meeter of BBN for providing and assisting in the use of the tagger.</Paragraph>
    <Paragraph position="3"> T\]'P is based on the Linguistic String Grammar developed by Sager \[8\] and partially incorporated in the Proteus parser \[3\]. TIP is written in Quintus Prolog, and currently implements more than 400 grammar productions. The restriction component of the original LSP Grammar as well as the lamlxta-reduction based &amp;quot;semantics&amp;quot; of the Proteus implementation have been redesigned for the unification-hased environment. 2 TI'P produces a regularized representation of each parsed sentence that reflects the sentence's logical structure. This representation may differ considerably from a standard parse tree, in that the constituents get moved around (e.g., de-passivization, de-dativization), and carrain noun phrases get transformed into equivalent clauses (denon'finalization). The aim is to produce a uniform representation across different paraphrases; for example, the phrase context-free language recognition or parsing is represented as shown below: \[\[verb, \[or,\[recognize,parse\]\]\] \[subject, anyone\] \[object, \[np,\[n,language\], \[edj,\[context See\]\]\]\]\].</Paragraph>
    <Paragraph position="4"> The parser is equipped with a time-out mechanism that allows for fast closing of more difficult sub-constituents after a preset amount of time has elapsed without producing a parse. When the time-out option is turned on (which happens automatically during the parsing), the parser is permitted to skip portions of input to reach a starter terminal for the next constituent to be parsed, and closing the currently open one (or ones) with whatever partial representation has been generated thus far. The result is an approximate partial parse, which shows the overall structure of the sentence, from which some of the constituents may be missing. Since the time-out option can be regulated by setting an appropriate flag before the parsing starts, the parser may be tuned to reach an acceptable compromise between its speed and precision. null The time-out mechanism is implemented using a straight-forward parameter passing and is at present limited to only a sub-set of nonterminals used by the grammar. Suppose that X is such a nonterminal, and that it occurs on the right-hand side of a production S -&gt; X Y Z. The set of &amp;quot;starters&amp;quot; is computed for Y, which consists of the word tags that can occur as the left-most  constituent of Y. This set is passed as a parameter while the parser attempts to recognize X in the inpuL If X is recognized successfully within a preset time, then the parser proceeds to parse a Y, and nothing else happens. On the other hand, if the parser cannot determine whether there is an X in the input or not, that is, it neither succeeds nor fails in parsing X before being timed out, the unfinished X constituent is closed with a partial parse, and the parser is restarted at the closest element from the starters set for Y that can be found in the remainder of the input. If Y rewrites to an empty sizing, the starters for Z to the right of Y are added to the starters for Y and both sets are passed as a parameter to X. As an example consider the following clause in the TrP parser (some arguments are removed for expository reasons): null</Paragraph>
    <Paragraph position="6"> subtail(Tail).</Paragraph>
    <Paragraph position="7"> In this production, a (finite) clause rewrites into an (optional) santence adjunct (SA), a subject, a verbphrase and subject's right adjunct (SUBTAIL, also optional). With the exception of subtail, each predicate has a parameter that specifies the list of &amp;quot;starter&amp;quot; tags for restarting the parser, should the evaluation of this predicate exceed the allotted portion of time. Thus, in case sa is aborted before its evaluation is complete, the parser will jump over some elements of the unparsed portion of the input looking for a word that could begin a subject phrase (either a predeterminer, a determiner, a count word, a pronoun, an adjective, a noun, or a proper name). Likewise, when subject is timed out, the parser will restart with verbphrase at either vbz, vbd or vbp (finite forms of a verb). Note that if verbphrase is timed out, then subtail will be ignored, both verbphrase and clause will be closed, and the parser will restart at an element of set SR passed down from clause. The examples in Figures 1 to 3 show approximate parse structures generated by TTP.</Paragraph>
    <Paragraph position="8"> The sentence in Figure 1 has been parsed nearly to the end, but T\]'P has failed to find the main verb and it has thrown out much of the last phrase such as the LR(k) grammars, partly due to an improper tokenization of LR(k). In Figure 2, the parser has initially assumed that the conjunction in the sentence has the narrow scope, then it realized that something went wrong but, apparently, there was no time left to back up. Occasionally, however, sentences may come out substantially lruncated, as shown in Figure 3 (where although has been mistagged as a preposition).</Paragraph>
    <Paragraph position="9"> There are at least several options to realize the kind of time-regulated parsing discussed above. One involves allotting a certain amount of time per sentence and, when this time is up, entering the time-out mode for the rest of the current sentence processing. This amounts to a rapid, though controlled, exit from presently open constituents mostly ignoring the unparsed portion of the sentence. This option gives each sentence in the input roughly the same amount of time, and thus allows the parser to explore more alternatives while processing shorter sentences, while setting tight limits for the longer ones. In our experiments with the CACM collection we found that 0.7 sec/parse is acceptable for an average sentence. One other option is to set up time limits per nonterminal, and restore normal parsing after each time-out. The advantage here is that longer sentences receive proportionally more time to process (allowing for some backtracking to explore alternatives). The disadvantage is that one loses the SENTENCE: The problem of determining whether an arbitrary context-free grammar is a member of some easily parsed subclass of grammars such as the LR(k) grammars is considered.</Paragraph>
    <Paragraph position="10">  SENTEN(\]X: The TX-2 computer at MIT Lincoln Laboratory was used for the imple* me.ration of such a system and the characteristics of this implementation are reported.</Paragraph>
    <Paragraph position="11">  SENTENCE: In principle, the system can deal with any orthography, although at present it is limited to 4000 Chinese characters and some mathematical symbols.</Paragraph>
    <Paragraph position="12">  right control upon the overall speed of the parser; now complex sentences may take a considerably longer time to finish. Various mixed options are also possible, for instance one may initially allot x milliseconds to each sentence, and if necessary, restart it with a half that time, and so forth.</Paragraph>
    <Paragraph position="13"> Another method for containing the time allowed for parsing is to limit the amount of nondeterminism by a stricter control over the rule selection and by disabling backtracking at certain points, again at the expense of producing only an approximate parse. Certain types of structural ambiguity, such as preposirional phrase attachment which cannot be resolved at the syntax level anyway, frequently remain unresolved in the parse structure generated by 'ITP (although, 'I'\]'P attempts to resolve some structural ambiguities using preferences whenever possible).</Paragraph>
    <Paragraph position="14"> 'ITP is also quite robust; it can parse nearly every sentence or phrase, provided the latter is reasonably correctly tagged. We parsed the entire CACM-3204 collection and only two sentences were returned unparsed, because of multiple tagging errors. 3 To assure a gradual degradation of output rather than an outright failure, and also to allow for handling of sentence fragments and isolated phrases such as titles, each sentence/phrase is attempted to be analyzed in up to four ways: (1) as a sentence, (2) as a noun phrase or a preposition phrase with a right adjunct(s), (3) as a gerundive clause, and eventually (4) as a series of simple noun phrases. Each of these attempts is allotted a new time slice, and the next analysis is started after the previous one fails hut before it is timed out. Although parsing of some sentences may now approach four times the allotted time limit, we noted that the average parsing time per sentence remains basically unaffected. 4 3 CACM-3204 is a standard collection used in information retrieval experiments and includes, in addition to the abstracts, a set of 64 queries and relevance judgements for them. The pure text portion of the collection contains nearly 10,000 sentences and phrases, or about 235,000 words.</Paragraph>
  </Section>
  <Section position="4" start_page="347" end_page="347" type="metho">
    <SectionTitle>
4 The average parsing time per sentence is 0.745 sec.
EXTRACTION OF SYNTACTIC PHRASES
</SectionTitle>
    <Paragraph position="0"> The similarity measure that we use for term classification is based on quantitative information about word and phrase frequencies and word co-occurrences within the text. We collected this information for two-word &amp;quot;phrases&amp;quot; extracted from the parsed docurnents, s The co-occurrence analysis gives the best results when the words are connected by the same grammatical relation, for example verb-object, or noun-right adjunct, etc. We noted, however, that including multiple relations in the analysis is possible so long as they could be considered to convey similar &amp;quot;semantic&amp;quot; dependencies. In our experiments the following types of word pairs are extracted: (1) a noun and its left noun adjunct, (2) a noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) a noun and its adjective, where the noun is the head of a noun phrase as recognized by the parser.</Paragraph>
    <Paragraph position="1"> The pairs are extracted from the regularized parse structures with a pattern-matching procedure which uses an exclusion list to disregard some &amp;quot;uninteresting&amp;quot; words (such as be, such, any). The words with the common stem but different forms are replaced by a single &amp;quot;normal&amp;quot; form. Working on the parsed text ensures a high degree of precision in capturing the meaningful phrases, which is especially evident when compared with the results usually obtained from a &amp;quot;raw&amp;quot; text (either unprocessed or only partially processed). ~ On the other hand, since our parser is allowed to skip some portions of each sentence that cannot be parsed within a preset time limit, the structures it produces are occasionally incomplete so that the extraction procedure will generate orily a subset of all relevant phrases. The precision, however, remains very high: few undesired phrases are ever turned out (as far as the four specified types are concerned), which is particularly important in subsequent statistical processes, since these tend to be quite sensitive on the amount of noise in the analyzed material. An example is shown in Figure 4.</Paragraph>
  </Section>
  <Section position="5" start_page="347" end_page="347" type="metho">
    <SectionTitle>
STATISTICAL SIMILARITY MEASURE
</SectionTitle>
    <Paragraph position="0"> Classification of words and phrases based on similarities in their meaning is particularly important in information retrievai systems. Various word taxonomies derived from machine-readable dictionaries may be of relevance here \[1\], but general-purpose dictionaries, such as Oxford's Advanced Learner's Dictionary (OALD) or Longman's Dictionary of Contemporary English (LDOCE), both available on-line, are usually quite limited in their coverage of domain specific vocabulary, including domain-specific use of common words as well as technical terminology. Statistical methods for word clustering may provide a partial solution to this problem given a sufficient amount of textual data that display a certain uniformity of subject matter and style. These problems have been studied to some extent within the sublanguage paradigm \[4,5\], and also using elements of informarion theory \[2,6\]. One general problem with the latter approach is that information theory, which deals with code transmission,</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML