File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/p88-1025_metho.xml
Size: 8,112 bytes
Last Modified: 2025-10-06 14:12:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P88-1025"> <Title>SYNTACTIC APPROACHES TO AUTOMATIC BOOK INDEXING</Title> <Section position="4" start_page="0" end_page="204" type="metho"> <SectionTitle> AUTOMATIC PHRASE CONSTRUCTION </SectionTitle> <Paragraph position="0"> Book indexing systems differ from standard automatic text indexing systems because complex, multi-word phrases are normally used for indexing purposes rather than the single term entries that are preferred in conventional automatic indexing systems. The phrase generation system described in this note is based on an automatic syntactic analysis of the available texts followed by a noun-phrase identification process using parse trees as input and producing lists of nominal constructions. The parsing system used in this study is based on an augmented phrase structure grammar, and was originally designed for use in the EPISTLE textcritiquing system. I (Heidorn, 1982, Jensen, 1983) A typical document abstract is shown 1 The writer is indebted to the IBM Corporation and to Dr. George Heidorn for making available the PLNLP parsing system for use at Cornell University. in Fig. 2, and the output produced by the syntactic analysis program for sentence 2 of the document is shown in Fig. 3. It may be noted that the syntactic output appears in the form of a standard phrase marker, the various levels of the syntax tree being listed in a column format from left to right. During the analysis, a head is identified for each syntactic constituent, identified by an asterisk (*) in the output. Thus in Fig. 3, the VERB is the main head of the sentence; the head of the noun phrase preceding the main verb is the NOUN representing the term &quot;oPerations&quot;, etc.</Paragraph> <Paragraph position="1"> The phrase formation system used in this study builds two-term phrases by combining the head of a constituent with the head of each constituent that modifies it.</Paragraph> <Paragraph position="2"> (Fagan 1987a, 1987b) For the sample sentence of Fig. 3, such a strategy produces the phrases development - exception dictionary - development negative - dictionary system operations In the phrase output, the dependent term is listed first in each case, followed by the governing term. Note that the phrase generation system identifies apparently reasonable constructions such as &quot;dictionary development&quot; and &quot;system operations&quot;, but not the unwanted phrases &quot;exception operations&quot; or &quot;exception systems&quot;.</Paragraph> </Section> <Section position="5" start_page="204" end_page="205" type="metho"> <SectionTitle> AUTOMATIC PHRASE ASSIGNMENT </SectionTitle> <Paragraph position="0"> An automatic phrase construction system generates a large number of phrases for a given text item. Fig. 4 lists all the phrases produced for the abstract of Fig. 2.</Paragraph> <Paragraph position="1"> Phrases occurring in the document title are identified by the letter T, and phrases obtained more than once for a given document are identified by a frequency marker (2) in Fig. 4. The output of Fig. 4 could be used directly in a semi-automatic indexing environment by letting the user choose appropriate index entries from the available list. The standard entries from the figure might then be manually chosen for indexing purposes by the document author, or by a trained indexer.</Paragraph> <Paragraph position="2"> In a fully automatic indexing system, additional criteria must be used, leading to the choice of some of the proposed phrase constructions, and the rejection of some others. The following criteria, among others, may be useful: For sentences that produce more than one acceptable syntactic analysis output, all analyses except the first one may be eliminated; (in the Heidorn-Jensen analyzer multiple analyses are arranged in decreasing order of presumed correctness).</Paragraph> <Paragraph position="3"> Phrases consisting of identical juxtaposed words (&quot;computationscomputation&quot; in Fig. 4) may be eliminated. null Phrases consisting of more than two words (e.g. &quot;document-retrievalsystem&quot;) may be given preference in the phrase assignment process.</Paragraph> <Paragraph position="4"> Phrases occurring in document titles, and/or section headings may be given preference.</Paragraph> <Paragraph position="5"> Noun-noun constructions might be given preference over adjective-noun construction.</Paragraph> <Paragraph position="6"> A further choice of phrases, as well as a phrase ordering system in decreasing order of apparent desirability, can be implemented by assigning a phrase weight to each phrase and listing the phrases in decreasing weight order. Two different frequency criteria are important in phrase weighting: The frequency of occurrence of a construct in a given document, or document section, known as the term frequency (tf) The number of documents, or document sections, in which a given construct occurs, known as the document frequency (df). 2 2 For book indexing purposes, a book can be broken down into sections, or paragraphs; the term frequency and document frequency factors are then computed for the individual book components The best constructs for indexing purposes are those exhibiting a high term frequency, and a relatively low overall document fre. quency. Such constructs will distinguish the documents, or document sections, to which they are assigned from the remainder of the collection. The corresponding term weighting system, known as tf.idf is computed by multiplying the term frequency factor by an inverse document frequency factor.</Paragraph> <Paragraph position="7"> Fig. 5 shows selected phrase output based in part on the use of automatically derived term weights. The top part of the figure contains the automatically derived constructs containing more than two terms. These might be used for indexing purposes regardless of term weight. In addition, the two-term phrases whose term frequency exceeds 1 in the document might also be used for indexing purposes. This would add the 9 phrases listed in the center portion of Fig. 5.</Paragraph> <Paragraph position="8"> Some of the phrases with ff > 1 have either a very high document frequency (125 for &quot;retrieval system&quot;) or a very low document frequency of 1, meaning that the phrase occurs only in the single document 659. In practice, a reasonable indexing policy consists in choosing phrases for which tf > k 1 and k 2 < df < k3 for suitable parameters kl,k2, and k 3. When these parameters are set equal to 1, 1 and 100, respectively, the 5 phrases identified by asterisks in Fig. 5 are chosen as indexing units.</Paragraph> <Paragraph position="9"> The bottom part of Fig. 5 shows a ranked phrase list in decreasing order according to a composite (tf x idf) phrase weight. Using such an ordered list, a typical indexing policy consists in choosing the top n entries from the list, or choosing entries whose weight exceeds a given threshold T. When T is chosen as 0.1, the 12 phrases listed at the bottom of Fig. 5 are produced. It may be noted that most of the terms listed in Fig. 5 appear to be reasonable indexing units.</Paragraph> <Paragraph position="10"> In a practical book indexing system, a phrase classification system capable of determining relationships between similar, or identical, phrases becomes useful. Such a phrase classification then leads to the choice of canonical representations for each group of equivalent phrases, and to the assignment of &quot;see&quot; and &quot;see also&quot; references. Phrase relationships can be determined by using synonym dictionaries and various kinds of phrase lists. In addition, attempts have also been made to use the term definitions contained in machine-readable dictionaries to construct hierarchies of word meanings. (Walker, 1987; Kucera, 1985; Chodorow, 1985) The automatic construction of phrase classification systems remains to be pursued in future work.</Paragraph> </Section> class="xml-element"></Paper>