XML Viewer - h94-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1003_metho.xml
Size: 13,913 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1003">
  <Title>The Comlex Syntax Project: The First Year</Title>
  <Section position="4" start_page="0" end_page="11" type="metho">
    <SectionTitle>
3. Methods
</SectionTitle>
    <Paragraph position="0"> Our basic approach is to create an initial lexicon manually and then to use a variety of resources, both commercial and corpus-derived, to refine this lexicon. Although methods have been developed over the last few years for automatically identifying some subcategorization constraints through corpus analysis \[2,5\], these methods are still limited in the range of distinctions they can identify and their ability to deal with low-frequency words. Consequently we have chosen to use manual entry for creation of our initial dictionary.</Paragraph>
    <Paragraph position="1"> The entry of lexical information is being performed by four graduate linguistics students, referred to as elves (&amp;quot;elf&amp;quot; = enterer of lexical features). The elves are provided with a menu-basedinterface coded in Common Lisp using the Garnet GUI package, and running on Sun workstations. This interface also provides access to a large text corpus; as a word is being entered, instances of the word can be viewed in one of the windows. Elves rely on citations from the corpus, definitions and citations from any of several printed  Noun feature NUNIT: a noun which can occur in a quantifier-noun measure expression ex: &amp;quot;two FOOT long pipe&amp;quot;/&amp;quot;a pipe which is two FEET in length&amp;quot; Noun complement NOUN-THAT-S: the noun complement is a full sentence ex:&amp;quot;the assumption that he will go to school (is wrong.)&amp;quot; Adj feature ATTRIBUTIVE: an adjective that occurs only attributively (ie before the noun) and never predicatively (after &amp;quot;be&amp;quot;) ex: &amp;quot;The LONE man rode through the desert&amp;quot;/*&amp;quot;the man was lone.&amp;quot; Adj complement ADJ-FOR-TO-INF: includes three infinitival complements of adjectives ex: &amp;quot;it is PRACTICAL for Evan to go to school.&amp;quot; (extrap-adj-for-to-inf) ''the race was easy for her to win.&amp;quot; (extrap-adj-for-to-inf-np-omit) &amp;quot;Joan was kind to invite me.&amp;quot; (extrap-adj-for-to-inf-rs) Verb feature VMOTION: a verb which occurs with a locative adverbial complement. ex: &amp;quot;he ran in&amp;quot; (which may permute to &amp;quot;in he ran.&amp;quot;) Verb complement: NP a verb which takes a direct object noun phrase.</Paragraph>
    <Paragraph position="2"> ex: &amp;quot;he ran a gambling den.&amp;quot;  :cs ((poss 2) (vp 3 :mood prespart :subject 2)) :gs (:subject 1, :comp 3) :ex &amp;quot;he discussed their writing novels.&amp;quot;) :cs (vp 2 :mood prespart :subject anyone) :features (:control arbitrary) :gs (:subject 1, :comp 2) :ex &amp;quot;he discussed writing novels.&amp;quot;) (*possing be-ing-sc)) :cs (vp 2 :mood prespart :subject 1) :features (:control subject) :gs (:subject 1, :comp 2) :ex &amp;quot;she began drinking at 9:00 every night.&amp;quot;)  dictionaries and their own linguistic intuitions in assigning features to words.</Paragraph>
    <Paragraph position="3"> Entry of the initial dictionary began in April 1993. To date, entries have be, en created for all the nouns and adjectives, and 60% of the verbs3; the initial dictionary is scheduled for completion in the spring of 1994.</Paragraph>
    <Paragraph position="4"> We expect to check this dictionary against several sources. We intend to compare the manual subeategorizations for verbs against those in the OALD, and would be pleased to make comparisons against other broad-coverage dictionaries if those can be made available for this purpose. We also intend to make comparisons against several corpus-derived lists: at the very least, with verb/preposition and verb/particle pairs with high mutualinformation \[3\] and, if possible, with the results of recently-developed procedures for extracting subcategorization frames from corpora \[2,5\]. While this corpus-derived information may not be detailed or accurate enough for fully-automated lexicon creation, it should be most valuable as a basis for comparisons.</Paragraph>
    <Paragraph position="5"> 4. Types and Sources of Error As part of the process of refining the dictionary and assuring its quality, we have spent considerable resources on reviewing dictionary entries and on occasion have had sections coded by two or even four of the elves. This process has allowed us to make some analysis of the sources and types of error in the lexicon, and how they might be reduced. We can divide the sources of error and inconsistency into four classes: 1. errors of elassUieation: where an instance of a word is improperly analyzed, and in particular where the words following a verb are not properly identified with regard to complement type. Specific types of problems include misclassifying adjuncts as arguments (or vice versa) and identifying the wrong control features. Our primary defenses against such errors have been a steady refinement of the feature descriptions in our manual and regular group review sessions with all the elves. In particular, we have developed detailed criteria for making adjunct/argument distinctions \[6\].</Paragraph>
    <Paragraph position="6"> A preliminary study, conducted on examples (drawn at random from a corpus not used for our concordance) of verbs beginning with &amp;quot;j&amp;quot;, indicated that elves were consistent 93% to 94% of the time in labeling argument/adjunct distinctions following our criteria and, when they were consistent in argument/adjunct labeling, rarely disagreed on the subcategorization. In more than half of the cases where there was disagreement, the elves separately flagged these as difficult, ambiguous, or figurative uses of the verbs (and therefore would probably not use them as the basis for assigning lexical features). The agreement rate for examples which were not flagged was 96% to 98%.</Paragraph>
    <Paragraph position="7"> 2. omitted features: where an elf omits a feature because it is not suggested by an example in the concordance, a citation in the dictionary, or the elf's introspection. In order to get an estimate of the magnitude of this problem we decided to establish a measure of coverage or &amp;quot;recall&amp;quot; for the subcategorization features assigned by our elves. To do this, we tagged 3No features are being assigned to adverbs or prepositions in the initial lexicon.</Paragraph>
    <Paragraph position="8"> the first 150 &amp;quot;j&amp;quot; verbs from a randomly selected corpus from a part of the San Diego Mercury which was not included in our concordance and then compared the dictionary entries created by our lexicographers against the tugged corpus. The results of this comparison are shown in Figure 4.</Paragraph>
    <Paragraph position="9"> The &amp;quot;Complements only&amp;quot; is the percentage of instances in the corpus covered by the subcategorization tugs assigned by the elves and does not include the identification of any prepositions or adverbs. The &amp;quot;Complements only&amp;quot; would correspond roughly to the type of information provided by OALD and LDOCE 4. The &amp;quot;Complements + Prepositions/Particles&amp;quot; column includes all the features, that is it considers the correct identification of the complement plus the specific prepositions and adverbs required by certain complements.</Paragraph>
    <Paragraph position="10"> The two columns of figures under &amp;quot;Complements + Prepositions/Particles&amp;quot; show the results with and without the enumeration of directional prepositions.</Paragraph>
    <Paragraph position="11"> We have recently changed our approach to the classificaton of verbs (like &amp;quot;run&amp;quot;, &amp;quot;send&amp;quot;, &amp;quot;jog&amp;quot;, &amp;quot;walk&amp;quot;, &amp;quot;jump&amp;quot;) which take a long list of directional prepositions, by providing our entering program with a P-DIR option on the preposition list. This option will automatically assign a list of directional prepositions to the verb and thus will save time and eliminate errors Of missing prepositions. Figure 5 shows the dictionary entry for&amp;quot;jump&amp;quot;, taken from the union of the four elves. If you note the large number of directional prepositions listed under PP (prepositional phrase), you can see how easy it would be for a single elf to miss one or more. The addition of P-DIR has eliminated that problem.</Paragraph>
    <Paragraph position="12"> In some cases this approach will provide a preposition list that is a little rich for a given verb but we have decided to err on the side of a slight overgeneration rather than risk missing any prepositions which actually occur. As you can see, the removal of the P-DIRs from consideration improves the individual elf scores.</Paragraph>
    <Paragraph position="13"> The elf union score is the union of the lexical entries for all four elves. Theseare certainly numbers to be proud of, but realistically, having the verbs done four separate times is not practical. However, in our original proposal we stated that because of the complexity of the verb entries we would like to have them done twice. As can be seen in Figure 6, with two passes we succeed in raising individual percentages in all cases.</Paragraph>
    <Paragraph position="14"> We would like to make clear that even in the two cases where our individual lexicographers miss 18% and 13% of the complements, there was only one instance in which this might have resulted in the inability to parse a sentence. This was a missing intransitive. Otherwise, the missed complements would have been analyzed as adjuncts since they were a combination ofprepositionalphrases and adverbials with one case of a subordinate conjunction &amp;quot;as&amp;quot;.</Paragraph>
    <Paragraph position="15"> We endeavored to make a comparison with LDOCE on the measurement. This was a bit difficult since LDOCE lacks some complements we have and combines others, not always consistently. For instance, our PP roughly corresponds to either L9 (our PP/ADVP) or prep/adv + T1 (e.g. &amp;quot;on&amp;quot; + T1) (our PP/PART-NP) but in some cases a preposition is mentioned but the verb is classified as intransitive. The straight forward comparison has LDOCE finding 73% of the tagged  comp'~\[ements but a softer measure eliminating complements that LDOCE seems to be lacking (PART-NP-PP, P-POSSING, PP-PP) and allowing for app complement for&amp;quot;joke&amp;quot;, although it is not specified, results in a percentage of 79.</Paragraph>
    <Paragraph position="16"> We have adopted two lines of defense against the problem of omitted features. First, critical entries (particularly high frequency verbs) will be done independently by two or more elves. Second, we are developing a more balanced corpus for the elves to consult. Recent studies (e.g., \[1\]) confirm our observations that features such as subcategorization patterns may differ substantially between corpora. We began with a corpus from a single newspaper (San Jose Mercury News), but have since added the Brown corpus, several literary works from the Library of America, scientific abstracts from the U.S.</Paragraph>
    <Paragraph position="17"> Department of Energy, and an additional newspaper (the Wall Street Journal). In extending the corpus, we have limited ourselves to texts which would be readily available to members of the Linguistic Data Consortium.</Paragraph>
    <Paragraph position="18"> excess features: when an elf assigns a spurious feature through incorrect extrapolation or analogy from available examples or introspection. Because of our desire to obtain relatively complete feature sets, even for infrequent verbs, we have permitted elves to extrapolate from the citations found.</Paragraph>
    <Paragraph position="19"> Such a process is bound to be less certain than the assignment of features from extant examples. However, this problem does not appear to be very severe. A review of the &amp;quot;j&amp;quot; verb enlries produced by all four elves indicates that the fraction of spurious entries ranges from 2% to 6%.</Paragraph>
    <Paragraph position="20"> fuzzy features: feature assignmentis defined in terms of the acceptability of words in particular syntactic frames. Acceptability, however, is often not absolute but a matter of degree. A verb may occur primarily with particular complements, but will be &amp;quot;acceptable&amp;quot; with others.</Paragraph>
    <Paragraph position="21"> This problem is compounded by words which take on particular features only in special contexts. Thus, we don't ordinarily think of&amp;quot;dead&amp;quot; as being gradable (*&amp;quot;Fred is more dead than Mary.&amp;quot;), but we do say &amp;quot;deader than a door nail&amp;quot;. It is also compounded by our decision not to make sense distinctions initially. For example, many words which are countable (require a determiner before the singular form) also have a generic sense in which the determiner is not required (*&amp;quot;Fred bought apple.&amp;quot; but &amp;quot;Apple is a wonderful flavor.&amp;quot;). For each such problematic feature we have prepared guidelines for the elves, but these still require considerable discretion on their part.</Paragraph>
    <Paragraph position="22"> These problems have emphasized for us the importance of developing a tagged corpus in conjunction with the dictionary, so that frequency of occurrence of a feature (and frequency by text type) will be available. We are planning to do such tagging beginning in March 1994, in parallel with the completion of our initial dictionary. Our plan is to begin by tagging verbs in the Brown corpus, in order to be able to correlate our tagging with the word sense tagging being done by the WordNet group on the same corpus \[7\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML