XML Viewer - p98-1099

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1099_metho.xml
Size: 21,256 bytes
Last Modified: 2025-10-06 14:14:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1099">
  <Title>Combining Multiple, Large-Scale Resources in a Reusable Lexicon for Natural Language Generation</Title>
  <Section position="3" start_page="607" end_page="607" type="metho">
    <SectionTitle>
2 Constructing a generation lexicon
</SectionTitle>
    <Paragraph position="0"> by merging linguistic resources</Paragraph>
    <Section position="1" start_page="607" end_page="607" type="sub_section">
      <SectionTitle>
2.1 Linguistic resources
</SectionTitle>
      <Paragraph position="0"> In our selection of resources, we aim primarily for accuracy of the resource, large coverage, and providing a particular type of information especially useful for natural language generation.</Paragraph>
      <Paragraph position="1"> four linguistic resources:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="607" end_page="607" type="metho">
    <SectionTitle>
1. The WordNet on-line lexical database
</SectionTitle>
    <Paragraph position="0"> (Miller et al., 1990). WordNet is a well known on-line dictionary, consisting of 121,962 unique words, 99,642 synsets (each synset is a lexical concept represented by a set of synonymous words), and 173,941 senses of words. 1 It is especially useful for generation because it is based on lexical concepts, rather than words, and because it provides several semantic relationships (hyponymy, antonymy, meronymy, entailment) which are beneficial to lexical choice. 2. English Verb Classes and Alternations (EVCA) (Levin, 1993). EVCA is an extensive linguistic study of diathesis alternations, which are variations in the realization of verb arguments. For example, the alternation &amp;quot;there-insertion&amp;quot; transforms A ship appeared on the horizon to There appeared a ship on the horizon. Knowledge of alternations facilitates the generation of paraphrases. (Levin, 1993) studies 80 alternations. null</Paragraph>
  </Section>
  <Section position="5" start_page="607" end_page="607" type="metho">
    <SectionTitle>
3. The COMLEX syntax dictionary (Grish-
</SectionTitle>
    <Paragraph position="0"> man et al., 1994). COMLEX contains syntactic information for 38,000 English words. The information includes subcategorization and complement restrictions.</Paragraph>
  </Section>
  <Section position="6" start_page="607" end_page="610" type="metho">
    <SectionTitle>
4. The Brown Corpus tagged with WordNet
</SectionTitle>
    <Paragraph position="0"> senses (Miller et al., 1993). The original 1As of Version 1.6, released in December 1997.</Paragraph>
    <Paragraph position="1"> Brown corpus (Ku~era and Francis, 1967) has been used as a reference corpus in many computational applications. Part of Brown Corpus has been tagged with WordNet senses manually by the WordNet group.</Paragraph>
    <Paragraph position="2"> We use this corpus for frequency measurements and exacting selectional constraints.</Paragraph>
    <Section position="1" start_page="607" end_page="608" type="sub_section">
      <SectionTitle>
2.2 Combining linguistic resources
</SectionTitle>
      <Paragraph position="0"> In this section, we present an algorithm for merging data from the four resources in a manner that achieves high accuracy and completeness. We focus on verbs, which play the most important role in deciding phrase and sentence structure.</Paragraph>
      <Paragraph position="1"> Our algorithm first merges COMLEX and EVCA, producing a list of syntactic subcate~ gorizations and alternations for each verb. Distinctions in these syntactic restrictions according to each sense of a verb are achieved in the second stage, where WordNet is merged with the result of the first step. Finally, the corpus information is added, complementing the static resources with actual usage counts for each syntactic pattern. This allows us to detect rarely used constructs that should be avoided during generation, and possibly to identify alternatives that are not included in the lexical databases.</Paragraph>
      <Paragraph position="2">  Alternations involve syntactic transformations of verb arguments. They are thus a means to alleviate the usual lack of alternative ways to express the same concept in current generation systems.</Paragraph>
      <Paragraph position="3"> EVCA has been designed for use by humans, not computers. We need therefore to convert the information present in Levin's book (Levin, 1993) to a format that can be automatically analyzed. We extracted the relevant information for each verb using the verb classes to which the various verbs are assigned; members of the same class have the same syntactic behavior in terms of allowable alternations. EVCA specifies a mapping between words and word classes, associating each class with alternations and with subcategorization frames. Using the mapping from word and word classes, and from word classes to alternations, alternations for each verb are extracted.</Paragraph>
      <Paragraph position="4"> We manually formatted the alternate patterns in each alternation in COMLEX format.</Paragraph>
      <Paragraph position="5">  The reason to choose manual formatting rather than automating the process is to guarantee the reliability of the result. In terms of time, manual formatting process is no more expensive than automation since the total number of alternations is smail(80). When an alternate pattern can not be represented by the labels in COM-LEX, we need to added new labels during the formatting process; this also makes automating the process difficult.</Paragraph>
      <Paragraph position="6"> The formatted EVCA consists of sets of applicable alternations and subcategorizations for 3,104 verbs. We show the sample entry for the verb appear in Figure 1. Each verb has 1.9 alternations and 2.4 subcategorizations on average. The maximum number of alternations (13) is realized for the verb &amp;quot;roll&amp;quot;.</Paragraph>
      <Paragraph position="7"> The merging of COMLEX and EVCA is achieved by unification, which is possible due to the usage of similar representations. Two points are worth to mention: (a) When a more general form is unified with a specific one, the later is adopted in final result. For example, the unification of PP2 and PP-PRED-RS 3 is PP-PRED-RS. (b) Alternations are validated by the subcategorization information. An alternation is applicable only if both alternate patterns are applicable.</Paragraph>
      <Paragraph position="8"> Applying this algorithm to our lexical resources, we obtain rich subcategorization and alternation information for each verb. COM-LEX provides most subcategorizations, while EVCA provides certain rare usages of a verb which might be missing from COMLEX. Conversely, the alternations in EVCA are validated by the subcategorizations in COMLEX. The merging operation produces entries for 5,920 verbs out of 5,583 in COMLEX and 3,104 in EVCA. 4 Each of these verbs is associated with</Paragraph>
    </Section>
    <Section position="2" start_page="608" end_page="609" type="sub_section">
      <SectionTitle>
5.2 subcategorizations and 1.0 alternation on
</SectionTitle>
      <Paragraph position="0"> average. Figure 2 is an updated version of Figure 1 after this merging operation.</Paragraph>
      <Paragraph position="2"> a mapping between concepts and words. Its inclusion of rich lexical relations also provide basis for lexical choice. Despite of these advantages, the syntactic information in WordNet is relatively poor. Conversely, the result we obtained after combining COMLEX and EVCA has rich syntactic information, but this information is provided at word level thus unsuitable to use for generation directly. These complementary resources are therefore combined in the second stage, where the subcategorizations and alternations from COMLEX/EVCA for each word are assigned to each sense of the word.</Paragraph>
      <Paragraph position="3"> Each synset in WordNet is linked with a list of verb frames, each of which represents a simple syntactic pattern and general semantic constraints on verb arguments, e.g., Somebody -s something. The fact that WordNet contains this syntactic information(albeit poor) makes it possible to link the result from COMLEX/EVCA with WordNet.</Paragraph>
      <Paragraph position="4"> The merging operation is based on a compatibility matrix, which indicates the compatibility of each subcategorization in COMLEX/EVCA with each verb frame in WordNet. The sub- null categorizations and alternations listed in COM-LEX/EVCA for each word is then assigned to different senses of the word based on their compatibility with the verbs frames listed under that sense of the word in WordNet. For example, if for a certain word, the subcategorizations PP-PRED-RS and NP are listed for the word in COMLEX/EVCA, and the verb frame somebody -s PP is listed for the first sense of the word in WordNet, then PP-PRED-RS will be assigned to the first sense of the word while NP will not. We also keep in the lexicon the general constraint on verb arguments from Word-Net frames. Therefore, for this example, the entry for the first sense of w indicates that the verb can take a prepositional phrase as a complement, the subject of the verb is the same as the subject of the prepositional phrase, and the subject should be in the semantic category &amp;quot;somebody&amp;quot;. As you can see, the result incorporates information from three resources and but is more informative than any of them. An alternation is considered applicable to a word sense if both alternate patterns have matchable verb frames under that sense.</Paragraph>
      <Paragraph position="5"> The compatibility matrix is the kernel of the merging operations. The 147&amp;quot;35 matrix (147 subcategorizations from COMLEX/EVCA, 35 verb frames from WordNet) was first manually constructed based on human understanding. In order to achieve high accuracy, the restrictions to decide whether a pair of labels are compatible are very strict when the matrix was first constructed. We then use regressive testing to adjust the matrix based on the analysis of merging results. During regressive testing, we first merge WordNet with COMLEX/EVCA using current version of compatibility matrix, and write all inconsistencies to a log file. In our case, an inconsistency occurs if a subcategorization or alternation in COMLEX/EVCA for a word can not be assigned to any sense of the word, or a verb frame for a word sense does not match any subcategorization for that word. We then analyze the log file and adjust the compatibility matrix accordingly. This process repeated 6 times until when we analyze a fair amount of inconsistencies in the log file, they are no more due to over-restriction of the compatibility matrix. null Inconsistencies between WordNet and COMappear: null sense 1 give an impression  LEX/EVCA result unmatching subcategorizations or verb frames. On average, 15% of subcategorizations and alternations for a word can not be assigned to any sense of the word, mostly due to the incompleteness of syntactic information in WordNet; 2% verb frames for each sense of a word does not match any subcategorizations for the word, either due to incompleteness of COMLEX/EVCA or erroneous entries in WordNet.</Paragraph>
      <Paragraph position="6"> The lexicon at this stage is a rich set of subcategorizations and alternations for each sense of a word, coupled with semantic constraints of verb arguments. For 5,920 words in the result after combining COMLEX and EVCA, 5,676 words also appear in WordNet and each word has 2.5 senses on average. After the merging operation, the average number of subcategorizations is refined from 5.2 per verb in COM-LEX/EVCA to 3.1 per sense, and the average number of alternations is refined from 1.0 per verb to 0.2 per sense. Figure 3 shows the result for the verb appear after the merging operation.</Paragraph>
    </Section>
    <Section position="3" start_page="609" end_page="610" type="sub_section">
      <SectionTitle>
2.3 Corpus analysis
</SectionTitle>
      <Paragraph position="0"> Finally, we enriched the lexicon with language usage information derived from corpus analysis. The corpus used here is the Brown Corpus.</Paragraph>
      <Paragraph position="1"> The language usage information in the lexicon include: (1) frequency of each word sense; (2) frequency of subcategorizations for each word sense. A parser is used to recognize the subcategorization of a verb. The corpus analysis in- null formation complements the subcategorizations from the static resources by marking potential superfluous entries and supplying entries that are possibly missing in the lexicai databases; (3) semantic constraints of verb arguments. The arguments of each verb are clustered based on hyponymy hierarchy in WordNet. The semantic categories we thus obtained are more specific compared to the general constraint(animate or inanimate) encoded in WordNet frame representation. The language usage information is especially useful in lexicai choice.</Paragraph>
    </Section>
    <Section position="4" start_page="610" end_page="610" type="sub_section">
      <SectionTitle>
2.4 Discussion
</SectionTitle>
      <Paragraph position="0"> Merging resources is not a new idea and previous work has investigated integration of resources for machine translation and interpretation (Klavans et al., 1991), (Knight and Luk, 1994). Whereas our work differs from previous work in that for the first time, a generation lexicon is built by this technique; unlike other work which aims to combine resources with similar type of information, we select and combine multiple resources containing different types of information; while others combine not well formatted lexicon like LDOCE (Longman Dictionary of Contemporary English), we chose well formatted resources (or manually format the resource) so as to get reliable and usable results; semi-automatic rather than fully automatic approach is adopted to ensure accuracy; corpus analysis based information is also linked with information from static resources. By these measures, we are able to acquire an accurate, reusable, rich, and large-scale lexicon for natural language generation.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="610" end_page="612" type="metho">
    <SectionTitle>
3 Applications
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="610" end_page="611" type="sub_section">
      <SectionTitle>
3.1 Architecture
</SectionTitle>
      <Paragraph position="0"> We applied the lexicon to lexical choice and lexical realization in a practical generation system. First we introduce the architecture of lexical choice and realization and then describe the overall system.</Paragraph>
      <Paragraph position="1"> A multi-level feedback architecture as shown in Figure 4 was used for lexical choice and realization. We distinguish two types of concepts: semantic concepts and lexicai concepts. A semantic concept is the semantic meaning that a user wants to convey, while a lexical concept is a lexical meaning that can be represented by a set  of synonymous words, such as synsets defined in WordNet. Paraphrases are also distinguished into 3 types according to whether they are at the semantic, lexical, or syntactic level. For example, if asked whether you will be at home tomorrow, then the answers &amp;quot;I'll be at work tomorrow&amp;quot;, &amp;quot;No, I won't be at home.', and &amp;quot;I'm leaving for vacation tonight&amp;quot; are paraphrases at the semantic level. Paraphrases like &amp;quot;He bought an umbrella&amp;quot; and &amp;quot;He purchased an umbrella&amp;quot; are at the lexical level since they are acquired by substituting certain words with synonymous words. Paraphrases like &amp;quot;A ship appeared on the horizon&amp;quot; and &amp;quot;On the horizon appeared a ship&amp;quot; are at the syntactic level since they only involve syntactic transformations. Therefore, all paraphrases introduced by alternations are at syntactic level. Our architecture includes levels corresponding to these 3 levels of paraphrasing. null The input to the lexical choice and realization module is represented as semantic concepts. In the first stage, semantic paraphrasing is carried out by mapping semantic concepts to lexical concepts. Generally, semantic level paraphrases are very complex. They depend on the  situation, the domain, and the semantic relations involved. Semantic paraphrases are represented declaratively in a database file which can be edited by the users. The file is indexed by semantic concepts and under each entry, a list of lexical concepts that can be used to realize the semantic concept are provided.</Paragraph>
      <Paragraph position="2"> In the second stage, we use the lexical resource that we constructed to choose words for the lexical concepts produced by stage 1. The lexicon is indexed by lexical concepts that point to synsets in WordNet. These synsets represent a set of synonymous words and thus, it is at this stage that lexical paraphrasing is handled. In order to choose which word to use for the lexical concept, we use domain-independent constraints that are included in the lexicon as well as domain-specific constraints. Syntactic constraints that come from the detailed subcategorizations linked to each word sense is a domain-independent constraint. Subcategorizations are used to check that the input can be realized by the word. For example, if the input has 3 arguments, then words which take only 2 arguments can not be selected. Semantic constraints on verb argument derived from WordNet and the corpus are used to check the agreement of the arguments. For example, if the input subject argument is an animate, then words which take only inanimate subject can not be selected. Frequency information derived from the corpus is also used to constrain word choice. Besides the above domain-independent constraints other constraints specific to a domain might also be needed to choose an appropriate word for the lexical concept. Introducing the combined lexicon at this stage allows us to produce many lexical paraphrases without much effort; it also allows us to separate domain-independent and domain-specific constraints in lexical choice so that domain-independent constraints can be reused in each application.</Paragraph>
      <Paragraph position="3"> The third stage produces a structure represented as a high level sentence structure, with subcategorizations and words associated with each sentence. At this stage, information in the lexical resource about subcategorization and alternations are applied in order to generate syntactic paraphrases. Output of this stage is then fed directly to the surface realization package, the FUF/SURGE system (Elhadad, 1992; Robin, 1994). To choose which alternate pattern of an alternation to use, we use information such as focus of the sentence as criteria; when the two alternates are not distinctively different, such as &amp;quot;He knocked the door&amp;quot; and &amp;quot;He knocked at the door&amp;quot;, one of them is randomly chosen. The application of subcategorizations in the lexicon at this stage helps to check that the output is grammatically correct, and alternations can produce many syntactic paraphrases.</Paragraph>
      <Paragraph position="4"> The above refining processing is interactive.</Paragraph>
      <Paragraph position="5"> When a lower level can not find a possible candidate to realize the high level representation, feedback is sent to the higher level module, which then makes changes accordingly.</Paragraph>
    </Section>
    <Section position="2" start_page="611" end_page="612" type="sub_section">
      <SectionTitle>
3.2 PlanDOC
</SectionTitle>
      <Paragraph position="0"> Using the proposed architecture, we applied the lexicon to a practical generation system, PIan-DOC. PlanDOC is an enhancement to Bellcore's LEIS-PLAN TM network planning product. It transforms lengthy execution traces of engineer's interaction with LEIX-PLAN into human-readable summaries.</Paragraph>
      <Paragraph position="1"> For each message in PlanDOC, at least 3 paraphrases are defined at semantic level. For example, '~rhe base plan called for one fiber activation at CSA 2100&amp;quot; and &amp;quot;There was one fiber activation at CSA 2100&amp;quot; are semantic paraphrases in PlanDOC domain. At the lexical level, we use synonymous words from WordNet to generate lexical paraphrases. A sample lexical paraphrase for &amp;quot;The base plan called for one fiber activation at CSA 2100&amp;quot; is &amp;quot;The base plan proposed one fiber activation at CSA 2100&amp;quot;.</Paragraph>
      <Paragraph position="2"> Subcategorizations and alternations from the lexicon are then applied at the syntactic level.</Paragraph>
      <Paragraph position="3"> After three levels of paraphrasing, each message in PlanDOC on average has over 10 paraphrases. null For a specific domain such as PlanDOC, an enormous proportion of a general lexicon like the one we constructed is unrelated thus unused at all. On the other hand, domain-specific knowledge may need to be added to the lexicon.</Paragraph>
      <Paragraph position="4"> The problem of how to adapt a general lexicon to a particular application domain and merge domain ontologies with a general lexicon is out of the scope of this paper but discussed in (Jing, 1998).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML