File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0207_metho.xml
Size: 25,613 bytes
Last Modified: 2025-10-06 14:07:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0207"> <Title>Generation from Lexical Conceptual Structures</Title> <Section position="3" start_page="0" end_page="52" type="metho"> <SectionTitle> 2 Lexical Conceptual Structure </SectionTitle> <Paragraph position="0"> Lexical Conceptual Structure is a compositional abstraction with language-independent properties that transcend structural idiosyncrasies (Jackendoff, 1983; Jackendoff, 1990; Jackendoff, 1996). This representation has been used as the interlingua of several projects such as UNITRAN (Dorr et al., 1993) and MILT (Dorr, 1997).</Paragraph> <Paragraph position="1"> An LCS is adirected graph with a root. Each node is associated with certain information, including a type, a primitive and a field. The type of an LCS node is one of Event, State, Path, Manner, Property or Thing, loosely correlated with verbs prepositions, adverbs, adjectives and nouns. Within each of these types, there are a number of conceptual primitives of that type, which are the basic building blocks of LCS structures. There are two general classes of primitives: closed class or structural primitive (e.g., CAUSE, GO, BE, TO) and CONSTANTS, corresponding to the primitives for open lexical classes (e.g., reduce+ed, textile+, slash+ingly). I. Examples of fields include Locational, Possessional, Identificational. Children are also designated as to whether they are subject, argument, or modifier position.</Paragraph> <Paragraph position="2"> An LCS captures the semantics of a lexical item through a combination of semantic structure (specified by the shape of the graph and its structural primitives and fields) and semantic content (specified through constants). The semantic structure of a verb is the same for all members of a verb class (Levin and Rappaport Hovav, 1995) whereas the content is specific to the verb itself. So, all the verbs in the &quot;Cut Verbs - Change of State&quot; class have the same semantic structure but vary in their semantic content (for example, chip, cut, saw, scrape, slash and scratch).</Paragraph> <Paragraph position="3"> The lexicon entry or Root LCS (RLCS) of one sense of the Chinese verb xuel_jian3 is as follows:</Paragraph> <Paragraph position="5"> The top node in the. RLCS has the structural primitive ACT_ON in the locational field. Its sub-ject is a star-marked LCS, meaning a subordinate RLCS needs to be filled in here to form a complete event. It also has the restriction that the filler LCS be of the type thing. The number &quot;1&quot; in that node specifies the thematic role: in this case, agent. The second child node, in argument position, needs to t Suffixes such as /, /ed, +ingly are markers of the open class of primitives, indicating the type be of type thing too. The number &quot;2&quot; stands for theme. The last two children specify the manner of the locational act_on, that is &quot;cutting in a downward manner&quot;. The RLCS for nouns are generally much simpler since they usually include only one root node with a primitive* For instance (US+) or (quota+).</Paragraph> <Paragraph position="6"> The meaning of complex phrases is represented as a composed LCS (CLCS). This is constructed &quot;composed&quot; from several RLCSes corresponding to individual words. In the composition process, which starts with a parse tree of the input sentence, all the obligatory positions in the root and subordinate RLCS corresponding to lexical items are filled with other RLCSes from appropriately placed items in the parse tree. For example, the three RLCSes we have seen already can compose to give the CLCS in (2), corresponding~o the English sentence: United states cut down (the) quota.</Paragraph> <Paragraph position="7"> CLCS structures can be composed of different sorts of RLCS structures, corresponding to different words. A CLCS can also be decomposed on the generation side in different ways depending on the RLCSes of the lexical items in the target language.</Paragraph> <Paragraph position="8"> For example, the CLCS above will match a single verb and two arguments when generated in Chinese (regardless of the input language). But it will match four lexical items in English: cut, US, quota, and down, since the RLCS for the verb &quot;cut&quot; in the English lexicon, as shown in (3), does not include the modifier down.</Paragraph> <Paragraph position="10"> The rest of the examples in this paper will refer to the slightly more complex CLCS shown in (4), corresponding to the English sentence The United States unilaterally reduced the China textile export quota This LCS is presented without all the additional features for sake of clarity. Also, it is actually one of eight possible LCS compositions produced by the analysis component from the input Chinese sentence. null</Paragraph> </Section> <Section position="4" start_page="52" end_page="56" type="metho"> <SectionTitle> 3 The Generation System </SectionTitle> <Paragraph position="0"> Since this generation system was developed in tandem with the most recent LCS composition system, and LCS-language and specific lexicon extensions, a premium was put on the ability for experimentation along a number of parameters and rapid adjustment on the basis of intermediate inputs and results to the generation system. This goal encouraged a modular design, and made lisp a convenient language for implementation. We were also able to successfully integrate components from the Nitrogen The architecture of the generation system is shown in Figure 1, showing the main modules and sub-modules and flow of information between them.</Paragraph> <Paragraph position="1"> The first main component translates, with the use of a language specific lexicon, from the LCS interlingua to a language-specific representation of the sentence in a modified form of the AMR-interlingua, using words and features specific to the target language, but also including syntactic and semantic information from the LCS representation. The second main component produces target language sentences from this intermediate representation. We will now describe each of these components in more detail.</Paragraph> <Paragraph position="2"> The input to the generation component is a textrepresentation of a CLCS, the Lexical Conceptual Structure corresponding to a natural language sentence. The particular format, known as long-hand is equivalent to the form shown in (4), but making certain information more explicit and regular (at the price of increased verbosity). The Longhand CLCS can either be a fully language-neutral interlingua representation, or one which still incorporates some aspects of the source-language interpretation process. This latter may include grammatical features on LCS nodes, but also nodes, known as functional nodes, which correspond to words in the source language but are not LCS-nodes themselves, serving merely as place-holders for feature information. Examples of these nodes include punctuation markers, coordinating conjunctions, grammatical aspec t markers, and determiners. An additional extension of the LCS input language, beyond traditional LCS is the in-place representation of an ambiguous sub-tree as a POSSIBLES node, which has the various possibilities represented as its own children.</Paragraph> <Paragraph position="3"> Thus, for example, the following structure (with some aspects elided for brevity) represents a node that could be one of three possibilities. In the second one, the root of the sub-tree is a functional node, passing its features to its child, COUNTRY+:</Paragraph> <Section position="1" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 3.1 Lexical Choice </SectionTitle> <Paragraph position="0"> The first major component, divided into four pipelined sub-modules, as shown in Figure 1 transforms a CLCS structure to what we call an LCS-AMR structure, using the syntax of the abstract meaning representation (AMR), used in the Nitrogen generation system, but with words already chosen (rather than more abstract Sensus ontology concepts), and also augmented with information from the LCS that is useful for target language realization. null The pre-processing phase converts the text input format into internal graph representations, for efficient access of components (with links for parents as well as children), also doing away with extraneous source-language features, converting, for example, (5) to remove the functional node and promote COUNTRY+ to be one of the possible sub-trees. This involves a top-down ,reversal of the tree, including some complexities when functional nodes without children (which then assign features to their parents) are direct children of possibles nodes.</Paragraph> <Paragraph position="1"> The lexical access phase compares the internal CLCS form to the target language lexicon, decorating the CLCS tree with the RLCSes of target language words which are likely to match sub-structures of the CLCS. In an off-line processing phase, the target language lexicon is stored in a hash-table, with each entry keyed on a designated primitive which would be a most distinguishing node in the RLCS.</Paragraph> <Paragraph position="2"> On-line decoration then proceeds in two step process, for each node in the CLCS: (6) a. look for RLCSes stored in the lexicon under the CLCS node's primitives b. store retrieved RLCSes at the node in the CLCS that matches the root of this RLCS Figure 2 shows some of the English entries matching the CLCS in (4). For most of these words, the designated primitive is the only node in the corresponding LCS for that entry. For reduce, however, reduce+ed is the designated primitive. While this will be retrieved in step (6) while examining the reduce+ed node from (4), in (6)b, the LCS for &quot;reduce&quot; will be stored at the root node of (4) (cause).</Paragraph> <Paragraph position="4"> The current English lexicon contains over 11000 RLCS entries such as those in Figure 2, including over 4000 verbs and 6200 unique primitive keys in the hash-table.</Paragraph> </Section> <Section position="2" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 3.1.3 Alignment/Decomposition </SectionTitle> <Paragraph position="0"> The heart of the lexical access algorithm is the decomposition process. This algorithm attempts to align RLCSes selected by the lexical access portion with parts of the CLCS, to find a complete covering of the CLCS graph. The main algorithm is very similar to that described in (Dorr, 1993), however with some extensions to be able to also deal with the in-place ambiguity represented by the possibles nodes.</Paragraph> <Paragraph position="1"> The algorithm recursively checks a CLCS node against corresponding RLCS nodes coming from the lexical entries-retrieved and stored in the previous phase. If significant incompatibilities are found, the lexical entry is discarded. If all (obligatory) nodes in the RLCS match against nodes in the CLCS, then the rest of the CLCS is recursively checked against other lexical entries stored at the remaining unmatched CLCS nodes. Some nodes, indicated with a &quot;*&quot;, as in Figure 2, require not just a match against the corresponding CLCS node, but also a match against another lexical entry. Some CLCS nodes must thus match multiple RLCS nodes. A CLCS node matches an RLCS node, if the following conditions hold: (7) a.</Paragraph> <Paragraph position="2"> b.</Paragraph> <Paragraph position="3"> C.</Paragraph> <Paragraph position="4"> d.</Paragraph> <Paragraph position="5"> e.</Paragraph> <Paragraph position="6"> the primitives are the same (or primitive for one is a wild-card, represented as nil) the types (e.g., thing, event, state, etc.) are the same the fields (e.g., identificational, possessive, locational, etc) are the same the positions (e.g., subject, argument, or modifier) are the same all obligatory children of the RLCS node have corresponding matches to children of the CLCS Subject and argument children of an RLCS node are obligatory unless specified as optional, whereas modifiers are optional unless specified as obligatory. In the RLCS for &quot;reduce&quot; in Figure 2, the nodes corresponding to agent and theme (numbered 1 and 2, respectively) are obligatory, while the instrument (the node numbered 19) is optional. Thus, even though in (4) there is no matching lexical entry for the node in Figure 2 numbered 20 (&quot;*&quot;marked in the RLCS for &quot;with&quot;), the main RLCS for ' 'reduce' ' is allowed to match, though without any realization for the instrument.</Paragraph> <Paragraph position="7"> A complexity in the algorithm occurs when there are multiple possibilities filling in a position in a CLCS. in this case, only one of these possibilities is requirea to match all the corresponding RLCS nodes in order for a lexical entry to match. In the case where there are some of these possibilities that do not match any RLCS nodes (meaning there are no target-language realizations for these constructs), these possibilities can be pruned at this stage. On the other hand, ambiguity can also be introduced at the decomposition stage, if multiple lexical entries can match a single structure The result of the decomposition process is a match-structure indicating the hierarchical relationship between all lexical entries, which, together cover the input CLCS.</Paragraph> <Paragraph position="8"> The match structure resulting from decomposition is then converted into the appropriate input format used by the Nitrogen generation system. Nitrogen's input, Abstract Meaning Representation (AMR), is a labeled directed graph written using the syntax for the PENMAN Sentence Plan Language (Penman 1989). the structure of an AMR is basically as in (8). Since the roles expected by Nitrogen's English generation grammar do not match well with the thematic roles and features of a CLCS, we have extended the AMR language with LCS-specific relations, calling the result, an LCS-AMR. To distinguish the LCS relations from those used by Nitrogen, we mark most of the new roles with the prefix : LCS-. Figure 3 shows the LCS-AMR corresponding to the CLCS in (4).</Paragraph> <Paragraph position="9"> In the above example, the basic role / is used to specify an instance. So, the LCS-AMR can be read as an instance of the concept Ireduce I whose category is a verb and is in the active voice. Moreover, Ireducel has two thematic roles related to it, an agent and a theme; and it is modified by the concept lunilaterally\]. The different roles modifying Ireduce I come from different origins. The :LCS-NODE value comes directly from the unique node number in the input CLCS. The category, voice and telicity are derived from features of the LCS entry for the verb Ireduce\] in the English lexicon. The specifications of agent and theme come from the LCS representation of the verb reduce in the English lexicon as well, as can be seen by the node numbers 1 and 2, in the lexicon entry in Figure 2. The role :LCS-MOD-MANNER is derived by combining the fact that the corresponding AMR had a modifier role in the CLCS and because its type is a Manner.</Paragraph> </Section> <Section position="3" start_page="53" end_page="56" type="sub_section"> <SectionTitle> 3.2 Realization </SectionTitle> <Paragraph position="0"> The LCS-AMR representation is then passed to the realization module. The strategy used by Nitrogen is to over-generate possible sequences of English from the ambiguous or under-specified AMRs and then decide amongst them based on bigram frequency. The interface between the Linearization module and the Statistical Extraction module is a word lattice of possible renderings. The Nitrogen package offers support for both subtasks, Linearization and Statistical Extraction. Initially, we used the Nitrogen grammar to do Linearization. But complexities in recasting the LCS-AMR roles as standard AMR roles as well as efficiency considerations compelled us to create our own English grammar implemented in Lisp to generate the word lattices.</Paragraph> <Paragraph position="1"> In this module, we force linear order on the unordered parts of an LCS-AMR. This is done by recursively calling subroutines that create various phrase types (NP, PP, etc.) from aspects of the LCS-AMR. The result of the linearization phase is a word lattice specifying the sequence of words that make up the resulting sentence and the points of ambiguity where different generation paths are taken. (9) shows the word lattice corresponding to the LCS- null word OR specifies the existence of different paths for generation. In the above example, the word 'quota' gets all possible determiners since its definiteness is not specified. Also, the relative order of the words 'textile' and 'export' is not resolved so both possibilities are generated.</Paragraph> <Paragraph position="2"> Sentences were realized according to the pattern in (10). That is, first subordinating conjunctions, if any, then modifiers in the temporal field (e.g., &quot;now&quot;, &quot;in 1978&quot;), then the first thematic role, then most other modifiers, the verb (with collocations if any) then spatial modifiers (&quot;up&quot;, &quot;down&quot;), then the second and third thematic roles, followed by prepositional phrases and relative sentences. Nitrogen's morphology component was also used, e.g., to give tense to the head verb. In the example above, since there was no tense specified in the input LCS, past tense was used on the basis of the telicity of the verb.</Paragraph> <Paragraph position="4"> There is no one-to-one mapping between a particular thematic role and an argument position. For example, a theme can be the subject in some cases and it can be the object in others or even an oblique.</Paragraph> <Paragraph position="5"> Observe &quot;cookie&quot; in ill).</Paragraph> <Paragraph position="6"> ill) a. John ate a cookie (object) b. the cookie contains chocolate (subject) c. she nibbled at a cookie (oblique) Thematic roles are numbered for their correct realization order, according to the hierarchy for arguments shown in (12).</Paragraph> <Paragraph position="7"> (12) agent > instrument > theme > perceived > ( everythin gel se ) So, in the case of the occurrence of theme alone, it is mapped to first argument position. If a theme and an agent occur, the agent is mapped to first argument position and the theme is mapped to second argument position. A more detailed discussion is available in (Doff et al., 1998). For the LCS-AMR in Figure 3, the thematic hierarchy is what determined that the lunited statesl is the subject and Iquotal is the object of the verb Ireducel.</Paragraph> <Paragraph position="8"> In our input CLCSs, in most cases little hierarchical information was given about multiple modifiers of a noun. Our initial, brute force, solution was to generate all permutations and depend on statistical extraction to decide. This technique Worked for noun phrases of about 6 words, but was too costly for larger phrases (of which there were several examples in our test corpus). This cost was alleviated to some degree, also providing slightly better results than pure bigram selection by labelling adjectives in the English lexicon as belonging to one of several ordered classes, inspired by the adjective ordering scheme in (Quirk et al., 1985). This is shown in (13). (13) a. Determiner (all, few, several, some, etc.) b. Most Adjectival (important, practical, economic, etc.) c. Age (old, young, etc.) d. Color (black, red, etc.) e. Participle (confusing, adjusted, convincing, decided) f. Provenance (China, southern, etc.) g. Noun (Bank_of_China, difference, memorandum, etc.) h. Denominal (nouns made into adjectives by adding-al, e.g., individual, coastal, annual, etc.) If multiple words fall within the same group, permutations are generated for them. This situation can be seen for the LCA-AMR in Figure 3 with the ordering of the modifiers of the word I quota\]: I chinal, lexportl and Itextilel. Ichinal fell within the Provenance class of modifiers which gives it precedenc e over the other two words. They, on the other hand, fell in the Noun class and therefore both permutations were passed on to the statistical component. The final step, extracting a preferred sentence from the word lattice of possibilities is done using Nitrogen's Statistical Extractor without any changes. Sentences are scored using uni and bigram frequencies calculated based on two years of Wall Street Journal (Langkilde and Knight, 1998b).</Paragraph> </Section> </Section> <Section position="5" start_page="56" end_page="56" type="metho"> <SectionTitle> 4 Dealing with Ambiguity </SectionTitle> <Paragraph position="0"> A major issue in sentence generation from an inter-lingua or conceptual structure, especially as part of a machine translation project, is how and when to deal with ambiguity. There are several different sources of ambiguity in the generation process outlined in the previous section. Some of these include: * ambiguity in source language analysis (as represented by possibles nodes in the CLCS input to the Generation system). This can include ambiguity between multiple concepts, such as the example in (5), LCS type/structure (e.g., thing or event, which field), or structural ambiguity (subject, argument or modifier).</Paragraph> <Paragraph position="1"> ambiguity introduced in lexical choice (when multiple match structures can cover a single CLCS) ambiguity introduced in realization (when multiple orderings are possible, also multiple morphological realizations) There are also several types of strategies for addressing ambiguity at various phases, including: * passing all possible structures down for further processing stages to deal with * filtering based on &quot;soft&quot; preferences (only pass the highest set of candidates, according to some metric) * quota-based filtering, passing only the top N candidates * threshold filtering, passing only candidates that exceed a fixed threshold (either score or binary test) The generation system uses a combination of these strategies, at different phases in the processing. Ambiguous CLCS sub-trees are sometimes annotated with scores based on preference of attachment as an argument rather than a modifier. The alignment algorithm can be run in either of two modes, one which selects only the top scoring possibility for which a matching structure can be found, and one in which all possible structures are passed on, regardless of score. The former method is the only one feasible when given very large (e.g., over 1 megabyte text files) CLCS inputs. Also at the decomposition level, soft preferences are used in that missing lexical entries can be hypothesized to cover parts of the CLCS (essentially &quot;making up&quot; words in the target language). This is done, however, only when no legitimate matches are found using only the available lexical entries. At the linearization phase, there are often many choices for ordering of modifiers at the same level. As mentioned in the previous section, we are experimenting with separating these into positional classes, but our last resort is to pass along all permutations of elements in each sub-class. The ultimate arbiter is the statistical extractor, which orders and presents the top scoring realizations.</Paragraph> </Section> <Section position="6" start_page="56" end_page="57" type="metho"> <SectionTitle> 5 Interlingual representation issues </SectionTitle> <Paragraph position="0"> One issue that needs to be confronted in an Inter-lingua such as LCS is what to do when linguistic structure of languages vary widely, and useful conceptual structure may also diverge from these. A case in point is the representation of numbers. Languages diverge widely as to which numbers are primitive terms, and how larger numbers are built compositionaUy through modification (e.g., multiplication and addition). One question that immediately comes up is whether an interlingua such as LCS should represent numbers according to the linguistic structure of the source language (or some particular designated natural language) or as some other internal numerical form, (e.g. decimal numerals).</Paragraph> <Paragraph position="1"> Likewise, on generation into a target language, how much of the structure of the source language should be kept, especially when this is not the most natural way to group things in the target language.</Paragraph> <Paragraph position="2"> One might be tempted to always convert to a standard interlingua representation of numbers, however this does los_e some possible classification into groups that might be present in the input (contrast in English: &quot;12 pair&quot; with &quot;2 dozen&quot;.</Paragraph> <Paragraph position="3"> In our Chinese-English efforts, such issues came up, since the natural multiplication points in Chinese were 100, 10,000, and 100,000,000, rather than 100, 1000, and 1,000,000, as in English. Our provisional solution is to propogate the source language modification structure all the way through the LCS-AMR stage, and include special purpose rules looking for the &quot;Chinese&quot; numbers and multiplying them together to get numerals, and then divide and realize in the English fashion. E.g., using the words thousand, million, and billion.</Paragraph> </Section> class="xml-element"></Paper>