File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1062_metho.xml
Size: 22,179 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1062"> <Title>Learning Parse and Translation Decisions</Title> <Section position="3" start_page="0" end_page="483" type="metho"> <SectionTitle> 2 Basic Parsing Paradigm </SectionTitle> <Paragraph position="0"> As the basic mechanism for parsing text into a shallow semantic representation, we choose a shift-reduce type parser (Marcus, 1980). It breaks parsing into an ordered sequence of small and manageable parse actions such as shift and reduce. This ordered 'left-to-right' parsing is much closer to how humans parse a sentence than, for example, chart oriented parsers; it allows a very transparent control structure and makes the parsing process relatively intuitive for humans. This is very important, because during the training phase, the system is guided by a human supervisor for whom the flow of control needs to be as transparent and intuitive as possible.</Paragraph> <Paragraph position="1"> The parsing does not have separate phases for part-of-speech selection and syntactic and semantic processing, but rather integrates all of them into a single parsing phase. Since the system has all morphological, syntactic and semantic context information available at all times, the system can make well- null based decisions very early, allowing a single path, i.e. deterministic parse, which eliminates wasting computation on 'dead end' alternatives.</Paragraph> <Paragraph position="2"> Before the parsing itself starts, the input string is segmented into a list of words incl. punctuation marks, which then are sent through a morphological analyzer that, using a lexicon 1, produces primitive frames for the segmented words. A word gets a primitive frame for each possible par t of speech. (Morphological ambiguity is captured within a frame.) parse stack &quot;bought&quot; synt: verb top of top of stack list * &quot;<input list > , &quot;today&quot; synt adv (R 2 TO S-VP AS PRED (OBJ PAT)) &quot;reduce the 2 top elements of the parse stack to a frame with syntax 'vp' and roles 'pred' and 'obj and pat'&quot; ~ &quot;bought a book .... today&quot; synt: vp synt: adv sub: (pred) (obj pat) /</Paragraph> <Paragraph position="4"> boxes represent frames The central data structure for the parser consists of a parse stack and an input list. The parse stack and the input list contain trees of frames of words or phrases. Core slots of frames are surface and lexical form, syntactic and semantic category, subframes with syntactic and semantic roles, and form restric1The lexicon provides part-of-speech information and links words to concepts, as used in the KB (see next section). Additional information includes irregular forms and grammatical gender etc. (in the German lexicon).</Paragraph> <Paragraph position="5"> slots include special information like the numerical value of number words.</Paragraph> <Paragraph position="6"> Initially, the parse stack is empty and the input list contains the primitive frames produced by the morphological analyzer. After initialization, the deterministic parser applies a sequence of parse actions to the parse structure. The most frequent parse actions are shift, which shifts a frame from the input list onto the parse stack or backwards, and reduce, which combines one or several frames on the parse stack into one new frame. The frames to be combined are typically, but not necessarily, next to each other at the top of the stack. As shown in figure 1, the action (R 2 TO VP AS PRED (0BJ PAT)) for example reduces the two top frames of the stack into a new frame that is marked as a verb phrase and contains the next-to-the-top frame as its predicate (or head) and the top frame of the stack as its object and patient. Other parse actions include add-into, which adds frames arbitrarily deep into an existing frame tree, mark, which can mark any slot of any frame with any value, and operations to introduce empty categories (i.e. traces and 'PRO', as in &quot;Shei wanted PR.Oi to win.&quot;). Parse actions can have numerous arguments, making the parse action language very powerful.</Paragraph> <Paragraph position="7"> The parse action sequences needed for training the system are acquired interactively. For each training sentence, the system and the supervisor parse the sentence step by step, with the supervisor entering the next parse action, e.g. (R 2 TO VP AS PRED (01aJ PAT) ), and the system executing it, repeating this sequence until the sentence is fully parsed. At least for the very first sentence, the supervisor actually has to type in the entire parse action sequence. With a growing number of parse action examples available, the system, as described below in more detail, can be trained using those previous examples. In such a partially trained system, the parse actions are then proposed by the system using a parse decision structure which &quot;classifies&quot; the current context. The proper classification is the specific action or sequence of actions that (the system believes) should be performed next. During further training, the supervisor then enters parse action commands by either confirming what the system proposes or overruling it by providing the proper action. As the corpus of parse examples grows and the system is trained on more and more data, the system becomes more refined, so that the supervisor has to overrule the system with decreasing frequency. The sequence of correct parse actions for a sentence is then recorded in a log file.</Paragraph> </Section> <Section position="4" start_page="483" end_page="484" type="metho"> <SectionTitle> 3 Features </SectionTitle> <Paragraph position="0"> To make good parse decisions, a wide range of features at various degrees of abstraction have to be considered. To express such a wide range of features, we defined a feature language. Parse features can be thought of as functions that map from partially parsed sentences to a value. Applied to the target parse state of figure 1, the feature (SYNT OF OBJ OF -1 AT S-SYNT-ELEM), for example, designates the general syntactic class of the object of the first frame of the parse stack 2, in our example np 3. So, features do not a priori operate on words or phrases, but only do so if their description references such words or phrases, as in our example through the path 'OBJ OF -1'.</Paragraph> <Paragraph position="1"> Given a particular parse state and a feature, the system can interpret the feature and compute its 2S-SYNT-ELEM designates the top syntactic level; since -1 is negative, the feature refers to the 1st frame of the parse stack. Note that the top of stack is at the right end for the parse stack.</Paragraph> <Paragraph position="2"> 3If a feature is not defined in a specific parse state, the feature interpreter assigns the special value unavailable. value for the given parse state, often using additional background knowledge such as 1. A knowledge base (KB), which currently consists of a directed acyclic graph of 4356 mostly semantic and syntactic concepts connected by 4518 is-a links, e.g. &quot;book,~o~,n-eoncept is-a tangible - objectnoun-coneept&quot;. Most concepts representing words are at a fairly shallow level of the KB, e.g. under 'tangible object', 'abstract', 'process verb', or 'adjective', with more depth used only in concept areas more relevant for making parse and translation decisions, such as temporal, spatial and animate concepts. 4 2. A subcategorization table that describes the syntactic and semantic role structures for verbs, with currently 242 entries.</Paragraph> <Paragraph position="3"> The following representative examples, for easier understanding rendered in English and not in feature language syntax, further illustrate the expressiveness of the feature language: eral elements on the parse stack or input list, and any of their subelements, at any depth. Since the currently 205 features are supposed to bear some linguistic relevance, none of them are unjustifiably remote from the current focus of a parse state.</Paragraph> <Paragraph position="4"> The feature collection is basically independent from the supervised parse action acquisition. Before learning a decision structure for the first time, the supervisor has to provide an initial set of features 4Supported by acquisition tools, word/concept pairs are typically entered into the lexicon and the KB at the same time, typically requiring less than a minute per word or group of closely related words.</Paragraph> <Paragraph position="5"> done-operation-p tree START ~ . - -7-ff~&quot; -&quot; &quot;2 7..--do -~ - - _ ~:JJ -art /sj~ g C/ I do er - -. - re er o re C/ . ~&quot; .&quot; shift n 'It s-verb red 'uCe 2..,~ reduce 1... reduce 3...</Paragraph> <Paragraph position="6"> that can be considered obviously relevant. Particularly during the early development of our system, this set was increased whenever parse examples had identical values for all current features but nevertheless demanded different parse actions. Given a specific conflict pair of partially parsed sentences, the supervisor would add a new relevant feature that discriminates the two examples. We expect our feature set to grow to eventually about 300 features when scaling up further within the Wall Street Journal domain, and quite possibly to a higher number when expanding into new domains. However, such feature set additions require fairly little supervisor effort. Given (1) a log file with the correct parse action sequence of training sentences as acquired under supervision and (2) a set of features, the system revisits the training sentences and computes values for all features at each parse step. Together with the recorded parse actions these feature vectors form parse examples that serve as input to the learning unit. Whenever the feature set is modified, this step must be repeated, but this is unproblematic, because this process is both fully automatic and fast.</Paragraph> </Section> <Section position="5" start_page="484" end_page="485" type="metho"> <SectionTitle> 4 Learning Decision Structures </SectionTitle> <Paragraph position="0"> Traditional statistical techniques also use features, but often have to sharply limit their number (for some trigram approaches to three fairly simple features) to avoid the loss of statistical significance.</Paragraph> <Paragraph position="1"> In parsing, only a very small number of features are crucial over a wide range of examples, while most features are critical in only a few examples, being used to 'fine-tune' the decision structure for special cases. So in order to overcome the antagonism between the importance of having a large number of features and the need to control the number of examples required for learning, particularly when acquiring parse action sequence under supervision, we choose a decision-tree based learning algorithm, which recursively selects the most discriminating feature of the corresponding subset of training examples, eventually ignoring all locally irrelevant features, thereby tailoring the size of the final decision structure to the complexity of the training data.</Paragraph> <Paragraph position="2"> While parse actions might be complex for the action interpreter, they are atomic with respect to the decision structure learner; e.g. &quot;(R 2 TO VP AS PFtED (OBJ PAT))&quot; would be such an atomic classification. A set of parse examples, as already described in the previous section, is then fed into an ID3-based learning routine that generates a decision structure, which can then 'classify' any given parse state by proposing what parse action to perform next.</Paragraph> <Paragraph position="3"> We extended the standard ID3 model (Quinlan, 1986) to more general hybrid decision structures.</Paragraph> <Paragraph position="4"> In our tests, the best performing structure was a decision list (Rivest, 1987) of hierarchical decision trees, whose simplified basic structure is illustrated in figure 3. Note that in the 'reduce operation tree', the system first decides whether or not to perform a reduction before deciding on a specific reduction.</Paragraph> <Paragraph position="5"> Using our knowledge of similarity of parse actions and the exceptionality vs. generality of parse action groups, we can provide an overhead structure that helps prevent data fragmentation.</Paragraph> </Section> <Section position="6" start_page="485" end_page="485" type="metho"> <SectionTitle> 5 Transfer and Generation </SectionTitle> <Paragraph position="0"> The output tree generated by the parser can be used for translation. A transfer module recursively maps the source language parse tree to an equivalent tree in the target language, reusing the methods developed for parsing with only minor adaptations. The main purpose of learning here is to resolve translation ambiguities, which arise for example when translating the English &quot;to knov\]' to German (wissen/kennen) or Spanish (saber/conocer).</Paragraph> <Paragraph position="1"> Besides word pair entries, the bilingual dictionary also contains pairs of phrases and expressions in a format closely resembling traditional (paper) dictionaries, e.g. &quot;to comment on SOMETHING_l&quot;/&quot;sich zu ETWAS_DAT_I ~iut3ern&quot;. Even if a complex translation pair does not bridge a structural mismatch, it can make a valuable contribution to disambiguation. Consider for example the term &quot;interest rate&quot;. Both element nouns are highly, ambiguous with respect to German, but the English compound conclusively maps to the German compound &quot;Zinssatz&quot;. We believe that an extensive collection of complex translation pairs in the bilingual dictionary is critical for translation quality and we are confident that its acquisition can be at least partially automated by using techniques like those described in (Smadja et al., 1996). Complex translation entries are preprocessed using the same parser as for normal text. During the transfer process, the resulting parse tree pairs are then accessed using pattern matching.</Paragraph> <Paragraph position="2"> The generation module orders the components of phrases, adds appropriate punctuation, and propagates morphologically relevant information in order to compute the proper form of surface words in the target language.</Paragraph> </Section> <Section position="7" start_page="485" end_page="487" type="metho"> <SectionTitle> 6 Wall Street Journal Experiments </SectionTitle> <Paragraph position="0"> ~Ve now present intermediate results on training and testing a prototype implementation of the system with sentences from the Wall Street Journal, a prominent corpus of 'real' text, as collected on the ACL-CD.</Paragraph> <Paragraph position="1"> In order to limit the size of the required lexicon, we work on a reduced corpus of 105,356 sentences, a tenth of the full corpus, that includes all those sentences that are fully covered by the 3000 most frequently occurring words (ignoring numbers etc.) in the entire corpus. The first 272 sentences used in this experiment vary in length from 4 to 45 words, averaging at 17.1 words and 43.5 parse actions per sentence. One of these sentence is &quot;Canadian manufacturers' new orders fell to $20.80 billion (Cana-Tr. snt. 16 32 64 128 256 training sentences; with all 205 features and hybrid decision structure; Train. = number of training sentences; pr/prec. = precision; rec. = recall; I. = labeled; Tagging = tagging accuracy; Cr/snt = crossings per sentence; Ops = correct operations; OpSeq dian) in January, down 4~o from December's $21.67 billion billion on a seasonally adjusted basis, Statistics Canada, a federal agency, said.&quot;.</Paragraph> <Paragraph position="2"> For our parsing test series, we use 17-fold crossvalidation. The corpus of 272 sentences that currently have parse action logs associated with them is divided into 17 blocks of 16 sentences each. The 17 blocks are then consecutively used for testing. For each of the 17 sub-tests, a varying number of sentences from the other blocks is used for training the parse decision structure, so that within a sub-test, none of the training sentences are ever used as a test sentence. The results of the 17 sub-tests of each series are then averaged.</Paragraph> <Paragraph position="3"> number of correct constituents in system parse number of constituents in system parse Recall (rec.): number of correct constituents in system parse number of constituents in logged parse Crossing brackets (cr): number of constituents which violate constituent boundaries with a constituent in the logged parse.</Paragraph> <Paragraph position="4"> Labeled (l.) precision/recall measures not only structural correctness, but also the correctness of the syntactic label. Correct operations (Ops) measures the number of correct operations during a parse that is continuously corrected based on the logged sequence. The correct operations ratio is important for example acquisition, because it describes the percentage of parse actions that the supervisor can confirm by just hitting the return key. A sentence has a correct operating sequence (OpSeq), if the system fully predicts the logged parse action sequence, and a correct structure and labeling (Str~L), if the structure and syntactic labeling of the final system parse of a sentence is 100% correct, regardless of the operations leading to it.</Paragraph> <Paragraph position="5"> The current set of 205 features was sufficient to always discriminate examples with different parse actions, resulting in a 100% accuracy on sentences already seen during training. While that percentage is certainly less important than the accuracy figures for unseen sentences, it nevertheless represents an important upper ceiling.</Paragraph> <Paragraph position="6"> Many of the mistakes are due to encountering con-Type of deci- plain hier. plain sion structure list list tree decision structures; with 256 training sentences and 205 features structions that just have not been seen before at all, typically causing several erroneous parse decisions in a row. This observation further supports our expectation, based on the results shown in table 1 and figure 4, that with more training sentences, the testing accuracy for unseen sentences will still rise significantly. null Table 2 shows the impact of reducing the feature set to a set of N core features. While the loss of a few specialized features will not cause a major degradation, the relatively high number of features used in our system finds a clear justification when evaluating compound test characteristics, such as the number of structurally completely correct sentences. When 25 or fewer features are used, all of them are syntactic. Therefore the 25 feature test is a relatively good indicator for the contribution of the semantic knowledge base.</Paragraph> <Paragraph position="7"> In another test, we deleted all 10 features relating to the subcategorization table and found that the only metrics with degrading values were those measuring semantic role assignment; in particular, none of the precision, recall and crossing bracket values changed significantly. This suggests that, at least in the presence of other semantic features, the subcategorization table does not play as critical a role in resolving structural ambiguity as might have been expected.</Paragraph> <Paragraph position="8"> Table 3 compares four different machine learning variants: plain decision lists, hierarchical decision lists, plain decision trees and a hybrid structure, namely a decision list of hierarchical decision trees, as sketched in figure 3. The results show that extensions to the basic decision tree model can significantly improve learning results.</Paragraph> <Paragraph position="9"> ble = 1.00, worst possible = 6.00) Table 4 summarizes the evaluation results of translating 32 randomly selected sentences from our Wall Street Journal corpus from English to German. Besides our system, CONTEX, we tested three commercial systems, Logos, SYSTR.AN, and Globalink. In order to better assess the contribution of the parser, we also added a version that let our system start with the correct parse, effectively just testing the transfer and generation module. The resulting translations, in randomized order and without identification, were evaluated by ten bilingual graduate students, both native German speakers living in the U.S. and native English speakers teaching college level German. As a control, half of the evaluators were also given translations by a bilingual human. Note that the translation results using our parser are fairly close to those starting with a correct parse. This means that the errors made by the parser have had a relatively moderate impact on translation quality. The transfer and generation modules were developed and trained based on only 48 sentences, so we expect a significant translation quality improvement by further development of those modules. null Our system performed better than the commercial systems, but this has to be interpreted with caution, since our system was trained and tested on sentences from the same lexically limited corpus (but of course without overlap), whereas the other systems were developed on and for texts from a larger variety of domains, making lexical choices more difficult in particular. null Table 5 shows the correlation between various parse and translation metrics. Labeled precision has the strongest correlation with both the syntactic and semantic translation evaluation grades.</Paragraph> <Paragraph position="10"> translation metrics. Values near -1.0 or 1.0 indicate very strong correlation, whereas values near 0.0 indicate a weak or no correlation. Most correlation values, incl. for labeled precision are negative, because a higher (better) labeled precision correlates with a numerically lower (better) translation score on the 1.0 (best) to 6.0 (worst) translation evaluation scale.</Paragraph> </Section> class="xml-element"></Paper>