File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1166_metho.xml
Size: 22,747 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1166"> <Title>and Eleazar Eskin. 1999. Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning. In Proceedings of EMNLP/VLC- 99, College Park, United States. Martin Kay, Jean Mark Gawron, and Peter Norvig. 1994. Verbmobil { A Translation System for Face-to-Face Dialog. CSLI Lecture Notes.</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Document normalization </SectionTitle> <Paragraph position="0"> The authoring of documents in constrained domains and their translation into other languages is a very important activity in industrial settings. In some cases, the distinction between technical writers and technical translators has started to blur, so as to minimize the time and efiorts needed to obtain multilingual documents. The paradigm of translation for monolinguals introduced by Kay in 1973 (Kay, 1997)1 led the way to a new conception of the authoring task, which flrst materialized with systems involving human disambiguation (e.g. (Boitet, 1989; Somers et al., 1990)). A related paradigm emerged in the 90s (Hartley and Paris, 1997), whereby a technical author is responsible for providing the content of a document and a generation system produces multilingual versions of it. Updating documents is then done by updating the document content, and only some postediting may take place instead of full translation by a human translator.</Paragraph> <Paragraph position="1"> Systems implementing this paradigm range from template-based multilingual document creation to systems presenting the user with the evolving text of the document (often called the feedback or control text) in her language, following from the WYSIWYM (What You See Is What You Meant) approach (Power and Scott, 1998).2 Anchors (or active zones) in the text of the evolving document allow the user to specify further its semantics by making choices presented to her in her language. The underlying content representation is then used to generate the text of the document in as many languages as the system supports. The MDA</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> (Multilingual Document Authoring) system </SectionTitle> <Paragraph position="0"> (Dymetman et al., 2000; Brun et al., 2000) follows the WYSIWYM approach, but puts a strong emphasis on the well-formedness of document semantic content. More particularly, document content can be specifled in terms of communicative goals, allowing the selection of messages which are contrastive within the modelled class of documents in no more steps than is needed to identify a predeflned communicative goal. Figure 1 illustrates the architecture of a MDA system. A MDA grammar specifles the possible content representations of a document in terms of trees of typed semantic objects in a formalism inspired from Deflnite Clause Grammars (Pereira and Warren, 1980).</Paragraph> <Paragraph position="1"> 2We have done a review of these systems in (Max, 2003a) in which we have identifled and compared flve families of approaches.</Paragraph> <Paragraph position="2"> Considering all the possibilities ofiered by having the semantic description of a document, for example the exploitation within the Semantic Web, it seemed very interesting to reuse resources developed for controlled document authoring to analyze existing documents. Also, a corpus study of drug lea ets that we conducted (Max, 2003a) showed that documents from the same class of documents could contain a lot of variation, which can hamper the reader's understanding. We deflned document normalization as the process of the identiflcation of the content representation produced by an existing document model corresponding best to an input document, followed by the automatic generation of the text of the normalized document from that content representation. This is illustrated in flgure 2.</Paragraph> <Paragraph position="3"> In the next section, we brie y describe our paradigm for document content analysis, which exploits the MDA formalism in a reverse way.</Paragraph> <Paragraph position="4"> Candidate content representations expressed in the MDA formalism are flrst produced and ranked automatically, and a human expert then identifles the one that best accounts for the communicative content of the original document. The core of this paper is devoted to our implementation of interactive negotiation for document normalization. Finally, we discuss our results and propose ways of improving the system.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Document normalization system </SectionTitle> <Paragraph position="0"> A MDA grammar can enumerate the well-formed content representations for documents of a given class and associate textual realizations to them (Dymetman et al., 2000).</Paragraph> <Paragraph position="1"> Content representations are typed abstract semantic trees in which dependencies can be established through uniflcation of variables.</Paragraph> <Paragraph position="2"> Generation of text is done in a compositional manner from the semantic representation.</Paragraph> <Paragraph position="3"> Figure 3 shows an excerpt of a MDA grammar describing well-formed commands for the Unix shell. Such a grammar describes both the abstract semantic syntax and the concrete syntax for a particular language, English in this case. The flrst rule reads as follows: lsCommand is a semantic object of type shellCommand type, which is composed of an object of type flleSelection type, an object of type sortCriteria type, and an object of type displayOptions type. Text strings appearing in the right-hand side of the rules are used together with the strings associated with the present semantic objects to compose the normalized text associated with the described abstract semantic trees.</Paragraph> <Paragraph position="4"> Our approach to normalize documents has been described in (Max, 2003b). A heuristic search procedure in the space of content representations deflned by a MDA grammar is flrst performed. Its evaluation function measures a similarity score between the document to be normalized and the normalized documents that can be produced from a partial content representation. The similarity score is inspired from information retrieval, and takes into account common descriptors and their relative informativity in the class of documents. The admissibility property of the search procedure guarantees that the flrst complete content representation found is the one with the best global similarity score. This process uses text generation to measure some kind of similarity, and has been called fuzzy inverted generation.</Paragraph> <Paragraph position="5"> In order to better cover the space of texts conveying the same communicative content, the MDA formalism has been extended to support non-deterministic generation, allowing the production of competing texts from the same content representation, as is illustrated in flgure 4. For each considered content representation, texts are produced and compared to the document to be normalized, thus allowing the ranking of candidate content representations % description for the type 'linksReferencesListing_type' type_display(e, linksReferencesListing_type, 'specifies how links are shown'). % description for the objects 'displayLinksReferences' and 'dontDisplayLinksReferences' functor_display(e, displayLinksReferences,'show the files and directories that are referenced by links'). functor_display(e, dontDisplayLinksReferences,'show links as such (not the files and directories they point to)'). representation through fuzzy inverted generation null by decreasing the similarity score.</Paragraph> <Paragraph position="6"> Given the limitations of the similarity measure inspired from information retrieval, the search is continued to flnd the N flrst documents with the best similarity scores. The identiflcation of the content representation that represents best the communicative content of the original document is then done by interactive negotiation between an expert of the class of the document and the system based on the candidates previously extracted.</Paragraph> <Paragraph position="7"> To demonstrate how the implemented system works, we will consider the normalization of the following description in English of a command for the Unix shell with the grammar of flgure 3: List all flles. Do not show hidden flles and visit subdirectories recursively. Sort results by date of last modiflcation in long format in single-column in reverse chronological order. Give flle size in bytes.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Finding candidate document </SectionTitle> <Paragraph position="0"> representations: fuzzy inverted generation The MDA grammar used is flrst precompiled o+ine by a separate tool, in order to associate proflles of text descriptors to semantic objects and types in the grammar (see (Max, 2003b) for details). In our current implementation, descriptors are WordNet synsets. The text of the input document is then lemmatized and the descriptors are extracted, yielding the proflle of descriptors for the input document. The grammar is then used to construct partial abstract semantic trees, which are ordered in a list of candidates according to the similarity score computed between their proflle and that of the input document. At each iteration, the search algorithm considers the most promising candidate content representation and performs one step of derivation on it, which corresponds to instantiating a variable in the tree with a value for its type. The flrst complete candidate (i.e. an abstract tree not containing any variable) found is then kept, and the search continues until a given number of candidates has been found. This number deflnes a value of system confldence, which can be selected by the user of our normalization system: the higher the confldence, the fewer candidates are kept, at the risk that the best one according to an expert may not be present. Given the size of the grammar used and the complexity of the analysed document, a small number of candidates can be kept (20 in our example).</Paragraph> <Paragraph position="1"> This process restricts the search space from a large collection of virtual documents3 to a comparatively smaller number of concrete textual documents, associated with their semantic structure. A factorization process then builds a unique content representation that contains all the difierent alternative subtrees found in the candidates. Each semantic object in the resulting factorized semantic tree is then decorated by a list of all the candidates to which it belongs. Competing semantic objects are ranked according to the score of the candidate with the highest score to which they belong. This compact representation permits to consider underspeciflcations from the analysis of the input document present at any depth in the candidate semantic trees.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Identifying the best document </SectionTitle> <Paragraph position="0"> representation: interactive negotiation Document normalization implies a normative view on the analysis of a document. Because the communicative content that will be ultimately retained may not be exactly that of the original document, some negotiation must take place to determine which alternative semantic content, if any, is acceptable. This is analoguous to what happens in translation. As (Kay et al., 1994) put it: Translation is not a meaning-preserving function from a source to a target text. Indeed, it is probably teractive negotiation not helpful to think of it as a function at all, but rather as a matter of compromise.</Paragraph> <Paragraph position="1"> In our view, a human expert should be responsible for making di-cult decisions that the machine cannot make without signiflcant interpretation capabilities. Furthermore, these decisions encompass cases where no explicit content in the input document can be used to determine content that is expected in order to obtain a well-formed representation in the semantic model used.4 This will be illustrated below with the negotiation dialogue of flgure 8. A naive way to select the candidate content representation found by the system that best corresponds to the input document would be to show to an expert all the normalized texts corresponding to the candidates. This would however be a tedious and error-prone task. The compact representation built at the end of fuzzy inverted generation allows the discrimination of candidates based on local underspeciflcations corresponding to competing semantic choices. We have implemented three methods for supporting interactive negotiation that will be described below.</Paragraph> <Paragraph position="2"> They allow an expert to resolve underspeciflcations and therefore update the factorized content representation by eliminating incorrect hypotheses. This is iterated until the factorized content representation does not contain any underspeciflcation, as illustrated in flgure 5. Figure 6 shows the main interface of our normalization system after the automatic selection 4This suggests that document normalization can be used as a corrective mecanism applied on ill-formed documents that can be incomplete or semantically incoherent relatively to a given semantic model. of candidate content representations and the construction of the compact representation.</Paragraph> <Paragraph position="3"> Semantic view The middle panel on the right of the window contains the semantic view, which is a graphical view of the factorized abstract semantic tree that can be interpreted by the expert. It uses the text descriptions for semantic objects and types as described by the functor display and type display predicates present in the original MDA formalism (see flgure 3). The tick symbol represents a semantic object that dominates a semantic subtree containing no underspeciflcations. In our example, this is the case for the object described as output type and detail level for display. The arrow symbol describes a semantic object that does not take part in an underspeciflcation, but which dominates a subtree that contains at least one. The exclamation mark symbol denotes a semantic type that is underspecifled, and for which at least two semantic objects are in competition. Semantic objects in competition are denoted by the interrogation mark symbol , and are ordered according to the highest score of the candidate representation to which they belong. This view can be used by the expert to the semantic view navigate at any depth inside the compact representation. By clicking on a semantic object in competition, the expert can decide whether this object belongs to the solution or not. On the example of flgure 7, the expert has selected the flrst possibility (subdirectories are recursively visited) for an underspecifled type (specifles how subdirectories are visited), which is itself dominated by another underspecifled type (specifles whether only directory names are shown or...). The menu that pops up allows the validation of the selected object: this will have for efiect to prune the factorized tree of any subtree that does not belong to at least one of the candidates of the validated object. In the present case, not only will it prune the alternative subtree dominated by subdirectories are not recursively visited, but also the subtree dominated by only show directory names (not Figure 8: Negotiation dialogue about how links should be shown their content) present at a shallower level in the tree. Furthermore, subtrees that would be incompatible elsewhere in the compact representation because of failed parameter uniflcation would disappear. Conversely, the invalidation operation prunes all the subtrees which have at least one candidate in common with the invalidated object. The expert can also ask for a negotiation dialogue, which will be introduced shortly.</Paragraph> <Paragraph position="4"> MDA view It seemed very natural to propose aview withwhichauserof aMDA systemwould already be familiar. Such a view shows the normalized text corresponding to all the objects from the root object that are not in competition. Underspecifled semantic types appear as underlined text spans called active zones, which trigger a pop up menu when clicked. Whereas in the MDA authoring mode all the possible objects for the semantic type that do not violate any semantic dependencies are shown, our system only proposes those that belong to candidates that are still in competition. Furthermore, these semantic objects are not ordered by their order in appearance in the grammar, but by the score of their most likely candidate according to our system. Selecting an object corresponds to validating it, implying that the invalidation operation is not accessible from this view. Also, underspecifled semantic types dominated by other underspecifled types cannot be resolved using this view, as they do not appear in the text.5 However, dealing with a text in natural language corresponding to the normalized document may be a more intuitive interface to some users, although it may require more operations.</Paragraph> <Paragraph position="5"> Negotiation dialogues The key element in this task is the minimization of the number of 5We thought that showing these types using cascade menus would be too confusing for the user.</Paragraph> <Paragraph position="6"> operations by the user. The two previous views allow the expert to choose some underspeciflcation to resolve. The List of underspeciflcations panel on the left of the window in flgure 6 contains an enumeration of all underspeciflcations found in the compact representation. They are ordered by decreasing score, where the score can indicate the average score of the objects in competition, or the inverse of the average number of candidates per object in competition. Therefore, the expert can choose to resolve flrst underspeciflcations that contain likely objects, or underspeciflcations that involve few candidates so that the validation of an object will prune more candidates from the compact representations. Clicking on an underspeciflcation in the list triggers a negotiation dialogue similar to that of flgure 8. The semantic type on that dialogue, specifles how links are shown, is not supported by any evidence in the input document. The expert can however choose a value for it. When the underspeciflcation is resolved, all the views are refreshed to re ect the new state of the compact representation, and a negotiation dialogue for the underspeciflcation then ranked flrst in the list is shown. The expert can either discard it, or continue in the dialogue mode, with the possibility to skip the resolution of an underspeciflcation.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Discussion and perspectives </SectionTitle> <Paragraph position="0"> We have presented an approach to normalize documents in constrained domains and its implementation. Our approach combines the strictness of well-formed content representations and and the exibility of information retrieval techniques, and makes use of human expertise to resolve di-cult interpretation problems in an attempt to build an operational system. Although our initial results are promising, our approach could be improved in several ways.</Paragraph> <Paragraph position="1"> First of all, an important evaluation factor of our approach is how much efiort has to be done by the human expert. We have only conducted informal experiments of evaluation by the task, which have revealed that normalization can be performed quite fast when the user has a good command of the difierent views available.</Paragraph> <Paragraph position="2"> Nevertheless, it seems crucial to be able to present the expert with at least some evidence from the text of the input document to support competing semantic objects. Morever, the evidence extracted from the input document could be used as the basis for learning new formulations for particular communicative goals that would match better subsequent similar input. Although our system already supports non-deterministic generation, we have not implemented a mechanism that would allow supervised learning of new formulations yet. We expect this \normalization memory&quot; functionality to have an important impact for the normalization of documents from the same origin, as it should improve the automatic selection of content representations.</Paragraph> <Paragraph position="3"> In case the candidates returned by fuzzy inverted generation do not contain the content representation representing best the input document, the user can choose to reanalyze the document. This will start search again from the (N+1)th content representation. But because the expert might have already resolved some underspeciflcations and thus identifled subparts that should belong to the solution, this information should be taken into account while reanalyzing the document, which is not the case in the current implementation. If the solution has to be present in the list of candidates returned, it should be as close to the top of the list as possible, so that the flrst choices for each underspeciflcation represent the actual best choices. To this end, we intend to implement a second-pass analysis that would rerank the candidates produced by fuzzy inverted generation by computing text similarities over short passages such as those proposed in (Hatzivassiloglou et al., 1999). These techniques were much harder to implement during the search in the virtual space of documents produced by the document model, because partial content representations are not actual texts.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Acknowledgements </SectionTitle> <Paragraph position="0"> Many thanks to Marc Dymetman, who supervised my work at Xerox Research Centre Europe and who originally came up with the concept of fuzzy inverted generation. Many thanks also to Christian Boitet, my university PhD supervisor, and to Anne-Lise Bully, CPedric Leray and Abdelkhalek Rherad for their programming work on the interface of the presented system.</Paragraph> <Paragraph position="1"> This work was funded by a PhD grant from</Paragraph> </Section> class="xml-element"></Paper>