File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1093_metho.xml

Size: 22,712 bytes

Last Modified: 2025-10-06 14:11:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1093">
  <Title>SCSI_ : a linguistic specification language for MT</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KEY-WORDS Machine Translation, Natural Language
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> In most Nt_P and second generation MT systems, the information computed during the process is generally represented as abstract trees, very common description tools used in linguistics. The modules implementing the Various steps are written in Specialized Languages for Linguistic Programming (SLLP) (see for example &lt;Vauquots85a&gt;, &lt;NakamureB4&gt;, &lt;S\]ocum84&gt;, &lt;Masse4&gt;).</Paragraph>
    <Paragraph position="1"> In spite of the expressive power of SLLP compared to tradltionnal programming languages such as LISP, the conception and the maintenance of programs become more and more difficult as the complexity of &amp;quot;ltngwares&amp;quot; grows.</Paragraph>
    <Paragraph position="2"> To take up this challenge, we introduce in the field of computational linguistics the specification/ Implementation/ validation framework which has been proved valuable in tradttionnal programming. This leads to the intoductton of new tools and new working methods. The expected benefits for computational linguists are allowing them to facilitate the conception of the linguistic parts of NLP systems, tO increase the speed of realisatton, to trnprove the reliability of the final system and to facilitate the maintenance.</Paragraph>
    <Paragraph position="3"> Writing an analysis program with a SLLP, the computational linguist must define the set of strings to be analysed, the structural descriptor corresponding to an input string, the strategies used for the computation of the descriptor, the heuristics used for ambiguity choices and the treatment of wrong Inputs (errors). He generally writes a more or less precise and comprehensive document on those problems and begin programming from scratch. This method is highly unfeasible with large lingwares, We advocate for the use of a more stringent methodology which consist of : I. Specify formally (Ile. using a formal language) the valid inputs and the corresponding outputs : the specification must be comprehensive and neutral with respect to the choices of implementation. At this stage, the computational linguist is concerned only with linguistic problems, not with programming. An interpreter for the specification language should be used to write and debug the specification.</Paragraph>
    <Paragraph position="4"> 2. Specify the implementation choices for data structures and control (decompostton into modules, strategies and heuristics) and the treatement of errors. This specification depends on the Input/output specification and may partially depend on the ktnd of SLLP to he used for implementation.</Paragraph>
    <Paragraph position="5"> It should be as formal as possible, at least a strictly normalized document.</Paragraph>
    <Paragraph position="6">  3. Implement the module specified using a particular SLLP.</Paragraph>
    <Paragraph position="7"> 4. Validate the implemerltatton : the interpreter of  the specification language should be used to prepare a set of valid inputs/outputs; the results of the execution of the module to be validated on the input set is compared to ti~e output set.</Paragraph>
    <Paragraph position="8"> An Integrated software environement offering the developement tools and Insuring the coherence between the developement steps should be provided to facilitate the use of the methodology.</Paragraph>
    <Paragraph position="9"> As a first step toward this direction, we introduce a linguistic specification language fop which an interpreter is being Implemented. Those tools are used in the first and fourth steps as defined below and are being Integrated in a specialized envlronement based on the specification language &lt;Yam86&gt;.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="393" type="metho">
    <SectionTitle>
II. LINGUISTIC SPECIFICATION
1. A SPECIFICATION FORMALISM
</SectionTitle>
    <Paragraph position="0"> Before presenting the specification language itself, atrial1 consider wllat properties that such a language we should have.</Paragraph>
    <Paragraph position="1"> Most problems in NLP systems are found in the analysis stage (and some in the transfer stage in MT systems). The major gain should be to clarify the analysis stage using the proposed framework. Thus, a linguistic specification language should : define the set of valid input strings; define the correspending ouputs (structural descriptors of strings); define the mapping between those two sets.</Paragraph>
    <Paragraph position="2"> Analysis and synthesis ar'e two complementary views of a language defined by a formal grammar. We should reasonably expect that a linguistic specification language should be equally Llsed for the specification of analysis and synthesis modules &lt;KeyS4&gt;.</Paragraph>
    <Paragraph position="3"> Formal grammars define formal languages, and formal grammars does not make arly reference tO the situation (the global context in which sentences are produced), thus formal languages used to describe natural language sub-sets must allow the expression of ambiguities and paraphrases. An element of the mapping should be a couple (string, tree) where many trees are generally associated to one string and conversely, many strings are associated to one tree.</Paragraph>
    <Paragraph position="4"> The advantage of modularity is admitted and the description of the mapping should be done piece by piece, each piece describing a partial mapping and the total  mapping is then obtained by the composition of partial mappings (e,g. unification as In FUGs &lt;Kay84&gt;).</Paragraph>
    <Paragraph position="5"> An Important feature of such a language Is that a linguistic specification should be wrttten by linguists who have no a priori knowledge In computer science : a linguist must be able to concentrate only on \]tguisttc problems and not on computer sctence problems. The formalism should be clean of all computer science impurities, the mechanism of composition should be clear and simple.</Paragraph>
    <Paragraph position="6"> Within this framework, a graphic formalism for the specification of procedural analysis or generation grammars, the &amp;quot;Static Grammars&amp;quot; (SG) formalism has been developped at GETA under the direction of Pr.B.Vauquois &lt;Vauquois85b&gt;. This formalism is now used tn the French MT National Project to specify the grammars of an lndustrla |English-French system. Up to now, SGs were hand-written and cannot be edited on computer because of the use of graphs. This formalism has been modified tn order to realize a software onvlronement based on SG (structural editor, interpreter, graphic outputs .... ). It ts called &amp;quot;Structural Correspondence Specification Language&amp;quot; (SCSL). A grammar written in SCSL tS called &amp;quot;Structural Correspondance Specification Grammar&amp;quot; (SCSG). SCSI_ (sect. III) allows one to write the grammar of any interesting formal language such as programming languages or sub-sets of natural languages. This formalism is quite general and does not depend on a particular linguistic theory. GETA, under the direction of Pr.B.Vauquois, has elaborated its own linguistic framework and methodology from which this work directly descends, but it is nevertheless perfectly possible to write grammars within different linguistic frameworks. We emphasize this point because the distinction between the formalism properties and the linguistic theory properties ts not always clear. Moreover, tt may be tempting to wire the properties of some linguistic theory within a particular formalism, and this ts sometimes done, leading to confusion.</Paragraph>
  </Section>
  <Section position="5" start_page="393" end_page="394" type="metho">
    <SectionTitle>
2. IMPLEMENTATION AND VALIDATION OF LINGUISTIC MODULES
</SectionTitle>
    <Paragraph position="0"> As mentioned earlier, a SCSG Is used for the specification of analysis or generation modules written in one of the SLLP of the ARIANE system. Defining a mapping, a SCSG Is neutral with respect to implementation choices whicl3 are essentially algorithmic In nature (organteation in modules, control, etc) and with respect to intrinsic ambiguity choices which are essentially heuristic in nature.</Paragraph>
    <Paragraph position="1"> The same SCSG may be used to specify the lnputs/ouputs of different procedural grammars, each of which implementing different strategies and heuristics for comparative purposes * the result must nevertheless correspond to the same specification.</Paragraph>
    <Paragraph position="2"> The interpreter (not yet fully implemented) ts used for debugging a SCSG (tests, traces .... ) and for the empirical validation of procedural grammars for analysis or generation: the function computed by a procedural grammar must be included tn the mapplng defined by the SCSG spectfiying the procedural grammar.</Paragraph>
    <Paragraph position="3"> The interpreter may compute the trees corresponding to an input string (analysis) or the strings corresponding to an input tree (generation). A chart identifier may define an entry point for the interpreter.</Paragraph>
    <Paragraph position="4"> Before en execution, one can type in different trace commands. At the end of an execution, the trace and the derivation may be printed.</Paragraph>
    <Paragraph position="5"> Qne can trace for different charts (step-by-step or otherwise) a tentative application of a chart, a sucess, a failure or e combination of these parameters. In the step-by-step mode, the interpreter stops on each traced trial/sucess/failure and it is possible to type In new commands (trace, untraoe, stop) and chose the next chart to be applied.</Paragraph>
    <Paragraph position="6"> An output trace element have the following general pattern (several levels of details are avaible) : &lt;chart td&gt;, &lt;tree_ocourence&gt;, &lt;string_occurence&gt;.</Paragraph>
    <Paragraph position="7"> I I I. THE LANGUAGE To give a flavour of the specification language, we introduce e simplified version. Unnecessary (but essential for practical use) constructs of the language are removed. A more abstract view has been studied tn &lt;ZehartnB6&gt;.</Paragraph>
    <Paragraph position="8"> A SCSG describe simultaneously : the set of strings of the language; the set of structural descriptors of the language: the mapping between those two sets.</Paragraph>
    <Paragraph position="9"> A SCSG ts composed of &amp;quot;charts&amp;quot;. The mapping between the string language and the tree language Is decrtbed in parts : a chart decrfbes a partial mapping (set of valid sub-strings &lt;-&gt; set of valid sub-trees), the total mapptng is obtained by the composition of partial mappings (sect. IV).</Paragraph>
    <Paragraph position="10"> SCSL Is a language using key-words : every important syntactic unit begins with a key-word (e.g. CHART).</Paragraph>
    <Paragraph position="11"> Identifiers begin with at least one letter, designators begins with at least one digit. Designators are preceded by a prefix indicating their type, A SCSG begins with the declaration of labels and decorations, and then followed by the charts. Charts consist of a tree part and a forest part describing respectively a tree pattern and a forest pattern. We then have the contexts part and lastly the constraints part (sect. III.2).</Paragraph>
    <Paragraph position="12"> SCSL do not have the concept of asstgnement : a chart defines correspondence between a tree end a forest constrained by a boolean expression on the patterns of the chart.</Paragraph>
    <Paragraph position="13"> The basic construct of the language is a labeled and decorated tree pattern : each node of the described trees is a couple (label, decoration). The label have tile string basic type, the decoration have a hierarchical definition which use the SCALAR and SET constructors. A constraint Is a boolean expression on the labels and decorations of tbe nodes of the patterns.</Paragraph>
    <Paragraph position="14"> i. LABEL~ DECORATION AND TREE PATTERNS Most of SLLP use trees as basic data structure, Some associate to a tree or to a node attributes, essentially a set of variable/value pairs which may be manipulated with a few operators. To offer a more powerful description tool, a SCSL node tree tsa couple (label, decoration) where the decoration is a hierarchical attribute structure. This ts intended to Facilitate the manipulation of complex sets of attributes through a unified view.</Paragraph>
    <Paragraph position="15"> t.t. Label The label Is traditionally a non-terminal of a grammar, but tt may be viewed as a particular attribute of a tree. The type definition of labels ts expressed with a regular expression. The operation on this type ts equality.</Paragraph>
    <Paragraph position="17"> The decoration is interpreted as an oriented non-ordered tree where attribute identifiers (SCALAR or SET type) are the labels of the nodes and the values of the attributes are the forests that they dominate (in the  actual verston of SCSL, attributes may have the STRING or INTEGER types with the associated operators)* For the SCALAR type, the operation is equality, For the SET type, the operations are union, Intersection and set difference, Relational operators are equality, menbership and inclusion.</Paragraph>
    <Paragraph position="18"> The operations are defined on a hierarchical set structure : one must indicate on whlch level an operation is defined by suffixing the operator with an integer. The default value is the first level; &amp;quot;*&amp;quot; is used for the deepest level.</Paragraph>
    <Paragraph position="20"> The basic notion of the language is a labeled and decorated tree. The types of a node, a tree, a forest are defined by the declaration of the labels and the decorations.</Paragraph>
    <Paragraph position="21"> A chart should be a comprehensive description of a linguistic fact which Bay Ilave different rea\]lsatlons : the decoration allow the manipulation of sets of attributes at different levels of detat1, the structure should describe a whole family of trees, The structure of a tree pattern is described wtth designators whtch are Implicitly declared. The scope of s designator Is reduced to a chart. A designator begins wtth one dtgit, a node designator is prefixed wtth ,.t,. The content of a node is accessed by means of decoration and label identifiers : the label of a node .1 is accessed by 1b1(.1) (if tile label is declared as &amp;quot;}bl&amp;quot;), tts decoration by deco(.1).</Paragraph>
    <Paragraph position="22"> a tree de!~ is prefixed with &amp;quot;~&amp;quot;. The tree may be reduced to a single node.</Paragraph>
    <Paragraph position="23"> a forest designator Is prefixed with &amp;quot;$&amp;quot;. The forest may be empty, A tree pattern describe a set of trees, each tree being completely describe in width and depth.</Paragraph>
    <Paragraph position="25"> tlere, }abels are for a couple (}abel, decorattot)).</Paragraph>
  </Section>
  <Section position="6" start_page="394" end_page="396" type="metho">
    <SectionTitle>
2. CHARTS
</SectionTitle>
    <Paragraph position="0"> The element of the forest pattern may be : a string element described direct}y; - a sub-string described indirectly using the corresponding structure (tree), defined by some chart.</Paragraph>
    <Paragraph position="1">  The forest pattern is a sequence of tree patterns described by a regular-like notation : a tree pattern suffixed by &amp;quot;+&amp;quot; may be iterated, by &amp;quot;?&amp;quot; optional and by &amp;quot;*&amp;quot; optional or Iterated. Contrary to regular expressions, one can use these notations for single tree patterns only.</Paragraph>
    <Paragraph position="2"> TO have stmp\]er notations, an iterated tree pattern, e.g, ( .1(.2,,3) )*, w111 be written .1.(,2,.3) and the same convention wiil hold for &amp;quot;?&amp;quot; and &amp;quot;+&amp;quot;. Such a pattern must be used as a whole and is interpreted as a list : a boolean expression on nodes of such a pattern is interpreted as an expression on the nodes of each tree of the list.</Paragraph>
    <Paragraph position="3"> ExamJ~_Le. : .1? , .3,($4) , .5+($6) the node designated by .1 may be absent; the tree designated by .3($4) may be absent or iterated; the tree designated by .5($6) must be present and may be iterated;</Paragraph>
    <Section position="1" start_page="395" end_page="395" type="sub_section">
      <SectionTitle>
2.2. Correspondance and constraints
</SectionTitle>
      <Paragraph position="0"> a~llclt corresJ3ondance between tree and forest To avoid the duplication of the same constraints in the tree part and in the forest part, we allow the followlng notation facility.</Paragraph>
      <Paragraph position="1"> The same node designators in the tree pattern and the forest pattern represent distinct objects related to each other in the followlng manner : if C(T.x) is tile set of constraints on a node T.x of the tree part and C(F.x), the set of constraints on the node F.x of the forest part, then node T.x verify C(T,x) and the constraints of C(F,x) which are not contradictory with those of C(T.x) (and conversely for node F.x).</Paragraph>
      <Paragraph position="2"> This relation may also be explicitly stated for nodes having different designators using the predeftned CORRES function.</Paragraph>
      <Paragraph position="3"> Some formal constraints linking the tree pattern and the forest pattern are verified at compile time to ensure decidability.</Paragraph>
      <Paragraph position="4"> b) Constraints The constraints part is a boolean expression on labels and decorations of chart pattern nodes. Ail classical boolean operators are available (and, or, exclusive or, not, imply, equivalent).</Paragraph>
      <Paragraph position="5"> Designators are prefixed by A for the tree part and F for the forest part. An expression using non-prefixed designators is interpreted as an expression on the designators of the tree part and of tile forest part. The designators of context patterns Bust be different from the tree part end forest part designators.</Paragraph>
      <Paragraph position="7"> A tree pattern of the context pattern i8 a member of a cut of the derivation tree of the context-free skeleton ; a context pattern describes a set of cuts in the derivation tree (sect. IV.2).</Paragraph>
      <Paragraph position="8"> A context pattern is a forest pattern where each tree pattern may be prefixed by the &amp;quot;not&amp;quot; boolean operator (&amp;quot;^&amp;quot;), indicating the mandatory absence of the tree pattern. Context designators must not be used in other parts of the chart.</Paragraph>
      <Paragraph position="9"> Examll3Le ~ : we give some examples of right contexts and their interpretations. The constraint 6(.5) is written for a boolean expresstom on the label and decoration of ,5.</Paragraph>
      <Paragraph position="10"> there exists a cut such that the first element of this cut verify C :  -- EXIST is a predeflned boolean functiqn -- testing the existence of an instance : -- there must be an instance of .1 or .3 -- constraints on right context and forest nodes :</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="2" start_page="395" end_page="396" type="sub_section">
      <SectionTitle>
2.3. Contexts
</SectionTitle>
      <Paragraph position="0"> A partial mapping described by a chart in a context-free manner may be subordinated to contextual constraints on the left or right context of the described set of sub-strings. This is a powerful tool to describe contextual constraints, co-references, wh-movements, etc.</Paragraph>
      <Paragraph position="1"> A context elememt is a sub-string which is described with a corresponding tree pattern.</Paragraph>
      <Paragraph position="2">  Instance of tree and forest: patterns for &amp;quot;some of the  f I I f l f I f i ! &amp;quot;some&amp;quot; . 2 &amp;quot;of&amp;quot;. 3 &amp;quot;the&amp;quot;.4 &amp;quot;books&amp;quot; . 5 1</Paragraph>
      <Paragraph position="4"> F tqure 2 : Chart instance on &amp;quot;some of the books&amp;quot;</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="396" end_page="396" type="metho">
    <SectionTitle>
iV. THE DERIVATION MECHANISM
1. ELEMENT OF 1HE MAPPING
</SectionTitle>
    <Paragraph position="0"> An element of the mapping defined by a SCSG is a couple (string, tree) where the correapondance is defined for, each sub tree, The string is displayed as a linear graph labeled wtth string elements (terminals of the grammar).</Paragraph>
    <Paragraph position="1"> The tree is a correspondance tree : to each node is associated a list of paths of the string graph (the correspondance ts generally not projective, e.g.</Paragraph>
    <Paragraph position="2"> representing the &amp;quot;respectlvly&amp;quot; construct).</Paragraph>
    <Paragraph position="3"> of the couple for the string &amp;quot;some of the</Paragraph>
    <Paragraph position="5"> FQ~Eure 3 : Apilcation of bx39 on &amp;quot;some of the books&amp;quot;</Paragraph>
  </Section>
  <Section position="8" start_page="396" end_page="397" type="metho">
    <SectionTitle>
2. DERIVATION IN THE CONIEXT-FREE FRAMEWORK
</SectionTitle>
    <Paragraph position="0"> In the context-free fr~,*mework, S chart may be see~l as a rule In the PRQLOG \]I fly, your :  ClIART ts the chart identifier, ~\['ee z s trln9~ is the computed couple, TERMINAL is a string element definition, - *chart ,~ vartable that will be tnstantlated with a chart tdenttf let, EVAL ts a predicate that evaluate the constra|nts part, ARC make the reduction and memorize the (:ontexts for future evaluation.</Paragraph>
    <Paragraph position="1"> The algort thm of the context-free skeleton is a bottom-up version of Earley's algorithm defined and used by QutC/lton &lt;Outnton80&gt; in the KEAL speech recognition system.</Paragraph>
    <Paragraph position="2"> For the sake of clarity, the input tape and the factorized stack may be represented as a C-graph. Executing an analysts, tile Interpreter receives a \]tnear labeled C-graph and t~orks by adding on arcs for each reduced constituent. An arc Is labeled by a correspondance tree, the contexts to be evaluated and pointers to the reduced constituents.</Paragraph>
    <Paragraph position="3">  ExamloJe of a derivation tree for the string &amp;quot;some of the books&amp;quot;. The couple calculated Is written bestde the chart identifiers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML