XML Viewer - c94-2146

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2146_metho.xml
Size: 15,910 bytes
Last Modified: 2025-10-06 14:13:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2146">
  <Title>A GRAMMAR BASED APPROACH TO A GRAMMAR CHECKING OF FREE WORD ORDER LANGUAGES</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A GRAMMAR BASED APPROACH TO A GRAMMAR CHECKING
OF FREE WORD ORDER LANGUAGES
</SectionTitle>
    <Paragraph position="0"> This paper shows one of the methods used for grammar checking, as it is being developed in the frame of the EC funded project</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="906" type="metho">
    <SectionTitle>
LATESLAV -Language Technology li)r Slavic
Languages (PECO 2824).
</SectionTitle>
    <Paragraph position="0"> The languages under consideration in the project - Czech and Bulgarian - are both free word order languages, therefore it is not sufficient to use only simple pattern based methods for error checking. The emphasis is on grammar-based methods, which are much closer to parsing than pattern-based methods.</Paragraph>
    <Paragraph position="1"> It is necessary to stress that we are dealing with a surface syntactic analysis.</Paragraph>
    <Paragraph position="2"> Therefore also the errors which are taken into consideration are surface syntactic errors. Our system for identification and localization of (surl:ace) syntactic errors consists of two basic modules - the module of lexical analysis and the module of surface syntax checking. In the present paper, we will describe the second module, which is more complicated and creates the core of the whole system. Although it is not crucial for our method, we would like to point out that our approach to the problems of grammar checking is based on dependency syntax.</Paragraph>
    <Paragraph position="3"> Let us illustrate the degree of licedom of the word order, which is provided by Czech, one of the languages under consideration in the project. If we take a sentence like &amp;quot;Oznaeen3~ (Adj. masc., Nom/Gen Sg.) soubor (N masc., Nom/Gen Sg.) se (Pron.) nepodafilo (V neutr., 3rd pers. Sg) tisp6~nE (Adv.) otev~ft (V inf.)&amp;quot; (The marked file failed to be opened sucessfully); word-tbr-word translation into English &amp;quot;Marked file itself failed succesfully to open&amp;quot;, we may modify the word order for instance in the following way:  etc.</Paragraph>
    <Paragraph position="4"> The only thing concerning the word order, which we can guarantee, is that the above introduced list of syntactically correct sentences does not exhaust all possibilities, which arc given by the combinations of those six Czech words 1. The example also shows, that although the word order is very free, there are certain limitations to that freedom, as e.g. the adjective - noun group (&amp;quot;oznaOcny soubor&amp;quot;), which is closely bound together in all of the sample sentences, but may not be bound together in some other context - cf. &amp;quot;...soubor Karlem veera veeer oznaeen~ jako ~patnsL...&amp;quot; (...file \[by Karel\[ yesterday evening marked as wrong...).</Paragraph>
    <Paragraph position="5"> The approach which we have chosen for the developlncnt of the grammar checker for free word order languages is based on the idea of reducing the complicated syntactic structure of the sentence into a more simple structure by means of deleting those words from the input sentence which do not cause any error.</Paragraph>
    <Paragraph position="6"> Let us take as an example lhe (ungrammatical) English sentence quoted in \[31: &amp;quot;*77ze little boys I mentioned runs very quickly.&amp;quot; The error may be extracted by a stepwise deletion of the correct parts which do not alTect the (non)correctness of the sentence. We will get successively the sentences &amp;quot;*The boys I mentioned runs very quickly&amp;quot;, &amp;quot;*boys Imentioned runs very quickly', &amp;quot;*boys runs very quickly&amp;quot;, &amp;quot;*boys runs quickly&amp;quot;, &amp;quot;*boys furls &amp;quot;.</Paragraph>
    <Paragraph position="7"> 1As mentioned above, we are concerned with surface syntactic shape of the Czech sentences and thus we leave aside the semantic relevance of the word order variations due to their different topic focus articulation. For a detailed discussion of these phenomena, see esp. \[1\] and \[7\].</Paragraph>
    <Paragraph position="8">  The example shows that it is useflfl to use a model which is able to deal with deletions in a natural way. We use the nondeterministic list automata (NLA); a list automaton works with a (doubly linked) list of items rather than with a tape. The version of the NLA which is used in our project is described briefly in the tollowing sections.</Paragraph>
  </Section>
  <Section position="3" start_page="906" end_page="908" type="metho">
    <SectionTitle>
2.ERR()R CHECKING AUTOMATON
</SectionTitle>
    <Paragraph position="0"> The core module of our system is the Error: Checking Module. It recognizes grammatical correctness of a given portion of text, or, in other words, it distinguishes the grammatically correct and grammatically incorrect subsequences (not necessarily continuous) of lexical elements in the input text. The input of the Error Checking Module (ECM) consists el' tile results of the morphological and lexical analysis. The exact form of the inpnt elements is thoroughly described in \[511. For the purpose of this paper it is only necessary to say that the data, representing one lexical element, are contained in one complex feature structure. The attributes are divided into two groups, input and output attributes. The ECM tries to reduce the input sequence by means of deleting some symbols.</Paragraph>
    <Paragraph position="1"> The deleted symbols are stored. They create the history of simplifications of the input text.</Paragraph>
    <Paragraph position="2"> The whole process is nondeterministic -if there are more variants how to delete the symbols, all of them are sooner or later taken into account.</Paragraph>
    <Paragraph position="3"> For the purpose of the grammar checker, we use a slightly modified version of NI,A, called Error Checking Automaton (ECA). ECA has a two-level storage, with a two-way linear&amp;quot; list on each level composed of data items (fields, cells). In tile list there are two distinguished items: one at the leflmost end and the other at the rightmost end of the list. These items are called sentinels.</Paragraph>
    <Paragraph position="4"> The first level represents the input and the working storage of ECA, the other one ECA's output. ECA has a finite state control unit with one head moving on a linear (doubly linked) list ot' items. In each moment the head is connected with one particular cell of the list (&amp;quot;the head looks at the visited field&amp;quot;). The actions of the working head are delimited by the following four types of basic operations which the head may perform on the list: MOVE, DELETE, INSERT, and RESTART. The operations of the type MOVE ensure the bi-directional motion of the head on the list. The I)ELETE operations delete the input field in the input level. The INSERT operations add a field with a symbol to the output level, more exactly: to the close neighborhood of the visited field.</Paragraph>
    <Paragraph position="5"> The RESTART operation transfers ECA from the current configuration to the (re)start configuration, i.e. to the configuration in which ECA is in the initial (unique) state.</Paragraph>
    <Paragraph position="6"> The processing of the ECA is controlled by rules, written in a formalism called DABG (Deletion Autolnata Based Grammar), which was developed especially for the project I,ATESLAV. It is described in detail in 151. The theoretical background for erasing automata of the above described kind can be found in \[1611 and 121.</Paragraph>
    <Paragraph position="7"> The ECM is logically composed of the following three components:  (a) list automaton P of the type ECA; (b) list automaton N of the type ECA; (c) control module C.</Paragraph>
    <Paragraph position="8"> 2.1. The automaton P  The automaton works in cycles between (re)starts. The function of the autolnaton P is to apply one rule of the control grammar to the input during one cycle. That means to decide nondeterministically which finite subsequence of the text in the input level of the list is correct according to one rule of the control grammar, and to delete this part from the input level. After: that it continues the work in the next cycle. This means that if the input text is error free, the automaton P gradually repeats cycles and deletes (in at least one branch of its tree of computations) all the input elements (except for the sentinels).</Paragraph>
    <Paragraph position="9"> The automaton P accepts the input sequence, if the computation of P finishes by deleting all the items (except for the sentinels) from the input level of the list.</Paragraph>
    <Paragraph position="10"> Notation: L(I') is a set of strings accepted by 1'.</Paragraph>
    <Paragraph position="11"> rs(P,w) = {w I, where w I is a word, which is a result of one cycle performed on the word w by the automaton P }  We can see that the following two facts hold, due to the restarting nature of  computations of P: Fact 1: Let w be a word fi'om L(P), then rs(P,w) c~ L(P) ve Q.</Paragraph>
    <Paragraph position="12"> Fact 2: If w is not a word fi'om L(P), then</Paragraph>
    <Paragraph position="14"> Two basic principles how to formulate rules for the automaton P for a given natural language L lk~llow from the above mentioned facts:  1) P contains only those (deleting) rules, for which there exists a sentence in L which will remain syntactically correct after the application of the rule.</Paragraph>
    <Paragraph position="15"> 2) There may not be a syntactically incorrect scqucnce of words l'ronl L which would be changed to a syntactically correct sentence of L by means of the application of a rule fl:om P. Strong rules (S-rules) are the rules which meet the following requirement: 3) Any application of a rule will keep both  correctness and incorrectness of the input sequence.</Paragraph>
    <Paragraph position="16"> Clearly the S-rules meet also the requirements 1) and 2).</Paragraph>
    <Paragraph position="17"> The subautomaton of P, which uses S-rules only, is called Ps.</Paragraph>
    <Paragraph position="18"> One cycle (one compound step) of an automaton P (or Ps) can be described from the technical point of view as follows: First, the automaton searches through the input level for the place where there is a possibility to apply one of the deleting rules to the input level of the automaton. In the positive case the deleting rule is applied and P (or Ps) returns to the initial configuration (restarts). 2.2. The automaton N The task of the automaton N is to find in the input text a minimal limited error, to make a correction of it (cf. the following del'inition). In one compound step the automaton N perfl)rms (nondeterministically) the following actions: First, similarly as the automaton P, N locates the place where to apply a rule of its control grammar to the input level. Then it checks whether there is an error in the close neighborhood of that place. In the positive case it marks the occurrence of the error at the output level of the list and corrects the input level of the list by deleting some items from the environment of the current position of the head. Definition: The limited error is a string z, fl)r which there are no y, w such that the string yzw is a (grammatically) correct sentence (of a given language L). If z can be written in the form of</Paragraph>
    <Paragraph position="20"> and there are also strings s,r such that SUlU2... ukr is a grammatically correct sentence, u = UlU2 ... Uk is called a correction of Z.</Paragraph>
    <Paragraph position="21"> A minimal limited error is a st6ng z, which is a limited error and there is no possibility how to shorten z from either side preserving the property of being a limited error for z.</Paragraph>
    <Paragraph position="22"> 2.3. Tile control module C The C module is a control submodule of the entire ECM module. At the beginning of the operation of ECM, the C module calls the automaton P, which works as long as it is possible to shorten the input level list by deleting its items. As soon as the automaton P cannot shorten the list in the input level any more and the input level does not contain only the sentinels, C calls the module N. This automaton removes one error fi'om the input level, makes an error mark and transfers the control back to the C module, which invokes the automaton P again. Thus, the submodule C repeatedly invokes the automata P and N (switching between them) as long as they are able to shorten the sequence of input elements. ff there are more possibilities at a certain moment of computation, the automaton P chooses only one of them, C stores the other into the stack and it tries to apply another rule to the modified input. That means that C is searching the tree of possible computations depth-first.</Paragraph>
    <Paragraph position="23">  In any stage of selection of rules h)r P and N there may be some input sentences, which contain either syntactically correct subsequences of words which cannot be handled by the rules of P, or syntactic errors which are not covered by the rules of N. In this case both atttomata stop and thc input level still contains a subsequence of input elements. Its contingent final emptying is ensured by lhe C module, which marks this fact at the output level. Then the C module transfers control to the next phase of the whole system.</Paragraph>
    <Paragraph position="24"> At this point it is ncccssary to clarify, what kind of output structure is built by ECM. As we have already mentioned, our approach to the probleln is oriented towards dependency syntax 2.</Paragraph>
    <Paragraph position="25"> All the rules ()f control grammar for P and N delete the depending word from lhc input and put it into the output atlribute of the governing word. At the end of the process there is a tree, which contains the information ahout all the words front the input, about the order of deletions and also all error marks made by the automaton N.</Paragraph>
    <Paragraph position="26"> The switching between ! ), N and C guarantees that any possible path in the tree of computation will result in a complete structure of a deletion tree.</Paragraph>
    <Paragraph position="27"> The current best deletion tree is then compared to any new deletion tree. If the new tree is better (e.g. it contains a smaller number of errors or contains errors with a smaller weight), it is then stored for further use as the new current best result.</Paragraph>
    <Paragraph position="28"> At the end of the whole process we have the &amp;quot;best&amp;quot; result (or a set of best results, c.g, when there arc more possibilities how to parse the sentence), which contains all relevant inlormation about errors present in the input sentence.</Paragraph>
    <Paragraph position="29"> At the current stage of the work we have decided to distinguish as considerable only the folh)wing two types of errors: a) Only one call of N was used and the whole process o1' deletions is completed by P and N only.</Paragraph>
    <Paragraph position="30"> 2However, the use of DABG for creating the control grammars for P and N is not limited to dependency based approach only. Both the data structures (feature-structure based) and the DABG formalism allow to formulate rules which create the constituent structure of the sentence at the output level. b) If there wcrc only the rules of l's and N applied to a particular path in a tree of computation.</Paragraph>
    <Paragraph position="31"> Clearly the tree with error marks of thc lype a) will be among the best results of the C for any reasonable comparison of results. We have to assign a slightly smaller weight to the errors of the type b).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML