File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0508_metho.xml
Size: 15,970 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0508"> <Title>I I l I I i I I i I I II I I I I I I ! On Parsing Binary Dependency Structures Deterministically in Linear Time Hard ARNOLA I Kielikone Oy</Title> <Section position="3" start_page="71" end_page="74" type="metho"> <SectionTitle> 2 The Practical Parser </SectionTitle> <Paragraph position="0"> From now on we assume that strings of nodes are natural language sentences and discuss a fully implemented parser (DCParser) that parses Finnish sentences. The DCParser differs from the simple theoretical model described above, but, as v~ll be shown below, the differences do not alter the theory.</Paragraph> <Section position="1" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 2.1 Contexts </SectionTitle> <Paragraph position="0"> The formal part introduced binary relations as context-free ordered pairs (1). Dependency relations in the implemented parser use contexts.</Paragraph> <Paragraph position="1"> Formally, they could be expressed as context-sensitive ordered pairs as in (2), but the DCParser uses different rule syntax as discussed in 2.5.</Paragraph> <Paragraph position="2"> (2) Ri = { <\[cxl\]x\[cxr\],icy~\]y\[cy,\]>l x, y are morphosyntactic representations of the direct governor and the governed word form, cx~, cxr, cyi, cy, are morpho-syntaetie representations of the left and the right contexts ofx and y, respectively, and x Riy }.</Paragraph> <Paragraph position="3"> The use of contexts in relations adds another heuristic component to the BF-algorithm, and one dependency relation may require quite a few but fixed number of such context sensitive definitions. Contexts do not alter, however, the linear time behavior of Theorem 2. They only increase the value of the constant C (the</Paragraph> <Paragraph position="5"> where n~ is the number of the words in a sentence.</Paragraph> </Section> <Section position="2" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 2.3 Homographic disambiguation </SectionTitle> <Paragraph position="0"> The theoretical model did not discuss ambiguous nodes. In practice a word form can have several alternative morphotactic interpretations. The DCParser has a separate morphological analysis phase which produces all possible morphotactic interpretations for the word forms of input sentences. A separate preprosessing phase explicitly disambiguates most of the lexical and homographic ambiguities of Finnish word forms using context sensitive rules designed for the purpose (Nyl~nen, 1986). The remaining ambiguities are resolved implicitly by the DCParser as follows. When an interpretation of an ambiguous word form qualifies as a governed node the alternative interpretations will be rejected. This strategy implements yet another heuristic component for the parser, but the strategy does not alter the linearity argument presented earfier.</Paragraph> </Section> <Section position="3" start_page="71" end_page="72" type="sub_section"> <SectionTitle> 2.4 The dependency relations </SectionTitle> <Paragraph position="0"> The parser uses 32 different binary dependency relations for Finnish. The coordinating relations are discussed in 2.5. The most important other relations are listed in Table 4. The typical syntactic categories for the regents and for the dependants are also shown. Space does not allow a discussion of the individual relations.</Paragraph> <Paragraph position="1"> They are visualized in examples below. By stipulation, the finite verb of the main clause is the head of a grammatical sentence.</Paragraph> <Paragraph position="2"> Coordinations are one of the main sources of syntactic ambiguity in natural language sentences. For us they cause also a notational problem, since coordinations do not seem to be prima facie binary relations. The DCParser treats a coordination as two coexisting binary relations. One word governs the coordinator which governs the other word. By stipulation, that word among coordinated words which is closest to the regent becomes the head of the coordination. For example, the coordinated subject in the sentence John, Bill and Mary laughed is ascending, while the coordinated object in the sentence I sin,, John, Bill and Mary is descending as Figure 3 illustrates.</Paragraph> </Section> <Section position="4" start_page="72" end_page="73" type="sub_section"> <SectionTitle> 2.6 Subordinate clauses </SectionTitle> <Paragraph position="0"> The DCParser treats finite subordinate clauses so that the subordinating conjunction serves as a linking word between the heads of the main and the subordinate clauses. The conjunction is in the relation in question, and the head of the subordinate clause is in the ConjPostComprelation with the conjunction. Below there is a Finnish example sentence from the corpus, its rough word-for-word translation and the parse tree produced by the DCParser (4). This sentence exemplifies both subordinate clauses and coordinations. In this output mode the DCParser displays word forms as triplets: surface form, Relation, base form. Hierarchy is indicated using indentation: the regent of a given dependant is the first word below that is indented one step less.</Paragraph> <Paragraph position="1"> Riittda, kmt puolueetja niidenj~rjesti~t \[It is enough\], \[when\] \[the parties and their organiz.\] velvoitetacai : lainsaadt~mtn m'ullajulkaisemaan \[are compelled\] \[using legislation\] \[to publish\] tarkasti tililq~aattkset, budjettinsaja \[accurately\] \[financial statements, their budgets and lahjoituksensa.</Paragraph> <Paragraph position="2"> their donations\].</Paragraph> <Paragraph position="3"> Another sentence from the corpus and its parse tree are as follows: Kysymys askarntttaa koko maaJlmaa tlyt,</Paragraph> </Section> <Section position="5" start_page="73" end_page="74" type="sub_section"> <SectionTitle> 2.7 The grammar </SectionTitle> <Paragraph position="0"> In the DCParser word forms are represented as objects of morpho-syntactie attributes. For example, the word form jdrjest~t (organizations) appears as \[Form=&quot;jarjestOt&quot;, Lex=&quot;jiirjest6&quot;, Cat=Noun, Case=Nom, Number=PL\] For efficiency reasons binary relations are expressed as active rules. The testing of a relation, then, corresponds to the activation of the respective rule or a set of alternative rules.</Paragraph> <Paragraph position="1"> For example, a simplified rule for AdjAttr (adjectival attribute of nouns) reads as: A rule has two main parts: the condition part and the action part. The condition part searches and tests qualifying dependants and possible contextual words. A word qualifies in a test if its attribute object satisfies the description given in the rule. Variables can be used for passing attribute values. C: =&quot; assings a value; &quot;=&quot; tests a value) The action part binds and names dependants and assigns values to attributes. The rule above iteratively (Redo) binds immediately preceeding adjectives as attributes if they agree in the case and number with the head noun.</Paragraph> <Paragraph position="2"> Rules are classified into generic rules (grammar proper) and lexical rules. Their expressive power is identical. The former are activated by syntactic categories. (6) visualizes a simple generic rule. Lexical rules are activated by specific lexemes. For example, (7) describes a part of a complex rule for Finnish verb pitdd.</Paragraph> <Paragraph position="3"> Pit#ti has several senses and subcatagories in Finnish. (7) shows two of them. The first alternative treats the verb as a modal verb as in Minun pit~ menn~ saunaan (I must go to the sauna). (In our linguistic analysis we treat the infinitive menna (to go) as the subject of the modal verb pitOO and the genitive minun (?I) as the subject of the infinitive.) The second alternative handles the idiomatic usage Mind pidOn hanest~ (I like her) where a surface elative adverbial represents a deep semantic object of pitaO. The rule binds an elative as adverbial, but does not bind it if the elative is followed by a participle as in Min~ piclOn hOnestO l~htev~std tuoksusta (?1 like the ~agrance coming from her).</Paragraph> <Paragraph position="4"> The grmnmar (Arnola, 1998) consists of about 950 generic rules and of about 12 500 lexical rules. An algorithm, which implements the Best-First strategy, controls the activation of the rules.</Paragraph> </Section> </Section> <Section position="4" start_page="74" end_page="75" type="metho"> <SectionTitle> 3. Empirical Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.1 Benchmark test suite </SectionTitle> <Paragraph position="0"> The parser has been under development for years. It is an integral part of a commercial machine translation system called TranSmart(r).</Paragraph> <Paragraph position="1"> A benchmark test suite of correctly parsed sentences (source sentences and their correct parse trees) has been accumulated during this period. Only sentences that have revealed grammatical errors in the parser have been added to the test suite after the errors were corrected. Otherwise the test suite sentence have been randomly selected. The test suite sentences are periodically parsed to guarantee monotonous improvement of the grammar.</Paragraph> <Paragraph position="2"> the benchmark test suite As of this writing, the benchmark test suite comprises over 3000 sentences. The distribution of the sentence lengths (including delimiters) is shown in Figure 4. The average sentence length is 12.1 words.</Paragraph> </Section> <Section position="2" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.2 Linearity argument </SectionTitle> <Paragraph position="0"> We used the benchmark test suite sentences to test the linearity claim. Figure 5 shows the distribution of the parsing times in seconds. The processor is an old Intel 486, 66 MHz. A 150 MHz Pentium processor parses about 400 sentence/minute of running text.</Paragraph> <Paragraph position="1"> sentence length. Sentences whose length is between 5 and 20 words form statistically meaningful sets. Their average parsing times form a clear linear function. Longer sentences do not support a contrary view.</Paragraph> </Section> <Section position="3" start_page="74" end_page="75" type="sub_section"> <SectionTitle> 3.3 Quality </SectionTitle> <Paragraph position="0"> It remains to discuss the quality of the parser.</Paragraph> <Paragraph position="1"> Weuse the following strict criterion for the correemess of a parse tree. A sentence is parsed correctly if the sentence is grammatical and the produced dependency structure completely complies with the structure a competent human judge would assign to it. Otherwise the parse tree is judged incorrect. Hence, a single, local structural error in an otherwise correct parse trcC/ disqualifies the st~cture. If a sentence is globally ambiguous but it is clear for a human reader which structure is meant, the structure is judged correct only if it is in agreement with the human decision. If a human reader cannot make the right choice for an ambiguous sentence without textual context, the structure is deemed correct if it is one of the possible correct structures.</Paragraph> <Paragraph position="2"> column in a graphic form. Lines are fitted to the data to indicate possible tendencies of the series. Presently the DQParser is fully developed in the sense that it is in practical use in commercial machine translation systems. However, the tuning of the parser still continues. The parser has been subjected to tens of thousands of genuine unedited sentences from different sources over the years. Each parse tree has been carefully studied and all indicated errors or gaps that could be systematically corrected were corrected in the grammar and in the lexicons.</Paragraph> <Paragraph position="3"> About once a week the benchmark test suite was processed and possible errors found in the test suite were corrected.</Paragraph> <Paragraph position="4"> Occasionally (about oncein a month or two) a fresh piece of text was randomly selected. The total number of sentences in the text and the I number of sentences parsed correctly right away were recorded. The incorrectly parsed sentences were classified into three classes: the ones I parsed correctly after (only) lexical corrections, the ones parsed correctly after grammatical corrections (and possible lexical corrections), and the ones whose parsing errors could not be I corrected in systematic fashion. These a errors exhibit a fundamental drawback of the Best-First strategy. Table 5 shows the data of these test I samples. Each column presents both absolute and relative numbers: absolute/percentage%.</Paragraph> <Paragraph position="6"> Text No. Parsed Rcq. Req. Fatally of correctly Icxical gramm, in.</Paragraph> <Paragraph position="7"> ..................... ,,s~_t._ .................. c~rC/ct= concc:t. _coF_.cct . Table 5 and Figure 7 show that the parser seems to embody a stable 2-4% error ratio due to fundamental problems in the Best-First strategy. Approximately the same number of sentences (2-5%) have revealed grammatical deficiencies in the parser. This figure may have a slow, although not clear declining trend. 9-17% of the sentences have revealed lexical deficiencies, and this figure seems to have a slow declining trend. 76-87% of the sentences were parsed correctly right away, and this figure seems to show a clear</Paragraph> <Paragraph position="9"> if slow upward trend. (The test samples cover almost two years of rather intense tuning.) Conclusion In this paper we have argued that it is possible to parse binary dependency structures of natural language sentences deterministically and in linear time, and to keep parsing quality within acceptable limits, if syntactic heuristics is applied appropriately. A possibility for linear parsing has been proved theoretically and demonstrated empirically. The quality issue was discussed using empirical data. Determinism was accomplished with a Best-First search algorithm which implements syntactic heuristics in three ways: 1) in a permanent ordering of the testing of dependency relations, 2) in the implicit disambiguation of homographic word form interpretations, and 3) in the contexts of dependency relation rules.</Paragraph> <Paragraph position="10"> Linear behavior is strongly supported by the empirical data. It is difficult to be precise about the quality issue. Empirical data shows that the upper limit of the quality of this deterministic strategy is 96-98%. The inherent error rate is due to the use of heuristics. Nondeterministie parsers do not have such theoretical barriers. But this inherent error ratio should be contrasted with the fact that a deterministic parser produces the fight parse tree, while a nondeterministic parser produces usually only a forest of candidate parse trees accurately.</Paragraph> <Paragraph position="11"> At the moment of this writing this deterministic parser seems to have reached about 85% correctness rate (the average of the last five samples). Current errors are mainly lexical errors or gaps (about 9%) which usually can be easily corrected but the corrections improve the quality only slightly. Some 3% of the current errors are errors and gaps in the grammar. One should be cautious, however, of giving any precise numbers for parsing quality, since our exprerience shows that quality numbers vary markedly from one text to another.</Paragraph> <Paragraph position="12"> An interactive demonstration of the parser is available to the public for testing purposes at http://www'kielikdegne'fddcparser'fi'demdeg&quot; and the machine translation system (from Finnish into English) at http://www.kielikone.fi/fealcee.</Paragraph> </Section> </Section> class="xml-element"></Paper>