File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/92/c92-1062_abstr.xml
Size: 12,950 bytes
Last Modified: 2025-10-06 13:47:22
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1062"> <Title>A Chart-based Method of ID/LP Parsing with Generalized Discrimination Networks</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Variations of word order are among the most well-known phenomena of natural languages. From st well represented sample of world languages, Steele\[13\] shows that about 76% of languages exhibit significant word order variation. In addition to the well-known Walpiri(Australian language), several languages such as Japanese, Thai, German, Hindi, and Finnish also allow considerable word order variations. It is widely admitted that such variations are&quot; governed by generalizations that should be expressed by the grammars. Generalized Phrase Structure Grammar (GPSG)\[7\] provides a method to account for these generalizations by decomposing the grammar rules to Immediate Dominance(ID) rules and Linear Preeedence(LP) rules. Using ID/LP formalism, the flexible word order languages can be concisely and more easily described. However, designing an efficient algorithm to pnt the seperated components back in real parsing is a difficult problem.</Paragraph> <Paragraph position="1"> Given a set of ID/LP rules, one alternative method for parsing is to compile it into another grammar description language, e.g. Context-Free Grammar(CFG), for which there exist some parsing algorithms. However, the received object grammar tends to be so huge and can slow down the parsing time dramatically. Also, the method losts the modularity of ID/LP formalism.</Paragraph> <Paragraph position="2"> Another set of approaches\[ll, 4, 11 tries to keep ID and LP rules as they are, without expanding them out to other formalisms. Shieber\[ll\] has proposed an interesting algorithm for direct ID/LP parsing by generalizing Earley's algorithm\[6\] to use tile constraints of ID/LP rules directly. Despite of its possibility of blowing up in the worst ease, Barton\[3\] has shown that Shieber's direct parsing algorithm usually does have a time advantage over the use of Earley's algoo rithm oll the expanded CFG. Thus the direct parsing strategy is likely to be an appealing candidate for parsing with ID/LP rules from the computational point of view.</Paragraph> <Paragraph position="3"> In this paper, we present a new approach to direct ID/LP rules parsing that outperforms the prcvious methods. Besides of the direct parsing property, three features contribute to its efficiency. First, ID rules are precompiled to generalized discrimination networks\[9\] to yield compact representation of parsing states, hence less computation time. Second, LP rules are also pre-compiled into a Hasse diagram to minimize the time used for order legality cheek at run time. And, third, its bottom-up depth-first parsing strategy minimizes the work of edge check and therefore saves a lot of processing time.</Paragraph> <Paragraph position="4"> We will first describe briefly each feature of our parser. Then, we will show the parsing algorithm and an example of parsing. The comparisons of our approach with other related works are also described. Finally, we give a conclusion and our future works.</Paragraph> <Paragraph position="6"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Bottom-up Depth-first Strategy </SectionTitle> <Paragraph position="0"> Chart parsing is one of the most well-known and efficient techniques for parsing general context-free grammars. The chart serves as a book-keeping storage for all parses generated while parsing. In general, to avoid redoing the same tasks, the chart has to be checked every time a new edge is proposed to see whether the identical edge was already generated. Also, when an edge is entered into the chart, it must be checked with other edges to see if it can be merged together to create new edges. In practice, these checks can occupy the majority of parsing time.</Paragraph> <Paragraph position="1"> In order to build an efficient parser, it is apparent to minimize the checks above. Many different strategies of chart parsers has been developed. Most of them try to mininfize the number of useless edges to reduce the checking time.</Paragraph> <Paragraph position="2"> Our parsing strategy is based on the Word Incorporation (VVI) algorithm\[12\] with some modifications to accommodate ID/LP forrealism. We follow WI algorithm by restricting the parsing strategy to be solely bottom-up and depth-first. This makes the parsing proceed along the input in an orderly fashion (left to right or right to left) and keep processing at a vertex until no more new edges ending at the vertex can be generated. Once the parsing go beyond a vertex, the processing will never be redone at that vertex again. As a consequence, the duplicated edge check can be completely omitted. Moreover, once representation of ID rules a complete edge is used (for creating new active edges), we can delete it out of tile storage since it cannot affect other edges anymore.</Paragraph> <Paragraph position="3"> This reduces the number of edges and hence the checking time.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Generalized Discrimina- </SectionTitle> <Paragraph position="0"> tion Networks as ID rules compilation In conventional chart parsing for context-free grammars, a method for reducing the number of edges is precompiling the grammars into discrimination trees. Assume two CFG rules, S ~ ABCD and S ~ ABEF. The RHS of the two rules have the common left part AB and therefore can be merged together into a single combined rule: S ~ AB(CD,EF). In parsing, the common part can then be represented by a single active edge.</Paragraph> <Paragraph position="1"> However, to apply the method to ID/LP formalism, the case is different. Suppose we have a ID/LP grammar (-;1 as shown in Fig. 1. If we view parsing as discrimination tree traversal, the parsing has to proceed in the fixed order from the root to leaf nodes. Because of the order-free characteristic of ID rules, we can no longer just simply combine the ID rules (1) and (2) together as for the two CFG rules above.</Paragraph> <Paragraph position="2"> To achieve the same merit of discrimination network in the case of CFG rules, we use gen-ACTES DE COLING-92. NxrcrEs, 23-28 AO~,'r 1992 4 0 2 Paoc. OF COLING-92. NANTES. Aua. 23-28. 1992 erMized discrimination network (GDN) for representing ID rules. GDN is a generalization of a discrimination tree that can be traversed according to the order in which constralnts are obtained incrementally during the analytical process, independently of the order of the network's arcs. The technique has been first proposed in \[9\] to be used in incremental semantic disambiguation nmdel but its characteristic also matches our purpose. The technique of GI)N is to assign each node in the network a unique identifier and a bit vector. For example, the ID rules of Ga, shown in Fig. 1 ,can be represented as the discrimination network in Fig. 2, of which each node is assigned a unique identifier. The leftmost digit of an identifier of a node v indicates whether the node is a leaf or not, '0' for being a leaf and '1' for being a non-leaf. This digit is followed by the sequence S(v), which is the concatenation of the sequence S(u) and the integer k, where u is the immediate predecessor of v and k is the numerical number of the arcs issuing from u. 1 Note that the identifier of the root node r has only the first leftmost digit(S(r) is null).</Paragraph> <Paragraph position="3"> As shown in Fig. 2, to each node identifier, we attached a bit vector that has the same length as the identifier and consists of l's except the leftmost and rightmost bits. These identifiers together with their corresponding bit vectors play an important role in the parsing process with GDN, as will be described later.</Paragraph> <Paragraph position="4"> Note that representing ID rules by GDN can combine the common parts of different ID rules into the same arcs in the network.</Paragraph> <Paragraph position="5"> Shieber's representation, in contrast, considers each single ID rule seperately and thus cannot achieve this kind of compactness.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Representing LP rules as a </SectionTitle> <Paragraph position="0"> Hasse diagram Hasse diagram is a representation of partially ordered set used in graph theory\[8\]. Since a set of LP rules also defines a partially oro dered set on a grammar's categories, we can vector assigned to each node constrnct its corresponding Hasse diagram. Fig. 3 shows a Hasse diagram for LP rules of G1. qb each node we assign a unique flag and construct a bit vector by setting the flag to '1' and the others to '0'. As for this Hasse diagram, we assign 1lag(a) the first bit, flag(b) the second bit, ..., and flag(f) the sixth bit. The bit vectors of nodes a, b, c, d, e and f are then 000001, 000010, 000100, 001000, 010000 and 100000, respectively. The precedence vector of each node is the bitwise disjunction between its bit vector and all bit vectors of its subordinate nodes. For example, the precedence vector of f is the disjunction between bit vectors of f, a and e; 100000 V 000001 V 010000 = 110001. The resultant precedence vectors are shown in Fig. 3 with O's in their left parts omitted.</Paragraph> <Paragraph position="1"> Using the above technique, the order legality check with respect to a given set of LP rules can be efficiently done by the algorithm tively, where A precedes B in the input.</Paragraph> <Paragraph position="2"> 1. 'Fake the bitwise disjunction between Pre a and Pren.</Paragraph> <Paragraph position="3"> 2. Ctieck equality: if the result is equal to \['rea, fail. Otherwise, return the result as the precedence vector of the string AB.</Paragraph> <Paragraph position="4"> ACTES DE COLING-92, NANTES, 23-28 AOtn' 1992 4 0 3 I)ROC. OV COLING-92, NAN'I'ES, AUG. 23~28, 1992 Note that, by the encoding algorithm described in the previous subsection, the precedence vector of a symbol A that must precede a symbol B always be included in B's precedence vector. As a result, if A comes behind B the disjunction of their precedence vectors will be equal to B's precedence vector. The above algorithm thus employs this fact to detect the order violation easily. Moreover, note that all LP constraints concerning the symbols concatenated are propagated with the resultant string's precedence vector by the result of disjunction. We can then use the algorithm to check the legality of next input symbol with respect to all preceded symbols easily by checking with the resultant string's precedence vector only. In real implementation, we can represent a precedence vector by an integer and the order legality can thus be checked efficiently by using Boolean bitwise operations between integers provided in most machines.</Paragraph> <Paragraph position="5"> from ID/LP rules : G: Next, the constraint-identifier table is replaced by the category-state table, notated as reduce(category, state), viewing each category as a constraint. This table will be used to reduce a constituent to higher level constituent state when it is complete. A constituent is complete if its current state is at a leaf node and all bits of BitV are set to 0. Fig. 4 shows the table derived from G1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Table for ID/LP Parsing </SectionTitle> <Paragraph position="0"> GDN can cope with any order of input constraints by referring to the table of constraint-identifier which is extracted from the network by collecting pairs of a branch and its immediate subordinate node. However, GDN has been proposed to handle the completely order-free constraint system. In order to apply the model to parse natural language of which word order is restricted by some linear precedence constraints, some modifications have to be done to take those constraints into account.</Paragraph> <Paragraph position="1"> First, the definition of a state is changed from a 2-tuple < Id, BitV > to a 4-tuple < Cat, Id, Pre, BitV > where each element is defined as the following: Cat : the mother category of the state; Id : the identifier of the state; Pre : the precedence vector of the state; BitV : the bit vector of the state.</Paragraph> <Paragraph position="2"> Because we have several networks for all nonterminal categories in grammar, Cat is added to indicate which networks the state belongs to. Moreover, in addition to the elements used to check ID rules, the precedence vector Pre is added for the check of LP rules.</Paragraph> </Section> </Section> class="xml-element"></Paper>