File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1084_metho.xml
Size: 19,211 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1084"> <Title>Generalized Multitext Grammars</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Formal Definitions </SectionTitle> <Paragraph position="0"> Let a37a39a38 be a finite set of nonterminal symbols and let a40 be the set of integers.3 We define a41 a5 a37 a38 a9a43a42</Paragraph> <Paragraph position="2"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Elements of </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> will be called indexed nonterminal symbols. In what follows we also consider a finite set of terminal symbols a37a59a58 , disjoint from a37 a38 , and work with strings in a37a61a60</Paragraph> <Paragraph position="4"> a37 a38 a9a85a56 , i.e. the set of indexes that appear in a66 .</Paragraph> <Paragraph position="5"> An indexed tuple vector, or ITV, is a vector of tuples of strings over a37 a62 , having the form</Paragraph> <Paragraph position="7"> a106a27a2a24a9 to denote the arity of such a tuple, which is a97a46a98 . When a113</Paragraph> <Paragraph position="9"> be confused with a5a21a22 a9 , that is the tuple of arity one containing the empty string. A link is an ITV where 2These are only a small subset of the necessary productions.</Paragraph> <Paragraph position="10"> The subscripts on the nonterminals indicate what terminals they will eventually yield; the terminal productions have been left out to save space.</Paragraph> <Paragraph position="11"> other uses of superscripts in formal language theory. However, we shall omit the parentheses when the context is unambiguous. null each a66a102a98a116a103 consists of one indexed nonterminal and all of these nonterminals are coindexed. As we shall see, the notion of a link generalizes the notion of nonterminal in context-free grammars: each production rewrites a single link.</Paragraph> <Paragraph position="12"> Definition 1 Let a17 a93 a95 be some integer constant. A generalized multitext grammar with a17 dimensions (a17 -GMTG for short) is a tuple a117a118a42</Paragraph> <Paragraph position="14"> a9 where a37a39a38 , a37 a58 are finite, disjoint sets of nonterminal and terminal symbols, respectively, a37 a38 is the start symbol and a119 is a finite set of productions. Each production has the form a121 a3 a122 , where a121 is a a17 -dimensional link and a122 is a a17 -dimensional ITV such that a113 a5 a121 a0 a106a27a2a24a9a123a42a124a113 a5 a122 a0 a106a27a2a24a9 for</Paragraph> <Paragraph position="16"> We omit symbol a17 from a17 -GMTG whenever it is not relevant. To simplify notation, we write productions as a127a128a42 a0a127</Paragraph> <Paragraph position="18"> we omit the unique index appearing on the LHS of a127 . Each a127 a98 is called a production component. The production component a5 a9 a3 a5 a9 is called the inactive production component. All other production components are called active and we set a131a71a132a18a133a75a68 a134a135a72 a5 a127 a9a136a42 a106 a28a137a97a14a98a84a138a139a100a102a56 . Inactive production components are used to relax synchronous rewriting on some dimensions, that is to implement rewriting on a32a55a140 a17 components. When a32 a42a141a95 , rewriting is licensed on one component, independently of all the others.</Paragraph> <Paragraph position="19"> Two grammar parameters play an important role in this paper. Let a127a142a42 a0a127</Paragraph> <Paragraph position="21"> Definition 2 The rank a145 of a production a127 is the number of links on its RHS: a145 a5 a127 a9 a42</Paragraph> <Paragraph position="23"> For example, the rank of Production (23) is two and its fan-out is four.</Paragraph> <Paragraph position="24"> In GMTG, the derives relation is defined over ITVs. GMTG derivation proceeds by synchronous application of all the active components in some production. The indexed nonterminals to be rewritten simultaneously must all have the same index a54 , and all nonterminals indexed with a54 in the ITV must be rewritten simultaneously. Some additional notation will help us to define rewriting precisely. A reindexing is a one-to-one function on a40 , and is extended to a37</Paragraph> <Paragraph position="26"> and a35 a5a26a45a48a47a50a49a36a51 a9a151a42 a45a48a47a1a0a135a47a50a49a36a51a36a51 for a45a48a47a12a49a36a51a78a52 a41 a5 a37 a38 a9 . We also extend a35 to strings in a37 a60a62 analogously. We say that a121 a1a85a121 a79 a52 a37 a60a62 are independent if a68a12a69a71a70a147a72a75a74 a5 a121 a9a3a2 catenation of all a121a149a98a116a103 and that a66 is some concatenation of all a66a102a98a104a103 , a95a61a105a111a106a65a105 a17 , a95a48a105a142a108 a105a107a97a14a98 , and let a35 be some reindexing such that strings a35 a5 a121 a9 and a66 are independent. The derives relation a66a9a8 a154a10 a6 holds whenever there exists an index a54 a52 a40 such that the following two conditions are satisfied: the usual way, to represent derivations.</Paragraph> <Paragraph position="27"> We can now introduce the notion of generated language (or generated relation). A start link of a a17 -GMTG is a a17 -dimensional link where at least one component is a5a19a18 a47 a7 a51 a9 , a18 the start symbol, and the rest of the components are a5 a9 . Thus, there are a21 a90a23a22 a95 start links. The language generated by a a17 -GMTG a117 is a24 a5 a117 a9a151a42 a44 a66a26a25 a28</Paragraph> <Paragraph position="29"> containing multitexts derived from a different start link. These subsets are disjoint, since every non-empty tuple of a start link is eventually rewritten as a string, either empty or not.5 A start production is a production whose LHS is a start link. A GMTG writer can choose the combinations of components in which the grammar can generate, by including start productions with the desired combinations of active components. If a grammar contains no start productions with a certain combination of active components, then the corresponding subset of a24 a5 a117 a9 will be empty. Allowing a single GMTG a117 to generate multitexts with 5We are assuming that there are no useless nonterminals.</Paragraph> <Paragraph position="30"> some empty tuples corresponds to modeling relations of different dimensionalities. This capability enables a synchronous grammar to govern lower-dimensional sublanguages/translations. For example, an English/Italian GMTG can include Production (9), an English CFG, and an Italian CFG. A single GMTG can then govern both translingual and monolingual information in applications. Furthermore, this capability simplifies the normalization procedure described in Section 6. Otherwise, this procedure would require exceptions to be made when eliminating epsilons from start productions.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Generative Capacity </SectionTitle> <Paragraph position="0"> In this section we compare the generative capacity of GMTG with that of mildly context-sensitive grammars. We focus on LCFRS, using the notational variant introduced by Rambow and Satta (1999), briefly summarized below. Throughout this section, strings a30a118a52 a37a61a60 and vectors of the form a0a6a5a31a30 a9a13a2 will be identified. For lack of space, some proofs are only sketched, or entirely omitted when relatively intuitive: Melamed et al. (2004) provide more details.</Paragraph> <Paragraph position="1"> Let a37a59a58 be some terminal alphabet. A function a36 has rank a32a164a93a111a100 if it is defined on a5 a37 a60</Paragraph> <Paragraph position="3"> a49 represents some grouping into a35 strings of all and only the variables appearing in the left-hand side, possibly with some additional terminal symbols. (Symbols a145 , a113 and a8</Paragraph> <Paragraph position="5"> as in GMTGs, every a45 a52 a37 a38 is associated with an integer a113 a5a26a45 a9a124a93 a95 with a113 a5a19a18 a9a53a42 a95 , and a119 is a finite set of productions of the form</Paragraph> <Paragraph position="7"> The language generated by a117 is defined as a24</Paragraph> <Paragraph position="9"> tively, a113 a5 a127 a9a99a42 a113 a5a26a45 a9 and a113 a5 a117 a9a81a42a78a150a43a152a146a153 a154a163a155a146a156 a113 a5 a127 a9 .</Paragraph> <Paragraph position="10"> The proof of the following theorem is relatively intuitive and therefore omitted.</Paragraph> <Paragraph position="11"> Theorem 1 For any LCFRS a117 , there exists some</Paragraph> <Paragraph position="13"> Next, we show that the generative capacity of GMTG does not exceed that of LCFRS. In order to compare string tuples with bare strings, we introduce two special functions ranging over multitexts. Assume two fresh symbols a1 a1a3a2 a15a52 a5 a37a39a58 a64</Paragraph> <Paragraph position="15"> sets of multitexts in the obvious way: a5a7a6a10a8a146a72 a5 a24 a9 a42</Paragraph> <Paragraph position="17"> In a a17 -GMTG, a production with a32 active components, a95a151a105 a32 a105 a17 , is said to be a32 -active. A a17 -GMTG whose start productions are all a17 -active is called properly synchronous.</Paragraph> <Paragraph position="18"> Lemma 1 For any properly synchronous a17 -GMTG a117 , there exists some LCFRS a117 a79 with a145</Paragraph> <Paragraph position="20"> and a113 a5 a117a57a79a37a9 a42a129a113 a5 a117 a9 such that a24 a5 a117 a79a44a9a99a42a13a5a7a6a10a8a71a72 a5 a24 a5 a117 a9 a9 .</Paragraph> <Paragraph position="21"> Outline of the proof. We set a117 a79 a42 a5 a37 a79</Paragraph> <Paragraph position="23"> in the productions of a117 , and a119a67a79 is constructed as follows. Let a127 a1 a127a39a79 a52 a119 with a127a94a42 a0a127</Paragraph> <Paragraph position="25"> hand side of a127a80a79 , that is</Paragraph> <Paragraph position="27"> Then there must be at least one index a54 such that for</Paragraph> <Paragraph position="29"> a54a82a98a27a9 be the number of occurrences of a54a89a98 appearing in a121 a154 . We define an alphabet a23 a154a124a42 a44 a43a39a98a104a103 a28 a95 a105 a106 a105</Paragraph> <Paragraph position="31"> we define a string a44 a5 a127 a1a89a106 a1a2a108 a9 over a23 a154a110a64a142a37a91a58 as follows. Let a121 a98a104a103a110a42 a25</Paragraph> <Paragraph position="33"> the index of a25a19a16 and the indicated occurrence of a25a15a16 is the a22 -th occurrence of such symbol appearing from left to right in string a121 a154 .</Paragraph> <Paragraph position="34"> Next, for every possible a127 , a127a80a79 , and a54 as above, we</Paragraph> <Paragraph position="36"> and a113 a5 a127 a49 a9 a42 a113 a5 a127 a9 . Without loss of generality, we assume that a117 contains only one production with a18 appearing on the left-hand side, having the form a127 a27 a42 a0a6a5a19a18 a9a11a1a14a86a14a86a14a86a10a1 a5a19a18 a9a13a2 a3 a0a6a5a26a45 a7 a9a11a1a14a86a14a86a14a86 a1 a5a26a45 a7 a9a13a2 . To complete the construction of a119a67a79 , we then add a last production a0a18 a2 a3 a36 a5 a0a127 a27 a1a16a95 a2a24a9 where</Paragraph> <Paragraph position="38"> We claim that, for each a127 , a127 a79 and a54 as above lemma follows from this claim.</Paragraph> <Paragraph position="39"> The proof of the next lemma is relatively intuitive and therefore omitted.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Weak Language Preservation Property </SectionTitle> <Paragraph position="0"> GMTGs have the weak language preservation property, which is one of the defining requirements of synchronous rewriting systems (Rambow and Satta, 1996). Informally stated, the generative capacity of the class of all component grammars of a GMTG exactly corresponds to the class of all projected languages. In other words, the interaction among different grammar components in the rewriting process of GMTG does not increase the generative power beyond the above mentioned class. The next result states this property more formally.</Paragraph> <Paragraph position="1"> Let a117 be a a17 -GMTG with production set a119 .</Paragraph> <Paragraph position="2"> For a95a53a105 a106 a105 a17 , the a106 -th component grammar of a117 , written a4a1a0a3a2 a4 a5 a117 a1a89a106 a9 , is the 1-GMTG with productions a119a33a98a126a42 a44 a127a39a98 a28 a0a127 and a119 a79 are constructed from a117 a79 a79 almost as in the proof of Lemma 1.</Paragraph> <Paragraph position="3"> The only difference is in the definition of strings if a25a18a16a123a52 a41 a5 a37 a38 a9 , with a54 , a22 as in the original proof. Finally, the production rewriting a0a18 a2 has the form only with respect to string a2 . The theorem then follows from the fact that LCFRS is closed under intersection with regular languages (Weir, 1988).</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Generalized Chomsky Normal Form </SectionTitle> <Paragraph position="0"> Certain kinds of text analysis require a grammar in a convenient normal form. The prototypical example for CFG is Chomsky Normal Form (CNF), which is required for CKY-style parsing. A a17 -GMTG is in Generalized Chomsky Normal Form (GCNF) if it has no useless links or useless terminals, and every production is in one of two forms: (i) A nonterminal production has rank = 2 and no terminals or a22 's on the RHS.</Paragraph> <Paragraph position="1"> The algorithm to convert a GMTG to GCNF has the following steps: (1) add a new start-symbol (2) isolate terminals, (3) binarize productions, (4) remove a22 's, (5) eliminate useless links and terminals, and (6) eliminate unit productions. The steps are generalizations of those presented by Hopcroft et al. (2001) to the multidimensional case with discontinuities. The ordering of these steps is important, as some steps can restore conditions that others eliminate. Traditionally, the terminal isolation and binarization steps came last, but the alternative order reduces the number of productions that can be created during a22 -elimination. Steps (1), (2), (5) and (6) are the same for CFG and GMTG, except that the notion of nonterminal in CFG is replaced with links in GMTG. Some complications arise, however, in the generalization of steps (3) and (4).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Step 3: Binarize </SectionTitle> <Paragraph position="0"> The third step of converting to GCNF is binarization of the productions, making the rank of the grammar two. For a32a164a93a111a100 and a35 a93 a95 , we write D-GMTG represent the class of all a17 -GMTGs with rank a32 and fan-out a35 . A CFG can always be binarized into another CFG: two adjacent nonterminals are replaced with a single nonterminal that yields them. In contrast, it can be impossible to binarize a a17 -GMTG fan-out to binarize, and its 2D illustration.</Paragraph> <Paragraph position="1"> for every fan-out a35 a93 a21 and rank a32 a93a1a0 , there are some index orderings that can be generated by guishing characteristic of such index orderings is apparent in Figure 1, which shows a production in a grammar with fan-out two, and a graph that illustrates which nonterminals are coindexed. No two nonterminals are adjacent in both components, so replacing any two nonterminals with a single non-terminal causes a discontinuity. Increasing the fan-out of the grammar allows a single nonterminal to rewrite as non-adjacent nonterminals in the same string. Increasing the fan-out can be necessary even for binarizing a 1-GMTG production such as:</Paragraph> <Paragraph position="3"> To binarize, we nondeterministically split each nonterminal production a127a26a11 of rank a32 a138 a21 into two nonterminal productions a127</Paragraph> <Paragraph position="5"> of rank a140 a32 , but possibly with higher fan-out. Since this algorithm replaces a32 with two productions that have rank a140 a32 , recursively applying the algorithm to productions of rank greater than two will reduce the rank of the grammar to two. The algorithm follows: (i) Nondeterministically chose a10 links to be removed from a127 a11 and replaced with a single link these links the m-links.</Paragraph> <Paragraph position="6"> (ii) Create a new ITV a66 . Two nonterminals are neighbors if they are adjacent in the same string in a production RHS. For each set of mlink neighbors in component a32 in a127 a11 , place that set of neighbors into the a32 'th component of a66 in the order in which they appeared in a127 a11 , so that each set of neighbors becomes a different string, for a95a83a105 a32 a105 a17 .</Paragraph> <Paragraph position="7"> (iii) Create a new unique nonterminal, say a52 , and replace each set of neighbors in production a127 a11 with a52 , to create a127</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Step 4: Eliminate a22 's </SectionTitle> <Paragraph position="0"> Grammars in GCNF cannot have a22 's in their productions. Thus, GCNF is a more restrictive normal form than those used by Wu (1997) and Melamed (2003). The absence of a22 's simplifies parsers for GMTG (Melamed, 2004). Given a GMTG a117 with a22 in some productions, we give the construction of a weakly equivalent grammar a117a57a79 without any a22 's. First, determine all nullable links and associated strings in a117 . A nullable strings of that link. There is one version for each of the possible combinations of the nullable strings being present or absent. The version of the link with all strings present is its original version. Each non-original version of the link (except in the case of start links) gets a unique subscript, which is applied to all the nonterminals in the link, so that each link is unique in the grammar. We construct a new grammar a117 a79 whose set of productions a119 a79 is determined as follows: for each production, we identify the nullable links on the RHS and replace them with each combination of the non-original versions found earlier. If a string is left empty during this process, that string is removed from the RHS and the fan-out of the production component is reduced by one. The link on the LHS is replaced with its appropriate matching non-original link.</Paragraph> <Paragraph position="1"> There is one exception to the replacements. If a production consists of all nullable strings, do not include this case. Lastly, we remove all strings on the RHS of productions that have a22 's, and reduce the fan-out of the productions accordingly. Once a9a13a2 . We then alter the productions. Production (31) gets replaced by (40). A new production based on (30) is Production (38). Lastly, Production (29) has two nullable strings on the RHS, so it gets altered to add three new productions, (34), (35) and (36). The altered set of productions are the following: Melamed et al. (2004) give more details about conversion to GCNF, as well as the full proof of our final theorem: Theorem 4 For each GMTG a117 there exists a GMTG a117 a79 in GCNF generating the same set of multitexts as a117 but with each a5a21a22 a9 component in a multitext replaced by a5 a9 .</Paragraph> </Section> </Section> class="xml-element"></Paper>