File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1062_metho.xml
Size: 13,849 bytes
Last Modified: 2025-10-06 14:12:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1062"> <Title>VOCNETS - A TOOL FOR HANDLING FINITE VOCABULARIES</Title> <Section position="4" start_page="0" end_page="306" type="metho"> <SectionTitle> 2. Modified Finite-State Representation </SectionTitle> <Paragraph position="0"> We have chosen to represent vocabularies as modified finite-state gra~aars, which we shall call vocnets.</Paragraph> <Paragraph position="1"> A vocnet will include a finite directed graph with edges, a r r o w s, labelled with elements of the alphabet A. Such a graph will specify a vocabulary over the alphabet A if we mark a subset S of the nodes as source nodes and define as an accepted word the concatenation of the labels of such paths through the graph from nodes in S as arrive under certain side conditions at a set of nodes which fulfills given target conditions. We do not assume a vocnet to be deterministic in the sense that for any node i and string ~ there exist only one node j such that ~ is a path from i to j. Should we introduce such a restraint, it can be proven that it is lost already under regular operations on the vocabularies, ioe., that this attractive feature will be absent from a vocnet derived in the manner we propose for the union, concatenation set or closure of the vocabularies, for which deterministic vocnets had been introduced.</Paragraph> <Paragraph position="2"> Precautions had to be taken to keep the mechanically generated representations compact. In particular, it was essential to eliminate the well-known multiplicative effect on the number of states arising when standard finite-state grammars are combined by intersection and complementation.</Paragraph> <Paragraph position="3"> 3. Definition of vocnet graphs A vocnet graph U = <A, N, C', C&quot;> quadruple, where is a A is an alphabet of a t o m s a, b; c, ... N is a set of n o d e s h, i, j, k, .o.</Paragraph> <Paragraph position="4"> C' and C&quot; are mappings of A into N ~N. We define C(x) = C'(x) u C&quot;(x) as the set of c a t e g o r i e s of &quot;the atom x. We define tile product C 4 o C~ of two category sets C~ and C~ as C~ o C~ ={(i, j)IBk (i, k)e C~^(k~ j) C/ C~I and the category set for a string ~ : x ~ as</Paragraph> <Paragraph position="6"> We shall say that the atom x C o n-n e c t s the set M1 to the set M2 in U iff either M2 is the set of all j for which there is a node i in MI such that (i, j)~ C'(x)r or M2 is the set of all j for which there is a node i in M1 such that (i, j)~ C&quot;(x).</Paragraph> <Paragraph position="7"> We shall a\]so say that a string & = x connects Ml to M2 if there is some set M3 such that x connects M1 to M3 and ~ connects M3 to M2deg By :introducing two kinds of arrows, one can so to speak synchronize parallel paths: the restraint that in every path the arrow associated with one position in a string will haw! to be of the same kind can be utilized to partition the graph into zones which correspond to segments of the strings, if one kind of arrows, i n t r a z o n e arrows (tliose in C') join nodes within the same zone and another kind, i n t e rz o n e arrows (those in C&quot;)~ join nodes in one zone with nodes in another zone. A string can then be seen as consisting of segments separated by junctures, where each segment J s associated with parallel intrazone arrow sequences and each juncture with parallel interzone arrows.</Paragraph> <Paragraph position="8"> The sets E1 and E2 here form the t a r g e t a r e a s of G.</Paragraph> <Paragraph position="9"> The union of all minimal sets M for which P(M) is true in the vocnet G will be called the t a r g e t s e t T of G.</Paragraph> <Paragraph position="10"> A vocnet G defines the language L(G): \[ (* I ~M ~N and ~ connects S to the non-empty node set M and P(M) is true\] Whereas for a string to be accepted by a conventional finite-state grammar it is enough ~hat it is associated with one permitted path through the graph, a string will be accepted by a vocnet if it is associated with a set of simultaneous paths, each leading from a source node to a target node, these target nodes forming a permitted combination M (i.e., M is not empty and P(M) i.S true).</Paragraph> <Paragraph position="11"> The vocnet may contain special e x i t c h e c k e r s. An exit checker is a dummy zone, consisting of exactly one node connected to itself by an arrow in C' for each atom in A. By using exit checkers, local conditions for zones can be accounted for in the target conditions for the whole vocnetdeg The exit checkers, in a way, will then fre~,ze the zone exit conditions so that they remain accessible for verification when the whole graph has been passed through.</Paragraph> <Paragraph position="12"> 5. Genexation of Vecnets from List of Words A vocnet for a given vocabulary can be generateo algorithmically in the following mannerdeg Words are entered one by one. For each new word unique new nodes are introduced: if the new word is x^xz.., x~ , each letter x~ is given the new category (kT ,k~+A), where no k~. existed before.</Paragraph> <Paragraph position="13"> Clearly, this procedure will create a vocnet which will account for all and only the words givendeg The set of nodes, however, will typically be much larger than necessary, but it can be reduced - after one word has been entered or after the insertion of several words - by appropriate fusion of nodes; cf. section 8 infra.</Paragraph> </Section> <Section position="5" start_page="306" end_page="307" type="metho"> <SectionTitle> 6. Set Operations on Vocabularies </SectionTitle> <Paragraph position="0"> In the :following, it will be assumed that the vocabularies considered are strings over the same alphabet A, that none of them includes the empty string, and that the vocnet graphs which we combine have disjunct sets of nodes.</Paragraph> <Section position="1" start_page="306" end_page="306" type="sub_section"> <SectionTitle> 6.1 Complement Formation </SectionTitle> <Paragraph position="0"> Given a vocnet G1 for a language LI, the vocnet G for the complement L is given immediately by replacing P1 by its negation</Paragraph> <Paragraph position="2"> Jf G1 is complete in the sense that for any string there exists some path beginning in an element of SI. If G1 is not complete in this sense, it can be made complete at the expense of adding one more node.</Paragraph> </Section> <Section position="2" start_page="306" end_page="306" type="sub_section"> <SectionTitle> 6.2 Union </SectionTitle> <Paragraph position="0"> In a vocnet G = <U, S, P> for the union of L(GI) and L(G2) the vocnet graph U is formed directly through union of the elements of U1 and U2, and P is formed through disjunction:</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="306" end_page="306" type="sub_section"> <SectionTitle> 6.3 Intersection </SectionTitle> <Paragraph position="0"> In a vocnet G for L(GI) ~ L(G2), U and S are formed as in the case of union and</Paragraph> <Paragraph position="2"> Thus, one and the same vocnet graph will serve as a component in vocnets defining different languages.</Paragraph> <Paragraph position="3"> 7. String Operations on Vocabularies</Paragraph> </Section> <Section position="4" start_page="306" end_page="307" type="sub_section"> <SectionTitle> 7.1 Concatenation </SectionTitle> <Paragraph position="0"> The concatenation set V of V1 and V2, i.e., the set V of strings consisting of a string in Vl, specified by the vocnet GI, concatenated with one in V2, specified by the vocnet G2, is defined by a vocnet G</Paragraph> <Paragraph position="2"> Here NI+ is N1 with the addition of exit checkers: if G1 has the target areas El, PS'2,..., NI+ will contain the exit checkers fl, f2, ..., CI&quot;+ is CI&quot; with the addition of arrows for each atom from each node in Ep to the exit checker fp, CI2&quot;(x) is tile set of all arrows (i, j) with i ~ T1 and j 6 N2 for which (h, j) ~ Cl'(x) for some h & $2.</Paragraph> <Paragraph position="3"> QI(M) is the frozen version of PI(M), with fl, f2, ..., substituting El, E2, ...</Paragraph> <Paragraph position="4"> The vocnet graphs U1 and U2 have thus been integrated as zones into the new vocnet graph. A few exit checkers have been added to permit expressing the restraints on the pass@ge through the zone U1 as target conditions on the totality of G. Thanks to the use of exit checkers the complexity of the target condition P of G in terms of the number of target areas is not the product of the complexities of Pl and P2 but less than their sum.</Paragraph> <Paragraph position="5"> 7.2. Restricted Iteration and Involution The languages L(GI) u L(GI)Zu... u L(GI) q and L(GI)q (q = ~ 2) may be represented as vocnets that are constructed in a similar way as for concatenation, with GI in the role of G2, but the exit checkers have to be stratified so that we may count the depth d of the concatenation. Therefore C&quot;(x) contains besides the categories explained in 7.1 all pairs (dfp, d*~fp) for l~d ~q-l.</Paragraph> <Paragraph position="6"> The target condition for restricted iteration is</Paragraph> <Paragraph position="8"> and for the p-th power of L(GI)</Paragraph> <Paragraph position="10"> Here, ~ PI(M) are the frozen stratified target conditions of GI.</Paragraph> </Section> <Section position="5" start_page="307" end_page="307" type="sub_section"> <SectionTitle> 7.3. Decatenation </SectionTitle> <Paragraph position="0"> Given one vocnet G1 (say for words beginning with a prefix) and another vocnet G2 (say for prefixes and prefix sequences), we search a vocnet G (say for words stripped of their prefixes) such that ~& L(G\] iff</Paragraph> <Paragraph position="2"> The following vocnet G will satsify our</Paragraph> <Paragraph position="4"> where S is the union of all sets M ~NI for which S1 is connected to M in G1 by some string contained in L(G2).</Paragraph> <Paragraph position="5"> 8. Equatability and Node Fusion Vocnets generated with the incremental algorithm described in section 5 above typically contain more nodes than a minimal vocnet for the same language. Similarly, vocnets derived from other vocnets tend to be highly redundant.</Paragraph> <Paragraph position="6"> Compacting of a given vocnet can be algorithmically performed as follows.</Paragraph> <Paragraph position="7"> We shall say that nodes in a vocnet G are e q u a t a b 1 e if they can be identified without affecting the language defined by G.</Paragraph> <Paragraph position="8"> The following definitions permit us to find pairs of equatable nodes.</Paragraph> <Paragraph position="9"> We first define some equivalence relations between nodes.</Paragraph> <Paragraph position="10"> The nodes i and j are p r e c ed e n c e e q u i v a 1 e n t in a vocnet graph U iff for all k and x</Paragraph> <Paragraph position="12"> The nodes J and j are s i o n e q u i v a 1 e n t graph U iff for all k and x s U C C e sin a vocnet</Paragraph> <Paragraph position="14"> The nodes i and j are s o u r c e e q u i v a 1 e n t in a vocnet G iff i&S <=> j&S The nodes i and j are t a r g e t e q u i v a 1 e n t in a vocnet G iff for any subset M of N P(M u {i}) <:> P(M u \[j}).</Paragraph> <Paragraph position="15"> Now tile nodes i and j are 1 e f t e q u i v a 1 e n t in a vocnet G iff they are precedence and source equivalent, rPhey are r i g h t e q u i v a 1 e n t in a vocnet G iff they are succession and target equivalent. They are e q u a t a b I e if but not necessarily only if - they are left or right equivalent.</Paragraph> <Paragraph position="16"> By successive fusion of pairwise equatable nodes vocnets can be - not rarely drastically - compacted. It should be noted v however, that equatability is not an equivalence relation and that reduction of a given vocnet graph does not yield a unique result but depends on the choice of node pairs to identify in each step of the procedure.</Paragraph> </Section> </Section> <Section position="6" start_page="307" end_page="307" type="metho"> <SectionTitle> 9. Parasites </SectionTitle> <Paragraph position="0"> By p a r a s i t e s of a language L we shall mean strings which are not members of L nor substrings of members of L.</Paragraph> <Paragraph position="1"> Clearly, if with the vocnet G tile set C(~ ) is empty, ~ is a parasite of L(G)~ 4, is not a member nor will it become a member whatever is appended at either end.</Paragraph> <Paragraph position="2"> We shall say a node i in a vocnet G is g e n u i n e if there is some string o< associated with a path from a source node in G via i to a node in some M, such that connects S to M and P(M) is true.</Paragraph> <Paragraph position="3"> If all nodes in a notvec are genuine r a string 4. is a parasite iff C(o< ) is empty. The vocnet will then offer us an associative calculus for recognizing parasites (and strings which constitute the beginning of a word or the end of a word).</Paragraph> <Paragraph position="4"> A node i is ingenuine if no path leads from nodes in S to i or from i to nodes of To If P(M) has the simple form that M must overlap with some given target set, a node i is ingenuine only if the preceding condition is fullfilled.</Paragraph> <Paragraph position="5"> i0. Node Elimination Ingenuine nodes can be removed from the graph U without affecting the language accepted by G = <U, S, P>.</Paragraph> <Paragraph position="6"> Successive elimination of ingenuine nodes and fusion of equatable node may lead to considerable compression and simplification of a given vocnet. It should be observed that the final, irreducible result of such compression is not independent of the choice at each stage of what reduction operation to perform.</Paragraph> </Section> class="xml-element"></Paper>