File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0812_metho.xml
Size: 26,993 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0812"> <Title>SDL--A Description Language for Building NLP Systems</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Motivation & Idea </SectionTitle> <Paragraph position="0"> The shallow text processing system SProUT (Becker et al., 2002) developed at DFKI is a complex platform for the development and processing of multilingual resources. SProUT arranges processing components (e.g., tokenizer, gazetteer, named entity recognition) in a strictly sequential fashion, as is known from standard cascaded PSnite-state devices (Abney, 1996).</Paragraph> <Paragraph position="1"> In order to connect such (independently developed) NL components, one must look at the application programmer interface (API) of each module, hoping that there are API methods which allow, e.g., to call a module with a speciPSc input, to obtain the result value, etc. In the best case, API methods from different modules can be used directly without much programming overhead. In the worst case, however, there is no API available, meaning that we have to inspect the programming code of a module and have to write additional code to realize interfaces between modules (e.g., data transformation). Even more demanding, recent hybrid NLP systems such as WHITEBOARD (Crysmann et al., 2002) implement more complex interactions and loops, instead of using a simple pipeline of modules.</Paragraph> <Paragraph position="2"> We have overcome this in$?exible behavior by implementing the following idea. Since we use typed feature structures (Carpenter, 1992) in SProUT as the sole data interchange format between processing modules, the construction of a new system can be reduced to the interpretation of a regular expression of modules. Because the sign for concatenation can not be found on a keyboard, we have given the three characters +, j, and / the following meaning: + sequence or concatenation m1+m2 expresses the fact that (1) the input to m1+ m2 is the input given to m1, (2) the output of module m1 serves as the input to m2, and (3) that the PSnal output of m1 + m2 is equal to the output of m2.</Paragraph> <Paragraph position="3"> This is the usual $?ow of information in a sequential cascaded shallow NL architecture.</Paragraph> <Paragraph position="4"> + concurrency or parallelism j denotes a quasi-parallel computation of independent modules, where the PSnal output of each module serves as the input to a subsequent module (perhaps grouped in a structured object, as we do by default). This operator has far reaching potential. We envisage, e.g., the parallel computation of several morphological analyzers with different coverage or the parallel execution of a shallow topological parser and a deep HPSG parser (as in WHITEBOARD). In a programming language such as Java, the execution of modules can even be realized by independently running threads.</Paragraph> <Paragraph position="5"> + unrestricted iteration or PSxpoint computation m/ has the following interpretation. Module m feeds its output back into itself, until no more changes occur, thus implementing a kind of a PSxpoint computation (Davey and Priestley, 1990). It is clear that such a PSxpoint might not be reached in PSnite time, i.e., the computation must not stop. A possible application was envisaged in (Braun, 1999), where an iterative application of a base clause module was necessary to model recursive embedding of subordinate clauses in a system for parsing German clause sentential structures. Notice that unrestricted iteration would even allow us to simulate an all-paths context-free parsing behavior, since such a feedback loop can in principle simulate an unbounded number of cascade stages in a PSnite-state device (each level of a CF parse tree has been constructed by a single cascade stage).</Paragraph> <Paragraph position="6"> We have dePSned a Java interface of methods which each module must fulPSll that will be incorporated in the construction of a new system. Implementing such an interface means that a module must provide an implementation for all methods speciPSed in the interface with exactly the same method name and method signature, e.g., set-Input(), clear(), or run(). To ease this implementation, we have also implemented an abstract Java class that provides a default implementation for all these methods with the exception of run(), the method which starts the computation of the module and which delivers the PSnal result.</Paragraph> <Paragraph position="7"> The interesting point now is that a new system, declaratively speciPSed by means of the above apparatus, can be automatically compiled into a single Java class. Even the newly generated Java class implements the above interface of methods. This Java code can then be compiled by the Java compiler into a running program, realizing exactly the intended behavior of the original system speci-PScation. The execution semantics for an arbitrary module m is dePSned to be always the execution of the run() method of m, written in Java as m.run() Due to space limitations, we can only outline the basic idea and present a simpliPSed version of the compiled code for a sequence of two module instances m1 +m2, for the independent concurrent computation m1 j m2, and for the unbounded iteration of a single module instance m/.</Paragraph> <Paragraph position="8"> Note that we use the typewriter font when referring to the concrete syntax or the implementation, but use italics to denote the abstract syntax.</Paragraph> <Paragraph position="10"> The pseudo code above contains three methods, seq(), par(), and fix(), methods which mediate between the output of one module and the input of a succeeding module. Clearly, such functionality should not be squeezed into independently developed modules, since otherwise a module m must have a notion of a PSxpoint during the execution of m/ or must be sensitive to the output of every other module, e.g., during the processing of (m1 j m2) + m. Note that the mediators take modules as input, and so having access to their internal information via the public methods speciPSed in the module interface (the API).</Paragraph> <Paragraph position="11"> The default implementation for seq is of course the identity function (speaking in terms of functional composition). par wraps the two results in a structured object (default implementation: a Java array). fix() implements a PSxpoint computation (see section 5.3 for the Java code). These mediators can be made speciPSc to special module-module combinations and are an implementation of the mediator design pattern, which loosely couples independent modules by encapsulating their interaction in a new object (Gamma et al., 1995, pp. 273). I.e., the mediators do not modify the original modules and only have read access to input and output via getInput() and getOutput().</Paragraph> <Paragraph position="12"> In the following, we present a graphical representation for displaying module combination. Given such pictures, it is easy to see where the mediators come into play. Depicting a sequence of two modules is, at PSrst sight, not hard.</Paragraph> <Paragraph position="14"> Now, if the input format of m2 is not compatible with the output of m1, must we change the programming code for m2? Even more serious, if we would have another expression m3 + m2, must m2 also be sensitive to the output format of m3? In order to avoid these and other cases, we decouple module interaction and introduce a special mediator method for the sequence operator (seq in the above code), depicted by '.</Paragraph> <Paragraph position="16"> 'connects two modules. This fact is re$?ected by making seq a binary method which takes m1 and m2 as input parameters (see example code).</Paragraph> <Paragraph position="17"> Let us now move to the parallel execution of several modules (not necessarily two, as in the above example).</Paragraph> <Paragraph position="19"> There is one problem here. What happens to the output of each module when the lines come together, meeting in the outgoing arrow? The next section has a few words on this and presents a solution. We only note here that there exists a mediator method par, which, by default, groups the output in a structured object. Since par does not know the number of modules in advance, it takes as its parameter an array of modules. Note further that the input arrows are PSne--every module gets the same data.</Paragraph> <Paragraph position="20"> Hence, we have the following modiPSed picture.</Paragraph> <Paragraph position="22"> Now comes the / operator. As we already said, the module feeds itself with its own output, until a PSxpoint has been reached, i.e., until input equals output. Instead of writing m we make the mediator method for / explicit, since it embodies the knowledge about PSxpoints (and not the module): null</Paragraph> <Paragraph position="24"/> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Syntax </SectionTitle> <Paragraph position="0"> A new system is built from an initial set of already existing modules M with the help of the three operators +, j, and /. The set of all syntactically well-formed module descriptions D in SDL is inductively dePSned as follows: + m 2 M ) m 2 D + m1;m2 2 D ) m1 + m2 2 D + m1;:::;mk 2 D ) (j m1 :::mk) 2 D + m 2 D ) m/ 2 D Examples in the concrete syntax are written using the typewriter font, e.g., module. All operators have the same priority. Succeeding modules are written from left to right, using inPSx notation, e.g., m1 + m2. Parallel executed modules must be put in parentheses with the |operator PSrst, for instance ( |m1 m2). Note that we use the prePSx notation for the concurrency operator |to allow for an arbitrary number of arguments, e.g., ( |m1 m2 m3). This technique furthermore circumvents notorious grouping ambiguities which might lead to different results when executing the modules. Notice that since |must neither be commutative nor must it be associative, the result of ( |m1 m2 m3) might be different to ( |m1 ( |m2 m3)), to ( |( |m1 m2) m3), or even to ( |m2 ( |m1 m3)), etc. Whether |is commutative or associative is determined by the implementation of concurrency mediator par. Let us give an example. Assume, for instance, that m1, m2, and m3 would return typed feature structures and that par() would join the results by using uniPScation. In this case, |is clearly commutative and associative, since uniPScation is commutative and associative (and idempotent).</Paragraph> <Paragraph position="1"> Finally, the unrestricted self-application of a module should be expressed in the concrete syntax by using the module name, prePSxed by the asterisk sign, and grouped using parentheses, e.g., (* module). module here might represent a single module or a complex expression (which itself must be put in parentheses).</Paragraph> <Paragraph position="2"> Making j and / prePSx operators (in contrast to +) ease the work of the syntactical analysis of anSDLexpression. The EBNF for a complete system description system is given by PSgure 1. A concrete running example is shown in PSgure 2.</Paragraph> <Paragraph position="3"> The example system from PSgure 2 should be read as dePSne a new module de.dfki.lt.test.System as ( |rnd1 rnd2 rnd3) + inc1 + ..., variables rnd1, rnd2, and rnd3 refer to instances of module de.dfki.lt.sdl.test.Randomize, module Randomize belongs to package de.dfki.lt.sdl.test, the value of rnd1 should be initialized with (&quot;foo&quot;, &quot;bar&quot;, &quot;baz&quot;), etc. Every single line must be separated by the newline character.</Paragraph> <Paragraph position="4"> The use of variables (instead of using directly module names, i.e., Java classes) has one important advantage: variables can be reused (viz., rnd2 and rnd3 in the example), meaning that the same instances are used at several places throughout the system description, instead of using several instances of the same module (which, of course, can also be achieved; cf. rnd1, rnd2, and rnd3 which are instances of module Randomize). Notice that the value of a variable can not be redePSned during the course of a system description.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Modules as Functions </SectionTitle> <Paragraph position="0"> Before we start the description of the implementation in the next section, we will argue that a system description can be given a precise formal semantics, assuming that the initial modules, which we call base modules are well dePSned. First of all, we only need some basic mathematical knowledge from secondary school, viz., the concept of a function.</Paragraph> <Paragraph position="1"> A function f (sometimes called a mapping) from S to T, written as f : S !! T, can be seen as a special kind of relation, where the domain of f is S (written as DOM(f) = S), and for each element in the domain of f, there is at most one element in the range (or codomain) RNG(f). If there always exists an element in the range, we say that f is a total function (or well dePSned) and write f#. Otherwise, f is said to be a partial function, and for an s 2 S for which f is not dePSned, we then write f(s)&quot;. Since S itself might consist of ordered n-tuples and thus is the Cartesian product of S1;:::;Sn, depicted as PSni=1Si, we use the vector notation and write f(~s) instead of f(s). The n-fold functional composition of f : S !!</Paragraph> <Paragraph position="3"> tion in S).</Paragraph> <Paragraph position="4"> Assuming that m is a module for which a proper run() method has been dePSned, we will, from now on, refer to the function m as abbreviating m.run(), the execution of method run() from module m. Hence, we dePSne the execution semantics of m to be equivalent to m.run().</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Sequence </SectionTitle> <Paragraph position="0"> Let us start with the sequence m1 + m2 of two modules, regarded as two function m1 : S1 !! T1 and</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Parallelism </SectionTitle> <Paragraph position="0"> We now come to the parallel execution of k modules mi : Si !! Ti (1 * i * k), operating on the same input. As already said, the default mediator for j returns an ordered system !dePSnition fcommandg/ variables dePSnition !module &quot;=&quot; regexpr newline module !a fully qualiPSed Java class name regexpr !var j&quot;(&quot; regexpr &quot;)&quot;j regexpr &quot;+&quot; regexpr j&quot;(&quot;&quot;|&quot;fregexprg+ &quot;)&quot;j&quot;(&quot;&quot;*&quot; regexpr &quot;)&quot;</Paragraph> <Paragraph position="2"> sequence of the results of m1;:::;mk, hence is similar to the Cartesian product PS:</Paragraph> <Paragraph position="4"> dePSned and the domain of each module is a superset of the domain of the new composite module:</Paragraph> <Paragraph position="6"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Iteration </SectionTitle> <Paragraph position="0"> A proper dePSnition of unrestricted iteration, however, deserves more attention and a bit more work. Since a module m feeds its output back into itself, it is clear that the iteration (m/)(~s) must not terminate. I.e., the question whether m/#holds, is undecidable in general. Obviously, a necessary condition for m/# is that S P T, and so if m : S !! T and m# holds, we have m/ : S !! S.</Paragraph> <Paragraph position="1"> Since m is usually not a monotonic function, it must not be the case that m has a least and a greatest PSxpoint. Of course, m might not possess any PSxpoint at all.</Paragraph> <Paragraph position="2"> Within our very practical context, we are interested in PSnitely-reachable PSxpoints. From the above remarks, it is clear that given ~s 2 S, (m/)(~s) terminates in PSnite time iff no more changes occur during the iteration process,</Paragraph> <Paragraph position="4"> We can formalize the meaning of / with the help of Kleene's ,, operator, known from recursive function theory (Hermes, 1978). ,, is a functional and so, given a function f as its input, returns a new function ,,(f), the unbounded minimization of f. Originally employed to precisely dePSne (partial) recursive functions of natural numbers, we need a slight generalization, so that we can apply ,, to functions, not necessarily operating on natural numbers.</Paragraph> <Paragraph position="5"> Let f : Nk+1 !! N (k 2 N). ,,(f) : Nk !! N is given by</Paragraph> <Paragraph position="7"> I.e., ,,(f)(~x) returns the least n for which f(~x;n) = 0.</Paragraph> <Paragraph position="8"> Such an n, of course, must not exist.</Paragraph> <Paragraph position="9"> We now move from the natural numbers N to an arbitrary (structured) set S with equality relation =S. The task of ,, here is to return the number of iteration steps n for which a self-application of module m no longer changes the output, when applied to the original input ~s 2 S. And so, we have the following dePSnitional equation for the meaning of m/:</Paragraph> <Paragraph position="11"> Obviously, the number of iteration steps needed to obtain a PSxpoint is given by ,,(m)(~s), where ,, : (S !! S) !!N. Given m, we dePSne ,,(m) as</Paragraph> <Paragraph position="13"> Compare this dePSnition with the original ,,(f)(~x) on natural numbers above. Testing for zero is replaced here by testing for equality in S. This last dePSnition completes the semantics for m/.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Incorporating Mediators </SectionTitle> <Paragraph position="0"> The above formalization does not include the use of mediators. The effects the mediators have on the input/output of modules are an integral part of the dePSnition for the meaning of m1 + m2, (j m1 ::: mk), and m/. In case we explicitly want to represent (the default implementation of) the mediators in the above dePSnitions, we must, PSrst of all, clarify their status.</Paragraph> <Paragraph position="1"> Let us focus, for instance, on the mediator for the sequence operator +. We already said that the mediator for + uses the output of m1 to feed m2, thus can be seen as the identity function id, speaking in terms of functional composition. Hence, we might redePSne [[(m1 + m2)]](~s) as</Paragraph> <Paragraph position="3"> If so, mediators were functions and would have the same status as modules. Clearly, they pragmatically differ from modules in that they coordinate the interaction between independent modules (remember the mediator metaphor). However, we have also said that the mediator methods take modules as input. When adopting this view, a mediator is different from a module: it is a functional (as is ,,), taking functions as arguments (the modules) and returning a function. Now, letS be the mediator</Paragraph> <Paragraph position="5"> is the case in the default implementation for +. This view, in fact, precisely corresponds to the implementation.</Paragraph> <Paragraph position="6"> Let us quickly make the two other dePSnitions re$?ect this new view and let P and F be the functionals for j and /, resp. For j, we now have [[(j m1 ::: mk)]](~s) := (P(m1;:::;mk)-(PSki=1mi))(~sk) (PSki=1mi)(~sk) denotes the ordered sequence hm1(~s);:::;mk(~s)i to which function P(m1;:::;mk) is applied. At the moment,</Paragraph> <Paragraph position="8"> i.e., the identity function is applied to the result of each mi(~s), and so in the end, we still obtain hm1(~s);:::;mk(~s)i.</Paragraph> <Paragraph position="9"> The adaption of m/ is also not hard: F is exactly the ,,(m)(~x)-fold composition of m, given value ~x. Since ~x are free variables, we use Church's Lambda abstraction (Barendregt, 1984), make them bound, and write</Paragraph> <Paragraph position="11"> It is clear that the above set of dePSnitions is still not complete, since it does not cover the cases where a module m consists of several submodules, as does the syntax of SDL clearly admit. This leads us to the PSnal four inductive dePSnitions which conclude this section:</Paragraph> <Paragraph position="13"> Recall that the execution semantics of m(~s) has not changed after all and is still m.run(s), whereas s abbreviates the Java notation for the k-tuple ~s.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Interfaces </SectionTitle> <Paragraph position="0"> This section gives a short scetch of the API methods which every module must implement and presents the default implementation of the mediator methods.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Module Interface IModule </SectionTitle> <Paragraph position="0"> The following seven methods must be implemented by a module which should contribute to a new system. The next subsection provides a default implementation for six of them. The exception is the one-argument method run() which is assumed to execute a module.</Paragraph> <Paragraph position="1"> + clear() clears the internal state of the module it is applied to. clear() is useful when a module instance is reused during the execution of a system. clear() might throw a ModuleClearError in case something goes wrong during the clearing phase.</Paragraph> <Paragraph position="2"> + init() initializes a given module by providing an array of init strings. init() might throw a ModuleInitError.</Paragraph> <Paragraph position="3"> + run() starts the execution of the module to which it belongs and returns the result of this computation. An implementation of run() might throw a ModuleRunError. Note that run() should not store the input nor the output of the computation. This is supposed to be done independently by using setInput() and setOutput() (see below).</Paragraph> <Paragraph position="4"> + setInput() stores the value of parameter input and returns this value.</Paragraph> <Paragraph position="5"> + getInput() returns the input originally given to setInput().</Paragraph> <Paragraph position="6"> + setOutput() stores the value of parameter output and returns this value.</Paragraph> <Paragraph position="7"> + getOutput() returns the output originally given to setOutput().</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Module Methods </SectionTitle> <Paragraph position="0"> Six of the seven module methods are provided by a default implementation in class Modules which implements interface IModule (see above). New modules are advised to inherit from Modules, so that only run() must actually be speciPSed. Input and output of a module is memorized by introducing the two additional private instance PSelds input and output.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Mediator Methods </SectionTitle> <Paragraph position="0"> The public class Mediators provides a default implementation for the three mediator methods, speciPSed in interface IMediator. It is worth noting that although fix() returns the PSxpoint, it relocates its computation into an auxiliary method fixpoint() (see below), due to the fact that mediators are not allowed to change the internal state of a module. And thus, the input PSeld still contains the original input, whereas the output PSeld refers to the PSxpoint, at last.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Compiler </SectionTitle> <Paragraph position="0"> In section 2, we have already seen how basic expressions are compiled into a sequence of instructions, consisting of API methods from the module and mediator interface.</Paragraph> <Paragraph position="1"> Here, we like to glance at the compilation of more complex SDL expressions.</Paragraph> <Paragraph position="2"> First of all, we note that complex expressions are decomposed into $?at basic expressions which are not further structured. Each subexpression is associated with a new module variable and these variables are inserted into the original system description which will also then become $?at. In case of the example from PSgure 2, we have the following subexpressions together with their variables (we prePSx every variable by the dollar sign): $1</Paragraph> <Paragraph position="4"> the original system description reduces to $1 + $inc1 + $inc2 + $2 + $4 and thus is normalized as $1, :::, $4 are. TheSDLcompiler then introduces so-called local or inner Java classes for such subexpressions and locates them in the same package to which the newly dePSned system belongs. Clearly, each new inner class must also fulPSll the module interface IModule (see section 5) and the SDL compiler produces the corresponding Java code, similar to the default implementation in class Modules (section 5), together with the right constructors for the inner classes.</Paragraph> <Paragraph position="5"> For each base module and each newly introduced inner class, the compiler generates a private instance PSeld (e.g., private Randomize $rnd1) and a new instance (e.g., this.$rnd1 = new Randomize()) to which the API methods can be applied. Each occurence of the operators +, |, and * corresponds to the execution of the mediator methods seq, par, and fix (see below).</Paragraph> <Paragraph position="6"> Local variables (prePSxed by the low line character) are also introduced for the individual run() methods ( 15, :::, 23 below). These variables are introduced by the SDL compiler to serve as handles (or anchors) to already evaluated subexpression, helping to establish a proper $?ow of control during the recursive compilation process. We PSnish this paper by presenting the generated code for the run() method for system System from PSgure 2.</Paragraph> <Paragraph position="7"> We always generate a new mediator object ( med) for each local class in order to make the parallel execution of modules thread-safe. Note that in the above code, the mediator method seq() is applied four times due to the fact that + occurs four times in the original speciPScation. The full code generated by the SDL compiler for the example from PSgure 2 can be found under http://www.dfki.de/>>krieger/public/. The directory also contains the Java code of the involved modules, plus the default implementation of the mediator and module methods. In the workshop, we hope to further report on the combination of WHAT (Sch&quot;afer, 2003), an XSLT-based annotation transformer, with SDL.</Paragraph> </Section> class="xml-element"></Paper>