File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-4003_metho.xml
Size: 51,860 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-4003"> <Title>Modularity and Information Content Classes in Principle-based Parsing</Title> <Section position="3" start_page="518" end_page="518" type="metho"> <SectionTitle> 2. The Proposal </SectionTitle> <Paragraph position="0"> Two avenues have generally been pursued to build efficient GB parsers. In one case, a &quot;covering grammar&quot; is compiled, which overgenerates and is then filtered by constraints. The compilation is done in such a way that the overgeneration is wellbehaved. For instance, the correct distribution of empty categories is calculated off-line (Dorr 1993). In the other case, all the principles are applied on line, but they apply only to a portion of the tree, and are therefore restricted to a local computation (Frank 1992). 4 My proposal combines these two approaches: it adopts the idea of compiling the grammar, at least partially, off-line but it attempts to find a principled way of doing so. In this, I differ from Dorr, where the amount of compilation is heuristic and based on practical experimentation. The approach shares Frank's intuition that linguistic principles have a form, which can be exploited in structuring the parser.</Paragraph> <Paragraph position="1"> This proposal is based on two observations. First, each principle of linguistic theory has a canonical form, and second, primitives of linguistic theories can be partitioned into classes, based on their content.</Paragraph> <Paragraph position="2"> As an illustration of the first observation, we can look at the principle that regulates the distribution of the empty categories in the phrase marker, the Empty Category Principle (ECP), as stated below (adapted from Rizzi 1990, 25).</Paragraph> <Paragraph position="3"> (2) .</Paragraph> <Paragraph position="4"> The Empty Category Principle An empty category x is licensed if the 3 following conditions are satisfied: x is in the domain of a head H</Paragraph> </Section> <Section position="4" start_page="518" end_page="521" type="metho"> <SectionTitle> 3 For CF parsers, just how much compilation speeds up the parser is defined precisely by the analysis of </SectionTitle> <Paragraph position="0"> the algorithm. No such precise analysis is available for principle-based algorithms.</Paragraph> <Paragraph position="1"> 4 Frank (1992) presents a parsing model that is claimed not to allow any compilation of the linguistic theory, and to operate in linear time. Two objections can be raised to these claims: first, the use of TAG elementary trees to restrict the working space of the parser amounts to a precompilation of phrase-structure and locality constraints, so that locality is not computed in the course of the parse, but basically done as template matching. Second, in the measure of complexity, Frank does not count the cost of choosing which elementary tree to unadjoin or unsubstitute, or the cost of backtracking if the wrong decision is made. There are indeed cases where, in order to perform the correct operation, more than one elementary tree must be spanned. It is not clear that linear time complexity can actually be claimed if all factors are taken into account. For a more detailed discussion, see Merlo 1992, to appear.</Paragraph> <Paragraph position="2"> the category of H E {A, Agr, N, P, T, V} there is no barrier or head H r that intervenes between H and x It can be observed that this principle has an internal structure and can be decomposed into separate pieces of information: (2.1) imposes a condition on configurations, namely, a condition on the shape of the tree; (2.2) imposes a condition on the labelling of the nodes in the tree; and (2.3) imposes a locality condition, as it defines the subtree available for the computation of the principle. These three conditions are independent.</Paragraph> <Paragraph position="3"> For instance, the configuration does not depend on the categorial labelling of the head node. The precompilation of these conditions would require computing all the possible combinations, without any reduction of the space of analysis.</Paragraph> <Paragraph position="4"> The second observation is based on a detailed inspection of the form of the principles of the grammar. What is presented in (2) as an illustrative example is, in fact, a consistent form of organization of the principles. If one looks at several of the principles of the grammar that are involved in building structure and annotating the phrase marker, one finds the same internal organization.</Paragraph> <Paragraph position="5"> Theta-assignment occurs in the configuration of sisterhood, it requires a 0-assigning head, and it must occur between a node and its most local assigner. Assignment of Case occurs in a given configuration (according to Chomsky (1988, 1992) it is always a specifier-head configuration), given a certain lexical property of the head (\[-N\]), and locally, within the same maximal projection). The same restriction occurs again for what is called the wh-criterion (Rizzi 1991), which regulates wh-movement, where the head must have a +wh feature and occur within a specifier-head configuration. Categorial selection and functional selection also occur under the same restrictions, in the complement configuration (i.e., between a head and a maximal projection). The licensing of subjects in the phrase marker, done by predication, must occur in the specifier-head configuration. The licensing of the empty category pro also requires the inflectional head of the sentence to bear the feature Strong Agr, and it occurs in the specifier-head configuration. The assignment of the feature \[+ barrier\] depends on L-marking, which in turn requires that the head is lexical, and that marking occurs in the complement configuration.</Paragraph> <Paragraph position="6"> Thus, each different &quot;factor&quot; that composes a principle can be considered a separate primitive, and such primitives can be grouped into classes defined according to their content. Linguistic information can be classified into five different classes: (3) a. Configurations: sisterhood, c-command, m-command, + maximal b. Lexical features: iN, +V, +Funct, +c-selected c. Syntactic features: +Case, +0, +% +barrier, +Strong Agr d. Locality information: minimality, antecedent government e. Referential information: q-anaphor, q-pronominal, indices This qualitative classification forms a partitioning into natural classes based on information content. I call these IC ClassesP 5 Differently from Crocker (1992, to appear) and Frazier (1985), this partitioning does not rely on the particular representation used. The spirit of the hypothesis is that linguistic theory is formed by heterogeneous types of information, and that the representation used to describe them is a derived concept. Frazier (1990) proposes an evolutionary partitioning of the parser based on tasks. This Paola Merlo Modularity and Information Content Classes It can then be hypothesized that the amount of compilation (or, conversely, the modularity of the parser) is captured by the notion of IC classes as follows: In other words, a parser that takes advantage of the structure of linguistic principles will maintain a modular design based on the five classes in (3).</Paragraph> <Paragraph position="7"> Although the ICMH is not so stringent as to make predictions that converge on a single parsing architecture, it does provide some predictive power about the organization of the parser. First, structural information is encoded separately from lexical information. Standard context-free rules, specified with category, such as VP --*V NP, are not compatible with the ICMH, nor are proposals in the spirit of licensing grammars (Abney 1989, Frank 1992), where information is encoded in each lexical item. Second, the ICMH predicts that long-distance dependencies, represented as chains, are computed in steps. Empty categories are licensed in two computational steps: structural licensing by an appropriate head, and feature instantiation. With respect to feature instantiation in particular, it is predicted that precompiling syntactic features speeds up the parsing process. This is different from functional approaches such as Fong (1991), and Fong and Berwick (1992), in which there is no precompilation. 6 These predictions seem to be supported (and, consequently, so is the ICMH) by two main results, which are illustrated below: . m separating X from lexical information yields more compact data structures; I propose a parser that uses two compiled tables: one that encodes structural information, and the other that encodes lexical information.</Paragraph> <Paragraph position="8"> . using syntactic features to compute empty categories reduces the search space, complex chains can be computed efficiently.</Paragraph> <Paragraph position="9"> These claims are supported in the next section, where I discuss the properties of an implemented parser, which computes simple, complex, and multiple chain formation, as exemplified in Figure 1. This subset of constructions has been chosen because it constitutes the crucial test set for principle-based parsers: it involves complex interactions of principles over large portions of the tree. 7 perspective is not in opposition to the current proposal, as the specialization of the parser in different tasks is likely to be an adaptive reaction to the different types of inputs.</Paragraph> <Paragraph position="10"> At first sight it might appear that the notion of types proposed by Fong (1991) is similar to IC classes. In fact, the similarity is superficial. Clearly, both notions constitute an attempt to partition the set of principles into smaller subsets. However, Fong's types are a mechanism to interleave constraints and phrase structure rules automatically. They are a method to schedule the on-line computation of principles that are the direct translation of the theory, and not a way of defining the design of the parser. In Fong's view, all computations are done on-line and the parser reflects the theory as directly as possible.</Paragraph> <Paragraph position="11"> mary seems to like john john thinks that mary loves bill john thinks that mary runs mary thinks that john seems to like bill who does john love ? who do you think that john likes ? who did you think that john seemed to like ? * who did you wonder why mary liked ? Figure 1 Types of Sentences.</Paragraph> <Paragraph position="12"> In the rest of the paper, I first discuss the advantages of storing X information separately from lexical information (section 3). I then turn to the computation of long-distance dependencies. I illustrate two algorithms to compute chains: I show that a particular use of syntactic feature information speeds up the parse, and I discuss the plausibility of using algorithms that require strict left-to-right annotation of the nodes (section 4). In fact, the algorithm I propose appears to be interestingly correlated to a gap in the typology of natural languages.</Paragraph> </Section> <Section position="5" start_page="521" end_page="527" type="metho"> <SectionTitle> 3. The Computation of Phrase Structure </SectionTitle> <Paragraph position="0"> In order to explore the validity of the proposed hypothesis about the modularity of the parser, an analyzer for English was developed. Each of the data structures is the direct implementation of linguistic objects with different information contents. The input to the algorithm is an unannotated sentence. The output consists of a tree and a list of two chains: the list of A chains and the list of A chains, that is, chains formed by wh-movement and NP movement, respectively. The main parsing algorithm is a modified LR parsing algorithm augmented by multi-action entries and constraints on reduction. 8 The structure-building component of the parser is driven by an LR(k) parser (Knuth 1965) which consults two tables. One table encodes X information (following Kornai and Pullum 1990). The other table encodes lexical information. Lexical information is consulted only if it is needed to disambiguate a state containing multiple actions in the LR parser. An overview of this design is shown in Figure 2.</Paragraph> <Paragraph position="1"> instance) or they require extensive backtracking (Fong 1991; Fong and Berwick 1992). In formalism other than GB theory, gaps are encoded directly into the rules. Both GPSG and HPSG use slash features to percolate features to gaps. The use of slash features probably simplifies the computation. There has been a debate on the explanatory adequacy of grammars that employ slash features (see van de Koot 1990, and Stabler 1994). For my purposes, note that, if anything, I am dealing with the worst case for the parser.</Paragraph> <Paragraph position="2"> 8 The ICMH is not sufficient to predict a specific parsing architecture, but rather it loosely dictates the organization of the parser. The choice of an LR parser then is the result of the ICMH (with which the parser's organization must be compatible) and additional independent factors. First, LR parsers have the valid prefix property, namely they recognize that a string is not in the language as soon as possible (other parsing methods have this property as well, for instance Schabes 1991). A parser with this property is incremental, in the sense that it does not perform unnecessary work, and it fails as soon as an error occurs. Second, the stack of an LR parser encodes the notion of c-command implicitly. This is</Paragraph> <Paragraph position="4"> ,,&quot; 'iLi, iI i R I'1-1 IP ii I I I I I ,1111 I,I kll IIILI ,,,. - j I I Organization of the Parser: The data structures (tables, stack and chains) are represented as rectangles. Operations on feature annotation are performed by constraints, represented as ovals.</Paragraph> <Paragraph position="5"> Figure 3 Category-Neutral Grammar.</Paragraph> <Paragraph position="7"> The context-free grammar compiled in the LR table is shown in Fi__gure 3. The crucial feature of this grammar is that nonterminals specify only the X projection level, and not the category. Because the LR table is underspecified with respect to the categorial labels of the input, many instances of LR conflicts arise, which can be teased apart by looking at the co-occurrence restrictions on categories. This information would be stored in the rules themselves in ordinary context-free rules. However, ordinary context-free rules do not encode many other types of lexical information also used in parsing. Thus, they lose generality, without exploiting all the available information.</Paragraph> <Paragraph position="8"> As an illustration, consider the following set of context-free rules.</Paragraph> <Paragraph position="9"> crucial for fast computation of chains. Third, LR parsers are fast.</Paragraph> <Paragraph position="10"> Computational Linguistics Volume 21, Number 4 (4) 1. C' -+ Co IP 2. I' --* I0 VP 3. V' ~ V0 NP 4. V' ~ V0 e 5. V ~ ~ V0 m Rules 1-4 have the same X structure, but they differ in the labels of the nodes. In rules I and 2 the heads, Co and I0 respectively, are followed by IP and VP obligatorily. Rules 4 and 5 cover the same string. Clearly, by writing 1-4 as different rules, the fact that they are instances of the same structure is not captured. Similarly, the obligatoriness of IP and VP as complements of Co and I0 is lost. Finally, the choice of rule 4 or rule 5 depends on the actual verb in the string. If the verb is intransitive, rule 4 cannot apply.</Paragraph> <Paragraph position="11"> In the parser, structural information is separate from information about co-occurrence (rules 1-4), functional selection (rules 1, 2) and subcategorization (rules 4, 5). This information is stored in a table, called a co-occurrence table. The table stores information about obligatory complementation, such as the fact that I0 must be followed by a VP. It also stores compatible continuations based on subcategorization. For instance, consider the case in which the current token is an intransitive verb. The LR table contains two actions that match the input: one action generates a projection of the input node (V'), without branching, while the other action creates an empty object NP. By consulting the subcategorization information, the parser can eliminate the second option as incorrect.</Paragraph> <Paragraph position="12"> Using an LR table together with a co-occurrence table is equivalent in coverage to a fully instantiated LR table, but it is more advantageous in other respects. Conceptually, the latter organization encodes X theory directly, and it maintains a general design, which makes it applicable to several languages. Practically, there is reason to think that it is more efficient.</Paragraph> <Section position="1" start_page="523" end_page="525" type="sub_section"> <SectionTitle> 3.1 Testing the ICMH for phrase structure </SectionTitle> <Paragraph position="0"> The prediction made by the ICMH is that compiling together X theory and categorial information will increase the size of the grammar without reducing the nondeterminism contained in the grammar, because category/subcategory information belongs to a different IC Class than structural (i.e., X) information.</Paragraph> <Paragraph position="1"> Method and Materials. The size of the grammar is measured as the number of rules or number of states in the LR table. The amount of nondeterminism is measured as the average number of conflicts (the ratio between the number of actions and the number of entries in a table.) 9 Three grammars were constructed, constituting (pairwise) as close an approximation as possible to minimal pairs (with respect to IC Classes). They are shown in the Appendix. Grammar 1 differs minimally from Grammar 2, because each head is instantiated by category. The symbol YP stands for any maximal projection admitted by linguistic theory. Grammar 3 differs minimally from Grammar 2, because it also 9 The average number of conflicts in the table gives a rough measure of the amount of nondeterminism the parser has to face at each step. However, it is only an approximate measure for at least two reasons: taking the mean of the conflicts abstracts away from the size of the grammar, which might be a factor, as the search in the table becomes more burdensome for larger tables (but, if anything, it plays against small grammars/tables); moreover, it does not take into account the fact that some states might be visited more often than others.</Paragraph> <Paragraph position="2"> includes some subcategorization information (such as transitive, intransitive, raising), and some co-occurrence restrictions and functional selection. Moreover, empty categories are &quot;moved up,&quot; so that they are encountered as high in the tree as possible. These three grammars are then compiled by the same program (BISON) into three (LA)LR tables. The results are shown in Table 1, which compares some of the indices of the nondeterminism in a given grammar to its size, and Table 2, which shows the distribution of actions in each of the grammars.</Paragraph> <Paragraph position="3"> Discussion. Consider Grammar I and Grammar 2 in Table 1. Grammar 2 has a slightly smaller average of conflicts, while it has three times the number of rules and twelve times the number of entries, compared to Grammar 1. The fact that Grammar 2 is larger than Grammar 1, with only a slightly smaller average of conflicts, confirms the prediction made by the ICMH that compiling X theory with categorical information will increase the size of the grammar without decreasing nondeterminism. Since the number of rules is expanded, but no &quot;filtering&quot; constraint is incorporated in Grammar 2 with respect to Grammar 1, this result might not seem surprising.</Paragraph> <Paragraph position="4"> However, the ICMH is also confirmed by the other pairwise comparisons and by the global results. Grammar 3 has a higher number of average conflicts than Grammar 2, but it is smaller, both by rules and LR entries, so it is more compact. Notice that adding information (subcategory, selection, etc.) has a filtering effect, and the resulting grammar is smaller. However, adding information does not reduce nondeterminism.</Paragraph> <Paragraph position="5"> Compared to Grammar 1, Grammar 3 does not show any improvement on either dimension: Grammar 3 is both larger (four times as many LR entries) and more non-deterministic than Grammar 1. Globally, one can observe that an increase in grammar size, either as a number of rules or number of LR entries, does not correspond to a parallel decrease in nondeterminism.</Paragraph> <Paragraph position="6"> As Table 2 shows, the distribution of the conflicts in Grammar 3 presents some gaps. This occurs because certain groups of actions go together. Two main patterns of conflict are observed: In those states that have the highest number of conflicts, all rules that cover the empty string can apply; in those states that have an intermediate number of conflicts, only some rules can apply, namely, those that have a certain X projection level, and that cover the empty string (e.g., all XP's, independent of category, that cover the empty string). This observation confirms that categorial information does not reduce nondeterminism. On the contrary, adding categorical information multiplies nondeterminism by adding structural configurations. Even introducing &quot;filtering&quot; lexical information (co-occurrence restrictions and functional complementation) does not appear to help. In fact, ambiguities caused by empty categories occur according to structural partitions. The qualitative observation supports the numerical results: Introducing categorial information is not advantageous, because it increases the size of the grammar without decreasing significantly the average number of conflicts.</Paragraph> </Section> <Section position="2" start_page="525" end_page="527" type="sub_section"> <SectionTitle> 3.2 Extending the test to other compilation techniques </SectionTitle> <Paragraph position="0"> The effects discussed above could be an artifact of the compilation technique. In order to check that this is not the case, the same three grammars (reported in the appendix) were compiled into LL and Left Corner (LC) tables.</Paragraph> <Paragraph position="1"> LL compilation: Discussion. The LL compilation method yields results similar to those of the LR compilation, although less clear cut. This confirms the intuition that the results reflect some structural property of the grammar, and are not an artifact of the LR compilation.</Paragraph> <Paragraph position="2"> The results of the compilation of the same grammars into LL tables are shown in Table 3. Grammar 1' is a modified version of Grammar 1, without adjunction rules. These figures show that there is no relation between the increased specialization of the grammar and the decrease of nondeterminism. Note that the LL compilation does not maintain the paired rankings of actions and rules. So, for the LL table, the co-occurrence of lexical categories does not play a filtering role.</Paragraph> <Paragraph position="3"> Globally, there appears to be an inverse relation between the size of the grammar, measured by the number of rules, and the average number of conflicts: the larger the grammar the smaller the number of conflicts. This might make one think that there is some sort of relation between grammar size and nondeterminism after all.</Paragraph> <Paragraph position="4"> However, this is not true if we use the number of entries as the relevant measure of size. Moreover, if one looks at Grammar 1', which is smaller than Grammar 1, one can see that the average number of conflicts decreases quite a bit. This confirms a weaker hypothesis, which is nonetheless related to the initial one, namely that nondeterminism does not vary in an inverse function to &quot;content of information.&quot; Some qualitative observations might help clarify the sources of ambiguity in the tables. In all three grammars, the same ambiguities are repeated, for each terminal item. In other words, all columns of the LL table are identical (with the exception of cell \[X0, wp\] in Grammar 1.) This suggests that lexical tokens do not provide any selective information. Moreover, as we saw in th.e LR tables, projections to the same level have the same pattern of conflicts. (In Grammar 2, the number of conflicts is multiplied by the number of categories.) 1deg LC-compilation: Discussion. The same three grammars were compiled in left corner (LC) tables. The result of the compilation are shown in Table 4, and the distribution of the conflicts is shown in Table 5. As can be seen from Table 4, Grammar 2 is three times larger than Grammar 1 and is compiled in a table that has twenty-nine times as many entries, but the average number of conflicts is not significantly smaller.</Paragraph> <Paragraph position="5"> The interpretation of the LC table derived from Grammar 3 poses a problem for the ICMH. Grammar 3 is larger than Grammar 1, as it contains category and some co-occurrence information, but its average of conflicts is smaller. In this case, it seems that adding information reduces nondeterminism. On the other hand, compared to Grammar 2, both the table and the average number of conflicts are smaller. I take this to mean that the ICMH is confirmed only by a global assessment of the relation between the content of information and the average conflicts, but not by pairwise comparisons of the grammars. Notice however, that the difference in the two pairwise comparisons confirms that simple categorial information does not perform a filtering action on the structure, while lexical co-occurrence does. This is precisely what I propose to compile in the lexical co-occurrence table.</Paragraph> <Paragraph position="6"> The qualitative inspection of the tables confirms the clustering of conflicts suggested by Table 5. Grammar 1 and Grammar 2 show the same patterns of conflicts as the LR and LL tables: conflicting actions cluster with the bar level of the category. So, for example, in Grammar 2, one finds that when the left corner is a maximal projection 10 In all cases, this is caused by the X form of the grammar. Namely, the loci of recursion and gapping are at both sides of the head, and anything can occur there. Eliminating this property would be incorrect, as it would amount to eliminating one of the crucial principles of GB, namely move-c~, which says that any maximal projection or head can be gapped.</Paragraph> <Paragraph position="7"> Computational Linguistics Volume 21, Number 4 the action is unique, while when the corner is a bar level projection there are multiple actions and they are the same, independently of the input token. In Grammar 3, the same patterns of actions are repeated for each left corner, independently of the goal or of the input token.</Paragraph> <Paragraph position="8"> The qualitative inspection of the compiled tables is coherent across compilation methods and appears, in general, to support the ICMH, as the interaction of structural and lexical information is the cause of repeated patterns of conflicts. Quantitatively, the results, which are very suggestive in the LR compilation, are less clear in the other two methods. However, in no case do they clearly disconfirm the hypothesis. I conclude that categorial information should be factored out of the compiled table and separate data structures should be used. u</Paragraph> </Section> </Section> <Section position="6" start_page="527" end_page="536" type="metho"> <SectionTitle> 4. The Computation of Chains </SectionTitle> <Paragraph position="0"> As a result of using a category-neutral context-free backbone to parse, most of the feature annotation is performed by conditions on rule reduction associated with each context-free rule, which are shown in Figure 4.</Paragraph> <Paragraph position="1"> The most interesting issues arise in the treatment of tiller-gap dependencies, which are represented as chains. Informally, a chain is a syntactic object that defines an equivalence class of positions for the purpose of feature assignment and interpretation. (5) a. Maryi was loved ti b. Whoi did John love ti ? c. Maryi seemed t I to have been loved ti .</Paragraph> <Paragraph position="2"> d. Whoi did John think t I that Mary loved ti ? The sentence in (5a), for example, contains the chain (Maryi, ti), which encodes the fact that Mary is the object of love, represented by the empty category t. In this parser, empty categories are postulated by the LR parser, when building structure, and their licensing is immediately checked by the appropriate condition on rule reductions, shown in Figure 4.</Paragraph> <Paragraph position="3"> Many principles regulate the distribution of chains. For the purpose of the following discussion, it is only necessary to recall that a chain can only contain one thematic position and one position that receives case. Moreover, chains divide into two types: 11 It should be noted that, although phrase-structure rules are reduced to the bare bones, they cannot be eliminated altogether. Parsers that project phrase structure and attachments entirely from the lexicon have been presented by Abney (1989) and Frank (1992), using licensing grammars (LS). They stiffer from serious shortcomings when faced with ambiguous input, as they do not have enough global knowledge of the possible structures in the language to recover from erroneous parses. Abney alleviates this problem by attaching LR states to the constructed nodes, thus losing much of the initial motivation of the licensing approach. Frank's parser is augmented by a parse stack to parse head-final languages. Frank does not discuss this issue in detail, but it seems that a &quot;shift&quot; operation must be added to the operations of the parser. As there could always be a licensing head in the right context, which would license a left-branching structure, the &quot;shift&quot; operation is always correct. But then, the parser might reach the end of the input (or at least the end of the relevant elementary tree, i.e., the main predicate-argument structure) before realizing either that it pursued an incorrect analysis, in the case of ambiguous input, or that the input is ill-formed. Thus, this augmented parser could not recognize errors as soon as they are encountered. Finally, note that all the augmentation necessary to make the LS grammar work make it equivalent to a phrase-structure grammar, possibly with the disadvantage of being procedurally instead of declaratively encoded. On the other hand, a precompiled table which keeps track of all the alternative configurations guarantees that incorrect parses are detected as soon as possible, and, if alternative parses exist, they will be found.</Paragraph> <Paragraph position="4"> node labelling determines what kind of chain link the current node is: head, intermediate, foot chain selection selects chain to unify with current node chain unification unifies node with selected chain head feature percolation consults cooccurrence table and determines cooccurrence restrictions among heads 0-marked marks node with available 0-role case marked marks node with available Case c-select categorial selection is-a barrier checks if maximal projection is a barrier license empty head checks features of closest lexical head licensing head finds a lexical head to license a maximal projection locality checks that the maximal projections between antecedent and empty category are not barriers Figure 4 The Constraints.</Paragraph> <Paragraph position="5"> m wh-chains, also called A-chains, and NP-movement chains, also called A-chains; the empty categories that occur in these chains have different properties. More than one chain can occur in a sentence. Multiple chains occurring in the same sentence can either be disjoint or intersected. 12 Disjoint chains are nested, as in (6a). If chains intersect, they share the same index and they have exactly one element in common, as in (6b).</Paragraph> <Paragraph position="6"> (6) a. Whoi did Maryi seem tj to like ti? b. Whoi did you think ti seemed t i to like Mary?</Paragraph> <Section position="1" start_page="528" end_page="531" type="sub_section"> <SectionTitle> 4.1 The Algorithms </SectionTitle> <Paragraph position="0"> When building chains, several problems must be solved. First of all, the parser must decide whether to start a new chain or not. It must also decide whether to start a chain headed by an element in an argument position (A-chain), such as the head of a passive chain, or a chain headed by an element in a non-argument position (A-chain), 12 Actually, chains can also compose. If chains compose they do not have intersecting elements, but they create a new link. This type of chain is exemplified in (i). We will only discuss chains of the types of (6). (i) Who i did you meet t i Oi without greeting ti? Computational Linguistics Volume 21, Number 4 such as the head of a wh-chain. Second, on positing an empty element, the parser must decide to which chain it belongs) 3 The two decisions can be seen as instances of the same problem, which consists in identifying the type of link in the chain that a given input node can form (whether head, intermediate or foot, abbreviated as H,I,F in what follows.) One can describe this sequence of decisions as two problems that must be solved in order to form chains: the Node Labelling Problem (NLAB), and the Chain Selection Problem (CSEL), formulated below.</Paragraph> <Paragraph position="1"> The Node Labelling Problem (NLAB).</Paragraph> <Paragraph position="2"> Given a node N to be inserted in a chain, determine its label L, where L C/ {AH, AH, AI, AI, AF, AF}.</Paragraph> <Paragraph position="3"> This problem defines a relation R: N x L, where N belongs to the set of nodes, and L belongs to the set of labels for the elements of chains. The labels of possible chain links reflect the theoretical distinctions between A-movement and A-movement, and the fact that links of a chain can be either the first element of the chain, the head (H), or an intermediate element (I) in the case of chains formed by several links, or the last element, the foot (F). 14 There are six possible outputs for this algorithm. The first case arises when the node N is a lexical wh-word, which starts a wh-chain. The second possibility is if the head is lexical, but not a question word. In this case, an argument chain (A-chain) is started, as in passives. The last four cases deal with empty categories. The feature 13 Strictly speaking, it must also provide a rescuing procedure. This can be done by checking whether all the chains satisfy the well-formedness.conditions. If not all the chains satisfy the well-formedness constraints, the parser can attempt to intersect or compose two or more chains in order to satisfy the well-formedness conditions. These two problems are not treated here. For an illustration, under the name of Chain Intersection Problem and Chain Composition Problem, see Merlo 1992. 14 I present here a simplified version of the algorithm, to avoid technical linguistic details, which are not relevant for the following discussion. However, one should also output a label AOp, which designates the empty operator that binds, for instance, the empty variable in a parasitic gap construction and other cases of non-overt movement, such as relative clauses. In the man OP ! saw an empty operator is postulated by analogy to the man whom/that I saw. AOp is licensed by the same conditions that license an intermediate A trace.</Paragraph> <Paragraph position="4"> Paola Merlo Modularity and Information Content Classes annotation of the category is inspected: case distinguishes the foot of an A-chain from the foot of an A-chain, while intermediate traces are characterized by a lack of 0-role and by their configurations (i.e., intermediate A empty categories occur in A positions (spec of I), while intermediate A empty categories occur in A positions (spec of C)). Once the potential chain links have been labelled, a second algorithm looks for a chain that can &quot;accept&quot; a node with that label.</Paragraph> <Paragraph position="5"> The Chain Selection Problem (CSEL) Given a node N of label L, and an ordered list of chains C, return the chain Ci, possibly none, to which N has unified.</Paragraph> <Paragraph position="6"> Algorithm 2 Input: Node, Label(s), Ordered List of Chains Output: Chain or empty set m If Label E {AH, AH} then start new chain m If Label = AI then choose nearest unsaturated chain, unless it is the immediately preceding element in the stack.</Paragraph> <Paragraph position="7"> The list of chains given as input is ordered by the structure-building algorithm: when new chains are started, they are added at the end of the list. The first clause;of Algorithm 2 starts a new chain whenever a lexical element is seen. No other type of chain link can start a chain. The second clause selects a chain when the foot is seen. By choosing the nearest chain (i.e., the last one in the list), only nested dependencies are built. The third clause assigns AI in a condition that is more complex than the others, to deal with subject-oriented parasitic-gaps. 15 In Figure 5, I show schematically how these algorithms build chains. A pseudo-Prolog notation is used, which is similar to the output of the parser, where chains are represented as lists enclosed in square brackets. I show the I/O of each algorithm, given the sentence Who did you think that John seemed to like?, where a multiple A-chain and an A-chain must be recovered. NLAB takes an input word and outputs a label, while CSEL takes a triple (Node, Label, Chains) as input, and returns a new chain list. Note that, in Algorithms I and 2, features such as Case and 0-role must be available as input for the correct labelling and chain assignment of the empty category. This is a crucial feature of the algorithms for chain formation proposed here.</Paragraph> <Paragraph position="8"> In GB theory, empty categories can be freely coindexed with an antecedent, from which they inherit their features. Features that are incompatible with a given context are automatically excluded, since the sentence will be ungrammatical (Brody 1984). This theory is called functional determination of empty categories. In GB parsing, there have been two approaches to the implementation of chains: one that mirrors directly Paola Merlo Modularity and Information Content Classes features such as case and thematic roles when building chains leads to an exponential growth of the space of hypotheses; second, I argue that using these features does not restrict the validity of the algorithm to specific constructions or languages.</Paragraph> </Section> <Section position="2" start_page="531" end_page="536" type="sub_section"> <SectionTitle> 4.2 Restricting the Search Space </SectionTitle> <Paragraph position="0"> As the previous section on phrase structure has shown, computing features is not always profitable, as some features reduce the search space while others do not. To see that checking features does indeed pay off, the cost of checking these features must be compared to the benefit of reducing the search space.</Paragraph> <Paragraph position="1"> This analysis mostly concerns the first algorithm, NLAB, which is constituted of a series of binary choices. More precisely, recall that the relevant information is: a) whether a node is lexical or not; b) whether it has a 0-role or not; c) whether it has Case or not; d) whether it is a sister of C (hence, in an A-position) or not (if not, it counts as an A-position). For the chain selection algorithm (CSEL) there are four main constraints: first, A-nodes can only be inserted in A-chains and A-nodes can only be inserted in A-chains. Second, empty nodes never start a new chain. Third, the closest head is always chosen as a potential chain to which to unify. Finally, only unsaturated chains are chosen.</Paragraph> <Paragraph position="2"> Consider what would result if NLAB did not check for all of these factors. If (b) were not checked, NLAB ~ would not distinguish between feet and intermediate traces, even in the same type of chain, thus it would output four sets of labels: AH, AH, {AF, AI}, {__AF, AI}. If (c) were not checked, NLAB&quot; would not distinguish between A-feet and A-feet, thus it would output AH, AH, AI, AI, {AF, AF}. If (d) were not checked, NLAB'&quot; would output AH, AH, {AI, AI}, AF, AF. If (b), (c) and (d) together were not checked, NLAB&quot;&quot; would output AH, AH, {AI, AI, AF, AF}.</Paragraph> <Paragraph position="3"> In accounting for the growth rate in the space of hypothesis of these modified algorithms, two factors must be taken into consideration. One factor is the number of active chain types, namely, whether a sentence presents only A-chains, only A-chains, or both. This factor encodes the second and third restriction of the CSEL algorithm, with the consequence that not all combinations are attempted. The second factor accounts for the growth rate proper, which is reducible to counting the set of k-strings over an n-sized alphabet, hence n k. Here, k is the number of relevant links in the sentence (for instance, feet in NLAB'), and n is given by the size of the set of features collapsed by lifting some of these checks, hence, 2, 2, 2 and 4, respectively.</Paragraph> <Paragraph position="4"> The hypothesis space in the three algorithms grows in slightly different ways.</Paragraph> <Paragraph position="5"> In NLAB ~, where there is no restriction on the number of active chains, the growth rate is n k. For NLAB&quot; and NLAB m, the formula is NA k, where NA is the number of active chains. Practically, this amounts to 2 k at most, as the number of active chains is not more than 2, because of the restriction requiring that the nearest unsaturated chain be selected. For NLAB&quot;, the restriction for active chains no longer holds. In this algorithm, no features are checked, so it is impossible to establish if a chain is saturated or not until structure building ends. Thus, the growth factor is a function of the number of heads seen up to a certain point in the parse, the number of empty categories, and their respective order in the input. Notice that the different size of the collapsed feature set, which is larger for NLAB&quot;, is implicitly taken into account by k, as the number of relevant links varies with the size of the collapsed feature sets. For the same sentence, there are more relevant links if the collapsed feature set is larger.</Paragraph> <Paragraph position="6"> Now, in all cases, growth is exponential in the number of relevant links, while the possible gain obtained by not checking features can be at most logarithmic in the number of potential empty categories. Since the number of potential empty categories is at most 2f, for f binary features, this gain is expressed as f. Hence, suppressing feature checks becomes beneficial only if kf > n k. Now notice that 2 _< n _< 2d. For n = 2 and f = 3, the inequality is satisfied for k < 4. This means that for algorithms NLAB&quot; and NLAB &quot;p, all sentences with more than three relevant links are computed faster if features are checked. For n = 4, i.e. algorithm NLAB&quot;, the inequality is never satisfied. 17 The results of some calculations are reported in Table 6. The numbers in the &quot;sentence&quot; column refer to the type of construction, as exemplified in Figure 1 (sentence types 1 and 2 are not considered because they contain only trivial chains). If one considers a sentence such as Who did you say that John thought that Mary seemed to like?, with four gaps and four heads, there are 96 hypotheses about chain formation to explore using NLAB&quot;. Clearly, checking features and using them for building chains, and keeping the hypothesis search space small, is beneficial in most cases.</Paragraph> <Paragraph position="7"> Extensibility. These algorithms deal in detail with the somewhat neglected problem of what to do when more than one chain has to be constructed. They do not discuss specifically the issues of adjunction or rightward movement. However, they could be extended.</Paragraph> <Paragraph position="8"> In the unextended algorithm, the postulation and structural licensing of empty categories is always performed by the same mechanism. According to the ECP (as formulated in Rizzi 1990, 25; Cinque 1990; Chomsky 1986b, among others), for an empty category to be licensed, two conditions must be satisfied: the empty category must be within the maximal projection of a lexical head to be licensed structurally, and it must be identified by an antecedent. The structural licenser and the antecedent need not be the same element. In fact, they hardly ever are. Whether movement is to the left or to the right does not affect structural licensing (which is here performed by the conditions that apply to the reduction of an ~-rule).</Paragraph> <Paragraph position="9"> Rightward movement requires an extension of the algorithm to incorporate the empty category in a chain. An empty category that is the foot of rightward movement must be licensed structurally, before its antecedent is seen. When the NP that is the antecedent (head of chain) is found, it starts a new chain, according to CSEL.</Paragraph> <Paragraph position="10"> Therefore, an extension is needed to check if there are any empty categories wait17 Note that here I am assuming that checking a feature and checking a chain have the same computational cost, which is an approximation, as a chain cannot be checked with a single operation. Computational Linguistics Volume 21, Number 4 L-attributed rule into an S-attributed rule (Aho, Sethi and Ullman 1977, 282ff discuss the marker nonterminals technique). Such a transformation is possible if the attributes of the tokens on the left of the current token are at a fixed position in the stack.</Paragraph> <Paragraph position="11"> We can use this S-attribution transformation for Case assignment to the subject (nominative Case or structural case). In English, structural case is assigned to the subject position, if the subject is a sibling of (the projection of) a finite inflectional node. This position can occur both in main and embedded clauses. English is headinitial, and the Specifier precedes the head. These properties interact, so that when the subject NP is reduced, INFL is always the next token in the parsing configuration.</Paragraph> <Paragraph position="12"> Thus, rule (8) can be used. 21 (8) IP --+ NP {Case assign, if I +fin} I' This rule assigns case correctly only if the attribution is not a function of the subconstituents of I. This is precisely what distinguishes case assigned to the subject (structural case assignment) from other types of case assignments (e.g., case assigned to the object by either a verb or a preposition): it is assigned independently of the properties of the main verb.</Paragraph> <Paragraph position="13"> The S-attribution transformation is not restricted to languages with the properties of English; it can also be extended to head-final languages. In verb-final languages (German, for example) the subject of the sentence in embedded clauses is not string adjacent to the head of the sentence, as it is in English. However, structural case can be assigned from left to right, since the complementizer, which necessarily marks the left edge of an IP, is obligatory, and the finite complementizer is always different from the infinitival complementizer.</Paragraph> <Paragraph position="14"> S-attribution could not be performed, however, in parsing a language with all the characteristics given in (9).</Paragraph> <Paragraph position="15"> (9) a. no overt case marking b. no distinct finite complementizer c. verb final d. right branching in the projections other than the verb 21 At first sight, this might appear as a wild overidealization. In fact, there are both theoretical and empirical reasons to think that this is the right way to idealize the data. A corpus analysis on 111 occurrences of the verb announce in the Penn Treebank shows that the subject is followed by an aspectual adverb 11 times, twice by incidental phrases, and 4 times by an apposition. In all other cases the subject and the verb are indeed adjacent. I do not consider appositions and incidentals as challenging for the general claim: incidentals are clearly outside of an X structure assigned to the sentence; while appositions are &quot;internal&quot; to the NP, thus when the verb is reached, the phrase sitting on the stack is indeed the NP subject, which can therefore receive Case. The treatment of aspectual adverbs is more complex. There are at least two possible tacks. First, one can notice that adverbs, although they are analysed as maximal projections because they can be modified, never take a complement, thus they are usually limited to a very short sequence of words, and they do not have a recursive structure. A minimum amount of lookahead, even limited to these particular instances of aspectual adverbs, would solve the problem. Clearly, this is an inelegant solution. A more principled treatment comes from recent developments in the theory, that have changed somewhat the representation used for adverbs. Laenzlinger (1993) suggests that all maximal projections have two specifiers, one A and one A, the higher of the two is the A-position, which can be occupied by adverbs, if they are licensed by the appropriate head (the Adv-Criterion). For these adverbs, the appropriate head is Asp0 which we find only with finite verbs. The parser could compile this information and assign case directly, without even waiting to see the (lexical) verb.</Paragraph> <Paragraph position="16"> Paola Merlo Modularity and Information Content Classes Because of property (9a), case could not be inferred from explicit information contained in the input (unlike Japanese or German); because of property (9b) the subject position of an embedded clause would not be unmistakably signalled (unlike German but like Japanese); because of property (9c), the inflectional head would occur after the NP that needs to be assigned case; finally, because of property (9d), an LR parser could give worst-case results (which is not the case for verb-final, but left-branching, languages, like Japanese): it could require the entire sentence to be stacked before starting to assemble it.</Paragraph> <Paragraph position="17"> Although a problem in principle, this limitation disappears in practice. Inspection of some of the sources on language typology shows that such languages are very difficult to find (Steele 1978; Shopen 1985; Comrie 1981). According to Downing (1978), verb-final languages usually have prenominal relative clauses, which is a sign that they are left branching. Only two verb-final languages have postnominal relative clauses, Persian and Turkish. In Persian, the clause boundary is overtly marked by the suffix -i on the antecedent. Moreover, both languages have overt case marking of the subject. Although this is by no means definitive evidence, it suggests that the algorithm for chain formation and feature assignment that I have presented is not obviously inadequate, and that it is applicable to a variety of languages with different properties.</Paragraph> </Section> </Section> class="xml-element"></Paper>