File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1507_metho.xml
Size: 22,126 bytes
Last Modified: 2025-10-06 14:09:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1507"> <Title>Categorial Type Logic meets Dependency Grammar to annotate an Italian Corpus</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Dependency and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> functor-argument relations 3.1 Dependency structures in TUT </SectionTitle> <Paragraph position="0"> The Turin University Treebank (TUT) is a corpus of Italian sentences annotated by specifying relational structures augmented with morpho-syntactic information and semantic role (henceforth ARS) in a monostratal dependency-based representation. The treebank in its current release includes 38,653 words and 1,500 sentences adjectives, determiners, articles, adverbs, prepositions, conjunctions, numerals, interjections, punctuation and a class of residual items which differs from project to project.</Paragraph> <Paragraph position="1"> from the Italian civil law code, the national newspapers La Stampa and La Repubblica, and from various reviews, newspapers, novels, and academic papers.</Paragraph> <Paragraph position="2"> The ARS schema consists of i) morphosyntactic, ii) functional-syntactic and iii) semantic components, specifying part-of-speech, grammatical relations, and thematic role information, respectively. The reader is referred to (Bosco, 2003) for a detailed description of the TUT annotation schema. An example is given below (tr. &quot;The first steps have not been encouraging&quot;). In this example, the node TOP-VERB is the root of the whole structure4.</Paragraph> <Paragraph position="3"> Because we are interested in extracting dependency relations, we can focus on the functional-syntactic component of the TUT annotation, where information relating to grammatical relations (heads and dependents) is encoded. null The TUT annotation schema for dependents makes a primary distinction between (a) functional and (b) non-functional tags, for dependents that can and that cannot be assigned thematic roles, respectively. These two classes are further divided into (a') arguments (ARG) and modifiers (MOD) and (b'), AUX, COORDINATOR, INTERJECTION, CONTIN, EMPTYCOMPL, EMPTYLOC, SEPARATOR and VISITOR5; and furthermore, the arguments CONTIN, (ii) EMPTYCOMPL, (iii) EMPTYLOC and (iv) VISITOR. They are used for expressions that (i) introduce a part of an expression with a non-compositional interpretation (e.g. locative or idiomatic expressions and denominative structures: &quot;Arriv`o [prima]H [de]D ll'alba&quot;, lit. tr. &quot;(She) arrived ahead of the daybreak&quot;); (ii) link a re(ARG) and modifiers(MOD) are sub-divided as following null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> ARG SUBJ OBJ INDOBJ INDCOMPL PREDCOMPL MODIFIER RMOD RELCLR RMODPRED APPOSITION RELCLA </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Categorial functor-argument </SectionTitle> <Paragraph position="0"> structures Categorial Type Logic (CTL) (Moortgat, 1997) is a logic-based formalism belonging to the family of Categorial Grammars (CG). In CTL, the type-forming operations of CG are viewed as logical connectives. As the slogan &quot;Parsingas-Deduction&quot; suggests, such a view makes it possible to do away with combinatory syntactic rules altogether; establishing the well-formedness of an expression becomes a process of deduction in the logic of the type-forming connectives.</Paragraph> <Paragraph position="1"> The basic distinction expressed by the categorial type formulas is the Fregean opposition between complete and incomplete expressions. Complete expressions are categorized by means of atomic type formulas; grammaticality judgements for expressions with an atomic type do not require further contextual information. Typical examples of atomic types would be 'sentence' (s) and 'noun' (n). Incomplete expressions are categorized by means of fractional type formulas; the denominators of these fractions indicate the material that has to be found in the context in order to obtain a complete expression of the type of the numerator.</Paragraph> <Paragraph position="2"> Definition 3.1 (Fractional type formulas) Given a set of basic types ATOM, the set of types TYPE is the smallest set such that: i. if A[?]ATOM, then A[?]TYPE; ii. if A and B [?] TYPE, then A/B and B\A[?]TYPE.</Paragraph> <Paragraph position="3"> There are different ways of presenting valid type computations. In a Natural Deduction format, we write G turnstileleft A for a demonstration that flexive personal pronoun with particular verbal head (e.g. &quot;La porta [si]D [apre]H&quot;, lit. tr. &quot;the door (it) opens&quot;); (iii) link a pronoun with a verbal head introducing a sort of metaphorical location of the head (e.g. &quot;In Albania [ci]D [sono]H molti problemi&quot;, lit. tr. &quot;In Albany there are many problems&quot;.); (iv) mark the extraction of a part of a structure (e.g. &quot;Cos`i devi vedere questo argomento&quot;, lit. tr. &quot;This way (you) must see this topic&quot;). the structure G has the type A. The statement AturnstileleftA is axiomatic. Each of the connectives / and\has an Elimination rule and an Introduction rule. Below, we give these inference rules for / (incompleteness to the right). The cases for\(incompleteness to the left) are symmetric.</Paragraph> <Paragraph position="4"> Given structures G and [?] of types A/B and B respectively, the Elimination rule builds a compound structure G*[?] of type A. The Introduction rule allows one to take apart a compound structure G*B into its immediate substructures.</Paragraph> <Paragraph position="6"> Notice that the language of fractional types is essentially higher-order: the denominator of a fraction does not have to be atomic, but can itself be a fraction. The Introduction rules are indispensable if one is interested in capturing the full set of theorems of the type calculus.</Paragraph> <Paragraph position="7"> Classical CG (in the style of Ajdukiewicz and Bar-Hillel) uses only the Elimination rules, and hence has restricted inferential capacities. It is impossible in classical CG to obtain the validity A turnstileleft B/(A\B), for example. Still, the classical CG perspective will be useful to realize our aim of automatically inducing type assignments from structured data obtained from the TUT corpus thanks to the type resolution algorithm explained below.</Paragraph> <Paragraph position="8"> Type inference algorithms for classical CG have been studied by (Buszkowski and Penn, 1990). The structured data needed by their type inference algorithms are so-called functor-argument structures (fa-structures). An fa-structure for an expression is a binarybranching tree; the leaf nodes are labeled by lexical expressions (words), the internal nodes by one of the symbols triangleleftsld (for structures with the functor as the left daughter) or trianglerightsld (for structures with the functor as the right daughter).</Paragraph> <Paragraph position="9"> To assign types to the leaf nodes of an fastructure, one proceeds in a top-down fashion.</Paragraph> <Paragraph position="10"> The type of the root of the structure is fixed (for example: s). Compound structures are typed as follows: - to type a structure G triangleleftsld [?] as A, type G as A/B and [?] as B; - to type a structure G trianglerightsld [?] as A, type G as B and [?] as B\A.</Paragraph> <Paragraph position="11"> If a word occurs in different structural environments, the typing algorithm will produce distinct types. The set of type assignments to a word can be reduced by factoring: one identifies type assignments that can be unified. For an example, compare the structured input be-</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 From TUT dependency structures </SectionTitle> <Paragraph position="0"> to categorial types To accomplish our aims, we will have an occasion to use two extensions of the basic categorial machinery outlined in the section above: a generalization of the type language to multiple modes of composition, and the addition of structural rules of inference to the logical rules of slash Elimination and Introduction.</Paragraph> <Paragraph position="1"> Multimodal composition The intuitions underlying the distinction between heads and dependentsin Dependency Grammars (DG) and between functors and arguments in CG often coincide, but there are also cases where they diverge (Venneman, 1977). In the particular case of the TUT annotation schema, we see that for all instances of dependents labeled as ARG (or one of its sublabels), the DG head/dependent articulation coincides with the CG functor/argument asymmetry. But for DG modifiers, or dependents without thematic roles of the class AUX (auxiliary)6 there is a mismatch between dependency structure and functor-argument structure. Modifiers would be functors in terms of their categorial type: functors where the numerator and the denominator are identical. This makes them into 'identities' for the fractional multiplication, which explains their optionality and the possibility of iteration. AUX elements in DG would count as morphological modifiers of the head verbs. From the CG point of view, they would be typed as functors with non-identical numerator and denominator, distinguishing them that way from optional modifiers, and capturing the fact that they are indispensable to build a complete grammatical structure.</Paragraph> <Paragraph position="2"> To reconcile the competing demands of the head-dependent and functor-argument classification, we make use of the type calculus proposed in (Moortgat and Morrill, 1991), which treats dependency and functor-argument relations as two orthogonal dimensions of linguistic organization. Instead of one composition operation *, the system of (Moortgat and Morrill, 1991) has two: *l for structures where the left daughter is the head, and *r for right-headed structures. The two composition operations each have their slash and backslash operations for the typing of incomplete expressions: - A/lB: a functor looking for a B to the right to form an A; the functor is the head, the argument the dependent; - A/rB: a functor looking for a B to theright to form an A; the argument is the head, the functor the dependent; - B\lA: a functor looking for a B to the left to form an A; the argument is the head, the functor the dependent; - B\rA: a functor looking for a B to the left to form an A; the functor is the head, the argument the dependent.</Paragraph> <Paragraph position="3"> The type inference algorithm of (Buszkowski and Penn, 1990) can be straightforwardly adapted to the multimodal situation. The internal nodes of the fa-structures now are labeled with a fourfold distinction: as before, the triangle points to the functor daughter of a constituent; in the case of the black triangle, the functor daughter is the head constituent, in the case of the white triangle, the functor daughter is the dependent.</Paragraph> <Paragraph position="5"> The type-inference clauses can be adapted accordingly. null - to type a structure G triangleleftsld [?] as A, type G as A/lB and [?] as B; - to type a structure G triangleleft [?] as A, type G as A/rB and [?] as B.</Paragraph> <Paragraph position="6"> - to type a structure G trianglerightsld [?] as A, type [?] as B\rA and G as B; - to type a structure G triangleright [?] as A, type [?] as B\lA and G as B.</Paragraph> <Paragraph position="7"> Structural reasoning The dependency relations in the TUT corpus abstract from surface word order. When we induce categorial type formulas from these dependency relations, as we will see in Section 4.1, the linear order imposed by '/' and '\' in the obtained formulas will not always be compatible with the observable surface order. Incompatibilities will arise, specifically, in the case of non-projective dependencies. Where such mismatches occur, the induced types will not be immediately useful for parsing -- the longer term subtask of the project discussed here.</Paragraph> <Paragraph position="8"> To address this issue, we can extend the inference rules of our categorial logic with structural rules. The general pattern of these rules is: infer Gprime turnstileleft A from G turnstileleft A, where Gprime is some rearrangement of the constituents of G. These rules, in other words, characterize the structural deformations under which type assignment is preserved. Structural rules can be employed in two ways in CTL (see (Moortgat, 2001) for discussion). In an on-line use, they actually manipulate structural configurations during the parsing process. Such on-line use can be very expensive computationally. Used off-line, they play a role complementary to the factoring operation, producing a number of derived lexical type-assignments from some canonical assignment. With the derived assignments, parsing can then proceed without altering the surface structure.</Paragraph> <Paragraph position="9"> As indicated in the introduction, the use of CTL in the construction of a treebank for a part of the CILTA corpus belongs to a future phase of our project. For the purposes of this paper we must leave the exact nature of the required structural rules, and the trade-off between off-line and on-line uses, as a subject for further research.</Paragraph> <Paragraph position="10"> 4 A distributional study of Italian part-of-speech tagging In order to annotate the CORIS corpus with a theory-neutral set of PoS tags, we plan to carry out a distributional study of its lexicon.</Paragraph> <Paragraph position="11"> Early approaches to this problem were based on the hypothesis that if two words are syntactically and semantically different, they will appear in different contexts. There are a number of studies that, starting from this hypothesis, have built automatic or semi-automatic procedures for clustering words (Brill and Marcus, 1992; Pereira et al., 1993; Martin et al., 1998), especially in the field of cognitive sciences (Redington et al., 1998; Gobet and Pine, 1997; Clark, 2000). They examine the distributional behaviour of some target words, comparing the lexical distribution of their respective collocates using quantitative measures of distributional similarity (Lee, 1999).</Paragraph> <Paragraph position="12"> In (Brill and Marcus, 1992) it is given a semi-automatic procedure that, starting from lexical statistical data collected from a large corpus, aims to arrange target words in a tree (more precisely a dendrogram), instead of clustering them automatically. This procedure requires a linguistic examination of the resulting tree, in order to identify the word classes that are most appropriate to describe the phenomenon under investigation. In this sense, they use a semi-automatic word-class generator method.</Paragraph> <Paragraph position="13"> A similar procedure has been applied on Italian in (Tamburini et al., 2002). The novelty of this work is that it derives the distributional information on words from a very basic set of PoS tags, namely nouns, verbs and adjectives. This method, completely avoiding the sparseness of the data affecting Brill and Marcus' method, uses general information about the distribution of lexical words to study the internal subdivisions of the set of grammatical words, and results more stable than the method based only on lexical co-occurrence.</Paragraph> <Paragraph position="14"> The main drawback of these techniques is the limited context of analysis. Collecting information from a defined context, typically two or three words will invariably miss syntactic dependencies longer than the context interval. To overcome this problem we propose to exploit the expressivity of CTAs (with encoded core dependency relations, as we saw in the section above) by applying the clustering algorithms on them.</Paragraph> <Paragraph position="15"> Below we sketch how we intend to induce CTAs from the TUT dependency treebank, and the clustering method we plan to implement. The whole procedure can be summarized by the picture below.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Inducing categorial types from TUT </SectionTitle> <Paragraph position="0"> The first step is to reduce the distinctions encoded in the TUT treebank to bare head-dependent relations: the ARG type on the one hand, and the MOD and AUX types on the other.</Paragraph> <Paragraph position="1"> These relations are converted into fa-structures built by means of the dependency-sensitive operators triangleleftsld, trianglerightsld, triangleleft , triangleright . By means of example, we consider some simple sentences exemplifying the different relations. null Figure 1 shows a head-dependent structure in which edges represent head-dependent relations and each edge points to the dependent of each relation. In this example, each H-D relation agrees with the F-A relation, i.e. each head corresponds to a functor and the dependents are all labeled as arguments (or sub-tags of it).7 the use of qualifying adjectives, which is an example of a modifier, and past tense auxiliaries. Considering the relation between &quot;mela&quot;(apple) and &quot;rossa&quot; (red), and between &quot;ha&quot; (has) and &quot;mangiato&quot; (eaten), we have the dependency trees in Figure 2.</Paragraph> <Paragraph position="2"> In the first case, the noun is the head and the adjective is the dependent, but from the functor-argument perspective, the adjective (in general, the modifier) is the incomplete functor component. A similar discrepancy is observed for the auxiliary and the main verb, where the auxiliary should be classified as the incomplete functor, but as the dependent element with respect to the main verb. In this case the absence 7The example follows TUT practice in designating the determiner as the head of the noun phrases. We are aware of the fact that this is far from controversial in the dependency community. In preprocessing TUT before type inference, we have the occasion to adjust such debatable decisions, and representational issues such as the use of empty categories, for which there is no need in a CTL framework.</Paragraph> <Paragraph position="3"> of the auxiliary would result in an ungrammatical sentence. The relations of MOD and AUX exhibit a different behavior than ARG, and hence are depicted with different arcs.</Paragraph> <Paragraph position="4"> Our simple example sentences could be converted into the following fa-structures: - Allen trianglerightsld (mangia triangleleftsld (la triangleleftsld mela) - Allen trianglerightsld (mangia triangleleftsld (la triangleleftsld (mela triangleright rossa)) - Allen trianglerightsld ((ha triangleleft mangiato) triangleleftsld (la triangleleftsld mela)) The second step is to run the Buszkowski-Penn type-inference algorithm (in its extended form, discussed above) on the fa-structures obtained from TUT, and to reduce the lexicon by factoring (identification of unifiable assignments) and (in a later phase) structural derivability. Fixing the goal type for these examples as s, we obtain the following type assignments from the fa-structures given above: Notice that from the output in our tiny sample, we have no information allowing us to identify the argument assignments A and B. Notice also that from an fa-structure which takes together &quot;ha mangiato&quot; in a constituent, we obtain a type assignment for &quot;mangiato&quot; that does not express its incompleteness anymore -- instead, the combination with the auxiliary expresses this. This is already an example where structural reasoning can play a role: compare the above analysis with the type solution one would obtain by starting from an fa-structure which takes &quot;mangiato la mela&quot; as a constituent, which yields a type solution (A\rs)/rE for the auxiliary, and E/lB for the head verb. We are currently experimenting with the effect of different constituent groupings on the size of the induced type lexicon.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Clustering Algorithms </SectionTitle> <Paragraph position="0"> Once we have induced the categorial type assignments for the TUT lexicon, the last step of our first task is to divide it into clusters so to study the distributional behavior of the corresponding lexical entries. The advantage of using categorial types as objects of the clustering algorithm is that they represent long distance dependencies as well as limited distributional information. Thus the categorial types become the basic elements of syntactic information associated with lexical entries and the basic &quot;distributional fingerprints&quot; used in the clustering process.</Paragraph> <Paragraph position="1"> Every clustering process is based on a notion of &quot;distance&quot; between the objects involved in the process. We should define an appropriate metric among categorial types. We believe that a crucial role will be played by the dependency relation encoded into the types by means of compositional modes.</Paragraph> <Paragraph position="2"> Currently, we are studying the application of proper distance measures considering types as trees and adapting the theoretical results on tree metrics to our problem. The algorithm for computing the tree-edit distance (Shasha and Zhang, 1997), designed for generic trees, appears to be a good candidate for clustering in categorial-type domain. What remains to be done is to experiment the algorithm and fine-tune the metrics to our purpose.</Paragraph> </Section> </Section> class="xml-element"></Paper>