File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0604_intro.xml

Size: 9,569 bytes

Last Modified: 2025-10-06 14:03:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0604">
  <Title>Probing the space of grammatical variation: induction of cross-lingual grammatical constraints from treebanks</Title>
  <Section position="4" start_page="21" end_page="23" type="intro">
    <SectionTitle>
2 Subjects and objects in Czech and
Italian
</SectionTitle>
    <Paragraph position="0"> Grammatical relations - such as subject (S) and direct object (O) - are variously encoded in languages, the two most widespread strategies being: i) structural encoding through word order, and ii) morpho-syntactic marking. In turn, morpho-syntactic marking can apply either on the noun head only, in the form of case inflections, or on both the noun and the verb, in the form of agreement marking (Croft 2003).</Paragraph>
    <Paragraph position="1"> Besides formal coding, the distribution of subjects and object is also governed by semantic and pragmatic factors, such as noun animacy, definiteness, topicality, etc. As a result, there exists a variety of linguistic clues jointly cooperating in making a particular noun phrase the subject or direct object of a sentence. Crucially for our present purposes, cross-linguistic variation does not only concern the particular strategy used to encode S and O, but also the relative strength that each factor plays in a given language. For instance, while English word order is by and large the dominant clue to identify S and O, in other languages the presence of a rich morphological system allows word order to have a much looser connection with the coding of grammatical relations, thus playing a secondary role in their identification. Moreover, there are languages where semantic and pragmatic constraints such as animacy and/or definiteness play a predominant role in the processing of grammatical relations. A large spectrum of variations exists, ranging from languages where S must have a higher degree of animacy and/or definiteness relative to O, to languages where this constraint only takes the form of a softer statistical preference (cf. Bresnan et al. 2001).</Paragraph>
    <Paragraph position="2"> The goal of this paper is to explore the area of this complex space of grammar variation through careful assessment of the distribution of S and O tokens in Italian and Czech. For our present analysis, we have used a MaxEnt statistical model trained on data extracted from two syntactically annotated corpora: the Prague Dependency Treebank (PDT, Bohmova et al.</Paragraph>
    <Paragraph position="3"> 2003) for Czech, and the Italian Syntactic Semantic Treebank (ISST, Montemagni et al.</Paragraph>
    <Paragraph position="4"> 2003) for Italian. These corpora have been chosen not only because they are the largest syntactically annotated resources for the two languages, but also because of their high degree of comparability, since they both adopt a dependency-based annotation scheme.</Paragraph>
    <Paragraph position="5"> Czech and Italian provide an interesting vantage point for the cross-lingual analysis of grammatical variation. They are both Indo-European languages, but they do not belong to the same family: Czech is a West Slavonic language, while Italian is a Romance language.</Paragraph>
    <Paragraph position="6"> For our present concerns, they appear to share two crucial features: i) the free order of grammatical relations with respect to the verb; ii) the possible absence of an overt subject.</Paragraph>
    <Paragraph position="7"> Nevertheless, they also greatly differ due to: the virtual non-existence of case marking in Italian (with the only marginal exception of personal pronouns), and the degree of phrase-order freedom in the two languages. Empirical evidence supporting the latter claim is provided in Table 1, which reports data extracted from PDT and ISST. Notice that although in both languages S and O can occur either pre-verbally or post-verbally, Czech and Italian greatly differ in their propensity to depart from the (unmarked) SVO order. While in Italian preverbal O is highly infrequent (1.90%), in Czech more than 30% of O tokens occur before the verb. The situation is similar but somewhat more balanced in the case of S, which occurs post-verbally in 22.21% of the Italian cases, and in 40% of Czech ones. For sure, one can argue that, in spoken Italian, the number of pre-verbal objects is actually higher, because of the greater number of left dislocations and topicalizations occurring in informal speech. However reasonable, the observation does not explain away the distributional differences in the two corpora, since both PDT and ISST contain written language only. We thus suggest that there is clear empirical evidence in favour of a systematic, higher phrase-order freedom in Czech, arguably related to the well-known correlation of Czech constituent placement with sentence information structure, with the element carrying new information showing a tendency to occur sentence-finally (Stone 1990). For our present concerns, however, aspects of information structure, albeit central in Czech grammar, were not taken into account, as they happen not to be  wrt case marked-up in the Italian corpus.</Paragraph>
    <Paragraph position="8"> According to the data reported in Table 1, Czech and Italian show similar correlation patterns between animacy and grammatical relations. S and O in ISST were automatically annotated for animacy using the SIMPLE Italian computational lexicon (Lenci et al. 2000) as a background semantic resource. The annotation was then checked manually. Czech S and O were annotated for animacy using Czech WordNet (Pala and Smrz 2004); it is worth remarking that in Czech animacy annotation was done only automatically, without any manual revision.</Paragraph>
    <Paragraph position="9"> Italian shows a prominent asymmetry in the distribution of animate nouns in subject and object roles: over 50% of ISST subjects are animate, while only 10% of the objects are animate. Such a trend is also confirmed in Czech - although to a lesser extent - with 34.10% of animate subjects vs. 15.42% of objects.1 Such an overwhelming preference for animate subjects in corpus data suggests that animacy may play a very important role for S and O identification in both languages.</Paragraph>
    <Paragraph position="10"> Corpus data also provide interesting evidence concerning the actual role of morpho-syntactic constraints in the distribution of grammatical relations. Prima facie, agreement and case are the strongest and most directly accessible clues for S/O processing, as they are marked both overtly and locally. This is also confirmed by psycholinguistic evidence, showing that subjects tend to rely on these clues to identify S/O.</Paragraph>
    <Paragraph position="11"> However, it should be observed that agreement can be relied upon conclusively in S/O processing only when a nominal constituent and 1 In fact, the considerable difference in animacy distribution between the two languages might only be an artefact of the way we annotated Czech nouns semantically, on the basis of their context-free classification in the Czech WordNet. a verb do not agree in number and/or person (as in leggono il libro '(they) read the book').</Paragraph>
    <Paragraph position="12"> Conversely, when N and V share the same person and number, no conclusion can be drawn, as trivially shown by a sentence like il bambino legge il libro 'the child reads the book'. In ISST, more than 58% of O tokens agree with their governing V, thus being formally indistinguishable from S on the basis of agreement features. PDT also exhibits a similar ratio, with 56% of O tokens agreeing with their verb head. Analogous considerations apply to case marking, whose perceptual reliability is undermined by morphological syncretism, whereby different cases are realized through the same marker. Czech data reveal the massive extent of this phenomenon and its impact on SOI.</Paragraph>
    <Paragraph position="13"> As reported in Table 2, more than 56% of O tokens extracted from PDT are formally indistinguishable from S in case ending.</Paragraph>
    <Paragraph position="14"> Similarly, 45% of S tokens are formally indistinguishable from O uses on the same ground. All in all, this means that in 50% of the cases a Czech noun can not be understood as the S/O of a sentence by relying on overt case marking only.</Paragraph>
    <Paragraph position="15"> To sum up, corpus data lend support to the idea that in both Italian and in Czech SOI is governed by a complex interplay of probabilistic constraints of a different nature (morphosyntactic, semantic, word order, etc.) as the latter are neither singly necessary nor jointly sufficient to attack the processing task at hand. It is tempting to hypothesize that the joint distribution of these data can provide a statistically reliable basis upon which relevant probabilistic constraints are bootstrapped and combined consistently. This should be possible due to i) the different degrees of clue salience in the two languages and ii) the functional need to minimize  processing ambiguity in ordinary communicative exchanges. With reference to the latter point, for example, we may surmise that a speaker will be more inclined to violate one constraint on S/O distribution (e.g. word order) when another clue is available (e.g. animacy) that strongly supports the intended interpretation only. The following section illustrates how a MaxEnt model can be used to model these intuitions by bootstrapping constraints and their interaction from language data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML