File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1038_intro.xml

Size: 4,602 bytes

Last Modified: 2025-10-06 14:03:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1038">
  <Title>Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French</Title>
  <Section position="3" start_page="306" end_page="307" type="intro">
    <SectionTitle>
2 The French Treebank
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="306" end_page="306" type="sub_section">
      <SectionTitle>
2.1 Annotation Scheme
</SectionTitle>
      <Paragraph position="0"> The French Treebank (FTB; Abeill'e et al. 2000) consists of 20,648 sentences extracted from the daily newspaper Le Monde, covering a variety of authors and domains (economy, literature, politics, etc.).1 The corpus is formatted in XML and has a rich morphosyntactic tagset that includes part-of-speech tag, 'subcategorization' (e.g., possessive or cardinal), inflection (e.g., masculine singular), and lemma information. Compared to the Penn Treebank (PTB; Marcus et al. 1993), the POS tagset of the French Treebank is smaller (13 tags vs. 36 tags): all punctuation marks are represented as the single PONCT tag, there are no separate tags for modal verbs, whwords, and possessives. Also verbs, adverbs and prepositions are more coarsely defined. On the other hand, a separate clitic tag (CL) for weak pronouns is introduced. An example for the word-level annotation in the FTB is given in Figure 1 The phrasal annotation of the FTB differs from that for the Penn Treebank in several aspects. There is no verb phrase: only the verbal nucleus (VN) is annotated. A VN comprises the verb and any clitics, auxiliaries, adverbs, and negation associated with it.</Paragraph>
      <Paragraph position="1"> This results in a flat syntactic structure, as in (1).</Paragraph>
      <Paragraph position="3"> arr^et'es)) 'are systematically arrested' The noun phrases (NPs) in the FTB are also flat; a noun is grouped together with any associated determiners and prenominal adjectives, as in example (2). Note that postnominal adjectives, however, are adjoined to the NP in an adjectival phrase (AP).</Paragraph>
      <Paragraph position="4">  Treebank: d'entre 'between' (catint: compoundinternal POS tag) (2) (NP (D des) (A petits) (N mots) (AP (ADV tr`es) (A gentils))) 'small, very gentle words' Unlike the PTB, the FTB annotates coordinated phrases with the syntactic tag COORD (see the left panel of Figure 3 for an example).</Paragraph>
      <Paragraph position="5"> The treatment of compounds is also different in the FTB. Compounds in French can comprise words which do not exist otherwise (e.g., insu in the compound preposition `a l'insu de 'unbeknownst to') or can exhibit sequences of tags otherwise ungrammatical (e.g., `a la va vite 'in a hurry': Prep + Det + finite verb + adverb). To account for these properties, compounds receive a two-level annotation in the FTB: a subordinate level is added for the constituent parts of the compound (both levels use the same POS tagset). An example is given in Figure 2.</Paragraph>
      <Paragraph position="6"> Finally, the FTB differs from the PTB in that it does not use any empty categories.</Paragraph>
    </Section>
    <Section position="2" start_page="306" end_page="307" type="sub_section">
      <SectionTitle>
2.2 Data Sets
</SectionTitle>
      <Paragraph position="0"> The version of the FTB made available to us (version 1.4, May 2004) contains numerous errors. Two main classes of inaccuracies were found in the data: (a) The word is present but morphosyntactic tags are missing; 101 such cases exist. (b) The tag information for a word (or a part of a compound) is present but the word (or compound part) itself is missing. There were 16,490 instances of this error in the dataset.</Paragraph>
      <Paragraph position="1"> Initially we attempted to correct the errors, but this proved too time consuming, and we often found that the errors cannot be corrected without access to the raw corpus, which we did not have. We therefore decided to remove all sentences with errors, which lead to a reduced dataset of 10,552 sentences.</Paragraph>
      <Paragraph position="2"> The remaining data set (222,569 words at an average sentence length of 21.1 words) was split into a training set, a development set (used to test the parsing models and to tune their parameters), and a test set, unseen during development. The training set consisted of the first 8,552 sentences in the corpus, with the following 1000 sentences serving as the development set and the final 1000 sentences forming the test set. All results reported in this paper were obtained on the test set, unless stated otherwise.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML