File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2005_intro.xml

Size: 2,946 bytes

Last Modified: 2025-10-06 14:04:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2005">
  <Title>Tagging Portuguese with a Spanish Tagger Using Cognates</Title>
  <Section position="3" start_page="33" end_page="34" type="intro">
    <SectionTitle>
3 Resources
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
3.1 Tagset
</SectionTitle>
      <Paragraph position="0"> For both Spanish and Portuguese, we used positional tagsets developed on the basis of Spanish CLiC-TALP tagset (Torruella, 2002). Every tag is a string of 11 symbols each corresponding to one morphological category. For example, the Portuguese word partires 'you leave' is assigned the tag VM0S---2PI-, because it is a verb (V), main (M), gender is not applicable to this verb form (0), singular (S), case, possesor's number and form are not applicable to this category(-), 2nd person (2), present (P), indicative (I) and participle type is not applicable (-).</Paragraph>
      <Paragraph position="1"> A comparison of the two tagsets is in Table 2.2 When possible the Spanish and Portuguese tagsets use the same values, however some differences are unavoidable. For instance, the pluperfect is a compound verb tense in Spanish, but a separate word that needs a tag of its own in Portuguese. In addition, we added a tag for &amp;quot;treatment&amp;quot; Portuguese pronouns.</Paragraph>
      <Paragraph position="2"> The Spanish tagset has 282 tags, while that for Portuguese has 259 tags.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
3.2 Training corpora
</SectionTitle>
      <Paragraph position="0"> Spanish training corpus. The Spanish corpus we use for training the transition probabilities as well as for obtaining Spanish-Portuguese cognate pairs is a fragment (106,124 tokens, 18,629 types) of the Spanish section of CLiC-TALP (Torruella, 2Notice that we have 6 possible values for the gender position: M (masc.), F (fem.), N (neutr., for certain pronouns), C (common, either M or F), 0 (unspecified for this form within the category), - (the category does not distinguish gender)  2002). CLiC-TALP is a balanced corpus, containing texts of various genres and styles. We automatically translated the CLiC-TALP tagset into our system (see Sect. 3.1) for easier detailed evaluation and for comparison with our previous work that used a similar approach for tagging (Hana et al., 2004; Feldman et al., 2006).</Paragraph>
      <Paragraph position="1"> Raw Portuguese corpus. For automatic lexicon acquisition, we use NILC corpus,3 containing 1.2M tokens.</Paragraph>
    </Section>
    <Section position="3" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.3 Evaluation corpus
</SectionTitle>
      <Paragraph position="0"> For evaluation purposes, we selected and manually annotated a small portion (1,800 tokens) of NILC corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML