File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1902_metho.xml

Size: 15,999 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1902">
  <Title>PIPCA - Unisinos</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> Our first annotation schemes were Prolog lists of treebank sentences and their noun phrases (NPs), as shown in Figure 1. The lists were extracted from Lisp lists of the Penn Treebank. These lists were manipulated in our experiments on coreference annotation and resolution.</Paragraph>
    <Paragraph position="1"> The results of coreference annotation were lists of Prolog facts dcc(Index1,Index2,Code) as shown in Figure 2. Index1 refers to the sequential numbering of definite descriptions; Index2 refers to the sequential numbering of noun phrases; and Code refers to their classification, according to discourse status (Poesio and Vieira, 1998). For some of them there were also facts [S,[NP,the,squabbling,[PP,within,[NP,the, Organization,[PP,of,[NP,Petroleum,Exporting, Countries]]]]],[VP,seems,[PP,under, [NP,control]],[PP,for,now]].].</Paragraph>
    <Paragraph position="2"> [NP,Petroleum,Exporting,Countries].</Paragraph>
    <Paragraph position="3"> [NP,the,Organization, [PP,of,[NP,Petroleum,Exporting,Countries]]].</Paragraph>
    <Paragraph position="4"> [NP,the,squabbling,[PP,within, [NP,the,Organization...</Paragraph>
    <Paragraph position="5">  ...</Paragraph>
    <Paragraph position="6"> ddc(5, 16, r).</Paragraph>
    <Paragraph position="7"> ddc(6, 18, k).</Paragraph>
    <Paragraph position="8"> ddc(7, 28, r).</Paragraph>
    <Paragraph position="9"> ...</Paragraph>
    <Paragraph position="10"> ddsr(5, 16, r, np(5)).</Paragraph>
    <Paragraph position="11"> ddsr(7, 28, r, np(16)).</Paragraph>
    <Paragraph position="12"> ...</Paragraph>
    <Paragraph position="14"> ing their antecedent NPs. We could only link the annotation to the data by running the Prolog code that loaded the lists of NPs and sentences and generated their indexes. Although we had carried out intensive research with these resources and tools, the re-use of our data in other environments was very difficult.</Paragraph>
    <Paragraph position="15"> Despite the lack of fully annotated data for Portuguese, we tried to check out whether the same heuristics we used for English would be suitable for this new language. To test our heuristics we used the PALAVRAS parser2 (Bick, 2000) to parse Portuguese corpus. From parsed texts we extracted Prolog lists of NPs as illustrated in Figure 3. Experiments were carried out over these resources. Heuristics for correference resolution were adapted to Portuguese and the results obtained were comparable to those previously obtained for English. However, the genericity of the Portuguese resolver and annotated data still raised the same re-usability problems as for English, since the encoding format had not evolved.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 COMMOn-REFs
</SectionTitle>
    <Paragraph position="0"> In the COMMOn-REFs project we face the challenge of dealing with different languages (French  pora and tools, initially available under different formats. null We adopted MMAX3 as our manual annotation tool. With MMAX we could annotate our corpus according to our theoretical principles. The following corpus studies were developed with the aid of the tool: (Salmon-Alt and Vieira, 2002; Vieira et al., 2002b; Vieira et al., 2002a). In these studies, our annotation targets were manually marked and coreference information was added to them according to subjects' analysis of the texts.</Paragraph>
    <Paragraph position="1"> We are currently developing a coreference resolution tool on the basis of XML files and XSL scripts. The tool manipulates several levels of linguistic information. Parsing information has been provided by the PALAVRAS parser. The parser output is transformed into two XML files: one with POS and another with syntactic information (chunks).</Paragraph>
    <Paragraph position="2"> Coreference information, manually annotated with MMAX (markables), is used for evaluation. Our tool, besides manipulating three different annotation levels (POS, chunks, markables), creates two others: anaphors and candidates, as detailed in Section 5. As we are interested in having our resources made available, we relate our annotation schemes to standard proposals presented in (Ide and Romary, 2002; Ide and Romary, 2003; Ide and Romary, 2001).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Data model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Encoding standards
</SectionTitle>
      <Paragraph position="0"> Directions for standard corpus encoding in XML have been proposed in (Ide and Romary, 2002; Ide and Romary, 2003; Ide and Romary, 2001). Such efforts consist on defining abstract formats for corpus annotation that could be instantiated according to specific project requirements. An abstract XML file can be generated for each annotation level according to a Virtual Annotation Markup Language (VAML). The structure of this language is defined by a skeleton that consists on &lt;struct&gt; (a node/level in the annotation) and &lt;feat&gt; elements (feature attached to the enclosing &lt;struct&gt; node).</Paragraph>
      <Paragraph position="1"> Particular project specifications are defined through data categories (component categories to be annotated) and dialect (encoding style). On the basis of these specifications, a mapping between VAML and Concrete AML (CAML) can be made. CAML is the language used for annotation encoding in particular projects.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Our schemes
</SectionTitle>
      <Paragraph position="0"> Our first experiments with MMAX were on manual coreference annotation. The tool required specific input and output formats. Our corpus, that is, the primary data, were first converted from raw texts to XML, encoded as &lt;word&gt; elements, like the example in Figure 4 for the sentences O jogador pode deixar o time. Ele recebeu uma proposta excelente.</Paragraph>
      <Paragraph position="1"> (The player may leave the club. He received an excellent proposal.) Each corpus token (words and punctuation) corresponds to a &lt;word&gt;.</Paragraph>
      <Paragraph position="2"> The coreference was manually annotated and encoded as &lt;markable&gt; elements. Each anaphoric expression and antecedent were represented by markables. Anaphors' markables had an extra attribute &amp;quot;pointer&amp;quot;, that refers to its antecedent markable. An example of a markable file is presented on Figure 5(a). Markables correspond to our final level of annotation. The &amp;quot;span&amp;quot; attribute refers to our primary data, the words. The other attributes (coref, classific) were specified according to our application. Figure 5(b) represents the abstract XML encoding for our markables file, according to VAML4. Pointer, coreference and classification compose our dialect vocabulary for the following data categories: antecedent and types of discourse status, as in (Poesio 4As we are not aware of a registry of data categories for coreference level, in our examples throughout the paper we often use the same vocabulary in abstract and concrete encodings.  and Vieira, 1998) inpired on (Prince, 1981; Prince, 1992). According to our dialect instantiation style, the data categories are represented as attributes of &lt;markable&gt; elements.</Paragraph>
      <Paragraph position="3"> To develop a tool for automatic coreference resolution, we needed to consider other intermediate levels of annotation: source of linguistic information used for solving anaphoras.</Paragraph>
      <Paragraph position="4"> Our corpus was analysed by the Portuguese parser PALAVRAS (Bick, 2000). The original format of PALAVRAS output is not standard. As previously presented in Figure 3, on each line of the figure: a0 the first symbol represents the syntactic function ('SUBJ'= subject, 'N'=noun modifier, 'H'=head, etc.); a0 following ':' , we have the syntactic form for groups of words ('np'=noun phrase, etc.) and POS-tags for single words ('n'=noun, 'v'=verb, etc.); a0 in brackets are the word canonical form and other inflectional tags; a0 after the brackets comes the word as it occurs in the corpus.</Paragraph>
      <Paragraph position="5"> The '=' signs in the beginning of each line represent the level of the phrase in the parsing tree5. 5A complete description of the tagset symbols is available at http://visl.hum.sdu.dk/visl/pt/info/symbolset-manual.html. We defined the XML encoding for PALAVRAS output. We split PALAVRAS output into two annotation levels, one for POS and another for syntactic data. Figure 6(a) shows our scheme for POS file.</Paragraph>
      <Paragraph position="6"> The corresponding abstract XML file is presented on Figure 6(b). Our data categories are word canonic form (lemma), pos, gender, number, person, tense, mode, and case. According to our dialect instantiation style, each POS data category is represented by a new XML element, the other inflexional tags are encoded as attributes of this element. By handling a parsed corpus we could treat compounds at word level; the multi word expression &amp;quot;Sao Paulo&amp;quot;, for example, is tokenised as one word and codified as &lt;word id=&amp;quot;word_9&amp;quot;&gt;Sao_Paulo&lt;/word&gt;.</Paragraph>
      <Paragraph position="7"> We encode syntactic data as chunks. Each syntactic structure is represented by a &lt;chunk&gt; element. Figure 7(a) shows our encoding. The mapping to abstract XML is presented on Figure 7(b). In our dialect, each &lt;chunk&gt; in the concrete XML encoding corresponds to a &lt;struct&gt; in the abstract one.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Automatic coreference resolution
</SectionTitle>
    <Paragraph position="0"> The tool we are developing for anaphora resolution takes as input word, POS and chunk files (the architecture design is shown on Figure 8). The resolution process is perfomed by a set of stylesheets, each one representing a different heuristic. This set is called Resolution Heuristics Base (RHB). A stylesheet is connected to another through pipes and it filters the information flowing through the system (Gamma et</Paragraph>
    <Paragraph position="2"> &lt;struct id=&amp;quot;s0&amp;quot; type=&amp;quot;T&amp;quot;&gt; &lt;struct id=&amp;quot;s1&amp;quot; type=&amp;quot;P&amp;quot;&gt; &lt;struct id=&amp;quot;s2&amp;quot; type=&amp;quot;S&amp;quot; span=&amp;quot;word_1..word_7&amp;quot;&gt; &lt;struct id=&amp;quot;s3&amp;quot; type=&amp;quot;NP&amp;quot; rel=&amp;quot;subj&amp;quot; ref=&amp;quot;word_1..word_2&amp;quot;&gt; &lt;struct id=&amp;quot;s4&amp;quot; type=&amp;quot;art&amp;quot; rel=&amp;quot;n-mod&amp;quot; ref=&amp;quot;word_1&amp;quot;/&gt; &lt;struct id=&amp;quot;s5&amp;quot; type=&amp;quot;n&amp;quot; rel=&amp;quot;h&amp;quot; ref=&amp;quot;word_2&amp;quot;/&gt; &lt;/struct&gt; &lt;struct id=&amp;quot;s6&amp;quot; type=&amp;quot;VP&amp;quot; rel=&amp;quot;p&amp;quot; ref=&amp;quot;word_3..word_4&amp;quot;&gt; &lt;struct id=&amp;quot;s7&amp;quot; type=&amp;quot;v&amp;quot; rel=&amp;quot;aux&amp;quot; ref=&amp;quot;word_3&amp;quot;/&gt; &lt;struct id=&amp;quot;s8&amp;quot; type=&amp;quot;v&amp;quot; rel=&amp;quot;h&amp;quot; ref=&amp;quot;word_4&amp;quot;/&gt; &lt;/struct&gt; &lt;struct id=&amp;quot;s9&amp;quot; type=&amp;quot;NP&amp;quot; rel=&amp;quot;acc&amp;quot; ref=&amp;quot;word_5..word_6&amp;quot;&gt; &lt;struct id=&amp;quot;s10&amp;quot; type=&amp;quot;art&amp;quot; rel=&amp;quot;n-mod&amp;quot; ref=&amp;quot;word_5&amp;quot;/&gt; &lt;struct id=&amp;quot;s11&amp;quot; type=&amp;quot;n&amp;quot; rel=&amp;quot;h&amp;quot; ref=&amp;quot;word_6&amp;quot;/&gt;  when necessary. Our tool strategy follows four main steps: anaphor selection, candidates selection, resolution, and output generation.</Paragraph>
    <Paragraph position="3"> Two new intermediate annotation levels are generated: the anaphor entities (represented by &lt;anaphor&gt; elements) and antecedent candidates (represented by &lt;candidate&gt; elements).</Paragraph>
    <Paragraph position="4"> The &lt;candidate&gt; represents possible antecedents in the corpus, and it also has a &amp;quot;span&amp;quot; attribute. (Figure 9(a)). Different candidate sets can be generated according to the heuristics used for its selection. demonstrates the corresponding VAML encoding. The &lt;anaphor&gt; depicts the anaphoric noun phrases (pronouns, definite descriptions, demonstratives) and it has the attribute&amp;quot;span&amp;quot; (Figure 9(b)). Through &amp;quot;span&amp;quot; value we can get information from the input files (words, POS, chunks), needed for the resolution process.</Paragraph>
    <Paragraph position="5"> Along the resolution process other attributes are added to anaphor elements, such as &amp;quot;coref&amp;quot;, &amp;quot;pointer&amp;quot; and &amp;quot;classif&amp;quot; attributes, as seen in Figure 10(a). Figure 10(b) represents the corresponding VAML encoding for the &lt;anaphor&gt; elements.</Paragraph>
    <Paragraph position="6"> The heuristics to be applied to resolve coreference are based on previous studies about resolution of referring expressions (Vieira and Poesio, 2000; Lappin and Leass, 1994; Strube et al., 2002) and they are not discussed here.</Paragraph>
    <Paragraph position="7"> The output is the last step in the process and it is also played by a stylesheet that translates the &lt;anaphor&gt; nodes into &lt;markable&gt; ones, so the results can be visualized using the MMAX tool.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> We have presented the evolution of our annotation schemes over 7 years of corpus research. We believe that a standard orientation may shed some light to those who are defining their projects. Concerning annotation level relations our annotation is based on object-based anchoring, especially because our primary data is represented by XML elements (words in our dialect, basic struct elements with id attributes in VAML).</Paragraph>
    <Paragraph position="1"> Considering relations like parallelism, alternatives and aggregation (Ide and Romary, 2002) we see that our model includes aggregation at the chunk level. When studying annotation agreement we need to represent alternative data according to the judgment of each annotator (although we have adopted duplicated annotated files previously in our project). Previous work on encoding standards has mentioned mainly POS and syntactic annotation. In this paper we extended its use for coreference annotation. Our data model could be adequately mapped to the standards.</Paragraph>
    <Paragraph position="2"> An issue raised by coreference annotation is the need of two references for primary data in the same &lt;struct&gt;, one for anaphor (target) and another for its antecedent. In our examples, we encoded the reference to primary data indicating the antecedent by &lt;feat&gt; elements with attribute type=&amp;quot;pointer&amp;quot;. An advantage we could expect to take from work related to standards is knowledge about the impact on performance in data handling according to encoding decisions.</Paragraph>
    <Paragraph position="3"> Our project deals with different input and output formats. We intend to share our results and compare our techniques to different ones for anaphora reso- null lution. Since we use XML for external and internal encoding, and there is a mapping between them and standard formats, such as VAML, we will be able to import and export the corresponding VAML for our CAML and share both our resources and tools.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML