File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0710_metho.xml

Size: 17,703 bytes

Last Modified: 2025-10-06 14:09:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0710">
  <Title>Reference Resolution over a Restricted Domain: References to Documents</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Components of a Fully Automated
Ref2doc System
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> Within the overall goal of a fully automated understanding of references to documents in meeting dialogs, several related sub-tasks can be distinguished, most simply envisaged as separate processes in a computational architecture:  1. Generate a transcript of the utterances produced by each speaker.</Paragraph>
      <Paragraph position="1"> 2. Detect the REs from the transcripts that make references to the documents of the meeting.</Paragraph>
      <Paragraph position="2"> 3. Generate a formal representation of the documents: articles, titles, etc.</Paragraph>
      <Paragraph position="3"> 4. Connect or match each RE to the document element it refers to.</Paragraph>
      <Paragraph position="4">  Each of these components can be further subdivided. Our main focus here is task (4). For this task, an evaluation procedure, an algorithm, and its evaluation are provided respectively in Sections 4.3, 5, and 6. Task (3) is discussed below in Section 3.2.1. Task (1), which amounts more or less to automated speech recognition, is of course a standard one, for which the performance level, as measured by the word error rate (WER), depends on the microphone used, the environment, the type of the meeting, etc. To factor out these problems, which are far beyond the scope of this paper, we use manual transcripts of recorded meetings (see Section 4.2.1).</Paragraph>
      <Paragraph position="5"> The present separation between tasks (2) and (4) needs further explanations--see also (van Deemter and Kibble, 2000; Popescu-Belis, 2003) for more details. Our interest here is the construction of reference links between REs and document elements (from which coreference can be inferred), so we do not focus on task (2). Instead, we use a set of REs identified by humans.</Paragraph>
      <Paragraph position="6"> Task (2) is not trivial, but could be carried out using a repertoire of pattern matching rules. The patterns of the manually detected REs shown in Table 1 (Section 4.4) are a first step in this direction. The difficulty is that sometimes task (2) proposes candidate REs, for which only task (4) can decide whether they can really be matched to a document element or not. For instance, REs such as pronouns ('it') or deictics ('this') that refer to document elements can only be detected using a combination of (2) and (4). This is one of our future goals.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Construction of the Logical Structure of
Documents
</SectionTitle>
      <Paragraph position="0"> Inferring the structure of a document from its graphical aspect is a task that can be automated with good performances, as explained elsewhere (Hadjar et al., 2004). Here, the documents are front pages of newspapers, in French. We first define the template of document structures, then summarize the construction method.</Paragraph>
      <Paragraph position="1">  Many levels of abstraction are present in the lay-out and content of a document. They are conveyed by its various structures: thematic, physical, logical, relational or even temporal. The form of a document, i.e. its layout and its logical structure, carries important (and often underestimated) clues about the content, in particular for newspaper pages,  where articles are organized by zones, and titles are clearly marked.</Paragraph>
      <Paragraph position="2"> We consider that newspaper front pages have a hierarchical structure, which can be expressed using a very simple ontology. This is summarized in Figure 1 using a DTD-like declaration, as the document structure is encoded in XML.</Paragraph>
      <Paragraph position="3"> For instance, the first rule in Figure 1 states that a Newspaper front page bears the newspaper's Name, the Date, one Master Article, zero, one or more Highlights, one or more Articles, etc. Each content element has an ID attribute bearing a unique index.</Paragraph>
      <Paragraph position="4">  The document structure can be extracted automatically from the PDF version of a document, along with a logical representation of the layout. Our approach merges low level extraction methods applied to PDF files with layout analysis of a synthetically generated TIFF image (Hadjar et al., 2004). A segmentation algorithm first extracts from the image the threads, frames and text lines, then separates image and text zones, and finally merges lines into homogeneous blocks. In parallel, the objects contained in the PDF file (text, images, and graphics) are extracted and matched with the result of the lay-out analysis; for instance, text is associated to physical (graphical) blocks. Finally, the cleaned PDF is parsed into a unique tree, which can be transformed  tion with ref2doc information (er stands for RE). either into SVG or into an XML document, and used for various applications.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation Method and Data
</SectionTitle>
    <Paragraph position="0"> Two important elements for testing are the available data (4.2), which must be specifically annotated (4.1), and a scoring procedure (4.3), which is quite straightforward, and provides several scores.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Annotation Model
</SectionTitle>
      <Paragraph position="0"> The annotation model for the references to documents builds upon a shallow dialog analysis model (Popescu-Belis et al., 2004), implemented in XML.</Paragraph>
      <Paragraph position="1"> The main idea is to add external annotation blocks that do not alter the master resource--here the timed meeting transcription, divided into separate channels. However, REs are annotated on the dialog transcription itself. A more principled solution, but more complex to implement, would be to index the master transcriptions by the number of words, then externalize the annotation of REs as well (Salmon-Alt and Romary, 2004).</Paragraph>
      <Paragraph position="2"> As shown in Figure 2, the ref pointers from the REs to the document elements are grouped in a ref2doc block at the end of the document, using as attributes the index of the RE (er-id), the document filename (doc-file), and an XPath expression (doc-id) that refers to a document element from the XML document representation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Annotation Procedure and Results
</SectionTitle>
      <Paragraph position="0"> A document-centric meeting room has been set up at the University of Fribourg to record different types of meetings. Several modalities related to documents are recorded, thanks to a dozen cameras and eight microphones. These devices are controlled and synchronized by a master computer running a meeting capture and archiving application, which helps the users organize the numerous data files (Lalanne et al., 2004).</Paragraph>
      <Paragraph position="1"> At the time of writing, 22 press-review meetings of ca. 15 minutes each were recorded, between March and November 2003. In such meetings, participants discuss (in French) the front pages of one or more newspapers of the day. Each participant presents a selection of the articles to his/her colleagues, for information purposes. In general, after a monologue of 5-10 utterances that summarize an article, a brief discussion ensues, made of questions, answers and comments. Then, the chair of the meeting shifts the focus of the meeting to another article. The recordings of the 22 meetings were manually transcribed using Transcriber,1 then exported as XML files. The structure of the documents was also encoded as XML files using the procedure described above (3.2.1) with manual correction to ensure near 100% accuracy.</Paragraph>
      <Paragraph position="2">  The annotation of the ground truth references was done directly in the XML format described above (Figure 2). We have annotated 15 meetings with a total of 322 REs. In a first pass, the annotator marked the REs (with &lt;er&gt;...&lt;/er&gt; tags), if they referred to an article or to one of its parts, for instance its title or author. However, REs that corresponded only to quotations of an article's sentences were not annotated, since they refer to entities mentioned in the documents, rather than to the document elements. Table 1 synthesizes the observed patterns of REs.</Paragraph>
      <Paragraph position="3"> The REs were then automatically indexed, and a template for the ref2doc block and an HTML view were generated using XSLT. In a second pass, the annotator filled in directly the attributes of the ref2doc block in the template. The annotators were instructed to fill in, for each RE (er-id), the name of the journal file that the RE referred to (doc-file), and the XPath to the respective document element (doc-id), using its ID. Examples were provided for XPath expressions. The following separate windows are all required for the annotation: null + text/XML editor for theref2docblock of the dialog annotation file; + HTML browser for the serialized HTML transcript (with REs in boldface);  We tested the reliability of the annotators on the second part of their task, viz., filling in the ref2doc blocks. The experiment involved three annotators, for the three meetings that discuss several documents at a time, with a total of 92 REs. In a first stage, annotation was done without any communication between annotators, only using the annotation guidelines. The result was on average 96% agreement for document assignment (that is, 3 errors for 92 REs), and 90% agreement on document elements (that is, 9 errors).2 In a second stage, we analyzed and solved some of the disagreements, thus reaching 100% agreement on document assignment, and 97% agreement on document elements, that is only two disagreements. These resulted from different interpretations of utterances--e.g., they in &amp;quot;they say. . . &amp;quot; could denote the author, the newspaper, etc.--and could not be solved.</Paragraph>
      <Paragraph position="4"> This experiment shows that ref2doc annotation is a very reliable task: referents can be clearly identified in most cases. A perfect system would match the human performance at more than 95%.3</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> Unlike intra-document coreference resolution, for which evaluation is a complex task (Popescu-Belis, 2003), the evaluation of reference resolution over a specific domain is quite straightforward. One must compare for each RE the referent found by the system with the correct one selected by the annotators.</Paragraph>
      <Paragraph position="1"> If the two are the same, the system scores 1, otherwise it scores 0. The total score is the number of correctly solved REs out of the total number of REs (100% means perfect). The automatic evaluation measure we implemented using the XML annotation described above provides in fact three scores:  1. The number of times the document an RE refers to is correctly identified. This is informative only when a dialog deals with more than one document.</Paragraph>
      <Paragraph position="2"> 2These numbers were found using the evaluation software described below (Section 4.3). Document element agreement means here that the elements had the same ID.</Paragraph>
      <Paragraph position="3"> 3As for the first part of the process, recognizing the REs that refer to documents, we can only hypothesize that inter-annotator agreement is lower than for the second part. 2. The number of times the document element, characterized by its ID attribute, is correctly identified. Here, the possible types of document elements are article: Master-Article, JournalArticle, Article or Highlight.</Paragraph>
      <Paragraph position="4"> 3. The number of times the specific part of an ar null ticle is correctly identified (e.g., content, title, author, image, as indicated by the XPath annotation in the XML output format).</Paragraph>
      <Paragraph position="5"> The third score is necessarily lower than the second one, and the second one is necessarily lower than the first one. The third score is not used for the moment, since our ref2doc algorithms do not target sub-article elements. To help adjust the resolution algorithm, the scoring program also outputs a detailed evaluation report for each meeting, so that a human scorer can compare the system's output and the correct answer explicitly.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Empirical Analysis of Occurring REs
</SectionTitle>
      <Paragraph position="0"> The patterns of the annotated REs are synthesized in Table 1 according to the type of entity they refer to. This analysis attempts to derive regular expressions that describe the range of variation of the REs that refer to documents, but without generalizing too much. Words in capital letters represent classes of occurring words: NEWSP are newspaper names, SPEC is a specifier (one or more words, e.g., an adjective or a relative sentence), DATE and TITLE are obvious. Items in brackets are optional, and j indicates an exclusive-or. The patterns derived here could be used to recognize automatically such REs, except for two categories--anaphors and (discourse) indexicals--that must be disambiguated.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Ref2doc Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Preliminary Study
</SectionTitle>
      <Paragraph position="0"> The first resolution method we implemented uses co-occurrences of words in the speech transcript and in the documents. More precisely, for each RE annotated in the transcript as referring to documents, the words it contains and the words surrounding it in the same utterance are matched, using the cosine metric, with the bag of words of each logical block of the document: article, title, author, etc. To increase the importance of the words within the REs, their weight is double the weight of the surrounding words. The most similar logical block is considered to be the referent of the RE, provided the similarity value exceeds a fixed threshold (confidence level).</Paragraph>
      <Paragraph position="1">  French, ordered by the type of the referent (9 REs out of 322 did not follow these patterns).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Algorithm based on Anaphora Tracking
</SectionTitle>
      <Paragraph position="0"> A more complex algorithm was designed, which is based on the identification of anaphoric vs. non-anaphoric REs, as well as co-occurrences of words.</Paragraph>
      <Paragraph position="1"> The algorithm scans each meeting transcript linearly (not by channel/speaker), and stores as variables the 'current document' and the 'current document element' or article. For each RE, the algorithm assigns first the hypothesized document, from the list of documents associated to the meeting. REs that make use of a newspaper's name are considered to refer to the respective newspaper; the other ones are supposed to refer to the current newspaper, i.e. they are anaphors. This simple method does not handle complex references such as 'the other newspaper', but obtains nevertheless a sufficient score (see Section 6 below).</Paragraph>
      <Paragraph position="2"> The algorithm then attempts to assign a document element to the current RE. First, it attempts to find out whether the RE is anaphoric or not, by matching it against a list of typical anaphors found in the meetings: 'it', 'the article' (bare definite), 'this article', 'the author' (equivalents in French). If the RE is anaphoric, then it is associated to the current article or document element--a very simple implementation of a focus stack (Grosz et al., 1995)--except if the RE is the first one in the meeting, which is never considered to be anaphoric.</Paragraph>
      <Paragraph position="3"> If the RE is not considered to be anaphoric, then the algorithm attempts to link it to a document element by comparing the content words of the RE with those of each article. The words of the RE are considered, as well as those of its left and right contexts. A match with the title of the article, or the author name, is weighted more than one with the content. Finally, the article that scores the most matches is considered to be the referent of the RE, and becomes the current document element.</Paragraph>
      <Paragraph position="4"> Several parameters govern the algorithm, in particular the weights of the various matches--the nine pairs generated by fRE word, left context word, right context wordg PS ftitle or subtitle word, author word, contents wordg--and the size of the left and right context--the number of preceding and following utterances, and the number of words retained. Evaluation provides insights about the best values for these parameters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML