XML Viewer - p01-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/p01-1058_abstr.xml
Size: 28,354 bytes
Last Modified: 2025-10-06 13:42:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1058">
  <Title>Norwegian Parallel Corpus. In Karin Aijmer,</Title>
  <Section position="1" start_page="0" end_page="19" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPublico, a 180million word newspaper corpus free for R&amp;D in Portuguese processing.</Paragraph>
    <Paragraph position="1"> We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise.</Paragraph>
    <Paragraph position="2"> G20G3 G44G81G87G85G82G71G88G70G87G76G82G81 CETEMPublico is a large corpus of European Portuguese newspaper language, available at no cost to the community dealing with the processing of Portuguese.</Paragraph>
    <Paragraph position="3">  It was created in the framework of the Computational Processing of Portuguese project, a government funded initiative to foster language engineering of the Portuguese language.</Paragraph>
    <Paragraph position="4">  Evaluating this resource, we have two main goals in mind: To contribute to improve its usefulness; and to suggest ways of going about as far as corpus evaluation is concerned in general (noting that most corpora projects are simply described and not evaluated).</Paragraph>
    <Paragraph position="5">  In fact, and despite the amount of research devoted to corpus processing nowadays, there is not much information about the actual corpora being processed, which may lead naive users and/or readers to conclude that this is not an interesting issue. In our opinion, that is the wrong conclusion.</Paragraph>
    <Paragraph position="6"> There is, in fact, a lot to be said about any particular corpus. We believe, in addition, that such information should be available when one is buying, or even just browsing, a corpus, and it should be taken into consideration when, in turn, systems or hypotheses are evaluated with the help of that corpus.</Paragraph>
    <Paragraph position="7"> In this paper, we will solely be concerned with CETEMPublico, but it is our belief that similar kinds of information could be published about different corpora. Our intention is to give a positive contribution both to the whole community involved in the processing of Portuguese and to the particular users of this corpus. At the moment of writing, 160 people have ordered (and, we assume, consequently received) it  . There have also been more than four thousand queries via the Web site which gives access to the corpus.</Paragraph>
    <Paragraph position="8"> We want to provide evaluation data and describe how one can improve the corpus. We are genuinely interested in increasing its value, and have, since corpus release,  made available four patches (e-mailing this information to all  Although we also made available a CQP (Christ et al., 1999) encoded version in March 2001, the vast majority of the users received the text-only version.  The corpus was ready in July 2000; the first copies were sent out in October, with the information that version 1.0 creation date was 25 July 2000.</Paragraph>
    <Paragraph position="9"> who ordered the corpus). We have also tried to considerably improve the Web page.</Paragraph>
    <Paragraph position="10"> We decided to concentrate on the evaluation of version 1.0, given that massive distribution was done of that particular version  . Web access to the corpus (Santos and Bick, 2000) will not be dealt with here. Note that all trivial improvements described here have already been addressed in some patch.</Paragraph>
    <Paragraph position="11"> G21G3 G54G75G82G85G87G3G87G72G70G75G81G76G70G68G79G3G71G72G86G70G85G76G83G87G76G82G81 As described in detail in Rocha and Santos (2000) and also in the FAQ at the corpus Web page, CETEMPublico was built from the raw material provided by the Portuguese daily newspaper Publico: text files in Macintosh format, covering approximately the years 1991 to 1998, and including both published news articles and those created but not necessarily brought to print. These files were automatically tagged with a classification based on, but not identical to, the one used by the newspaper to identify sections, and with the semester the article was associated to. In addition, sentence separation, and title and author identification were automatically created. The texts were then divided in extracts with an average length of two paragraphs. These extracts were randomly shuffled (for copyright reasons) and numbered, and the final corpus was the ordered sequence of the extract numbers.</Paragraph>
    <Paragraph position="12"> To illustrate the corpus in text format, we present in Appendix A an extract that includes all possible tags with the exception of &lt;marca&gt;. G22G3 G42G72G81G72G85G68G79G3G72G89G68G79G88G68G87G76G82G81 We start by commenting on the distribution process, and then go on to analyse the corpus contents and the specific options chosen in its creation.</Paragraph>
    <Paragraph position="13"> Let us first comment on the distribution options. While this resource is entirely free (one has just to register in a Web page in order to receive the corpus at the address of one's choice), several critical remarks are not out of place:  We have no estimate of how many users have actually succeeded, or even tried, to apply the patches made available later on. We have just launched a Web questionnaire in order to have a better idea of our user community.</Paragraph>
    <Paragraph position="14"> First of all, when publicizing the resource, it was not clear for whom the CD distribution was actually meant: Later on, we discovered that many traditional linguists ordered it just to find out that they were much better off with the on-line version.</Paragraph>
    <Paragraph position="15"> Second, more accompanying information in the CD would not hurt, instead of pointing to a Web page as the only source: In fact, the assumption that everyone has access to the Web while working with CETEMPublico is not necessarily true in Portugal or Brazil.</Paragraph>
    <Paragraph position="16"> Finally, we did not produce a medium-size technical description; in addition to the FAQ on the Web page, we provided only a full paper (Rocha and Santos, 2000) describing the whole project, arguably an overkill.</Paragraph>
    <Paragraph position="17"> About the corpus contents, several fundamental decisions can - and actually have, in previous conferences or by e-mail - be criticized, in particular the use of a single text source and the inclusion of sentence tags (by criteria so far not yet documented). Still, we think that both are easy to defend, since 1) the time taken in copyright handling and contract writing with every copyright owner strongly suggests minimizing their number. And 2) although sentence separation is a controversial issue, it is straightforward to dispose of sentence separation tags. So, this option cannot really be considered an obstacle to users.</Paragraph>
    <Paragraph position="18">  We will concentrate instead on each annotation, after discussing the choice of texts and extracts.</Paragraph>
    <Paragraph position="19"> G22G17G20G3 G40G91G87G85G68G70G87G3G71G72G73G76G81G76G87G76G82G81G3G68G81G71G3G70G75G82G76G70G72 Looking at the final corpus, it is evident that many extracts should be discarded or, at least, rewritten. We tried to remove specific kinds of &amp;quot;text&amp;quot;, namely soccer classifications, citations from other newspapers, etc., but it is still possible to detect several other objects of dubious interest in the resulting corpus.</Paragraph>
    <Paragraph position="20"> In fact, using regular expression patterns of the kind &amp;quot;existence of multiple tabs in a line ending in numbers&amp;quot;, we identified 5270 extracts having some form of classification, as well as 662 extracts with no valid content.</Paragraph>
    <Paragraph position="21">  Since extract definition is based on paragraph and not sentence boundary, the option of marking &lt;s&gt; boundaries has no other consequences.</Paragraph>
    <Paragraph position="22"> Now, it is arguable that classifications of other sports (e.g., athletics and motor races), solutions to crossword puzzles, film and book reviews, and TV programming tables, just to name a few, should have been extracted on the same grounds presented for removing soccer.</Paragraph>
    <Paragraph position="23"> Our decision was obviously based on a question of extent. (Soccer results are much more frequent.) However, we now regret this methodological flaw and would like to clean up a little more (as done in the patches), or add back soccer results.</Paragraph>
    <Paragraph position="24"> Another problem detected, concerning the extract structure, was our unfortunate algorithm of appending titles to the previous extract, just like authors, instead of joining them to the next extract. This means that 4.8% of the extracts end with a title in CETEMPublico. (9.6% end with an author.) G22G17G21G3 G54G83G88G85G76G82G88G86G3G85G72G83G72G87G76G87G76G82G81G86 The worst problem presented by the CETEMPublico corpus is the question of repeated material. (Incidentally, it is interesting to note that this is also a serious problem in searching the Web, as mentioned by Kobayashi and Takeda (1999).) Repeated articles  can be due to two independent factors: - parallel editions of the local section of the newspaper in the two main cities of Portugal (Lisboa and Porto) - later publication of previously &amp;quot;rejected&amp;quot; articles  In addition to manually inspecting rare items that one would not expect to appear more than a few times in the corpus (but which had higher frequency than expected), we used the following strategies to detect repeated extracts:  1. Record the first and last 40 characters of each extract, in a hash table, as well as their size in characters. Then fully compare only the repeated extracts under this criterion. 2. Using the Perl module MD5 (useful for  cryptographical purposes), we attributed to each extract a checksum of 32 bytes, and recorded it in a hash table. Repeated extracts have the same checksum, but it is extremely unlikely that two different ones will.</Paragraph>
    <Paragraph position="25">  Repeated sentences can also occur in the lead and in the body of an article, and (in the opinion section) to highlight parts of an article.</Paragraph>
    <Paragraph position="26"> The results obtained for exactly equal extracts are displayed in Table 1 for both methods.</Paragraph>
    <Paragraph position="27"> Another related (and obviously more complicated) problem is what to do with quasiduplicates, i.e. sentences or texts that are almost, but not, identical. An estimate of the number of approximately equal extracts, obtained with the 40 character-method but with relaxed size constraints (10%) yields some further 15,665 possibly repeated extracts. It is not obvious whether one can automatically identify which one is the revised version, or even whether it is desirable to choose that one. We have, anyway, compiled a list of these cases, thinking that they might serve as raw material for studying the revision process (and to obtain a list of errors and their correction).</Paragraph>
    <Paragraph position="29"> In the CETEMPublico corpus, newspaper titles and subtitles, as well as author identifications, have been marked up as result of heuristic processing. In Rocha and Santos (2000), a preliminary evaluation of precision and recall for these tasks was published, but here we want to evaluate this in a different way, without making reference to the original text files.</Paragraph>
    <Paragraph position="30"> Given the corpus, we want to address precision and error rate (i.e., of all chunks tagged as titles, how many have been rightly tagged?, and how many are wrong?). We reviewed manually the first 500 instances of &lt;t&gt;  , of which 427 were undoubtedly titles, a further 4 wrongly tagged authors, and at least 15 belonged to book or film reviews, indicating  In the 15 th chunk of the corpus. This aparently naive choice of test data does not bias evaluation, since the extracts are randomly placed in the corpus and do not reflect any order of time period or kind of text. title, author and publisher, or director and broadcasting date, etc.</Paragraph>
    <Paragraph position="31"> We then looked into the following error-prone situation: After having noted that several paragraphs in a row including title and author tags were usually wrong (and should have been marked as list items instead), we looked for extracts containing sequences of four titles / authors and manually checked 200. The precision in this case was very low: Only 38% were correctly tagged. Of the incorrect ones, as much as 34% were part of book reviews as described above. This indicates clearly that we should have processed special text formats prior to applying our general heuristic rules.</Paragraph>
    <Paragraph position="32"> Regarding recall, we did the following partial inspection: We noted several short sentences ending in ? or ! (a criterion to parse a text chunk as a full sentence) that should actually be tagged as titles. We therefore looked at 200 paragraphs with one single sentence ending in question or exclamation mark containing less than 8 words, and concluded that 41 cases (20%) could definitively be marked as titles, while no less than 85 of these cases where questions taken from interviews. Most other cases were questions inside ordinary articles.</Paragraph>
    <Paragraph position="33"> As far as authors are concerned, the phrase Leitor devidamente identificado (&amp;quot;duly identified reader&amp;quot;, used to sign reader's letters where the writer does not wish to disclose his or her identity) was correctly identified only in 78% of the cases (135 in 172). In 17% of the occurrences, it was wrongly tagged as title.</Paragraph>
    <Paragraph position="34"> From a list of 500 authors randomly extracted for evaluation purposes, only 395 (79%) were unambiguously so, while 8 (1.5%) could still be considered correct by somehow more relaxed criteria. We thus conclude that up to 21% of the author tags in the corpus may be wrongly attributed, a figure much higher than the originally estimated 4%.</Paragraph>
    <Paragraph position="35"> Among those cases, foreign names (generally in the context of film or music reviews, or book presentations) were frequently mistagged as authors of articles in Publico, a situation highly unlikely and amenable to automatic correction. Figure 1 is an example.</Paragraph>
    <Paragraph position="37"> In addition to paragraph separation coming from the original newspaper files, CETEMPublico comes with sentence separation as an added-value feature.</Paragraph>
    <Paragraph position="38"> Now, sentence separation is obviously not a trivial question, and there are no foolproof rules for complicated cases (Nunberg, 1990; Grefenstette and Tapainanen, 1994; Santos, 1998). So, instead of trying to produce other subjective criteria for evaluating a particularly delicate area, we decided to look at the amount of work needed to revise the sentence separation for a given purpose, as reported in section 4.2.</Paragraph>
    <Paragraph position="39"> But we did some complementary searches for cases we would expect to be wrong whatever the sentence separation philosophy. We thus found 6,358 sentences initiated by a punctuation mark (comma, closing quotes, period, question mark and exclamation mark, respectively amounting to 4053, 410, 1607, 227 and 61 occurrences), as well as a plethora of suspiciously small sentences, cf. Table 2.</Paragraph>
    <Paragraph position="40">  Sentence separation marks some sentences as fragments (&lt;s frag&gt;); in addition, the &lt;li&gt; attribute was used to render list elements. We are not sure now whether it was worthwhile to have two different markup elements.</Paragraph>
    <Paragraph position="41">  Finally, the sentence separation module also introduces the &lt;marca&gt; tag to identify metacharacters that are used for later coreference (eg. in footnotes). The asterisk &amp;quot;*&amp;quot; was marked as such in CETEMPublico, but not inside author or title descriptions, an undesirable inconsistency.</Paragraph>
    <Paragraph position="42"> G22G17G24G3 G40G91G87G85G68G81G72G82G88G86G3G70G75G68G85G68G70G87G72G85G86 An annoying detail is the amount of strange characters that have remained in the corpus after font conversion, such as non-Portuguese characters, hyphens, bullet list marking, and the characters &lt; &gt; instead of quotes.</Paragraph>
    <Paragraph position="43"> It is straightforward to replace these with other ISO-8859-1 characters or combinations of characters, as was done with dashes and quotes.</Paragraph>
    <Paragraph position="44">  Only the last line of Table 4 requires some care, since E is a otherwise valid Portuguese character that should only be replaced a few times.</Paragraph>
    <Paragraph position="46"> CETEMPublico extracts come with a subject classification derived from (but not equal to) the original newspaper section. Due to format differences of the original files, only 86% of the extracts have some classification associated.</Paragraph>
    <Paragraph position="47"> The others carry the label ND (not determined).</Paragraph>
    <Paragraph position="48"> We evaluate here this classification, since for half of the corpus article separation had to be carried out automatically and thus chances exist that errors may have crept in.</Paragraph>
    <Paragraph position="49"> The first thing we did was to check whether repeated extracts had been attributed the same classification. Astonishingly, there were many differences: of the 47,002 cases of multiple extracts, 10,872 (23%) had different categories, even though only in 2% of the cases none of the conflicting categories was ND.</Paragraph>
    <Paragraph position="50"> Another experiment was to look at well-known polysemic or ambiguous items and see whether their meaning correlated with the kind of text it was purported to be in. We thus inspected manually several thousand concordances dealing with the following middle frequency words  : 201 occurrences of vassoura  Note that it is not always possible to have a one-to-one mapping from MacRoman into ISO-8859-1.</Paragraph>
    <Paragraph position="51">  Glosses provided are not exhaustive.</Paragraph>
    <Paragraph position="52"> (broom; last vehicle in a bicycle race); 124 of passador (sieve; drug seller; emigrant dealer); 314 of cunha (wooden object; corruption device); 599 of coxa (noun thigh; adjective lame); 205 of prego (nail; meat sandwich; pawnshop); 145 of garfo (fork; biking); 5505 of estrela (star; filmstar; success); 375 of dobragem (folding; dubbing; parachuting and F1 term); 573 of escravatura (slavery).</Paragraph>
    <Paragraph position="53"> We could only find two cases of firm disagreement with source classification (in the two last mentioned queries). This is not such a good result as it seems, though, since it can be argued that subject classification is too high level (society, politics, culture) to allow for definite results.</Paragraph>
    <Paragraph position="54"> G23G3 G38G82G85G83G88G86G3G76G81G3G88G86G72 The best way to evaluate a corpus resource is to see how well it fares regarding the tasks it is put to. We will not evaluate concordancing for human inspection, because we assume that this is a rather straightforward task for which CETEMPublico is useful, especially because it requires direct scrutiny. Obviously, human inspection and judgement make the results more robust.</Paragraph>
    <Paragraph position="55"> G23G17G20G3 G51G85G82G83G72G85G3G81G68G80G72G3G76G71G72G81G87G76G73G76G70G68G87G76G82G81 One of the authors developed proper name identification tools (Santos, 1999) prior to the existence of CETEMPublico. We ran them on this corpus to see how they worked.</Paragraph>
    <Paragraph position="56"> We proceeded in the following way: We inspected manually the first 1,000 proper names obtained from CETEMPublico and got less then  This category encompasses &amp;quot;deviant&amp;quot; proper names, mainly including foreign accents and numbers, irrespective of proper name length.</Paragraph>
    <Paragraph position="57"> Then, we computed the distribution of the 52,665 proper nouns identified by the program (23,401 types) on the first million words of the corpus as shown in Table 5, and inspected manually those 1,017 having a length larger or equal than four words. Of these 88% were correct and 6.5% were plainly wrong. Cases of merging two proper names and cases where it was easy to guess one missing (preceding or following) word accounted each for approximately 5% of the remaining instances.</Paragraph>
    <Paragraph position="58"> While use of CETEMPublico allowed us to uncover cases not catered for by the program, it also illuminated some potential  tokenization problems in the corpus, namely a large quantity of tokens ending in a dash (21,455 tokens, 6,458 types) or in a slash (7313 tokens, 4530 types), as well as up to 132,455 tokens including one single parenthesis (28,466 types). G23G17G21G3 G55G85G72G72G69G68G81G78G3G69G88G76G79G71G76G81G74 The first million words of CETEMPublico was selected for the creation of a treebank for</Paragraph>
    <Section position="1" start_page="12" end_page="19" type="sub_section">
      <SectionTitle>
Portuguese (Floresta Sinta(c)tica
</SectionTitle>
      <Paragraph position="0"> ), given that its use is copyright cleared and the corpus is free.</Paragraph>
      <Paragraph position="1"> The treebank team engaged in a manual revision of the text prior to treebank coding, refining sentence separation with the help of syntactically-based criteria (Afonso and Marchi, 2001). We have tried to compute the amount of change produced by human intervention, which turned out to be a surprisingly complex task (Santos, 2001). This one million words subcorpus contained 8,043 extracts.</Paragraph>
      <Paragraph position="2">  Assuming that the first million is not different from the rest of the corpus, the results indicate an estimate of 17% of the corpus extracts in need of improvement. Looking at sentences, 2,977 sentences of the 42,026 original ones had to be re-separated into 4,304 of the resulting 43,271. Table 6 displays an estimate of what was actually involved in the revision of sentence tags (percentages are relative to the original number of sentences).  Different tokenizers may have different strategies, but we assume that these will be hard cases for most.  Numbered from 1 to 8067, since version 1.2 was used, and therefore 24 invalid extracts had been already removed. In addition, the treebank reviewers considered that further 129 should be taken out.</Paragraph>
      <Paragraph position="3"> The &amp;quot;Other&amp;quot; category includes changes among the tags &lt;t&gt;, &lt;a&gt;, &lt;li&gt; and &lt;s&gt;.</Paragraph>
      <Paragraph position="4">  G23G17G22G3 G54G83G72G79G79G76G81G74G3G70G75G72G70G78G72G85G3G72G89G68G79G88G68G87G76G82G81 One of the first and most direct uses of a large corpus is to study the coverage, evaluate, and especially improve a spelling checker and morphological analyser.</Paragraph>
      <Paragraph position="5"> Our preliminary results of evaluating Jspell (Almeida and Pinto, 1994) as far as type and token spelling is concerned are as follows: Among the 942,980 types of CETEMPublico, 574,199 were not recognized by the current version of Jspell (60.4%), amounting to 3.07% of the size of the corpus. A superficial comparison showed that CETEMPublico contains a higher percentage of unrecognized words, both types and tokens, than other Portuguese newspaper corpora. Numbers for a 1.5-million word corpus of Diario do Minho (a regional newspaper) and for a 4-million word corpus of a political party newspaper are respectively 26.5% and 25.41% unrecognized types and 2.26% and 1.67% unrecognized tokens. These numbers may be partially explained by Publico's higher coverage of international affairs, together with its cinema and music sections, both bringing an increase in foreign proper names  The percentage of unrecognized tokens varies from 4.8% for culture to 2.0% for society extracts.  We classify as Portuguese or foreign the word, not the location: thus, Tanzania is a Portuguese word.  That is, words routinely used in Portuguese but which up to now have kept a distinctly foreign spelling, such as pullover.</Paragraph>
      <Paragraph position="6"> words missing in dict. 101 98 incorrectly spelled  We investigated the &amp;quot;errors&amp;quot; found by the system, to see how many were real and how many were due to a defficient lexical (or rule) coverage. Table 7 shows the distribution of 1,000 &amp;quot;errors&amp;quot; randomly obtained from the 12 th corpus chunk.</Paragraph>
      <Paragraph position="7"> The absolute frequencies of the most common spelling errors in CETEMPublico is another interesting evaluation parameter.</Paragraph>
      <Paragraph position="8"> Applying Jspell to types with frequency &gt; 100 (excluding capitalized and hyphenated words), we identified manually the &amp;quot;real&amp;quot; errors. Strikingly, all involved lack or excess of accents. The most frequent appeared 840 times (juiz), the second one (saiu) 659, and the third (impor) had 637 occurrences. Their correctly spelled variants (juiz, saiu, impor) appeared respectively 11896, 9892 and 5125 times.</Paragraph>
      <Paragraph position="9"> G24G3 G38G82G80G83G68G85G76G86G82G81G3G90G76G87G75G3G82G87G75G72G85G3G70G82G85G83G82G85G68 One can find excellent reports on the difficulties encountered in creating corpora (see e.g. Armstrong et al. (1998) and references therein), but it is significantly rarer to get an evaluation of the resulting objects. It is thus not easy to compare CETEMPublico with other corpora on the issues discussed here.</Paragraph>
      <Paragraph position="10"> For example, it was not easy to find a thorough documentation of BNC  problems (although there is a mailing list and a specific e-mail address to report bugs), nor is similar information to be found in distribution agencies' (such as LDC or ELRA) Web sites.</Paragraph>
      <Paragraph position="11"> It is obviously outside the scope of the present paper to do a thorough analysis of other corpora as well, but our previous experience shows that it is not at all uncommon to experience problems with characters and fonts, repeated texts or sentences, rubbish-like sections, wrong markup and/or lack of it. All this independently of corpora being paid and/or distributed by agencies supposed to have  Including one case of lack of space between two words, suacontribuicao.</Paragraph>
      <Paragraph position="12">  British National Corpus. http://info.ox.ac.uk/bnc/ performed validation checks. The same happens for corpora that have been manually revised. As regards sentence separation, Johansson et al. (1996) mention that proofreading of the automatic insertion of &lt;s&gt;-units was necessary for the ENPC corpus, but they do not report problems of human editors in deciding what an &lt;s&gt; should be. Let us, however, note that ENPC compilers were free to use an &lt;omit&gt; tag for complicated cases and, last but not least, were not dealing with newspaper text.</Paragraph>
      <Paragraph position="13"> G25G3 G38G82G81G70G79G88G71G76G81G74G3G85G72G80G68G85G78G86 This paper can be read from a user's angle as a complement to the documentation of the CETEMPublico corpus. In addition, by showing several simple forms of evaluating a corpus resource, we hope to have inspired others to do the same for other corpora.</Paragraph>
      <Paragraph position="14"> While the work described in this paper already allowed us to publish several patches, improve our corpus processing library and contribute to new versions of other people's programs, namely Jspell, our future plans are to do more extensive testing using more powerful techniques (e.g. statistical) to investigate other problems or features of the corpus. In any case, we believe that the work reported in this paper comes logically first.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML