File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2304_evalu.xml
Size: 3,451 bytes
Last Modified: 2025-10-06 13:59:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2304"> <Title>A Robust and Efficient Parser for Non-Canonical Inputs</Title> <Section position="5" start_page="21" end_page="23" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We experimented this approach during the French evaluation campaign EASy (cf.</Paragraph> <Paragraph position="1"> [Paroubek05]). The test consisted in parsing several files containing various kinds of material: literature, newspaper, technical texts, questions, e-mails and spoken language. The total size of this corpus is one million words. Part of this corpus was annotated with morpho-syntactic (POS tags) and syntactic annotations. The last one provides bracketing as well as syntactic relations between units. The annotated part of the corpus represents 60,000 words and constitutes the gold standard.</Paragraph> <Paragraph position="2"> The campaign consisted for the participants to parse the entire corpus (without knowing what part of the corpus constituted the reference). The results of the campaign are not yet available concerning the evaluation of the relations. The figures presented in this section concern constituent bracketing. The task consisted in identifying minimal non recursive constituents described by annotation guidelines given to the participants.</Paragraph> <Paragraph position="3"> The different categories to be built are: GA (adjective group: adjective or passed participle), GN (nominal group: determiner, noun adjective and its modifiers), GP (prepositional group), GR (adverb), NV (verbal nucleus: verb, clitics) and PV (verbal propositional group).</Paragraph> <Paragraph position="4"> Our system parses the entire corpus (1 million words) in 4 minutes on a PC. It presents then a very good efficiency.</Paragraph> <Paragraph position="5"> We have grouped the different corpora into three different categories: written texts (including newspapers, technical texts and literature), spoken language (orthographic transcription of spontaneous speech) and e-mails. The results are the following: These figures show then very stable results in precision and recall, with only little loss of efficiency for non-canonical material. When studying more closely the results, some elements of explanation can be given. The e-mail corpus is to be analyzed separately: many POS tagging errors, due to the specificity of this kind of input explain the difference. Our POS-tagger was not tuned for this kind of lexical material.</Paragraph> <Paragraph position="6"> The interpretation of the difference between written and oral corpora can have some linguistic basis. The following figures give quantitative indications on the categories built by the parser. The first remark is that the repartition between the different categories is the same. The only main difference concerns the higher number of nucleus VP in the case of written texts. This seems to support the classical idea that spoken language seems to use more nominal constructions than the written one.</Paragraph> <Paragraph position="7"> The problem is that our parser encounters some difficulties in the identification of the NP borders. It very often also includes some material belonging in the grammar given during the campaign to AP or VP. The higher proportion of NPs in spoken corpora is an element of explanation for the difference in the results.</Paragraph> </Section> class="xml-element"></Paper>