File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0903_evalu.xml
Size: 4,999 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0903"> <Title>Comparing corpora and lexical ambiguity</Title> <Section position="3" start_page="15" end_page="17" type="evalu"> <SectionTitle> 3. Results and comparison of ambiguities </SectionTitle> <Paragraph position="0"> Note (tab. 4): Column 1 gives the ambiguity class. Column 2 provides the ratio of similarity (maximum similarity 3 This class has only one representative within the medical corpus, the word patient (feminine patiente) : An equivalent within the general corpus is politique (in Eng. it means both political and politic, s), but the former (0.5% of tokens) is ten times more frequent than the latter (0.05%). The frequency of the word politique is consistent with the frequency lists distributed by Jean V6ronis (http://www.up.univmrs-fr/~veronis), which were calculated on a one million words corpus from Le Monde Diplomatique (1987-1997). It should noted that this result questions the concepts of &quot;unrestricted corpora' and 'representativeness&quot; (Biber, 1994), as in fact it oRen refers to a mix of politics and newspaper topics ! = 1, minimal similarity = 0 and 5) between the frequency of the considered ambiguity in medical (Fm.) and general texts (Fg.). Columns 3 and 4 (resp.</Paragraph> <Paragraph position="1"> Fm. et Fg.) indicate the frequency of the ambiguity respectively in the medical texts and in the general texts. Column 5 provides some examples or the best representative (BR) of the ambiguity class, i.e. when one lexeme represents at least 80% of the class.</Paragraph> <Paragraph position="2"> List of abbreviations for the syntactic categories: proc, clific pronoun; v, verb; ne, common noun; d, determiner; sp, preposition; prop, personal pronoun; cccs, conjunction; q, numeral. List of abbreviations for the morpho-syntactic features and sub categorizations: ms, masculine singular; n, verbal infinitive form; ~, feminine singular; bs, masculine or feminine singular; 12, first and second person singular or plural; sO3, third person singular; p03, plural third person.</Paragraph> <Paragraph position="3"> When possible this tagset follows the MULTEXT (Ide and V6ronis 1994) morpho-syntactic description, modified within the GRACE action. But we must notice that the original MULTEXT description and the GRACE version (Paroubek and al. 1998, Rajman and ai. 1997) for the French language have not been foreseen for annotating morphemes.</Paragraph> <Paragraph position="4"> Previously, while attempting to assess the performance of our tools, only a sample of the ad hoc corpus we built up was used, whereas the following studies on the ambiguities will be carried out on the whole corpus. Like in the validation task, the lexical ambiguities are based on the morphological analysis of each tagger, expressed in the MCT. First of all, table 3 gives the general ambiguity rate in each corpora: it clearly states that the total ambiguity rate in general corpora is about twice as big as in medical texts.</Paragraph> <Paragraph position="5"> A more precise table (tab. 4) provides at least two remarkable results. First, it shows that in the general corpus, less than a dozen words are responsible for half of the global ambiguity rate.</Paragraph> <Paragraph position="6"> These results must be compared to (Chanod and Tapainen, 1995), who situate this number around 16, while about six words generate the same level of ambiguity in the medical corpus! This table also shows that the distribution of the ambiguity type is also domain dependant. Thus, the ambiguity d\[fs\]-\[bs\]/proc is twice more frequent in medical texts, and the ambiguity represented by the tokens patient/patiente (masculine and feminine form of patient; which may be a noun, an adjective, or some form of verb) is five times more frequent. On the contrary, some classes of ambiguity are simply absent or very rare in the medical domain (as for example v\[12\]/v\[s03\], or nc\[ms\]/v\[n\]).</Paragraph> <Paragraph position="7"> Finally, in table 5, we give the distribution of the most frequent syntactic categories according to the corpus. In this table, a particularly interesting result concerns the imbalance between categories of noun phrases (detetTniner, noun, adjective...) and categories of 'verb phrases (verb, adverb...); the former being much more frequent in medical texts, whereas the latter are more frequent in general texts. Here we verify a well-known stylistic manner: medical reports are often written in a telegraphic sty\]te, where the verb is frequently implicit. As a corollary, nominalization phenomena are very frequent.</Paragraph> <Paragraph position="8"> Simple or complex numeral tokens (date, time, expressions with digits and measure symbols) are also much more frequent.</Paragraph> <Paragraph position="9"> syntactic categories according to the domain.</Paragraph> <Paragraph position="10"> Note (tab. 5) :frefers to the punctuations.</Paragraph> </Section> class="xml-element"></Paper>