File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1807_evalu.xml
Size: 7,333 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1807"> <Title>Extracting Multiword Expressions with A Semantic Tagger</Title> <Section position="4" start_page="3" end_page="3" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> In this section, we analyze the results of the MWE extraction in detail for a full evaluation of our approach to MWE extraction. null Overall, after we processed the test corpus, the USAS tagger extracted 4,195 MWE candidates from the test corpus. After manually checking through the candidates, we selected 3,792 as good MWEs, resulting in overall precision of 90.39%.</Paragraph> <Paragraph position="1"> As we explained earlier, due to the difficulty of obtaining the total number of true MWEs in the entire test corpus, we had to estimate recall of the MWE extraction on a sample corpus. In detail, we first randomly selected fifty texts containing 14,711 words from the test corpus, then manually marked-up good MWEs in the sample texts, finally counted the number of the marked-up MWUs.</Paragraph> <Paragraph position="2"> As a result, 1,511 good MWEs were found in the sample. Since the number of automatically extracted good MWEs in the sample is 595, the recall on the sample is calculated as follows: null Recall=(595/1511)x100%=39.38%.</Paragraph> <Paragraph position="3"> Considering the homogenous feature of the test data, we assume this local recall is roughly approximate to the global recall of the test corpus.</Paragraph> <Paragraph position="4"> To analyze the performance of USAS in respect to the different semantic field categories, we divided candidates according to the assigned semantic tag, and calculated the precision for each of them. Table 1 lists these precisions, sorting the semantic fields by the number of MWE candidates (refer to section 3 for definitions of the twenty-one main semantic field categories). As shown in this table, the USAS semantic tagger obtained precisions between 91.23% to 100.00% for each semantic field except for the field of &quot;names and grammatical words&quot; denoted by Z. As Z was the biggest field (containing 45.39% of the total MWEs and 43.12% of the accepted MWEs), we examined these MWEs more closely. We discovered that numerous pairs of words are tagged as person names (Z1) and geographical names (Z2) by mistake, e.g. Blackfriars crown (tagged as Z1), stabbed ries Another possible factor that affects the performance of the USAS tagger is the length of the MWEs. To observe the performance of our approach from this perspective, we grouped the MWEs by their lengths, and then checked precision for each of the categories. Table 2 shows the results (once again, they are sorted in descending order by MWE lengths). As we might expect, the number of MWEs decreases as the length increases. In fact, bi-grams alone constitute 80.52% and 81.88% of the candidate and accepted MWEs respectively. The precision also showed a generally increasing trend as the MWE length increases, but with a major divergence of trigrams. One main type of error occurred on tri-grams is that those with the structure of CIW(capital-initial word) + conjunction + CIW tend to be tagged as Z2 (geographical name). The table shows relatively high precision for longer MWEs, reaching 100% for 6grams. Because the longest MWEs extracted have six words, no longer MWEs could be examined.</Paragraph> <Paragraph position="5"> As discussed earlier, purely statistical algorithms of MWE extraction generally filter out candidates of low frequencies. However, such low-frequency terms in fact form major part of MWEs in most corpora. In our study, we attempted to investigate the possibility of extracting low frequency MWEs by using semantic field annotation. We divided MWEs into different frequency groups, then checked precision for each of the categories. Table 3 shows the results, which are sorted by the candidate MWE frequencies. As we expected, 69.46% of the candidate MWEs and 68.22% of the accepted MWEs occur in the corpus only once or twice. This means that, with a frequency filter of Min(f)=3, a purely statistical algorithm would exclude more than half of the candidates from the process.</Paragraph> <Paragraph position="6"> tionship between the precisions and the frequencies. Generally, we would expect better precisions for MWEs of higher frequencies, as higher co-occurrence frequencies are expected to reflect stronger affinity between the words within the MWEs. By and large, slightly higher precisions were obtained for the latter groups of higher frequencies (5-7, 8-20 and 21-117) than those for the preceding lower frequency groups, i.e. 94.07%-96.64% versus 87.43%-92.67%. Nevertheless, for the latter three groups of the higher frequencies (5-7, 8-20 and 21-117) the precision did not increase as the frequency increases, as we initially expected.</Paragraph> <Paragraph position="7"> When we made a closer examination of the error MWEs in this frequency range, we found that some frequent domain-specific terms are misclassified by the USAS tagger.</Paragraph> <Paragraph position="8"> For example, since the texts in the test corpus are newspaper reports of court stories, many law courts (e.g. Manchester crown court, Norwich crown court) are frequently mentioned throughout the corpus, causing high frequencies of such terms (f=20 and f=31 respectively). Unfortunately, the templates used in the USAS tagger did not capture them as complete terms. Rather, fragments were assigned a Z1 person name tag (e.g. Manchester crown). A solution to this type of problem is to improve the multiword unit templates used in the USAS tagger. Other possible solutions may include incorporating a statistical algorithm to help detect boundaries of complete MWEs.</Paragraph> <Paragraph position="9"> When we examined the error distribution within the semantic fields more closely, we found that most errors occurred within the Z and T categories (refer to Table 1). The errors occurring in these semantic field categories and their sub-divisions make up 76.18% of the total errors (403). Table 4 shows the error distribution across 14 sub-divisions (for definitions of these subdivisions, see: website: http://www.comp.lancs.ac.uk/ucrel/usas). Notice that the majority of errors are from four semantic sub-categories: Z1, Z2, Z3 and T1.3. Notice, also, that the first two of these account for 60.55% of the total errors. This shows that the main cause of the errors in the USAS tool is the algorithm and lexical entries used for identifying names - personal and geographical and, to a lesser extent, the algorithm and lexical entries for identifying periods of time. If these components of the USAS can be improved, a much higher precision can be expected.</Paragraph> <Paragraph position="10"> In sum, our evaluation shows that our semantic approach to MWE extraction is efficient in identifying MWEs, in particular those of lower frequencies. In addition, a reasonably wide lexical coverage is obtained, as indicated by the recall of 39.38%, which is important for terminology building. Our approach provides a practical way for extracting MWEs on a large scale, which we envisage can be useful for both linguistic research and practical NLP applications.</Paragraph> </Section> class="xml-element"></Paper>