File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1807_metho.xml
Size: 3,622 bytes
Last Modified: 2025-10-06 14:08:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1807"> <Title>Extracting Multiword Expressions with A Semantic Tagger</Title> <Section position="3" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Experiment of MWE extraction </SectionTitle> <Paragraph position="0"> In order to test our approach of extracting MWEs using semantic information, we first tagged the newspaper part of the METER Corpus with the USAS tagger. We then collected the multiword units assigned as a single semantic unit. Finally, we manually checked the results.</Paragraph> <Paragraph position="1"> The Meter Corpus chosen as the test data is a collection of court reports from the British Press Association (PA) and some leading British newspapers (Gaizauskas 2001; Clough et al., 2002). In our experiment, we used the newspaper part of the corpus containing 774 articles with more than 250,000 words. It provides a homogeneous corpus (in the sense that the reports come from a restricted domain of court events) and is thus a good source from which to extract domain-specific MWEs.</Paragraph> <Paragraph position="2"> Another reason for choosing this corpus is that it has not been used in training the USAS system. As an open test, we assume the results of the experiment should reflect true capability of our approach for real-life applications.</Paragraph> <Paragraph position="3"> The current USAS tagger may assign multiple possible semantic tags for a term when it fails to disambiguate between them. As mentioned previously, the first one denotes the most likely semantic field of the term. Therefore, in our experiment we chose the first tag when such situations arose.</Paragraph> <Paragraph position="4"> A major problem we faced in our experiment is the definition of a MWE. Although it has been several years since people started to work on MWE extraction, we found that there is, as yet, no available &quot;clear-cut&quot; definition for MWEs. We noticed various possible definitions have been suggested for MWE/MWU.</Paragraph> <Paragraph position="5"> For example, Smadja (1993) suggests a basic characteristic of collocations and multiword units is recurrent, domain-dependent and cohesive lexical clusters. Sag et el. (2001b) suggest that MWEs can roughly be defined as &quot;idiosyncratic interpretations that cross word boundaries (or spaces)&quot;. Biber et al. (2003) describe MWEs as lexical bundles, which they go on to define as combinations of words that can be repeated frequently and tend to be used frequently by many different speakers/writers within a register.</Paragraph> <Paragraph position="6"> Although it is not difficult to interpret these deifications in theory, things became much more complicated when we undertook our practical checking of the MWE candidates. Quite often, we experienced disagreement between us about whether or not to accept a MWE candidate as a good one. In practice, we generally followed Biber et al.'s definition, i.e. accept a candidate MWE as a good one if it can repeatedly co-occur in the corpus.</Paragraph> <Paragraph position="7"> Another difficulty we experienced relates to estimating recall. Because the MWEs in the METER Corpus are not marked-up, we could not automatically calculate the number of MWEs contained in the corpus. Consequently, we had to manually estimate this figure. Obviously it is not practical to manually check though the whole corpus within the limited time allowed. Therefore, we had to estimate the recall on a sample of the corpus, as will be described in the following section.</Paragraph> </Section> class="xml-element"></Paper>