File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1016_intro.xml

Size: 4,952 bytes

Last Modified: 2025-10-06 14:02:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1016">
  <Title>The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> Following Keller and Lapata (2003), web counts for n-grams were obtained using a simple heuristic based on queries to the search engine Altavista.1 In this approach, the web count for a given n-gram is simply the number of hits (pages) returned by the search engine for the queries generated for this n-gram. Three different types of queries were used for the NLP tasks in the present paper: Literal queries use the quoted n-gram directly as a search term for Altavista (e.g., the bigram history changes expands to the query &amp;quot;history changes&amp;quot;).</Paragraph>
    <Paragraph position="1"> Near queries use Altavista's NEAR operator to expand the n-gram; a NEAR b means that a has to occur in the same ten word window as b; the window is treated as a bag of words (e.g., history changes expands to &amp;quot;history&amp;quot; NEAR &amp;quot;changes&amp;quot;).</Paragraph>
    <Paragraph position="2"> Inflected queries are performed by expanding an n-gram into all its morphological forms. These forms are then submitted as literal queries, and the resulting hits are summed up (e.g., history changes expands to &amp;quot;history change&amp;quot;, &amp;quot;histories change&amp;quot;, &amp;quot;history changed&amp;quot;, etc.). John Carroll's suite of morphological tools (morpha, morphg, and ana) was used to generate inflected forms of verbs and nouns.2 In certain cases (detailed below), determiners were inserted before nouns in order to make it possible to recognize simple NPs. This insertion was limited to a/an, the, and the empty determiner (for bare plurals).</Paragraph>
    <Paragraph position="3"> All queries (other than the ones using the NEAR operator) were performed as exact matches (using quotation marks in Altavista). All search terms were submitted to the search engine in lower case. If a query consists of a single, highly frequent word (such as the), Altavista will return an error message. In these cases, we set the web count to a large constant (108). This problem is limited to unigrams, which were used in some of the models detailed below. Sometimes the search engine fails to return a hit for a given n-gram (for any of its morphological variants). We smooth zero counts by setting them to .5.</Paragraph>
    <Paragraph position="4"> For all tasks, the web-based models are compared against identical models whose parameters were estimated from the BNC (Burnard, 1995). The BNC is a static 100M word corpus of British English, which is about 1000 times smaller than the web (Keller and Lapata, 2003). Comparing the performance of the same model on the web and on the BNC allows us to assess how much improvement can be expected simply by using a larger data set. The BNC counts were retrieved using the Gsearch corpus query tool (Corley et al., 2001); the morphological query expansion was the same as for web queries; the NEAR operator was simulated by assuming a window of five words to the left and five to the right.</Paragraph>
    <Paragraph position="5">  # best model on development set 6 (not) sign. different from best BNC model on test set</Paragraph>
    <Paragraph position="7"> Gsearch was used to search solely for adjacent words; no POS information was incorporated in the queries, and no parsing was performed.</Paragraph>
    <Paragraph position="8"> For all of our tasks, we have to select either the best of several possible models or the best parameter setting for a single model. We therefore require a separate development set. This was achieved by using the gold standard data set from the literature for a given task and randomly dividing it into a development set and a test set (of equal size). We report the test set performance for all models for a given task, and indicate which model shows optimal performance on the development set (marked by a '#' in all subsequent tables). We then compare the test set performance of this optimal model to the performance of the models reported in the literature. It is important to note that the figures taken from the literature were typically obtained on the whole gold standard data set, and hence may differ from the performance on our test set. We work on the assumption that such differences are negligible.</Paragraph>
    <Paragraph position="9"> We use kh2 tests to determine whether the performance of the best web model on the test set is significantly different from that of the best BNC model. We also determine whether both models differ significantly from the base-line and from the best model in the literature. A set of diacritics is used to indicate significance throughout this paper, see Table 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML