File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1105_intro.xml

Size: 3,920 bytes

Last Modified: 2025-10-06 14:02:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1105">
  <Title>Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution</Title>
  <Section position="4" start_page="0" end_page="835" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Resolution of structural ambiguity problems such as noun compound bracketing, prepositional phrase (PP) attachment, and noun phrase coordination requires using information about lexical items and their cooccurrences. This in turn leads to the data sparseness problem, since algorithms that rely on making decisions based on individual lexical items must have statistics about every word that may be encountered. Past approaches have dealt with the data sparseness problem by attempting to generalize from semantic classes, either manually built or automatically derived.</Paragraph>
    <Paragraph position="1"> More recently, Banko and Brill (2001) have advocated for the creative use of very large text collections as an alternative to sophisticated algorithms and hand-built resources. They demonstrate the idea on a lexical disambiguation problem for which labeled examples are available &amp;quot;for free&amp;quot;. The problem is to choose which of 2-3 commonly confused words (e.g., {principle, principal}) are appropriate for a given context. The labeled data comes &amp;quot;for free&amp;quot; by assuming that in most edited written text, the words are used correctly, so training can be done directly from the text. Banko and Brill (2001) show that even using a very simple algorithm, the results continue to improve log-linearly with more training data, even out to a billion words. A potential limitation of this approach is the question of how applicable it is for NLP problems more generally - how can we treat a large corpus as a labeled collection for a wide range of NLP tasks? In a related strand of work, Lapata and Keller (2004) show that computing n-gram statistics over very large corpora yields results that are competitive with if not better than the best supervised and knowledge-based approaches on a wide range of NLP tasks. For example, they show that for the problem of noun compound bracketing, the performance of an n-gram based model computed using search engine statistics was not significantly different from the best supervised algorithm whose parameters were tuned and which used a taxonomy.</Paragraph>
    <Paragraph position="2"> They find however that these approaches generally fail to outperform supervised state-of-the-art models that are trained on smaller corpora, and so conclude that web-based n-gram statistics should be the base-line to beat.</Paragraph>
    <Paragraph position="3"> We feel the potential of these ideas is not yet fully realized. We are interested in finding ways to further exploit the availability of enormous web corpora as implicit training data. This is especially important for structural ambiguity problems in which the decisions must be made on the basis of the behavior  of individual lexical items. The trick is to figure out how to use information that is latent in the web as a corpus, and web search engines as query interfaces to that corpus.</Paragraph>
    <Paragraph position="4"> In this paper we describe two techniques - surface features and paraphrases - that push the ideas of Banko and Brill (2001) and Lapata and Keller (2004) farther, enabling the use of statistics gathered from very large corpora in an unsupervised manner. In recent work (Nakov and Hearst, 2005) we showed that a variation of the techniques, when applied to the problem of noun compound bracketing, produces higher accuracy than Lapata and Keller (2004) and the best supervised results. In this paper we adapt the techniques to the structural disambiguation problems of prepositional phrase attachment and noun compound coordination.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML