File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3012_intro.xml

Size: 3,359 bytes

Last Modified: 2025-10-06 14:02:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3012">
  <Title>Corpus representativeness for syntactic information acquisition</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The coverage of the computational lexicon used in deep Natural Language Processing (NLP) is crucial for parsing success. But rather frequently, the absence of particular entries or the fact that the information encoded for these does not cover very specific syntactic contexts --as those found in technical texts-- make high informative grammars not suitable for real applications. Moreover, this poses a real problem when porting a particular application from domain to domain, as the lexicon has to be re-encoded in the light of the new domain. In fact, in order to minimize ambiguities and possible over-generation, application based lexicons tend to be tuned for every specific domain addressed by a particular application. Tuning of lexicons to different domains is really a delaying factor in the deployment of NLP applications, as it raises its costs, not only in terms of money, but also, and crucially, in terms of time.</Paragraph>
    <Paragraph position="1"> A desirable solution would be a 'plug and play' system that, given a collection of documents supplied by the customer, could induce a tuned lexicon. By 'tuned' we mean full coverage both in terms of: 1) entries: detecting new items and assigning them a syntactic behavior pattern; and 2) syntactic behavior pattern: adapting the encoding of entries to the observations of the corpus, so as to assign a class that accounts for the occurrences of this particular word in that particular corpus. The question we have addressed here is to define the size and composition of the corpus we would need in order to get necessary and sufficient information for Machine Learning techniques to induce that type of information.</Paragraph>
    <Paragraph position="2"> Representativeness of a corpus is a topic largely dealt with, especially in corpus linguistics. One of the standard references is Biber (1993) where the author offers guidelines for corpus design to characterize a language. The size and composition of the corpus to be observed has also been studied by general statistical NLP (Lauer 1995), and in relation with automatic acquisition methods (Zernick, 1991, Yang &amp; Song 1999). But most of these studies focused in having a corpus that actually models the whole language. However, we will see in section 3 that for inducing information for parsing we might want to model just a particular subset of a language, the one that corresponds to the texts that a particular application is going to parse. Thus, the research we report about here refers to aspects related to the quantity and optimal composition of a corpus that will be used for inducing syntactic information.</Paragraph>
    <Paragraph position="3"> In what follows, we first will briefly describe the observation corpus. In section 3, we introduce the phenomena observed and the way we got an objective measure. In Section 4, we report on experiments done in order to check the validity of this measure in relation with word frequency. In section 5 we address the issue of corpus size and how it affects this measure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML