File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1030_intro.xml
Size: 2,512 bytes
Last Modified: 2025-10-06 14:01:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1030"> <Title>Scaling Context Space</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Context plays an important role in many natural language tasks. For example, the accuracy of part of speech taggers or word sense disambiguation systems depends on the quality and quantity of contextual information these systems can extract from the training data. When predicting the sense of a word, for instance, the immediately preceding word is likely to be more important than the tenth previous word; similar observations can be made about POS taggers or chunkers. A crucial part of training these systems lies in extracting from the data high-quality contextual information, in the sense of de ning contexts that are both accurate and correlated with the information (the POS tags, the word senses, the chunks) the system is trying to extract.</Paragraph> <Paragraph position="1"> The quality of contextual information is often determined by the size of the training corpus: with less data available, extracting context information for any given phenomenon becomes less reliable.</Paragraph> <Paragraph position="2"> However, corpus size is no longer a limiting factor: whereas up to now people have typically worked with corpora of around one million words, it has become feasible to build much larger document collections; for example, Banko and Brill (2001) report on experiments with a one billion word corpus.</Paragraph> <Paragraph position="3"> When using a much larger corpus and scaling the context space, there are, however, other trade-offs to take into consideration: the size of the corpus may make it unfeasible to train some systems because of ef ciency issues or hardware costs; it may also result in an unmanageable expansion of the extracted context information, reducing the performance of the systems that have to make use of this information.</Paragraph> <Paragraph position="4"> This paper reports on experiments that try to establish some of the trade-offs between corpus size, processing time, hardware costs and the performance of the resulting systems. We report on experiments with a large corpus (around 300 million words). We trained a thesaurus extraction system with a range of context-extracting front-ends to demonstrate the interaction between context quality, extraction time and representation size.</Paragraph> </Section> class="xml-element"></Paper>