File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1704_intro.xml
Size: 3,577 bytes
Last Modified: 2025-10-06 14:04:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1704"> <Title>CUCWeb: a Catalan corpus built from the Web</Title> <Section position="2" start_page="0" end_page="19" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> CUCWeb is the outcome of the common interest of two groups, a Computational Linguistics group and a Computer Science group interested on Web studies. It fits into a larger project, The Spanish Web Project, aimed at empirically studying the properties of the Spanish Web (Baeza-Yates et al., 2005). The project set up an architecture to retrieve a portion of the Web roughly corresponding to the Web in Spain, in order to study its formal properties (analysing its link distribution as a graph) and its characteristics in terms of pages, sites, and domains (size, kind of software used, language, among other aspects).</Paragraph> <Paragraph position="1"> One of the by-products of the project is a 166 million word corpus for Catalan.1 The biggest annotated Catalan corpus before CUCWeb is the CTILC corpus (Rafel, 1994), consisting of about 50 million words.</Paragraph> <Paragraph position="2"> In recent years, the Web has been increasingly used as a source of linguistic data (Kilgarriff and Grefenstette, 2003). The most straightforward approach to using the Web as corpus is to gather data online (Grefenstette, 1998), or estimate counts 1Catalan is a relatively minor language. There are currently about 10.8 million Catalan speakers, similar to Serbian (12), Greek (10.2), or Swedish (9.3). See http://www.upc.es/slt/alatac/cat/dades/catala-04.html (Keller and Lapata, 2003) using available search engines. This approach has a number of drawbacks, e.g. the data one looks for has to be known beforehand, and the queries have to consist of lexical material. In other words, it is not possible to perform structural searches or proper language modeling.</Paragraph> <Paragraph position="3"> Current technology makes it feasible and relatively cheap to crawl and store terabytes of data. In addition, crawling the data and processing it off-line provides more potential for its exploitation, as well as more control over the data selection and pruning processes. However, this approach is more challenging from a technological viewpoint. 2 For a comprehensive discussion of the pros and cons of the different approaches to using Web data for linguistic purposes, see e.g.</Paragraph> <Paragraph position="4"> Thelwall (2005) and L&quot;udeling et al. (To appear). We chose the second approach because of the advantages discussed in this section, and because it allowed us to make the data available for a large number of non-specialised users, through a web interface to the corpus. We built a general-purpose corpus by crawling the Spanish Web, processing and filtering them with language-intensive tools, filtering duplicates and ranking them according to popularity.</Paragraph> <Paragraph position="5"> The paper has the following structure: Section 2 details the process that lead to the constitution of the corpus, Section 3 explores some of the exploitation possibilities that are foreseen for CUCWeb, and Section 4 discusses the current architecture. Finally, Section 5 contains some conclusions and future work.</Paragraph> <Paragraph position="6"> at overcoming this challenge, by developing &quot;a set of tools (and interfaces to existing tools) that will allow a linguist to crawl a section of the web, process the data, index them and search them&quot;.</Paragraph> </Section> class="xml-element"></Paper>