File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1009_intro.xml
Size: 1,848 bytes
Last Modified: 2025-10-06 14:03:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1009"> <Title>Evaluation of the Bible as a Resource for Cross-Language Information Retrieval</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> This paper describes a project which is part of a larger, ongoing, undertaking, the goal of which is to harvest a representative sample of material from the internet and determine, on a very broad scale, the answers to such questions as: what ideas in the global public discourse enjoy most currency; how the popularity of ideas changes over time.</Paragraph> <Paragraph position="1"> Ideas are, of course, expressed in words; or, to put it another way, a document's vocabulary is likely to reveal something about the author's ideology (Lakoff, 2002). In view of this, and since ultimately we are interested in clustering the documents harvested from the internet by their ideology (and we understand 'ideology' in the broadest possible sense), we approach the problem as a textual information retrieval (IR) task. There is another level of complexity to the problem, however. The language of the internet is not, of course, confined to English; on the contrary, the representation of other languages is probably increasing (Hill and Hughes, 1998; Nunberg, 2000). Thus, for our results to be representative, we require a way to compare documents in one language to those in potentially any other language. Essentially, we would like to answer the question of howideologically aligned two documents are, regardless of their respective languages. In cross-language IR, this must be approached by the use of a parallel multilingual corpus, or at least some kind of appropriate training material available in multiple languages.</Paragraph> </Section> class="xml-element"></Paper>