File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3009_metho.xml
Size: 8,489 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3009"> <Title>The Linguist's Search Engine: An Overview</Title> <Section position="4" start_page="0" end_page="34" type="metho"> <SectionTitle> 2 LSE Interface Concepts </SectionTitle> <Paragraph position="0"> The design of the LSE was guided by a simple basic premise: a tool can't be a success unless people use it. This led to the following principles in its design: Some of these principles conflict with each other. For example, sophisticated searches are difficult to specify in a linguist-friendly way and without requiring some learning by the user, and rapid interaction is difficult to accomplish for Web-sized searches.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 2.1 Query By Example </SectionTitle> <Paragraph position="0"> The LSE adopts a strategy one can call &quot;query by example,&quot; in order to provide sophisticated search functionality without requiring the user to learn a complex query language. For example, consider the so-called &quot;comparative correlative&quot; construction (Culicover and Jackendoff, 1999).</Paragraph> <Paragraph position="1"> Typing the bigger the house the richer the buyer automatically produces the analysis in Figure 1, which can be edited with a few mouse clicks to get the generalized structure in Figure 2, converted with one button push into the LSE's query language, and then submitted in order to find other examples of this construction, such as The higher the rating, the lower the interest rate that must be paid to investors; The more you bingo, the more chances you have in the drawing; The more we plan and prepare, the easier the transition.</Paragraph> <Paragraph position="2"> Crucially, users need not learn a query language, although advanced users can edit or create queries directly if so desired. Nor do users need to agree with (or even understand) the LSE's automatic parse, in order to find sentences with parses similar to the exemplar. Indeed, as is the case in Figure 1, the parse need not even be entirely reasonable; what is important is that the structure produced when analyzing the query will be the same structure produced via analysis of the corresponding sentences in the corpus.</Paragraph> <Paragraph position="3"> Other search features include the ability to specify immediate versus non-immediate dominance; the ability to negate relationships (e.g. a VP that does not immediately dominate an NP); the ability to specify that words should match on all morphological forms; the ability to match nodes based on WordNet relationships (e.g.</Paragraph> <Paragraph position="4"> all descendants of a particular word sense); the ability to save and reload queries; the ability to download results in keyword-in-context (KWIC) format; and the ability to apply a simple keyword-based filter to avoid offensive results during live demonstrations.</Paragraph> <Paragraph position="5"> Results are typically returned by the LSE within a few seconds, in a simple search-engine style format. In addition, however, the user has rapid access to the immediate preceding and following contexts of returned sentences, their annotations, and the Web page where the example occurred.</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 2.2 Built-In and Custom Collections </SectionTitle> <Paragraph position="0"> Linguistically annotating and indexing the entire Web is beyond impractical, and therefore there is a clear tradeoff between rapid response time and the ability to search the Web as a whole. In order to manage this tradeoff, the LSE provides, by default, a built-in collection of English sentences taken randomly from a Web-scale crawl at the Internet</Paragraph> <Paragraph position="2"> This static collection is often useful by itself.</Paragraph> <Paragraph position="3"> In order to truly search the entire Web, the LSE permits users to define their own custom collections, piggybacking on commercial Web search engines. Consider, as an example, a search involving the verb titrate, which is rare enough that it occurs only twice in a collection of millions of sentences. Using the LSE's &quot;Build Custom Collection&quot; functionality, the user can specify that the LSE should: * Query Altavista to find pages containing any morphological form of titrate * Extract only sentences containing that verb * Annotate and index those sentences * Augment the collection by iterating this process with different specifications Doing the Altavista query and extracting, parsing, and indexing the sentences can take some time, but the LSE permits the user to begin searching his or her custom collection as soon as any sentences have been added into it. Typically dozens to hundreds of sentences are available within a few minutes, and a typical custom collection, containing thousands or tens of thousands of sentences, is completed within a few hours.</Paragraph> <Paragraph position="4"> Collections can be named, saved, augmented, and deleted.</Paragraph> <Paragraph position="5"> Currently the LSE supports custom collections built using searches on Altavista and Microsoft's MSN Search. It is interesting to note that the search engines' capabilities can be used to create custom collections based on extralinguistic criteria; for example, specifying pages originating only in the .uk domain in order to increase the likelihood of finding British usages, or specifying additional query terms in order to bias the collection toward particular topics or domains.</Paragraph> </Section> </Section> <Section position="5" start_page="34" end_page="35" type="metho"> <SectionTitle> 3 Architecture and Implementation </SectionTitle> <Paragraph position="0"> The LSE's design can be broken into the following high level components: The built-in LSE Web collection contains 3 million sentences at the time of this writing. We estimate that it can be increased by an order of magnitude without seriously degrading response time, and we expect to do so by the time of the demonstration.</Paragraph> <Paragraph position="1"> The design is centered on a relational database that maintains information about users, collections, documents, and sentences, and the implementation combines custom-written code with significant use of off-the-shelf packages. The interface with commercial search engines is accomplished straightforwardly by use of the WWW::Search perl module (currently using a custom-written variant for MSN Search).</Paragraph> <Paragraph position="2"> Natural language annotation is accomplished via a parallel, database-centric annotation architecture (Elkiss, 2003). A configuration specification identifies dependencies between annotation tasks (e.g. tokenization as a prerequisite to part-of-speech tagging). After documents are processed to handle markup and identify sentence boundaries, individual sentences are loaded into a central database that holds annotations, as well as information about which sentences remain to be annotated. Crucially, sentences can be annotated in parallel by task processes residing on distributed nodes.</Paragraph> <Paragraph position="3"> Indexing and search of annotations is informed by the recent literature on semistructured data. However, linguistic databases are unlike most typical semistructured data sets (e.g., sets of XML documents) in a number of respects - these include the fact that the dataset has a very large schema (tens of millions of distinct paths from root node to terminal symbols), long path lengths, a need for efficient handling of queries containing wildcards, and a requirement that all valid results be retrieved. On the other hand, in this application incremental updating is not a requirement, and neither is 100% precision: results can be overgenerated and then filtered using a less efficient comparison tools such as tgrep2. Currently the indexing scheme follows ViST (Wang et al., 2003), an approach based on suffix trees that indexes structure and content together. The variant implemented in the LSE ignores insufficiently selective query branches, and achieves more efficient search by modifying the ordering within the structural index, creating an inmemory tree for the query, ordering processing of query branches from most to least selective, and memoizing query subtree matches.</Paragraph> </Section> class="xml-element"></Paper>