File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/97/a97-1033_relat.xml
Size: 12,621 bytes
Last Modified: 2025-10-06 14:16:02
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1033"> <Title>Building a Generation Knowledge Source using Internet-Accessible Newswire</Title> <Section position="3" start_page="221" end_page="223" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Research related to ours falls into two main categories: extraction of information from input text and construction of knowledge sources for generation. null cal context surrounding the hypothesized proper nouns (McDonald, 1993; Coates-Stephens, 1991) and the larger discourse context (Mani et al., 1993) to improve the accuracy of proper noun extraction when large known word lists are not available. Like this research, our work also aims at extracting proper nouns without the aid of large word lists. We use a regular grammar encoding part-of-speech categories to extract certain text patterns and we use WordNet (Miller et al., 1990) to provide semantic filtering.</Paragraph> <Paragraph position="1"> Our work on extracting descriptions is quite similar to the work carried out under the DARPA message understanding program for extracting descriptions (MUC, 1992). The purpose for and the * scenario in which description extraction is done is quite different, but the techniques are very similar. It is based on the paradigm of representing patterns that express the kinds of descriptions we expect; unlike previous work we do not encode semantic categories in the patterns since we want to capture all descriptions regardless of domain.</Paragraph> <Paragraph position="2"> Research on a system called Murax (Kupiec, 1993) is similar to ours from a different perspective. Murax also extracts information from a text to serve directly in response to a user question.</Paragraph> <Paragraph position="3"> l~urax uses lexico-syntactic patterns, collocational analysis, along with information retrieval statistics, to find the string of words in a text that is most likely to serve as an answer to a user's whquery. In our work, the string that is extracted may be merged, or regenerated, as part of a larger textual summary.</Paragraph> <Section position="1" start_page="221" end_page="221" type="sub_section"> <SectionTitle> 2.1 Information Extraction </SectionTitle> <Paragraph position="0"> Work on information extraction is quite broad and covers far more topics and problems than the information extraction problem we address. We restrict our comparison here to work on proper noun extraction, extraction of people descriptions in various information extraction systems developed for the message understanding conferences (MUC, 1992), and use of extracted information for question answering.</Paragraph> <Paragraph position="1"> Techniques for proper noun extraction include the use of regular grammars to delimit and identify proper nouns (Mani et al., 1993; Paik et al., 1994), the use of extensive name lists, place names, titles and &quot;gazetteers&quot; in conjunction with partial grammars in order to recognize proper nouns as unknown words in close proximity to known words (Cowie et al., 1992; Aberdeen et al., 1992), statistical training to learn, for example, Spanish names, from online corpora (Ayuso et al., 1992), and the use of concept based pattern matchers that use semantic concepts as pattern categories as well as part-of-speech information (Weischedel et al., 1993; Lehnert et al., 1993). In addition, some researchers have explored the use of both lo-</Paragraph> </Section> <Section position="2" start_page="221" end_page="222" type="sub_section"> <SectionTitle> 2.2 Construction of Knowledge Sources for Generation </SectionTitle> <Paragraph position="0"> The construction of a database of phrases for re-use in generation is quite novel. Previous work on extraction of collocations for use in generation (Smadja and McKeown, 1991) is related in that full phrases are extracted and syntactically typed so that they can be merged with individual words in a generation lexicon to produce a full sentence. However, extracted collocations were used only to determine realization of an input concept.</Paragraph> <Paragraph position="1"> In our work, stored phrases would be used to provide content that can identify a person or place for a reader, in addition to providing the actual phrasing.</Paragraph> <Paragraph position="2"> Figure 1 shows the overall architecture of PROFILE and the two interfaces to it (a user interface on the World-Wide Web and an interface to a natural language generation system). In this section, we describe the extraction component of PRO-FILE, the following section focuses on the uses of PROFILE for generation, and Section 7 describes the Web-based interface.</Paragraph> <Paragraph position="3"> r ............................................................................... ~ News retrieval ~ Entity Extraction I</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="222" end_page="222" type="sub_section"> <SectionTitle> 3.1 Extraction of entity names from old </SectionTitle> <Paragraph position="0"> newswire To seed the database with an initial set of descriptions, we used a 1.4 MB corpus containing Reuters newswire from March to June of 1995. The purpose of such an initial set of descriptions is twofold. First, it allows us to test the other components of the system. Furthermore, at the time a description is needed it limits the amount of online full text, Web search that must be done. At this stage, search is limited to the database of retrieved descriptions only, thus reducing search time as no connections will be made to external news sources at the time of the query. Only when a suitable stored description cannot be found will the system initiate search of additional text.</Paragraph> <Paragraph position="1"> * Extraction of candidates for proper nouns. After tagging the corpus using the POS part-of-speech tagger (Church, 1988), we used a CREP (Duford, 1993) regular grammar to first extract all possible candidates for entities. These consist of all sequences of words that were tagged as proper nouns (NP) by POS. Our manual analysis showed that out of a total of 2150 entities recovered in this way, 1139 (52.9%) are not names of entities. Among these are n-grams such as &quot;Prime Minister&quot; or &quot;Egyptian President&quot; which were tagged as NP by POS. Table 1 shows how many entities we retrieve at this stage, and of them, how many pass the semantic filtering test. The numbers in the left-hand column refer to two-word noun phrases that identify entities (e.g., &quot;Bill Clinton&quot;). Counts for three-word noun phrases are shown in the right-hand column. We show counts for multiple and unique occurrences of the same noun phrase.</Paragraph> <Paragraph position="2"> * Weeding out of false candidates. Our system analyzed all candidates for entity names using WordNet (Miller et al., 1990) and removed from consideration those that contain words appearing in WordNet's dictionary. This resulted in a list of 421 unique entity names that we used for the automatic description extraction stage. All 421 entity names retrieved by the system are indeed proper nouns.</Paragraph> </Section> <Section position="4" start_page="222" end_page="223" type="sub_section"> <SectionTitle> 3.2 Extraction of descriptions </SectionTitle> <Paragraph position="0"> There are two occasions on which we extract descriptions using finite-state techniques. The first case is when the entity that we want to describe was already extracted automatically (see Subsection 3.1) and exists in PROFILE's database. The second case is when we want a description to be retrieved in real time based on a request from either a Web user or the generation system.</Paragraph> <Paragraph position="1"> There exist many live sources of newswire on the Internet that can be used for this second case.</Paragraph> <Paragraph position="2"> Some that merit our attention are the ones that can be accessed remotely through small client programs that don't require any sophisticated protocols to access the newswire articles. Such sources include HTTP-accessible sites such as the Reuters site at www.yahoo.com and CNN Interactive at www.cnn.com, as well as others such as ClariNet which is propagated through the NNTP protocol.</Paragraph> <Paragraph position="3"> All these sources share a common characteristic in that they are all updated in real time and all contain information about current events. Hence, they are therefore likely to satisfy the criteria of pertinence to our task, such as the likelihood of the sudden appearance of new entities that couldn't possibly have been included a priori in the generation lexicon.</Paragraph> <Paragraph position="4"> Our system generates finite-state representations of the entities that need to be described. An example of a finite-state description of the entity &quot;Yasser Arafat&quot; is shown in Figure 2. These full expressions are used as input to the description finding module which uses them to find candidate sentences in the corpus for finding descriptions.</Paragraph> <Paragraph position="5"> Since the need for a description may arise at a later time than when the entity was found and may require searching new text, the description finder must first locate these expressions in the text.</Paragraph> <Paragraph position="6"> These representations are fed to CREP which extracts noun phrases on either side of the entity (either pre-modifiers or appositions) from the news corpus. The finite-state grammar for noun</Paragraph> <Paragraph position="8"> phrases that we use represents a variety of different syntactic structures for both pre-modifiers and appositions. Thus, they may range from a simple noun (e.g., &quot;president Bill Clinton&quot;) to a much longer expression (e.g., &quot;Gilberto Rodriguez Orejuela, the head of the Cali cocaine cartel&quot;). Other forms of descriptions, such as relative clauses, are the focus of ongoing implementation.</Paragraph> <Paragraph position="9"> Table 2 shows some of the different patterns retrieved.</Paragraph> </Section> <Section position="5" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 3.3 Categorization of descriptions </SectionTitle> <Paragraph position="0"> We use WordNet to group extracted descriptions into categories. For all words in the description, we try to find a WordNet hypernym that can restrict the semantics of the description. Currently, we identify concepts such as &quot;profession&quot;, &quot;nationality&quot;, and &quot;organization&quot;. Each of these concepts is triggered by one or more words (which we call &quot;triggers&quot;) in the description. Table 2 shows some examples of descriptions and the concepts under which they are classified based on the WordNet hypernyms for some &quot;trigger&quot; words. For example, all of the following &quot;triggers&quot; in the list &quot;minister&quot;, &quot;head&quot;, &quot;administrator&quot;, and &quot;commissioner&quot; can be traced up to &quot;leader&quot; in the WordNet hierarchy.</Paragraph> </Section> <Section position="6" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 3.4 Organization of descriptions in a </SectionTitle> <Paragraph position="0"> database of profiles For each retrieved entity we create a new profile in a database of profiles. We keep information about the surface string that is used to describe the entity in newswire (e.g., &quot;Addis Ababa&quot;), the source of the description and the date that the entry has been made in the database (e.g., &quot;reuters95_06_25&quot;). In addition to these pieces of meta-information, all retrieved descriptions and their frequencies are also stored.</Paragraph> <Paragraph position="1"> Currently, our system doesn't have the capability of matching references to the same entity that use different wordings. As a result, we keep separate profiles for each of the following: &quot;Robert Dole&quot;, &quot;Dole&quot;, and &quot;Bob Dole&quot;. We use each of these strings as the key in the database of descriptions. null Figure 3 shows the profile associated with the key &quot;John Major&quot;.</Paragraph> <Paragraph position="2"> KEY: john major The database of profiles is updated every time a query retrieves new descriptions matching a certain key.</Paragraph> </Section> </Section> class="xml-element"></Paper>