File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1033_metho.xml
Size: 12,338 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1033"> <Title>Building a Generation Knowledge Source using Internet-Accessible Newswire</Title> <Section position="4" start_page="223" end_page="224" type="metho"> <SectionTitle> 4 Generation </SectionTitle> <Paragraph position="0"> We have made an attempt to reuse the descriptions, retrieved by the system, in more than a trivial way. The content planner of a language generation system that needs to present an entity to the user that he has not seen previously, might want to include some background information about it.</Paragraph> <Paragraph position="1"> However, in case the extracted information doesn't contain a handy description, the system can use some descriptions retrieved by PROFILE.</Paragraph> <Section position="1" start_page="223" end_page="224" type="sub_section"> <SectionTitle> 4.1 Transformation of descriptions into ~nctional Descriptions </SectionTitle> <Paragraph position="0"> Since our major goal in extracting descriptions from on-line corpora was to use them in generation, we have written a utility which converts finite-state descriptions retrieved by PROFILE into functional descriptions (FD) (Elhadad, 1991) that we can use directly in generation. A description retrieved by the system from the article in 4 is shown in Figure 5. The corresponding FD is shown in Figure 6.</Paragraph> <Paragraph position="1"> We have implemented a TCP/IP interface to Surge. The FD generation component uses this interface to send a new FD to the surface realization component of Surge which generates an English surface form corresponding to it.</Paragraph> </Section> <Section position="2" start_page="224" end_page="224" type="sub_section"> <SectionTitle> 4.2 Lexicon creation </SectionTitle> <Paragraph position="0"> We have identified several major advantages of using FDs produced by the system in generation compared to using canned phrases.</Paragraph> <Paragraph position="1"> * Grammaticality. The deeper representation allows for grammatical transformations, such as aggregation: e.g., &quot;president Yeltsin&quot; + &quot;president Clinton&quot; can be generated as &quot;presidents Yeltsin and Clinton&quot;.</Paragraph> <Paragraph position="2"> * Unification with existing ontologies.</Paragraph> <Paragraph position="3"> E.g., if an ontology contains information about the word &quot;president&quot; as being a realization of the concept &quot;head of state&quot;, then under certain conditions, the description can be replaced by one referring to &quot;head of state&quot;. * Generation of referring expressions. In the previous example, if &quot;president Bill Clinton&quot; is used in a sentence, then &quot;head of state&quot; can be used as a referring expression in a subsequent sentence.</Paragraph> <Paragraph position="4"> * Enhancement of descriptions. If we have retrieved &quot;prime minister&quot; as a description for Silvio Berlusconi, and later we obtain knowledge that someone else has become Italy's primer minister, then we can generate &quot;former prime minister&quot; using a transformation of the old FD.</Paragraph> <Paragraph position="5"> * Lexical choice. When different descriptions are automatically marked for semantics, PROFILE can prefer to generate one over another based on semantic features. This is useful if a summary discusses events related to one description associated with the entity more than the others.</Paragraph> <Paragraph position="6"> * Merging lexicons. The lexicon generated automatically by the system can be merged with a domain lexicon generated manually.</Paragraph> <Paragraph position="7"> These advantages look very promising and we will be exploring them in detail in our work on summarization in the near future.</Paragraph> </Section> </Section> <Section position="5" start_page="224" end_page="225" type="metho"> <SectionTitle> 5 Coverage and Limitations </SectionTitle> <Paragraph position="0"> In this section we provide an analysis of the capabilities and current limitations of PROFILE.</Paragraph> <Section position="1" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 5.1 Coverage </SectionTitle> <Paragraph position="0"> At the current stage of implementation, PROFILE has the following coverage.</Paragraph> <Paragraph position="1"> * Syntactic coverage. Currently, the system includes an extensive finite-state grammar that can handle various pre-modifiers and appositions. The grammar matches arbitrary noun phrases in each of these two cases to the extent that the POS part-of-speech tagger provides a correct tagging.</Paragraph> <Paragraph position="2"> * Precision. In Subsection 3.1 we showed the precision of the extraction of entity names.</Paragraph> <Paragraph position="3"> Similarly, we have computed the precision of retrieved 611 descriptions using randomly selected entities from the list retrieved in Sub-section 3.1. Of the 611 descriptions, 551 (90.2%) were correct. The others included a roughly equal number of cases of incorrect NP attachment and incorrect part-of-speech assignment. For our task (symbolic text generation), precision is more important than recall; it is critical that the extracted descriptions are correct in order to be converted to FD and generated.</Paragraph> <Paragraph position="4"> * Length of descriptions. The longest description retrieved by the system was 9 lexical items long: &quot;Maurizio Gucci, the former head of Italy's Gucci fashion dynasty&quot;. The shortest descriptions are 1 lexical item in length e.g. &quot;President Bill Clinton&quot;.</Paragraph> <Paragraph position="5"> * Protocol coverage. We have implemented retrieval facilities to extract descriptions using the NNTP (Usenet News) and HTTP (World-Wide Web) protocols.</Paragraph> </Section> <Section position="2" start_page="225" end_page="225" type="sub_section"> <SectionTitle> 5.2 Limitations </SectionTitle> <Paragraph position="0"> Our system currently doesn't handle entity crossreferencing. It will not realize that &quot;Clinton&quot; and &quot;Bill Clinton&quot; refer to the same person. Nor will it link a person's profile with the profile of the organization of which he is a member.</Paragraph> <Paragraph position="1"> At this stage, the system generates functional descriptions (FD), but they are not being used in a summarization system yet.</Paragraph> </Section> </Section> <Section position="6" start_page="225" end_page="225" type="metho"> <SectionTitle> 6 Current Directions </SectionTitle> <Paragraph position="0"> One of the more important current goals is to increase coverage of the system by providing interfaces to a large number of on-line sources of news. We would ideally want to build a comprehensive and shareable database of profiles that can be queried over the World-Wide Web.</Paragraph> <Paragraph position="1"> We need to refine the algorithm to handle cases that are currently problematic. For example, polysemy is not properly handled. For instance, we would not label properly noun phrases such as &quot;Rice University&quot;, as it contains the word &quot;rice&quot; which can be categorized as a food.</Paragraph> <Paragraph position="2"> Another long-term goal of our research is the generation of evolving summaries that continuously update the user on a given topic of interest. In that case, the system will have a model containing all prior interaction with the user. To avoid repetitiveness, such a system will have to resort to using different descriptions (as well as referring expressions) to address a specific entity 1. We will be investigating an algorithm that will select a proper ordering of multiple descriptions referring to the same person.</Paragraph> <Paragraph position="3"> After we collect a series of descriptions for each a large number of threads of summaries on the same topic from the Reuters and UPI newswire used up to 10 different referring expressions (mostly of the type of descriptions discussed in this paper) to refer to the same entity.</Paragraph> <Paragraph position="4"> among all of them. There are two scenarios. In the first one, we have to pick one single description from the database which best fits the summary that we are generating. In the second scenario, the evolving summary, we have to generate a sequence of descriptions, which might possibly view the entity from different perspectives. We are investigating algorithms that will decide the order of generation of the different descriptions.</Paragraph> <Paragraph position="5"> Among the factors that will influence the selection and ordering of descriptions, we can note the user's interests, his knowledge of the entity, the focus of the summary (e.g., &quot;democratic presidential candidate&quot; for Bill Clinton, vs. &quot;U.S. president&quot;). We can also select one description over another based on how recent they have been included in the database, whether or not one of them has been used in a summary already, whether the summary is an update to an earlier summary, and whether another description from the same category has been used already. We have yet to decide under what circumstances a description needs to be generated at all.</Paragraph> <Paragraph position="6"> We are interested in implementing existing algorithms or designing our own that will match different instances of the same entity appearing in different syntactic forms - e.g., to establish that &quot;PLO&quot; is an alias for the &quot;Palestine Liberation Organization&quot;. We will investigate using cooccurrence information to match acronyms to full organization names and alternative spellings of the same name with each other.</Paragraph> <Paragraph position="7"> An important application that we are considering is applying the technology to text available using other protocols - such as SMTP (for electronic mail) and retrieve descriptions for entities mentioned in such messages.</Paragraph> <Paragraph position="8"> We will also look into connecting the current interface with news available to the Internet with an existing search engine such as Lycos (www.lycos.com) or AltaVista (www.altavista.digital.corn). We can then use the existing indices of all Web documents mentioning a given entity as a news corpus on which to perform the extraction of descriptions.</Paragraph> <Paragraph position="9"> Finally, we will investigate the creation of KQML (Finin et al., 1994) interfaces to the different components of PROFILE which will be linked to other information access modules at Columbia University.</Paragraph> </Section> <Section position="7" start_page="225" end_page="226" type="metho"> <SectionTitle> 7 Contributions </SectionTitle> <Paragraph position="0"> We have described a system that allows users to retrieve descriptions of entities using a Web-based search engine. Figure 7 shows the Web interface to PROFILE. Users can select an entity (such as &quot;3ohn Major&quot;), specify what semantic classes of descriptions they want to retrieve (e.g., age, posi- null tion, nationality) as well as the maximal number of queries that they want. They can also specify which sources of news should be searched. Currently, the system has an interface to Reuters at www.yahoo.com, The CNN Web site, and to all local news delivered via NNTP to our local news domain.</Paragraph> <Paragraph position="1"> The Web-based interface is accessible publicly (currently within Columbia University only). All queries are cached and the descriptions retrieved can be reused in a subsequent query. We believe that such an approach to information extraction can be classified as a collaborative database.</Paragraph> <Paragraph position="2"> The FD generation component produces syntactically correct functional descriptions that can be used to generate English-language descriptions using FUF and Surge, and can also be used in a general-purpose summarization system in the domain of current news.</Paragraph> <Paragraph position="3"> All components of the system assume no prior domain knowledge and are therefore portable to many domains - such as sports, entertainment, and business.</Paragraph> </Section> class="xml-element"></Paper>