File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-1012_relat.xml
Size: 7,725 bytes
Last Modified: 2025-10-06 14:15:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1012"> <Title>Automatic Evaluation of Summaries Using Document Graphs</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Researchers in the field of document summarization have been trying for many years to define a metric for evaluating the qua lity of a machine-generated summary. Most of these attempts involve human interference, which make the process of evaluation expensive and time-consuming. We discuss some important work in the intrinsic category. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Sentence Precision-Recall Measure </SectionTitle> <Paragraph position="0"> Sentence precision and recall have been widely used to evaluate the quality of a summarizer (Jing et al., 1998). Sentence precision measures the percent of the summary that contains sentences matched with the model summary. Recall, on the other hand, measures the percent of sentences in the ideal summary that have been recalled in the summary. Even though sentence precision/recall factors can give us an idea about a summary's quality, they are not the best metrics to evaluate a system's quality. This is due to the fact that a small change in the output summary can dramatically affect the quality of a summary (Jing et al., 1998). For example, it is possible that a system will pick a sentence that does not match with a model sentence chosen by an assessor, but is equivalent to it in meaning. This, of course, will affect the score assigned to the system dramatic ally. It is also obvious that sentence precision/recall is only applicable to the summaries that are generated by sentence extraction, not abstraction (Mani, 2001).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Content-Based Measure </SectionTitle> <Paragraph position="0"> Content-based measure computes the similarity at the vocabulary level (Donaway, 2000 and Mani, 2001). The evaluation is done by creating term frequency vectors for both the summary and the model summary, and measuring the cosine similarity (Salton, 1988) between these two vectors. Of course, the higher the cosine similarity measure, the higher the quality of the summary is. Lin and Hovy (2002) used accumulative n-gram matching scores between model summaries and the summaries to be evaluated as a performance indicator in multi-document summaries. They achieved their best results by giving more credit to longer n-gram matches with the use of Porter stemmer.</Paragraph> <Paragraph position="1"> A problem raised in the evaluation approaches that use the cosine measure is that the summaries may use different key terms than those in the original documents or model summaries. Since term frequency is the base to score summaries, it is possible that a high quality summary will get a lower score if the terms used in the summary are not the same terms used in most of the document's text. Donaway et al. (2000) discussed using a common tool in information retrieval: latent semantic indexing (LSI) (Deerwester et al., 1990) to address this problem. The use of LSI reduces the effect of near-synonymy problem on the similarity score. This is done by penalizing the summary less in the reduced dimension model when there are infrequent terms synonymous to frequent terms. LSI averages the weights of terms that co-occur frequently with other mutual terms. For example, both &quot;bank&quot; and &quot;financial institution&quot; often occur with the term &quot;account&quot; (Deerwester et al., 1990). Even though using LSI can be useful in some cases, it can produce unexpected results when the document contains terms that are not synonymous to each other, but, however, they co-occur with other mutual terms.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Document Graph </SectionTitle> <Paragraph position="0"> Graph Current approaches in content-based summarization evaluation ignore the relations between the keywords that are expressed in the document. Here, we introduce our approach, which measures the similarity between two summaries or a summary and a document based on the relations (between the keywords). In our approach, each document/summary is represented as a document graph (DG), which is a directed graph of concepts/entities and the relations between them. A DG contains two kinds of nodes, concept/entity nodes and relation nodes. Currently, only two kinds of relations, &quot;isa&quot; and &quot;related to&quot;, are captured (Santos et al, 2001) for simplicity.</Paragraph> <Paragraph position="1"> To generate a DG, a document/summary in plain text format is first tokenized into sentences; and then, each sentence is parsed using Link Parser (Sleator and Temperley, 1993), and the noun phrases (NP) are extracted from the parsing results. The relations are generated based on three heuristic rules: * The NP-heuristic helps to set up the hierarchical relations. For example, from a noun phrase &quot;folk hero stature&quot;, we generate relations &quot;folk hero stature isa stature&quot;, &quot;folk hero stature related to folk hero&quot;, and &quot;folk hero isa hero&quot;.</Paragraph> <Paragraph position="2"> * The NP-PP-heuristic attaches all prepositional phrases to adjacent noun phrases. For example, from &quot;workers at a coal mine&quot;, we generate a relation, &quot;worker related to coal mine&quot;.</Paragraph> <Paragraph position="3"> * The sentence-heuristic rela tes concepts/entities contained in one sentence.</Paragraph> <Paragraph position="4"> The relations created by sentence-heuristic are then sensitive to verbs, since the interval between two noun phrases usually contains a verb. For example, from a sentence &quot;Workers at a coal mine went on strike&quot;, we generate a relation &quot;worker related to strike&quot;. Another example, from &quot;The usual cause of heart attacks is a blockage of the coronary arteries&quot;, we generate &quot;heart attack cause related to coronary artery blockage&quot;. Figure 1 shows a example of a partial DG.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3.2 Similarity Comparison between two Document Graphs </SectionTitle> <Paragraph position="0"> The similarity of DG1 to DG2 is given by the equation:</Paragraph> <Paragraph position="2"> which is modified from Montes-y-Gomez et al.</Paragraph> <Paragraph position="3"> (2000). N is the number of concept/entity nodes in DG1, and M stands for number of relations in DG1; n is the number of matched concept/entity nodes in two DGs, and m is the number of matched relations. We say we find a matched relation in two different DGs, only when both of the two concept/entity nodes linked to the relation node are matched, and the relation node is also matched. Since we might compare two DGs that are significantly different in size (for example, DGs for an extract vs. its source document), we used the number of concept/entity nodes and relation nodes in the target DG as N and M, instead of the total number of nodes in both DGs. The target DG is the one for the extract in comparing an extract with its source text. Otherwise, the similarity will always be very low. Currently, we weight all the concepts/entities and relations equally. This can be fine tuned in the future.</Paragraph> </Section> </Section> class="xml-element"></Paper>