File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2213_metho.xml
Size: 13,212 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2213"> <Title>A Method for Relating Multiple Newspaper Articles by Using Graphs, and Its Application to Webcasting</Title> <Section position="3" start_page="1307" end_page="1309" type="metho"> <SectionTitle> 3 Creating a Graph Structure for Articles </SectionTitle> <Paragraph position="0"> This section describes how to construct a directed graph structure from a set of articles. Any directed graph can be represented by a matrix. Figure 3 shows the adjacency matrix MG of the graph G in Figure 1.</Paragraph> <Paragraph position="1"> For example, a value of &quot;1&quot; for the (1, 2) element in M indicates that dx is adjacent to d2. Since an article cannot follow itself, the value of (i, i) elements is &quot;0&quot;. From the time constraint defined in Section 3, MG is an upper triangle matrix.</Paragraph> <Paragraph position="2"> The following is a procedure for constructing a directed graph for related articles: 1. Calculate the similarity and difference between articles.</Paragraph> <Paragraph position="3"> 2. Construct a similarity matrix.</Paragraph> <Paragraph position="4"> 3. Convert the matrix into an adjacency matrix. In the next section, each step is illustrated by using the set of articles V in Figure 2 on the subject of nuclear testing taken from the Nikkei Shinbun. 2</Paragraph> <Section position="1" start_page="1307" end_page="1307" type="sub_section"> <SectionTitle> 3.1 Calculating the similarities and </SectionTitle> <Paragraph position="0"> differences between articles The function sim(di,dj) calculates the word-based similarity between two articles. It is defined on the basis of Salton's Vector Space Model (Salton, 1968). Words are extracted from an article by using a morphological analyzer. Next, nouns and verbs are extracted as keywords.</Paragraph> <Paragraph position="1"> _ di wdi sim(di,dj) = ~-,k,,, wkw k~ kWkw) k kw\] dT: President of France states that nuclear testing will restart in September, and that France will conduct eight tests between now and next May.</Paragraph> <Paragraph position="2"> d8: France states that it will restart nuclear testing. This will hamper nuclear disarmament. dg: France states that it will restart nuclear testing. Australia halts defense cooperation with France. dlo: France states that it will restart nuclear testing. The U.S. expresses regret at the decision. Here, di is the weight given to the keyword Wkw kw in article di. Modification of the TF.IDF value (Robertson et al., 1976) is used for the weighting. 9d, is the weight assigned to the keyword kw, kw which is a differentia word for di.</Paragraph> <Paragraph position="4"> Other parameters are defined as follows: k: constant value Cd,(kw): frequency of word kw in d(i) Cd, : number of words in d(i) Nk(kw): number of articles that contain the word kw in k articles di-k,... ,di The function differentia(d{) returns a set of key-words that appear in dj but do not appear in the last k articles.</Paragraph> <Paragraph position="6"> where i - k < l < i, Cd,(kw) = O}</Paragraph> </Section> <Section position="2" start_page="1307" end_page="1307" type="sub_section"> <SectionTitle> 3.2 Constructing a similarity matrix </SectionTitle> <Paragraph position="0"> A similarity matrix for a set of articles is constructed by using the sim function. In a conventional hierarchical clustering algorithm, a similarity for any combination of two articles is required in order to construct a hierarchical tree of the set of articles. This causes ~ calculations of the similarity function, for n articles, with a consequent complexity of O(n2). This is very expensive when n is large.</Paragraph> <Paragraph position="1"> In our algorithm for constructing a similarity matrix, shown in Figure 4, the complexity of constructing a graph structure for an article set by using a constraint is O(n). The following constraint, which procedure MakeDistanceMatrix for i= 2 to n begin if i-k< 1 thens+- 1 elses+--i-k</Paragraph> <Paragraph position="3"> includes Constraint 1, is used for in threading algorithm. null Constraint 2 For (di,dj) E A, j - (k + l) <i<j This constraint means that an article can only follow the last k articles. As the result, the number of times the similarity matrix needs to be calculated is reduced by kn, giving a complexity of O(n).</Paragraph> <Paragraph position="4"> By using the algorithm, each similarity between nodes is calculated, and the similarity matrix in Figure 5 shows a similarity matrix S of V. In this case, keywords are extracted from title sentences, and k is set to five.</Paragraph> </Section> <Section position="3" start_page="1307" end_page="1309" type="sub_section"> <SectionTitle> 3.3 Conversion into an adjacency matrix </SectionTitle> <Paragraph position="0"> From the similarity matrix, an adjacency matrix is constructed. An element s(i, j) in the similarity matrix corresponds to the element ss(i,j) in the adjacency matrix SS. There are various strategies for the conversion. In this paper, ss(i,j) is set to 1 when s(i, j) > 0.18, and any node can follow at most k/2 nodes, in this case two nodes. Figure 6 shows a result of the conversion. Finally, a directed graph for V is created (Figure 7). Figure 8 shows a graph that visualizes the content of the articles in our example.</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> There are two threads in the graph. One concerns for France's restarting of nuclear testing. The other concerns China's latest nuclear test. The &quot;France&quot; thread contains two sub-threads. One concern requests by other countries for France to reconsider its stated intension of restarting nuclear testing, and the other concerns responses by other countries to the France government's official statement on testing.</Paragraph> <Paragraph position="5"> Some articles are followed by multiple articles. For example, d7 is the first official statement on France's restarting of nuclear testing, and many related articles on this topic follow.</Paragraph> <Paragraph position="6"> Each rectangle in Figure 8 represents an article.</Paragraph> <Paragraph position="7"> Words in a rectangle are differentia words for the articles. These words show new information in the article, and make it easy to understand the content of the articles. If a word in an article appears in the differentia words for its parent article, the word may represent a &quot;turning point&quot; in the story of the articles. For example, the word &quot;state&quot; is the differentia word for dT, and is in its adjacent articles ds, dg, anddlo. This means that d7 is a starting point of the new topic &quot;state.&quot; Such words are called topic words, and are represented in Figure 8 by bold type.</Paragraph> <Paragraph position="8"> Several features of the graph visualize the characteristics and relationships of the articles: these features will be discussed in the next section.</Paragraph> <Paragraph position="9"> It is difficult to evaluate the result of threading.</Paragraph> <Paragraph position="10"> We are implementing it in a webcasting (push) application so that it can be evaluated by the many people who use ordinary web browsers. The attempt is described in Section 5.</Paragraph> </Section> </Section> <Section position="4" start_page="1309" end_page="1311" type="metho"> <SectionTitle> 4 Features of a Graph </SectionTitle> <Paragraph position="0"> This section describes how the features of a constructed graph represent the characteristics of articles. null</Paragraph> <Section position="1" start_page="1309" end_page="1310" type="sub_section"> <SectionTitle> 4.1 In-degree and Out-degree </SectionTitle> <Paragraph position="0"> The in-degree is the number of arcs leading to a node, while the out-degree is the number of arcs leading from it. The in-degree of di can be calculated by adding up the elements in the i-th column of an adjacency matrix. The out-degree of di can be calculated by adding up the elements in the i-th row of the matrix (Figure 9). In Botafogo et al. (Botafogo et al., 1992), a node that has a high out-degree is called an index node, while a node that has a high in-degree is called a reference node in their analysis of hypertext.</Paragraph> <Paragraph position="1"> In the set of articles V shown in Figure 9, d7 is an index node. In this paper, an index node denotes the beginning of a new topic. When the topic is important, many articles follow, and consequently the out- null degree for the node increases. The contribution of reference nodes is not clear in V (d6, ds, and d9 have max in-degrees). Nodes that have high in-degree have two characteristics. The first is that when the articles contain multiple topics, they have many inbound arcs, each representing a different topic. The second is that when the articles are closely related for a particular topic, the in-degrees of related nodes increase, since these articles are connected to each other.</Paragraph> </Section> <Section position="2" start_page="1310" end_page="1310" type="sub_section"> <SectionTitle> 4.2 Path </SectionTitle> <Paragraph position="0"> A path from one node to another node shows the &quot;story flow&quot; of articles. Multiple paths between two nodes show different stories about the nodes.</Paragraph> <Paragraph position="1"> For example, there are three paths between dl, which is a first node, and dl0. The shortest path (dl, d2,, dT, dl0) gives a simple outline of the articles. The longest path (d,, d2, d7, ds, dg, dl0) contains all related information on the topic. By extracting long paths from the graph and combining them, various stories can be created.</Paragraph> <Paragraph position="2"> The length of a path shows how the nodes on it \[ along to the &quot;main stream&quot; of the story. For exmple, the maximum length of a path through d6, is three, while that of a path through d7 is five. This means that a path that contains d7 is on a main stream of the thread and is likely to be continued.</Paragraph> <Paragraph position="3"> The longest paths for nodes can be calculated by using the algorithm shown in Figure 11. Its complexity is O(n), since the maximum number of arcs is at most nk for n nodes, from Constraint 2, defined in Section 3.2.</Paragraph> </Section> <Section position="3" start_page="1310" end_page="1311" type="sub_section"> <SectionTitle> 4.3 Cycle </SectionTitle> <Paragraph position="0"> A cycle 3 shows the existence of a topic. In V, {dT, ds, dg, dl0} is a cycle for the topic &quot;statement.&quot; By recognizing cycles, we can extract topics from the whole graph. Furthermore, we can abstract articles by reducing cycles to single nodes.</Paragraph> <Paragraph position="1"> 5 XML-based Representation for Threads It is important that the threading information be exchangeable when we apply our method to Web documents. Extended Markup Language (XML) is a proposed standard (XML, 1997) specified by the World Wide Web Consortium (W3C). In XML, tags and 3Formally, it is called a semi-cycle, since the graph is directed. null attributes can be defined, whereas in HTML they are fixed. XML documents can be used to exchange information that has various data structure. For example, Channel Definition Format (CDF)(CDF, 1997) is a standard to offer frequently updated collections of information (channels) on Web. A CDF document can contains a collection of articles that have tree structure. In this paper, graph structures of created threads are represented in XML. Figure 10 shows a part of the thread in Figure 8.</Paragraph> <Paragraph position="2"> The <thread> tag shows the beginning of the thread. It contains a set of deceptions for articles, each marked <article>. Each instance of the <article> tag has a reference to its source document, an identifying id, genus and differentia words, and other information on the article. The tag <follows> is used to denote arcs from the article to related articles.</Paragraph> <Paragraph position="3"> The XML documents can be separate from the source articles. They can be provided as part of a &quot;push&quot; service for Internet users, offering a solution to the problem of information overloading. In such a service, gatherer collects articles from Web sites and threader makes threads for them. The results are stored in XML, and then pushed to subscribers who can capture the flow of topics by following the threads. In another scenario, when a user gets an article, and wants to see its origin or the next related article, he or she gets the thread containing the article by consulting the threading server. The advantage of using XML is that it will be supported by various tools, including Web browsers. Now we are prototyping the threading service system by using a XML processor developed at our laboratory.</Paragraph> <Paragraph position="4"> Figure 12 shows a Java applet for viewing threads, which can run on major Web browsers. A XML document is parsed and visualized as tree-like structure.</Paragraph> </Section> </Section> class="xml-element"></Paper>