XML Viewer - c02-1033

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1033_metho.xml
Size: 16,843 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1033">
  <Title>Using collocations for topic segmentation and link detection</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overview of TOPICOLL
</SectionTitle>
    <Paragraph position="0"> In accordance with much work about discourse analysis, TOPICOLL processes texts linearly: it detects topic shifts and finds links between segments without delaying its decision, i.e., by only taking into account the part of text that has been already analyzed. A window that delimits the current focus of the analysis is moved over each text to be processed. This window contains the lemmatized content words of the text, resulting from its pre-processing. A topic context is associated to this focus window. It is made up of both the words of the window and the words that are selected from a collocation network1 as strongly linked to the words of the window. The current segment is also given a topic context. This context results from the fusion of the contexts associated to the focus window when this window was in the segment space. A topic shift is then detected when the context of the focus window and the context of the current segment are not similar any more for several successive positions of the focus window. This process also performs link detection by comparing the topic context of each new segment to the context of the already delimited segments.</Paragraph>
    <Paragraph position="1"> The use of a collocation network permits TOPICOLL to find relations beyond word recurrence and to associate a richer topical representation to segments, which facilitates tasks such as link detection or topic identification. But work such as (Kozima, 1993), (Ferret, 1998) or (Kaufmann, 1999) showed that using a domain-independent source of knowledge for text segmentation doesn't necessarily lead to get better results than work that is only based on word distribution in texts. One of the reasons of this fact is that these methods don't precisely control the relations they select or don't take into account the sparseness of their knowledge. Hence, while they discard some incorrect topic shifts found by methods based on word recurrence, they also find incorrect shifts when the relevant relations are not present in their knowledge or don't find some correct shifts because of the selection of non relevant relations from a topical viewpoint. By combining word recurrence and relations selected from a collocation network, TOPICOLL aims at exploiting a domain-independent source of knowledge for text segmentation in a more accurate way.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Collocation networks
</SectionTitle>
    <Paragraph position="0"> TOPICOLL depends on a resource, a collocation network, that is language-dependent. Two collocation networks were built for it: one for French, 1 A collocation network is a set of collocations between words. This set can be viewed as a network whose nodes are words and edges are collocations.</Paragraph>
    <Paragraph position="1"> from the Le Monde newspaper (24 months between 1990 and 1994), and one for English, from the L.A. Times newspaper (2 years, part of the TREC corpus). The size of each corpus was around 40 million words.</Paragraph>
    <Paragraph position="2"> The building process was the same for the two networks. First, the initial corpus was pre-processed in order to characterize texts by their topically significant words. Thus, we retained only the lemmatized form of plain words, that is, nouns, verbs and adjectives. Collocations were extracted according to the method described in (Church and Hanks, 1990) by moving a window on texts. Parameters were chosen in order to catch topical relations: the window was rather large, 20-word wide, and took into account the boundaries of texts; moreover, collocations were indifferent to word order. We also adopted an evaluation of mutual information as a cohesion measure of each collocation. This measure was normalized according to the maximal mutual information relative to the considered corpus.</Paragraph>
    <Paragraph position="3"> After filtering the less significant collocations (collocations with less than 10 occurrences and cohesion lower than 0.1), we got a network with approximately 23,000 words and 5.2 million collocations for French, 30,000 words and 4.8 million collocations for English.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Description of TOPICOLL
</SectionTitle>
    <Paragraph position="0"> TOPICOLL is based on the creation, the update and the use of a topical representation of both the segments it delimits and the content of its focus window at each position of a text. These topical representations are called topic contexts. Topic shifts are found by detecting that the topic context of the focus window is not similar anymore to the topic context of the current segment. Link detection is performed by comparing the context of a new segment to the context of the previous segments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Topic contexts
</SectionTitle>
      <Paragraph position="0"> A topic context characterizes the topical dimension of the entity it is associated to by two vectors of weighted words. One of these vectors, called text vector, is made up of words coming from the text that is analyzed. The other one, called collocation vector, contains words selected from a collocation network and strongly linked to the words of the processed text. For both vectors, the weight of a word expresses its importance with regard to the other words of the vector.</Paragraph>
      <Paragraph position="1">  The text vector of the context associated to the focus window is made up of the content words of the window. Their weight is given by: )()()( wsignifwoccNbwwghttxt [?]= (1) where occNb(w) is the number of occurrences of the word w in the window and signif(w) is the significance of w. The weight given by (1) combines the importance of w in the part of text delimited by the window and its general significance. This significance is defined as in (Kozima, 1993) as its normalized information in a reference corpus2:  the collocation network The building of the collocation vector for the window's context comes from the procedure presented in (Ferret, 1998) for evaluating the lexical cohesion of a text. It consists in selecting words of the collocation network that are topically close to those in the window. We assume that this closeness is related to the number of links that exist between a word of the network and the words of the window. Thus, a word of the network is se2 In our case, this is the corpus used for building the collocation network.</Paragraph>
      <Paragraph position="2"> lected if it is linked to at least wst (3 in our experiments) words of the window. A collocation vector may also contain some words of the window as they are generally part of the collocation network and may be selected as its other words.</Paragraph>
      <Paragraph position="3"> Each selected word from the network is then assigned a weight. This weight is equal to the sum of the contributions of the window words to which it is linked to. The contribution of a word of the window to the weight of a selected word is equal to its weight in the window, given by (1), modulated by the cohesion measure between these two words in the network (see Figure 1). More precisely, the combination of these two factors is achieved by a geometric mean:</Paragraph>
      <Paragraph position="5"> where coh(w,wi) is the measure of the cohesion between w and wi in the collocation network.</Paragraph>
      <Paragraph position="6">  The topic context of a segment results from the fusion of the contexts associated to the focus window when it was inside the segment. The fusion is achieved as the segment is extended: the context associated to each new position of the segment is combined with the current context of the segment. This combination, which is done separately for text vectors and collocation vectors, consists in merging two lists of weighted words. First, the words of the window context that are not in the segment context are added to it. Then, the weight of each word of the resulting list is computed according to its weight in the window context and its previous weight in the segment context:</Paragraph>
      <Paragraph position="8"> with Cw, the context of the window, Cs, the context of the segment and wghtx(w,C{s,w},t), the weight of the word w in the vector x (txt or coll) of the context C{s,w} for the position t. For the words from the window context that are not part of the segment context, wghtx(w,Cs,t-1) is equal to 0.</Paragraph>
      <Paragraph position="9"> The revaluation of the weight of a word in a segment context given by (4) is a solution halfway between a fast and a slow evolution of the content of segment contexts. The context of a segment has to be stable because if it follows too narrowly the topical evolution of the window context, topic shifts could not be detected. However, it must also adapt itself to small variations in the way a topic is expressed when progressing in the text in order not to detect false topic shifts.</Paragraph>
      <Paragraph position="10">  In order to determine if the content of the focus window is topically coherent or not with the current segment, the topic context of the window is compared to the topic context of the segment. This comparison is performed in two stages: first, a similarity measure is computed between the vectors of the two contexts; then, the resulting values are exploited by a decision procedure that states if the two contexts are similar.</Paragraph>
      <Paragraph position="11"> As (Choi, 2000) or (Kaufmann, 1999), we use the cosine measure for evaluating the similarity between a vector of the context window (Vw) and the equivalent vector in the segment context (Vs):</Paragraph>
      <Paragraph position="13"> where wgx(wi,C{s,w}) is the weight of the word wi in the vector x (txt or coll) of the context C{s,w}.</Paragraph>
      <Paragraph position="14"> As we assume that the most significant words of a segment context are the most recurrent ones, the similarity measure takes into account only the words of a segment context whose the recurrence3 is above a fixed threshold. This one is higher for text vectors than for collocation vectors. This filtering is applied only when the context of a segment is considered as stable (see 4.2).</Paragraph>
      <Paragraph position="15"> The decision stage takes root in work about combining results of several systems that achieve the same task. In our case, the evaluation of the similarity between Cs and Cw at each position is based on a vote that synthesizes the viewpoint of the text vector and the viewpoint of the collocation vector.</Paragraph>
      <Paragraph position="16"> First, the value of the similarity measure for each vector is compared to a fixed threshold and a posi-</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The recurrence of a word in a segment context is gi-
</SectionTitle>
    <Paragraph position="0"> ven by the ratio between the number of window contexts in which the word was present and the number of window contexts gathered by the segment context.</Paragraph>
    <Paragraph position="1"> tive vote in favor of the similarity of the two contexts is decided if the value exceeds this threshold.</Paragraph>
    <Paragraph position="2"> Then, the global similarity of the two contexts is rejected only if the votes for the two vectors are negative.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Topic segmentation
</SectionTitle>
      <Paragraph position="0"> The algorithm for detecting topic shifts is taken from (Ferret and Grau, 2000) and basically relies on the following principle: at each text position, if the similarity between the topic context of the focus window and the topic context of the current segment is rejected (see 4.1.3), a topic shift is assumed and a new segment is opened. Otherwise, the active segment is extended up to the current position.</Paragraph>
      <Paragraph position="1"> This algorithm assumes that the transition between two segments is punctual. As TOPICOLL only operates at word level, its precision is limited. This imprecision makes necessary to set a short delay before deciding that the active segment really ends and similarly, before deciding that a new segment with a stable topic begins. Hence, the algorithm for detecting topic shifts distinguishes four states: - the NewTopicDetection state takes place when a new segment is going to be opened. This opening is then confirmed provided that the content of the focus window context doesn't change for several positions. Moreover, the core of the segment con- null text is defined when TOPICOLL is in this state; - the InTopic state is active when the focus window is inside a segment with a stable topic; - the EndTopicDetection state occurs when the  focus window is inside a segment but a difference between the context of the window and the context of the current segment suggests that this segment could end soon. As for the NewTopicDetection state, this difference has to be confirmed for several positions before a change of state is decided; - the OutOfTopic state is active between two segments. Generally, TOPICOLL stays in this state no longer than 1 or 2 positions but when neither the words from text nor the words selected from the collocation network are recurrent, i.e. no stable topic can be detected according to these features, this number of positions may be equal to the size of a segment.</Paragraph>
      <Paragraph position="2"> The transition from one state to another follows the automaton of Figure 2 according to three parameters: null  - its current state; - the similarity between the context of the focus window and the context of the current segment: Sim or no Sim; - the number of successive positions of the focus window for which the current state doesn't change: confirmNb. It must exceed the Tconfirm threshold (equal to 3 in our experiments) for leaving the NewTopicDetection or the EndTopicDetection state.</Paragraph>
      <Paragraph position="3">  The processing of a segment starts with the OutOfTopic state, after the end of the previous segment or at the beginning of the text. As soon as the context of the focus window is stable enough between two successive positions, TOPICOLL enters into the NewTopicDetection state. The InTopic state can then be reached only if the window context is found stable for the next confirmNb-1 positions. Otherwise, TOPICOLL assumes that it is a false alarm and returns to the OutOfTopic state. The detection of the end of a segment is symmetrical to the detection of its beginning. TOPICOLL goes into the EndTopicDetection state as soon as the content of the window context begins to change significantly between two successive positions but the transition towards the OutOfTopic state is done only if this change is confirmed for the next confirmNb-1 next positions.</Paragraph>
      <Paragraph position="4"> This algorithm is completed by a specific mechanism related to the OutOfTopic state. When TOPICOLL stays in this state for a too long time (this time is defined as 10 positions of the focus window in our experiments), it assumes that the topic of the current part of text is difficult to characterize by using word recurrence or selection from a collocation network and it creates a new segment that covers all the concerned positions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Link detection
</SectionTitle>
      <Paragraph position="0"> The algorithm of TOPICOLL for detecting identity links between segments is closely associated to its algorithm for delimiting segments. When TOPICOLL goes from the NewTopicDetection state to the InTopic state, it first checks whether the current context of the new segment is similar to one of the contexts of the previous segments. In this case, the similarity between contexts only relies on the similarity measure (see (5)) between their collocation vectors. A specific threshold is used for the decision. If the similarity value exceeds this threshold, the new segment is linked to the corresponding segment and takes the context of this one as its own context. In this way, TOPICOLL assumes that the new segment continues to develop a previous topic. When several segments fulfills the condition for link detection, TOPICOLL selects the one with the highest similarity value.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML