XML Viewer - p01-1049

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1049_metho.xml
Size: 19,283 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1049">
  <Title>Building Semantic Perceptron Net for Topic Spotting</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Topic Representation
</SectionTitle>
    <Paragraph position="0"> The frame of Minsky (1975) is a well-known knowledge representation technique. A frame represents a high-level concept as a collection of slots, where each slot describes one aspect of the concept. The situation is similar in topic spotting.</Paragraph>
    <Paragraph position="1"> For example, the topic &amp;quot;water &amp;quot; may have many aspects (or sub-topics). One sub-topic may be about &amp;quot;water supply &amp;quot;, while the other is about &amp;quot;water and environment protection&amp;quot;, and so on. These sub-topics may have some common attributes, such as the word &amp;quot;water&amp;quot;, and each sub-topic may be further sub-divided into finer subtopics, etc.</Paragraph>
    <Paragraph position="2"> The above points to a hierarchical topic representation, which corresponds to the hierarchy of document classes (Figure 1). In the model, the contents of the topics and sub-topics (shown as circles) are modeled by a set of attributes, which is simply a group of semantically related words (shown as solid elliptical shaped bags or rectangles). The context (shown as dotted ellipses) is used to identify the exact meaning of a word.</Paragraph>
    <Paragraph position="3">  Hofmann (1998) presented a word occurrence based cluster abstraction model that learns a hierarchical topic representation. However, the method is not suitable when the set of training examples is sparse. To avoid the problem of automatically constructing the hierarchical model, Tong et al (1987) required the users to supply the model, which is used as queries in the system.</Paragraph>
    <Paragraph position="4"> Most automated methods, however, avoided this problem by modeling the topic as a feature vector, rule set, or instantiated example (Yang &amp; Liu, 1999). These methods typically treat each word feature as independent, and seldom consider linguistic factors such as the context or lexical chain relations among the features. As a result, these methods are not good at discriminating a large number of documents that typically lie near the boundary of two or more topics.</Paragraph>
    <Paragraph position="5"> In order to facilitate the automatic extraction and modeling of the semantic aspects of topics, we adopt a compromise approach. We model the topic as a tree of concepts as shown in Figure 1.</Paragraph>
    <Paragraph position="6"> However, we consider only one level of hierarchy built from groups of semantically related words.</Paragraph>
    <Paragraph position="7"> These semantic groups may not correspond strictly to sub-topics within the domain. Figure 2 shows an example of an automatically constructed topic tree  In Figure 2, node &amp;quot;a&amp;quot; contains the common feature set of the topic; while nodes &amp;quot;b&amp;quot;, &amp;quot;c&amp;quot; and &amp;quot;d&amp;quot; are related to sub-topics on &amp;quot;water supply&amp;quot;, &amp;quot;rainfall&amp;quot;, and &amp;quot;water and environment protection&amp;quot; respectively. Node &amp;quot;e&amp;quot; is the context of the word &amp;quot;plant&amp;quot;, and node &amp;quot;f&amp;quot; is the context of the word &amp;quot;bank&amp;quot;. Here we use training to automatically resolve the corresponding relationship between a node and an attribute, and the context word to be used to select the exact meaning of a word. From this representation, we observe that: a) Nodes &amp;quot;c&amp;quot; and &amp;quot;d&amp;quot; are closely related and may not be fully separable. In fact, it is sometimes difficult even for human experts to decide how to divide them into separate topics.</Paragraph>
    <Paragraph position="8"> b) The same word, such as &amp;quot;water &amp;quot;, may appear in both the context node and the basic semantic node.</Paragraph>
    <Paragraph position="9"> c) Some words use context to resolve their meanings, while many do not need context.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Semantic Correlations
</SectionTitle>
    <Paragraph position="0"> Although there exists many methods to derive the semantic correlations between words (Lee, 1999; Lin, 1998; Karov &amp; Edelman, 1998; Resnik, 1995; Dagan et al, 1995), we adopt a relatively simple and yet practical and effective approach to derive three topic -oriented semantic correlations: thesaurus-based, co-occurrence-based and context-based correlation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Thesaurus based correlation
</SectionTitle>
      <Paragraph position="0"> WordNet is an electronic thesaurus popularly used in many researches on lexical semantic acquisition, and word sense disambiguation (Green, 1999; Leacock et al, 1998). In WordNet, the sense of a word is represented by a list of synonyms (synset), and the lexical information is represented in the form of a semantic network.</Paragraph>
      <Paragraph position="1"> However, it is well known that the granularity of semantic meanings of words in WordNet is often too fine for practical use. We thus need to enlarge the semantic granularity of words in practical applications. For example, given a topic on &amp;quot;children education&amp;quot;, it is highly likely that the word &amp;quot;child&amp;quot; will be a key term. However, the concept &amp;quot;child&amp;quot; can be expressed in many semantically relate d terms, such as &amp;quot;boy&amp;quot;, &amp;quot;girl &amp;quot;, &amp;quot;kid&amp;quot;, &amp;quot;child&amp;quot;, &amp;quot;youngster&amp;quot;, etc. In this case, it might not be necessary to distinguish the different meaning among these words, nor the different senses within each word. It is, however, important to group all these words into a large synset {child, boy, girl, kid, youngster}, and use the synset to model the dominant but more general meaning of these words in the context.</Paragraph>
      <Paragraph position="2"> In general, it is reasonable and often useful to group lexically related words together to represent a more general concept. Here, two words are considered to be lexically related if they are related to by the &amp;quot;is_a&amp;quot;, &amp;quot;part_of&amp;quot;, &amp;quot;member_of&amp;quot;, or &amp;quot;antonym&amp;quot; relations, or if they belong to the same synset. Figure 3 lists the lexical relations that we considered, and the examples.</Paragraph>
      <Paragraph position="3"> Since in our experiment, there are many antonyms co-occur within the topic, we also group antonyms together to identify a topic. Moreover, if a word had two senses of, say, sense-1 and sense-2.</Paragraph>
      <Paragraph position="4"> And if there are two separate words that are lexically related to this word by sense-1 and sense-2 respectively, we simply group these words together and do not attempt to distinguish the two different senses. The reason is because if a word is so important to be chosen as the keyword of a topic, then it should only have one dominant meaning in that topic. The idea that a keyword should have only one dominant meaning in a topic is also suggested in Church &amp; Yarowsky (1992).</Paragraph>
      <Paragraph position="5">  per Figure 3: Examples of lexical relationship Based on the above discussion, we compute the thesaurus-based correlation between the two terms t1 and t2, in topic Ti, as:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Co-occurrence based correlation
</SectionTitle>
      <Paragraph position="0"> Co-occurrence relationship is like the global context of words. Using co-occurrence statistics, Veling &amp; van der Weerd (1999) was able to find many interesting conceptual groups in the Reuters2178 text corpus. Examples of the conceptual groups found include: {water, rainfall, dry}, {bomb, injured, explosion, injuries}, and {cola, PEP, Pepsi, Pespi-cola, Pepsico}. These groups are meaningful, and are able to capture the important concepts within the corpus.</Paragraph>
      <Paragraph position="1"> Since in general, high co-occurrence words are likely to be used together to represent (or describe) a certain concept, it is reasonable to group them together to form a large semantic node. Thus for topic Ti, the co-occurrence-based correlation of two terms, t1 and t2, is computed as: )(/)(),( 21)(21)(21)( ttdfttdfttR iiico [?][?]= (2) where )( 21)( ttdf i [?] ( )( 21)( ttdf i [?] ) is the fraction of documents in Ti that contains t1 and (or) t2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Context based correlation
</SectionTitle>
      <Paragraph position="0"> Broadly speaking, there are three kinds of context: domain, topic and local contexts (Ide &amp; Vernois, 1998). Domain context requires extensive knowledge of domain and is not considered in this paper. Topic context can be modeled approximately using the co-occurrence</Paragraph>
      <Paragraph position="2"> relationships between the words in the topic. In this section, we will define the local context explicitly.</Paragraph>
      <Paragraph position="3"> The local context of a word t is often defined as the set of non-trivial words near t. Here a word wd is said to be near t if their word distance is less than a given threshold, which is set to be 5 in our experiment.</Paragraph>
      <Paragraph position="4"> We represent the local context of term tj in topic Ti by a context vector cv(i)(tj). To derive cv(i)(tj), we first rank all candidate context words of ti by their density values: )(/)( )()()( jikijijk tnwdm=r (3) where )()( ji tn is the number of occurrence of tj in Ti, and )()( kij wdm is the number of occurrences of wdk near tj. We then select from the ranking, the top ten words as the context of tj in Ti as: ),(),...,,(),,{()( )(10)(10)(2)(2)(1)(1)( ijijijijijijji wdwdwdtcv rrr= (4) When the training sample is sufficiently large, the context vector will have good statistic meanings. Noting again that an important word to a topic should have only one dominant meaning within that topic, and this meaning should be reflected by its context. We can thus draw the conclusion that if two words have a very high context similarity within a topic, it will have a high possibility that they are semantic related. Therefore it is reasonable to group them together to form a larger semantic node. We thus compute the context-based correlation between two term t1 and</Paragraph>
      <Paragraph position="6"> For example, in Reuters 21578 corpus, &amp;quot;company&amp;quot; and &amp;quot;corp&amp;quot; are context-related words within the topic &amp;quot;acq&amp;quot;. This is because they have very similar context of &amp;quot;say, header, acquire, contract&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Semantic Groups &amp; Topic Tree
</SectionTitle>
    <Paragraph position="0"> There are many methods that attempt to construct the conceptual representation of a topic from the original data set (Veling &amp; van der Weerd, 1999; Baker &amp; McCallum, 1998; Pereira et al, 1993). In this Section, we will describe our semantic -based approach to finding basic semantic groups and constructing the topic tree. Given a set of training documents, the stages involved in finding the semantic groups for each topic are given below.</Paragraph>
    <Paragraph position="1"> A) Extract all distinct terms {t1, t2, ..tn} from the training document set for topic Ti. For each term tj, compute its df(i)(tj) and cv(i)(tj), where df(i)(tj) is defined as the fraction of documents in Ti that contain tj. In other words, df(i)(tj) gives the conditional probability of tj appearing in Ti.</Paragraph>
    <Paragraph position="2"> B) Derive the semantic group Gj using tj as the main keyword. Here we use the semantic correlations defined in Section 3 to derive the semantic relationship between tj and any other term tk. Thus:</Paragraph>
    <Paragraph position="4"> where d0, d1, d2, d3 are predefined thresholds.</Paragraph>
    <Paragraph position="5"> For all tk with Link(tj,tk)=1, we form a semantic group centered around tj denoted by:</Paragraph>
    <Paragraph position="7"> Here tj is the main keyword of node Gj and is denoted by main(Gj)=t j.</Paragraph>
    <Paragraph position="8"> C) Calculate the information value inf (i)(Gj) of each basic semantic group. First we compute the information value of each tj:</Paragraph>
    <Paragraph position="10"> and N is the number of topics. Thus 1/N denotes the probability that a term is in any class, and pij denotes the normalized conditional probability of tj in Ti. Only those terms whose normalized conditional probability is higher than 1/N will have a positive information value.</Paragraph>
    <Paragraph position="11"> The information value of the semantic group Gj is simply the summation of information value of its constituent terms weighted by their maximum semantic correlation with tj as:</Paragraph>
    <Paragraph position="13"> In the above grouping algorithm, the predefined thresholds d0,d1,d2,d3 are used to control the size of each group, and d4 is used to control the number of groups.</Paragraph>
    <Paragraph position="14"> The set of basic semantic groups found then forms the sub-topics of a 2-layered topic tree as illustrated in Figure 2.</Paragraph>
    <Paragraph position="15"> 5. Building and Training of SPN The Combination of local perception and global arbitrator has been applied to solve perception problems (Wang &amp; Terman, 1995; Liu &amp; Shi, 2000). Here we adopt the same strategy for topic spotting. For each topic, we construct a local perceptron net (LPN), which is designed for a particular topic. We use a global expert (GE) to arbitrate all decisions of LPNs and to model the relationships between topics. Here we discuss the design of both LPN and GE, and their training processes.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Local Perceptron Net (LPN)
</SectionTitle>
      <Paragraph position="0"> We derive the LPN directly from the topic tree as discussed in Sectio n 2 (see Figure 2). Each LPN is a multi-layer feed-forward neural network with a typical structure as shown in Figure 4.</Paragraph>
      <Paragraph position="1"> In Figure 4, xij represents the feature value of keyword wdij in the ith semantic group; xijk-s (where k=1,...10) represent the feature values of the context words wdijk's of keyword wd ij; and aij denotes the meaning of keyword wd ij as determined by its context. Ai corresponds to the ith basic semantic node. The weights wi, wij, and wijk and biases ei and eij are learned from training, and y(i)(x) is the output of the network.</Paragraph>
      <Paragraph position="2">  Given a document:</Paragraph>
      <Paragraph position="4"> where m is the number of basic semantic nodes, ij is the number of key terms contained in the ith semantic node, and cvij={xij1,xij2...</Paragraph>
      <Paragraph position="5"> ijijkx } is the context of term xij. The output y(i) =y(i)(x) is calculated as follows:</Paragraph>
      <Paragraph position="7"> Equation (10) expresses the fact that only if a key term is present in the document (i.e. xij &gt; 0), its context needs to be checked.</Paragraph>
      <Paragraph position="8"> For each topic Ti, there is a corresponding net y(i) =y(i)(x) and a threshold q(i). The pair of (y(i)(x), q(i)) is a local binary classifier for Ti such that: If y(i)(x)-q(i) &gt; 0, then Ti is present; otherwise Ti is not present in document x.</Paragraph>
      <Paragraph position="9"> From the procedures employed to building the topic tree, we know that each feature is in fact an evidence to support the occurrence of the topic.</Paragraph>
      <Paragraph position="10"> This gives us the suggestion that the activation function for each node in the LPN should be a non-decreasing function of the inputs. Thus we impose a weight constraint on the LPN as: wi&gt;0, wij&gt;0, wijk&gt;0 (12)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Global expert (GE)
</SectionTitle>
      <Paragraph position="0"> Since there are relations among topics, and LPNs do not have global information, it is inevitable that LPNs will make wrong decisions. In order to overcome this problem, we use a global expert (GE) to arbitrate al local decisions. Figure 5 illustrates the use of global expert to combine the outputs of LPNs.</Paragraph>
      <Paragraph position="1">  Given a document x, we first use each LPN to make a local decision. We then combine the outputs of LPNs as follows:  where Wij-s are the weights between the global arbitrator i and the jth LPN; and )(iTh -s are the global bias. From the result of Equation (13), we have: If Y(i) &gt; 0; then topic Ti is present; otherwise Ti is not present in document x The use of Equation (13) implies that: a) If a LPN is not activated, i.e., y(i) PS q(i), then its output is not used in the GE. Thus it will not affect the output of other LPN.</Paragraph>
      <Paragraph position="2"> b) The weight Wij models the relationship or correlation between topic i and j. If Wij &gt; 0, it means that if document x is related to Tj, it may also have some contribution ( Wij) to topic Tj. On the other hand, if Wij &lt; 0, it means the two topics are negatively correlated, and a document x will not be related to both Tj and Ti.</Paragraph>
      <Paragraph position="3"> The overall structure of SPN is as follows:</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 The Training of SPN
</SectionTitle>
      <Paragraph position="0"> In order to adopt SPN for topic spotting, we employ the well-known BP algorithm to derive the optimal weights and biases in SPN. The training phase is divided to two stages. The first stage learns a LPN for each topic, while the second stage trains the GE. As the BP algorithm is rather standard, we will discuss only the error functions that we employ to guide the training process.</Paragraph>
      <Paragraph position="1"> In topic spotting, the goal is to achieve both high recall and precision. In particular, we want to allow y(x) to be as large (or as small) as possible in cases when there is no error, or when +Ohm[?]x and q&gt;)(xy (or [?]Ohm[?]x and q&lt;)( xy ). Here +Ohm and [?]Ohm denote the positive and negative training document sets respectively. To achieve this, we adopt a new error function as follows to train the LPN:  Ohm are used to ensure that the contributions of positive and negative examples are equal.</Paragraph>
      <Paragraph position="2"> After the training, we choose the node with the biggest wi value as the common attribute node.</Paragraph>
      <Paragraph position="3"> Also, we trim the topic representation by removing those words or context words with very small wij or wijk values.</Paragraph>
      <Paragraph position="4"> We adopt the following error function to train  where +Ohmi is the set of positive examples of Ti.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML