File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1038_metho.xml

Size: 11,285 bytes

Last Modified: 2025-10-06 14:10:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1038">
  <Title>Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words</Title>
  <Section position="5" start_page="298" end_page="298" type="metho">
    <SectionTitle>
3 Discovery of Patterns
</SectionTitle>
    <Paragraph position="0"> Our first step is the discovery of patterns that are useful for lexical category acquisition. We use two main stages: discovery of pattern candidates, and identification of the symmetric patterns among the candidates.</Paragraph>
    <Section position="1" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.1 Pattern Candidates
</SectionTitle>
      <Paragraph position="0"> An examination of the patterns found useful in previous work shows that they contain one or more very frequent word, such as 'and', 'is', etc. Our approach towards unsupervised pattern induction is to find such words and utilize them.</Paragraph>
      <Paragraph position="1"> We define a high frequency word (HFW) as a word appearing more than TH times per million words, and a content word (CW) as a word appearing less than TC times per a million words4.</Paragraph>
      <Paragraph position="2"> Now define a meta-pattern as any sequence of HFWs and CWs. In this paper we require that meta-patterns obey the following constraints: (1) at most 4 words; (2) exactly two content words; (3) no two consecutive CWs. The rationale is to see what can be achieved using relatively short patterns and where the discovered categories contain single words only. We will relax these constraints in future papers. Our meta-patterns here are thus of four types: CHC, CHCH, CHHC, and HCHC.</Paragraph>
      <Paragraph position="3"> In order to focus on patterns that are more likely to provide high quality categories, we removed patterns that appear in the corpus less than TP times per million words. Since we can ensure that the number of HFWs is bounded, the total number of pattern candidates is bounded as well. Hence, this stage can be computed in time linear in the size of the corpus (assuming the corpus has been already pre-processed to allow direct access to a word by its index.) 4Considerations for the selection of thresholds are discussed in Section 5.</Paragraph>
    </Section>
    <Section position="2" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.2 Symmetric Patterns
</SectionTitle>
      <Paragraph position="0"> Many of the pattern candidates discovered in the previous stage are not usable. In order to find a usable subset, we focus on the symmetric patterns.</Paragraph>
      <Paragraph position="1"> Our rationale is that two content-bearing words that appear in a symmetric pattern are likely to be semantically similar in some sense. This simple observation turns out to be very powerful, as shown by our results. We will eventually combine data from several patterns and from different corpus windows (Section 4.) For identifying symmetric patterns, we use a version of the graph representation of (Widdows and Dorow, 2002). We first define the single-pattern graph G(P) as follows. Nodes correspond to content words, and there is a directed arc A(x,y) from node x to node y iff (1) the words x and y both appear in an instance of the pattern P as its two CWs; and (2) x precedes y in P. Denote by Nodes(G),Arcs(G) the nodes and arcs in a graph G, respectively.</Paragraph>
      <Paragraph position="2"> We now compute three measures on G(P) and combine them for all pattern candidates to filter asymmetric ones. The first measure (M1) counts the proportion of words that can appear in both slots of the pattern, out of the total number of words. The reasoning here is that if a pattern allows a large percentage of words to participate in both slots, its chances of being a symmetric pattern are greater:</Paragraph>
      <Paragraph position="4"> M1 filters well patterns that connect words having different parts of speech. However, it may fail to filter patterns that contain multiple levels of asymmetric relationships. For example, in the pattern 'x belongs to y', we may find a word B on both sides ('A belongs to B', 'B belongs to C') while the pattern is still asymmetric.</Paragraph>
      <Paragraph position="5"> In order to detect symmetric relationships in a finer manner, for the second and third measures we define SymG(P), the symmetric subgraph of G(P), containing only the bidirectional arcs and nodes of G(P):</Paragraph>
      <Paragraph position="7"> The second and third measures count the proportion of the number of symmetric nodes and edges in G(P), respectively:</Paragraph>
      <Paragraph position="9"> All three measures yield values in [0,1], and in all three a higher value indicates more symmetry. M2 and M3 are obviously correlated, but they capture different aspects of a pattern's nature: M3 is informative for highly interconnected but small word categories (e.g., month names), while M2 is useful for larger categories that are more loosely connected in the corpus.</Paragraph>
      <Paragraph position="10"> We use the three measures as follows. For each measure, we prepare a sorted list of all candidate patterns. We remove patterns that are not in the top ZT (we use 100, see Section 5) in any of the three lists, and patterns that are in the bottom ZB in at least one of the lists. The remaining patterns constitute our final list of symmetric patterns.</Paragraph>
      <Paragraph position="11"> We do not rank the final list, since the category discovery algorithm of the next section does not need such a ranking. Defining and utilizing such a ranking is a subject for future work.</Paragraph>
      <Paragraph position="12"> A sparse matrix representation of each graph can be computed in time linear in the size of the input corpus, since (1) the number of patterns |P |is bounded, (2) vocabulary size |V  |(the total number of graph nodes) is much smaller than corpus size, and (3) the average node degree is much smaller than |V  |(in practice, with the thresholds used, it is a small constant.)</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="298" end_page="300" type="metho">
    <SectionTitle>
4 Discovery of Categories
</SectionTitle>
    <Paragraph position="0"> After the end of the previous stage we have a set of symmetric patterns. We now use them in order to discover categories. In this section we describe the graph clique-set method for generating initial categories, and category pruning techniques for increased quality.</Paragraph>
    <Section position="1" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
4.1 The Clique-Set Method
</SectionTitle>
      <Paragraph position="0"> Our approach to category discovery is based on connectivity structures in the all-pattern word relationship graph G, resulting from merging all of the single-pattern graphs into a single unified graph.</Paragraph>
      <Paragraph position="1"> The graph G can be built in time O(|V  |x |P |x</Paragraph>
      <Paragraph position="3"> rather than Nodes(G) for brevity.) When building G, no special treatment is done when one pattern is contained within another. For example, any pattern of the form CHC is contained in a pattern of the form HCHC ('x and y', 'both x and y'.) The shared part yields exactly the same subgraph. This policy could be changed for a discovery of finer relationships.</Paragraph>
      <Paragraph position="4"> The main observation on G is that words that are highly interconnected are good candidates to form a category. This is the same general observation exploited by (Widdows and Dorow, 2002), who try to find graph regions that are more connected internally than externally.</Paragraph>
      <Paragraph position="5"> We use a different algorithm. We find all strong n-cliques (subgraphs containing n nodes that are all bidirectionally interconnected.) A clique Q defines a category that contains the nodes in Q plus all of the nodes that are (1) at least unidirectionally connected to all nodes in Q, and (2) bidirectionally connected to at least one node in Q.</Paragraph>
      <Paragraph position="6"> In practice we use 2-cliques. The strongly connected cliques are the bidirectional arcs in G and their nodes. For each such arc A, a category is generated that contains the nodes of all triangles that contain A and at least one additional bidirectional arc. For example, suppose the corpus contains the text fragments 'book and newspaper', 'newspaper and book', 'book and note', 'note and book' and 'note and newspaper'. In this case the three words are assigned to a category.</Paragraph>
      <Paragraph position="7"> Note that a pair of nodes connected by a symmetric arc can appear in more than a single category. For example, suppose a graph G containing five nodes and seven arcs that define exactly three strongly connected triangles, ABC,ABD,ACE.</Paragraph>
      <Paragraph position="8"> The arc (A,B) yields a category {A,B,C,D}, and the arc (A,C) yields a category {A,C,B,E}.</Paragraph>
      <Paragraph position="9"> Nodes A and C appear in both categories. Category merging is described below.</Paragraph>
      <Paragraph position="10"> This stage requires an O(1) computation for each bidirectional arc of each node, so its complexity is O(|V  |x AverageDegree(G)) = O(|V |).</Paragraph>
    </Section>
    <Section position="2" start_page="298" end_page="300" type="sub_section">
      <SectionTitle>
4.2 Enhancing Category Quality: Category
Merging and Corpus Windowing
</SectionTitle>
      <Paragraph position="0"> In order to cover as many words as possible, we use the smallest clique, a single symmetric arc.</Paragraph>
      <Paragraph position="1"> This creates redundant categories. We enhance the quality of the categories by merging them and by windowing on the corpus.</Paragraph>
      <Paragraph position="2"> We use two simple merge heuristics. First, if two categories are identical we treat them as one. Second, given two categories Q,R, we merge them iff there's more than a 50% overlap between them: (|QintersectiontextR |&gt; |Q|/2) [?] (|QintersectiontextR |&gt; |R|/2).  This could be added to the clique-set stage, but the phrasing above is simpler to explain and implement. null In order to increase category quality and remove categories that are too context-specific, we use a simple corpus windowing technique. Instead of running the algorithm of this section on the whole corpus, we divide the corpus into windows of equal size (see Section 5 for size determination) and perform the category discovery algorithm of this section on each window independently. Merging is also performed in each window separately. We now have a set of categories for each window. For the final set, we select only those categories that appear in at least two of the windows. This technique reduces noise at the potential cost of lowering coverage. However, the numbers of categories discovered and words they contain is still very large (see Section 5), so windowing achieves higher precision without hurting coverage in practice.</Paragraph>
      <Paragraph position="3"> The complexity of the merge stage is O(|V |) times the average number of categories per word times the average number of words per category.</Paragraph>
      <Paragraph position="4"> The latter two are small in practice, so complexity amounts to O(|V |).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML