XML Viewer - w05-0617

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0617_intro.xml
Size: 2,276 bytes
Last Modified: 2025-10-06 14:03:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0617">
  <Title>Morphology Induction From Term Clusters</Title>
  <Section position="4" start_page="128" end_page="128" type="intro">
    <SectionTitle>
2 Clustering
</SectionTitle>
    <Paragraph position="0"> A prerequisite of our method is a clustering of terms in the corpus vocabulary into rough syntactic groups. To achieve this, we first collect co-occurrence statistics for each word, measuring the  eral resources in addition to a corpus, including a list of canonical inflectional suffixes.</Paragraph>
    <Paragraph position="1"> recently soon slightly quickly ...</Paragraph>
    <Paragraph position="2"> underwriter designer commissioner ...</Paragraph>
    <Paragraph position="3"> increased posted estimated raised ...</Paragraph>
    <Paragraph position="4"> agreed declined expects wants ...</Paragraph>
    <Section position="1" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
Wall Street Journal corpus.
</SectionTitle>
      <Paragraph position="0"> frequency of words found immediately adjacent to it in the corpus, treating left occurrences as distinct from right occurrences. This co-occurrence database serves as input to information theoretic co-clustering (Dhillon et al., 2003), which seeks a partition of the vocabulary that maximizes the mutual information between term categories and their contexts. This approach to term clustering is closely related to others from the literature (Brown et al., 1992; Clark, 2000).2 Recall that the mutual information between random variables a0 and a1 can be written:</Paragraph>
      <Paragraph position="2"> Here, a0 and a1 correspond to term and context clusters, respectively, each event a18 and a22 the observation of some term and contextual term in the corpus. We perform an approximate maximization of a33 a3a6a5 using a simulated annealing procedure in which each random trial move takes a word a18 or context a22 out of the cluster to which it is tentatively assigned and places it into another.</Paragraph>
      <Paragraph position="3"> We performed this procedure on the Wall Street Journal (WSJ) portion of the North American News corpus, forming 200 clusters. Table 1 shows sample terms from several hand-selected clusters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML