XML Viewer - w04-0107

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0107_intro.xml
Size: 3,591 bytes
Last Modified: 2025-10-06 14:02:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0107">
  <Title>Unsupervised Induction of Natural Language Morphology Inflection Classes</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> It is possible to organize much of the recent work on unsupervised morphology induction by considering the bias each approach has toward discovering morphologically related words that are also orthographically similar.</Paragraph>
    <Paragraph position="1"> At one end of the spectrum is the work of Yarowsky et al. (2001), who derive a morphological analyzer for a language, L, by projecting the morphological analysis of a resource-rich language onto L through a clever application of statistical machine translation style word alignment probabilities. The word alignments are trained over a sentence aligned parallel bilingual text for the language pair. While the probabilistic model they use to generalize their initial system contains a bias toward orthographic similarity, the unembellished algorithm contains no assumptions on the orthographic shape of related word forms.</Paragraph>
    <Paragraph position="2"> Next along the spectrum of orthographic similar- null Proceedings of the Workshop of the ity bias is the work of Schone and Jurafsky (2000), who first acquire a list of pairs of potential morphological variants (PPMV's) using an orthographic similarity technique due to Gaussier (1999), in which pairs of words from a corpus vocabulary with the same initial string are identified. They then apply latent semantic analysis (LSA) to score each PPMV with a semantic distance. Pairs measuring a small distance, those whose potential variants tend to occur where a neighborhood of the nearest hundred words contains similar counts of individual high-frequency forms, are then proposed as true morphological variants of one anther. In later work, Schone and Jurafsky (2001) extend their technique to identify not only suffixes but also prefixes and circumfixes by building both forward and backward tries over a corpus.</Paragraph>
    <Paragraph position="3"> Goldsmith (2001), by searching over a space of morphology models limited to substitution of suffixes, ties morphology yet closer to orthography.</Paragraph>
    <Paragraph position="4"> Segmenting word forms in a corpus, Goldsmith creates an inventory of stems and suffixes. Suffixes which can interchangeably concatenate onto a set of stems form a signature. After defining the space of signatures, Goldsmith searches for that choice of word segmentations resulting in a minimum description length local optimum.</Paragraph>
    <Paragraph position="5"> Finally, the work of Harris (1955; 1967), and later Hafer and Weiss (1974), has direct bearing on the approach taken in this paper. Couched in modern terms, their work involves first building tries over a corpus vocabulary, and then selecting, as morpheme boundaries, those character boundaries with high branching count in the tries.</Paragraph>
    <Paragraph position="6"> The work in this paper also has a strong bias toward discovering morphologically related words that share a similar orthography. In particular, the morphology model we use is, akin to Goldsmith, limited to suffix substitution. The novel proposal we bring to the table, however, is a formalization of the full search space of all candidate inflection classes. With this bulwark in place, defining search strategies for morpheme discovery becomes a natural and straightforward activity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML