XML Viewer - w06-3210

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-3210_relat.xml
Size: 7,881 bytes
Last Modified: 2025-10-06 14:15:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3210">
  <Title>A Naive Theory of Affixation and an Algorithm for Extraction</Title>
  <Section position="5" start_page="83" end_page="85" type="relat">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> For reasons of space we cannot cite and comment every relevant paper even in the narrow view of highly unsupervised extraction of affixes from raw corpus data, but we will cite enough to cover each line of research. The vast fields of word segmentation for speech recognition or for languages which do not mark word boundaries will not be covered.</Paragraph>
    <Paragraph position="1"> In our view, segmentation into lexical units is a different problem than that of affix extraction since the frequencies of lexical items are different, i.e occur  detection. The high placement of English -eth and -ah are due to the fact that the bible version used has drinketh, sitteth etc and a lot of personal names in -ah.</Paragraph>
    <Paragraph position="2"> much more sparsely. Results from this area which have been carried over or overlap with affic detection will however be taken into account. A lot of the papers cited have a wider scope and are still useful even though they are critisized here for having a non-optimal affix detection component.</Paragraph>
    <Paragraph position="3"> Many authors trace their approches back to two early papers by Zellig Harris (Harris, 1955; Harris, 1970) which count letter successor varieties. The basic procedure is to ask how many different phonemes occur (in various utterances e.g a corpus) after the first n phonemes of some test utterance and predictthatsegmentation(s)occurwherethenumber ofsuccesorsreachesapeak. Forexample, ifwehave play, played, playing, player, players, playground and we wish to test where to segment plays, the succesor count for the prefix pla would be 1 because only y occurs after whereas the number of successors of play peak at three (i.e {e,i,g}). Although the heuristic has had some success it was shown (in various interpretations) as early as (Hafer and Weiss, 1974) that it is not really sound - even for English.</Paragraph>
    <Paragraph position="4"> A slightly better method is to compile a set of words into a trie and predict boundaries at nodes with high actitivity (e.g (Johnson and Martin, 2003; Schone and Jurafsky, 2001; Kazakov and Manandhar, 2001) and earlier papers by the same authors), but this not sound either as non-morphemic short common character sequences also show significant branching.</Paragraph>
    <Paragraph position="5">  The algorithm in this paper is differs significantly from the Harris-inspired varieties. First, we do not record the number of phonemes/character of a given prefix/suffix but the total number of continuations. In the example above, that would be the set {ed,ing,er,ers,ground} rather than the threemember set of continuing phonemes/characters.</Paragraph>
    <Paragraph position="6"> Secondly, segmentation of a given word is not the immediate objective and what amounts to identification of the end of a lexical (thus generally lowfrequency) item is not within the direct reach of the model. Thirdly, and most importantly, the algorithm in this paper looks at the slope of the frequency curve not at peaks in absolute frequency.</Paragraph>
    <Paragraph position="7"> A different approach, sometimes used in complement of other sources of information, is to select aligned pairs (or sets) of strings that share a long character sequence (work includes (Jacquemin, 1997; Yarowsky and Wicentowski, 2000; Baroni et al., 2002; Clark, 2001)). A notable advantage is that one is not restricted to concatenative morphology.</Paragraph>
    <Paragraph position="8"> Many publications ('Cavar et al., 2004; Brent et al., 1995; Goldsmith et al., 2001; D'ejean, 1998; Snover et al., 2002; Argamon et al., 2004; Goldsmith, 2001; Creutz and Lagus, 2005; Neuvel and Fulop, 2002; Baroni, 2003; Gaussier, 1999; Sharma et al., 2002; Wicentowski, 2002; Oliver, 2004), and various other works by the same authors, describe strategies that use frequencies, probabilities, and optimization criteria, often Minimum Description Length (MDL), in various combinations. So far, all these are unsatisfactory on two main accounts; on the theretical side, they still owe an explanation of why compression or MDL should give birth to segmentations coinciding with morphemes as linguistically defined. On the experimental side, thresholds, supervised/developed parametres and selective input still cloud the success of reported results, which, in any case, aren't wide enough to sustain some too rash language independence claims.</Paragraph>
    <Paragraph position="9"> To be more specific, some MDL approaches aim to minimize the description of the set of words in the input corpus, some to describe all tokens in the corpus, but, none aims to minimize, what one would otherwise expect, the set of possible words in the language. More importantly, none of the reviewed works allow any variation in the description language (&amp;quot;model&amp;quot;) during the minimization search. Therefore they should be more properly labeled&amp;quot;weightingschemes&amp;quot;andit'sanopenquestion null whether their yields correspond to linguistic analysis. Given an input corpus and a traditional linguistic analysis, it is trivial to show that it is possible to decrease description length (according to the given schemes) by stepping away from linguistic analysis.</Paragraph>
    <Paragraph position="10"> Moreover, various forms of codebook compression, such as Lempel-Ziv compression, yield shorter description but without any known linguistic relevance at all. What is clear, however, apart from whether it is theoretically motivated, is that MDL approaches are useful.</Paragraph>
    <Paragraph position="11"> A systematic test of segmentation algorithms over many different types of languages has yet to be published. For three reasons, it will not be undertaken here either. First, as e.g already Manning (1998) notes for sandhi phenomena, it is far from clear what the gold standard should be (even though we may agree or agree to disagree on some familiar European languages). Secondly, segmentation algorithms may have different purposes and it might not make good sense to study segmentation in isolation from induction of paradigms. Lastly, and most importantly, all of the reviewed techniques (Wicentowski, 2004; Wicentowski, 2002; Snover et al., 2002; Baroni et al., 2002; Andreev, 1965; 'Cavar et al., 2004; Snover and Brent, 2003; Snover and Brent, 2001; Snover, 2002; Schone and Jurafsky, 2001; Jacquemin, 1997; Goldsmith and Hu, 2004; Sharmaetal., 2002; Clark, 2001; KazakovandManandhar, 1998; D'ejean, 1998; Oliver, 2004; Creutz and Lagus, 2002; Creutz and Lagus, 2003; Creutz and Lagus, 2004; Hirsim&amp;quot;aki et al., 2003; Creutz and Lagus, 2005; Argamon et al., 2004; Gaussier, 1999; Lehmann, 1973; Langer, 1991; Flenner, 1995; Klenk and Langer, 1989; Goldsmith, 2001; Goldsmith, 2000; Hu et al., 2005b; Hu et al., 2005a; Brent et al., 1995), as they are described, have threshold-parameters of some sort, explicitly claim not to work well for an open set of languages, or require noise-free all-form input (Albright, 2002; Manning, 1998; Borin, 1991). Therefore it is not possible to even design a fair test.</Paragraph>
    <Paragraph position="12"> In any event, we wish to appeal to the merits of developing a theory in parallel with experimentation - as opposed to only ad hoc result chasing. If we have a theory and we don't get the results we want,  wemayscrutinizetheassumptionsbehindthetheory in order to modify or reject it (understanding why we did so). Without a theory there's no telling what to do or how to interpret intermediate numbers in a long series of calculations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML