File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2015_intro.xml

Size: 3,185 bytes

Last Modified: 2025-10-06 14:01:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2015">
  <Title>Unsupervised Learning of Morphology for English and Inuktitut</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> To recognize a morpheme boundary, for example between a root and a suffix, a learner must have seen at least two roots with that suffix and at least two suffixes with that root. For instance, 'helpful', 'helpless', 'harmful', and 'harmless' would be enough evidence to guess that those words could be divided as 'help/ful', 'help/less', 'harm/ful', and 'harm/less'. Without seeing varying roots and varying suffixes, there is no reason to prefer one division to another.</Paragraph>
    <Paragraph position="1"> We can represent a language's morphology as a graph or automaton, with the links labeled by characters and the nodes organizing which characters can occur after specific prefixes. In such an automaton, the morpheme boundaries would be hubs, that is, nodes with in-degree greater than one and out-degree greater than one. Furthermore, this automaton could be simplified by path compression to remove all nodes with in-degree and out-degree of one. The remaining automaton could be further modified to produce a graph with one source, one sink, and all other nodes would be hubs.</Paragraph>
    <Paragraph position="2"> A hub-automaton, as described above, matches the intuitive idea that a language's morphology allows one to assemble a word by chaining morphemes together.</Paragraph>
    <Paragraph position="3"> This representation highlights the morphemes while also representing morphotactic information. Phonological information can be represented in the same graph but may be more economically represented in a separate transducer that can be composed with the hubautomaton. null For identifying the boundary between roots and suffixes, the idea of hubs is essentially the same as Goldsmith's (2001) signatures or the variations between Gaussier's (1999) p-similarity words. A signature is a set of suffixes, any of which can be added to several roots to create a word. For example, in English any suffix in the set: NULL, 's', 'ed', 'ing', can be added to 'want' or 'wander' to form a word. Here, NULL means the empty suffix.</Paragraph>
    <Paragraph position="4"> In a hub automaton, the idea is more general than in previous work and applies to more complex morphologies, such as those for agglutinative or polysynthetic languages. In particular, we are interested in unsupervised learning of Inuktitut morphology in which a single lexical unit can often include a verb, two pronouns, adverbs, and temporal information.</Paragraph>
    <Paragraph position="5"> In this paper, we describe a very simple technique for identifying hubs as a first step in building a hubautomaton. We show that, for English, this technique does as well as more complex collections of techniques using signatures. We then show that the technique also works, in a limited way, for Inuktitut. We close with a discussion of the limitations and our plans for more complete learning of hub-automata.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML