File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2015_metho.xml
Size: 5,810 bytes
Last Modified: 2025-10-06 14:08:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2015"> <Title>Unsupervised Learning of Morphology for English and Inuktitut</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Searching for hubs </SectionTitle> <Paragraph position="0"> The simplest way to build a graph from a raw corpus of words is to construct a trie. A trie is a tree representation of the distinct words with a character label on each branch. The trie can be transformed into a minimal, acyclic DFA (deterministic finite automaton), sharing nodes that have identical continuations. There are well known algorithms for doing this (Hopcroft & Ullman, 1969). For example, suppose that, in a given corpus, the prefix 'friend' occurs only with the suffixes 'NULL', 's', and 'ly' and the word 'kind' occurs only with the same suffixes. The minimal DFA has merged the nodes that represent those suffixes, and as a result has fewer links and fewer nodes than the original trie.</Paragraph> <Paragraph position="1"> In this DFA, some hubs will be obvious, such as for the previous example. These are morpheme boundaries.</Paragraph> <Paragraph position="2"> There will be other nodes that are not obvious hubs.</Paragraph> <Paragraph position="3"> Some may have high out-degree but an in-degree of one; others will have high in-degree but an out-degree of one.</Paragraph> <Paragraph position="4"> Many researchers, including Schone and Jurafsky (2000), Harris (1958), and Dejean (1998), suggest looking for nodes with high branching (out-degree) or a large number of continuations. That technique is also used as the first step in Goldsmith's (2001) search for signatures. However, without further processing, such nodes are not reliable morpheme boundaries.</Paragraph> <Paragraph position="5"> Other candidate hubs are those nodes with high out-degree that are direct descendants, along a single path, of a node with high in-degree. In essence, these are stretched hubs. Figure 1 shows an idealized view of a hub and a stretched hub.</Paragraph> <Paragraph position="6"> stretched hub. The lines are links in the automaton and each would be labeled with a character. The ovals are nodes and are only branching points.</Paragraph> <Paragraph position="7"> In a minimized DFA of the words in a corpus, we can identify hubs and the last node in stretched hubs as morpheme boundaries. These roughly correspond to the signatures found by other methods.</Paragraph> <Paragraph position="8"> The above-mentioned technique for hub searching misses boundaries if a particular signature only appears once in a corpus. For instance, the signature for 'help' might be 'ed', 's', 'less', 'lessly', and NULL; and suppose there is no other word in the corpus with the same signature. The morpheme boundaries 'help-less' and 'help-ed' will not be found.</Paragraph> <Paragraph position="9"> The way to generalize the hub-automaton to include words that were never seen is to merge hubs. This is a complex task in general. In this paper, we propose a very simple method. We suggest merging each node that is a final state (at the end of a word) with each hub or stretched hub that has in-degree greater than two.</Paragraph> <Paragraph position="10"> Doing so sharply increases the number of words accepted by the automaton. It will identify more correct morpheme boundaries at the expense of including some non-words.</Paragraph> <Paragraph position="11"> These two techniques, hub searching and simple node merging, were implemented in a program called &quot;HubMorph&quot; (hub-automaton morphology).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Related Work </SectionTitle> <Paragraph position="0"> Most previous work in unsupervised learning of morphology has focused on learning the division between roots and suffixes (e.g., Sproat, 1992; Gaussier, 1999; Dejean, 1996; Goldsmith, 2001). The hope is that the same techniques will work for extracting prefixes.</Paragraph> <Paragraph position="1"> However, even that will not handle the complex combinations of infixes that are possible in agglutinative languages like Turkish or polysynthetic languages like Inuktitut.</Paragraph> <Paragraph position="2"> This paper presents a generalization of one class of techniques that search for signatures or positions in a trie with a large branching factor. Goldsmith (2001) presents a well-developed and robust version of this class and has made his system, Linguistica, freely available (Goldsmith, 2002).</Paragraph> <Paragraph position="3"> Linguistica applies a wide array of techniques including heuristics and the application of the principle of Minimum Description Length (MDL) to find the best division of words into roots and suffixes, as well as prefixes in some cases. The first of these techniques finds the points in a word with the highest number of possible successors in other words. With all these techniques, Linguistica seeks optimal breakpoints in each word. In this case, optimal means the minimal number of bits necessary to encode the whole collection.</Paragraph> <Paragraph position="4"> There are also techniques that attempt to use semantic cues, arguing that knowing the signatures is not sufficient for the task. For example, Yarowsky and Wicentowski (2000; cf. Schone & Jurafsky, 2000) present a method for determining whether singed can be split into sing and ed based on whether singed and sing appear in the same contexts. Adopting a technique like this would increase the precision of HubMorph. In addition, some semantic approach is absolutely essential for identifying fusional morphology, where the word (sang) is not a simple composition of a root (sing) and morphemes.</Paragraph> </Section> class="xml-element"></Paper>