File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0712_intro.xml
Size: 3,655 bytes
Last Modified: 2025-10-06 14:00:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0712"> <Title>Knowledge-Free Induction of Morphology Using Latent Semantic Analysis</Title> <Section position="2" start_page="0" end_page="67" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Computational morphological analyzers have existed in various languages for years and it has been said that &quot;the quest for an efficient method for the analysis and generation of word-forms is no longer an academic research topic&quot; (Karlsson and Karttunen, 1997). However, development of these analyzers typically begins with human intervention requiring time spans from days to weeks. If it were possible to build such analyzers automatically without human knowledge, significant development time could be saved.</Paragraph> <Paragraph position="1"> On a larger scale, consider the task of inducing machine-readable dictionaries (MRDs) using no human-provided information (&quot;knowledge-free&quot;). In building an MRD, &quot;simply expanding the dictionary to encompass every word one is ever likely to encounter...fails to take advantage of regularities&quot; (Sproat, 1992, p. xiii). Hence, automatic morphological analysis is also critical for selecting appropriate and non-redundant MRD headwords.</Paragraph> <Paragraph position="2"> For the reasons expressed above, we are interested in knowledge-free morphology induction. Thus, in this paper, we show how to automatically induce morphological relationships between words.</Paragraph> <Paragraph position="3"> Previous morphology induction approaches (Goldsmith, 1997, 2000; D4Jean, 1998; Gaussier, 1999) have focused on inflectional languages and have used statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Several problems can arise using only stem-and-affix statistics: (1) valid affixes may be applied inappropriately (&quot;ally&quot; stemming to &quot;all&quot;), (2) morphological ambiguity may arise (&quot;rating&quot; conflating with &quot;rat&quot; instead of &quot;rate&quot;), and (3) non-productive affixes may get accidentally pruned (the relationship between &quot;dirty&quot; and &quot;dirt&quot; may be lost)3 Some of these problems could be resolved if one could incorporate word semantics. For instance, &quot;all&quot; is not semantically similar to &quot;ally,&quot; so with knowledge of semantics, an algorithm could avoid conflating these two words.</Paragraph> <Paragraph position="4"> To maintain the &quot;knowledge-free&quot; paradigm, such semantics would need to be automatically induced. Latent Semantic Analysis (LSA) (Deerwester, et al., 1990); Landauer, et al., 1998) is a technique which automatically identifies semantic information from a corpus. We here show that incorporating LSA-based semantics alone into the morphology-induction process can provide results that rival a state-oh the-art system based on stem-and-affix statistics (Goldsmith's Linguistica).</Paragraph> <Paragraph position="5"> Our algorithm automatically extracts potential affixes from an untagged corpus, identifies word pairs sharing the same proposed stem but having different affixes, and uses LSA to judge semantic relatedness between word pairs. This process serves to identify valid morphological relations. Though our algorithm could be applied to any inflectional language, we here restrict it to English in order to perform evaluations against the human-labeled CELEX database (Baayen, et al., 1993).</Paragraph> </Section> class="xml-element"></Paper>