File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0602_intro.xml
Size: 2,992 bytes
Last Modified: 2025-10-06 14:01:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0602"> <Title>Unsupervised Learning of Morphology Using a Novel Directed Search Algorithm: Taking the First Step</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There are numerous languages for which no annotated corpora exist but for which there exists an abundance of unannotated orthographic text. It is extremely time-consuming and expensive to create a corpus annotated for morphological structure by hand. Furthermore, a preliminary, conservative analysis of a language's morphology would be useful in discovering linguistic structure beyond the word level. For instance, morphology may provide information about the syntactic categories to which words belong, knowledge which could be used by parsing algorithms. From a cognitive perspective, it is crucial to determine whether the amount of information found in pure speech is sufficient for discovering the level of morphological structure that children are able to find without any direct supervision. Thus, we believe the task of automatically discovering a conservative estimate of the orthographically-based morphological structure in a language independent manner is a useful one.</Paragraph> <Paragraph position="1"> Additionally, an initial description of a language's morphology could provide a starting point for supervised morphological models, such as the memory-based algorithm of Van den Bosch and Daelemans (1999), which cannot be used on languages for which annotated data is unavailable.</Paragraph> <Paragraph position="2"> During the last decade several minimally supervised and unsupervised algorithms that address the problem have been developed. Gaussier (1999) describes an explicitly probabilistic system that is based primarily on spellings. It is an unsupervised algorithm, but requires the tweaking of parameters to tune it to the target language. Brent (1993) and Brent et al. (1995), described Minimum Description Length, (MDL), systems. One approach used only the spellings of the words; another attempted to find the set of suffixes in the language used the syntactic categories from a tagged corpus as well. While both are unsupervised, the latter is not knowledge free and requires data that is tagged for part of speech, making it less suitable for analyzing under examined languages.</Paragraph> <Paragraph position="3"> A similar MDL approach is described by Goldsmith (2001). It is ideal in being both knowledge free and unsupervised. The difficulty lies in Goldsmith's liberal definition of morphology which he uses to evaluate with; a more conservative approach would seem to be a better hypothesis to bootstrap from.</Paragraph> <Paragraph position="4"> We previously, Snover and Brent (2001), presented a very conservative unsupervised system, July 2002, pp. 11-20. Association for Computational Linguistics.</Paragraph> </Section> class="xml-element"></Paper>