File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0906_concl.xml
Size: 6,289 bytes
Last Modified: 2025-10-06 13:53:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0906"> <Title>Learning Argument/Adjunct Distinction for Basque</Title> <Section position="6" start_page="3" end_page="3" type="concl"> <SectionTitle> 4 Related work </SectionTitle> <Paragraph position="0"> Concerning the acquisition of verb subcategorization information, there are proposals ranging from manual examination of corpora (R.</Paragraph> <Paragraph position="1"> Grishman et al. 1994) to fully automatic approaches.</Paragraph> <Paragraph position="2"> Table 3, partially borrowed from A. Korhonen (2001), summarizes several systems on subcategorization frame acquisition.</Paragraph> <Paragraph position="3"> C. Manning (1993) presents the acquisition of subcategorization frames from unlabelled text corpora. He uses a stochastic tagger and a finite state parser to obtain instances of verbs with their adjacent elements (either arguments or adjuncts), and then a statistical filtering phase produces subcategorization frames (from a set of previously defined 19 frames) for each verb.</Paragraph> <Paragraph position="4"> T. Briscoe and J. Carroll (1997) describe a grammar based experiment for the extraction of subcategorization frames with their associated relative frequencies, obtaining 76.6% precision and 43.4% recall. Regarding evaluation, they use the ANLT and COMLEX Syntax dictionaries as gold standard. They also performed evaluation of coverage over a corpus. For our work, we could not make use of any previous information on subcategorization, because there is nothing like a subcategorization dictionary for Basque.</Paragraph> <Paragraph position="5"> A. Sarkar and D. Zeman (2000) report results on the automatic acquisition of subcategorization frames for verbs in Czech, a free word order language. The input to the system is a set of manually annotated sentences from a treebank, where each verb is linked with its dependents (without distinguishing arguments and adjuncts). The task consists in iteratively eliminating elements from the possible frames with the aim of removing adjuncts. For evaluation, they give an estimate of how many of the obtained frames appear in a set of 500 sentences where dependents were annotated manually, showing an improvement from a baseline of 57% (all elements are adjuncts) to 88%.</Paragraph> <Paragraph position="6"> Comparing this approach to our work, we must point out that Sarkar and Zeman's data does not come from raw corpus, and thus they do not deal with the problem of noise coming from the parsing phase. Their main limitation comes by relying on a treebank, which is an expensive resource.</Paragraph> <Paragraph position="7"> D. Kawahara et al. (2001) use a full syntactic parser to obtain a case frame dictionary for Japanese, where arguments are distinguished by their syntactic case, including their headword (selectional restrictions). The resulting case frame components are selected by a frequency threshold. M. Maragoudakis et al. (2001) apply a morphological analyzer and phrase chunking module to acquire subcategorization frames for Modern Greek. In contrast to this work, they use different machine learning techniques. They claim that Bayesian Belief Networks are the best learning technique.</Paragraph> <Paragraph position="8"> P. Merlo and M. Leybold (2001) present learning experiments for automatic distinction of arguments and adjuncts, applied to the case of prepositional phrases attached to a verb. She uses decision trees tested on a set of 400 verb instances with a single PP, reaching an accuracy of 86.5% over a baseline of 74%.</Paragraph> <Paragraph position="9"> Note that both Manning and Merlo and Leybold's systems learn from contexts with just one PP (maximum) per verb (finite state filter). Our system learns from contexts with up to 5 PPs. Furthermore, we distinguish 48 different kinds of cases, hence the number of combinations is considerably bigger.</Paragraph> <Paragraph position="10"> Regarding the parsing phase, the systems presented so far are heterogeneous. While Manning, Merlo and Leybold and Maragoudakis et al. use very simple parsing techniques, Briscoe and Carroll and Kawahara et al. use sophisticated parsers. Our system can be placed between these two approaches. The result of the shallow parsing is not simple in that it relies on a robust morphological analysis and disambiguation.</Paragraph> <Paragraph position="11"> Remember that Basque is an agglutinative language with strong morphology and, therefore, this stage is particularly relevant. Moreover, the finite state filter we used for parsing is very sophisticated (L. Karttunen et al. 1997, I.</Paragraph> <Paragraph position="12"> Aldezabal et al. 2001), compared to Manning's.</Paragraph> <Paragraph position="13"> Conclusion This work describes an initial effort to obtain subcategorization information for Basque. To successfully perform this task we had to go deeper than mere syntactic categories (NP, PP, ...) enriching the set of possible arguments to 48 different classes. This leads to quite sparse data. Together with sparseness, another problem common to every subcategorization acquisition system is that of noise, coming from adjuncts and incorrectly parsed elements. For that reason, we defined subcategorization acquisition in terms of distinguishing between arguments and adjuncts.</Paragraph> <Paragraph position="14"> The system presented was applied to a newspaper corpus. Subcategorization acquisition is highly associated to semantics in that different senses of a verb will most of the times show different subcategorization information. Thus, the task of learning subcategorization information is influenced by the corpus. As for the evaluation of this work, we carried out two different kinds of evaluation. This way, we verified the relevance of semantics in this kind of task.</Paragraph> <Paragraph position="15"> For the future, we plan to incorporate the information resulting from this work in our parsing system. We hope that this will lead to better results in parsing. Consequently, we would get better subcategorization information, in a bootstrapping cycle. We also plan to improve the results by using semantic information as proposed in A. Korhonen (2001).</Paragraph> </Section> class="xml-element"></Paper>