File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2007_metho.xml
Size: 18,137 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-2007"> <Title>Towards a Semantic Classi cation of Spanish Verbs Based on Subcategorisation Information</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Acquisition of Spanish Subcategorisation Frames </SectionTitle> <Paragraph position="0"> Subcategorisation frames encode the information of how many arguments are required by the verb, and of what syntactic type. Acquiring the subcategorization frames for a verb involves, in the rst place, distinguishing which constituents are its arguments and which are adjuncts, elements that give an additional piece of information to the sentence.</Paragraph> <Paragraph position="1"> Moreover, sentences contain other constituents that are not included in the subcategorisation frames of verbs: these are sub-constituents that are not structurally attached to the verb, but to other constituents.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Methodology and Materials </SectionTitle> <Paragraph position="0"> We experiment our methodology on two corpora of different sizes, both consisting of Spanish newswire text: a 3 million word corpus, hereafter called small corpus, and a 50 million word corpus, hereafter called large corpus. They are both POS tagged and partially parsed using the MS-analyzer, a partial parser for Spanish that includes named entities recognition (Atserias et al., 1998).</Paragraph> <Paragraph position="1"> In order to collect the frequency distributions of Spanish subcategorisation frames, we adapt a methodology that has been developed for English to the speci cities of the Spanish language ((Brent, 1993), (Manning, 1993), (Korhonen, 2002b)). It consists in extracting from the corpus pairs made of a verb and its co-occurring constituents that are a possible pattern of a frame, and then ltering out the patterns that do not have a probability of co-occurrence with the verb high enough to be considered its arguments.</Paragraph> <Paragraph position="2"> We establish a set of 11 possible Spanish subcategorisation frames. These are the plausible combinations of a maximum of 2 of the following constituents: nominal phrases, prepositional phrases, temporal sentential clauses, gerundive sentential clauses, in nitival sentential clauses, and in nitival sentential clauses introduced by a preposition. The individual prepositions are also taken into account as part of the subcategorisation frame types.</Paragraph> <Paragraph position="3"> Adapting a methodology that has been thought for English presents a few problems, because English is a language with a strong word order constraint, while in Spanish the order of constituents is freer. Although the unmarked order of constituents is Subject Verb Object with the direct object preceding the indirect object, in naturally occurring language the constituents can be moved to non-canonical positions. Since we extract the patterns from a partially parsed corpus, which has no information on the attachment or grammatical function of the constituents, we have to take into account that the extraction is an approximation. There are various phenomena that can lead us to an erroneous extraction of the constituents. As an illustrative example, in Spanish it is possible to have an inversion in the order of the objects, as can be observed in sentence (1), where the indirect object a Straw ( to Straw ) precedes the direct object los alegatos ( the pleas ).</Paragraph> <Paragraph position="4"> (1) El gobierno chileno presentar*a hoy a Straw los alegatos (. . . ).</Paragraph> <Paragraph position="5"> The Chilean government will present today to Straw the pleas (. . . ) .</Paragraph> <Paragraph position="6"> Dealing with this kind of phenomenon introduces some noise in the data. Matching a pattern for a subcategorisation frame from sentence (1), for example, we would misleadingly induce the pattern [ PP(a)] for the verb presentar, present , when in fact the correct pattern for this sentence is [ NP PP(a)].</Paragraph> <Paragraph position="7"> The solution we adopt for dealing with the variations in the order of constituents is to take into account the functional information provided by clitics. Clitics are unstressed pronouns that refer to an antecedent in the discourse. In Spanish, clitic pronouns can only refer to the subject, the direct object, or the indirect object of the verb, and they can in most cases be disambiguated taking into account their agreement (in person, number and gender) with the verb. When we nd a clitic pronoun in a sentence, we know that an argument position is already lled by it, and the rest of the constituents that are candidates for the position are either discarded or moved to another position. Sentence (2) shows an example of how the presence of clitic pronouns allows us to transform the patterns extracted. The sentence would normally match with the frame pattern [ PP(por)], but the presence of the clitic (which has the form le) allows us to deduce that the sentence contains an indirect object, realised in the subcategorisation pattern with a prepositional phrase headed by a in second position. Therefore, we look for the following nominal phrase, la aparici*on del cad*aver, to ll the slot of the direct object, that otherwise would have not been included in the pattern. (2) Por la tarde, agentes del cuerpo nacional de polic* a le comunicaron por tel*efono la aparici*on del cad*aver.</Paragraph> <Paragraph position="8"> In the afternoon, agents of the national police clitic IO reported by phone the apparition of the corpse. .</Paragraph> <Paragraph position="9"> The collection of pairs verb + pattern obtained with the method described in the last section needs to be ltered out, because we may have extracted constituents that are in fact adjuncts, or elements that are not attached to the verb, or errors in the extraction process. We lter out the spurious patterns with a Maximum Likelihood Estimate (MLE), a method proposed by (Korhonen, 2002b) for this task. MLE is calculated as the ratio of the frequency of a0a2a1a4a3a5a3a7a6a9a8a11a10a13a12 +a14 a6a9a8a16a15a18a17 over the frequency of a14 a6a19a8a20a15a21a17 . Pairs of verb+pattern that do not have a probability of co-occurring together higher than a certain threshold are ltered out. The threshold is determined empirically using held-out data (20% of the total of the corpus), by choosing from a range of values between 0.02 and 0.1 the value that yields better results against a held-out gold standard of 10 verbs.</Paragraph> <Paragraph position="10"> In our experiments, this method yields a threshold value of 0.05.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Experimental Evaluation </SectionTitle> <Paragraph position="0"> We evaluate the obtained subcategorisation frames in terms of precision and recall compared to a gold sation frames.</Paragraph> <Paragraph position="1"> standard. The gold standard is manually constructed for a sample of 41 verbs. The verb sample is chosen randomly from our data with the condition that both frequent and infrequent verbs are represented, and that we have examples of all our subcategorisation frame types. We perform experiments on two corpora of different sizes, expecting that the differences in the results will show that a large amount of data does signi cantly improve the performance of any given system without any changes in the methodology. After the extraction process, the small corpus consists of 58493 pairs of verb+pattern, while the large corpus contains 1253188 pairs.1 Since we include in our patterns the heads of the prepositional phrases, the corpora contain a large number of pattern types (838 in the small corpora, and 2099 in the large corpora). We investigate grouping semantically equivalent prepositions together, in order to reduce the number of pattern types, and therefore increment the probabilities on the patterns. The preposition groups are established manually.</Paragraph> <Paragraph position="2"> Table 1 shows the average results obtained on the two different corpora for the 41 test verbs. The baselines are established by considering all the frame patterns obtained in the extraction process as correct frames. The experiments on the large corpus give better results than the ones on the small one, and grouping similar prepositions together is useful only on the large corpus. This is probably due to the fact that the small corpus does not suffer from a too large number of frame types, and the effect of the groupings cannot be noticed. The F measure value of 66% reported on the third line of table 1, obtained on the large corpus with preposition groups, compares favourably to the results reported on (Korhonen, 2002b) for a similar experiment on English subcategorization frames, in which an F measure of prepositional constituents in the second position of the pattern that are introduced with the preposition de, of . This is motivated by the observation that in 96.8% of the cases this preposition is attached to the preceding constituent, and not to the verb.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Clustering Verbs into Classes </SectionTitle> <Paragraph position="0"> We use a bottom-up hierarchical clustering algorithm to group together 514 verbs into K classes.</Paragraph> <Paragraph position="1"> The algorithm starts by nding the similarities between all the possible pairs of objects in the data according to a similarity measure S. After having established the distance between all the pairs, it links together the closest pairs of objects by a linkage method L, forming a binary cluster. The linking process is repeated iteratively over the newly created clusters until all the objects are grouped into one cluster. K, S and L are parameters that can be set for the clustering. For the similarity measure S, we choose the Euclidean distance. For the linkage method L, we choose the Ward linkage method (Ward, 1963). Our choice of the parameter settings is motivated by the work of (Stevenson and Joanis, 2003). Applying a clustering method to the verbs in our data, we expect to nd a natural division of the data that will be in accordance with the classication of verbs that we have set as our target classi cation. We perform different experiments with different values for K in order to test which of the different granularities yields better results.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Target Classi cation </SectionTitle> <Paragraph position="0"> In order to be able to evaluate the clusters output by the algorithm, we need to establish a manual classi cation of sample verbs. We assume the manual classi cation of Spanish verbs developed by (V*azquez et al., 2000). In their classi cation, verbs are organised on the basis of meaning components, diathesis alternations and event structure.</Paragraph> <Paragraph position="1"> They classify a large number of verbs into three main classes (Trajectory, Change and Attitude) that are further subdivided into a total of 31 subclasses.</Paragraph> <Paragraph position="2"> Their classi cation follows the same basic hypotheses as Levin's, but the resulting classes differ in some important aspects. For example, the Trajectory class groups together Levin's Verbs of Motion (move), Verbs of Communication (tell) and verbs of Change of Possession (give), among others. Their justi cation for this grouping is that all the verbs in this class have a Trajectory meaning component, and that they all undergo the Underspeci cation alternation (in Levin's terminology, the Locative Preposition Drop and the Unspeci ed Object alternations). The size of the classes at the lower level of the classi cation hierarchy varies from 2 to 176.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Materials </SectionTitle> <Paragraph position="0"> The input to the algorithm is a description of each of the verbs in the form of a vector containing the probabilities of their subcategorisation frames. We obtain the subcategorisation frames with the method described in the previous section that gave better results: using the large corpus, and reducing the number of frame types by merging individual prepositions into groups. In order to reduce the number of frame types still further, we only take into account the ones that occur more than 10 times in the corpus. In this way, we have a set of 66 frame types. Moreover, for the purpose of the classi cation task, the subcategorisation frames are enhanced with extra information that is intended to re ect properties of the verbs that are relevant for the target classi cation. The target classi cation is based on three aspects of the verb properties: meaning components, diathesis alternations, and event structure, but the information provided by subcategorisation frames only re ects on the second of them. We expect to provide some information on the meaning components participating in the action by taking into account whether subjects and direct objects are recognised by the partial parser as named entities.</Paragraph> <Paragraph position="1"> Then, the possible labels for these constituents are no NE , persons , locations , and institutions .</Paragraph> <Paragraph position="2"> We introduce this new feature by splitting the probability mass of each frame among the possible labels, according to their frequencies. Now, we have a total of 97 features for each verb of our sample.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Clustering Evaluation </SectionTitle> <Paragraph position="0"> Evaluating the results of a clustering experiment is a complex task because ideally we would like the output to ful l different goals. One the one hand, the clusters obtained should re ect a good partition of the data, yielding consistent clusters. On the other hand, the partition of the data obtained should be as similar as possible to the manually constructed classi cation, the gold standard. We use the Silhouette measure (Kaufman and Rousseeuw, 1990) as an indication of the consistency of the obtained clusters, regardless of the division of the data in the gold standard. For each clustering experiment, we calculate the mean of the silhouette value of all the data points, in order to get an indication of the overall quality of the clusters created. The main dif culty in evaluating unsupervised classi cation tasks against a gold standard lies in the fact that the class labels of the obtained clusters are unknown. Therefore, the evaluation is done according to the pairs of objects that the two groups have in common. (Schulte im Walde, 2003) reports that the evaluation method that is most appropriate to the task of unsupervised verb classi cation is the Adjusted Rand measure. It gives a value of 1 if the two classi cations agree com- null pletely in which pairs of objects are clustered together and which are not, while complete disagreement between two classi cations yields a value of -1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Experimental Results </SectionTitle> <Paragraph position="0"> We perform various clustering experiments in order to test, on the one hand, the usefulness of our enhanced subcategorisation frames. On the other hand, we intend to discover which is the natural partition of the data that best accommodates our target classi cation. The target classi cation is a hierarchy of three levels, each of them dividing the data into 3, 15, or 31 levels. For this reason, we experiment on 3, 15, and 31 desired output clusters, and evaluate them on each of the target classi cation levels, respectively.</Paragraph> <Paragraph position="1"> Table 2 shows the evaluation results of the clustering experiment that takes as input bare subcategorisation frames. Table 3 shows the evaluation results of the experiment that includes named entity recognition in the features describing the verbs. In both tables, each line reports the results of a classi cation task. The average Silhouette measure is shown in the second column. We can observe that the best classi cation tasks in terms of the Silhouette measure are the 3-way and 15-way classi cations. The baseline is calculated, for each task, as the average value of the Adjusted Rand measure for 100 random cluster assignations. Although all the tasks perform better than the baseline, the increase is so small that it is clear that some improvements have to be done on the experiments. According to the Adjusted Rand measure, the clustering algorithm seems to perform better in the tasks with a larger number of classes. On the other hand, the enhanced features are useful on the 15-way and 3-way classi cations, but they are harmful in the 31-way classi cation. In spite of these results, a qualitative observation of the output clusters reveals that they are intuitively plausible, and that the evaluation is penalised by the fact that the target classes are of very different sizes. On the other hand, our data takes into account syntactic information, while the target classi cation is not only based on syntax, but also on other aspects of the properties of the verbs. These results compare poorly to the performance achieved by (Schulte im Walde, 2003), who obtains an Adjusted Rand measure of 0.15 in a similar task, in which she classi es 168 German verbs into 43 semantic verb classes. Nevertheless, our results are comparable to a subset of experiments reported in (Stevenson and Joanis, 2003), where they perform similar clustering experiments on English verbs based on a general description of verbs, obtaining average Adjusted Rand measures of 0.04 and 0.07.</Paragraph> </Section> </Section> class="xml-element"></Paper>