File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2102_evalu.xml
Size: 10,240 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2102"> <Title>Unsupervised Induction of Modern Standard Arabic Verb Classes Using Syntactic Frames and LSA</Title> <Section position="11" start_page="797" end_page="799" type="evalu"> <SectionTitle> 7 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="797" end_page="798" type="sub_section"> <SectionTitle> 7.1 Data Preparation </SectionTitle> <Paragraph position="0"> The four sets of features are cast as the column dimensions of a matrix, with the MSA lemmatized verbs constituting the row entries. The data used for the syntactic frames is obtained from the ATB corresponding to ATB1v3, ATB2v2 and ATB3v2. The ATB is a collection of 1800 stories of newswire text from three different press agencies, comprising a total of 800,000 Arabic tokens after clitic segmentation. The domain of the corpus covers mostly politics, economics and sports journalism. To extract data sets for the frames, thetreebankisfirstlemmatizedbylooking up lemma information for each word in its manually chosen (information provided in the Tree-bank files) corresponding output of BAMA. Next, each active verb is extracted along with its sister constituents under the VP in addition to NP-TPC.</Paragraph> <Paragraph position="1"> As mentioned above, the only constituents kept as the frame are those labeled NP-TPC, NP-SBJ, NP-OBJ, NP-DTV, PP-CLR, and SBAR. For PP-CLRs and SBARs, the head preposition or complementizer which is assumed to be the left-most daughter of the phrase, is extracted. The verbs and frames are put into a matrix where the row entries are the verbs and the column entries are the frames. The elements of the matrix are the frequency of the row verb occurring in a given frame column entry. There are 2401 verb types and 320 frame types, corresponding to 52167 total verb frame tokens.</Paragraph> <Paragraph position="2"> For the LSA feature, we apply LSA to the AG corpus. AG (GIGAWORD 2) comprises 481 million words of newswire text. The AG corpus is morphologically disambiguated using MADA.5 MADA is an SVM based system that disambiguates among different morphological analyses produced by BAMA.(Habash and Rambow, 2005) We extract the lemma forms of all the words in AG andusethemfortheLSAalgorithm. Toextractthe LSA vectors, first the lemmatized AG data is split into 100 sentence long pseudo-documents. Next, an LSA model is trained using the Infomap software 6 on half of the AG (due to size limitations of Infomap). Infomap constructs a word similarity matrixindocumentspace, thenreducesthedimensionality of the data using SVD. LSA reduces AG to 44 dimensions. The 44-dimensional vector is extracted for each verb, which forms the LSA data set for clustering.</Paragraph> <Paragraph position="3"> Subject animacy information is represented as three feature columns in our matrix. One column entry represents the frequency a verb co-occurs with an empty subject (represented as an NP-SBJ dominating the NONE tag, 21586 tokens). Another column has the frequency the NP-SBJ/NP-TPC dominates a pronoun (represented in the corpus as the tag PRON 3715 tokens). Finally, the last subject animacy column entry represents the frequencyanNP-SBJ/NP-TPCdominatesaproper name (tagged NOUN PROP, 4221 tokens).</Paragraph> <Paragraph position="4"> The morphological pattern associated with each verb is extracted by looking up the lemma in the output of BAMA. The pattern information is added as a feature column to our matrix of verbs by features.</Paragraph> </Section> <Section position="2" start_page="798" end_page="798" type="sub_section"> <SectionTitle> 7.2 Gold Standard Data </SectionTitle> <Paragraph position="0"> The gold standard data is created automatically by taking the English translations corresponding to the MSA verb entries provided with the ATB distributions. We use these English translations to locate the lemmatized MSA verbs in the Levin English classes represented in the Levin Verb Index (Levin, 1993), thereby creating an approximated MSA set of verb classes corresponding to the English Levin classes. Admittedly, this is a crude manner to create a gold standard set. Given lackofapre-existingclassificationforMSAverbs, and the novelty of the task, we consider it a first approximation step towards the creation of a real gold standard classification set in the near future.</Paragraph> <Paragraph position="1"> Since the translations are assigned manually to the verb entries in the ATB, we assume that they are a faithful representation of the MSA language used.</Paragraph> <Paragraph position="2"> Moreover, we contend that lexical semantic meanings, if they hold cross linguistically, would be defined by distributions of syntactic alternations.</Paragraph> <Paragraph position="3"> Unfortunately, this gold standard set is more noisy than expected due to several factors: each MSA morphological analysis in the ATB has several associated translations, which include both polysemy and homonymy. Moreover, some of these translations are adjectives and nouns as well as phrasal expressions. Such divergences occur naturally but they are rampant in this data set. Hence, the resulting Arabic classes are at a finer level of granularity than their English counterparts because of missing verbs in each cluster. There are also many gaps - unclassified verbs - when the translation is not a verb, or a verb that is not in the Levin classification. Of the 480 most frequent verb types used in this study, 74 are not in the translated Levin classification.</Paragraph> </Section> <Section position="3" start_page="798" end_page="798" type="sub_section"> <SectionTitle> 7.3 Clustering Algorithms </SectionTitle> <Paragraph position="0"> We use the clustering algorithms implemented in the library cluster (Kaufman and Rousseeuw, 1990) in the R statistical computing language. The soft clustering algorithm, called FANNY, is a type of fuzzy clustering, where each observation is &quot;spread out&quot; over various clusters. Thus, the output is a membership function P(xi,c), the membership of element xi to cluster c. The memberships are nonnegative and sum to 1 for each fixed observation. The algorithm takes k, the number of clusters, as a parameter and uses a Euclidean distance measure. We determine k empirically, as explained below.</Paragraph> </Section> <Section position="4" start_page="798" end_page="799" type="sub_section"> <SectionTitle> 7.4 Evaluation Metric </SectionTitle> <Paragraph position="0"> The evaluation metric used here is a variation on an F-score derived for hard clustering (Chklovski and Mihalcea, 2003). The result is an Fb measure, where b is the coefficient of the relative strengths of precision and recall. b = 1 for all results we report. The score measures the maximum overlap between a hypothesized cluster (HYP) and a corresponding gold standard cluster (GOLD), and computes a weighted average across all the GOLD bardblCbardbl is the total number of verbs to be clustered. This is the measure that we report, which weights precision and recall equally.</Paragraph> </Section> </Section> <Section position="12" start_page="799" end_page="799" type="evalu"> <SectionTitle> 7.5 Results </SectionTitle> <Paragraph position="0"> To determine the features that yield the best clustering of the extracted verbs, we run tests comparing seven different factors of the model, in a 2x2x2x2x3x3x5design, withthefirstfourparameters being the substantive informational factors, and the last three being parameters of the clustering algorithm. For the feature selection experiments, the informational factors all have two conditions, which encode the presence or absence of the information associated with them. The first factor represents the syntactic frame vectors, the second the LSA semantic vectors, the third the subject animacy, and the fourth the morphological pattern of the verb.</Paragraph> <Paragraph position="1"> The fifth through seventh factors are parameters of the clustering algorithm: The fifth factor is three different numbers of verbs clustered: the 115, 268, and 406 most frequent verb types, respectively. The sixth factor represents numbers of clusters (k). These values are dependent on the number of verbs tested at a time. Therefore, this factor is represented as a fraction of the number of verbs. Hence, the chosen values are 16, 13, and 12 of the number of verbs. The seventh and last factor is a threshold probability used to derive discrete members for each cluster from the cluster probability distribution as rendered by the soft clustering algorithm. In order to get a good range of the variation in the effect of the threshold, we empirically choose five different threshold values: 0.03, 0.06, 0.09, 0.16, and 0.21. The purpose of the last three factors is to control for the amount of variation introduced by the parameters of the clustering algorithm, in order to determine the effectoftheinformationalfactors. Evaluationscores are obtained for all combinations of all seven factors (minus the no information condition - the algorithm must have some input!), resulting in 704 conditions.</Paragraph> <Paragraph position="2"> We compare our best results to a random baseline. In the baseline, verbs are randomly assigned to clusters where a random cluster size is on average the same size as each other and as GOLD.7 The highest overall scored Fb=1 is 0.456 and it results from using syntactic frames, LSA vectors, subject animacy, 406 verbs, 202 clusters, and a threshold of 0.16. The average cluster size is 3, 7It is worth noting that this gives an added advantage to the random baseline, since a comparable to GOLD size implicitly contibutes to a higher overlap score.</Paragraph> <Paragraph position="3"> because this is a soft clustering. The random base-line achieves an overall Fb=1 of 0.205 with comparable settings of406verbs randomly assigned to 202 clusters of approximately equal size.</Paragraph> <Paragraph position="4"> To determine which features contribute significantly to clustering quality, a statistical analysis of the clustering experiments is undertaken in the next section.</Paragraph> </Section> class="xml-element"></Paper>