File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2039_metho.xml
Size: 6,625 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2039"> <Title>Unsupervised Induction of Modern Standard Arabic Verb Classes</Title> <Section position="4" start_page="153" end_page="153" type="metho"> <SectionTitle> 3 Clustering </SectionTitle> <Paragraph position="0"> We employ both soft and hard clustering techniques to induce the verb classes, using the clustering algorithms implemented in the library cluster (Kaufman and Rousseeuw, 1990) in the R statistical computing language. The soft clustering algorithm, called FANNY, is a type of fuzzy clustering, where each observation is &quot;spread out&quot; over various clusters. Thus, the output is a membership function P(x</Paragraph> <Paragraph position="2"> to cluster c. The memberships are nonnegative and sum to 1 for each fixed observation. The algorithm takes k, the number of clusters, as a parameter and uses a Euclidean distance measure.</Paragraph> <Paragraph position="3"> The hard clustering used is a type of k-means clustering The canonical k-means algorithm proceeds by iteratively assigning elements to a cluster whose center (centroid) is closest in Euclidian distance.</Paragraph> </Section> <Section position="5" start_page="153" end_page="153" type="metho"> <SectionTitle> 4 Features </SectionTitle> <Paragraph position="0"> For both clustering techniques, we explore three different sets of features. The features are cast as the column dimensions of a matrix with the MSA lemmatized verbs constituting the row entries.</Paragraph> <Paragraph position="1"> Information content of frames This is the main feature set used in the clustering algorithm. These are the syntactic frames in which the verbs occur.</Paragraph> <Paragraph position="2"> The syntactic frames are defined as the sister constituents of the verb in a Verb Phrase (VP) constituent. null We vary the type of information resulting from the syntactic frames as input to our clustering algorithms. We investigate the impact of different levels of granularity of frame information on the clustering of the verbs. We create four different data sets based on the syntactic frame information reflecting four levels of frame information: FRAME1 includes all frames with all head information for PPs and SBARs, FRAME2 includes only head information for PPs but no head information for SBARs, FRAME3 includes no head information for neither PPs nor SBARs, and FRAME4 is constructed with all head information, but no constituent ordering information. For all four frame information sets, the elements in the matrix are the co-occurrence frequencies of a verb with a given column heading.</Paragraph> <Paragraph position="3"> Verb pattern The ATB includes morphological analyses for each verb resulting from the Buckwalter null analyzer. Semitic languages such as Arabic have a rich templatic morphology, and this analysis includes the root and pattern information of each verb. This feature is of particular scientific interest because it is unique to the Semitic languages, and has an interesting potential correlation with argument structure.</Paragraph> <Paragraph position="4"> Subject animacy In an attempt to allow the clustering algorithm to use information closer to actual argument structure than mere syntactic frames, we add a feature that indicates whether a verb requires an animate subject. Following a technique suggested by Merlo and Stevenson , we take advantage of this tendency by adding a feature that is the number of times each verb occurs with each NP types as subject, including when the subject is pronominal or pro-dropped.</Paragraph> </Section> <Section position="6" start_page="153" end_page="154" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="153" end_page="154" type="sub_section"> <SectionTitle> 5.1 Data Preparation </SectionTitle> <Paragraph position="0"> The data used is obtained from the ATB. The ATB is a collection of 1800 stories of newswire text from three different press agencies, comprising a total of 800, 000 Arabic tokens after clitic segmentation.</Paragraph> <Paragraph position="1"> The domain of the corpus covers mostly politics, economics and sports journalism. Each active verb is extracted from the lemmatized treebank along with its sister constituents under the VP. The elements of the matrix are the frequency of the row verb co-occuring with a feature column entry. There are 2074 verb types and 321 frame types, corresponding to 54954 total verb frame tokens. Subject animacy information is extracted and represented as four feature columns in our matrix, corresponding to the four subject NP types. The morphological pattern associated with each verb is extracted by looking up the lemma in the output of the morphological analyzer, which is included with the treebank release.</Paragraph> </Section> <Section position="2" start_page="154" end_page="154" type="sub_section"> <SectionTitle> 5.2 Gold Standard Data </SectionTitle> <Paragraph position="0"> The gold standard data is created automatically by taking the English translations corresponding to the MSA verb entries provided with the ATB distributions. We use these English translations to locate the lemmatized MSA verbs in the Levin English classes represented in the Levin Verb Index. Thereby creating an approximated MSA set of verb classes corresponding to the English Levin classes. Admittedly, this is a crude manner to create a gold standard set.</Paragraph> <Paragraph position="1"> Given the lack of a pre-existing classification for MSA verbs, and the novelty of the task, we consider it a first approximation step towards the creation of a real gold standard classification set in the near future. null</Paragraph> </Section> <Section position="3" start_page="154" end_page="154" type="sub_section"> <SectionTitle> 5.3 Evaluation Metric </SectionTitle> <Paragraph position="0"> The evaluation metric used here is a variation on an F-score derived for hard clustering (Rijsbergen, 1979). The result is an F b measure, where b is the coefficient of the relative strengths of precision and recall. b = 1 for all results we report. The score measures the maximum overlap between a hypothesized cluster (HYP) and a corresponding gold standard cluster (GOLD), and computes a weighted average across all the HYP clus- null bardblAbardbl is the total number of verbs that were clustered into the HYP set. This can be larger than the number of verbs to be clustered because verbs can be members of more than one cluster.</Paragraph> </Section> </Section> class="xml-element"></Paper>