File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2102_metho.xml

Size: 10,396 bytes

Last Modified: 2025-10-06 14:10:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2102">
  <Title>Unsupervised Induction of Modern Standard Arabic Verb Classes Using Syntactic Frames and LSA</Title>
  <Section position="5" start_page="795" end_page="795" type="metho">
    <SectionTitle>
1. INTRANSITIVE USE, optionally followed by
</SectionTitle>
    <Paragraph position="0"> a path They skated (along the river bank).</Paragraph>
  </Section>
  <Section position="6" start_page="795" end_page="795" type="metho">
    <SectionTitle>
2. INDUCED ACTION (some verbs)
</SectionTitle>
    <Paragraph position="0"> Pat skated (Kim) around the rink.</Paragraph>
    <Paragraph position="1"> Levin lists 184 manually created classes for English, which is not intended as an exhaustive classification. Many verbs are in multiple classes both due to the inherent polysemy of the verbs as well as other aspectual variations such as argument structure preferences. As an example of the latter, a verb such as eat occurs in two different classes; one defined by the Unspecified Object Alternation where it can appear both with and without an explicit direct object, and another defined by the Connative Alternation where its second argument appears either as a direct object or the object of the preposition at. It is important to note that the Levin classes aim to group verbs based on their event structure, reflecting aspectual and manner similarities rather than similarity due to their describing the same or similar events. Therefore, the semantic class similarity in Levin classes is coarser grained than what one would expect resulting from a semantic classification based on distributional similarity such as Latent Semantic Analysis (LSA) algorithms. For illustration, one would expect an LSA algorithm to group skate, rollerblade in one class and bicycle, motorbike, scooter in another; yet Levin puts them all in the same class based on their syntactic behavior, which reflects their common event structure: an activity with a possible causative participant. One of the purposes of this work is to test this hypothesis by examining the relative contributions of LSA and syntactic frames to verb clustering. null</Paragraph>
  </Section>
  <Section position="7" start_page="795" end_page="796" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> Based on the Levin classes, many researchers attempt to induce such classes automatically. No- null tably the work of Merlo and Stevenson (2001) attempts to induce three main English verb classes on a large scale from parsed corpora, the class of Ergative, Unaccusative, and Object-drop verbs. They report results of 69.8% accuracy on a task whose baseline is 34%, and whose expert-based upper bound is 86.5%. In a task similar to ours except for its use of English, Schulte im Walde clusters English verbs semantically by using their alternation behavior, using frames from a statistical parser combined with WordNet classes. She evaluates against the published Levin classes, and reports that 61% of all verbs are clustered into correct classes, with a baseline of 5%.</Paragraph>
  </Section>
  <Section position="8" start_page="796" end_page="796" type="metho">
    <SectionTitle>
4 Arabic Linguistic Phenomena
</SectionTitle>
    <Paragraph position="0"> In this paper, the language of interest is MSA.</Paragraph>
    <Paragraph position="1"> Arabic verbal morphology provides an interesting piece of explicit lexical semantic information in the lexical form of the verb. Arabic verbs have two basic parts, the root and pattern/template, which combine to form the basic derivational form of a verb. Typically a root consists of three or four consonants, referred to as radicals. A pattern, on the other hand, is a distribution of vowel and consonant affixes on the root resulting in Arabic derivationallexicalmorphology. Asanexample, theroot k t b,2 if interspersed with the pattern 1a2a3 - the numbers correspond to the positions of the first, second and third radicals in the root, respectively - yields katab meaning write. However, if the pattern were ma1A2i3, resulting in the word makAtib, it would mean offices/desks or correspondences.</Paragraph>
    <Paragraph position="2"> There are fifteen pattern forms for MSA verbs, of which ten are commonly used. Not all verbs occur with all ten patterns. These root-pattern combinations tend to indicate a particular lexical semantic event structure in the verb.</Paragraph>
  </Section>
  <Section position="9" start_page="796" end_page="796" type="metho">
    <SectionTitle>
5 Clustering
</SectionTitle>
    <Paragraph position="0"> Taking the linguistic phenomena of MSA as features, we apply clustering techniques to the problemofinducingverbclasses. WeshowedinSnider &amp; Diab (2006) that soft clustering performs best on this task compared to hard clustering, therefore we employ soft clustering techniques to induce the verb classes here. Clustering algorithms partition a set of data into groups, or clusters based on a similarity metric. Soft clustering allows elements  tobemembersofmultipleclusterssimultaneously, and have degrees of membership in all clusters.</Paragraph>
    <Paragraph position="1"> This membership is sometimes represented in a probabilistic framework by a distributionP(xi,c), which characterizes the probability that a verb xi is a member of cluster c.</Paragraph>
  </Section>
  <Section position="10" start_page="796" end_page="797" type="metho">
    <SectionTitle>
6 Features
</SectionTitle>
    <Paragraph position="0"> Syntactic frames The syntactic frames are defined as the sister constituents of the verb in a Verb Phrase (VP) constituent, namely, Noun Phrases (NP), Prepositional Phrases (PP), and Sentential Complements (SBARs and Ss). Not all of these constituents are necessarily arguments of the verb, so we take advantage of functional tag annotations in the ATB. Hence, we only include NPs with function annotation: subjects (NP-SBJ), topicalized subjects (NP-TPC),3 objects (NP-OBJ), and second objects in dative constructions (NP-DTV). The PPs deemed relevant to the particular sense of the verb are tagged by the ATB annotators as PP-CLR. We assume that these are argument PPs, and include them in our frames. Finally, we include sentential complements (SBAR and S). While some of these will no doubt be adjuncts (i.e. purpose clauses and the like), we assume that those that are arguments will occur in greater numbers with particular verbs, while adjuncts will be randomly distributed with all verbs.</Paragraph>
    <Paragraph position="1"> Given Arabic's somewhat free constituent order, frames are counted as the same when they contain the same constituents, regardless of order.</Paragraph>
    <Paragraph position="2"> Also, for each constituent that is headed by a function word (PPs and SBARs) such as prepositions and complementizers, the headword is extracted to include syntactic alternations that are sensitive to preposition or complementizer type. It is worth noting that this corresponds to the FRAME1 configuration described in our previous study.(Snider and Diab, 2006) Finally, only active verbs are included in this study, rather than attempt to reconstruct the argument structure of passives.</Paragraph>
    <Paragraph position="3"> Verb pattern The ATB includes morphological analyses for each verb resulting from the Buckwalter Analyzer (BAMA).4 For each verb, one of the analyses resulting from BAMA is chosen manually by the treebankers. The analyses are  matched with the root and pattern information derived manually in a study by Nizar Habash (personal communication).This feature is of particular scientific interest because it is unique to Semitic languages, and, as mentioned above, has an interesting potential correlation with argument structure. null Subject animacy In an attempt to allow the clusteringalgorithmtouseinformationclosertoactual null argument structure than mere syntactic frames, we add a feature that indicates whether a verb requires an animate subject. Merlo and Stevenson (2001) found that this feature improved their English verb clusters, but in Snider &amp; Diab (2006), we found this feature to not contribute significantly to Arabic verb clustering quality. However, upon further inspection of the data, we discovered we could improve the quality of this feature extraction in this study. Automatically determining animacy is difficult because it requires extensive manual annotation or access to an external resource such as WordNet, neither of which currently exist for Arabic. Instead we rely on an approximation that takes advantage of two generalizations from linguistics: the animacy hierarchy and zero-anaphora. According to the animacy hierarchy, as described in Silverstein (1976), pronouns tend to describe animate entities. Following a technique suggested by Merlo and Stevenson(2001), we take advantage of this tendency by adding a feature that is the number of times each verb occurs with a pronominal subject. We also take advantage of the phenomenon of zeroanaphora, or pro-drop, in Arabic as an additional indicator subject animacy. Pro-drop is a common phenomenon in Romance languages, as well as Semitic languages, where the subject is implicit and the only indicator of a subject is incorporated in the conjugation of the verb. According to work on information structure in discourse (Vallduv'i, 1992), pro-drop tends to occur with more given and animate subjects. To capture this generalization, weaddafeatureforthefrequencywithwhich agivenverboccurswithoutanexplicitsubject. We further hypothesize that proper names are more likely to describe animates (humans, or organizations which metonymically often behave like animates), adding a feature for the frequency with which a given verb occurs with a proper name.</Paragraph>
    <Paragraph position="4"> With these three features, we provide the clustering algorithm with subject animacy indicators.</Paragraph>
    <Paragraph position="5"> LSA semantic vector This feature is the semantic vector for each verb, as derived by Latent SemanticAnalysis(LSA)oftheAG.LSAisadimension- null ality reduction technique that relies on Singular Value Decomposition (SVD) (Landauer and Dumais, 1997). The main strength in applying LSA to large quantities of text is that it discovers the latent similarities between concepts. It may be viewedasaformofclusteringinconceptualspace.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML