File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0612_metho.xml
Size: 9,838 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0612"> <Title>Constructing an English Valency Lexicon[?]</Title> <Section position="3" start_page="94" end_page="94" type="metho"> <SectionTitle> 2 Valency Lexicon Construction 2.1 FGD </SectionTitle> <Paragraph position="0"> The Functional Generative Description (FGD) (Sgall et al., 1986) is a dependency-based formal strati cational language description framework that goes back to the functional-structural Prague School. For more detail see (Panevov*a, 1980) and (Sgall et al., 1986). The theory of FGD has been implemented in the Prague Dependency Treebank project (Sgall et al., 1986; Haji c, 2005).</Paragraph> <Paragraph position="1"> FGD captures valency in the underlying syntax (the so-called tectogrammatical language layer). It enables listing of complementations (syntactically dependent autosemantic lexemes) in a valency lexicon, regardless of their surface (morphosyntactic) forms, providing them with semantic labels (functors) instead. Implicitly, a complementation present in the tectogrammatical layer can either be directly rendered by the surface shape of the sentence, or it is omitted but can be inferred from the context or by common knowledge. A valency lexicon describes the valency behavior of a given lexeme (verb, noun, adjective or adverb) in the form of valency frames.</Paragraph> <Section position="1" start_page="94" end_page="94" type="sub_section"> <SectionTitle> 2.2 Valency within FGD </SectionTitle> <Paragraph position="0"> A valency frame in the strict sense consists of inner participants and obligatory free modi cations (see e.g. (Panevov*a, 2002)). Free modi cations are prototypically optional and do not belong to the valency frame in the strict sense though some frames require a free modi cation (e.g. direction in verbs of movement). Free modi cations have semantic labels (there are some more than 40 in PDT) and they are distributed according to semantic judgments of the annotators. FGD introduces ve inner participants. Unlike free modi cations, inner participants cannot be repeated within one frame. They can be obligatory as well as optional (which is to be stated by the judgment on grammaticality of the given sentence and by the so-called dialogue test, (Panevov*a, 1974 75)). Both the obligatory and the optional inner participants belong to the valency frame in the strict sense.</Paragraph> <Paragraph position="1"> Like the free modi cations, the inner participants have semantic labels according to the cognitive roles they typically enter: ACT (Actor), PAT (Patient), ADDR (Addressee), ORIG (Origin) and EFF (Effect). Syntactic criteria are used to identify the rst two participants ACT and PAT ( shifting , see (Panevov*a, 1974 75)). The other inner participants are identi ed semantically; i.e. a verb with one inner participant will have ACT, a verb with two inner participants will have ACT and PAT regardless the semantics and a verb with three and more participants will get the label assigned by the semantic judgment.</Paragraph> </Section> <Section position="2" start_page="94" end_page="94" type="sub_section"> <SectionTitle> 2.3 The Prague Czech-English Dependency Treebank </SectionTitle> <Paragraph position="0"> In order to develop a state-of-the-art machine translation system we are aiming at a high-quality annotation of the Penn Treebank data in a formalism similar to the one developed for PDT.</Paragraph> <Paragraph position="1"> When building PEDT we can draw on the successfully accomplished Prague Czech-English Dependency Treebank 1.0 (J. Cu r* n and M. Cmejrek and J. Havelka and J. Haji c and V. Kubo n and Z.</Paragraph> <Paragraph position="2"> Zabokrtsk*y, 2004) (PCEDT).</Paragraph> <Paragraph position="3"> PCEDT is a Czech-English parallel corpus, consisting of 21,600 sentences from the Wall Street Journal section of the Penn Treebank 3 corpus and their human translations to Czech. The Czech data was automatically morphologically analyzed and parsed by a statistical parser on the analytical (i.e. surface-syntax) layer. The Czech tectogrammatical layer was automatically generated from the analytical layer. The English analytical and tectogrammatical trees were derived automatically from the Penn Treebank phrasal trees.</Paragraph> </Section> <Section position="3" start_page="94" end_page="94" type="sub_section"> <SectionTitle> 2.4 The Prague English Dependency Treebank </SectionTitle> <Paragraph position="0"> The Prague English Dependency Treebank (PEDT) stands for the data from Wall Street Journal section of the Penn Treebank annotated in the PDT 2.0 shape. EngValLex is a supporting tool for the manual annotation of the tectogrammatical layer of PEDT.</Paragraph> </Section> </Section> <Section position="4" start_page="94" end_page="95" type="metho"> <SectionTitle> 3 Lexicon Structure </SectionTitle> <Paragraph position="0"> On the topmost level, EngValLex consists of word entries, which are characterized by lemmas. Verbs with a particle (e.g. give up) are treated as separate word entries.</Paragraph> <Paragraph position="1"> Each word entry consists of a sequence of frame entries, which roughly correspond to individual senses of the word entry and contain the valency information.</Paragraph> <Paragraph position="2"> Each frame entry consists of a sequence of valency slots, a sequence of example sentences and a textual note. Each valency slot corresponds to a complementation of the verb and is described by a tectogrammatical functor de ning the relation between the verb and the complementation, and a form de ning the possible surface representations of the functor. Valency slots can be marked as optional, if not, they are considered to be obligatory. The form is listed in round brackets following the functor name. Surface representations of functors are basically de ned by combination of morphological tags and lemmas. Yet to save annotators' effort, we have introduced several abbreviations that substitute some regularly co-occurring sequences. E.g. the abbreviation n means 'noun in the subjective case' and is de ned as follows: NN:NNS:NP:NPS meaning one of the Penn Treebank part-of-speech tags: NN, NNS, NP and NPS (colon delimits variants). Abbreviation might be de ned recursively. null Apart from describing only the daughter node of the given verb, the surface representation can describe an entire analytical subtree whose top-most node is the daughter of the given verb node. Square brackets are used to indicate descendant nodes. Square brackets allow nesting to indicate the dependency relations among the nodes of a given subtree. For example, the following statement describes a particle to whose daughter node is a verb.</Paragraph> <Paragraph position="3"> to.TO[VB] The following statement is an example of a definition of three valency slots and their corresponding forms:</Paragraph> <Paragraph position="5"> The ACT (Actor) can be any noun in the subjective case (the abbreviation n), the PAT (Patient) can be a particle to with a daughter verb, and the LOC (Locative) can be the preposition at.</Paragraph> <Paragraph position="6"> Moreover, EngValLex contains links to external data sources (e.g. lexicons) from words, frames, valency slots and example sentences.</Paragraph> <Paragraph position="7"> The lexicon is stored in an XML format which is similar to the format of the PDT-VALLEX lexicon used in the Prague Dependency Treebank 2.0.</Paragraph> </Section> <Section position="5" start_page="95" end_page="96" type="metho"> <SectionTitle> 4 Creating the Lexicon </SectionTitle> <Paragraph position="0"> The lexicon was automatically generated from PropBank using XSLT templates. Each PropBank example was expanded in a single frame in the destination lexicon. When generating the lexicon, we have kept as many back links to PropBank as possible. Namely, we stored links from frames to Propbank rolesets, links from valency slots to PropBank arguments and links from examples to PropBank examples. Rolesets were identi ed by the roleset id attribute. Arguments were identi ed by the roleset id, the name and the function of the role. Examples were identi ed by the roleset id and their name.</Paragraph> <Paragraph position="1"> After the automatic conversion, we had 8,215 frames for 3,806 words.</Paragraph> <Paragraph position="2"> Tectogrammatical functors were assigned semi-automatically according to hand-written rules, which were conditioned by PropBank arguments.</Paragraph> <Paragraph position="3"> It was yet clear from the beginning that manual corrections would be necessary as the relations of Args to functors varied depending on linguistic decisions1. null The annotators were provided with an annotation editor created on the base of the PDT-VALLEX editor. Apart from interface for editing EngValLex, the tool contains integrated viewers of PropBank and VerbNet, which allows of ine browsing of the lexicons. Those viewers can be run as a stand-alone application as well and are published freely on the web2. The editor allows the annotator to create, delete, and modify word entries, and frame entries. Links to PropBank can be set up, if necessary.</Paragraph> <Paragraph position="4"> Figure 1 displays the main window of the editor. The left part of the window shows list of words. The central part shows the list of the frames concerning the selected verb.</Paragraph> <Paragraph position="5"> For the purpose of annotation, we divided the lexicon into 1,992 les according to the name of PropBank rolesets (attribute name of the XML element roleset), and the les are annotated separately. When the annotation is nished, the les will be merged again. Currently, we have about 80% of the lexicon annotated, which already contains the most dif cult cases.</Paragraph> </Section> class="xml-element"></Paper>