File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1068_metho.xml
Size: 7,793 bytes
Last Modified: 2025-10-06 14:08:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1068"> <Title>Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Project overview </SectionTitle> <Paragraph position="0"> The aim of the SALSA project is to construct a large semantically annotated corpus and to provide methods for its utilisation.</Paragraph> <Paragraph position="1"> Corpus construction. In the first phase of the project, we annotate the TIGER corpus in part man- null ually, in part semi-automatically, having tools propose tags which are verified by human annotators. In the second phase, we will extend these tools for the weakly supervised annotation of a much larger corpus, using the TIGER corpus as training data.</Paragraph> <Paragraph position="2"> Utilisation. The SALSA corpus is designed to be utilisable for many purposes, like improving statistical parsers, and extending methods for information extraction and access. The focus in the SALSA project itself is on lexical semantics, and our first use of the corpus will be to extract selectional preferences for frame elements.</Paragraph> <Paragraph position="3"> The SALSA corpus will be tagged with the following types of semantic information: FrameNet frames. We tag all FEEs that occur in the corpus with their appropriate frames, and specify their frame elements. Thus, our focus is different from the lexicographic orientation of the FrameNet project mentioned above. As we tag all corpus instances of each FEE, we expect to encounter a wider range of phenomena. which Currently, FrameNet only exists for English and is still under development. We will produce a &quot;light version&quot; of a FrameNet for German as a by-product of the annotation, reusing as many as possible of the semantic frame descriptions from the English FrameNet database. Our first results indicate that the frame structure assumed for the description of the English lexicon can be reused for German, with minor changes and extensions.</Paragraph> <Paragraph position="4"> Word sense. The additional value of word sense disambiguation in a corpus is obvious. However, exhaustive word sense annotation is a highly time-consuming task. Therefore we decided for a selective annotation policy, annotating only the heads of frame elements. GermaNet, the German WordNet version, will be used as a basis for the annotation.</Paragraph> <Paragraph position="5"> Coreference. Similarly, we will selectively annotate coreference. If a lexical head of a frame element is an anaphor, we specify the antecedent to make the meaning of the frame element accessible.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Frame Annotation </SectionTitle> <Paragraph position="0"> Annotation schema. To give a first impression of frame annotation, we turn to the sentence in Fig. 2: (1) SPD fordert Koalition zu Gespr&quot;ach &quot;uber Reform auf.</Paragraph> <Paragraph position="1"> (SPD requests that coalition talk about reform.) Fig. 3 shows the frame annotation associated with (1). Frames are drawn as flat trees. The root node is labelled with the frame name. The edges are labelled with abbreviated FE names, like SPKR for SPEAKER, plus the tag FEE for the frame-evoking element. The terminal nodes of the frame trees are always nodes of the syntactic tree. Cases where a semantic unit (FE or FEE) does not form one syntactic constituent, like fordert . . . auf in the example, are represented by assignment of the same label to several edges. Sentence (1), a newspaper headline, contains at least two FEEs: auffordern and Gespr&quot;ach. auffordern belongs to the frame REQUEST (see Fig. 1). In our example the SPEAKER is the subject NP SPD, the ADDRESSEE is the direct object NP Koalition, and the MESSAGE is the complex PP zu Gespr&quot;ach &quot;uber Reform. So far, the frame structure follows the syntactic structure, except for that fact that the FEE, as a separable prefix verb, is realized by two syntactic nodes. However, it is not always the case that frame structure parallels syntactic structure. The second FEE Gespr&quot;ach introduces the frame CON-VERSATION. In this frame two (or more) groups talk to one another and no participant is construed as only a SPEAKER or only an ADDRESSEE. In our example the only NP-internal frame element is the TOPIC (&quot;what the message is about&quot;) &quot;uber Reform, whereas the INTERLOCUTOR-1 (&quot;the prominent participant in the conversation&quot;) is realized by the direct object of auffordern.</Paragraph> <Paragraph position="2"> As shown in Fig. 3, frames are annotated as trees of depth one. Although it might seem semantically more adequate to admit deeper frame trees, e.g. to allow the MSG edge of the REQUEST frame in Fig.</Paragraph> <Paragraph position="3"> 3 to be the root node of the CONVERSATION tree, as its &quot;real&quot; semantic argument, the representation of frame structure in terms of flat and independent semantic trees seems to be preferable for a number of practical reasons: It makes the annotation process more modular and flexible - this way, no frame annotation relies on previous frame annotation. The closeness to the syntactic structure makes the annotators' task easier. Finally, it facilitates statistical evaluation by providing small units of semantic information that are locally related to syntax.</Paragraph> <Paragraph position="4"> Difficult cases. Because frame elements may span more than one sentence, like in the case of direct speech, we cannot restrict ourselves to annotation at sentence level. Also, compound nouns require annotation below word level. For example, the word &quot;Gagenforderung&quot; (demand for wages) consists of &quot;-forderung&quot; (demand), which invokes the frame REQUEST, and a MESSAGE element &quot;Gagen-&quot;. Another interesting point is that one word may introduce more than one frame in cases of co-ordination and ellipsis. An example is shown in (2). In the elliptical clause only one fifth for daughters, the elided bought introduces a C T frame. So we let the bought in the antecedent introduce two frames, one for the antecedent and one for the ellipsis.</Paragraph> <Paragraph position="5"> (2) Ein Viertel aller Spielwaren w&quot;urden f&quot;ur S&quot;ohne erworben, nur ein F&quot;unftel f&quot;ur T&quot;ochter. (One quarter of all toys are bought for sons, only one fifth for daughters.) Annotation process. Frame annotation proceeds one frame-evoking lemma at a time, using subcorpora containing all instances of the lemma with some surrounding context. Since most FEEs are polysemous, there will usually be several frames relevant to a subcorpus. Annotators first select a frame for an instance of the target lemma. Then they assign frame elements.</Paragraph> <Paragraph position="6"> At the moment the annotation uses XML tags on bare text. The syntactic structure of the TIGERsentences can be accessed in a separate viewer. An annotation tool is being implemented that will provide a graphical interface for the annotation. It will display the syntactic structure and allow for a graphical manipulation of semantic frame trees, in a similar way as shown in Fig. 3.</Paragraph> <Paragraph position="7"> Extending FrameNet. Since FrameNet is far from being complete, there are many word senses not yet covered. For example the verb fordern, which belongs to the REQUEST frame, additionally has the reading challenge, for which the current version of FrameNet does not supply a frame.</Paragraph> </Section> class="xml-element"></Paper>