File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0744_metho.xml
Size: 18,554 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0744"> <Title>Recognition and Tagging of Compound Verb Groups in Czech</Title> <Section position="4" start_page="219" end_page="221" type="metho"> <SectionTitle> 3 Learning Verb Rules </SectionTitle> <Paragraph position="0"> The algorithm for learning verb rules (ZPSbkovPS and Pala, 1999) takes as its input annotated sentences from corpus DESAM. The algorithm is split into three steps: finding w~rb chunks (i.e. finding boundaries of simple clauses in compound or in complex sentences, and elimination of gap words), generalisation and verb rule synthesis. These three steps are described</Paragraph> <Section position="1" start_page="219" end_page="220" type="sub_section"> <SectionTitle> 3.1 Verb Chunks </SectionTitle> <Paragraph position="0"> The observed properties of a verb group are the following: their components are either verbs or a reflexive pronoun se (si); the boundary of a verb group cannot be crossed by the boundary of a sentence; and between two components of the verb group there can be a gap consisting of an arbitrary number of non-verb words or even a whole sentence. In the first step, the boundaries of all sentences are found. Then each gap is replaced by tag gap.</Paragraph> <Paragraph position="1"> The method exploits only the lemma of each word (nominative singular for nouns, adjectives, pronouns and numerals, infinitive for verbs) and its tag. We will demonstrate the whole process using the third simplex sentence of the clause in Tab. 1 ( byla bych se jl zd6astnila (I would have participaied in iV):</Paragraph> </Section> <Section position="2" start_page="220" end_page="220" type="sub_section"> <SectionTitle> 3.2 Generalisation </SectionTitle> <Paragraph position="0"> The lemmata and the tags are now being generalised. Three generalisation operations are employed: elimination of (some of) lemmata, generalisation of grammatical categories and finding grammatical agreement constraints.</Paragraph> <Paragraph position="1"> All lemmata except forms of auxiliary verb bit (to be) (b~t, by, aby, kdyby) are rejected. Lemmata of modal verbs and verbs with similar behaviour are replaced by tag modal. These verbs have been found in the list of more than 15 000 verb valencies (Pala and SevePSek P., 1999). In our example it is the verb zdSastnit that is removed. null matical categories are not important for verb group description. Very often it is negation (e), or aspect (a - aI stands for imperfectum, aP for perfectum). These categories may be removed. For some of verbs even person (p) can be removed. In our example the values of those grammatical categories have been replaced by ? and we obtained Another situation appears when two or more values of some category are related. In the simplest case they have to be the same - e.g. the value of attribute person (p) in the first and the last word of our example. More complicated is the relation among the values of attribute number (n). They should be the same except when the polite way of addressing occurs, e.g. in byl byste se ji zdSastnil (you would have participated in it). Thus we have to check whether the values are the same or the conditions of polite way of addressing are satisfied. For this purpose we add the predicate check_.num() that ensures agreement in the grammatical category number and we obtain</Paragraph> </Section> <Section position="3" start_page="220" end_page="221" type="sub_section"> <SectionTitle> 3.3 DCG Rules Synthesis </SectionTitle> <Paragraph position="0"> Finally the verb rule is constructed by rewriting the result of the generalisation phase. For the sentence byla bych se jl zd6astnila (I would have participated in it) we obtain verb_group (vg (Be, Cond, Se, Verb), Gaps) --> be (Be ,_, P, N, tM.,mP,_), 7, b~t/k5e?p_n_tMmPa? cond (Cond, .... Ncond, tP, mC, _), 7, by/k5e?p?n_tPmCa? {che ck_num (N, Ncond, Cond, Vy) }, ref lex_pron (Se, xX, _, _), 7, k3xXnSc? gap( \[\] ,Gaps), 7. gap k5 (Verb ..... P,N,tM,mP,_).</Paragraph> <Paragraph position="1"> 7. k5e?p_n_tMmPa? If this rule does not exist in the set of verb rules yet~ it is added into it. The meanings of non-terminals used in the rule are following: be() represents auxiliary verb b~t, cond() represents various forms of conditionals by, aby, kdyby, reflex_pron() stands for reflexive pronoun se (si), gap() is a special predicate for manipulation with gaps, and k5 () stands for arbitrary non-auxiliary verb. The particular values of some arguments of non-terminals represent required properties. Simple cases of grammatical agreement are treated through binding of variables. More complicated situations are solved employing constraints like the predicate check_hum().</Paragraph> <Paragraph position="2"> The method has been implemented in Perl. 126 definite clause grammar rules were constructed from the annotated corpus that describe all verb groups that are frequent in Czech.</Paragraph> </Section> </Section> <Section position="5" start_page="221" end_page="221" type="metho"> <SectionTitle> 4 Recognition of Verb Groups </SectionTitle> <Paragraph position="0"> The verb rules have been used for recognition, and consequently for tagging, of verb groups in unannotated text. A portion of sentences which have not been used for learning, has been extracted from a corpus. Each sentence has been ambiguously tagged with LEMMA morphological analyser (Pala and Seve~ek, 1995), i.e. each word of the sentence has been assigned to all possible tags. Then all the verb rules were applied to each sentence. The learned verb rules displayed quite good accuracy. For corpus DE-SAM, a verb rule has been correctly assigned to 92.3% verb groups. We tested, too, how much is this method dependent on the corpus that was used for learning. As the source of testing data we used Prague Tree Bank (PTB) Corpus that is under construction at Charles University in Prague. The accuracy displayed was not different from results for DESAM. It maybe explained by the fact that both corpora have been built from newspaper articles.</Paragraph> <Paragraph position="1"> Although the accuracy is acceptable for the test data that include also clauses with just one verb, errors have been observed for complex sentences. In about 13% of them, some of compound verb groups were not correctly recognized. It was observed that almost 7{)% of these errors were caused by incorrect lemrna recognition. In the next section we describe a method for fixing this kind of errors.</Paragraph> </Section> <Section position="6" start_page="221" end_page="222" type="metho"> <SectionTitle> 5 Fixing Misclassification Errors </SectionTitle> <Paragraph position="0"> We combined two approaches, elimination of lemmata which are very rare for a given word form, and inductive logic programming (Popelinsk~ et al., 1999; Pavelek and Popelinsk~, 1999). The method is used in the post-processing phase to prune the set of rules that have fired for the sentence.</Paragraph> <Section position="1" start_page="221" end_page="222" type="sub_section"> <SectionTitle> 5.1 Elimination of infrequent lemmata </SectionTitle> <Paragraph position="0"> In Czech corpora it was observed that 10% of word positions - i.e. each 10th word of a text - have at least 2 lemmata and about 1% word forms of Czech vocabulary has at least 2 lemmata. (Popelinsk~ et al., 1999; Pavelek and Popelinsk~, 1999) E.g. word form psi can be either preposition at (like at the lesson) or imperative of argue. We decided to remove all the verb rules that recognised a word-lemma couple of a very small frequency in the corpus. Actually those lemmata that did not appear more than twice in DESAM corpus were supposed to be incorrect.</Paragraph> <Paragraph position="1"> For testing, we randomly chose the set of 600 examples including compound or complex sentences from corpus DESAM. 251 sentences contained only one verb. The results obtained are in Tab. 2. The first line contains the number of examples used. In the following line there are results of the original method as mentioned in Section 4. Next line (+ infrequent lemmata) displays results when the word-lemma couple of a very small frequency have been removed. The column '> 1 verb' concerns the sentences where at least two verbs appeared. The column 'all' displays accuracy for all sentences. Results for corpus PTB are displayed in Tab. 3. It can be observed that after pruning rules that contain a rare lemma the accuracy significantly increased.</Paragraph> </Section> <Section position="2" start_page="222" end_page="222" type="sub_section"> <SectionTitle> 5.2 Lemma disambiguation by ILP </SectionTitle> <Paragraph position="0"> Some of incorrectly recognised lemmata cannot be fixed by the method described. E.g. word form se has two lemmata, se - reflexive pronoun and s - preposition with and both word-lemma couples are very frequent in Czech. For such cases we exploited inductive logic programming (ILP). The program reads the context of the lemma-ambiguous word and results in disambiguation rules (PopeHnsk~ et al., 1999; Pavelek and Popelinsk~, 1999). We employed ILP system Aleph 1.</Paragraph> <Paragraph position="1"> Domain knowledge predicates (Popelinsk~ et al., 1999; Pavelek and Popelinsk~, 1999) have the form p(Context, first(N),Condition) or p(Context, somewhere,Condition), where Context contains tags of either left or right context (for the left context in the reverse order), first(N) defines a subset of Context, somewhere does not narrow the context. The term Condition can have three forms, somewhere(List) (the values in List appeared somewhere in the define subset of Context), always(List) (the values appeared in all tags in the given subset of Context) and n_times (N ,Tag) (Tag may appear n-times in the specified context). E.g. p(Left, first(2), always ( \[k5, eA\] ) ) succeeds if in the first two tags of the left context there always appears k5, eA.</Paragraph> <Paragraph position="2"> In the last line of Tab. 2 and 3 there is the percentage of correct rules when also the lemma disambiguation has been employed. The increase of accuracy was much smaller than after pruning rules that contain a rare lemma. It has to be mentioned that in the case of PTB about a half of errors (incorrectly recognised verb rules) were caused by incorrect recognition of sentence boundaries.</Paragraph> <Paragraph position="3"> lhttp://web.comlab.ox.ac.uk/oucl/research/axeas/ machleaxn/Aleph/aleph.html</Paragraph> </Section> </Section> <Section position="7" start_page="222" end_page="223" type="metho"> <SectionTitle> 6 Tagging Verb Groups </SectionTitle> <Paragraph position="0"> We now describe a method for compound verb group tagging in morphologically annotated corpora. We decided to use SGML-like notation for tagging because it allows to incorporate the new tags very easily into DESAM corpus.</Paragraph> <Paragraph position="1"> The beginning and the end of the whole verb group and beginnings and ends of its particular components are marked. For the sentence byla bych se jl zd6astnila (I would have participated in it) we receive <vg tag=&quot; eApFnSt PmCaPr iv0&quot; fmverb=&quot; zfi~ astnit &quot;> where <vg> </vg> point out the beginning and the end of the verb group, <vgp> </vgp> mark components (parts) of the verb group.</Paragraph> <Paragraph position="2"> The assigned tag - i.e. values of significant morphologic categories - of the whole group is included as a value of attribute called tag in the starting mark of the group. Value of the attribute fmverb is the full-meaning verb; this information can be exploited e.g. for searching and processing of verb valencies afterwards.</Paragraph> <Paragraph position="3"> The value of attribute tag is computed automatically from the verb rule that describes the compound verb group.</Paragraph> <Paragraph position="4"> We are also able to detect other properties of compound verb groups. In the example above the new category r is introduced. It indicates that the group is reflexive (rl) or not (r0). The category v enables to mark whether the group is in the form of polite way of addressing (vl) or not (v0). The differences of the tag values can be observed comparing the previous and the following examples (nebyl byste se jl The set of attributes can be also enriched e.g. with the number of components. We also plan to include into the attributes of <vg> compound verb group type. It will enable to find the groups of the same type but wit:h different word order or the number of components.</Paragraph> </Section> <Section position="8" start_page="223" end_page="223" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Sometimes compound verb groups are defined in a less general way. Another approach that deal with the recognition and morphological tagging of compound verb groups in Czech appeared in (Osolsob~, 1999). Basic compound verb groups in Czech like active present, passive past tense, present conditional etc., are defined in terms of grammatical categories used in DESAM corpus. Two drawbacks of this approach can be observed. First, verb groups may only be compound of a reflexive pronoun, verbs to be and not more than one full-meaning verb,. Second, the gap between two words of the particular group cannot be longer than three words. The verb rules defined here are less general then the basic verb groups (Osolsob~, 1999). Actually verb rules make partition of them. Thus we can tag all these basic verb groups withont the limitations mentioned above. In contrast to some other approaches we include into the groups also some verbs which are in fact infinitive participants of verb valencies. However, we are able to detect such cases and recognize the &quot;pure&quot; verb groups afterwards. We believe that for some kind of shallow semantic analysis - e.g. in dialogue systems - our approach is more conve~ nient.</Paragraph> <Paragraph position="1"> We are also able to recognize the form of polite way of addressing a person (which has not equivalent in English, but similar phenomenon appears e.g. in French or German). We extend the tag of a verb group with this inibrmation.</Paragraph> <Paragraph position="2"> because it is quite important for understanding the sentence. E.g. in gel jste (vous ~tes allg) the word jste (~tes) should be counted as singular although it is always tagged as plural.</Paragraph> </Section> <Section position="9" start_page="223" end_page="224" type="metho"> <SectionTitle> 8 Ongoing Research </SectionTitle> <Paragraph position="0"> Our work is a part of the project which aims at building a partial parser for Czech. Main idea of partial parsing (Abney, 1991) is to recognize those parts of sentences which can be recovered reliably and with a small amount of syntactic information. In this paper we deal with recognition and tagging potentially discontinuous verb chunks. The problem of recognition of noun chunks in Czech was addressed in (Smr2 and ZPS~kovPS, 1998). We aim at an advanced method that should employ a minimum of ad hoc techniques and should be, ideally, fully automatic. The first step in this direction, the method for pruning verb rules, has been described in this paper. In the future we want to make the method even more adaptive. Some preliminary results on finding sentence boundary are displayed below.</Paragraph> <Paragraph position="1"> In Czech it is either comma and/or a conjunction that make the boundary between two sentences. From the corpus we have randomly chosen 406 pieces of text that contain a comma.</Paragraph> <Paragraph position="2"> In 294 cases the comma split two sentences. All &quot;easy&quot; cases (when comma is followed by a conjunction, it must be a boundary) were removed.</Paragraph> <Paragraph position="3"> It was 155 out of 294 cases. 80% of the rest of examples was used for learning. We used again Aleph for learning. For the test set the learned rules correctly recognised comma as a delimiter of two sentences in 86.3% of cases. When the &quot;easy&quot; cases were added the accuracy increased to 90.8%.</Paragraph> <Paragraph position="4"> Then we tried to employ this method for automatic finding of boundaries in our system for verb group recognition. Decrease of accuracy was expected but it was quite small. In spite of some extra boundary was found (the particular comma did not split two sentences), the correct verb groups have been found in most of such cases. The reason is that such incorrectly detected boundary splits a compound verb group very rarely.</Paragraph> <Paragraph position="5"> The last experiment concerned the case when a conjunction splits two sentences and the conjunction is not preceded with comma. There are four such conjunctions in Czech - a (and), nebo (or), i (even) and ani (neither, nor). Using Aleph we obtained the accuracy on test data 88.3% for a (500 examples, 90% used/for learning) and 87.2% for nebo (110 examples). The last two conjunctions split sentences very rarely. Actually, in the current version of corpus DESAM it has never happened.</Paragraph> </Section> <Section position="10" start_page="224" end_page="224" type="metho"> <SectionTitle> 9 Relevant works </SectionTitle> <Paragraph position="0"> Another approach for recognition of compound verb groups in Czech (Osolsob~, 1999) have been already discussed in Section 7. Ramshaw and Marcus (Ramshaw and Marcus, 1995) views chunking as a tagging problem. They used transformation-based learning and achieved recall and precision rates 93% for base noun phrase (non-recursive noun phrase) and 88% for chunks that partition the sentence. Verb chunking for English was solved by Veenstra (Veenstra, 1999). He used memory-based learner Timbl for noun phrase, verb phrase and propositional phrase chunking.Each word in a sentence was first assigned one of tags I_T (inside the phrase), O_T (outside the phrase) and B_T (left-most word in the phrase that is preceded by another phrase of the same kind), where T stands for the kind of a phrase.</Paragraph> <Paragraph position="1"> Chunking in Czech language is more difficult than in English for two reasons. First, a gap inside a verb group may be more complex and it may be even a whole sentence. Second, Czech language is a free word-order language what implies that the process of recognition of the verb group structure is much more difficult.</Paragraph> </Section> class="xml-element"></Paper>