File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1025_metho.xml
Size: 14,229 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1025"> <Title>PROBABILISTIC TAGGING WITH FEATURI~ STR,UCTUR,I;3S</Title> <Section position="4" start_page="0" end_page="161" type="metho"> <SectionTitle> 2 MATHEMATICAL BACK- GR.OUND </SectionTitle> <Paragraph position="0"> In order to ~Lssign tags to a word sequence, a IIMM can be used where tim tagger selects among all possible tag sequences tile most probable one (Garside, Leech and Saulpson, 1987; (Tlnlrch, 1988; Brown e.t al., 1989; Rabiner, 1990). The joint probability of a tag sequence l -- I0...tN_ 1 given a word sequence lg., : ~v0...lON_-l is hi the case of a second order IIMM:</Paragraph> <Paragraph position="2"> The term rqo t, stands for tim initial slate probability, i.e. the probability that the sequence begins with the first two tags. N is tim nunlber of words in the sequence, i.e. the corpus size. &quot;Phe term p(wC/\]ll) is the probability of a word wC/ in the context of the assigned tag tl. it is called observation symbol prolmbility (lexical probability) and can be estimated by:</Paragraph> <Paragraph position="4"> The second order state transition probability (contextual probability) 1,(t~ I t~-2 re-.t) in formula (l) expresses how probable it; is that the tag tl appears in the context of its two preceding tags li-', all(\] ti-\]. It is usually esthnate.d as the ratio of the frequency of the trigram (ll-'2, t~-l,t;) in a given training corpus to the. I'requency of the higram (li_2,li~l} ill {,lie sallie corpllS: f(ti-.~ ti-~ ti) With a large tag set and a relatively small hand-tagged training corpus forinula (3) has an iinl)ortant disadvantage: The maioril,y of transition probabilities cannot be estimated exactly because most of the possible trigrams (sequences of three consecutive tags) will not appear at all or only a few tilnes a.</Paragraph> <Paragraph position="5"> |I10llr exarrlple we have a 1,'rencli training corplls of 10,000 words tagged with a set of 386 different tags whMl could forrn a8a a = 57,512,450 different trigrams, but because of the corpus size no more than 10,000-2 trigranrs can appear. Actually, their nuinber was only 4,8\[5, i.e. 0.008 % of all possible '2 A detaihM descrilltlon of pro\] ileli/S egnlsed by sniall and ,.4el'O frequencies was given by Clah~ and Church (1989) ones, because some of them appeared more than once (table 1).</Paragraph> <Paragraph position="6"> frequency number and percentage range of trigrams in the range ing corpus of 10,000 words When we divide e.g. a trigram frequency 1 by a bigram frequency 2 according to formula (3) we gel tbe probability p=0.5 but we cannot trust it to be exact because the frequencies it is based on are too small.</Paragraph> <Paragraph position="7"> We can take advantage of the fact that the 386 tags are constituted by only 57 different fv-pairs concerning POS, gender, number, etc. If we consider probabilistic relations between single fv-pairs then we get higher frequencies (fig. 2) and tbe resulting probabilities are more exact.</Paragraph> <Paragraph position="8"> From the equations n--\[ 1 (t,} = {e,0 nc,, ... he,,._,} = / N ~,~ (4) \] kk=0 ) where tl means a tag and the elk symbolize its D-pairs</Paragraph> <Paragraph position="10"> Tire latter formula 3 describes the relation between the contextual probability of a tag and the contextual probabilities of its fv-pairs.</Paragraph> <Paragraph position="11"> The unification of morphological features inside a noun phrase is accomplished indirectly, hr a given context of D-pairs the correct fv-pair obtains the probability p=l and therefore will not influence tim probability of the tag to which it belongs (e.g.</Paragraph> <Paragraph position="12"> p~( 0num:SG \[...) = 1 in fig. 2). A wrong fv-pair would obtain p=0 and make the whole tag impossible.</Paragraph> <Paragraph position="13"> asugg ested bY Mats Rooth, IMS, Unlv.Stuttgart, Germany</Paragraph> </Section> <Section position="5" start_page="161" end_page="163" type="metho"> <SectionTitle> 3 TRAINING ALGORITHM </SectionTitle> <Paragraph position="0"> In the training process we are not interested in analysing and storing the contextual probabilities (state transition probabilities) of whole tags but of single fv-pairs. We note them in terms of probabilistic feature relations (PFI:~): Vr'l~: ( e, I c,'&quot;~ ; p(~,Ic~ &quot;~) ) (7) which later, in the Lagging process, will be combined in order to obtain the contextual tag probabilities.</Paragraph> <Paragraph position="1"> The term el in formula (7) is a fv-pair. G~ &quot;~ is a reduced context which contains only a subset of the fv-pairs of a really appearing context Ci (fig. 1). C/~ is obtained from Ci by eliminating all fv-pairs which do not influence the relative frequency of e,', according to the condition:</Paragraph> <Paragraph position="3"> The considered D-pair has nearly 4 the same probability in the complete and in the reduced contexts, i.e. Ci does not supply more information abont the probability of el than C/~''b does.</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> In the example (fig. la) we consider tile fv-pair Ogen:l,'EM. Within the given training corpus, its probability ill tile complete context Ci, i.e. in the context of all tile other fv-pairs of figure la, is p~=44/44=I (of. p~ in fig. 2).</Paragraph> <Paragraph position="8"> The presence of inum:SG in tag ti-1 does not influence the probability of Ogen:FEM in tag I i. Therefore lnum:SG eau be eliminated. Only fv-pairs which really have an influence remain in the context. The reduced context C~ &quot;b with less D-pairs, which we obtain this way, is more general (fig. lb).</Paragraph> <Paragraph position="9"> In the given training corpus, the probability of Ogen:FEM in the context CZ &quot;b is p0=170/174=0.997 (el. P0 in PFR0 in fig. 2), which is near to p~=l. The reduced context C~ ''~ is used to form a PFR which will be stored.</Paragraph> <Paragraph position="10"> 4 A small change in the probability caused by the elimination of fv-pairs from the context is admitted if it does not exceed a defined sman percentage e. (We used ~ -- 3%.) We. see in the use of reduced contexts instead of complete ones two advantages: (1) A great number of complete contexts containing many fv-pairs can lead after eliminatim, of irrelevant fv-pairs to the same PFR, which makes the nmnber of all possible PFlks much smaller than the number of all possible trigrams (cf. sec. 2).</Paragraph> <Paragraph position="11"> (2) &quot;\['he probability of a fv-pair can be estimated more exactly in a reduced context than in a complete one because of the higher frequencies in the first case. The Generation of Pl.,'l{s In the training process we first extract from a training corpus a set of trigrams where the tags are split up into their fv-pairs. From these trigrams a set of PFILs is generated separately \['or every fvqmlr ei. We examined four difl'erent methods for this procedure: Method 1-3: For every trigram we generate all possible subsets of its fv-pairs. Many trigrams, e.g. if they dillk'.r in only one fv-pair, have most of their subsets of fv-pairs in coil,IliOn. Both the complete trigrams and the subsets, constitute together the set, of contexts and subcontexts (Ci and C/'''~) wherein a fv-pair couhl appear. To generate Pl:lLs for at giw'.n fv-pair, we preselect and mark those (sub-)contexts which are supposed to have an intluence on the contextual probability of the. fv-pair. A (sub-)context will not be preselected if its frequency is smaller than a defined threshold. We use dilferent ways for the pres- null ing to the same feature type ew'.r appears in this (sul)-)context. E.g., if gen:MAS appears in a certain (sub-)context the,, this (sub-)context will l,e preselected for gen:l:EM too. Furthermore, it is possible to impose special conditions on the preselection, e.g. that a (sub-)context can only be preselected if it contains a POS feature in tag tl and ti-1 (cf. lit. l;t: Opos and Ipos).</Paragraph> <Paragraph position="12"> Method 2: In order to preselect (sub-)contexts for an fv-pair, we generate a decision tree r' (Quinlan, I983) where the feature of the fv-pair, e.g. ten, hum el.e, serves to classify all existing (sub-)contexts. E.g., hum prodt, ces three classes of contexts: those containing the fwpair Onum:SG, those with Onum:PL and those without a Onum feature. We assign to tile tree nodes other features than this upon which the cl~ussification is based. The root node is labeled with the feature from which we expect most information al)out the probability of the currently considered feature. The values of the rout node feature are assigned to the I)ranches starting at the root node. ~,h.~ continue the. branching until there remain no features will, an expected information gain and a frequency higher than defined Ssuggestedlw lIehnut Schmld, \[MS, Univ. Stuttgart, Ger-Ilk, lilly, lear reasolls of space we explain only how we etnploy decision trees for our purposes. For details about the automatic generation of such trees see Quinhm (1983).</Paragraph> <Paragraph position="13"> threshohls. To ever), leaf of the tree corresponds a (sul>)context which will be marked and thus preseletted for further analysis.</Paragraph> <Paragraph position="14"> Method 3: For each fvq)air concerning POS we preselect every (sub-)context containing only I'OS features ht tag tl-2 ;t,,d ti-1 (classical I'OS trigram), e.g. 2pos:PREP lpos:DET tbr Opos:NOUN. For the other fv-p;tirs we mark every (sub-)conl;ext containing any fv-pair of the same type in the previous tag ti-1 and ally POS features in tag li_ 1 alld Ii, e.g. lpos:DET Igen:FL'M Opos:NOUN for @en:I:EM.</Paragraph> <Paragraph position="15"> Witl, the methods 1-3, we next eliminate frolll ev~ cry preselected (sul>)context all fv-pairs which in the above described sense do not intluenee the relative frequency of the currently considered fv-pair (eq. 8).</Paragraph> <Paragraph position="16"> Method 4: l:ronl the set of trigrams extracted from a training corpus we generate separately for every fvpair, a binaryd>ranched decision tree which shall tiescribe wtrious contextual probabilities of this fv-pair. The tree is generated on a modi\[ied version of the II)3 algorithm (Quildan, 1983) and is similar to the one desr.rlbed by Schmid (1994).</Paragraph> <Paragraph position="17"> We start with a binary classification of all trigrams based on the considered D-palr. l'\].g., a classification for :len:l&quot;EM will divide the set of trigrams in two subsets, one where the trigrams contain Ogen:l&quot;EM in the tag Ii and one where they do not.</Paragraph> <Paragraph position="18"> Ogen:l,'EM (Every number is a probability of Ogeu:l&quot;ltM in the context described by the path from the root node to the node labeled with the munl>er.) The tree is built up recursiw~ly (fig. 3). At each step, i.e. with the construction of each node, we test which one of the other D-pairs delivers most infof matioi! concerning the abow>described chmsillcation.</Paragraph> <Paragraph position="19"> The current node will be labeled with this fv-pair. One of its two branches concerns the trigrams which con~</Paragraph> <Paragraph position="21"> H pi &quot;~ 0.145 The position index at the beginning of every feature-v',due-pair indicates the tag to which it belongs; e.g. Ogen:FEM belongs to t~tg li and 2num:SG to ll-2. transition probability) using probabilislic feature relations (PFH,) tain the D-pair, the other branch concerns tim tri-grams which do not contain it. The recursive expansion of the tree stops if either the information gained by consulting further fv-pairs or the frequencies upon which the calculus is based are smaller than defined thresholds.</Paragraph> </Section> <Section position="6" start_page="163" end_page="163" type="metho"> <SectionTitle> 4 TAGGING ALGORITHM </SectionTitle> <Paragraph position="0"> Starting point for the implementation of a feature structure tagger was a second-0rdcr-IIMM tagger (trigrams) based on a modified version of the Viterbi algorithm (Viterbi, 1967; Church, 1988) which we had earlier implemented in C (Kempe ,1994). There we replaced the function which estimated the contextual probability of a tag (state transition probability) hy dividing a trigram frequency by a bigram frequency (eq. 3) with a flmction which accomplished this calculus either using PF1Ls in the above-described way (eq.s 6, 7) or by consulting a decision tree (fig. 3).</Paragraph> <Paragraph position="1"> To estimate the contextual probability of a tag we have to know the contextual probabilities of its fv-pairs in order to multiply them (eq. 6).</Paragraph> <Paragraph position="2"> Using PFRs generated by roof:hod 1 or 2, when e.g looking for the probability p~(0pos:ADJ I...) from Ilgure 2, we may find in the list of PFRs, instead of a PFR, which would directly correspond (but is not stored), the two PFRs</Paragraph> <Paragraph position="4"> Both of them contain subsets of tile fv-pairs of the required complete context and could therefore both be applied. In such c;*se we laced to know how to combine Pl and p2 in order to gel; p (=p.~ in fig. 2).</Paragraph> <Paragraph position="5"> As there exists no mathematical relation between these three probabilities, we simply average Pt and P2 to get p l)ecause this gives as good tagging results as a nmnber of other more complicated approaches which we examined.</Paragraph> <Paragraph position="6"> PFRs generated by method 3 do not create this problem. For every complete context only one PFIL is stored.</Paragraph> <Paragraph position="7"> When we use the set of decision trees generated by method 4, we obtain for every fv-palr in every possible context only one probability by going down on the relevant branches until a probability information is reached.</Paragraph> <Paragraph position="8"> In opposition to tile PFRs of tile other methods, the decisiou trees also contain negative information al)ont the contexL of an fv-l)air, i.c. not only which fv-llairs have to be in the context but also which ones nmst bc absent.</Paragraph> </Section> class="xml-element"></Paper>