File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1011_metho.xml

Size: 18,198 bytes

Last Modified: 2025-10-06 14:12:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-1011">
  <Title>A SYSTEM \]\[7OR CREATING AND MANIPULATING GENERALIZEI) WORDCLASS TRANSITION MATRICES FROM LARGE LABELLEt) TEX'I'--CORPORA</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A SYSTEM \]\[7OR CREATING AND
MANIPULATING GENERALIZEI) WORDCLASS
TRANSITION MATRICES FROM LARGE
LABELLEt) TEX'I'--CORPORA
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> This paper deals with the training phase of a Markov-type linguistic model that is based on transition probabilities between pvirs and triplets of syntactic categories. To determine the o?timal level of detail for a set of syntactic classes we developed a systetn that uses a set-theoretical formalism to defiue such sets mid has some measm~s to comp~uce and c,ptimize them fildividually.</Paragraph>
    <Paragraph position="1"> In section two we describe the optimizafiou problem (hi terms of piediction, infoimation and economy requilements) and our approach to its solution. Section three introduces the system dlat will assist a lhlguist in h,'mdling the prediction and economy criteria and in the last section we plesent some slunple lemtlts that can be achieved with it.</Paragraph>
    <Paragraph position="2"> I. IN'fRODUCrlON The context in which we strutted devclopping the system described ia this paper is the I~NPRIT project #860, 'I.,inguistic Analysis of the European I.,anguages', which deals with seven European languages.</Paragraph>
    <Paragraph position="3"> The rnah~ objective of the project is to provide a language independe~t softw,'we enviromnent for dealing with the linguistic phase of a number of applications in the re'din of office a/ito:mation such as high quality, natural soundhlg text-to-speech ~:onversion for unlimited vocabularies, automatic speech recognition for large vocabularies, and omni-font optical character reading includhlg automatic reading of handwriting.</Paragraph>
    <Paragraph position="4"> The decision on what type of linguistic model to be used ill the project was made at an early stage. It was decided to aim at a probabilistic positional gramnrar (a Mmkov-type grammar) based on transition probabilities of pairs and triplets of syntactic categories. Tile use of Matkov-type models immediately incurs the necessity of defilting training texts. We started out with trainhlg corpora of approximately 100,000 words of official EEC publications, that were available hi all languages of the community. The training consists of buildhlg a number of data structures. 'File first is a lexicon of ,'111 words that occur in the text, with their attendmlt probability of occurl~uce and all possible wordclasses. The second structme is formed by two and three dimensional matrices describing the transition probabilities between pairs or triplets, respectively, of wordclasses. Clearly, the probabililies specified depend on the choice of syntactic categores along the dimensions. One of the major problems with a Malkoviml approach is to determine the optimal level of detail of the wordclasses for each dimension. In tiffs paper we will describe a softwale systetn that helps linguists ha carrying out experitnents aimed at finding an 'optitnal' system of wordclasses. null</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="50" type="metho">
    <SectionTitle>
2. MARKOW ANALYSIS OF LARGE CORPORA
AND WORDCLASS SYSTEMS
</SectionTitle>
    <Paragraph position="0"> The prOblem of finding a suitable wordclass set for statistical disambiguation of syntactic labelling may be fommlated more precisely and fomlally as follows: Find a set of wordelass labels (with gross wordclass and complex information) that can label each word of a language and  1. is minimal in the number of labels (economy requirement) null 2. provides high predictive power for adjacent word- null classes in a chain. A formal way to do this is by minimizing tile average entropy of N-dimensional transition probabilities for subsequent labels in sentences, e.g. reduced to the two-dimensional case, to minimize:</Paragraph>
    <Paragraph position="2"> summation symbol number of labels in the system indices running from 1 to n conditional probability of 'a' given 'b' (prediction requirement) 3. is maximal in the amount of infomaation about each labelled word, e.g. for syntactic analysis or disambiguation of alternative graphemic hypotheses. (information requirement) To find an exact solution to this problem is difficult - if not impossible, because of the dimensionality of the optinfization problem (given the large number of wordclasses needed to obtain useful parsing results) - the difficulty to define a unique starting set of word-classes for an optimization the dependence of a possible finite solution on the analysed corpus Our approach to this problem is to start from a very detailed hierarchical wordclass system including complex information. Tile degree of detail can be reduced by means of the notion of &amp;quot;cover symbols&amp;quot; that form partifioltings of the original system. Cover symbols and w0rdclasses not accounted for by cover symbols are called 'labels'. Initially, cover symbols will be created by combining wordclass symbols for related classes - e.g. the classes &amp;quot;verb, 1. person singtdar indicative present active&amp;quot; and &amp;quot;verb, 1. per:;on singular conjunctive present active&amp;quot; giving a cover symbol &amp;quot;verb, 1. person singular present active&amp;quot;. At a later stage other cover symbols can be created by combining and excluding wordclass symbols and already existing cover symbols. \]\[~a the optimization process different sets of.&amp;quot; labels are created subsequently mad compared by measmes ~elated to either of the criteria mentioned.</Paragraph>
    <Paragraph position="3"> A user working in the optimization process ~eeds meas~ ures to compare the significance of individual labels within a given set and to estimate the usefulness of joining labels i~,~to new, more comprehensive cover symbols'. Az one measur~ for criterium two we use the entropy directly in a global ~nd diagnostic way. Additionally a number of measures have been defined that are related to entropy and give more specific information on the performance of individual labels. Given a text in which to each word a label has beetg assigned that is:  1. the basic wordclass, if this has not been defined as belonging to a cover symbol 2. file applicable covet&amp;quot; symbol otherwise  and given a 2D-matrix that contains relative frequencies of transitions from any label (wordclass or cover symbol) to any other label in the text, then some useful rueastn'es are the branclfing factor for a given label, that tells how many different labels actually followed/preceeded it in an analysed text.</Paragraph>
    <Paragraph position="4"> file variance of the transition probabilities in a row/cob umn of the matrix, that indicates how much the strength of connections from the label to sttrrotmding labels varies as ,analysed fi~om a text.</Paragraph>
    <Paragraph position="5"> tile correlation between different rows/columns of the matrix, that gives information about how similarly the labels behave in a general right/left context, i.e. how much itffomtation will be lost by combining two labels into a new cover symbol.</Paragraph>
    <Paragraph position="6"> file relative frequency of a given label, that indicates tile relative labelling relevance wiflfin a given system.</Paragraph>
    <Paragraph position="7"> The measures defined here for a 2D-matrix, can be applied to a 3D-matrix in a similar way, e.g. the colxelafion between two labels in the same matrix dimellsion then means cox~relating the numbers of two planes.</Paragraph>
    <Paragraph position="8">  3~ .~(\]\]~\[},,:i?:,: .i',\]/! ;i'~?,4'.1',:}i'~ ~.;OJt !,.DiC/~'IU.CI;,5; ILttOM i'~,i~k'&gt;il.,{;~.'t)'V A i'q&lt;t_,f ,'i(~;)Z~:; %1 ~rder to a:~si~;~ ~hGuists h~ thch' ta.'&amp; of dc.&lt;~ig~&amp; G -'~x opli- null ma~ se~ oJ:&amp;quot; ',,:,o.~'delasst,~:*: ~&lt;:,,t; desig:~;;:d ~ too/ ca!icd g-&amp;quot;,l',/tivig.: F, dJtor ~7o_r Jv(a~G.c~s :i)~'o~:a \]~L'~d~:~&gt;v \['.!Y, ltys\[s, &amp;quot;..lie ~.,'.(~,~;t Jwzportant des_~.g~ &lt;:oi~siC/:lf::~;tt~o~s re; ik~llJll;iDeiililli' j ~\]le Sy:,;t'L~xIi a,'e:</Paragraph>
    <Paragraph position="10"> ;';td~MA is ~@it .h'~tc, two _~ogical pa~ls, though they ace ck~scly rob'rEdo h~ ttw. fi~'gt pa~.l a user ~al~ c.r,&lt;:at,:~ a set of cove.,.' sy.a~hoJ:~, /~. s;~.4~x~r~tie~d i'onnalism has beta, defined ~)~x ,-;pt;',:i~'yi~g c:ovcr symbols iu a hJeraccificaJ way: rc.cm' sive\]y -:;;;ts :d i~.b~;h.; ~my be put imo lists, th,:at sw.:h li~:ts t;e e::ch~dex! from oih~r lists k~ ,&lt;:p,;:eKy the fm~{ set of wordc|as~es co~/tai,.a;d \]~_ a ee~tail~ Cover sy~fl~ol, (sc;e al;pelldix for ~totatiorO }h:i.el~ ,,;3 rebels can be defined for C/:ach dimetlsiou. (called &amp;quot;scope?') of a erm~sitlo.u matrix stsparately, i.e. one dan defiiie a specific cov~r symbC/fi or~iy :2~x c.g, ~f~e first position h~ a transitioa t:~d~' or triple..,C/~.licr .o s~.~t of cover symbols has bee~ defi~icd v. con,&lt;;iste~ey ~:h~;ck is mad% to ellslll=e- that tm wordeqaas &lt;,;~/l~l'~)ol be\]\[o~ll~s tD zalol'e thaii olle (:over syl~lI}ol. A &lt;':el o\[' cover symbol d,~fh~itions ix cal!~:d a &amp;quot;mapping&amp;quot;.</Paragraph>
    <Paragraph position="11"> .A. mapping has to b,': co~s/stei~t but no~ ~ec(:ssmily eomplete~ Lo. rmt ovecy woidcla.sg my.st belong to ;ome dover symbol.</Paragraph>
    <Paragraph position="12"> Dift'e.rettt st. is of mappings crux be m~aged together as long as fl~ey stay eca~sistemo ~n lhe ~:eeol~d pa~ of tl~c system a m,;cr can create and marfipulak~ ~nmsMo~ probabfliiy mat.Goes with the help of a map.ph~g. Mais:h:Es &lt;:m~ b.:~ cr~afed i_'xom !shelled iext: in tiff'&lt;: case the sy:',~cm win ,~mbsm~e wordeJm;se~ i-~ tlieir respective (:ov~.;r syl~l~2o\[s a~ld wo~.dcJassi:s llOt behmgirig to a*~y covey ,~p/nfi.x~\] w.~. e.~i,.~.,d '/!.w, ~.m.~tri~., i,~ ih'is we/ti:e :a:,dy.a;d text is ~,~o! res~'rb:tcd, vdih x, &gt;;F':.ei tutho ~l//lil~i;i~';~ ' ().\[ wordclasses. A seccmd way t~, egcag~3 ~iiatrbscs ia Jmm calc:tdaliol~ C/m oilier ,?~a&amp;h;es. ~ 5..wet sym'~;~h-: e~.:u, b; ~, de.fined ~t~teracti'vely, and tlie r~vv mah~;~ i~,~hmging to She new mappi~ G cars de compmed.</Paragraph>
    <Paragraph position="13"> &amp;quot;!'o ha~tdie th~;s~ mat:dca~e,~ ~&gt;'., data_ sl~ett~lc has been desJ.gaed, '~)as~xi (m ff.~:~ ~por:~a~.ess ~.{' the: ai~atxices, .~t futfils two rcquireme:~ts: it i~ ;;uf~ic\[el~fly fas~ f~r ~=~:kticval of data in a~ imcntc.tiw.: e~;v;re_umel:t and it eel n~arfipulatc e:x.b;em,...{y ia,'yie mahices (largest so far 750 z 750 z 750), doric ~ c:ow.:., sy~,.~bols and vaatricEs i1~ additio~ to U~: eom-.</Paragraph>
    <Paragraph position="14"> , ~t~ti~m of ',.'tiE me-.'tsuv::s ~elated to elll;Jcop},+ '~,3~&amp;quot; :C/l:tc?~ {m~.'i)os~s rite sy,&lt;;Icm i~c|mles a powerful luEchatfisu-~ ~o vx:c,~s matrices ,:rod ~.vlated mappings for an~dysis ~llld edifi~g. ()~.</Paragraph>
    <Paragraph position="15"> may take a ,mnibcr of labels from a dhne~sio~ of a ~r~ai:~i~c ~,gg!c:e the:~t ;t ,~;et wi.fh a ,,ew merle mid defhlo a e;ubmatfi&gt; C/ by :.:!Jecifyi;ag arch ..~;ts i. the di~Ibrt:lit di~r~ensio~.&lt;,~ '~'i~i, s~fi~ma.. !_,i~ ~,my d~e~ b~ ~, ~mcessmd selectively by tl~:s.pJ~;_y, stad.siic: h ch~m&lt;~e :-:~:d qm~!~fizat{o~ pmrJcdlx!gs.</Paragraph>
    <Paragraph position="16"> i,Z t~;,:~ StatiStiCS pat!: JlsfOSnlaiiolI o.~1 si)arscne~:: ~wl C/.1::: t@;b.e..;i, iaM lowc:st transltion probabilities in ma_t~im::.: o~ ::i!bmat~_iec.&lt;; may i~e gathered. Cogrclatio~s of trm~sifio~i i-r(:ql~c~ ~.:;(;s b~;:c,'&lt;:cn hd~ch; may bc~ cahi;,a\[ated fl-u' a (;.~2aU,~ iak.~!&lt;~w),'2~,l raag~ of ~;meome only, f.ist, chauge and qlla.dzai%, com..</Paragraph>
    <Paragraph position="17"> mal~ds may be specified foc a maaedcai rauge C/,,f J;r;::qc,:~Me.&lt;; in tile Sllblllatt\[K. This e~st!res that olle liiay &gt;,{:(:~:exs; c~.:tht{at &amp;quot;ft~.rluE~cy layers&amp;quot; it~ the me&amp;d?~, which is au c~:scaii~} op&lt;;ra. ffot~ ior viewing very large matrices wi.lh only ,~ iTew ~.:~'xc:'._u~ of tlie erttfies now-zero.</Paragraph>
    <Paragraph position="18"> tf a user awetmlally finds dial the labels it~ aw, e dim~?:~.</Paragraph>
    <Paragraph position="19"> sion of a sift)matrix, could be inchlded idle a ~evi cower s3,,.x~ * boi, he/she may spceLfy this directly ~md the: ov;_:,aii left, ix together with its mapping wili be tnmsformed iuio ~, m:,v ~;maller one. Different mairJeos may be ~llel'\[~ed KN iOt;~ iitJ \[iic; misted ~Iiapi)illgS arc eoi~lpatit)lc ia a!l ailal)/iie x~:m;e: : ,m*c~.</Paragraph>
    <Paragraph position="20"> symbols in ode m~,{~ph~g must bc eith,:r di@mci from th,: orles hi the offer mapphlg or itt md&gt;s~:t rolatiom</Paragraph>
  </Section>
  <Section position="4" start_page="50" end_page="52" type="metho">
    <SectionTitle>
4, SOME EXAMPLE RESU1 ,TS
</SectionTitle>
    <Paragraph position="0"> The&amp;quot; paJ.iner:.; witllii~ lhe consortimn have .im~t ~:tx,icd ~h,' development of the optima\[ wordelass syslems. 'Dlcrcfor&lt;:, Ju ihis paper we will resirict ourselves to the prc~;c~Uatiol.~ of a small number of ex~unples that should convey the {iavotw of rite kind of information that cml be derived with file system.</Paragraph>
    <Paragraph position="1"> The data h~ the cx~unples ace derived from a~ ~.q'\['h::e text in Gemaan (g0,O00 words) and the same tcxl h~ Dutch (100,000 words) Isbelted with the ESPlOY-+wordctas:-; system (cm 250 wordelasscs for Gem-~an aml 104 Jor )?t~tci~ were actually itsed). '\]'he symbols nsed h~ th~,~ examph,x ca~ l,:intcq~reted as: 'P': prepgsitiol*, 'D': d,:temenc:r~ 'N': ~om~, 'A': adj~:&amp;&lt;~c;~ 'C': eonjtmclioJ~, 'B': att~fi L  If a user works on a 3D-matrix with the matl/x editor aid considers inclusion of all conjunctions into one cover symbol in the first scope, but wants to leave the most frequent labels out, he/she will look e.g. at a part of the matrix by a com-</Paragraph>
    <Paragraph position="3"> which will give a display of only those parts of the matlix where a conjtmction stands in the first position of the Markov chain.</Paragraph>
    <Paragraph position="4"> Let us assume that the ,nest frequent labels ,-u'e C(K)#######, C02..##### and 'all labels C01 but without C01..#####, the,l he/she could define the cover symbol 'ZCON' for scope I in the following way:</Paragraph>
    <Paragraph position="6"> with: '0' the list operator '!' the exception operator '_ZCEX' a local nanre With the help of tiffs new cover symbol we cru~ transform the matrix accordiugly.</Paragraph>
    <Paragraph position="7">  This is the well-known detemalner-adjecfive-noun phrase and the preposition-determiner-noun phrase. The tmmbers indicate the frequency with which the triples occur in the training text.</Paragraph>
    <Paragraph position="8">  The very low standard deviation of the label A17.....## casts considerable doubt upou its significance; it will probably be included into a cover symbol. The label COO#######, on the other hand, will probably deserve to be given a class of its own.</Paragraph>
    <Paragraph position="9">  The labels M02####### and B02####### have a high correlation and are therefore candidates to be put into the same cover symbol. But before doing this one has to determine the significance of such an operation by checking the standard deviation, branchhlg factor and the relative freu quency. Also the third criterium as defined in section two has to be taken into account.</Paragraph>
    <Paragraph position="10">  Tltis table has been derived from the Dutch corpus after definition of cover symbols for the main word classes. '171e entropies of these cover symbols are low compared to the maximum we encountered. Certainly tltis set of cover sym-. bols is too small to fulfill the information requirenrent for e.g.  disambiguation of alternative gl,'aphemic forms, definitions ate not allowed to be directLy or indirectly recursive.</Paragraph>
  </Section>
  <Section position="5" start_page="52" end_page="52" type="metho">
    <SectionTitle>
APPENDIX\[: SYNTAX OF COVER SYMBOL
DEFINITIONS
</SectionTitle>
    <Paragraph position="0"> The grammar is in BN-fonn, where: '1' mevas optionality, '1' alternative, '&lt;' and '&gt;' nontemainal, informal desclhptions are between double quotes.</Paragraph>
  </Section>
  <Section position="6" start_page="52" end_page="52" type="metho">
    <SectionTitle>
SET
</SectionTitle>
    <Paragraph position="0"> cover symbols used ill the map can only be excluded from other cover symbols (not included, otherwise the mapping would be inconsistent). This gives the consttaint use of cover symbol notations within a cover symbol definition, E.g. in an expression Z1 = &lt;expl&gt;!(&lt;exp2&gt;!&lt;exp3&gt;), the cover symbol set becomes inconsiste.t, if another cover symbol Z2 occurs included in &lt;expl&gt; or &lt;exp3&gt;, cover symbols occuning on the right side of a definition must be defined in the same file.</Paragraph>
    <Paragraph position="1">  &amp;quot;valid cover symbol notation&amp;quot; &amp;quot;valid wordclass symbol notation&amp;quot; &amp;quot;constraint use of CS-notation&amp;quot; la} order to support order in the cover symbol definitio.s cover symbols that ate to be included into other cover symbols (i.e. they have only attxifiaty function, but will not occur ha a map) are notated differently from cover symbols, that will occur hi a map: Auxili,'u'ies lmve a name preceeded by a Additional notations are used in a textual definition to specify the scope for subsequently defined cover symbols, Cover symbol definitio, fries may include other cove,' symbol definition fries by a C-like &amp;quot;#include&amp;quot; command. with the fl~llowing constraints:</Paragraph>
  </Section>
  <Section position="7" start_page="52" end_page="55" type="metho">
    <SectionTitle>
INFORMATION FLOW IN THE EMMA MARKOW ANALYSIS SYSTEM
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML