File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/c94-2119_evalu.xml
Size: 10,002 bytes
Last Modified: 2025-10-06 14:00:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2119"> <Title>Generalizing Automatically Generated Selectional Patterns</Title> <Section position="5" start_page="743" end_page="746" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="743" end_page="744" type="sub_section"> <SectionTitle> 4.1 Evaluation Metric </SectionTitle> <Paragraph position="0"> We have previously \[5\] described two methods for the evaluation of semantic constraints. For tile current ex-periments, we have used one of these methods, where the constraints are evaluated against a set of manually classitied semantic triples. '~ For this (waluation, we select a small test corpus separate fl'om the training corpus. We parse the corpus, regularize the parses, and extract triples just as we did tbr the semantic acquisition phase. We then manually classify each triph&quot; as valid or invalid, depending on whether or not it arises fl'om the correct parse for the sentence. G We then estahlish a threshold 7' for the weighted triples counts in our training set, and deline 4 If v:l allows a hi'o,taler range of argulnents than w2, then we can replace w2 by vq, but llOIb giC(~ versa, For (':xanlple~ w(; can repla(:e &quot;speak&quot; (which takes a human subject) by &quot;sleep&quot; (which takes an animate subject), and still have a selectionally valid pattern, \])tit. not the other wety around.</Paragraph> <Paragraph position="1"> ~&quot;l'his is similar to tests conducted by Pcreira ct al. \[9\] and l)agan et al. \[2\]. The cited tests, howevcl', were based ,m selected words or word pairs of high frequency, whereas ore&quot; test sets involve a representative set of high and low frequency triples. stiffs is a different criterion fl'om the one used in our earlier papers. In our earlier work, we marked a triple as wdid if it could be valid in some sentence in the domain. We found that it was very (lilIicult to apply such a standard consistmltly, and have therefore changed to a criterion based on an individual sentence. vq numl',er of l.rilllcs in test set which were ('.lassitied as vMid and which a,F,l)em'ed iu training sct with count > &quot;/' V__ llllllll)or oV tril)lcs in tcsl, set which were classilicd as valid m,d which nl)pearc(I in training set with COIlI/I. < ~/' i I- lmn,I)er of tril)lcs in t,est set. which were classitlcd as inwdid and which al)peared ilt trahfing set with CO\[llti, > &quot;\[' and then delinc</Paragraph> <Paragraph position="3"> By wu'ying the l, hreshold, we can a~lcct dill\went trade-olfs ()f recall and precisioli (at high threshold, we seh~ct: only a small n,,mher of triph:s which apl)eared frequ(mtly and in which we l.hereforc have \]ligh conli-(h!nce, t;hus obtaining a high precision lm(, \]()w recall; conversely, at a h)w t, hrcshohl we adndt a uuuch larger nund)er of i.riplcs, obt,aiuiug ~ high recall but lower precisiol 0.</Paragraph> </Section> <Section position="2" start_page="744" end_page="744" type="sub_section"> <SectionTitle> 4.2 .t .s~ Data </SectionTitle> <Paragraph position="0"> The trai,fing and Icst corpora were taken from the Wall Street ,hmrnaJ. In order to get higher-quality parses or I,\]lcse ,q(ml;elices, we disahlcd some of the recovery mechanisms normally t>ed in our parser. Of the 57,366 scnte,lCCS hi our t,rMidng corpus, we ohtMned comph%e pars('s Ibr 34,414 and parses of initial substrings for an additional 12,441 s(mtenccs. These i)m'ses were th(m regularized aim reduced to t,riph~s. Wc gcnerat;(;d a total of 27q,233 distinct triples from the corpus.</Paragraph> <Paragraph position="1"> The test corpus used to generate l, he triph~s which were mamlally classified consisl,ed of l0 artMcs, also (percentage of totM corpus used), o = at 72% l)reci siou; * = maximmn reca\[\], regardless of precision; x :-I)redicted values R)r m~Lximum recall D<)m the Wall S~;reet Journal, distinct from those in the training set. These articles produced a tcs(. set; containing a totM of i{)32 triples, of which 1107 were valid ;rod 825 were \[nvMid.</Paragraph> </Section> <Section position="3" start_page="744" end_page="746" type="sub_section"> <SectionTitle> 4.3 Results 4.3.1 Growth with Corl)us Size </SectionTitle> <Paragraph position="0"> Wc began by generating triples from the entire corpus and cwdmLt, ing the selectional patterns as <lescribed above; tile resulth/g recall/l)recision curve generated by wu'ying the threshold is shown in Figure 1.</Paragraph> <Paragraph position="1"> To see how pattern coverage iwl)roves with corpus size, we divided our training corpus into 8 segments and coHll/uted sets of tril)lcs based on the lirst Seglllell|,, the Ih'st two segments, etc. We show iu Figure 2 a plot of recall vs. corpus size, both at ~ consl, ant precision of 72% and for maximum recall regardless of precision .7 The rate of g;rowth of the maximum recall cau be understood in teruls of the frequency distribution of triples. In our earlier work \[4\] we lit the growth data to curw~s of the form 1 -exp(-fia:), on tile assumpt.ion that all selectional imtterns are t~qually likely. This lttay have 1)ee|l a roughly accurate assumption for that app\]ication, involving semantic-class based patterns (rather t, han word-based l);-ttl;erns), and a rather sharply circumscribed sublanguage (m(xlical reports).</Paragraph> <Paragraph position="2"> For the (word level) pal;i,crlls described here, howevcr, the distribution is quite skewed, with a small number of very-high-frequency l)atl,erns, a which results in di\[: rN,, (1,tta point is shown for 72% precision for the first seglll(:ilt &lone };e(:allSe we ~tl'c nl)\[ able to re&oh ;t prcci.%lOll of 72~ with a single seglnent.</Paragraph> <Paragraph position="3"> a'l'hc number of highq're(luency patterns is m:(:enl, u;tted by ing corpus. Vertical scale shows number of triples with a given frequency.</Paragraph> <Paragraph position="4"> fereat growth curves. Figure 3 plots the number of distinct triples per unit frequency, as a function of fi'equency, for the entire training corpus, This data can be very closely approximated by a fimction of tile form N(t,') = al ;'-~, where r~ = 2.9. 9 q'o derive a growth curve for inaxinmln recall, we will assunle that the fl'equeney distribution for triples selected at random follows the same tbrm. Let I)(7) represent the probability that a triple chosen at randorn is a particular triple T. l,et P(p) be the density of triples with a given probability; i.e., the nmnber of triples with probal)ilities between p and p + ( is eP(p) (for small e). Then we are ass,,ming that P(p) = ~p-~, for p ranging fl'om some minimum probability Pmin to 1. For a triple T, the probability that we would lind at least one instance of it in n corpus of w triples is approximately i -- c -~p(T). The lnaximum recall for a corpus of ~- triples is the probability of a given triple (the &quot;test triple&quot;) being selected at random, multiplied by the probability that that triple was found in the training corpus, summed over a.ll triples: ~)(r). (1 - e-~&quot;(&quot;')) 7' which can be coral)uteri using the density function ~ 1 P' P(P)' (1 e-&quot;V)dp m, n f l -~(1 c -T~' = rap. p - . )alp ,,~ia By selecting an appropriate value of a (and corresponding l),~i,~ so that the total probability is 1), we can get a the fact that our lcxicM scmmcr replaces all identitiablc COllllYally lllLllleS by tile token a-company, all C/llTellcy values by acurrency, etc. Many of the highest frequency triples involve such tokens.</Paragraph> <Paragraph position="5"> ranked by t):.</Paragraph> <Paragraph position="6"> good match to the actual maximum recall values; these computed values are shown as x in Figure 2. Except \['or the smallest data set, the agreement is quite good considering the wwy simple assumpt, ions made.</Paragraph> <Paragraph position="7"> In order to increase our coverage (recall), we then applied the smoothing procedure to the triples fi'om our training corpus. In testing our procedure, we lirst generated the confusioll matrix Pc and examined some of the entries, l&quot;igure 4 shows the largest entries in f'c for the noun &quot;bond&quot;, a common word in the Wall Street Journal. It is clear that (with some odd exceptions) most of tile words with high t): wtlues are semantically related to the original word.</Paragraph> <Paragraph position="8"> 'lk) evaluate the etl\~ctiveness of our smoothing procedure, we have plotted recall vs. precision graphs for both unsmoothed and smoothed frequency data. The results are shown in l,'igure 5. Over tile range of precisions where the two curves overlap, the smoothed data performs better at low precision/high recall, whereas the unsmoothed data is better at high precision/low recall. In addition, smoothing substantially extends the level of recall which can be achieved for a given corpus size, although at some sacrilice in precision.</Paragraph> <Paragraph position="9"> Intuitively we can understand why these curves should (:ross as they do. Smoothing introduces a certain degree of additional error. As is evident from Figure 4, some of the confllsion matrix entries arc spurious, arising from such SOllrces as incorrect l)arses and the conIlation of word senses. In addition, some of the triples being generalized are themselves incorrect (note that even at high threshold the precision is below 90%). The net result is that a portion (roughly 1/3 to 1/5) of</Paragraph> <Paragraph position="11"> the tril>les added by smoothing ~l'e incorreet. At low levels of 1)recision, (,his l)ro({uces a. net gldn on t.he l)re eision/rec+dl curve; +tt, highe,' levels o1' precision, 'deghere ix a. net loss. In a.ny event, smoothing (toes allow for sul)stlml, ially higher levels o1' recall than are possible without+ smoothing.</Paragraph> </Section> </Section> class="xml-element"></Paper>