File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1009_metho.xml
Size: 12,989 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1009"> <Title>Extracting Nested Collocations</Title> <Section position="3" start_page="0" end_page="42" type="metho"> <SectionTitle> 2 Collocations - The Problem </SectionTitle> <Paragraph position="0"> Collocations are perwtsive in language: &quot;letters&quot; are &quot;deliw:red&quot;, &quot;tea&quot; is &quot;strong&quot; and not &quot;powelful&quot;, we &quot;l'mt progrants&quot;, aitd so Oll. Linguists have long been interested in collocations and the detinitions are nuiaerous and varied. Some researchers include multi-o.leinent eOlnpOuIlds as (;xamples of collocations; some admit only collocations (:onsisl;ing of pairs of words, while others admit only eollo(;ations consisting of a maximum of tive or six words; some emphasize synl, aglnat, ic aspecl;s, others Selnmtl;ic aspects. The COlllillOil poini;s regarding collocations appear to be, as (Smadja, 1993) suggestsl: they are m'bil;rary (it is nol; clear why to &quot;Bill through&quot; means to &quot;fail&quot;), th('y are domain-dependent (&quot;interest rate&quot;, &quot;stock market&quot;), t;hey are recurrenl; and cohesive lo~xical clusters: the presence of one of the.</Paragraph> <Paragraph position="1"> collocates strongly Sltggesl;S /,tie rest of the cellocat, ion (&quot;Ulfited&quot; could ilnply &quot;States&quot; or &quot;Kingdom&quot;). null the classiiics collocations into i)redicative relations, rigid noun phrases and phrasal telnplatcs.</Paragraph> <Paragraph position="2"> It is not the goal of this paper to provide yet another definition of collocation. We adopt as a working definition the one by (Sinclair, 1991) Collocation is the occurrence of two or more words within a short space of each other in a text.</Paragraph> <Paragraph position="3"> Let us recall that collocations are domaindependent. Sublanguages have remarlmbly high incidences of collocation (Ananiadou and Mc-Naught, 1995). (Frawley, 1988) neatly sums up the nature of sublanguage, showing the key contribution of collocation: sublanguage is strongly lexically based sublanguage texts focus on content lexical selection is syntactified in sublanguages collocation plays a major role in sublanguage sublanguages demonstrate elaborate lexical cohesion. null The particular structures found in sublanguage texts reflect very closely the structuring of a sublanguage's associated conceptual domain. It is the particular syntactified combinations of words that reveal this structure. Since we work with sublanguages we can use &quot;small&quot; corpora as opposed as if we were working with a general language corpus.</Paragraph> <Paragraph position="4"> In the Brown Corpus for example, which consists of one million words, there are only 2 occurrences of &quot;reading material&quot;, 2 of &quot;cups of coffee&quot;, 5 of &quot;for good&quot; and 7 of &quot;as always&quot;, (Kjellmer, 1994). We extract uninterrupted and interrupted collocations. The interrupted are phrasal templates only and not predicatiw~ relations. We focus on the problem of the extraction of those collocations we call nested collocations. These collocations are at the same time substrings of other longer collocations. To make this (:lear, consider the following strings: &quot;New York Stock Exchange&quot;, &quot;York Stock&quot;, &quot;New York&quot; and &quot;Stock Exchange&quot;. Assuine that the first string, being a collocation, is extracted by some method able to extract collocations of length two or more. Are the other three extracted as well? &quot;New York&quot; and &quot;Stock Exchange&quot; should be extracted, while &quot;York Stock&quot; should not. Though the examples here are front domain-specific lexieal collocations, grammatieM ones can be nested as well: &quot;put down as&quot;, &quot;put down for&quot;, &quot;put down to&quot; and &quot;put down&quot;. (Smadja, 1993; Kits et al., 1994; Ikehara et al., 1995), mention about substrings of collocations.</Paragraph> <Paragraph position="5"> Smadja's Xtract produces only the biggest possible n-grams. Ikehara et al., exclude the substrings of the retrieved collocations.</Paragraph> <Paragraph position="6"> A more precise approach to the problem is provided by (Kits et al., 1994). They extract a sub-string of a collocation if it, appears a significant amount of times by itself. The following example illustrates the problem and their N)proach: consider the strings a=&quot;in spite&quot; and b=&quot;in spite of&quot;, with n(a) and n(b) their numbers of oceurrencies in the corpus respectively. It will always be n(a) > n(b), so whenever b is identified as a collocation, a is too. Itowever, a should not be extracted as a collocation. So, they modify the measure of frequency of occurrence to become</Paragraph> <Paragraph position="8"> where a is a word sequence la\[ is the length of a n(a) is the number of occurrencies of a in the curpus. null b is every word sequence that contains a n(b) is the number of occurrencies of b As a result they do not extract the sub-strings of longer collocations unless they appear a significant amount of times by themselves in the corpus. The problem is not solved. Table 2 gives the extracted by Cost-Criteria n-grams containing &quot;Wall Street&quot;. The corpus consists of 40,000 words of market reports. Only those n-grants of frequency</Paragraph> </Section> <Section position="4" start_page="42" end_page="43" type="metho"> <SectionTitle> 3 Our approach - The Algorithm </SectionTitle> <Paragraph position="0"> We, call the extracted strings candidate collocations rather than collocations, since what we accet)t as collo(:ations depends oil tile application.</Paragraph> <Paragraph position="1"> It is the human judge that will give the tinal de(:ision. This is tile reason we consider tile method as semi-automatic.</Paragraph> <Paragraph position="2"> Let us consider the string &quot;New York Stock Ex(:hange&quot;. Within this string, that has already been extra(:ted as a candidate collocation, there are two substrings that should/)e extracted, and one that shouhl not. The issue is how to distinguish when a substring of a (:andidate (:ollo(:ation is a candidate collocation, and when it is not. Kita et al.</Paragraph> <Paragraph position="3"> assume that the substring is a candidate (:ollocation if it appears by itself (with a relatively high frequency). ~lb this we add that: the sut)string aI)1)ears in more than one, (:an(li(lat(~' eollo(:ations, eVell if it, (h)es not appear by itself.</Paragraph> <Paragraph position="4"> &quot;Wall Street&quot;, for exalnple, appears 30 times in 6 longer candidate colh)cations, and 8 times by itself. If we considered only the number of times il; appears by itself, it would get a low value as a candidate collocation. We have to consider the number of tilnes it apI)ears within hmger candidate collocations. A second fa(:tor is tit(! number of these hmger collocations. The greater this numt)er is, the better the string is distribute.d, an(l the greater its value as a (:andi(late collocat;ion. We make the above (:onditions more spe(:iti(: and give the measure for a string being a candidate coll()cation. The measure is called C-value and the fa(> tors involved are the string's frequency of o(:eurrence in the corpus, its fre(luen(:y of oe(:urrence in longer candidate collocations, the immber of these longer ('andidate (:ollocations and its length. Regar(ling its length, we (:onsider hmger collocations to t)e &quot;more important&quot; than shorter appearii~g with the same fi'equency. More specifically, if \]a\] is the length 2 of the string a, its C-value is analog()us to la I - 1. The 1 is giv(m sin('e the shortest collocations are of length 2, and we want them to be &quot;of ilnportan(;e&quot; 2-1= 1.</Paragraph> <Paragraph position="5"> More specifically: 1. If a has the same hequen('y with a longer candidate (:ollocation that contains a, it is assigne(t C-value(a)=O i.e. is not a collocation. it is straightforward that in this case a appears in one only hmger candidate collocation. null 2We use tit(', same nol;ation with (Kita et al., 1.994). 2. If n(a) is the number of times a appears, and a is not a substring of an already extracted candidate collocation, then a is assigned 3. If a appears as a substring in one or more collocations (not with the same frequency), then it is assigned (I-I t(.)) (3) where t(a) is the total frequency of a in longer candidate collocations and c(a) the number of ttmse candidate collocations. This is the most complicate ease.</Paragraph> <Paragraph position="6"> Tit(; ilnportance of the. Iluinber of occurrences of a string in a longer string is illustrated with the de.nominator of the fraction in Equation 3. The bigger the nulnber of strings a substring appears in, the smaller the fraction num&~ o\] occu~ , the bigger the C-value of the string.</Paragraph> <Paragraph position="7"> The algorithm for the extraction of tile candielate collo(:ations follows: e.xtract the n-grams decide on the lowest frequency of collocations renlove tlle I>granls below this frequency lbr all n-grams a. of lllaxiHlulIl length</Paragraph> <Paragraph position="9"> h)r all smaller n-grams a in descending order if (total frequency of a)=(frequency of a in a longer string) a is NOT a collocation else if a appears for the first time</Paragraph> <Paragraph position="11"> The above algorithln coinputes the C-value of each string in an incremental way. That is, for each string a, we 1:(;(;i, a tuple ('n(a), t(a), c(a)} and we revise tt,e t(a) and ,:(a) wflues. For each n-gram b, every tin-le it is found ill a longer extracted n-gram a, the vahles t(b) and c(b) are revised:</Paragraph> <Paragraph position="13"> Ill the, initial stage, n(a) is set to the frequency of a appearing on its own, and t(a) and c(a) are set Let us calculate the C-value for the string &quot;Wall Street&quot;. Table 2 shows all the strings that appear more that twice, and that contain &quot;Wall Street&quot;. rod&quot;. Its C-value is (:ah:ulated l\[rom Equation 2. For each substrings eon|;ained in the 7-gram, tile number 11.9 (the l'requen(:y of the 7-gram) is kept, as its (till now) fl'equeney of occurrence in longer ,strings. For each of them, the fact that they have been already l'oun(t in a longer string is kept as well. Therefbre, t(&quot;Wall Street&quot;)=19 and c(&quot;\gall Street&quot;)=l.</Paragraph> <Paragraph position="14"> 2. We continue with the two 6-grams. Both of them, &quot;l~,eporter of The Wall Street Journal&quot; and :'Staff Reporter of The Wall Street&quot; get; C-value=O since they ~q)pear with the same l'requeney as the 7-gram that contains the're. Therefore, they do not tbrm candidate collocations and they do not change the t(&quot;Wall Street&quot;) and the c(&quot;Wall Street&quot;) values.</Paragraph> <Paragraph position="15"> 3. F/)r the 5-grams, there is one appearing with a l'requency })igger than that of the 7-gram it: is (:()nta,incd in, &quot;of The Wall Street Jourlml&quot;. This gets its C-value \[rom Equation 3. its sub-strings increase their frequcmey of occurrence ~ (as substrings) by 20 19=1 (20 is the frequency of the 5-gram and 19 the fr0,queney it appeared in longer candidate collocal;ions), and the numt)er of oeeurrence ~s su/)string by 1. There\[ore, t(&quot;Wall Street&quot;')=19+l=20 and c(&quot;Wall Street&quot;)--1+1--2. The other 5-gram is not a candidate collocations (it gets C-value=O).</Paragraph> <Paragraph position="16"> 4. For tile 4-grams, the &quot;The Wall Street Journal&quot; occurs in two longer n-grams and therefore gets its C-value from Equation 3. Froin this string, t(&quot;Wall Street&quot;)=20+2=22 and c(&quot;Wall Street&quot;)-2+1=3. The &quot;of The Wall Street&quot; is not accepted as a eamtidate collocations since it; apt)ears with the same fl'equeney as the &quot;of The Wall Street Jom'nal'.</Paragraph> <Paragraph position="17"> 5. &quot;Wall Street analysts&quot; appears for the first time so it; gets its C-value from Equation 2. &quot;Wall Street Journal&quot; mnl &quot;The Wall Street&quot; appearing in longer extracted n-grams get their values from Equation 3. They make t(&quot;Wall Street&quot;)=22+3+4+l=30 and c(&quot;Wall St, lee t&quot; ) = 3+ \] + 1+ 1 =6.</Paragraph> <Paragraph position="18"> 6. Finally, we evaluate the C-value for &quot;Wall Street&quot; from Equation 3. We find C-value(&quot;\Y=all Street&quot;)=33.</Paragraph> </Section> class="xml-element"></Paper>