XML Viewer - c96-1009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1009_metho.xml
Size: 12,989 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1009">
  <Title>Extracting Nested Collocations</Title>
  <Section position="3" start_page="0" end_page="42" type="metho">
    <SectionTitle>
2 Collocations - The Problem
</SectionTitle>
    <Paragraph position="0"> Collocations are perwtsive in language: &amp;quot;letters&amp;quot; are &amp;quot;deliw:red&amp;quot;, &amp;quot;tea&amp;quot; is &amp;quot;strong&amp;quot; and not &amp;quot;powelful&amp;quot;, we &amp;quot;l'mt progrants&amp;quot;, aitd so Oll. Linguists have long been interested in collocations and the detinitions are nuiaerous and varied. Some researchers include multi-o.leinent eOlnpOuIlds as (;xamples of collocations; some admit only collocations (:onsisl;ing of pairs of words, while others admit only eollo(;ations consisting of a maximum of tive or six words; some emphasize synl, aglnat, ic aspecl;s, others Selnmtl;ic aspects. The COlllillOil poini;s regarding collocations appear to be, as (Smadja, 1993) suggestsl: they are m'bil;rary (it is nol; clear why to &amp;quot;Bill through&amp;quot; means to &amp;quot;fail&amp;quot;), th('y are domain-dependent (&amp;quot;interest rate&amp;quot;, &amp;quot;stock market&amp;quot;), t;hey are recurrenl; and cohesive lo~xical clusters: the presence of one of the.</Paragraph>
    <Paragraph position="1"> collocates strongly Sltggesl;S /,tie rest of the cellocat, ion (&amp;quot;Ulfited&amp;quot; could ilnply &amp;quot;States&amp;quot; or &amp;quot;Kingdom&amp;quot;). null the classiiics collocations into i)redicative relations, rigid noun phrases and phrasal telnplatcs.</Paragraph>
    <Paragraph position="2">  It is not the goal of this paper to provide yet another definition of collocation. We adopt as a working definition the one by (Sinclair, 1991) Collocation is the occurrence of two or more words within a short space of each other in a text.</Paragraph>
    <Paragraph position="3"> Let us recall that collocations are domaindependent. Sublanguages have remarlmbly high incidences of collocation (Ananiadou and Mc-Naught, 1995). (Frawley, 1988) neatly sums up the nature of sublanguage, showing the key contribution of collocation: sublanguage is strongly lexically based sublanguage texts focus on content lexical selection is syntactified in sublanguages collocation plays a major role in sublanguage sublanguages demonstrate elaborate lexical cohesion. null The particular structures found in sublanguage texts reflect very closely the structuring of a sublanguage's associated conceptual domain. It is the particular syntactified combinations of words that reveal this structure. Since we work with sublanguages we can use &amp;quot;small&amp;quot; corpora as opposed as if we were working with a general language corpus.</Paragraph>
    <Paragraph position="4"> In the Brown Corpus for example, which consists of one million words, there are only 2 occurrences of &amp;quot;reading material&amp;quot;, 2 of &amp;quot;cups of coffee&amp;quot;, 5 of &amp;quot;for good&amp;quot; and 7 of &amp;quot;as always&amp;quot;, (Kjellmer, 1994). We extract uninterrupted and interrupted collocations. The interrupted are phrasal templates only and not predicatiw~ relations. We focus on the problem of the extraction of those collocations we call nested collocations. These collocations are at the same time substrings of other longer collocations. To make this (:lear, consider the following strings: &amp;quot;New York Stock Exchange&amp;quot;, &amp;quot;York Stock&amp;quot;, &amp;quot;New York&amp;quot; and &amp;quot;Stock Exchange&amp;quot;. Assuine that the first string, being a collocation, is extracted by some method able to extract collocations of length two or more. Are the other three extracted as well? &amp;quot;New York&amp;quot; and &amp;quot;Stock Exchange&amp;quot; should be extracted, while &amp;quot;York Stock&amp;quot; should not. Though the examples here are front domain-specific lexieal collocations, grammatieM ones can be nested as well: &amp;quot;put down as&amp;quot;, &amp;quot;put down for&amp;quot;, &amp;quot;put down to&amp;quot; and &amp;quot;put down&amp;quot;. (Smadja, 1993; Kits et al., 1994; Ikehara et al., 1995), mention about substrings of collocations.</Paragraph>
    <Paragraph position="5"> Smadja's Xtract produces only the biggest possible n-grams. Ikehara et al., exclude the substrings of the retrieved collocations.</Paragraph>
    <Paragraph position="6"> A more precise approach to the problem is provided by (Kits et al., 1994). They extract a sub-string of a collocation if it, appears a significant amount of times by itself. The following example illustrates the problem and their N)proach: consider the strings a=&amp;quot;in spite&amp;quot; and b=&amp;quot;in spite of&amp;quot;, with n(a) and n(b) their numbers of oceurrencies in the corpus respectively. It will always be n(a) &gt; n(b), so whenever b is identified as a collocation, a is too. Itowever, a should not be extracted as a collocation. So, they modify the measure of frequency of occurrence to become</Paragraph>
    <Paragraph position="8"> where a is a word sequence la\[ is the length of a n(a) is the number of occurrencies of a in the curpus. null b is every word sequence that contains a n(b) is the number of occurrencies of b As a result they do not extract the sub-strings of longer collocations unless they appear a significant amount of times by themselves in the corpus. The problem is not solved. Table 2 gives the extracted by Cost-Criteria n-grams containing &amp;quot;Wall Street&amp;quot;. The corpus consists of 40,000 words of market reports. Only those n-grants of frequency</Paragraph>
  </Section>
  <Section position="4" start_page="42" end_page="43" type="metho">
    <SectionTitle>
3 Our approach - The Algorithm
</SectionTitle>
    <Paragraph position="0"> We, call the extracted strings candidate collocations rather than collocations, since what we accet)t as collo(:ations depends oil tile application.</Paragraph>
    <Paragraph position="1"> It is the human judge that will give the tinal de(:ision. This is tile reason we consider tile method as semi-automatic.</Paragraph>
    <Paragraph position="2"> Let us consider the string &amp;quot;New York Stock Ex(:hange&amp;quot;. Within this string, that has already been extra(:ted as a candidate collocation, there are two substrings that should/)e extracted, and one that shouhl not. The issue is how to distinguish when a substring of a (:andidate (:ollo(:ation is a candidate collocation, and when it is not. Kita et al.</Paragraph>
    <Paragraph position="3"> assume that the substring is a candidate (:ollocation if it appears by itself (with a relatively high frequency). ~lb this we add that: the sut)string aI)1)ears in more than one, (:an(li(lat(~' eollo(:ations, eVell if it, (h)es not appear by itself.</Paragraph>
    <Paragraph position="4"> &amp;quot;Wall Street&amp;quot;, for exalnple, appears 30 times in 6 longer candidate colh)cations, and 8 times by itself. If we considered only the number of times il; appears by itself, it would get a low value as a candidate collocation. We have to consider the number of tilnes it apI)ears within hmger candidate collocations. A second fa(:tor is tit(! number of these hmger collocations. The greater this numt)er is, the better the string is distribute.d, an(l the greater its value as a (:andi(late collocat;ion. We make the above (:onditions more spe(:iti(: and give the measure for a string being a candidate coll()cation. The measure is called C-value and the fa(&gt; tors involved are the string's frequency of o(:eurrence in the corpus, its fre(luen(:y of oe(:urrence in longer candidate collocations, the immber of these longer ('andidate (:ollocations and its length. Regar(ling its length, we (:onsider hmger collocations to t)e &amp;quot;more important&amp;quot; than shorter appearii~g with the same fi'equency. More specifically, if \]a\] is the length 2 of the string a, its C-value is analog()us to la I - 1. The 1 is giv(m sin('e the shortest collocations are of length 2, and we want them to be &amp;quot;of ilnportan(;e&amp;quot; 2-1= 1.</Paragraph>
    <Paragraph position="5"> More specifically:  1. If a has the same hequen('y with a longer candidate (:ollocation that contains a, it is assigne(t C-value(a)=O i.e. is not a collocation. it is straightforward that in this case a appears in one only hmger candidate collocation. null 2We use tit(', same nol;ation with (Kita et al., 1.994). 2. If n(a) is the number of times a appears, and a is not a substring of an already extracted candidate collocation, then a is assigned 3. If a appears as a substring in one or more  collocations (not with the same frequency), then it is assigned (I-I t(.)) (3) where t(a) is the total frequency of a in longer candidate collocations and c(a) the number of ttmse candidate collocations. This is the most complicate ease.</Paragraph>
    <Paragraph position="6"> Tit(; ilnportance of the. Iluinber of occurrences of a string in a longer string is illustrated with the de.nominator of the fraction in Equation 3. The bigger the nulnber of strings a substring appears in, the smaller the fraction num&amp;~ o\] occu~ , the bigger the C-value of the string.</Paragraph>
    <Paragraph position="7"> The algorithm for the extraction of tile candielate collo(:ations follows: e.xtract the n-grams decide on the lowest frequency of collocations renlove tlle I&gt;granls below this frequency lbr all n-grams a. of lllaxiHlulIl length</Paragraph>
    <Paragraph position="9"> h)r all smaller n-grams a in descending order if (total frequency of a)=(frequency of a in a longer string) a is NOT a collocation else if a appears for the first time</Paragraph>
    <Paragraph position="11"> The above algorithln coinputes the C-value of each string in an incremental way. That is, for each string a, we 1:(;(;i, a tuple ('n(a), t(a), c(a)} and we revise tt,e t(a) and ,:(a) wflues. For each n-gram b, every tin-le it is found ill a longer extracted  n-gram a, the vahles t(b) and c(b) are revised:</Paragraph>
    <Paragraph position="13"> Ill the, initial stage, n(a) is set to the frequency of a appearing on its own, and t(a) and c(a) are set  Let us calculate the C-value for the string &amp;quot;Wall Street&amp;quot;. Table 2 shows all the strings that appear more that twice, and that contain &amp;quot;Wall Street&amp;quot;.  rod&amp;quot;. Its C-value is (:ah:ulated l\[rom Equation 2. For each substrings eon|;ained in the 7-gram, tile number 11.9 (the l'requen(:y of the 7-gram) is kept, as its (till now) fl'equeney of occurrence in longer ,strings. For each of them, the fact that they have been already l'oun(t in a longer string is kept as well. Therefbre, t(&amp;quot;Wall Street&amp;quot;)=19 and c(&amp;quot;\gall Street&amp;quot;)=l.</Paragraph>
    <Paragraph position="14">  2. We continue with the two 6-grams. Both of them, &amp;quot;l~,eporter of The Wall Street Journal&amp;quot; and :'Staff Reporter of The Wall Street&amp;quot; get; C-value=O since they ~q)pear with the same l'requeney as the 7-gram that contains the're. Therefore, they do not tbrm candidate collocations and they do not change the t(&amp;quot;Wall Street&amp;quot;) and the c(&amp;quot;Wall Street&amp;quot;) values.</Paragraph>
    <Paragraph position="15"> 3. F/)r the 5-grams, there is one appearing with  a l'requency })igger than that of the 7-gram it: is (:()nta,incd in, &amp;quot;of The Wall Street Jourlml&amp;quot;. This gets its C-value \[rom Equation 3. its sub-strings increase their frequcmey of occurrence ~ (as substrings) by 20 19=1 (20 is the frequency of the 5-gram and 19 the fr0,queney it appeared in longer candidate collocal;ions), and the numt)er of oeeurrence ~s su/)string by 1. There\[ore, t(&amp;quot;Wall Street&amp;quot;')=19+l=20 and c(&amp;quot;Wall Street&amp;quot;)--1+1--2. The other 5-gram is not a candidate collocations (it gets C-value=O).</Paragraph>
    <Paragraph position="16"> 4. For tile 4-grams, the &amp;quot;The Wall Street Journal&amp;quot; occurs in two longer n-grams and therefore gets its C-value from Equation 3. Froin this string, t(&amp;quot;Wall Street&amp;quot;)=20+2=22 and c(&amp;quot;Wall Street&amp;quot;)-2+1=3. The &amp;quot;of The Wall Street&amp;quot; is not accepted as a eamtidate collocations since it; apt)ears with the same fl'equeney as the &amp;quot;of The Wall Street Jom'nal'.</Paragraph>
    <Paragraph position="17"> 5. &amp;quot;Wall Street analysts&amp;quot; appears for the first time so it; gets its C-value from Equation 2. &amp;quot;Wall Street Journal&amp;quot; mnl &amp;quot;The Wall Street&amp;quot; appearing in longer extracted n-grams get their values from Equation 3. They make t(&amp;quot;Wall Street&amp;quot;)=22+3+4+l=30 and c(&amp;quot;Wall St, lee t&amp;quot; ) = 3+ \] + 1+ 1 =6.</Paragraph>
    <Paragraph position="18"> 6. Finally, we evaluate the C-value for &amp;quot;Wall Street&amp;quot; from Equation 3. We find C-value(&amp;quot;\Y=all Street&amp;quot;)=33.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML