File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1100_metho.xml

Size: 14,804 bytes

Last Modified: 2025-10-06 14:13:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1100">
  <Title>BUILDING A LEXICAL DOMAIN MAP FROM TEXT CORPORA</Title>
  <Section position="4" start_page="604" end_page="604" type="metho">
    <SectionTitle>
ltEAD-MODIFIER STRUCTURES
</SectionTitle>
    <Paragraph position="0"> TTP p,'u'se structures are p~tssed to the phrase extraction module where head+modifier (including predicate+,'u'gument) pairs are extracted and collected into occurrence patterns. The following types of head+modifier pairs m'e extracted:  (1) a head noun and its left adjective or noun adjunct, (2) a head noun ,and the head of its right adjunct, (3) the m,'fin verb of a clanse and the head of its object pbrase.</Paragraph>
    <Paragraph position="1">  These types of p,'firs account for most of the syntactic vm'i~mts for relating two words (or simple phrases) into pairs c,'urying compatible semantic content. For example, the pair retrieve+information will be extracted from mty of the following fragments: information retrieval system; retrieval of it~rmation /)'om databases; and information that can be retrieved by a user-controlled interactive search process. 3 Figure 1 shows TTP parse and head+modifier pairs extracted. Whenever multiple-noun strings (two nouns plus another noun or adjective) are present, they need Io be structurally disambiguated before any pairs emt be extracted. This is accomplished using statistically-based preferences, e.g., world+third is pt'etizn'ed to either country+world or cot#ltry+third when extracted from third world country. If such preferences cannot be contputed, ,all alternatives ,'u'e discarded to avokl noisy input to clustering progrmns.</Paragraph>
    <Paragraph position="2"> \[S;m Jose Mercury News {}8/30/91 Busilmss Sectlonl For McCaw, it wouhl have hurt the company's stralegy of building a seamless national cellular ilelWolk.</Paragraph>
    <Paragraph position="4"> bui ld+nelwork net work+cclhJlar uet wet k+llali'ollal he|work+seamless F'tgnre 1. Extracting Ilead+Modilier pairs from parsed sentences.</Paragraph>
  </Section>
  <Section position="5" start_page="604" end_page="605" type="metho">
    <SectionTitle>
TERM CORRELATIONS FROM TEXT
</SectionTitle>
    <Paragraph position="0"> Head-modifier pairs serve as occurrence contexts for terms included in them: both single words (as shown in Fignre 1) and other pairs (in case of nested pairs, e.g., cottntry+\[world+third\]). If two terms tend to be modilicd with a number of common modifiers but otherwise appear in few distinct contexts, we assign them a simih'uity coefficient, a real number between 0 and 1. The similarity is determined by comparing distribution characlerislics for both terms within the corpus: in general we will credit high-content terms appem'ing in multiple identical elmtexts, provided that these contexts are not too commonplace. 4 Figure 2 shows exmnples of terms sharing a number of common contexts along with frequencies of occurrence in a 250 MByte subset of Wall Street Journal database. A head context is when two distinct modifiers ,are attached to the same head element; a rood context is when the s,'une term modilles two distinct heads.</Paragraph>
    <Paragraph position="1"> To compute term similarities we used a variant of weighted Jacc\[u'd's measure described in e.g., (Grefen null chaiml,'m l(X)7 146 director 6 158 minisler 37 17 premier 7 8 sloly 9 3 chib fi 4 age 18 3 mother 4 5 bad 4 4 yotmg 258 12 ohler 18 ,I li'tgure 2. L:xample pairs of related re.ms.  language and logarithm on the basis of their co-occurrence with mztural. stette, 1992): 5 In another series of exf, orinmt~ts (Swzalkowskl &amp; Vauthey, 1992} we used a Mtllnal lnfo0maliou I}ased classillcalion formula (e.g., Church and ttanks, 1990; lliudle, 1990), but we l~,}und it less effeclive for diverse dalabases, such as WSJ.</Paragraph>
    <Paragraph position="3"> In rite above, f~,y stands for absolute fi'equency of pair \[x,y\] in tile corpus, ny is the frequency of term y, and N is ttte number of single-word terms.</Paragraph>
    <Paragraph position="4"> hi order to generate better sitnilarities, we require that words xt and x2 appear in at least M distinct conlilion contexts, where It common context is a couple of pairs \[xt,Y\] and \[x2,y\], or \[y,x 1\] and \[y,r 2\] such that they each occun'ed at legist K times. Thus, banana and Baltic will not be considered for similm-ity relation on the basis of tlteir occurrences in the common context of republic, no matter how frequent, unless there are M-1 other such common contexts comparably frequent (there wasn't any in TREC's WSJ database). For smaller or narrow domain databases M=2 is usually sufficient, e.g., CACM d:ltab:t,~e of computer science abstracts. For large databases covering a diverse subject matter, like WSJ or SJMN (S,'m Jose Mercury News), we used M&gt;_5. 6 This, however, turned out not to be sufficient. We would still genemle faMy strong simih'u'ity links between terms such as aerospace mid pharmaceutical where 6 and more comlnon contexts were found, even after a number of comlnon contexts, such ,'is company or market, have already been rejected because they were paired with too msmy different words, and thus had a dispersion ratio too high. The remaining common contexts m'e listed in Figure 3, ~dong with their GEW scores, all occurring at the head (left) position of a pair.</Paragraph>
  </Section>
  <Section position="6" start_page="605" end_page="606" type="metho">
    <SectionTitle>
6 For example &amp;tnana mM Dominican were found to have two
</SectionTitle>
    <Paragraph position="0"> common contexts: republic and plant, although tiffs second occurred in apparently different senses in Dominican plant and banatla ptattt.</Paragraph>
    <Paragraph position="1"> When analyzing Figure 3, we should note that while some of the GEW weights are quite low (GEW takes values between 0 and 1), thus indicating a low ilnportance context, the frequencies with which these contexts occurred wilh both ter,ns were high and balanced on both sides (e.g., concern), thus adding to tile slrength of association. To liher out such casts we established thresholds for adlnissible values of GEW factor, and disre-Du'ded contexts with entropy weights falling below the threshold. In the most recent experiments with WSJ texts, we found that 0.6 is a good threshold. We also observed that clustering bead terms using their moditiers as contexts converges faster and gives generally ntore reliable links thai\] when rood terms are clustered using heads as context (e.g., in the above example). In onr experiment with tile WSJ database, we fotmd that an occurrence of a common head context needs to be considered Its eoulribttting less to the total context cotint than an occurrence of a common rood context: we used 0.6 and l, respectively. Using this formtda, terms man and boy in Figure 2 share 5.4 contexts (4 head contexts and 3 rood contexts).</Paragraph>
    <Paragraph position="2"> hlilially, term similmities are organized into clusters around a centmid term. Figure 4 shows top 10 elements (sorted by similarity wflue) of tile chister for president. Note that in this case lhe SIM value drops suddenly after the second element of the cluster. Changes in SIM vahle are nsed to deternline cut-off points for clusters. Tile role of GTS factor will be explgfined later. Sample clusters obtained fi'om approx. 250 MByte (42 million words) snbset of WSJ (years 1990-1992) are given in  It may be worth pointing out that the similarities arc calculated ilsing term co-occurrences in syntaclic rather than in document-size contexts, the latter Ix:ing the usual practice it1 non-linguistic chlstering (e.g., Sparck Jones and Batlx:r, 1971; Crouch, 1988; Lewis and Croft, 1990).</Paragraph>
    <Paragraph position="3"> Although the two methods of te,'m clustering inay be COllsidered mntttally complementary in certaitt situations, we believe that more and slrouger associations can be obtained tllrough syntactic-context chlstering, given suflicient alnonnt of data and a reasonably accnralc syu- null earnings prffit, revemfe, income portfolio asset, invest, loan inflate growth, deniand, earnings ituhtstry business, eompatly, market growth increase, rise, gain firm bank, concern, group, tlnit environ climate, condition, siluation debt loan, sectire, botld lawyer attorney COltnsel attorney, administrator, secretary conlpule mac\]llne, software, eqtlO~ment competitor riwll, competition, bayer alliance i~artnersIiOl, veotnre, eoosortiunl big ktrge, major, bu.ee, significaot  fight battle, attack, war, cl allet ge base facile, source, reserve, stqqu~rt shareholder creditor, customer, client investor, stockhohler merge, bay-out, acquire, bM compensate, aid, espense cash, fitnd, money personnel, emfloyee,foree hire, draw, woo crucial, difficult, critical rtimor, tlncertainty, tension director, chairman deputy fi)recast, t~rospect, trend rule, policy, leg&amp;late, bill Tahle 1. Selected chlsters &amp;taiued fronl syntat:lic contexts, derived from approx. 40 millio~l words of WSJ tcxl, wiih weighted Jaceaid formula. null tactic parser\] ? Nell-syntactic contexts cross sentetlce lmundaries with no fuss, which is hell)ful with shorl, succinct documents (such as CACM abstracts), but less so wilh longer texls; see also (Grlshmali el al., 1986).</Paragraph>
  </Section>
  <Section position="7" start_page="606" end_page="607" type="metho">
    <SectionTitle>
QUERY I(XPANSION
</SectionTitle>
    <Paragraph position="0"> Sitnilltl'ity rdaiions are t,sed to expand user queries with new lernts, lit an &amp;quot;tttelnpt to make tile tinal Seluch tiuery more colnprehensive (adding synonytlis) and/or more pointed (adding specializalions). 11 follows that not all similiu'ily relatiolls will be equally useful ill query expansion, liar instance, eomplemelltary anti aitlonymous relaliolts like Ihe one between Australian and Catladitl#l, ftCCel;t aild rejecl, or even gelieralizaliOilS like Iroill (1PS'1&amp;quot;0X13(IC( ~ tO industry may actually hllrin systeln's perlornialice, Siliee we Iliay end till retrieviiig many h'relevaill documenls. On the olher hand, dalal)ase search is likely to miss relewtill doctlnlenls if we overlook the fact that vh:e director Call also be depety dit+et?lor, of that ltlkt'ov('r cgln also be merge, buy-ottl, or acqtdsition. We noled that an average SOl of similarities generated from it lexl corpus conlahis abotit as many &amp;quot;good&amp;quot; relations (synottylny, specializalion) as &amp;quot;lind&amp;quot; rclaliolts {anlonyiny, conipleinorllalion, generalizalion), as seen froin the query exp;lliSiOli viewpoinl. Therefore aiiy alleinpt Io sepai~ile these two classes alid 1o hlerease Ihe proporlion o1 &amp;quot;good&amp;quot; relalions shotlld result in improved relrieval. Tills has hldeed heell tJonlirined in our experinlenls where a relalively crlide filler has visibly hlcreased reiriewil precision.</Paragraph>
    <Paragraph position="1"> hi order It) creale an appropriate liller, we devised a global lerm speciliciiy ineasiiro ((ITS) whidl is calculated for each lerili across all conicxis iii which ii occiirs. The general philosophy here is thal ti niore specilic word/phrase WOllld h/lYe 11 iilore Iillliled use, i.e., a illOle specilic term wotild appear iit fewer distinct contexts, hi this respecl, GTS is similar it) tile standard ire'erred tlOClimet# fi'eqttetu 7 (idJ) measure excepl lhal lerni frequency ix iltt3aStlie(l over syntactic tlililS Iather Ihall doctllllenl size unils. TenliS with higher GTS vahies are generally coilsidered more specilic, but the specificily compa,'isotl is only meanillgful for terms which are already kllown to be similar. We bdieve that measuring lerm specilicily over doeumelli-size contexts (e.g., Sparck Jones, 1972) ,nay iiot fie appropriale iii this case. In particular, synllax-based contexts allow for processint~ lexls without any inlernal doctinlenl slriiclllre, The new function is calculaled according to the folhtwing forltiill'i:</Paragraph>
    <Paragraph position="3"> In the ahove, dw is di.~7)ersion of lerm w mlderslood as Ihe mmd~er of distinct COlltexls in which w is found. For any two ternls W 1 alld w2, all(l it constant ~1 &gt; 1, ir (77&amp;quot;S (w2) _&gt; 8t * (;TF (w 1) then w 2 is considered more speciiic lhall w 1 . hi addition, if SlM,,o,.,n(Wl,W2)=fI&gt; 01, where 01 is an elrli)irically  established threshold, then w2 c,'m be added to the query containing term w t with weight ~*to, 8 where co is the weight w2 would have if it were present in the query.</Paragraph>
    <Paragraph position="5"> we may consider w~ as synonymous to w~. All other relations ,are discarded. For example, the following were obtained from the WSJ training database:</Paragraph>
    <Paragraph position="7"> Therefore both takeover and buy-out can be used to specialize merge or acquire. With this filter, the relationships between takeover ~md buy-out and between merge ~md acquire ,are either both discarded or accepted as synonymous. At this time we are unable to tell synonymous or ne,'u&amp;quot; synonymous relationships from those which ,are prim,wily complement~u-y, e.g., matt ,and womatt.</Paragraph>
    <Paragraph position="8"> Filtered simih'u'ity relations create a domain map of terms. At present it may cont~fin only two types of links: equiv,'dence (synonymy and near-synonymy) ,and subsumption (specification). Figure 5 shows a small fragment of such map derived from lexic,-d relation computed from WSJ datab`ase. The domain map is used to expand user queries with related terms, either automatically or in a feedback mode by showing the user appropriate p~u'ts of the map.</Paragraph>
    <Paragraph position="9">  senses of 'charge' as 'expense' and 'allege'.</Paragraph>
    <Paragraph position="10"> s For TREC-2 we used 0=0.2; ,5 varied between l 0 and 100. We should add that the query exp~msion (in the sense considered here, Ibough not quite in the stone way) has been used in information retrieval research befo*'e (e.g., Sp,'trck Jones and Tail, 1984; Harm\[m, 1988), usuaUy with mixed results. The main difference between the current approach ,'u~d those previous attempts is that we use lexico-sernantic evidence for selecting extra terms, while they relied on term co-occurrence within the same documents. In fact we consider these to methods colnplementary with the latter being more appropriate for automatic relevance feedback. An alternative query expansion to is to use term clusters to create new terms, &amp;quot;metaterms&amp;quot;, and use them to index the database instead (e.g., Crouch, 1988; Lewis ,and Croft, 1990). We found that the query exp~sion approach gives the system more flexibility, for inst,'mce, by making room lbr hypertextstyle topic exploration via user feedback.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML