File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2114_metho.xml
Size: 16,296 bytes
Last Modified: 2025-10-06 14:13:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2114"> <Title>A Best-Match Algorithm for Broad-Coverage Example-Based Disambiguation</Title> <Section position="3" start_page="0" end_page="717" type="metho"> <SectionTitle> 2 A Best-Match Algorithm </SectionTitle> <Paragraph position="0"> In this section, conventional algorithms for exami)le-b~tsed disalnl)iguation~ art(1 their associate(i prol)lems, a.re briefly introduced. The algorithms of lnost examph>l)ased systems consist of the following three steps~: till some systenls, the exact-mah:h ttl|(I Lhe best-match ~tr(! ll/orge({.</Paragraph> <Paragraph position="1"> &quot;store+V&quot; *storel &quot;in&quot; &quot;disk&quot; *disk 1) &quot;store+V&quot; *store1 &quot;in&quot; &quot;storage-device&quot; *device 2) &quot;store+V&quot; *storel &quot;in&quot; &quot;cell&quot; *cell 1) &quot;store+V&quot; *store1 &quot;in&quot; &quot;computer&quot; *computer1 4) &quot;store+V&quot; *storel &quot;in&quot; &quot;storage&quot; *storage2 3) &quot;store+V&quot; *storel &quot;in&quot; &quot;format&quot; *formatl 1) &quot;store+V&quot; *storel &quot;in&quot; &quot;data-network&quot; *network3 t)</Paragraph> <Paragraph position="3"> 1. Searching for examples 2. Exact matching 3. Best matching with a thesaurus Suppose the prepositional phase attachment ambiguity in $1 is resolved by using these steps. (S1) A managed AS/400 system can store a new program in the repository.</Paragraph> <Paragraph position="4"> There are two candidates for the attachnmnt of the prepositional phrase &quot;in the repository.&quot; They are represented by the following head-modifier relationships: null</Paragraph> <Paragraph position="6"> In R1 the m)un &quot;repository&quot; modifies the verb &quot;store&quot; with &quot;in,&quot; while in R2, it modifies the noun &quot;program.&quot; First,, SENA searches for examples whose heads match the candidate. Figures 1 and 2 show the relevant examples for R1 and I/.2. They represent the head-modifier relationships, including wordsenses, a relation label between the word-senses, (e.g. 'in&quot;), and a frequency.</Paragraph> <Paragraph position="7"> If a relationship identical to either of the candidates R1 and R2 is found, a high similarity is attached to the candidate and the example (exact matching).</Paragraph> <Paragraph position="8"> Word-sense ambiguities are resolved by using the same framework \[12\]. In this case, each candidate represent each word sense. For example, the word-sense *store1 is preferred among the examples shown in Fig. I.</Paragraph> <Paragraph position="9"> If no examples are obtained by the exactmatching process, the system executes the best-matching process, which is the most important mechanism in the example-based approach. For the comparison, synonym or is-a relationships described in a thesaurus are used. For example, if synonym relations are h)und between &quot;repository&quot; and &quot;disk&quot; in the first example for the R1, a similarity whose value is smaller than that for exact matching is given to the examples. The most preferable candidate is selected by comparing all examples in Fig. 1 and computing the total similarity value for each candidate. If multiple candidates have tile same similarity values, the frequency of the example and some heuristics (for example, innermost attachment is preferred) are used to weight the similarities.</Paragraph> <Paragraph position="10"> Experience with SENA reveals two problems that prevent an improvement in the performance of the best-matching algorithm. First, the approach is strongly dependent on the thesaurus. Many systems calculate the similarity or preference mainly or entirely by using the hierarchy of the thesaurus.</Paragraph> <Paragraph position="11"> However, these relationships indicate only a certain kind of similarity between words. To improve the coverage of the example-base, other additional types of knowledge are required, as will be discussed in the following sections.</Paragraph> <Paragraph position="12"> Another problem is the existence of unknown words; that is, words that are described in the system dictionary but do not appear in the example-base or the thesaurus. In SENA, the New Collins Thesaurus \[1\] is used to disambiguate sentences in computer manuals. Many unknown words appear, especially nouns, since the thesaurus is for the general domain. Therefore, a inechanism for handling the unknown words is required. This is covered in Chapter 4.</Paragraph> </Section> <Section position="4" start_page="717" end_page="718" type="metho"> <SectionTitle> 3 Knowledge Acquisition for Robust Best-Matching </SectionTitle> <Paragraph position="0"> As described in the previous section, the best-matching algorithm is a basic element of example-based disambiguation, but is strongly dependent on the thesaurus. Nirenburg \[8\] discusses the type of knowledge needed for the matching; in his method, morphological information and antonyms are used in addition to synonym and is-a relationships. This section discusses the acquisition of knowledge front other aspects for a broad-coverage best-match algorithm. null</Paragraph> <Section position="1" start_page="717" end_page="718" type="sub_section"> <SectionTitle> 3.1 Acquisition of Conjunctive Rela- </SectionTitle> <Paragraph position="0"> tionships from Corpora The New Collins Thesaurus, which is used in SENA as a source of synonym or is-a relationships, gives the following synonyms of &quot;store&quot;: store: accumulate, deposit, garner, hoard, keep, etc. In our example-base, there are few examples for any of the words except &quot;keep,&quot; since the example-base was developed nminly to resolve sentences in technical documents such as computer manuals.</Paragraph> <Paragraph position="1"> When the domain is changed, the vocabulary and the usage of words also (:hange. Even a generaldommn thesaurus some, tinms does not suit a. spe(:ific domain. Moreover, develolmmnk of a domainspccitie thesaurus is it time-consuming task. The use of synonym or is-a relationships suggests the hypothesis that from the viewpoint of the exalni)le-l)~tsed itI)pl'oadl ~ a, word in iL sentell(;e citn be replaced by its synonyms or t~xonyms. That is, it supports the existe, nce of the (virtual) exampie $1' when &quot;store&quot; and &quot;keep&quot; h~tve a synonynl relationshil).</Paragraph> <Paragraph position="2"> (SI') A managed AS/400 systenl can keep a new program in tile repository.</Paragraph> <Paragraph position="3"> l}~terchangeability is :m important condition for cM('ulating similarity or preferences t)etween words. Our claim is that if words are inter(:hangeat)h~ in senten(:es, they should have strong similarity.</Paragraph> <Paragraph position="4"> In this l)al)er, (:onjmtetive relationships, whMt are COllllDon ill te(:hnictd (lOClllDetlts~ 3,re l)roposed as relationships that satisfy the conditiml of inter ehlmgeability. Seutenee, s in which the word &quot;store&quot; ix used as an element of coordinated structure can be extracted from computer manuMs, as following examples show: (1) The service retrieves, fornlats, all(/ stores a message for the user, (2) Delete the identifier being stored or rood|tied froin the tM)le.</Paragraph> <Paragraph position="5"> (3) This EXEC verifies mM StOlIt!S the language defaults in your tile.</Paragraph> <Paragraph position="6"> (4) You use the fltnetion to add, store, retrieve, ~tll(l update inforlna, tion Mmut doculnents.</Paragraph> <Paragraph position="7"> From tile sentences, the R)tlowing words that are inter(:hangeable with &quot;store&quot; are acquired: store,: retrieve, fo'r'm, at, modiJy, &quot;oeTiiflj, add, &quot;ltpda, te Often the words share easeq)atterns, which is ;t useNl characteristic fi)r determining interchanl,/e-ability. Another reason we use (:onjunctive relationships is that they can 1)e extracted scmiautomatieMly from untagged or tagged corpora 1)y using a simph', patkeri>matehing nmtho(l. We extract, ed about 700 conjunctive relationships from nntagged computer mamlMs by i)attern matching.</Paragraph> <Paragraph position="8"> The relationships include various types of knowledge, such as 10t ) antonyms (e.g. &quot;private&quot; itnd &quot;publiC'), (t>)sequences of ~ctions (e.g. &quot;toad&quot; itnd &quot;edit&quot;), (c) (weak) synonyms (e.g. &quot;program&quot; and &quot;service&quot;), and ((l) part-of relationships (e.g. &quot;tape&quot; ~tn(l &quot;device&quot;). Another merit of conjunctive relationships is that they reflect dommn-specili(: relations. null</Paragraph> </Section> <Section position="2" start_page="718" end_page="718" type="sub_section"> <SectionTitle> 3.2 Acquisition from Text to Be Dis- </SectionTitle> <Paragraph position="0"> ambiguated If there are no exami)les of i~ word to I)e dismn.biguated, and the word does not appear in the thesaurus, no relationships ~Lre acquired.</Paragraph> <Paragraph position="1"> The existence of words theft m'e mlknown to khe examl)le-base antl the thesaurus ix inevitat)le wtmn one is deMing with tile disambiguation <>f senten<:es in f>ri~(:ti(:al dmmdns. Computer manuals, for ex~nni)le , coIiLain lnally special llOUns such as llantes of colDlllands and products, but, there are no thesauruses for such highly domMn-speeilic words.</Paragraph> <Paragraph position="2"> One w~ty of resolving the prol)h'nt ix to use the text to be processed as the most domainospecilic example-base. This idea ix supported by the fact that most word-It;O-word dependencies il,<:luding the UllklloWll words aq)pear lltalty kimes il~ the sAIue text. Nasukawa \[7\] deveh)pe(l the Dis(:ourse Analyzer (DIANA), which resolves ambiguities in a text by dynamically referring to contextual information. Kinoshita et ;-I.1. \[3\] Mso prolmsed *t nletho<l for machine I;ra.nslatiml by lm.rsing ;t eoml)lete text in advance aud using it as an ex~mlple-1)ase, tlowever, neither system works for llllkllown wt)rds~ since both use only dependencies that al)l)eltr explicitly in the texl.</Paragraph> </Section> </Section> <Section position="5" start_page="718" end_page="719" type="metho"> <SectionTitle> 4 An Algorithm to Search tbr </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="718" end_page="719" type="sub_section"> <SectionTitle> Unknown Words </SectionTitle> <Paragraph position="0"> We first give ~ut enilaneed best-matci~ algorithm for disamlfiguation. '\['he steps given ill Chapter 2 axe moditied as follows: \[. Searching for examph!s 2. \]~xlt(q, matching 3. Best matching with a thesmtrus and conjunctive relationshil)s 4. Unknowll-word-makx:hil~g using a. context-base '\]'he outline of the the algorithm is as follows: Sen null tences in the text; to he processed are parsed ill ad-VILllC(! 1 aud 1;11(! parse trees axe stored as a, contextbase. '\['tie com;ext-h~tse caAI inchlde alIll)igllOllS word-to-word dependencies, since no disambiguakion l)rot:ess is executed. Using tm exanq)le-base slid the contextd)ase, the sentences ill the text are disantbiguated sequentially. If an ambiguous word does not ~q~pear in an exanlple-base or in the thesaltrus, 3.11 IlIIklIOWII word search is executed (otherwise, the COltve(lliOllil,\[ best~lllaA;ch process is eX(!Ctll;ed.) The mlknow:u-word-matching i)l'oeess includes the following ske, ps: 1. '\['he dependencies that include the unknown word are extracted froIil the context-base.</Paragraph> <Paragraph position="1"> 2. A candidate set of words that is interchangeabh; with tile unknown word ix searched for in kite (!xamph>base by using the context dependency. null 3. The e~mdidate set ~(:quired ill step 2 is com null p~tred with the examples extracted for each candidate of interpretation. A preference wdue is ea.leulated by using the sets, and the most preferred interpretation is selected.</Paragraph> <Paragraph position="2"> Let us see how the algorithm resolves the attachment ambiguity in sentence S1 from Chapter 2, which is taken from a text (manual) for the AS/400 system.</Paragraph> <Paragraph position="3"> (Sl) A managed AS/400 system can store a new program in the repository.</Paragraph> <Paragraph position="4"> The text that contains S1 is parsed in advance, and stored in the context-base. The results of the example search arc shown in Fig. 1. There are two candidate relationships for the attachment of the prepositional phrase &quot;in the repository&quot;.</Paragraph> <Paragraph position="6"> Tile noun &quot;repository&quot; does not appear in the example-base or thesaurus, and therefore no information for the attachment is acquired.</Paragraph> <Paragraph position="7"> Consequently, the word-to-word dependencies that contain &quot;repository&quot; are searched for in the context-base. The following sentences appear before or after S1 in the text: (CBI) The repository can hold objects that are ready to be sent or that have been received from another user library.</Paragraph> <Paragraph position="8"> (CB2) A distribution catalog entry exists ~or each object in the distribution repository.</Paragraph> <Paragraph position="9"> (CB3) A data object can be loaded into the distribution repository from an AS/400 library. (CB4) The object type of the object specified must match the information in the distribution repository.</Paragraph> <Paragraph position="10"> From the sentences, the head-nn)difier relationships that contain the unknown word &quot;repository&quot; are listed. These relationships are called the context dependency for the word. The context dependency of &quot;repository&quot; is us follows:</Paragraph> <Paragraph position="12"> The last number in each relation is the certainty factor (CF) of the relationship. The value is 1/(the number of candidates for the resolving ambiguity).</Paragraph> <Paragraph position="13"> For example, the attachment of &quot;repository&quot; in CB2 has two candidates, D2 and D3. Therefore, the certainty factors for D2 and D3 are 1/2.</Paragraph> <Paragraph position="14"> For each dependency, candidate words (CB) in the context-base are searched for in the examplebase. The words in the set can be considered as substitutable synonyms of the unknown word. For example, the WORDs that satisfy the relationship (&quot;hold+V&quot; (subj) WORD+N)in the case of D1 are searched for. The Mlowing are candidate words in the context-base for the word&quot;repository.&quot;</Paragraph> <Paragraph position="16"> The total set of candidate words (CB) of the &quot;repository&quot; is an union of CB1 through CB6. The set is compared with the extracted examples for each attachntent candidate (Fig. 1). The words in the examples are candidate words in the examplebase. By intersecting the candidate words in the context-base and the example-base, word that are interchangeable with the unknown word can be extracted. The intersections of ea(:h set are as follows: For 111, CBr3C1 -- {storage, format} For R2, CBNC2 = {} This result means that &quot;storage&quot; and &quot;format&quot; have the same usage (or are interchangeal)le) in the text. The preference value P(R) for the candidate R with the interchangeable word w is calculated by the formula:</Paragraph> <Paragraph position="18"> In this (:use, P(R1) = 0.5 x 1+0.5 x 1 = 1.0, and P(R2) = 0 (sui)posing that the frequency of the words is 1). As a result, R1 is preferred to R2.</Paragraph> <Paragraph position="19"> if both sets of candidates are empty, the numbers of extracted examples are coml)ared (this is called Heuristic-I). If there are no related words in this ease, R1 is preferred to i&quot;12 (see Fig. 1). This heuristic indicates that &quot;in&quot; is preferred after &quot;store,&quot; irrespective of the head word of the prepositional phrase.</Paragraph> </Section> </Section> class="xml-element"></Paper>