File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/c92-2085_intro.xml
Size: 15,382 bytes
Last Modified: 2025-10-06 14:05:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-2085"> <Title>Linguistic Knowledge Generator</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> In this sec.tkm, we describe a scenario to illustrate how our idea of the &quot;Gradual Approximation&quot; works to obtain knowledge front actual corpora. The goal of the scenario is to discover semantic classes of nouns which are effective for determining (disambiguating) internal structures of compound nouns, which consist of sequences of nmms. Note that, because there is no clear distinction in Japanese between noun phrases and conlllOUlld uonns consisting of sequences of nouns, we refer to them collectively as compound nouns. The scenario is comprised of three programs, ie. Japanese tagging program, Automatic Learning Program of Semantic Collocations and clustering program. null There is a phase of human intervention which accelerates the calculation, but in this scenario, we try to minimize it. In the following, we first give an overview of the scenario, then explain each program briefly, and tlnally report on an experiment that fits this scenario. Note that, though we use this simple scenario as an illustrative example, the same learning program can be used in another inore complex scenario whose aim is, for example, to discow~r semantic collocation between verbs aml noun/prepositional phrases.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Scenario </SectionTitle> <Paragraph position="0"> This scenario Lakes a corpus without any significant annotation as tile input data, and generates, ms the result, plausibility values of collocational relations between two words and word clusters, based oll the calculated semantic distances between words.</Paragraph> <Paragraph position="1"> The diagram illustrating this scenario is shown in &quot;Japanese tagging program&quot; which divides a sentence into words and generates lists of possible parts-of-speech for each word.</Paragraph> <Paragraph position="2"> Sequences of words with parts-of-speeches are then used to extract candidates for compound nouns (or noun phrases consisting of noun sequences), which are the input for the next program, the &quot;Automatic Learning Program for Semantic Collocations&quot; (ALPSC). This program constitutes the main part of Aergs DE COLING-92, NANTES, 23-28 AOf/'r 1992 5 6 1 PROC. OF COLING-92, NANTES, AUO. 23-28, 1992 the scenario and produces tile ahove-ment, ioned output null Tbe output of the program contain errors. Errors here mean that the plausibility values assigned to cob locations may lead to wrong determinations of compound noun structures. Such errors are contained in the results, because of the errors in the tagged data, the insufficient quality of the corpus and inevitable imperfections in tile learning system.</Paragraph> <Paragraph position="3"> From the word distance results, word clusters are computed by the next program, the &quot;Clustering Program&quot;. Because of tile errors in tile word distance data, the computed clusters may be counter-intuitive.</Paragraph> <Paragraph position="4"> We expect human intervention at this stage to formulate more intnitively reasonable clusters of nouns. After revision of the clusters by human specialists, the scenario enters a second trial. That is, the ALPSC re-computes plausibilit.y values of collocations and word distances based on tile revised clusters, the &quot;Chlstering Program&quot; generates the next generation of dusters, and humans intervene to formulate more reasonable clusters, and so on, and so forth. It is expected that word clusters after the (i+l)-th trial becomes more intuitively understandable than that of the i-th trial and that the repetition eventually converges towards ideal clusters of nouns and plausibility values, in the sense that they are consistent both with human introspection and the actual corpus.</Paragraph> <Paragraph position="5"> It should be noted that, while the overall process works as gradual approximation, the key program in the scenario, the ALPSC also works in tile mode of gradual approximation as explained in Section 3.2.2.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Programs and Iluman interven- </SectionTitle> <Paragraph position="0"> tions We will explain each program briefly. Ifowevcr the ALPSC is crucial and tmique, so it will be explained in greater detail.</Paragraph> <Paragraph position="1"> 3,2.1 Program: Japanese tagging program This program takes Japanese sentences as an input, finds word boundaries arid puts all possible parts-of-speech for each word under adjacency constraints. From the tagged sentences, sequences of nouns are extracted for input to the next program.</Paragraph> <Paragraph position="2"> of Semantic Collocations (ALPSC) This is the key program which computes plausibility values and word distances. In this scenario, the ALPSC treats only sequences of nouns, but it can generally applied for any structure of syntactic relationships. It is an unique program with the following points \[8\]: l. it does not need a training corpus, which is one of the bottle necks of some other learning programs 2. it learns by using a combination of linguistic knowledge and statistical analysis 3. it uses a parser which produces all possible analyses null 4 it works as a relaxation process While it is included as a part of a larger repetitive loop, this program itself contains a repetitive loop. Overview Before formally describing the algorithm, the %llowing shni)le example illustrates its working. A parser produces all possible syntactic descriptions aluong words in the form of syntactic dependency structures, The description is represented by a set of tupies, for example, \[head uord, syntactic relat+-on, argument\]. The only syntactic relation m a tuple is MOD for this scenario, but it can be either a grammatical relation like MOD, SUB,J, OBJ, etc. or a surface preposition like BY, WITH, etc. When two or more tupies share tile same argument and tile same syntactic-relation, but have different head-words, there is an ambiguity.</Paragraph> <Paragraph position="3"> For example, tile description of a compound noun: &quot;File transfer operation&quot; contains three tuples: \[transfer, MOO, file\] \[operation, MOD, file\] \[operation, HOD, transfer\] The first two tup\]es are redundant, because one word can only be an argument in one of tile tuples. As repeatedly clahned in tile literature of natural langnage understanding, in order to resolve this ambigmty, a system may have to be able to infer extra-linguistic knowledge. A practical problem here is that there is no systematic way of accumulating such extradinguistic knowledge for given subject fields. That is, mdess a system ha-s a full range of contextual understanding abilities it cannot reject either of the possihle interpretations as 'hnpossible'. The best a system can do, without full understanding abilities, is to select more plausible ones or reject less plausible ones. This implies that we have to introduce a measure by which we can judge plausibility of :interpretations'. null The algorithm we propose computes such measures from given data. It gives a plausibility value to each possible tuple, based on the sample corpus. For exampie, when tile tuples (transfer, ROD, file) and (opera~:ion, MOD, file) are assigned 0.5 and 0.82 as their plausibility, this would show the latter tuple to be more plausible than the former.</Paragraph> <Paragraph position="4"> The algorithm is based on the assumption that the ontological characteristics of the objects and actions denoted by words (or linguistic expressions in general), and the nature of the ontological relations AcrEs DE COLING-92, NANTES, 23-28 AOOT 1992 5 6 2 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 among tbein, are exhibited, though implicitly in sampie texts. For example, uouns denoting objects which belong to tile same ontological classes tend to appear m similar linguistic contexts.</Paragraph> <Paragraph position="5"> Note that we talk about extra-linguistic 'ontology' for the sake of explaining the basic idea bebind the actual algorithm, ttowever, as you will see, we do not represent such things as ontological entities in the actual algorithm. The algorithm sinrply counts frequencies of co-occurrences among words, and word similarity algorithms interpret such co-occurrences ms contexts.</Paragraph> <Paragraph position="6"> The algoritbm in this progranr computes the plansibility values of hypothesis-tuples like (operation, M0D, Iile), etc., basically by counting frequencies of instance-tuples \[operation, ROD, f+-le\], etc. gen crated from the input data.</Paragraph> <Paragraph position="7"> Terminology and notation instance-.tuple {h, r, a\] : a token of a dependency rclat, ion; part of the analysis of a sentence in a corpus.</Paragraph> <Paragraph position="8"> hypothesis-tuple (h,r,a) : a dependency relation; an abstraction or type over identical instance-tuples null }:yale-- repeat time of the relaxation cycle.</Paragraph> <Paragraph position="9"> (;'r,, : Gredit of instance tuple 7' with identification mmfl, er i, {0, 1\] V,,~ : Plausibility value of a hypothesis-tuple T in cycle 9. \[0, 1\] D~ (w.,rvb) : distance between words, w~ and wb m cycle 9. TO, 11 Algorithm The following explanation of the algorithm assumes that the illlnlts are sentences.</Paragraph> <Paragraph position="10"> 1 For a sentence we use a simple grammar to find all tuples lmssibly used. Each instance-tuple is then given credit in proportion to the number of conlpeting tuples.</Paragraph> <Paragraph position="12"> number of competing tuples This credit sbows which rules are suitable for this sentence. On the first iteration the split of the credit between ambiguous analyses is uniform as shown above, but on subsequent iterations plausibility values of tire hypothesis-tuples VT a-1 before the iteration are used to give preference to credit for some analyses over others. The formula for this will be given later.</Paragraph> <Paragraph position="13"> 2. tlypotbesis-tuples have a plausibility value which indicates their reliability by a number between 0 and 1. If an instance-tuple occurs frequently in the corpus or if it occurs where there are no alternative tuples, the plausibility value for the corresponding hypothesis must be large. After analysing all the sentences of tile corpus, wc get a set of sentences with weighted instanee-tuples.</Paragraph> <Paragraph position="14"> Each instauee-tuple invokes a hypothesis-tuple.</Paragraph> <Paragraph position="15"> For each hypothesis-tulflC, we define the plausibility value by the following formula. This formula is designed so that tile value does not ex-</Paragraph> <Paragraph position="17"> At this stage, the word-distances can be used to modify the plausibility wdnes of the hypotbesiutupies. The word-dist>tnces are either defined externally using human mtuitiou or calculated in the previous cycle with a formula given later Disl`ance between words induces a distance between bypothesis-tuples, 'lk~ speed up the calculation and to get, better resull`s, we use similar hypothesis eft~cts. The plausibility value of a hypothesis-tuple is modified based on the word distance and plausibility value of a similar hypothesis. For each bypotbesis-tuple, the plausibility-vMue is increased only as a consequence of the similar hypothesis-tuple which has the greatest ellect. The new plausibility value with similar hypothesis-tulfie effect is calculated 10y tile following formula.</Paragraph> <Paragraph position="19"> llere, the hypotbesis-tuple 7&quot; is the hypothesis-tuple which has the greatest effect on the hypothesis-tuple &quot;/' (origmM one). Ilypotbemstuple T and T' }lave all the same elements except one, &quot;Fbe distance between ~&quot; and 2/&quot; is the distance betweei1 the different elements, w a and wb. Ordinarily the dilierence is in the head or argument element, but when the relation is a preposition, it is possible to consider distance front another preposition.</Paragraph> <Paragraph position="20"> Distances between words are calculated on the basis of similarity between hypothesis-tuples about them. Tbe formula is as follows:</Paragraph> <Paragraph position="22"> are w~ and wb, respectively and whose heads and relations are the same fl is a constant parameter null ACRES DE COLING-92, NANTES, 23-28 AO~&quot; 1992 5 6 3 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 5. Tbis procedure will be repeated from the beginning, but modifying the credits ofinstance-tupies between ambiguous analyses using tim plausibility values of hypothesis-tuples. This will hopefully be more accurate than the previous cycle.</Paragraph> <Paragraph position="23"> On the first iteration, we used just a constant figure for the credits of instanee-tuples. But this time we can use the plausibility value of the hypothesis-tuple which was deduced in the previous iteration. Hence with each iteration we expect more reliable figures, q.'o cMcuLate the new credit of instance-tuple T, we use: c~, - vT~' (s) Ilere, V.r J m the numerator, is the plausibility value of a hypothesis-tuple which is the same tupie as the instance-tuple T V,~ in the denominator are the plausibility values of competing hypothesis tnples in the sentence and the plau- null sibility value of the same hyl)otbesis-tuple itself. ct is a coustant paranleter 6. Iterate step 1 to 5 several times, until the informatiou is saturated.</Paragraph> <Paragraph position="24"> 3,2.3 Program: Clustering i)rogram Word clusters are produced based on the word distance data which are comlmted in the previous program A non-overlapping clusters algorithm with the maximum method was used. The level of the clusters was adjusted experimentally to get suitable sizes tor bu Foau intervention.</Paragraph> <Paragraph position="25"> 3.2.4 Human zT~lerventzon. Select: clusters The clusters may mbcrit errors contained in the word distance data The errors can be classified into the following two types.</Paragraph> <Paragraph position="26"> Note that 'correcU bere means that it is correct m terms of truman intuition. To ease the laborious job of correcting these errors by band, we ignore the first type of error, which is much harder to remove than the second one. It is not ditlieult to remove the second type of error, because the number of words in a single cluster ranges from two to about thirty, and this number is manageable for hmnans. We try to extract purely 'correcU clusters or a subset of a correct cluster, from a generated cluster.</Paragraph> <Paragraph position="27"> It is our contention that, thougll chlsters contain errors, and are mixtures of clusters based on human intuition and clusters computed by process, we will gradually converge on correct clusters by repeating tiffs approximation.</Paragraph> <Paragraph position="28"> At this stage, some correct clusters in the produced clusters are extracted. This information will be an input of the next trial of ALPSC.</Paragraph> </Section> </Section> class="xml-element"></Paper>