XML Viewer - e83-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/e83-1011_metho.xml
Size: 19,266 bytes
Last Modified: 2025-10-06 14:11:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="E83-1011">
  <Title>AN EXPERIMENT WITH HEURISTIC PARSING OF SWEDISH</Title>
  <Section position="4" start_page="66" end_page="68" type="metho">
    <SectionTitle>
II AN OUTLINE OF AN ANALYZER BASED ON
THE HEURISTIC PRINCIPLE
</SectionTitle>
    <Paragraph position="0"> Figure 1 below shows the general outline of the system. Each of the various boxes (or subboxes) represents one specific process, usually a complete computer program in itself, or, in some cases, independent processes within a program. The big &amp;quot;container&amp;quot;, labelled &amp;quot;The Pool&amp;quot;, contains both the linguistic material as well as the current analysis of it. Each program or process looks into the Pool for things &amp;quot;it&amp;quot; can recognize, and when the process finds anything it is trained to recognize, it adds its observation to the material in the Pool. This added material may (hopefully) help other processes in recognizing what they are trained to recognize, which in its turn may again help the first process to recognize more of &amp;quot;its&amp;quot; units. And so on.</Paragraph>
    <Paragraph position="1"> The system is now under development and during this build-up phase each process is, as was said above, essentially a complete, stand-alone module, and the Pool exists simply as successively updated text files on a disc storage. At the moment some programs presuppose that other programs have already been run, but this state of affairs will be valid Just during this build~up phase. At the end of the build-up phase each program shall be able to run completely independent of any other program in the system and in arbitrary order relative to the others (but, of course, usually perform better if more information is available in the Pool).</Paragraph>
    <Paragraph position="2">  In the ~econd phase superordinated control programs are to be implemented. These programs will function as &amp;quot;traffic rules&amp;quot; and via these systems one shall be able to test various strategies, i.e. to test which relative order between the different subsystems that yields optimal resuit in some kind of &amp;quot;performance metric&amp;quot;, some evaluation procedure that takes both speed and quality into account.</Paragraph>
    <Paragraph position="3"> The programs/processes shown in Figure i all represent rather straightforward Finite State Pattern Matching (FS/PM) procedures. It is rather trivial to show mathematically that a set of interacting FS/PM procedures of the type used in our system together will yield a system that formally has the power of a CF-parser; in practice it will yield a system that in some sense is stronger, at least from the point of view of convenience. Congruence and similar phenomena will be reduced to simple local observations. Transformational variants of sentences will be recognized directly - there will be no need for performing some kind of backward transformational operations. (In this respect a system llke this will resemble Gazdar's grammar concept; Gazdar 1980. ) The control structures later to be superimposed on the interacting FS/PM systems will also be of a Finite State type. A system of the type then obtained - a system of independent Finite State Automatons controlled by another Finite State Automaton - will in principle have rather complex mathematical properties. It is, e.g., rather easy to see that such a system has stronger capacity than a Type 2 device, but it will not have the power of a full Type I system.</Paragraph>
    <Paragraph position="4"> Now a few comments to Figure i The &amp;quot;balloons&amp;quot; in the figure represent independent programs (later to be developed into independent processes inside one &amp;quot;big&amp;quot; program). The figure displays those programs that so far (January 1983) have been implemented and tested (to some extent). Other programs will successively be entered into the system.</Paragraph>
    <Paragraph position="5"> The big balloon labelled &amp;quot;The Closed Cat&amp;quot; represents a program that recognizes closed word classes such as prepositions, conjunctions, pronouns, auxiliaries, and so on. The Closed Cat recognizes full word forms directly. The SMURF balloon represents the morphological component (SMURF = &amp;quot;Swedish Murphology&amp;quot;). SMURF itself is organized internally as a complex system of independently operating &amp;quot;demons&amp;quot; - SMURFs - each knowing &amp;quot;its' little corner of Swedish word formation. (The name of the program is an allusion to the popular comic strip leprechauns &amp;quot;les Schtroumpfs&amp;quot;, which in Swedish are called &amp;quot;smurfar&amp;quot;.) Thus there is one little smurf recognizing derivat\[onal morphemes, one recognizing flectional endings, and so on. One special smurf, Phonotax, has an important controlling function every other smurf must always consult Phonotax before identifying one of &amp;quot;its&amp;quot; (potential) formatires; the word minus this formative must still be pronounceable, otherwise it cannot be a formative.</Paragraph>
    <Paragraph position="6"> SMURF works entirely without stem lexicon; it adheres completely to the &amp;quot;philosophy&amp;quot; of using surface signals as far as possible.</Paragraph>
    <Paragraph position="7"> NOMFRAS, VERBAL, IFIGEN, CLAUS and PREPPS are other &amp;quot;demons&amp;quot; that recognize different phrases or word groups within sentences, viz. noun phrases, verbal complexes, infinitival constructions, clauses and prepositional phrases, respectively.</Paragraph>
    <Paragraph position="8"> N-lex, V-lex and A-lex represent various (sub)lexicons; so far we have tried to do without them as far as possible. One should observe that stem lexicons are no prerequisites for the system to work, adding them only enhances its performance.</Paragraph>
    <Paragraph position="9"> The format of the material inside the Pool is the original text, plus appropriate &amp;quot;labelled brackets&amp;quot; enclosing words, word groups, phrases and so on. In this way, the form of representation is consistent throughout, no matter how many different types of analyses have been applied to it. Thus, various people can join our group and write their own &amp;quot;demons&amp;quot; in whatever language they prefer, as long as they can take sentences in text format, be reasonably tolerant to what types of '~rackets&amp;quot; they find in there, do their analysis, add their own brackets (in the specified format), and put the result back into the Pool.</Paragraph>
    <Paragraph position="10">  Of the various programs SMURF, NOMFRAS and IFIGEN are extensively tested (and, of course, The Closed Cat, which is a simple lexical lookup system), and various examples of analyses of these programs will be demonstrated in the next section.</Paragraph>
    <Paragraph position="11"> We hope to arrive at a crucial station in this project during 1983, when CLAUS has been more thoroughly tested. If CLAUS performs the way we hope (and preliminary tests indicate that it will), we will have means to identify very quickly the clausal structures of the sentences in an arbitrary running text, thus having a firm base for entering higher hierarchies in the syntactic domains.</Paragraph>
    <Paragraph position="12"> The programs are written in the Beta language developed by the present author; c.f. Brodda-Karlsson, 1980, and Brodda, 1983, forthcoming. Of the actual programs in the system, SMURF was developed and extensively tested by B.B. during 1977-79 (Brodda, 1979), whereas the others are (being) developed by B.B. and/or Gunnel KEllgren, Stockholm (mostly &amp;quot;and&amp;quot;).</Paragraph>
  </Section>
  <Section position="5" start_page="68" end_page="70" type="metho">
    <SectionTitle>
III EXPLODING SOME OF THE BALLOONS
</SectionTitle>
    <Paragraph position="0"> When a &amp;quot;fresh&amp;quot; text is entered into The Pool it first passes through a preliminary one-passprogram, INIT, (not shown in Fig. i) that &amp;quot;normalizes&amp;quot; the text. The original text may be of any type as long as it Is regularly typed Swedish.</Paragraph>
    <Paragraph position="1"> INIT transforms the text so that each graphic sentence will make up exactly one physical record.</Paragraph>
    <Paragraph position="2"> (Except in poetry, physical records, i.e. lines, usually are of marginal linguistic interest.) Paragraph ends will be represented by empty records. Periods used to indicate abbreviations are Just taken away and the abbreviation itself is contracted to one graphic word, if necessary; thus &amp;quot;t.ex.&amp;quot; (&amp;quot;e.g.&amp;quot;) is transformed into &amp;quot;rex&amp;quot;, and so on. Otherwise, periods, commas, question marks and other typographic characters are provided with preceding blanks. Through this each word is guaranteed to be surrounded by blanks, and delimiters llke commas, periods and so on are guaranteed to signal their &amp;quot;normal&amp;quot; textual functions. Each record is also ended by a sentence delimiter (preceded by a blank). Some manual post-editing is sometimes needed in order to get the text normalized according to the above. In the INIT-phase no linguistic analysis whatsoever is introduced (other than into what appears to be orthographic sentences).</Paragraph>
    <Paragraph position="3"> INIT also changes all letters in the original text to their corresponding upper case variants.</Paragraph>
    <Paragraph position="4"> (Originally capital letters are optionally provided with a prefixed &amp;quot;=&amp;quot;.) All subsequent analysis programs add their analyses In the form of lower case letters or letter combinations. Thus upper case letters or words will belong to the object language, and lower case letters or letter combinations will signal meta-language information. In this way, strictly text (ASCII) format can be kept for the text as well as for the various stages of its analysis; the &amp;quot;philosophy&amp;quot; to use text Input and text output for all programs involved represents the computational solution to the problem of how to make it possible for each process to work independently of all other in the system.</Paragraph>
    <Paragraph position="5"> The Closed Cat (CC) has the important role to mark words belonging to some well defined closed categories of words. This program makes no internal analysis of the words, and only takes full words into account. CC makes use of simple rewrite rules of the type '~ =&gt; eP~e / (blank)__(blank)&amp;quot;, where the inserted e's represent the &amp;quot;analysis&amp;quot; (&amp;quot;e&amp;quot; stands for &amp;quot;preposition&amp;quot;; P~ = &amp;quot;on&amp;quot;). A sample output from The Closed Cat is shown in illustration 2, where the various meta-symbols also are explained.</Paragraph>
    <Paragraph position="6"> The simple example above also shows the format of inserted meta-lnformatlon. Each Identified constituent is &amp;quot;tagged&amp;quot; with surrounding lower case letters, which then can be conceived of as labelled brackets. This format is used throughout the system, also for complex constituents. Thus the nominal phrase 'DEN LILLA FLICKAN&amp;quot; (&amp;quot;the little girl&amp;quot;) will be tagged as &amp;quot;'nDEN+LILLA+FLICKANn&amp;quot; by NOMFRAS (cf. below; the pluses are inserted to make the constituent one continuous string). We have reserved the letters n, v and s for the major categories nouns or noun phrases, verbs or verbal groups, and sentences, respectively, whereas other more or less transparent letters are used for other categories. (A list of used category symbols is presented in the Appendix: Printout Illustrations.) The program SWEMRF (or sMuRF, as it is called here) has been extensively described elsewhere (Brodda, 1979). It makes a rather intricate morphological analysis word-by-word In running text (i.e. SMURF analyzes each word in itself, disregarding the context it appears in). SMURF can be run in two modes, in &amp;quot;segmentation&amp;quot; mode and &amp;quot;analysis&amp;quot; mode. In its segmentation mode SMURF simply strips off the possible affixes from each word; it makesno use of any stem lexicon. (The affixes it recognizes are common prefixes, suffixes - i.e. derlvatlonal morphemes - and flexlonal endings.) In analysis mode it also tries to make an optimal guess of the word class of.the word under inspection, based on what (combinations of) word formation elements it finds in the word.</Paragraph>
    <Paragraph position="7"> SMURF in itself is organized entirely according to the heuristic principles as they are conceived here, i.e. as a set of independently operating processes that interactively work on each others output. The SMURF system has been the test bench for testing out the methods now being used throughout the entire Heuristic Parsing Project.</Paragraph>
    <Paragraph position="8"> In its segmentation mode SMURF functions formally as a set of interactive transformations, where the structural changes happen to be extremely simple, viz. simple segmentation rules of the type 'T=&gt;P-&amp;quot;, &amp;quot;Sffi&gt; -S&amp;quot; and &amp;quot;Effi&gt;-E '' for an arbitrary Prefix, Suffix and Ending, respectively, but where the &amp;quot;Job&amp;quot; essentially consists of establishing the corresponding structural descriptions. These are shown in III. I, below, together with sample analyses. It should be noted that phonotactlc constraints play a central role  in the SMURF system; in fact, one of the main objectives in designing the SMURF system was to find out how much information actually was carried by the phonntactlc component in Swedish. (It turned out to be quite much; cf. Brodda 1979. This probably holds for other Germanic languages as well, which all have a rather elaborated phonotaxis.) null NOMFRAS is the next program to be commented on. The present version recognizes structures of the type det/quant + (adJ)~ + noun; where the &amp;quot;det/quant&amp;quot; categories (i.e. determiners or quantlflers) are defined explicitly through enumeration - they are supposed to belong to the class of &amp;quot;surface markers&amp;quot; and are as such identified by The Closed Cat. Adjectives and nouns on the other hand are identified solely on the ground of their &amp;quot;cadences&amp;quot;, i.e. what kind of (formally) endlng-llke strings they happen to end with. The number of adjectives that are accepted (n in the formula above) varies depending on what (probable) type of construction is under inspection. In indefinite noun phrases the substantial content of the expected endings is, to say the least, meager, as both nouns and adjectives in many situations only have O-endings. In definite noun phrases the noun mostly - but not always - has a more substantial and recognizable ending and all intervening ad-Jectives have either the cadence -A or a cadence from a small but characteristic set. In a (supposed) definite noun phrase all words ending in any of the mentioned cadences are assumed to be adjectives, but in (supposed) indefinite noun phrases not more than one adjective is assumed unless other types of morphological support are present.</Paragraph>
    <Paragraph position="9"> The Finite State Scheme behind NOMFRAS is presented in Ill. 2, together with sample outputs; in this case the text has been preprocessed by The Closed Cat, and it appears that these two programs in cooperation are able to recognize noun phrases of the discussed type correctly to well over 95% in running text (at a speed of about 5 sentences per second, CPU-tlme); the errors were shared about 50% each between over- and undergenerations.</Paragraph>
    <Paragraph position="10"> Preliminary experiments aiming at including also SMURF and FREPPS (Prepositional Phrases) seem to indicate that about the same recall and precision rate could be kept for arbitrary types of (nonsententlal) noun phrases (cf. Iii. 6). (The systems are not yet trimmed to the extent that they can be operatively run together.) IFIGEN (Infinitive Generator) is another rather straightforward Finite State Pattern Matcher (developed by Gunnel K~llgren). It recognizes (groups of) nnnflnlte verbs. Somewhat simplified it can be represented by the following diagram (remember the conventions for upper and  where '~ux&amp;quot; and &amp;quot;Adv&amp;quot; are categories recognized by The Closed Cat (tagged &amp;quot;g&amp;quot; and &amp;quot;a&amp;quot;, respectively), and &amp;quot;nXn&amp;quot; are structures recognized by either NOMFRAS or, in the case of personal pronouns, by CC (It should he worth mentioning that the class of auxiliaries in Swedish is more open than the corresponding word class in English; besides the &amp;quot;ordinary&amp;quot; VARA (&amp;quot;to be&amp;quot;), HA (&amp;quot;to have&amp;quot;) and the modalsy, there is a fuzzy class of seml-auxillarles llke BORJA (&amp;quot;begin&amp;quot;) and others; IFIGEN makes use of about 20 of these in the present version.) The supine cadence -(A/I)'T is supposed to appear only once in an infinitival group. A sample output of IFIGEN is given in Iii. 3. Also for IFIGEN we have reached a recognition level around 95%, which, again, is rather astonishing, considering how little information actually is made use of in the system.</Paragraph>
    <Paragraph position="11"> The IFIGEN case illustrates very clearly one of the central points in our heuristic approach, namely the following: The information that a word has a specific cadence, in this case the cadence -A, is usually of very llttle significance in itself in Swedish. Certainly it is a typical infinltlval cadence (at least 90% of all infinitives in Swedish have it), but on the other hand, it is certainly a very typical cadence for other types of words as well: FLICKA (noun), HELA (adjective), DENNA/DETTA/DESSA (determiners or pronouns) and so on, and these other types are by no comparison the dominant group having this specific cadence in running text. But, in connection with an &amp;quot;infinitive warner&amp;quot; - an auxiliary, or the word ATT - the situation changes dramatically. This can be demonstrated by the following figures: In running text words having the cadance -A represents infinitives in about 30% of the cases. ATT is an infinitive marker (equivalent to &amp;quot;to&amp;quot;) in quite exactly 50% of its occurences (the other 50% it is a subordinating conjunction). The conditional probability that the configuration ATT ..-A represents an inflnltve is, however, greater than 99%, provided that characteristic cadences like -ARNA/-ORNA and quantiflers/determiners llke ALLA and DESSA are disregarded (In our system they are marked by SMURF and The Closed Cat, respectively, and thereby &amp;quot;saved&amp;quot; from being classified as infinitives.) Given this, there is almost no over-generation in IFIGEN, but Swedish allows for split infinitives to some extent. Quite much material can be put in between the infinitive warner and the infinitive, and this gives rise to some undergeneration (presengly). (Similar observations regarding conditional probabilities in configurations of linguistic units has been made by Mats Eeg-Olofson, Lund, 1982).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML