File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3160_metho.xml
Size: 17,371 bytes
Last Modified: 2025-10-06 14:13:01
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-3160"> <Title>AUTOMATIC PROOFREADING OF' FROZEN PIIRASES IN GERMAN</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AUTOMATIC PROOFREADING OF' FROZEN PIIRASES IN GERMAN RAI ,F KI';S\[';, FRIEDR1CH I)UDI)A, MARIANNE KUGI,F,R TA Triumph-Adler AG TA Research EF </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> \[&quot;rozen phrases are introduced as a new level of automatic proofieadiltg in between the stm,dard level of spelling verification tfl ist)lated words and tile levcl of grammar checking. The design and the iulplenlentatioll of a corresponding proofleading system are described in detail.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> European languages contain thousands (if what Mauriee Gross calls &quot;frozen&quot; or &quot;compound words&quot; (G ross, 1986). 111 contrast to &quot;free forms&quot;, frozen words - though being separable into several words and suffixes lack syntactic and/or semantic comtlositionality. This &quot;lack of conlpositionality is apparent fiom lexical restrictions&quot; (at night, but: *at day, *at evening, etc.) as well as &quot;by the impossibility of inserting material that is a pliori plausil~lc?' (*at {coming, tlresent, cokt, dark} night) (Gross, 1986).</Paragraph> <Paragraph position="1"> Now, these kinds of co-occurrence restrictions (ltaxris, 19701 determine not only the concrete lexical composition of an individual conlptltnld word but also its spelling.</Paragraph> <Paragraph position="2"> Consider, as an example, the Gernlan noun &quot;Bezug&quot; which-like any other nolni ill German- starts with a capital letter and occurs as a free form--e.g, in contexts like &quot;mit Bezug anl&quot;-such that its co-(iccnrrents can w~ry freely. There is a single exceptk/n to this rule, nalnely the phrase &quot;in bczug auff, which is entirely flozen, in tile sense that it excludes any variation of its tlarls tit' structure, and which, by the sanle token, restricts the spelling of the noun to lower case.</Paragraph> <Paragraph position="3"> t:or examples like this we introduce the term &quot;frozen phrases&quot; as referring to (the sub-class of) those frozen words that are compounds of several words. As Zimmermanll (Zimmcrmann, 1987) points out for multL words in general, frozen phrases are clearly (~ttt of scope of standard spelling correction systems due to tile fact that these systems ctlcck for isolated words Olfly and disregard thc respective contexts. Yet, as Gross (Gross, 1986) indicates, at least tile entirely and nearly entirely tl-OZell Rirms can be recognized and compared with the hel l) of simple string operations. Thus, these kinds of frozell phi%lSCS are accessible t(1 tile lllet\[lOdS of classical autolnatic proofleadillg.</Paragraph> <Paragraph position="4"> Folk)wing Gross (Gross, 1986) and Zimlnermann (Zimmermanll, 1987), we prt/posc to further extend standard spelling correction systems onto the level of frozen phrases by simply lnaking thenl capable of treating more than a single word at a time.</Paragraph> <Paragraph position="5"> This implies that we arrive at a context-sensitive system.</Paragraph> <Paragraph position="6"> Focusing (m individttal co-occt, rrences (Harris, 19701, the proposed extenskln will be a conservatiw: one, ill tile sense that it requires just widening the scope of the string matching/comparing operations that classically are used in spelling correction systenls. No deep and time-cousuming analysis, like palsiilg, is involved.</Paragraph> <Paragraph position="7"> Restricting the systenl that way makes our approach different fr(nn tile one considered in (bLimon and Herz, 1991), where a context sensitive spelling velificati(lli is proposed to be done with the help of &quot;local constraints auttnnata (I.CAs)&quot; which process contextual COilstraillls on the level of lexical or syntactic categories rather than on the basic level of strings, hi fact, proof-reading with LCAs rather anlounts to genuine granlnlar checking aim as such Imltmgs to a different and higher level of lmlguage checking.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> ACRES DE COLING-92, NANrl.;S, 23-28 A(:@I 1992 1 0 2 9 I'ROC. OF COLING-92, NANTES, AUG. 23-28, 1992 2 Phenomenology </SectionTitle> <Paragraph position="0"> The essential feature of the new level of automatic proofreading is that lexicalized co-occurrence restrictions determine the orthography of whole phrases beyond the scope of what can be accepted or rejected on the basis of isolated words alone.</Paragraph> <Paragraph position="1"> For German, these restrictions determine, among other things, whether or not certain forms (1) begin with an upper or lower case letter; (2) have to be separated by (2.1) blank, (2.2) hyphen, (2.3) or not at all; (3) combine with certain other forms; or even (4) influence punctuation. Examples are: (t) lch laufe eis. versus Ich laufe aufdem Eis.</Paragraph> <Paragraph position="2"> Er diirfle Bankrott machen, versus Er dfirfle bankrott se&.</Paragraph> <Paragraph position="3"> (2.1) Sic kann nicht Fahrrad fahren, versus (2.3) Sic kann nicht radfahren.</Paragraph> <Paragraph position="4"> (2.1) Es war bitter kalt. versus (2.3) Es war ein bitterkalter Tag.</Paragraph> <Paragraph position="5"> (2.2) Er liebt lch-Romane, versus (2.3) Er liebt Romane in lchforn,.</Paragraph> <Paragraph position="6"> (3) BetonblOcke vs. *Betonblocks versus Hiluserblocks vs. *HduserblOcke (4) Er rauchte~ ohne daft sic dawm wugte. vcrsus *Er rauchlc ohne~ daft sic daVOll wuBle.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Design </SectionTitle> <Paragraph position="0"> Like conventional spelling correction of isolated words, proofreading of frozen phrases is a lexicon-based process.</Paragraph> <Paragraph position="1"> However, while in conventional spelling correction &quot;referencing a dictionary of correctly spelled words&quot; (Frisch and Zamora, 1988) is standard, proofreading of the higher-level orthographic featnres mentioned in 2 above can be done on the basis of a lexicon that encodes the corresponding error patterns directly.</Paragraph> <Paragraph position="2"> Thus, each entry in the system lexicon is modelled as a quintuple <W,L,R,C,E> specifying an error pattern of a (multi-) word W for which a correction C will be proposed accompanied by an explanation E just in case a given match of W against some passage in the text nnder scrutiny differs significantly from C and the - possibly empty - left and right contexts L and R of W also match the environment of W's counterpart in the text.</Paragraph> <Paragraph position="3"> Disregarding E for a moment, this is tantalnotmt to saying that each such record is interpreted as a string rewriting nile W-->C / L R rephtcing W (e.g.: Bezug) by C (e.g.: bezug) in the environment 1_, R (e.g.: in auf).</Paragraph> <Paragraph position="4"> The form of these productions can best be characterized with an eye to the Chomsky hierarchy as unrestricted, since we can have any non-mill number of symbols on the LHS replaced by any number of symbols on the RHS, possibly by null (Partee et al., 1990). With an eye to semi-Thue o1 extended axiomatic systems one could say that a linearly ordered sequence of strings W, C1, C2, ..., Cm is a derivatkm of Cm iff (1) W is a (faulty) string (in the text to be corrected) and (2) each Ci follows from the immediately preceding string by one of the productions listed in the lexicon (Partee et al., 1990). Thus, theoretically, a single mistake can be corrected by applying a whole sequence of productions, though in practice the default is clearly that a correction be done in a single derivational step, at least as long as the system is just operating on strings and not on additional non-terlnimll sylnbols.</Paragraph> <Paragraph position="5"> Occurrences of W, L, and R in a text are recognized by pattern matching techniques.</Paragraph> <Paragraph position="6"> An error pattern W ignores the particnlarly error-prone aspects npper/k)wer case and word separator (sec the examples in 2 above). It thus matches both the correct and incorrect spellings with respect to these features.</Paragraph> <Paragraph position="7"> Beside wildcards ff)r characters, like &quot;*&quot;, a pattern for W, l J, or R may contain also wildcards for words allowing, for example, the specification of a maxinml distance of L or R with respect to W. Since the types of errors discussed here only occur within ACRES DE COLING-92, NANTES. 23-28 AObq' 1992 1 0 3 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 seuteuces, such a distant match has to be restricted by the sentence boundaries. Thus, by having the system operate seutencewisc, any left or right context is naturally restricted to be some string within the same sentence as W mr to be at bonndary of that sentence (e.g.: a punctuation mark).</Paragraph> <Paragraph position="8"> Any left or right context is either a positive or at negative one, i.e., its COUlpOneuts are homogeneously either required or forbidden in order for the corresponding rule to fire. So far it has not been necessary to allow tku * mixed modes within it left or right context.</Paragraph> <Paragraph position="9"> In case a correction C is proposed to the user, additionally at message will be displayed to him identifying the reason why C is correct rather than W. l)epending on the user's knowledge of tile language under investigation, he can take this either as an opporttulity to learn or rather as a help tier deciding whether to finally accept or reject the proposal.</Paragraph> <Paragraph position="10"> There are two kinds of explanations, absolute and conditional ones. Whereas absolute rules indicate that the system has necessary and sufficient evidence fi~r W's deviance, there clearly are cases where either W or C could he correct and this question cannot be decided on the basis of the system's lexical infl)rmation alone, hi these cases, a conditiol,al o1 if-then explanatkm is given to the user offering a higher-level deciskm criterion which the system itself is unable to apply.</Paragraph> <Paragraph position="11"> Take, as an example, the sentence Dieser Fihn hetrifft exit und Jung.</Paragraph> <Paragraph position="12"> which clearly allows for two readings, oue which renders &quot;Alt und Jmlg&quot; as the false spelling of the frozen phrase &quot;all und juug&quot; meaning &quot;everybody&quot;, and another one which takes &quot;AIt und Jung&quot; as the correct free form that literally designates the old aim the young while excluding the mktdle-aged.</Paragraph> <Paragraph position="13"> Thus, suhstitutahility by &quot;jedennann&quot; (i.e.: &quot;everybody&quot;) wouM be an adequate decision criterion to convey to the user.</Paragraph> <Paragraph position="14"> In its present design, tile system is based on the simplest possible working hyl~othesis, i.e. the assumption that the higherqevel or cognitive errors in a sentence cau be corrected independently. The intuition behind this assumption is that, normally, cognitive (or orthographical) errors are by far less frequent than ordinary motoric (or typographical) errors (for this distinction see Berkel and Smcdt, 1990), and that, as a consequence of this, they occur within distinct contexts such that it is excluded that the correction of one error be conditional upon the correction of another.</Paragraph> <Paragraph position="15"> According to this assumption, each sentence of the lext is processed only once in the fffllowing nmnner: The system reads from a plain text copy, T2, of the original formatted text, T I, processes the errors on T2 one after the other, and finally writes the corresponding corrections to TI, without also writing them to T2.</Paragraph> <Paragraph position="16"> Now, an abstract counterexample to the working hypothesis can be construed quite easily: Given a sentence containing tile sequence of errors ... Wl W2 ...</Paragraph> <Paragraph position="17"> and given lcxical rules</Paragraph> <Paragraph position="19"> rewriting (i,e. correcting) WI as C1 in any context, and W2 as C2 if preceeded by CI, then, clearly, tile system will correct Wl but will fail to correct W2.</Paragraph> <Paragraph position="20"> For the system to also correct W2, it must be able to take its previous output into accot, nt again. That is, it should not only read from T2 but also write to it.</Paragraph> <Paragraph position="21"> t Iowever, to allow corrections to be written also to q'2 would mean to stepwise introduce tleW contexts on W2. As a consequence, tile system woukl halve to redo the checking of tile whtfle sentence each time a correction has been made, tbr this correction might well bca relevant context of some previously cncouutered expression.</Paragraph> <Paragraph position="22"> Thus, giving up the working hypothesis would result in having the system take multiple runs through one and the same sentence instead of a single rtul, and this, of course, would drastically reduce system performance.</Paragraph> <Paragraph position="23"> Fortunately, we have not yet come across any (signi\[icaut atllount el) data that would justify such a redesign of the system.</Paragraph> <Paragraph position="24"> AC'I'ES l)E COLING-92, NANTES, 23-28 ^o(rr 1992 1 0 3 1 l'Roc, ov COLING-92, NANTES, AUG. 23-28, 1992 However, since the data captured in the system's lexicon covers at present some 50 % of the relevant phenomena compared to the Duden (Berger 1985), the ultimate complexity of tile system has to be regarded as an open and empirical question.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Status of Implementation </SectionTitle> <Paragraph position="0"> A first prototype of the system described above has been developed in C under UNIX within the ESPRIT I1 project 2315 &quot;Translator's Workbench&quot; (TWB) as one of several orthogonal modules checking basic as well as higher levels (like grammar and style; see (Thurmair, 1990) and (Winkelmann, 1990)) of various languages.</Paragraph> <Paragraph position="1"> A derived and extended version - covering 3.000 rewriting rules and some 80 explanations - has been integrated into both a proprietary text processing software under DOS and Microsoft's WOP, D FOR WINDOWS, version 1.1.</Paragraph> <Paragraph position="2"> This extended version has been combined with a conventional spelling verifier to form a single proofreader for the user. Internally, however, its hidden sub-modules are still totally independent from one another and process a sentence one alter the other.</Paragraph> <Paragraph position="3"> Thus, it may happen that the checkers disturb each others results by proposing antagonistic corrections with respect to one and the same expression: Within the correct passage &quot;in bezug auf', tbr example, &quot;bezug&quot; will first be regarded as an error by the standard checker which then will propose to rewrite it as &quot;Bezug&quot;. If the user accepts this proposal he will receive the exactly opposite advice by the frozen phrases checker.</Paragraph> <Paragraph position="4"> On the other hand, checking on different levels could nicely go hand in hand and produce synergetic effects: For, clearly, any context sensitive checking requires that the contexts themselves be correct antl thus possibly have been corrected in a previous, possibly context free, step. The checking of a single word could in turn profit from contextual knowledge in narrowing down the number of correction alternatives to be proposed for a given error: While there may be some eight or nine plausible candidates as corrections of &quot;Bezjg&quot; when regarded in isolation, only one candidate, i.e. &quot;bezug&quot;, is left when the context &quot;in auf&quot; is taken into account.</Paragraph> <Paragraph position="5"> The same interdependence seems to exist with respect to higher levels of language checking. At least it can be argued that a grammar checker will profit from integration with a frozen phrases checker: For nothing but an expert for frozen phrases can verit3, the correctness of (idiomatic) expressions like &quot;ruhig Blut bewahren&quot; or &quot;auf gut Gliick&quot;, and thus prevent a grammar checker from flagging the missing inflection of the adjective (&quot;ruhig&quot; or &quot;gut&quot;) in attributive position.</Paragraph> <Paragraph position="6"> Thus, there is a strong demand for arriving at a holistic solution ff)r multi-level language checking rather than for just having various level experts particularistically hooked together in series. This will be the task for the near future.</Paragraph> </Section> class="xml-element"></Paper>