File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1010_metho.xml
Size: 33,201 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1010"> <Title>PARSING AGGLUTINATIVE WORD STRUCTURES AND ITS APPLICATION TO SPELLING CHECKING FOR TURKISH</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> PARSING AGGLUTINATIVE WORD STRUCTURES AND ITS APPLICATION TO SPELLING CHECKING FOR TURKISH by AY,~IN SOI,AK and KEMAL OFI,AZER </SectionTitle> <Paragraph position="0"> l)el>arl, In<'nt, of ('onlputer Engineering and llfformation Sciences</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Bilkent Uniw~rsity </SectionTitle> <Paragraph position="0"> Ililkenl. Ankara, 06533 Tilrkiye</Paragraph> </Section> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> Most of tile research on parsing natnral \[allguages has beetl concerned with I&quot;,nglish, or wil, h other languages nlOrl)hologically similar Io English. Parsing agglntinat.ive word st, ructures ha.s altracted relatively little attcnl;ion most probal~ly becanse agghlfinatiw? lallgllages COlll~aill word s/ructtlres of considerable complexity, and parsing WOrdS ill Stlch languages I'(?(llliros morphok~gical analysis techniques. Ill this pal)er, we pi'eSell(r the design and implementation of a morphological root-driven parser tor Turkish word structures which has been mcorporatoed into a spelling checking kerllel for on-line Tiirkish texl, The agghltmative Ilatllre of the language and the resulting ('Olll\[)l<?x Wol'd \['ornlatiollS, V;ll'iOllS pholleLic llall/lOlly l'tlleS alld sill)tie eKcepLiOllS \[)reselll, cel'taill difficulties llOl usually on('ountered in the spelling checking of laagua,ges like English and make this a very challenging probhnH.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Morphological cbussilicat, ion of natttral languages according to their word Stl'tI('ttlrt+s idaces languages like Turkish, Finnish, and lhmgarian Io a class called &quot;ag ghfl+inalive langua.ges&quot;. \[n sllch hmguages, words are COlllbillaLiOll of several Iilorphel\]les. There is a root and several suffixes are conlbined lo this root in order to modil},' or extend its meaning. Whal characterizes agglut, inative languages is thai stem fornlation hy at&quot; fixation 1o previously derived st.oms is extremely productive. A given stellL ew'n Ihough itself qlnt0 corn ph+x, call generally serve as basis for evell lllol'l' ('o111 l)lex words. Consequ.ntly, agglutinative languages contain words of considerable COnll~lexity, and parsing such languages necessitates a thorough morphological analysis.</Paragraph> <Paragraph position="1"> Morphological parsing has allracted relatively \]itl,le attention ill con'tputational linguistics. Tile reason is that nearly all parsing research has been concerned wMl English, or wit.h languages morl)hologicaII ) similar to English. ,qillce in such languages words contain only a fi~w nalldJer of affixes, or none at all.</Paragraph> <Paragraph position="2"> alhnost all of the parsing mod<+ls \[br Ill(!lll consider recognizing those affix<+s +Is being trivial and thus do nol require a mOt'l+hological nnalysis, hi agghni native langaages, words C(/lll,ail111o direct indication Of t/lOrl;llel/le bOtlltdarios whMi at,, i. gellela\[ (IOpOll dent on tit(? inorpho\]ogieal and pllon(Jh)gical conlex\[+ A morphological parser requires a nlorphold/OaOloglest\[ COlllpollellt which l/lediat, es I)olwl?ell I\[he Sill\[kl('t * 1\['o1'111 of a \[llorp\]lellll! as ellco/llllel'l?d ill Ihe il/ptll text aud the lexical form in which the t\]torl)h<~me is stor<.d ill tile lllOl'phellle illVelltory, ie, a i\[WallS of i'e('oglliZing variallt forms of \[l/Ol'phelllOS as tll~! SaltlO. alld a nlorl)hotactic component which specilies which corn hi.rot,ions of Inorl)henws at,&quot; Iwrn:itt,'d \[7\] \lorphotogical parsing algorithms ma+x he divided it/to Ix',() classes as ollir .slrtpl~la 9 ;llt(I rool-df'iv~ It ;nlal+ ysis met.hods. FIolh approaches hawr beell Ilse/l frOlll very early on in l.he history of morphologicM parsing, For instance, I)ackal'd's parser flw ancien |Greek \[15).</Paragraph> <Paragraph position="3"> aud Brodda and Karlsson's for Finnish \[3\] used affix slripping. Sagval\[, on tile other hand, devised a rootdriwnl morpllological analyzer for Russian \[17\]. In addition, other tool; driwm morphological parsers for tile agglutinative langmtges Quechna \[9, 10\], Finnish \[l 1\], and Turkish \[6\] were developed independently ill the early 1980's+ All of these Ihree pars(~rs proceed from left to righl,. Iltlot, s ~tre SOllgh |ill the lexicon that, mat.oh imtial suhstl'ings of the word, and t, he gram Iltatica\[ category o\[ the root del, ermines what (:lass of sutlixes may follow. When a suttix in the permilted class is found to match a furttler substring of t,he word, grammatical mfornlation in 1he lexical entry fl)r that sulflx del,ernlines once again what class of suffixes may follow. If the end of tile word can be reached hy il.eration of this process, and if the last sullix analyzed is one which illay elld ;i word. t,\]le parse is successful \[7\].</Paragraph> <Paragraph position="4"> Another Icft-t+o-right parsing algol'itllni for autolnttlic analysis of Turkish words was proposed and ap plied by I(iiksal ill his Ph.l), thesis II2} Ills algo rithm called 'qdentified Maxillllllll Mat, ch (IMM) AI golithnl&quot;, tries to find the Ill;IXinllllll h'ngth subslring, which is present, in a reel dict.ionary, h'OI\]l the left of tim word. If a soltltriOll is ollLailled, ie., the rool IllOl+ \])ht?lllU iS identilledL the retnainhlg I)art of the word is considered as th( search (?\[elllellL. This part is looked tbr in the suffix ItlOrl)henle forms dictionary aml the nlorphemes are idl!ntified one by one. '\['he process StOpS whell there is no relllaillillg part. \[\]owevet ill SOllle casi.s, ;llt\[iotlgll it nolat+ioll is ohtained furl, her consistency analysis proves that this solution is tLot the corrccl one. In such cases Ill. previotts pseudo solution is reduced by one character alld all t,he search procedure is initiated once \[ll()l'C.</Paragraph> <Paragraph position="5"> 'l'heso approaches to tnorphologicaL parsing of Turk Ac+~+s DE COLING-92, NANTES, 23-28 Ao~r 1992 3 9 l'roc. OF COLING-92, NANTES, AUG. 23-28, 1992 ish words have tim following short.coming: They do not consider the fact that in Turkish, words contain l, rPlllelldOllS alllOlln\[, of selnantic illfOrlllat, iOll that has to })e taken into account. Ill these parsers, it is only the granlniatical category of the stein that detrrmine *lie suffixes that may follow, l|owever, niost of the sultixes in Turkish, especially the derivational oaes, call be at.taclled only to a linlited number of reels or sleltlS Inostl} duo to Sel/lallliC reasollS.</Paragraph> <Paragraph position="6"> Another shortcolnhig of the previous parsers for Turkish is ihat they allow ille iterativr ilsage of derNaiional su\[fixes. Although, bi6ksal \[12\], prevelltS the COIISeC/liiVe |lsagl, of the Sallle ltlOl'i)hellle lwicc, lie slill l)arsos the word G(3ZI,(II,2('{iI,('YI,~('.i)L{31,; correctly, so do llankalner \[7\]. It is tl'lli&quot; l\]lat. SOltle Turkish sutlixes can form aa iteraiive loop. but usually th,' number of iteratioli is not too high. rl'he above word ran I)e parsed correctly Ul; to lhe point G(3Zl,{'l((i:l~!L{'tl,; (the occultation of oculists), but the words GOZI,UI,2(,'UI,UI,;(,'{: and (IOZLUI((,:trLI l(('(!l,{iK are meaningh'ss, and tllerefore sonle conlro\] llle('}lalliSlllS IlSilig semantic iii\[oriliat, iOll SilOllid be illeluded wilhin the parser Io avoid parsing StlCli inealtinglrss words as if lhey werr corrl>ci.</Paragraph> <Paragraph position="7"> One of t.lie loosl iniportant application areas el' parsins words in natural lallguages is clleeking their spellings. Altllough ltianv spelling checkers for l'\]l> glish and soltle other bu/guages \]lave been developed, st) far no such t.oo\] was present for 'lTurkish. The reason for Ibis is l)rol)ably the conlp\]exity of parsing problem for Turkish as explained al)ow~. Wrong or(l('l'illg Of li\]orphellleS alld errors ill re;re\] o1' consollaal harntcmies Inay C~lllSP till, Wl'Oltg spelling of Turkish words (ionsequently. in order t.o check Ihe spelling of a Turkish ~ol'd, it is iit, cessai'y to lllake si<gnilieanl phonological mid ntorphological analyses.</Paragraph> <Paragraph position="8"> This paper describes a ntorphologieal root-driw'n parser developed for Turkish language and its appli c;itiOli to spoiling cllecking. A lllajor porlion of lhis work depends Oll a drlailrd and careful research on stJilW \[{'alllres of Turkish ilial lliakc t\]w parsing problent for this languagr rsprcially hard and ini.eresling.</Paragraph> <Paragraph position="9"> 'lh,. lbllowing svclion pr+,sonts all ovrrview of eel thin illorl)bOl)bOlielilic alid llioriihological aslwrts of tlw turkish language which are especially r,'le~anl to i lie probirl,, ulidrr con~idrl'alion (for delails se,' {70\])</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. The Turkish Language </SectionTitle> <Paragraph position="0"> Turkish is an agglutinative languageihat belongs Io a group of \]anguagrs known as A\]taic \]anguages. For all agg\]ulinative laligllagc, t\]/c collrepl of&quot; word is iuiir\]l \]arger than lhe sol c)\[ vocabillary ileilts. ~Vord slrllrl tll'es Call grow Io hi, relatively long b} addition of suttixes and solnetiiries contain an amount of senlantic information equivalent to a complete sentence in another language. A I)opular example of coinplex Turkish word formation is (,71'\]KOSI,OVAKYA-LILAf~TIItAMADI\['~.LAF{,IM1Z\])ANMI~SINIZ whose equivalent in English is &quot;(it is speculated that) you had been one of those whom we could not convert to a Czechoslovakian.&quot; In this example, one word m Turkisll corresponds to a fllll sentence in English.</Paragraph> <Paragraph position="1"> Each suitix has a certain flmction and modifies the semantic information in the steni preceding it. In our example, the root mori'~heme ~EKOSLOVAKYA is the nalne of the country Czechoslovakia and the suffix - /,I converts the meaning into Czechoslovakian, while the following suffix LA~ makes a verb from the previous stem meaning to become a Czechoslovakian, t, and so o11.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1. Turkish Phonetic Model </SectionTitle> <Paragraph position="0"> Being phonetic, the Turkish language can be adal)ted t.o a number of different alphabets. In the, past, various alphabets haw~ been used to transcribe Turkish, e.g., Arabic. Since 1928, Latin characters have been used. The Turkish alphabet consists of 29 letters of which 8 (A, E, I, L O, (3, U, (~) are vowels, and 21 (B,C,C~,D,F,G,(L H,J,K, L,M.N,P,R,S,,q,, q, V, Y, Z) are consonants.</Paragraph> <Paragraph position="1"> Turkish word formation uses a number of phonetic harlrlony rules. Vowels and COltSOllants change in certain ways when a suffix is apl)ended to a root, so that sucll harnlony constraints are not violated.</Paragraph> <Paragraph position="2"> 2.1.1. Vowcl Change iti Suffixes Ahnost all suffixes in Turkish ilse one of two basic vowels and their allophones. We have denoted these sets of allophones with braces around the main vowels A and 1. as {A} and {I}. The allophones of {At are A and E, where {It represents I, i, U, or {r. The vowels O and (} are only used in root inorl)hemes (especially in the first syllable) of Turkish words. ~ The vowel harltlOllV rtlies require thai vowels in a silt L fix challge according to certain rules whell they are affixed to a stem. The first vowel in thr suftix changes according to the last vowel of the sl.em. Succeeding vowels ili tile suffix change according to the vowel prereding it. If we denole the preceding vowel (lie it in the sten, or in the suffix) by v then {At is resolved as A if r is A, 1. O. or U. otherwise it is resolved as E.</Paragraph> <Paragraph position="3"> OiL the other band, {I} is resolw~d as \[ if e is A or 1. as iifeisEori, as U if t: is O or U. and as 0 if v is (3 or U. For examl)le the word &quot;YAPMAYACAI,7.TINIZ&quot; can be broken htto sutiixes as: YAI'/M {A)/\[Y\]:'{ A )C{A} { t'~)4/{l)}S {I}/N {I}Z i \[qom nm~ on. ~, wilt indicate lhe I;ng\]ish meatlh/g tff a iVlbl'll ill Turkish ill p,~l'etlllwsl!s following il. ~ I'h,' proglrssivo lense suffix {\]}YO\[( is an exceptioll. <\[ \] iudicates an opti,mal IllOi'l)heilie that nniM Iw inseried before it sulllx to satisfy cel'l&in harniony rules. In this case. \[Y\] indi<atrs Illltl liw COllS~lllillll ~l&quot; IIitl~,l I~i&quot; ilisl I'ted if Ihr last lel ICl of (lie $Lflll is ,~. vowel, otherwise il is dl'Op|)ed: e.g., OI'~U (read) . ()\[<.1%'AC:\\],{ is/lie will lead), bul 7()R (ask) -- 54C)\[}A(':\1< fs/i,,' ~ill ask) i'\[hr iu<, ;tilol)holies <if {K } al'r K and (i 'l'he I~l~ alloph,mrs of {It} ale |) alld \]'.</Paragraph> <Paragraph position="4"> AcrEs DE COLING-92, NAh'TES, 23-28 AOOT 1992 4 0 PRec. Of COLING-92, NANTES, AUG. 23-28, 1992 1( can bc seen that the vowels ill the correct spelling of the word obey the rules almve, while a spelling like * &quot;;APMAYACEI<TiNiz violates the harlnony rules because all {A} in the sullix call not resolve to all I'\] as tile prereding vowel is all A, It shouh\] be nlenlioned in passing i,hai t, here are also SOllle suffixes, sucli as -\[{l;\].'~, whose vowels llOVOr ch~lllgP.</Paragraph> <Paragraph position="5"> ~,\]L,2, Consonallt; }lili'lll(llly Another basic asperl of Turkish pllonology is consonant harniony. It is based on the classilicalion of 'hlrkis\[i (1OllSOllalllS illlO two lllaill groiips, voit'fless a.d ,o,c<:d. Th,, voiceh>ss COliSOliaFlls arc (', F, T.</Paragraph> <Paragraph position="6"> 1t, $. K. 1 >, ~. 'fh<' reluainmg ronsolianls are voiced lnterosied readers call find tile complete lisi of con-SOllant harlnony lqlh's in l.;oksal \[12\], and Solak \[20J To give ~ll/ examl)h', one of thr rules says that if a sulIix begins with ore, of t.h( consonants I). (:, (;.</Paragraph> <Paragraph position="7"> Ibis COllSOllalll changes iillo T, ('. I{ l'eSlWCl \[rely, if&quot; a %oiceless COllSOllalll is iH'(,~('nt as the final I)h(HieillO o\[' the pr(wious illl)rpllonl0, e.g.. ~l'()I,\])a (Ol/ l'o~id), bill I:(\]AK'IA (ou plalw), ~oiii(' lilOrl)henles are allixcd wiill ihc insertion ot either N. ,q, ~. &quot;l&quot; when Iwo vowels llal)pcn Io follmv each otll0r (e.g. ilAIIt:ES;i (his/l~er garden), II,tll(.:l::Yi (aecusali;e of garden), il.:i,5_l,;l/ (two each)j, or when there is anoLher nloi'phenie following (e.g. BAII(!IC,q_'iNDI'; (in his/iwr ga,'d,~.), or in Colltexl of sonic lirOllOtlllS (c.g,, BUNA (to tiffs), I<I;;NDiNI)EN (|'rein yourself)) and thr prononiial sut/ix I,~i (,,.g. SI':NiNV;i*i (accusaliv,' of .yours)).</Paragraph> <Paragraph position="8"> lit OII1' ('xanll)\]l' HI)O%'/', the flllllr(' It,liSt, sii|l\]x {'~'\]{A}( :{ Ai{IC} ........... I'le,' till' ~4i ...... YAt'MA ..... I since thr \[asl ph()llrnir is a vowel &quot;f is ms('rl,'d.</Paragraph> <Paragraph position="9"> 2.1..3. D(~forniatioll of ll.(mts No/'nlally 'lurkish rliols arc nilI t\[oxi>d. \]\[owovel-, tllerc ~tl'P SOlllO ('il'4or, whrrC/ f4Ollll' i~honenws ~ll'(' ch,qllgod by aSSilnilalion or variolls olher (icforlllaliOl/S \[12\] An ex(:eptioilal cas,' related io ih,' tlexion of IOOIs iS observed ill \])l,lSOlla\] i)rOllOllllS BI&quot;,N (1) alid ql;N (you)\]laving ,lalivo~ lIANA (to ine)and SAN:\ (Io yell t rcspeciivc\[~. 'l'hrsr ar(&quot; individual cases and Clill hi, Ircated as excc\[lli(lllS.</Paragraph> <Paragraph position="10"> ,% lil,)ll syslelnatic ,qlipsis OrClii's when il. su\[\[ix {1} k(.)ll ('Olll(?S all,el i it(' ~elbal reels alld SlOlllS ('li(I.iiig wil\[i I,ho llholieillc {A} In SilC\[i cases, ttlc wid,, \()1\('1 ;ii Ih,' end of lhe sielil is i/arrow~,d, c,g, YAP - ','AI>IYOlt (s/h,'/ii is doin-; \[ii\]). but Alia * AI'IIYOII (s/ho/it is ,.earchmg).</Paragraph> <Paragraph position="11"> AIIOl\[lcq' rool deforlualion o(('tlrs ;is (i vowel ellipsis. ~,Vlien a sut\[ix brginnin<e> wilh a vowel COllies afler SCllllt, ilOtlllS, gener;i\]l) dcsigiiating paris of thr hu ma. body. wim'h has . vowd {i} i. ils lasl syllabi,', Ihis vm~el drops, e.-. IIl'ltl:X (lieS,,) - BVIINUM (mS nose). '-;imilarty. who. lh(&quot; passiw.uess suffix {I}L is affixed to some ;crl~s. whose lasl vowel is {I}, tNi <, vowel also drops. ~,.~. ('AC, II/MAb; (io call) -('A(;ItlI,MAK (io Iw calh,d). Other root delk)rl.a ('l{,,f+q 5olak \[20\] fra delailed ilffOil,,aiion oil ,'at, h of th,' ~ullixes tions and their exceptions call be found ill Solak \[20\].</Paragraph> <Paragraph position="12"> 2.2. Turkish morphology Turkish roots can be classified into two main classes: ,ominal and verbal. The verbal class comprises the verbs, while l/Olllillal chess COlllprises llOIlllS, |)rOllOilllS and adjecl.iw's, etc. Tile sulfixes that can be receiw~d by either of these groups are different, i.e., a suffix which C~lll bt, a|Iix(!d to ~1 llOlllillRl l'oot ci%11 llot b(! affixed to a w.'rbal root wil, h tile same semantic function. null Turkish suffixes can bc classified as derivalio~lal and co~ljuyallonaL l)eriwttional sutfixes change thr meaning and sometinies lhe class of tim stenls they are affixed, while a conjtlgated verb or noun renmins x'~ such after the atlixation. Conjugational suffixes call b. affixed to all of the roots in |,he class thai \[,hry belong. On t, hc el, her hand, 1hr nuniber of roots that ,,ach derivational suffix can }>e affixed changes. The nominal model 'lhr shnplili,~d models for nominal and verbal grainlllgll'S rail be giVI211 ~lS tollows: 6 The nomimd niodel: nommal root + phu'al suffix + possessive sutflx + case suffix -I relalive suffix Tim wwlml model: verbal root -\[ voic(&quot; sultixes + negation sulfix + corn pound verb suffix t- Illaili It'llse suttix -i- qllestioll suffix + secoitd I.l!llsl? suffix + Iwrsoll sutllx</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Implementation </SectionTitle> <Paragraph position="0"> \\',' have ilnph,lnrnted n rool-driwul lnorphological analyzer lbr '\['urkish ;tlld llSe,,I il as a spelltn 9 chcckl*ui ,4'e 7~t;I that can be integrated t.o <li\[fiercnt a.I;pliC/iations Oil a variely of plattbrnis.</Paragraph> <Paragraph position="1"> The progranl takes a list of Turkish words as inpul, and thcli checks then10lit? I)y one ill the order t, hey appear. If the Slmllhlg of aii hipul word if. iileorrecl, il is oulput as inissI>elh'd Each word is allalyzcd individually wil, h 11o at, telllion to the Selllalltics or |,o the coillcxl. If a r, ord is spellrd corrrcll3 Inil is l,h~, wrOllgj ':.oi'd ill lhe C/Ollll'Xl. w(> liave 11o inl,elll,ion for, and way of tlagging it. ;is ci'rOllOOllb, '\['hils, as in all oliwr Sl)elling prog~i'alllS, th(> lexl is CXalnilied wilh leSliecl Lo words, liol willi rcspccl Io SClilC-iices. hi addition, w~, (1o 1101 )'{'t give ally stiggr'stion aboul the iliOSl likely correci words afler dole<ling a nlisspelh~d word. ie, spelling corl'rClirm is ilol dent, Word anal.~sis is handh'd in four step as syllabificaliou chrrk, reel dclcrniilial, ion, niorphol)llonenlie check.</Paragraph> <Paragraph position="2"> and morphological analvsis. I)uring lhese steps a dicliOllal' 3 of Turkisli root words, and a set o\[' rllles for 'lurkisli syllable structure, njorlihophonenlics, and inorpholog;y arc nsed coucurrenily. All these steps will I~e ,'xplain,'d i. llw following sections, after a ill Illose m.dol~ an, I I lie e xcept irma\] ,:;tses ~liJoill \[ heill</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> ACRES DE COLING-92, N^t, rrEs, 23-28 hot'n&quot; 1992 4 1 PROC. OF COLING-92, NANTrS, AUG. 23-28, 1992 </SectionTitle> <Paragraph position="0"> brief infornlation on tile dictionary used in this imlllenlentaliou. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1. Dictionary </SectionTitle> <Paragraph position="0"> The dictionary is bmsed oil the Turkish \Vriting Guide \[2,3\] as the source. Some words in the dictionary haw, to lie marked ~s having certaiu semantic and structural properties such as being a verbal root or a nomiua\] root, being a proper noun, not obeying to vowel harmony rules, deformiug under certain conditions, and so on. For examph ~, tile word BUII.UN (nose) have to be niarked as being a nolllilla\[ root, and deforming by vowel ellipsis. For this reason, for each word in Ihe dictiouary a series of flags represeuting certain properties of that word are heht. Tllus. each elitry of tim dictionary Colltains a word in Turkish and a series of flags showing certain properties of that wor( \[.</Paragraph> <Paragraph position="1"> Nearly 2:1,500 words..'ach having 7 h, lters on the averagiN are listed ill otir Ctil'roilt diclionary, 41 flags per e.'ord 7 have been lised so far, bill later it iliay hC/&quot; liecessary to iise illore, \[leCallSO of this, two long inl.egers (whose bits rel/reselll flags. 17)r a toial of 64 flags) arv assigned for every word.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2. Syllabification Check </SectionTitle> <Paragraph position="0"> Analyzing all t, he words in Turkish \Vritithg Guide \[23\] and all the suffixes ill Turkish \[1, b\]. w~&quot; have constructed a legular expression and a corresponding fi nile stale automaton for validating if a word matches the syllablestructttre rules of Turkish \[18\] This reg+ /llar t?xpr0ssiOll is tised as a heuristic ill oltr spelling checker. The input word is first processed with the regular expression. It is reported as misspelled if its syllaMe structure can not be mat.ched wilh this expression, i.e., tile phonemes of Ihe word do no! form valid sequences accordiug to Turkish syllable struciurcs. ()n tile other hand, if it. can lie matched, it, is flu'ther analyzed as it. tuay still be a non-Turkish or a misspelled word.</Paragraph> <Paragraph position="1"> With th(- hell> of tile syllal)ificat.ion cheek, most of the typographical e.rrors Call be detected. For examph~.</Paragraph> <Paragraph position="2"> if the word YAPMAI( (to make) were typed as YP.\I,\I,2 or YAPMKA. the word would not be matched by the expression and its spelling wouhl be reported incorrect. On tile other hand, ifil, wew written as YAPMEI(, where a vowel harmony error is made, it would pass the syllabification cheek, but would lie reporled as misspelled during morl/holJhonemic checks.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3. Root Determination </SectionTitle> <Paragraph position="0"> Before analyzing the morpholAmnenfic and morpho logical structures of a Tm'kish word, the root has to be determined. If \[he word passes the syllabification check, its root is searched in the dictionary rising a maxilnal match algorithm. In this algorithm, lirst ;\[he list of all \[lags can Im hmnd in Solak \[2(1\].</Paragraph> <Paragraph position="1"> AcrEs DE COLING-92, NANTES, 23-28 AO6-r 1992 42 the whole word is searched in the dictionary, If it is found then the word has no suffixes and therefore its spelling is correct. Otherwise, we remove a letter from tile right and search tile restllting substring. We continue this by removiug letters from the right until we find a root. If no root can be found although the first letter of the word is reached, tile word is reported as misspelled.</Paragraph> <Paragraph position="2"> The maximum length substring of the word that is present in tile dictionary is riot always its root. If fin't.her analyses show that the word is misspelled, a new root is searched m the dictionary, this time removing letters from the end of the previous root.. If a Ilew root can be found the same operations are repeated, otherwise tile word is reported &s misspelled. Root determination presents some dittieulties wheu the root of the word is deformed. For the root words which have to be deformed during certain agglutinations, a flag indicating that property is set in the dictionary. For example, the root of the word ,~EHRE (to the city) must be found as ~jgltiR (city). In order to determine it correctly, when the substring SEHR is not found in the dictionary, considering that it illay be a deforined root by vowel ellipsis, the vowel I is inserted between the consonants 11 and R, and the word ~EHIR is searched in the dictionary. When it is fotmd, tile flag corresponding to vowel ellipsis is checked. Since it is set for this word, the root of the word S,'I';IIRE is dcterlnined as ~EIIiR, and remainins analyses are contiuued. If that word were written as .~EHiRE, we should report it ms incorrect although ~EltiR + dative ease suffix form looks correct. For all other root defin'mations, the real root of the word can be fotnld by u/aking such cheeks and some necessary chauges (see \[20\]).</Paragraph> <Paragraph position="3"> For some roots both of the refills above are valid.</Paragraph> <Paragraph position="4"> For example, both METN\[ (accusative of text) and METiN\]\[ (accusative of strong) are correct although the root of both words is MET\[N (text, strong) becatlse this word call be used in twodifferent meanings.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4. Morphophonemic Check </SectionTitle> <Paragraph position="0"> Turkish words obey vowel and COllSOll~lllt harmouy rules during agglutination (see sections :3.2.1 and 3.2.2). The vowel harmony check may be done jnst after tile root determination, but other morphophonemic checks should be done during morphological anal3sis, null Afier tile root of the word is found, tile rest of tile word is considered as its suli\]xes. The first, vowel in the sutfixes part must be in harmony with tile last vowel of the root, while tile succeediug vowels must be in harnmny with the vowel preceding them. Since there are some sulllxes, such as --KEN, whose vowe\]s ilever chaugo, when a disharl!lony is fouud, we cimck whether it, is tile result of such a snffix (e.g., YANARI,2I'\]N (while iI is burning)).</Paragraph> <Paragraph position="1"> PRec. OF COLING-92. NANrEs, AUG. 23-28, 1992 SomP words of foreign origin do uot ohey vowel harmony rules during agglutination (e.g., KONTIIOL (control)). Before ttae w)wel harmony cheeks are doue, the tlag correslJonding to that property must I,e checked, If it is sol for the root of the word, du, vowel harmony check must he apl)lied inversely Thus, the first vowel in I, he sulllxes part must be in disharmony with the last vowel of the root (e.g,, I(ONTIIOI,LEI/, (controls)). As another interesting ('aS(', SOI\]le roots that ii\]ay he used ill tWO illeanings. \[,e, |lie holnol\]ylllS, obey vowel fiarulony ruh!s whel/ tile3' are used with a eertaiu lllealling, whih' they do lie\[ ob,'y thelll when tile)' are used in tit(! other meaning. For example, both SOLA (to Om left) and SOl,I); (t(} the Itote sol) pass the vowel harmony cheek sine,, tileir refit ~OI, has two iPl{!anil\]gs ;is &quot;left&quot; slid &quot;'tlitisical u(}t,e. &quot;8 The suffixes must I}e deierinin,xl before the conso llaUl }larlUolly checks are doue. Becanse of this. I hese checks are done during niorl}hological anal)sis, after eacli sulfix is isolated.</Paragraph> <Paragraph position="2"> It' a woM does not pass any of ll\]e nlorphophoiil!ulic checks, consideriug the possihility that lhe root may have i)eell determined wrollgly, a liew root is searched ill the dictionary.</Paragraph> <Paragraph position="3"> * 3.5. Morphological Analysis Tim spoiling checker has two separate set. of ruh,s for I.he two IIKLill root. classes. For tile illlplelllent~d.ioll of tile lexical analyzers and parsers in which the rules arc inchlded, two standard UNIX utilities, lea&quot; mid (lace, have been utilized respecliw~ly \[1;I\]. Lea: is used Io separate tile suffixes of a word from left to right, ;111(I I/ace is tlsed to p;q'se tilose su{\[ixes tlsil\]g Illorpilological rules of Turkish granllrlar.</Paragraph> <Paragraph position="4"> The models given in various books on Turkish gram mar \[I. 2, 1. 5. 14} and previous research on Turkish COml)utational linguistics \[12. 16\] have been ul,ilized in for generating the rufi's used in the parsers. Additionally, all of tim known exceplioua\] cases \]lave also been considered (see \[20\]). Although all the eonjugational suffixes flaw? been included into the rules, only a mlallsubset of the derivalional suttixes have heen ha\]idled, The reasons lot Ihis sre dial majorily of Ihe derivatioual sullixes may he receiw~d by only a small group of roots, and deternfining such groups is ;i rat her dilficult an(I time-consuming job, and depends on wmous sen(antic criteria. The derivational sutfixos that may I)c. alfixed to all of Ill,? roots ill a {'lawn and those which can he affixed to large I\]{rcentage, Illll UOi all, of the roots in their clas~ are inclu{led in lhe rules. That makes it i)ossible to , limioate a number of words from the dictionary.</Paragraph> <Paragraph position="5"> 'l~ho two p~ll'Sers ~11'(, allerllalively llscd. First parser Io I}e limed is deternlined accordilig to Ill,. class of/It{' root, hilt its the parsfilg COll\]illlWs it IlHty be IleC(?S:-;&try 1(} s\\ilt.h frolll o11{, plll'S(?r I(} ill\]other ~llld eOl/til\]ll{' 8 i'IIC WOlf\[ ~(}l, iN l)l'OtlOlllll?Cd slighl b' dilfel'elll ill Ihc I;~tlCl', there, or ~tgain pass hltck to the previous ()lie, since the da.ss of a stem can change when it, receives certain suffixes. &quot;\['lie switches between parsers C~l\] SOllletinles he very complicated. Some suffixes can have two different usages. In such eases both possibihties haw~ to he considered.</Paragraph> <Paragraph position="6"> \[f a word has receiw~d more than one derivationM sutfixes then mauy switches between parsers will be necessary. For example, the root of tile word BEYAZ-LA~TIRMAYANLARI}AN (from those which do not cause to hecome white) is found as the noun BEYAZ (white) in our dictionary. Then comes the suffix L{A},5, which makes a verb from a noun, tfierefor,&quot; a switch t.o the verb parser ha~s to be ulade. Parsillg contimles there until I.he suffix M{A} is nlatched.</Paragraph> <Paragraph position="7"> This sulfix can either make a w~rh a noun or negate it First cousideriug the possibility that it is used as a derivationM suffix, tile noun parser is invoked.</Paragraph> <Paragraph position="8"> 'file rmnaiuing part of the word can not be parsed by 1his parser. So accepthlg M{A} as the negation suffix, tile verb parser is returned to ;hid parsing continues there. Later comes the sullix \[Y\]{A}N, which is a sulfix i.fiat illakes ;t lIOill\] fronl a verb, so ~lgS.ill a switch to the noun parser is made. Continuing in this p~trser, the word is parsed correctly.</Paragraph> <Paragraph position="9"> Some Turkish roots call take the sullixes helonging to both nominal or verhal chLsses. \[:or such roots if parsing is unsuccessfld in the first parser chosen, the other olle UlnSt alsG be tried. For exalnphL (fie root of the word A(\]LAR (hungry I)eOl)fi~) is At7. 'Ellis root may either he used as a verb (open) or as a uoun (hungry). If parsing is first attempted with tile verbal parser it will he unsuccessful. So we backtrack aud use the nominal parser. With the nominal parser the word can be parsed successfully.</Paragraph> <Paragraph position="10"> Figure 1 shows the block diagram of the word anMysis. Smumurizhlg, first, the syllable structure of the word is checked. If it is wrong the word is added into the output list of misspelled words, otherwise the root is detemfined. If no root can be found the word is reported as misspelled. If a root is tbund, lirst Ihe vowel Ilarmony check is done. Then, according to lhe ('lass of the root, ol\]e of the parsers is actiwlted Ill Ihe parsers, an the sutIixes ;(re isolated OI/C Ily oue, ilecessary luorphophollenlic cfieeks a, re done. l)epending on the sulfixes, switches between the parsers are possihle. When the cud of the word is reached, if no errors ('all he tfUllld then the spelfillg of the word is correct. If any error is found in itny of the parsers or during ulorphophonenlic checks, a new root is searched. If another reel is found same operations are doile. If no suceessfld parsing can b,> done although lilt! Iirst. letter of the word is reached, the word is added into the OUtl)ut hist.</Paragraph> </Section> </Section> class="xml-element"></Paper>