File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1045_metho.xml
Size: 12,333 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1045"> <Title>Manipulating human-oriented dictionaries with very simple tools</Title> <Section position="2" start_page="0" end_page="283" type="metho"> <SectionTitle> 2. Logical form </SectionTitle> <Paragraph position="0"> The printed form of dictionaries rellect their internal structure (Boguraev 1990, Byrd, Calzolari, Chodomw & al. 1987). This slrucinre can lie modelizcd wilh a logical \['el'Ill whidi gives the sequence of the informations conlained by the dictionary. This logical feral contains entries, prononciation p:irls, spelling wu'ianls, grammatical categories, sonianlics informalion, silb-enlries, elc. We have defined a logical form of the (tile lines in italic are optional in a entry) 3. ASCII normalized external form A label is linked with every lype of infi)rmation of lira logical lkmn and is included in the inilial ASCII files. Thus, IJSM have obtained basic entries such as this given in fig. 2 which corresponds to the fi'euch enlry &quot;accit&nt&quot; (the label 'e' corresponds Io Ilie French enlry, 'pcn' to pronunciation, 'c' to grammatical calegory, etc.</Paragraph> <Paragraph position="1"> Our first aim is to produce from the ASCII normalized form a paper form of the FI'M dictionary with a format approaching that of usual dictionaries (fig. 3). This involves the introduction in d~e format of fonts, styles, etc. accident lak'ddCW n.m. accident : kemalangan, (kejadian) Iidak sengaja,(kejadian) secara kebetulan -- accident de train/d'avion train/plane crash: kemalangan keretapi, nahas kapal terbang -- aeehlent6 /abd&Ste/ a. damaged (in an accklent) : rosak (dalarn kemalangan) hm't (in an accident): lercedera (dalam kemalangan) (te,'rain)nneven: tidak rata (kawasan, daerah) hilly : berbukit.</Paragraph> <Paragraph position="2"> Fig. 3 : an entry of the pul~lishal~le paper FF, M dictionary</Paragraph> </Section> <Section position="3" start_page="283" end_page="283" type="metho"> <SectionTitle> 2. Electronic formatting </SectionTitle> <Paragraph position="0"> We also produce an electrcmic form. This electronic dictioua,y is supported by of a generic mullilingual dictiona,'y tc~l, ALEX. The problem is to keep as much :is ix~ssible of the logical loruL so as to allow logical access such as searching on multiple keys, sorting, etc.</Paragraph> </Section> <Section position="4" start_page="283" end_page="285" type="metho"> <SectionTitle> 3. Dictionary revision </SectionTitle> <Paragraph position="0"> &quot;Crossing&quot; of the French-Euglish and English-Malay dictionaries has been made manually by people who were not lluent ill French. Thus, some errors remaiu in both the logical structure and in the content. These errors have 1o Ix: corrected before producing lhe final paper form of the FEM dictionary.</Paragraph> <Paragraph position="1"> 4. Phonetic codes conversion USM did not use the standard phonetic transcripticm (international phonetic alphabet - IPA), lint a local transcription using certain ch'u','lcters of Ihe Tilnes TM font, which looks like characters of the IPA. These characters have high ASCII code (128 to 256), thus this rendering is different according to the Ibm. To be portable to PC for instauce, the files must contain only lower ASCII characters (32 to 128).</Paragraph> <Paragraph position="2"> I!1. Methodology Our methodology is genetic enough to be applied to other projects dealing with the construction of real-size publishable human-oriented dictionaries. The methodology is based on the use of simple but powerful l{~ls.</Paragraph> <Paragraph position="3"> 1. Use of an editor for correcting errors The problem is to find an appropriale software for this work. The first type of software is databases but our experiences with them (we have used dBASE III and all)) show that lexicographers don't like to work through DataBase Management Systems. They want to use the same word processor tc~ see the texts they want to index and to construct the dictionary.</Paragraph> <Paragraph position="4"> The most practical tool would bc an edilor of structured documents like Grif (Andr6, Furuta & Quint 1989, Phan & Boiler 1992) which can manage the logical form of tile dictionary, llowever, such editors are complex Io learn and are not yet availal)le on micros as Ihey require lmge computing ressources. \]Ience, we use Word, a wklely awlilable com,ncrcial word processor.</Paragraph> <Paragraph position="5"> We approach this notkm of structured documents by using Word's &quot;styling&quot; facility. A Word style is a group of paragraph and characters format with a name (e.g. tile title of this section has the style 'Titlel' which includes tile information about the rendering of this lille). We associate a particul:lr style to each logical type o1&quot; infer,harlem in the dictionary.</Paragraph> <Paragraph position="6"> To convert the initial norm:flizcd ASCII external fcmn (fig. 2) ill a printable form (llg. 3), we propose some solutions: the first solution is In use Word's macro facility. Unfortunately, that facility is euly available on the PC version, and we lmmd it very clumsy to constantly exchange large files bctwcen the PC and the Macintosh, not speaking of unexpected chantcter W, msfonnafion in tim phonetic fent.</Paragraph> <Paragraph position="7"> the second solution is to use transducers, I)tfl the commercial Iranscriptors awtilable are only based OIL direct correspondences. They c:mnot take into acct')unl a ferw,'u'd context at|d they generally have no variables (or notion of st:de). Thus, they arc not p(~werfuI cn(mgh 15r the problems at hired.</Paragraph> <Paragraph position="8"> We used I:l&quot; (l;mgtmge of Transcriptions), a Specialized l.anguage for l,inguistic ProgrammiLlg for wrilting trallscriplors.</Paragraph> <Paragraph position="9"> LT transducers have one input tape with two reading heads (one standard head and one forward head) and one writing head. They can also handle wuiables and produce side effects. Thus, this kind of transeriptors is not reversible in general.</Paragraph> <Paragraph position="10"> There have been p,'evious VCl'sions of I,'1&quot; (l,cpage 1986) The I,T llSed ill our wo!'k has been implemented on transcri plar lower- :,tippetinitial state is init fzom it~.t Io init by read &quot;a&quot; then vrite &quot;A&quot; read %&quot; then write &quot;B&quot; read &quot;c&quot; then write &quot;C&quot; rend &quot;d&quot; ~.en wTite &quot;D&quot; read &quot;z&quot; then &quot;trite &quot;Z&quot; read l chs._r'~ter then write it english ~ \] With I:l', we have easily writtca all necessary cenvcrtets. Phonetic transcriptions These conversions first conccra the prol)lem of special characters used in some fonts, especially the chqraclers used at USM (standard macintosh toots, i.e. courier er times) to approximate the intcmatielml phonetic alphabet (IPA). For ex:unple, the ' sign (as in/aksid'/) appcars only in a standard macintosh font.</Paragraph> <Paragraph position="11"> We have lhus defined three form.'tts. Ph:l is the initial fo,m of the Word files in a standard macintosh font (files built at IJSM). Ph2 is the format where special characters are replaced by others which appear in qll usual fonts (characters corresponding with tile letters, the numbers and with the '+', '-' signs, i.e. 7-bits ASCII). This ASCII coding authorises a safe exchange between Macintoshes and P('s.</Paragraph> <Paragraph position="12"> aborigtme:/ab>ri3 )-'n/ --> /ABORIJI :+N/ To transcript from the Phl to the Ph2 formats, we use the I,'1&quot; traascriplor pltTl tcoPh2. All exccrr~t is given below. transcriptor Ph I toPh2 initial st'.tlc is illil froln iuit to init via read &quot;~&quot; then write &quot;F.+&quot; read &quot;>&quot; then write &quot;O&quot; read &quot;3&quot; then wrile &quot;J&quot; read &quot;e&quot; then write &quot;1';-&quot; read &quot;it&quot; then write &quot;A-&quot; Fig. 6: excerpt of Ihe 1:I' transcriptm' Phi toPh2 l'h3 is tile IPA phonelic tornl'll, l~alPhon is the font used for this 1PA transcription. The problem is to assign this font only to the lines which corrcspoad in phonetic transcription, and hence to dclcLlllinate the right. lle,e, we work oe the RTF (Rich Text l:ornl'll) l'onnal, directly produced by Word, which records all the informatious describing Wo,d documents (styles, fonts and olher informations as italic, Ixfld, elc).</Paragraph> <Paragraph position="13"> Then, Ph2 LoPh3 realizes the transilien from Iq~2 te Ph3. It hansforms Ihe RTF form of the lines corresponding Io prolltlliCialioa (pholletic) by eotLvcrlillg Ihe &quot;l'illlCS foil\[ code (\f20) to Ihe l~'all'h(m fonI code (\fl \[3S) alld each charac|ef ill Ph2 form In \[he IPA forth.</Paragraph> <Paragraph position="14"> aborigine:/ABORIJI:,-;N/ --> /ab~ri~t:n/ AlL excerpt of Ph2LoPh3 is given Imh3w. The code RTF \'al) corrcspends to the ch:uacteL .... ill the l&quot;;.tlI)hotl fold, \'bf to '0', etc.</Paragraph> <Paragraph position="15"> tr:mscrildor Ph2tolql3 initial state is ini! fl'om init to init via read &quot;F~+&quot; the,i write &quot;{\f113R Vah}&quot; read &quot;O&quot; then write &quot;{\f113{{ Vbl'}&quot; ,'cad &quot;J&quot; then write &quot;{\f1138 Vbd}&quot; read &quot;F.-&quot; then write &quot;{\f\[ 13g el&quot; read &quot;A-&quot; then write &quot;{\fl 138, \'8c}&quot; Pig. 8: excerpt of the 1:1&quot; Irallseripll~r Pi~2lePh3 External format Tim first conversion type was about the problem of special characters rendering in the dictionm-y. The second concerns the extern,'d format of the diction,'u-y. We have defined three formats: An: the ASCII normalized form which corresponds to the initial files (ehl), these files with phonetic encoding (Ph2) and these files in the RTF format (Ph3) WT: tim Word transitory form wbich corresponds to tile stylized files with phonetic encodiug (Ph2) and and these files in the R lq v format (ph3) (lag 4) vp: tile Word printing fern (fig 3) in which we have c,mceled every informations about styles but we kept the other informations as fonts code and other chm acters formats (Ph3) The conversions between AN ~e and wp forms are made with LT transcriptors 3, Use of a dictionary tool Alex is a simple and easy to use generic dictionary tool. Its functionalities ,are quite classical (inserting and deleting items, sorting, searching). The interesting features are tim possibility to index a base on several keys and to search according to these keys or tile content of any non-indexed entry (although it is slower).</Paragraph> <Paragraph position="16"> Enu'ies can be structured objects and se,'u'chcs can be done in function of the values of the features. A stone base can handle heterogeneous objects.</Paragraph> <Paragraph position="17"> It is possible to pilot ALEX remotely (instead of interacting with it via the user-interface) and this nlelhod has been used to fill the FEM electronic base.</Paragraph> <Paragraph position="18"> To do so, we have written an LT transcriptor wilh strong side effects on ALFX. The go,d, here, w,'t~ not to produce a result in term of a trm~scripted file, but instead to read a file and produce actions on the ALFX b,~se. As any dialect of LT can mix Lisp commands in their script, it was possible to make these tools cooperate.</Paragraph> <Paragraph position="19"> Conclusion The methodology for manipulating human-oriented dictionaries presented in this paper is based on simple bat powerful tools which can be used by lexicographers who don't want to spend much dine learning how to use structured doemnents editors and even less, how to progr,'urt in I)BMS. We use Word, a commercial word processor; LT, a language of transcriptions; ALF.X, a diction,'uy tool. Contrary to our initial fears, these simple tools proved very convenient, and powerft, l enough for the tasks at h,'md.</Paragraph> <Paragraph position="20"> LT and ALFX will soon be av'filable by anonymous ftp at cambridge, apple, com.</Paragraph> </Section> class="xml-element"></Paper>