File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1020_metho.xml
Size: 11,016 bytes
Last Modified: 2025-10-06 14:14:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1020"> <Title>Reading more into Foreign Languages</Title> <Section position="4" start_page="135" end_page="136" type="metho"> <SectionTitle> SENTENCE WITH SELE~ WORD </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> ules for morphological analysis and disambiguation, dictionary access, and (indexed) corpora search with an output module. The &quot;suggestive&quot; pronunciation module is not shown.</Paragraph> <Paragraph position="3"> The core modules provide the information noted in Section 1, (1-3): morphology, bilingual dictionary entry, and examples from use. A fourth (userinterface and display) module controls interaction with the user and formats the information provided.</Paragraph> <Paragraph position="4"> Among other things, it allows the range of information to be tailored to individual preference.</Paragraph> <Paragraph position="5"> The usefulness of the first two sorts of information is evident. We Chose to include the third sort as well because corpora seemed likely to be valuable in providing examples more concretely and certainly more extensively than other sources. They may provide a sense of collocation or even nuances of meaning.</Paragraph> <Paragraph position="6"> The realization of these design goals required extensive knowledge bases about morphology and the lexicon.</Paragraph> <Paragraph position="7"> * Most crucially, the morphological knowledge base provides the link between the inflected forms found in texts and the &quot;citation forms&quot; found in dictionaries (Sproat, 1992). LEMMA-TIZATION recovers citation forms from inflected forms and is a primary task of morphological analysis. A substantial morphological knowledge base is likewise necessary if one is to provide information about the grammatical significance of morphological information.</Paragraph> <Paragraph position="8"> The only effective means of providing such a knowledge base is through morphological analysis software. Even if one could imagine storing all the inflected forms of a language such as French, the information associated with those forms is available today only from analysis software. The software is needed to create the store of information.</Paragraph> <Paragraph position="9"> Even apart from this: people occasionally create new words. Analysis programs can provide information about these, since most are formed according to very general and regular morphological processes.</Paragraph> <Paragraph position="10"> * Obviously, the quality of the online dictionary is absolutely essential. The only feasible option is to use an existing dictionary. Our investigative user study indicates that the dictionary is the most important factor in user satisfaction.</Paragraph> <Paragraph position="11"> * The essential design questions vis-PS-vis the corpus were (i) how large must the corpus be in order to guarantee a high expectation that the most frequent words would be found; and (ii) what sort of access techniques are needed on a corpus of the requisite size--given that access must succeed within at most a very few seconds.</Paragraph> <Paragraph position="12"> We tried to use texts from a variety of genres, and we attempted (with some limited success) to find bilingual English-Bulgarian, English-Estonian and French-Dutch texts.</Paragraph> <Section position="1" start_page="135" end_page="136" type="sub_section"> <SectionTitle> 2.1 Morphological Analysis </SectionTitle> <Paragraph position="0"> As we have seen, morphological analysis is necessary if one wishes to access an online dictionary. Since broad-coverage analysis packages represent very major development efforts, GLOSSER was fortunate in having use of Locolex, a state-of-the-art system from Rank Xerox (Bauer, Segond, and Zaenen, 1995).</Paragraph> <Paragraph position="1"> A French example analysis (from Figure 2): * atteignissent as att eindre+Subj I+PL+P3+FinV; The semi-regular form is recognized as a subjunctive, third-person plural finite form of the verb atteindre. The information about the stem (lemma) from the morphological parse enables a dictionary lookup, and the grammatical information is directly useful. Note that, in contrast to commercially available systems, the information is generated automatically--so that it is available on-line for any text.</Paragraph> <Paragraph position="2"> But there are also examples of words which could have different grammatical meanings. Locolex incorporates a stochastic POS tagger which it employs to disambiguate. In case Locolex is wrong (which is possible, but quite unlikely), the user is free to specify an alternative morphological analysis, which is then looed up in the dictionary and for which corpora examples are sought.</Paragraph> </Section> <Section position="2" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 2.2 Dictionary </SectionTitle> <Paragraph position="0"> GLOSSER was likewise fortunate in obtaining the use of good online dictionaries: the Van Dale dictionary Hedendaags Frans (van Dale, 1993) is used for French-Dutch, and the Kernermann semi-bilingual dictionaries are used for mapping English to Bulgarian, Estonian, and Hungarian. Only the Estonian version is complete. Although there are no paper versions of the latter available, (Kernermann Publishing, 1993) demonstrates the basic concept for English-Finnish.</Paragraph> </Section> <Section position="3" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 2.3 Corpus </SectionTitle> <Paragraph position="0"> We have relied on other projects, the ECI and MUL-TEXT for bilingual corpora, although this has involved some work in (re)aligning the texts.</Paragraph> <Paragraph position="1"> The results of disambiguation and morphological analysis serve not only as input to dictionary lookup but also to corpus search. The current implementation of this search uses a LEXEME-based index for rapid and varied access to the corpus.</Paragraph> <Paragraph position="2"> In order to determine the size of corpus needed, we experimented with a frequency list of the 10,000 most frequent word forms. A corpus of 2 MB contained 85% of these, and a corpus of 6 MB 100%.</Paragraph> <Paragraph position="3"> Our goal is 100% coverage of the words (lemmata) found in the 30,000-word dictionaries, and 1007o coverage of the most frequent 20,000 words. The current corpus size is 8 MB.</Paragraph> <Paragraph position="4"> As the corpus grows, the time for incremental search likwise grows linearly. When the average search time grew to several seconds (on a 70 MIPS UNIX server), it became apparent that some sort of indexing was needed. This was implemented and is described in (van Slooten, 1995). The indexed lookup is most satisfactory--not only has the absolute time dropped an order of magnitude, but the time appears to be constant when corpus size is varied between 1 and 10 MB.</Paragraph> <Paragraph position="5"> Lexeme-based search looks not only for further occurrences of the same string, but also for inflectional variants of the word. If the selected word is livre+Masc+SG+Noun, the search should find other tokens of this and also tokens of the plural form livres. This is made possible by lemmatizing the entire corpus in a preprocessing step, and retaining the results in an index of lemmata. It is clear that this improves the chance of finding examples of a given lexeme immensely.</Paragraph> </Section> <Section position="4" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 2.4 User Interface </SectionTitle> <Paragraph position="0"> The text the user is reading is displayed in the main window. Each of the three sorts of information is displayed in separate windows: MORPHOLOGY, the results of morphological analysis; DICTIONARY, the French-Dutch dictionary entry; and EXAMPLES, the examples of the word found in corpora search. See Figure 2 for details.</Paragraph> </Section> </Section> <Section position="5" start_page="136" end_page="136" type="metho"> <SectionTitle> 3 Using Glosser </SectionTitle> <Paragraph position="0"> A pilot study involving 20 university-level students of French was conducted in Feb. 1996. Half of the students used GLOSSER, and the other half a paper version of the same dictionary and all read the same text and answered questions tested text comprehension and satisfaction. The time needed for the task was also measured. The results of this pilot study were encouraging: although the level of student was too high (Dutch foreign language students have a high level of proficiency), so that no differnces in comprehension were noted, the GLOSSER users were faster, and reported enjoying the experience and interested in using the system further. We have just completed a more careful replication with more students at a lower level of French proficiency, and the predictions of the pilot are borne out: there are very significant differences in speed, insignificant advantages in comprehension, and high overall satisfaction (Dokter et al., to appear 1997).</Paragraph> </Section> <Section position="6" start_page="136" end_page="137" type="metho"> <SectionTitle> 4 Conclusions </SectionTitle> <Paragraph position="0"> GLOSSER was developed with the philosophy of exploiting available NLP technology wherever possible. Morphological analysis (lemmatization) is robust and accurate and more than up to the task</Paragraph> </Section> <Section position="7" start_page="137" end_page="137" type="metho"> <SectionTitle> LE GUN-CLUB </SectionTitle> <Paragraph position="0"> sueme f(~'r ale de,, I~tats-Unis, un nouveau club tr~s influvnt tn~ la rifle de Bali/more, en ple~n Maryland.</Paragraph> <Paragraph position="1"> les Am~icains sutpas~rent ~i~guli~rement~s, ce ivnce de la bali~dque. Non que leurs urines de l~feaion, m~is ell~ offri~ent des dimen*ions *rent par C/on~quent des port~es inconnues juglu'alors. tFedr] tS1 31 ire bereiken -> geraken (tot), reiken (tot) 32 -> halen, komen tot 0.2 taken -> treffen, 3 fig. taken -> treffen, kwetsen 1.1 ~ 70arts len 1.1 ~ un liwe sur l'armoire bij een boek op nun 6.2 ~ qn. au bras iem. in de arm treffen ~c. km. met let~ raken 6.3 ~ qn. duns se~ lem. in zijn overtuigingen k~enken atteindre + Subjl + PL+ P3+ Pin V 1972, la balar~e des ~roduits de laTI @tait encore exc~clentaJre. 1980, le deficit utte~alait2 milliards de dollars et, selon :aires ~ou~ces, labarre des 10 milliards a@t@ frar~hie en 1982. i.. ~'~'3 -&quot; .';-r,'-.:.--~ &quot;~'~'.</Paragraph> <Paragraph position="2"> atteignissent has been requested; on the right, from the top are windows for dictionary (Van Dale), morphological analysis (Rank Xerox) and examples in bilingual corpora. of supporting instructional software. The text processing techniques employed in GLOSSER are not exotic, and likely robust enough to support quick access to corpora on the order of 10 MB in size.</Paragraph> </Section> class="xml-element"></Paper>