File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0109_metho.xml
Size: 21,702 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0109"> <Title>Natural Language Processing at the School of Information Studies for Africa</Title> <Section position="4" start_page="50" end_page="51" type="metho"> <SectionTitle> 3 Infrastructure and Student Body </SectionTitle> <Paragraph position="0"> Addis Ababa University (AAU) is Ethiopia's oldest, largest and most prestigious university. The Department of Information Science (formerly School of Information Studies for Africa) at the Faculty of Informatics conducts a two-year Master's Program.</Paragraph> <Paragraph position="1"> The students admitted to the program come from all over the country and have fairly diverse backgrounds. All have a four-year undergraduate degree, but not necessarily in any computer science-related subject. However, most of the students have been working with computers for some time after their under-graduate studies. Those admitted to the program are mostly among the top students of Ethiopia, but some places are reserved for public employees.</Paragraph> <Paragraph position="2"> The initiative of organising a language processing course as part of the Master's Program came from the students themselves: several students expressed interest in writing theses on speech and language subjects, but the faculty acknowledged that there was a severe lack of staff qulified to teach the course. In fact, all of the university is under-staffed, while admittance to the different graduate programs has been growing at an enormous speed; by 400% only in the last two years. There was already an ICT support program in effect between AAU and SAREC, the Department for Research Cooperation at the Swedish International Development Cooperation Agency. This cooperation was used to establish contacts with Stockholm University and the Swedish Institute of Computer Science, that both had experience in developing computational linguistic courses. Information Science is a modern department with contemporary technology. It has two computer labs with PCs having Internet access and lecture rooms with all necessary aids. A library supports the teaching work and is accessible both to students and staff. The only technical problems encountered arose from the frequent power failures in the country that created difficulties in teaching and/or loss of data. Internet access in the region is also often slow and unreliable. However, as a result of the SAREC ICT support program, AAU is equipped with both an internal network and with broadband connection to the outside world. The central computer facilities are protected from power failures by generators, but the individual departments have no such back-up.</Paragraph> </Section> <Section position="5" start_page="51" end_page="51" type="metho"> <SectionTitle> 4 Course Design </SectionTitle> <Paragraph position="0"> The main aim of the course plan was to introduce the students successfully to the main subjects of language and speech processing and trigger their interest in further investigation. Several factors were important when choosing the course materials and deciding on the content and order of the lectures and exercises, in particular the fact that the students did not have a solid background in either Computer Science or Linguistics, and the time limitations as the course could only last for ten weeks. As a result, a curriculum with a holistic view of NLP was built in the form of a &quot;crash course&quot; (with many lectures and labs per week, often having to use Saturdays too) aiming at giving as much knowledge as possible in a very short time.</Paragraph> <Paragraph position="1"> The course was designed before the team travelled to Ethiopia, but was fine-tuned in the field based on the day-by-day experience and interaction with the students: even though the lecturers had some knowledge of the background and competence of the students, they obviously would have to be flexible and able to adjust the course set-up, paying attension both to the specific background knowledge of the students and to the students' particular interests and expectations on the course.</Paragraph> <Paragraph position="2"> From the outset, it was clear that, for example, very high programming skills could not be taken for granted, as given that this is not in itself a requirement for being admitted to the Master's Program.</Paragraph> <Paragraph position="3"> On the other hand, it was also clear that some such knowledge could be expected, this course would be the last of the program, just before the students were to start working on their theses; and several laboratory exercises were developed to give the students hands-on NLP experience.</Paragraph> <Paragraph position="4"> Coming to a department as external lecturers is also in general tricky and makes it more difficult to know what actual student skill level to expect. The lecturer team had quite extensive previous experiences of giving external courses this way (in Sweden and Finland) and thus knew that &quot;the home department&quot; often tends to over-estimate the knowledge of their students; another good reason for trying to be as flexible as possible in the course design. and for listening carefully to the feedback from the students during the course.</Paragraph> <Paragraph position="5"> The need for flexibility was, however, somewhat counter-acted by the long geographical distance and time constraints. It was necessary to give the course in about two months time only, and with one of the lecturers present during the first half of the course and the other two during the second half, with some overlap in the middle. Thus the course was split into two main parts, the first concentrating on general linguistic issues, morphology and lexicology, and the second on syntax, semantics and application areas.</Paragraph> <Paragraph position="6"> The choice of reading was influenced by the need not to assume very elaborated student programming skills. This ruled out books based mainly on programming exercises, such as Pereira and Shieber (1987) and Gazdar and Mellish (1989), and it was decided to use Jurafsky and Martin (2000) as the main text of the course. The extensive web page provided by those authors was also a factor, since it could not be assumed that the students would have full-time access to the actual course book itself. The costs of buying a regular computer science book is normally too high for the average Ethiopian student.</Paragraph> <Paragraph position="7"> To partially ease the financial burden on the students, we brought some copies of the book with us and made those available at the department library.</Paragraph> <Paragraph position="8"> We also tried to make sure that as much as possible of the course material was available on the web. In addition to the course book we used articles on specific lecture topics particularly material on Amharic, for which we also created a web page devoted to on-line Amharic resources and publications.</Paragraph> <Paragraph position="9"> The following sections briefly describe the different parts of the course and the laboratory exercises. The course web page contains the complete course materials including the slides from the lectures and the resources and programs used for the exercises: www.sics.se/humle/ile/kurser/Addis</Paragraph> </Section> <Section position="6" start_page="51" end_page="53" type="metho"> <SectionTitle> 5 Linguistics and word level processing </SectionTitle> <Paragraph position="0"> The aim of the first part of the course was to give the students a brief introduction to Linguistics and human languages, and to introduce common methods to access, manipulate, and analyse language data at the word and phrase levels. In total, this part consisted of seven lectures that were accompanied by three hands-on exercises in the computer laboratory.</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 5.1 Languages: particularities and structure </SectionTitle> <Paragraph position="0"> The first two lectures presented the concept of a human language. The lectures focused around five questions: What is language? What is the ecological situation of the world's languages and of the main languages of Ethiopia? What differences are there between languages? What makes spoken and written modalities of language different? How are human languages built up? The second lecture concluded with a discussion of what information you would need to build a certain NLP application for a language such as Amharic.</Paragraph> </Section> <Section position="2" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 5.2 Phonology and writing systems </SectionTitle> <Paragraph position="0"> Phonology and writing systems were addressed in a lecture focusing on the differences between writing systems. The SERA standard for transliterating Ethiopic script into Latin characters was presented.</Paragraph> <Paragraph position="1"> These problems were also discussed in a lab class.</Paragraph> </Section> <Section position="3" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 5.3 Morphology </SectionTitle> <Paragraph position="0"> After a presentation of general morphological concepts, the students were given an introduction to the morphology of Amharic. As a means of handling morphology, regular languages/expressions and finite-state methods were presented and their limitations when processing non-agglutinative morphology were discussed. The corresponding lab exercise aimed at describing Amharic noun morphology using regular expressions.</Paragraph> <Paragraph position="1"> In all, the areas of phonology and morphology were allotted two lectures and about five lab classes.</Paragraph> </Section> <Section position="4" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 5.4 Words, phrases and POS-tagging </SectionTitle> <Paragraph position="0"> Under this heading the students were acquainted with word level phenomena during two lectures. Tokenisation problems were discussed and the concept of dependency relations introduced. This led on to the introduction of the phrase-level and N-gram models of syntax. As examples of applications using this kind of knowledge, different types of part-of-speech taggers using local syntactic information were discussed. The corresponding lab exercise, spanning four lab classes, aimed at building N-gram models for use in such a system.</Paragraph> <Paragraph position="1"> The last lecture of the first part of the course addressed lexical semantics with a quick glance at word sense ambiguation and information retrieval.</Paragraph> <Paragraph position="2"> 6 Applications and higher level processing The second part of the course started with an overview lecture on natural language processing systems and finished off by a final feedback lecture, in which the course and the exam were summarised and students could give overall feedback on the total course contents and requirements.</Paragraph> <Paragraph position="3"> The overview lecture addressed the topic of what makes up present-day language processing systems, using the metaphor of Douglas Adams' Babel fish (Adams, 1979): &quot;What components do we need to build a language processing system performing the tasks of the Babel fish?&quot; -- to translate unrestricted speech in one language to another language -- with Gamb&quot;ack (1999) as additional reading material.</Paragraph> <Paragraph position="4"> In all, the second course part consisted of nine regular lectures, two laboratory exercises, and the final evaluation lecture.</Paragraph> </Section> <Section position="5" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 6.1 Machine Translation </SectionTitle> <Paragraph position="0"> The first main application area introduced was Machine Translation (MT). The instruction consisted of two 3-hour lectures during which the following subjects were presented: definitions and history of machine translation; different types of MT systems; paradigms of functional MT systems and translation memories today; problems, terminology, dictionaries for MT; other kinds of translation aids; a brief overview of the MT market; MT users, evaluation, and application of MT systems in real life. Parts of Arnold et al. (1994) complemented the course book.</Paragraph> <Paragraph position="1"> There was no obligatory assignment in this part of the course, but the students were able to try out and experiment with online machine translation systems. Since there is no MT system for Amharic, they used their knowledge of other languages (German, French, English, Italian, etc.) to experience the use of automatic translation tools.</Paragraph> </Section> <Section position="6" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 6.2 Syntax and parsing </SectionTitle> <Paragraph position="0"> Three lectures and one laboratory exercise were devoted to parsing and the representation of syntax, and to some present-day syntactic theories. After introducing basic context-free grammars, Dependency Grammar was taken as an example of a theory underlying many current shallow processing systems.</Paragraph> <Paragraph position="1"> Definite Clause Grammar, feature structures, the concept of unification, and subcategorisation were discussed when moving on to more deeper-level, unification-based grammars.</Paragraph> <Paragraph position="2"> In order to give the students an understanding of the parsing problem, both processing of artificial and natural languages was discussed, as well as human language processing, in the view of Kimball (1973). Several types of parsers were introduced, with increasing complexity: top-down and bottom-up parsing; parsing with well-formed substring tables and charts; head-first parsing and LR parsing.</Paragraph> </Section> <Section position="7" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 6.3 Semantics and discourse </SectionTitle> <Paragraph position="0"> Computational semantics and pragmatics were covered in two lectures. The first lecture introduced the basic tools used in current approaches to semantic processing, such as lexicalisation, compositionality and syntax-driven semantic analysis, together with different ways of representing meaning: first-order logic, model-based and lambda-based semantics. Important sources of semantic ambiguity (quantifiers, for example) were discussed together with the solutions allowed by using underspecified semantic representations.</Paragraph> <Paragraph position="1"> The second lecture continued the semantic representation thread by moving on to how a complete discourse may be displayed in a DRS, a Discourse Representation Structure, and how this may be used to solve problems like reference resolution. Dialogue and user modelling were introduced, covering several current conversational systems, with Zue and Glass (2000) and Wilks and Catizone (2000) as extra reading material.</Paragraph> </Section> <Section position="8" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 6.4 Speech technology </SectionTitle> <Paragraph position="0"> The final lecture before the exam was the only one devoted to speech technology and spoken language translation systems. Some problems in current spoken dialogue systems were discussed, while text-to-speech synthesis and multimodal synthesis were just briefly touched upon. The bulk of the lecture concerned automatic speech recognition: the parts and architectures of state-of-the-art speech recognition systems, Bayes' rule, acoustic modeling, language modeling, and search strategies, such as Viterbi and A-star were introduced, as well as attempts to build recognition systems based on hybrids between Hidden Markov Models and Artificial Neural Networks.</Paragraph> </Section> </Section> <Section position="7" start_page="53" end_page="54" type="metho"> <SectionTitle> 7 Laboratory Exercises </SectionTitle> <Paragraph position="0"> Even though we knew before the course that the students' actual programming skills were not extensive, we firmly believe that the best way to learn Computational Linguistics is by hands-on experience. Thus a substantial part of the course was devoted to a set of laboratory exercises, which made up almost half of the overall grade on the course.</Paragraph> <Paragraph position="1"> Each exercise was designed so that there was an (almost obligatory) short introductory lecture on the topic and the requirements of the exercise, followed by several opportunities for the students to work on the exercise in the computer lab under supervision from the lecturer. To pass, the students both had to show a working system solving the set problem and hand in a written solution/explanation. Students were allowed to work together on solving the problem, while the textual part had to be handed in by each student individually, for grading purposes.</Paragraph> <Section position="1" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 7.1 Labs 1-3: Word level processing </SectionTitle> <Paragraph position="0"> The laboratory exercises during the first half of the course were intended to give the students hands-on experience of simple language processing using standard UNIX tools and simple Perl scripts. The platform was cygwin,1 a freeware UNIX-like environment for Windows. The first two labs focused on regular expressions and the exercises included searching using 'grep', simple text preprocessing using 'sed', and building a (rather simplistic) model of Amharic noun morphology using regular expressions in (template) Perl scripts. The third lab exercise was devoted to the construction of probabilistic N-gram data from text corpora. Again standard UNIX tools were used.</Paragraph> <Paragraph position="1"> Due to the students' lack of experience with this type of computer processing, more time than expected was spent on acquainting them with the UNIX environment during the first lab excercises.</Paragraph> </Section> <Section position="2" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 7.2 Labs 4-5: Higher level processing </SectionTitle> <Paragraph position="0"> The practical exercises during the second half of the course consisted of a demo and trial of on-line machine translation systems, and two obligatory assignments, on grammars and parsing and on semantics and discourse, respectively. Both these exercises consisted of two parts and were carried out in the (freeware) SWI-Prolog framework.2 In the first part of the fourth lab exercise, the students were to familiarise themselves with basic grammars by trying out and testing parsing with a small context-free grammar. The assignments then consisted in extending this grammar both to add coverage and to restrict it (to stop &quot;leakage&quot;). The second part of the lab was related to parsing. The students received parsers encoding several different strategies: top-down, bottom-up, well-formed sub-string tables, head parsing, and link parsing (a link parser improves a bottom-up parser in a similar way as a WFST parser improves a top-down parser, by saving partial parses). The assignments included creating a test corpus for the parsers, running the parsers on the corpus, and trying to determine which of the parsers gave the best performance (and why).</Paragraph> <Paragraph position="1"> The assignments of the fifth lab were on lambda-based semantics and the problems arising in a grammar when considering left-recursion and ambiguity. The lab also had a pure demo part where the students tried out Johan Bos' &quot;Discourse Oriented Representation and Inference System&quot;, DORIS.3</Paragraph> </Section> </Section> <Section position="8" start_page="54" end_page="54" type="metho"> <SectionTitle> 8 Course Evaluation and Grading </SectionTitle> <Paragraph position="0"> The students were encouraged from the beginning to interact with the lecturers and to give feedback on teaching and evaluation issues. With the aim of coming up with the best possible assessment strategy -- in line with suggestions in work reviewed by Elwood and Klenowski (2002), three meetings with the students took place at the beginning, the middle, and end of the course. In these meetings, students and lecturers together discussed the assessment criteria, the form of the exam, the percentage of the grade that each part of the exam would bear, and some examples of possible questions.</Paragraph> <Paragraph position="1"> This effort to better reflect the objectives of the course resulted in the following form of evaluation: the five exercises of the previous section were given, with the first one carrying 5% of the total course grade, the other four 10% each, and an additional written exam (consisting of thirteen questions from the whole curriculum taught) 55%.</Paragraph> <Paragraph position="2"> While correcting the exams, the lecturers tried to bear in mind that this was the first acquaintance of the students with NLP. Given the restrictions on the course, the results were quite positive, as none of the students taking the exam failed the course. After the marking of the exams an assessment meeting with all the students and the lecturers was held, during which each question of the exam was explained together with the right answer. The evaluation of the group did not present particular problems. For grading, the American system was used according to the standards of Addis Ababa University (i.e., with the grades 'A+', 'A', ..., 'F').</Paragraph> </Section> class="xml-element"></Paper>