XML Viewer - w05-0101

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0101_metho.xml
Size: 29,036 bytes
Last Modified: 2025-10-06 14:09:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0101">
  <Title>Teaching Applied Natural Language Processing: Triumphs and Tribulations</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 Course Role within the SIMS Program
</SectionTitle>
    <Paragraph position="0"> The primary target audience of the Applied NLP course were masters students, and to a lesser extent, PhD students, in the School of Information Management and Systems. (Nevertheless, PhD students in computer science and other fields also took the course.) MIMS students (as the SIMS masters students are known) pursue a professional degree studying information at the intersection of technology and social sciences. The students' technical backgrounds vary widely; each year a significant fraction have Computer Science undergraduate degrees, and another significant fraction have social science or humanities backgrounds. All students have an interest in technology and are required to take some challenging technical courses, but most non-CS background students are uncomfortable with advanced mathematics and are not as comfortable with coding as CS students are.</Paragraph>
    <Paragraph position="1"> A key aspect of the program is the capstone final project, completed in the last semester, that (ideally) combines knowledge and skills obtained from throughout the program. Most students form a team of 3-4 students and build a system, usually to meet the requirements of an outside client or customer (although some students write policy papers and others get involved in research with faculty mem- null bers). Often the execution of these projects makes use of user-centered design, including a needs assessment, and iterative design and testing of the artifact. These projects often also have a backend design component using database design principles, document engineering modeling, or information architecture and organization principles, with sensitivity to legal considerations for privacy and intellectual property. Students are required to present their work to an audience of students, faculty, and professionals, produce a written report, and produce a website that describes and demonstrates their work.</Paragraph>
    <Paragraph position="2"> In many cases these projects would benefit greatly from content analysis. Past projects have included a system to query on and monitor news topics as they occur across time and sources, a system to analyze when and where company names are mentioned in text and graph interconnections among them, a system to allow customization of news channels by topic, and systems to search and analyze blogs. Our past course offerings in this space focused on information retrieval with very little emphasis on content analysis, so students were using only IR-type techniques for these projects.</Paragraph>
    <Paragraph position="3"> The state of the art in NLP had advanced sufficiently that the available tools can be employed for a number of projects like these. Furthermore, it is important for students attempting such projects to have an understanding of what is currently feasible and what is too ambitious. In fact, I find that this is a key aspect of teaching an applied class: learning what is possible with existing tools, what is feasible but requires more expertise than can be engineered in a semester with existing tools, and what is beyond the scope of current techniques.</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 Choosing Tools and Readings
</SectionTitle>
    <Paragraph position="0"> The main challenges for a hands-on course as I'd envisioned surrounded finding usable interoperable tools, and defining feasible assignments that make use of programming without letting it interfere with learning.</Paragraph>
    <Paragraph position="1"> There is of course the inevitable decision of which programming language(s) to work with. Scripting tools such as python are fast and easy to prototype with, but require the students to learn a new programming language. Java is attractive because many tools are written in it and the MIMS students were familiar with java - they are required to use it for two of their required courses but still tend to struggle with it. I did not consider perl since python is a more principled language and is growing in acceptance and in tool availability.</Paragraph>
    <Paragraph position="2"> In the end I decided to require the students to learn python because I wanted to use NLTK, the Natural Language Toolkit (Loper and Bird, 2002). One goal of NLTK is to remove the emphasis on programming to enable students to achieve results quickly; and this aligned with my primary goal. NLTK seemed promising because it contained some well-written tutorials on n-grams, POS tagging and chunking, and contained text categorization modules. (I also wanted support for entity extraction, which NLTK does not supply.) NLTK is written in python, and so I decided to try it and have the students learn a new programming language. As will be described in detail below, our use of NLTK was somewhat successful, but we experienced numerous problems as well.</Paragraph>
    <Paragraph position="3"> I made a rather large mistake early on by not spending time introducing python, since I wanted the assignments to correspond to the lectures and did not want to spend lecture time on the programming language itself. I instructed students who had registered for the course to learn python during the summer, but (not surprisingly) many of did not and had to struggle in the first few weeks. In retrospect, I realize I should have allowed time for people to learn python, perhaps via a lab session that met only during the first few weeks of class.</Paragraph>
    <Paragraph position="4"> Another sticking point was student exposure to regular expressions. Regex's were very important and useful practical tools both for tokenization assignments and for shallow parsing. I assumed that the MIMS students had gotten practice with regular expressions because they are required to take a computer concepts foundations course which I designed several years ago. Unfortunately, the lecturer who took over the class from me had decided to omit regex's and related topics. I realized that I had to do some remedial coverage of the topic, which of course bored the CS students and which was not complete enough for the MIMS students. Again this suggests that perhaps some kind of lab is needed for getting people caught up in topics, or that perhaps  the first few weeks of the class should be optional for more advanced students.</Paragraph>
    <Paragraph position="5"> I was also unable to find an appropriate textbook.</Paragraph>
    <Paragraph position="6"> Neither Sch&amp;quot;utze &amp; Manning nor Jurafsky &amp; Martin focus on the right topics. The closest in terms of topic is Natural Language Processing for Online Applications by Peter Jackson &amp; Isabelle Moulinier, but much of this book focuses on Information Retrieval (which we teach in two other courses) and did not go into depth on the topics I most cared about.</Paragraph>
    <Paragraph position="7"> Instead of a text, students read a small selection of research papers and the NLTK tutorials.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Topics
</SectionTitle>
    <Paragraph position="0"> The course met twice weekly for 80 minute periods.</Paragraph>
    <Paragraph position="1"> The topic coverage is shown below; topics followed by (2) indicate two lecture periods were needed.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Class Presentations
</SectionTitle>
      <Paragraph position="0"> Note the lack of coverage of full syntactic parsing, which is covered extensively in Dan Klein's course.</Paragraph>
      <Paragraph position="1"> I touched on it briefly in the second shallow parsing lecture and felt this level of coverage was acceptable because shallow parsing is often as useful if not more so than full parsing for most applications. Note also the lack of coverage of word sense disambiguation. This topic is rich in algorithms, but was omitted primarily due to time constraints, but in part because of the lack of well-known applications.</Paragraph>
      <Paragraph position="2"> Based on the kinds of capstone projects the MIMS students have done in the past, I knew that the most important techniques for their needs surrounded text categorization and information extraction/entity recognition. There are terrific software resources for text categorization and the field is fairly mature, so I had my PhD students Preslav Nakov and Barbara Rosario gave the lectures on this topic, in order to provide them with teaching experience.</Paragraph>
      <Paragraph position="3"> The functionality provided by named entity recognition is very important for a wide range of real-world applications. Unfortunately, none of the free tools that we tried were particularly successful.</Paragraph>
      <Paragraph position="4"> Those that are available are difficult to configure and get running in a short amount of time, and have virtually no documentation. Furthermore, the state-of-the-art in algorithms is not present in the available tools in the way that more mature technologies such as POS tagging, parsing, and categorization are.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="3" type="metho">
    <SectionTitle>
5 Using NLTK
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Benefits
</SectionTitle>
      <Paragraph position="0"> We used the latest version of NLTK, which at the time was version 1.4.2 NLTK supplies some pre-processed text collections, which are quite useful. (Unfortunately, the different corpora have different types of preprocessing applied to them, which often lead to confusion and extra work for the class.) The NLTK tokenizer, POS taggers and the shallow parser (chunker) have terrific functionality once they are understood; some students were able to get quite accurate results using these and the supplied training sets. The ability to combine different n-gram taggers within the structure of a backoff tagger also supported an excellent exercise. However, a somewhat minor problem with the taggers is that there is no compact way to store the model resulting from tagging for later use. A serialized object could be created and stored, but the size of such object was so large that it takes about as long to load it into</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Drawbacks
</SectionTitle>
      <Paragraph position="0"> There were four major problems with NLTK from the perspective of this course. The first major problem was the inconsistency in the different releases of code, both in terms of incompatibilities between the data structures in the different versions, and incompatibility of the documentation and tutorials within the different versions. It was tricky to determine which documentation was associated with which code version. And much of the contributed code did not work with the current version.</Paragraph>
      <Paragraph position="1"> The second major problem was related to the first, but threw a major wrench into our plans: some of the advertised functionality simply was not available in the current version of the software. Notably, NLTK advertised a text categorization module; without this I would not have adopted NLTK as the coding platform for the class. Unfortunately, the most current version did not in fact support categorization, and we discovered this just days before we were to begin covering this topic.</Paragraph>
      <Paragraph position="2"> The third major problem was the incompleteness of the documentation for much of the code. This to some degree undermined the goal of reducing the amount of work for students, since they (and I) had to struggle to figure out what was going on in the code and data structures.</Paragraph>
      <Paragraph position="3"> One of these documentation problems centered around the data structure for conditional probabilities. NLTK creates a FreqDist class which is explained well in the documentation (it records a count for each occurrence of some phenomenon, much like a hash table) and provides methods for retrieving the max, the count and frequency of each occurrence, and so on. It also provides a class called a CondFreqDist, but does not document its methods nor explain its implementation. Users have to scrutinize the examples given and try to reverse engineer the data structure. Eventually I realized that it is simply a list of objects of type FreqDist, but this was difficult to determine at first, and caused much wasting of time and confusion among the students. There is also confusion surrounding the use of the method names count and frequency for FreqDist. Count refers to number of occurrences and frequency to a probability distribution across items, but this distinction is never stated explicitly although it can be inferred from a table of methods in the tutorial. null A less dramatic but still hampering problem was with the design of the core data structures, which make use of attribute tags rather than classes. This leads to rather awkward code structures. For example, after a sentence is tokenized, the results of tokenization are appended to the sentence data structure and are accessed via use of a subtoken keyword such as 'TOKENS'. To then run a POS tagger over the tokenized results, the 'TOKENS' keyword has to be specified as the value for a SUBTOKENS attribute, and another keyword must be supplied to act as the name of the tagged results. In my opinion it would be better to use the class system and define objects of different types and operations on those objects.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="6" type="metho">
    <SectionTitle>
6 Assignments
</SectionTitle>
    <Paragraph position="0"> One of the major goals of the class was for the students to obtain hands-on experience using and extending existing NLP tools. This was accomplished through a series of homework assignments and a final project. My pedagogical philosophy surrounding assignments is to supply as much as the functionality as necessary so that the coding that students do leads directly to learning. Thus, I try to avoid making students deal with details of formatting files and so on. I also try to give students a starting point to build up on.</Paragraph>
    <Paragraph position="1"> The first assignment made use of some exercises from the NLTK tutorials. Students completed tokenizing exercises which required the use of the NLTK corpus tool accessors and the FreqDist and CondFreqDist classes. They also did POS tagging exercises which exposed them to the idea of ngrams, backoff algorithms, and to the process of training and testing. This assignment was challenging (especially because of some misleading text in the tagging tutorial, which has since been fixed) but the students learned a great deal. As mentioned above, I should have begun with a preliminary assignment which got students familiar with python basics before attempting this assignment.</Paragraph>
    <Paragraph position="2"> For assignment 2, I provided a simple set of regular expression grammar rules for the shallow parser class, and asked the students to improve on these.</Paragraph>
    <Paragraph position="3"> After building the chunker, students were asked to  choose a verb and then analyze verb-argument structure (they were provided with two relevant papers (Church and Hanks, 1990; Chklovski and Pantel, 2004)). As mentioned above, most of the MIMS students were not familiar with regular expressions, so I should have done a longer unit on this topic, at the expense of boring the CS students.</Paragraph>
    <Paragraph position="4"> The students learned a great deal from working to improve the grammar rules, but the verb-argument analysis portion was not particularly successful, in part because the corpus analyzed was too small to yield many sentences for a given verb and because we did not have code to automatically find regularities about the semantics of the arguments of the verbs. Other causes of difficulty were the students' lack of linguistic background, and the fact that the chunking part took longer than I expected, leaving students little time for the analysis portion of the assignment. null Assignments 3 and 4 are described in the following subsections.</Paragraph>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
6.1 Text Categorization Assignment
</SectionTitle>
      <Paragraph position="0"> As mentioned above, text categorization is useful for a wide range SIMS applications, and we made it a centerpiece of the course. Unfortunately, we had to make a mid-course correction when I suddenly realized that text categorization was no longer available in NLTK.</Paragraph>
      <Paragraph position="1"> After looking at a number of tools, we decided to use the Weka toolkit for categorization (Witten and Frank, 2000). We did not want the students to feel they had wasted their time learning python and NLTK, so we decided to make it easy for the students to reuse their python code by providing an interface between it and Weka.</Paragraph>
      <Paragraph position="2"> My PhD student Preslav Nakov provided great help by writing code to translate the output of our python code into the input format expected by Weka.</Paragraph>
      <Paragraph position="3"> (Weka is written in java but has command line and GUI interfaces, and can read in input files and store models as output files.) As time went on we added increasingly more functionality to this code, tying it in with the NLTK modules so that the students could use the NLTK corpora for training and testing.3  Both Preslav and I had used Weka in the past but mainly with the command-line interface, and not taking advantage of its rich functionality. As with NLTK, the documentation for Weka was incomplete and out of date, and it was difficult to determine how to use the more advanced features. We performed extended experimentation with the system and developed a detailed tutorial on how to use the system; this tutorial should be of general use.4 For the categorization task, we used the &amp;quot;twenty newsgroups&amp;quot; collection that was supplied with NLTK. Unfortunately, it was not preprocessed into sentences, so I also had to write some sentence splitting code (based on Palmer and Hearst (1997)) so students could make use of their tokenizer and tagger code.</Paragraph>
      <Paragraph position="4"> We selected one pair of newsgroups which contained very different content (rec.motorcycles vs. sci.space). We called this the diverse set. We then created two groups of newsgroups with more homogeneous content (a) rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, and (b) sci.crypt, sci.electronics, sci.med.original, sci.space. The intention was to show the students that it is easier to automatically distinguish the heterogeneous groups than the homogeneous ones.</Paragraph>
      <Paragraph position="5"> We set up the code to allow students to adjust the size of their training and development sets, and to separate out a reserved test set that would be used for comparing students' solutions.</Paragraph>
      <Paragraph position="6"> We challenged the students to get the best scores possible on the held out test set, telling them not to use this test set until they were completely finished training and testing on the development set. (We relied on the honor system for this.) We made it known that we would announce which were the top-scoring assignments. As a general rule I avoid competition in my classes, but this was kept very low-key; only the top-scoring results would be named. Furthermore, innovative approaches that perhaps did not do as well as some others were also highlighted. Students were required to try at least 2 different types of features and 3 different classifiers.</Paragraph>
      <Paragraph position="7"> This assignment was quite successful, as the stu- null dents were creative about building their features, and it was possible to achieve very strong results (much stronger than I expected) on both sets of newsgroups. The best scoring approaches got 99% accuracy on the 2-way diverse distinction and 97% accuracy on the 4-way homogeneous distinction.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
6.2 Enron Email Assignment
</SectionTitle>
      <Paragraph position="0"> Many of the SIMS students are interested in social networking and related topics. I decided as part of the class that we would analyze a relatively new text collection that had become available and that contained the potential for interesting text mining and analysis. I was also interested in having the class help produce a resource that would be of use to other classes and researchers. Thus we decided to take on the Enron email corpus,5 on which limited analysis had been done.</Paragraph>
      <Paragraph position="1"> My PhD student Andrew Fiore wrote code to pre-process this text, removing redundancies, normalizing email addresses, labeling quoted text, and so on. He and I designed a database schema for representing much of the structure of the collection and loaded in the parsed text. I created a Lucene6 index for doing free text queries while Andrew built a highly functional web interface for searching fielded components. Andrew's system eventually allowed for individual students to login and register annotations on the email messages.</Paragraph>
      <Paragraph position="2"> This collection consists of approximately 200,000 messages after the duplicates have been removed.</Paragraph>
      <Paragraph position="3"> We wanted to identify a subset of emails that might be interesting for analysis while at the same time avoiding highly personal messages, messages consisting mainly of jokes, and so on. After doing numerous searches, we decided to try to focus primarily on documents relating to the California energy crisis, trading discrepancies, and messages occurring near the end of the time range (just before the company's stock crashed).</Paragraph>
      <Paragraph position="4"> After selecting about 1500 messages, I devised an initial set of categories. In class we refined these.</Paragraph>
      <Paragraph position="5"> One student had the interesting idea of trying to identify change in emotional tone as the scandals surrounding the company came to light, so we added emotional tone as a category type. Each message  was then read and annotated by two students using the pre-defined categories. Students were asked to reconcile their differences when they had them.</Paragraph>
      <Paragraph position="6"> Despite these safeguards, my impression is that the resulting assignments are far from consistent and the categories themselves are still rather ad hoc and oftentimes overlapping. There were many difficult curation issues, such as how to categorize a message with forwarded content when that content differed in kind from the new material. If we'd spent more time on this we could have done a better job, but as this was not an information organization course, I felt we could not spend more time on perfecting the labels. Thus, I do not recommend the category labels be used for serious analysis. Nevertheless, a number of researchers have asked for the cleaned up database and categories, and we have made them publicly available, along with the search interface.7 The students were then given two weeks to process the collection in some manner. I made several suggestions, including trying to automatically assign the hand-assigned categories, extending some automatic acronym recognition work that we'd done in our research (Schwartz and Hearst, 2003), using named entity recognition code to identify various actors, clustering the collection, or doing some kind of social network analysis. Students were told that they could extend this assignment into their final projects if they chose.</Paragraph>
      <Paragraph position="7"> For most students it was difficult to obtain a strong result using this collection. The significant exception was for those students who worked on extending our acronym recognition algorithm; these projects were quite successful. (In fact, one student managed to improve on our results with a rather simple modification to our code.) Students often had creative ideas that were stymied by the poor quality of the available tools. Two groups used the MAL-LET named entity recognizer toolkit8 in order to do various kinds of social network analysis, but the results were poor. (Students managed to make up for this deficiency in creative ways.) I was a bit worried about students trying to use clustering to analyze the results, given the general difficulty of making sense of the results of cluster- null ing, and this concern was justified. Clustering based on Weka and other tools is of course memory- and compute-intensive, but more problematically, the results are difficult to interpret. I would recommend against allowing students to do a text clustering exercise unless within a more constrained environment. In summary, students were excited about building a resource based on relatively untapped and very interesting data. The resulting analysis on this untamed text was somewhat disappointing, but given that only two weeks were spent on this part of the assignment, I believe it was a good learning experience. Furthermore, the resulting resource seems to be of interest to a number of researchers, as was our intention.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
6.3 Final Projects
</SectionTitle>
      <Paragraph position="0"> I deliberately kept the time for the final projects short (about 3 weeks) so students would not go overboard or feel pressure to do something hugely timeconsuming. The goal was to allow students to tie together some of the different ideas and skills they'd acquired in the class (and elsewhere), and to learn them in more depth by applying them to a topic of personal interest.</Paragraph>
      <Paragraph position="1"> Students were encouraged to work in pairs, and I suggested a list of project ideas. Students who adopted suggested projects tended to be more successful than those who developed their own. Those who tried other topics were often too ambitious and had trouble getting meaningful results. However, several of those students were trying ideas that they planned to apply to their capstone projects, and so it was highly valuable for them to get a preview of what worked and what did not.</Paragraph>
      <Paragraph position="2"> One suggestion I made was to create a back-of-the-book indexer, specifically for a recipe book, and one team did a good job with this project. Another was to improve on or apply an automatic hierarchy generation tool that we have developed in our research (Stoica and Hearst, 2004). Students working on a project to collect metadata for camera phone images successfully applied this tool to this problem. Again, social networking analysis topics were popular but not particularly successful; NLP tools are not advanced enough yet to meet the needs of this intriguing topic area. Not surprisingly, when students started with a new (interesting) text collection, they were bogged down in the preprocessing stage before they could get much interesting work done.</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
6.4 Reflecting on Assignments
</SectionTitle>
      <Paragraph position="0"> Although students were excited about the Enron collection and we created a resource that is actively being used by other researchers, I think in future versions of the class I will omit this kind of assignment and have the students start their final projects sooner.</Paragraph>
      <Paragraph position="1"> This will allow them time to do any preprocessing necessary to get the text into shape for doing the interesting work. I will also exercise more control over what they are allowed to attempt (which is not my usual style) in order to ensure more successful outcomes.</Paragraph>
      <Paragraph position="2"> I am not sure if I will use NLTK again or not. If the designers make significant improvements on the code and documentation, then I probably will. The style and intent of the tutorials are quite appropriate for the goals of the class. Students with stronger coding background tended to use java for their final projects, whereas the others tended to build on the python code we developed in the class assignments, which suggests that this kind of toolkit approach is useful for them.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML