File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1067_intro.xml
Size: 9,298 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1067"> <Title>Making Computers Laugh: Investigations in Automatic Humor Recognition</Title> <Section position="3" start_page="531" end_page="533" type="intro"> <SectionTitle> 2 Humorous and Non-humorous Data Sets </SectionTitle> <Paragraph position="0"> To test our hypothesis that automatic classification techniques represent a viable approach to humor recognition, we needed in the first place a data set consisting of both humorous (positive) and non-humorous (negative) examples. Such data sets can be used to automatically learn computational models for humor recognition, and at the same time evaluate the performance of such models.</Paragraph> <Section position="1" start_page="531" end_page="532" type="sub_section"> <SectionTitle> 2.1 Humorous Data </SectionTitle> <Paragraph position="0"> For reasons outlined earlier, we restrict our attention to one-liners, short humorous sentences that have the characteristic of producing a comic effect in very few words (usually 15 or less). The one-liners humor style is illustrated in Table 1, which shows three examples of such one-sentence jokes.</Paragraph> <Paragraph position="1"> It is well-known that large amounts of training data have the potential of improving the accuracy of the learning process, and at the same time provide insights into how increasingly larger data sets can affect the classification precision. The manual con- null struction of a very large one-liner data set may be however problematic, since most Web sites or mailing lists that make available such jokes do not usually list more than 50-100 one-liners. To tackle this problem, we implemented a Web-based bootstrapping algorithm able to automatically collect a large number of one-liners starting with a short seed list, consisting of a few one-liners manually identified.</Paragraph> <Paragraph position="2"> The bootstrapping process is illustrated in Figure 1. Starting with the seed set, the algorithm automatically identifies a list of webpages that include at least one of the seed one-liners, via a simple search performed with a Web search engine. Next, the web-pages found in this way are HTML parsed, and additional one-liners are automatically identified and added to the seed set. The process is repeated several times, until enough one-liners are collected.</Paragraph> <Paragraph position="3"> An important aspect of any bootstrapping algorithm is the set of constraints used to steer the process and prevent as much as possible the addition of noisy entries. Our algorithm uses: (1) a thematic constraint applied to the theme of each webpage; and (2) a structural constraint, exploiting HTML annotations indicating text of similar genre.</Paragraph> <Paragraph position="4"> The first constraint is implemented using a set of keywords of which at least one has to appear in the URL of a retrieved webpage, thus potentially limiting the content of the webpage to a theme related to that keyword. The set of key-words used in the current implementation consists of six words that explicitly indicate humor-related content: oneliner, one-liner, humor, humour, joke, One-liners Take my advice; I don't use it anyway.</Paragraph> <Paragraph position="5"> I get enough exercise just pushing my luck.</Paragraph> <Paragraph position="6"> Beauty is in the eye of the beer holder.</Paragraph> <Paragraph position="7"> Reuters titles Trocadero expects tripling of revenues.</Paragraph> <Paragraph position="8"> Silver fixes at two-month high, but gold lags.</Paragraph> <Paragraph position="9"> Oil prices slip as refiners shop for bargains.</Paragraph> <Paragraph position="10"> BNC sentences They were like spirits, and I loved them.</Paragraph> <Paragraph position="11"> I wonder if there is some contradiction here.</Paragraph> <Paragraph position="12"> The train arrives three minutes early.</Paragraph> </Section> <Section position="2" start_page="532" end_page="532" type="sub_section"> <SectionTitle> Proverbs </SectionTitle> <Paragraph position="0"> Creativity is more important than knowledge.</Paragraph> <Paragraph position="1"> Beauty is in the eye of the beholder.</Paragraph> <Paragraph position="2"> I believe no tales from an enemy's tongue.</Paragraph> <Paragraph position="3"> tles, BNC sentences, and proverbs.</Paragraph> <Paragraph position="4"> funny. For example, http://www.berro.com/Jokes or http://www.mutedfaith.com/funny/life.htm are the URLs of two webpages that satisfy this constraint. The second constraint is designed to exploit the HTML structure of webpages, in an attempt to identify enumerations of texts that include the seed oneliner. This is based on the hypothesis that enumerations typically include texts of similar genre, and thus a list including the seed one-liner is likely to include additional one-line jokes. For instance, if a seed one-liner is found in a webpage preceded by the HTML tag <li> (i.e. &quot;list item&quot;), other lines found in the same enumeration preceded by the same tag are also likely to be one-liners.</Paragraph> <Paragraph position="5"> Two iterations of the bootstrapping process, started with a small seed set of ten one-liners, resulted in a large set of about 24,000 one-liners.</Paragraph> <Paragraph position="6"> After removing the duplicates using a measure of string similarity based on the longest common sub-sequence metric, we were left with a final set of approximately 16,000 one-liners, which are used in the humor-recognition experiments. Note that since the collection process is automatic, noisy entries are also possible. Manual verification of a randomly selected sample of 200 one-liners indicates an average of 9% potential noise in the data set, which is within reasonable limits, as it does not appear to significantly impact the quality of the learning.</Paragraph> </Section> <Section position="3" start_page="532" end_page="533" type="sub_section"> <SectionTitle> 2.2 Non-humorous Data </SectionTitle> <Paragraph position="0"> To construct the set of negative examples required by the humor-recognition models, we tried to identify collections of sentences that were nonhumorous, but similar in structure and composition to the one-liners. We do not want the automatic classifiers to learn to distinguish between humorous and non-humorous examples based simply on text length or obvious vocabulary differences. Instead, we seek to enforce the classifiers to identify humor-specific features, by supplying them with negative examples similar in most of their aspects to the positive examples, but different in their comic effect.</Paragraph> <Paragraph position="1"> We tested three different sets of negative examples, with three examples from each data set illustrated in Table 1. All non-humorous examples are enforced to follow the same length restriction as the one-liners, i.e. one sentence with an average length of 10-15 words.</Paragraph> <Paragraph position="2"> 1. Reuters titles, extracted from news articles published in the Reuters newswire over a period of one year (8/20/1996 - 8/19/1997) (Lewis et al., 2004). The titles consist of short sentences with simple syntax, and are often phrased to catch the readers attention (an effect similar to the one rendered by one-liners).</Paragraph> <Paragraph position="3"> 2. Proverbs extracted from an online proverb collection. Proverbs are sayings that transmit, usually in one short sentence, important facts or experiences that are considered true by many people. Their property of being condensed, but memorable sayings make them very similar to the one-liners. In fact, some one-liners attempt to reproduce proverbs, with a comic effect, as in e.g. &quot;Beauty is in the eye of the beer holder&quot;, derived from &quot;Beauty is in the eye of the beholder&quot;. null 3. British National Corpus (BNC) sentences, ex- null tracted from BNC - a balanced corpus covering different styles, genres and domains. The sentences were selected such that they were similar in content with the one-liners: we used an information retrieval system implementing a vectorial model to identify the BNC sentence most similar to each of the 16,000 one-liners1. Unlike the Reuters titles or the proverbs, the BNC sentences have typically no added creativity.</Paragraph> <Paragraph position="4"> However, we decided to add this set of negative examples to our experimental setting, in order 1The sentence most similar to a one-liner is identified by running the one-liner against an index built for all BNC sentences with a length of 10-15 words. We use a tf.idf weighting scheme and a cosine similarity measure, as implemented in the Smart system (ftp.cs.cornell.edu/pub/smart) to observe the level of difficulty of a humor-recognition task when performed with respect to simple text.</Paragraph> <Paragraph position="5"> To summarize, the humor recognition experiments rely on data sets consisting of humorous (positive) and non-humorous (negative) examples. The positive examples consist of 16,000 one-liners automatically collected using a Web-based bootstrapping process. The negative examples are drawn from: (1) Reuters titles; (2) Proverbs; and (3) BNC sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>