File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0406_intro.xml
Size: 9,718 bytes
Last Modified: 2025-10-06 14:02:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0406"> <Title>Multiword Expression Filtering for Building Knowledge Maps</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many real world applications require extraction of word sequences or multiword expressions from text. Examples of such applications include, among others, creation of search engine indexes, knowledge discovery, data mining, machine translation, summarization and term suggestion for either knowledge engineering or query refinement by end users of a search system. null The application of interest to the authors of this paper was that of building knowledge maps that help bridge the gap between searchers and documents. A knowledge map for a particular domain is a collection of concepts, relationships between these concepts as well as evidence associated to each concept. A domain concept represents an abstraction that can be generalized from instances in the domain. It can be a person, a thing, or an event. An example of a concept in the operating system domain is 'installation guidelines'. Relationships between concepts can be either generalization or specialization (such as &quot;is a&quot;) as well as different types of association (such as &quot;part-of&quot;). The evidence associated to a concept is a set of single or multiword terms such that if any of those terms are found in a document, then it is likely that the document refers to that concept.</Paragraph> <Paragraph position="1"> The task we were trying to support was to identify multiword expressions in a corpus of documents belonging to a domain that can help ontologists identify the important concepts in the domain as well as their evidence.</Paragraph> <Paragraph position="2"> Our research was focused on domains where the corpus of documents representing the domain contains a high degree of technical content. The reason for this is that such documents are served on many company web sites to help provide technical support for both employees and customers. null Our research assumes that a term is &quot;useful&quot; when it meets all of the following conditions (1) it makes sense in context of the domain, (2) it represents an action, some tangible or intangible object, name of a product, or a troubleshooting phrase, and (3) it would be chosen by an ontologist to be incorporated to their knowledge map. Some examples of multiword expressions that may be considered useful for building knowledge maps about technical content are &quot;how to In common parlance, the words &quot;term&quot; and &quot;expression&quot; are generally used interchangeably. In this paper, a term refers to a expression with one or more words.</Paragraph> <Paragraph position="3"> Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 40-47 uninstall the program&quot;, &quot;Simple Mail Transfer Protocol&quot;, and &quot;cannot reboot the computer&quot;. Some expressions may not seem useful at first glance but may make sense to an ontologist familiar with that domain. For instance, the occurrence of the number &quot;error 37582&quot; may be an error code, and hence evidence of a particular kind of problem. Similarly, expressions such as &quot;after rebooting the system&quot; may not seem useful but may be good evidence of concepts related to problem identification. Examples of expressions that may be acceptable for some purposes, but not for building knowledge maps are &quot;this software was implemented by&quot; and &quot;and reboot the system to&quot;. These expressions however can become useful after undergoing some processing or manipulation by humans.</Paragraph> <Paragraph position="4"> We extracted n-grams from documents using an algorithm proposed by Tseng [1998], and cleaned them up iteratively using a stopword-based algorithm in order to make them more useful for building knowledge maps. Tseng's algorithm is based on the assumption that documents concentrating on a topic tend to mention a set of words in a specific sequence repeatedly. In other words, repeated multiword expressions are extracted since they will make good evidence candidates. null Our experience with Tseng's algorithm was that it extracts many useful multiword expressions.</Paragraph> <Paragraph position="5"> But, it also extracts multiword expressions that are repeated frequently in the documents but are not useful when viewed independently of the sentences from where they originated. This may not matter for some applications, but puts a lot of burden on librarians or ontologists who want to use those multiword expressions to build knowledge maps. Examples of such expressions are &quot;software was&quot;, &quot;computer crashed after&quot;, &quot;installed in order to&quot;, and so on. Such expressions have to undergo further manipulation or processing by ontologists in order for them to be useful. A good weighting algorithm may eliminate some of these expressions in some cases. However, our experience has shown that in a sufficiently large and homogenous set of documents, occurrence of all of these variations is so high that many of them meet the threshold requirements to qualify as eligible multiword expressions. Setting higher frequency thresholds may be a solution to this problem, but that may result in elimination of other useful multiword expressions. null One of the steps usually undertaken is to eliminate not so useful single word terms extracted from documents. For instance, the word &quot;the&quot; is not considered to be useful for most purposes. If a user were to submit a query such as &quot;the cat&quot;, returning documents that contain &quot;cat&quot; would be more useful than looking for documents that contain both &quot;the&quot; and &quot;cat.&quot; Terms such as &quot;a&quot;, &quot;an&quot;, &quot;the&quot; and so on are referred to as &quot;stopwords&quot; or &quot;noise words&quot; or &quot;skipwords&quot;, and these are usually ignored by search engines when they occur by themselves when building indexes.</Paragraph> <Paragraph position="6"> There are many common stopword lists useful in various contexts.</Paragraph> <Paragraph position="7"> Statistical and quantitative techniques using frequency or mutual information scores for multi-word expressions as well as syntactic techniques using phrase trees have been used to extract multiword expressions from text. [Choueka 1988, Dias 1999, Lessard 1991, Hamza 2003, Merkel 1994, Paynter 2000] According to Dias et. al. [1999], many multiword units identified using statistical methods can not be considered as terms although it may be useful to identify them.</Paragraph> <Paragraph position="8"> Examples cited by the authors include terms such as &quot;be notified&quot; and &quot;valid for&quot;. Less commonly found in literature is work done to &quot;clean&quot; or &quot;filter&quot; the extracted multiword expressions to make them suitable for certain purposes. An example of implementation of a filter is found in work done by Merkel et al. in their FRASSE system [Merkel 2000] where they defined words that should be stripped at the beginning and at the end of multiword expressions as well as requirements on what kinds of characters should be regarded as pairs (quotation marks, parentheses, etc). The reason for identifying characters that should be regarded in pairs is to make sure that multiword expressions that are retained after filtering do not have only one parenthesis character or quotation mark. Their filter was implemented with the use of entropy thresholds and stopwords for the Swedish language. Another example of a proposed filter is found in work by Dias et. al.</Paragraph> <Paragraph position="9"> [1999] in which the authors suggest using a filter that removed stopwords where they occurred either at the beginning or at the end of multiword expressions. Our work uses a standard stopword list used by systems that suggest terms to ontologists and end users, and part of speech information to achieve the same goal. The part of speech information ensures that we treat beginning and end of multilword expressions differently.</Paragraph> <Paragraph position="10"> Our contribution has been to extend Tseng's algorithm using stopwords and a part of speech based algorithm to reduce the occurrence of expressions that need further processing by ontologists. Our goal was to increase the proportion of expressions extracted that don't have to undergo any more manual processing by ontologists to make them useful. This is very useful in situations such as term suggestion where users can be saved the time and effort involved in going through long lists of terms many of which may not be useful, or may have to be manipulated in some way to make them useful. Running our algorithm on large corpora of documents has shown that it helps to increase the percentage of useful terms from 40% (+-10) to 70% (+-10). In other words, the improvement is at least 20% and could be high as 160%.</Paragraph> <Paragraph position="11"> The rest of this paper is organized as follows Section 2 describes our algorithm for extraction of frequently occurring n-grams, and converting them to useful multiword expressions. Sections 3 and 4 describe the results of evaluating our algorithm on large corpora of documents and conclude the paper.</Paragraph> </Section> class="xml-element"></Paper>