File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/j93-1007_abstr.xml

Size: 6,526 bytes

Last Modified: 2025-10-06 13:47:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1007">
  <Title>Retrieving Collocations from Text: Xtract</Title>
  <Section position="2" start_page="0" end_page="144" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Consider the following sentences:  1. &amp;quot;The Dow Jones average of 30 industrials rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; 2. &amp;quot;The Dow average rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; 3. &amp;quot;The Dow industrials rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; 4. &amp;quot;The Dow Jones industrial rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; .5. &amp;quot;The Jones industrials rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; * Computer Science Department, Columbia University, New York, NY 10027. smadja@cs.columbia.edu. @ 1993 Association for Computational Linguistics  Computational Linguistics Volume 19, Number 1 Table 1 Cross linguistic comparisons of collocations. Language English Translation English correspondence French to see the door voir la porte to see the door German to see the door die Ttir sehen to see the door Italian to see the door vedere la porta to see the door Spanish to see the door ver la puerta to see the door Turkish to see the door kapiyi g6rmek to see the door French to break down/force the door enfoncer la porte German to break down/force the door die Ttir aufbrechen Italian to break down/force the door sfondare la porta Spanish to break down/force the door tumbar la puerta Turkish to break down/force the door kapiyi kirmak  * to push the door through , to break the door , to hit/demolish the door * to fall the door , to break the door ,6. &amp;quot;The industrial Dow rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; * 7. &amp;quot;The Dow of 30 industrials rose 26.28 points to 2,304.69 on Tuesday.&amp;quot; 8. &amp;quot;The Dow industrial rose 26.28 points to 2,304.69 on Tuesday.&amp;quot;  The above sentences contain expressions that are difficult to handle for nonspecialists. For example, among the eight different expressions referring to the famous Wall Street index, only those used in sentences 1--4 are correct. The expressions used in the starred sentences 5-8 are all incorrect. The rules violated in sentences 5--8 are neither rules of syntax nor of semantics but purely lexical rules. The word combinations used in sentences 5-8 are invalid simply because they do not exist; similarly, the ones used in sentences 1-4 are correct because they exist.</Paragraph>
    <Paragraph position="1"> Expressions such as these are called collocations. Collocations vary tremendously in the number of words involved, in the syntactic categories of the words, in the syntactic relations between the words, and in how rigidly the individual words are used together. For example, in some cases, the words of a collocation must be adjacent, as in sentences 1-5 above, while in others they can be separated by a varying number of other words. Unfortunately, with few exceptions (e.g., Benson, Benson, and Ilson 1986a) collocations are generally unavailable in compiled form. This creates a problem for persons not familiar with the sublanguage 1 as well as for several machine applications such as language generation.</Paragraph>
    <Paragraph position="2"> In this paper we describe a set of techniques for automatically retrieving such collocations from naturally occurring textual corpora. These techniques are based on statistical methods; they have been implemented in a tool, Xtract, which is able to retrieve a wide range of collocations with high performance. Preliminary results obtained with parts of Xtract have been described in the past (e.g., Smadja and McKeown  &amp;quot;Our firm made/did a deal with them&amp;quot; &amp;quot;The swimmer had/got a cramp&amp;quot; &amp;quot;Politicians are always on/in the firing lane&amp;quot; &amp;quot;These decisions are to be made/taken rapidly&amp;quot; &amp;quot;The children usually set/lay the table&amp;quot; &amp;quot;You have to break in/run in your new car&amp;quot; Figure 1 British English or American English? from Benson (1990). sentences candidates &amp;quot;If a fire breaks out, the alarm will ?? &amp;quot; &amp;quot;The boy doesn't know how to ? ? his bicycle&amp;quot; &amp;quot;The American congress can ?? a presidential veto&amp;quot; &amp;quot;Before eating your bag of microwavable popcorn, you have to ? ? it&amp;quot; &amp;quot;ring, go off, sound, start&amp;quot; &amp;quot;drive, ride, conduct&amp;quot; &amp;quot;ban~cancel~delete~reject&amp;quot; &amp;quot;turn down~abrogate~overrule&amp;quot; &amp;quot;cook/nuke/broil/fry/bake&amp;quot; Figure 2 Fill-in-the-blank test, from Benson (1990).</Paragraph>
    <Paragraph position="3"> Xtract now works in three stages. In the first stage, pairwise lexical relations are retrieved using only statistical information. This stage is comparable to Church and Hanks (1989) in that it evaluates a certain word association between pairs of words. As in Church and Hanks (1989), the words can appear in any order and they can be separated by an arbitrary number of other words. However, the statistics we use provide more information and allow us to have more precision in our output. The output of this first stage is then passed in parallel to the next two stages. In the second stage, multiple-word combinations and complex expressions are identified. This stage produces output comparable to that of Choueka, Klein, and Neuwitz (1983); however the techniques we use are simpler and only produce relevant data. Finally, by combining parsing and statistical techniques the third stage labels and filters collocations retrieved at stage one. The third stage has been evaluated to raise the precision of Xtract from 40% to 80% with a recall of 94%.</Paragraph>
    <Paragraph position="4"> Section 2 is an introductory section on collocational knowledge, Section 3 describes the type of collocations that are retrieved by Xtract, and Section 4 briefly surveys related efforts and contrasts our work to them. The three stages of Xtract are then introduced in Section 5 and described respectively in Sections 6, 7, and 8. Some results obtained by running Xtract on several corpora are listed and discussed in Section 9. Qualitative and quantitative evaluations of our methods and of our results are discussed in Sections 10 and 11. Finally, several possible applications and tasks for Xtract are discussed in Section 12.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML