File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/99/p99-1041_relat.xml
Size: 2,910 bytes
Last Modified: 2025-10-06 14:16:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1041"> <Title>Automatic Identification of Non-compositional Phrases</Title> <Section position="9" start_page="320" end_page="320" type="relat"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> There have been numerous previous research on extracting collocations from corpus, e.g., (Choueka, 1988) and (Smadja, 1993). They do not, however, make a distinction between compositional and non-compositional collocations. Mutual information has often been used to separate systematic associations from accidental ones. It was also used to compute the distributional similarity between words CHin dle, 1990; Lin, 1998). A method to determine the compositionality of verb-object pairs is proposed in (Tapanainen et al., 1998). The basic idea in there is that &quot;if an object appears only with one verb (of few verbs) in a large corpus we expect that it has an idiomatic nature&quot; (Tapanainen et al., 1998, p.1290). For each object noun o, (Tapanainen et al., 1998) computes the distributed frequency DF(o) and rank the non-compositionality of o according to this value.</Paragraph> <Paragraph position="1"> Using the notation introduced in Section 3, DF(o) is computed as follows:</Paragraph> <Paragraph position="3"> where {vl,v2,... ,vn} are verbs in the corpus that took o as the object and where a and b are constants.</Paragraph> <Paragraph position="4"> The first column in Table 5 lists the top 40 verb-object pairs in (Tapanainen et ai., 1998). The &quot;mi&quot; column show the result of our mutual information filter. The '+' sign means that the verb-object pair is also consider to be non-compositional according to mutual information filter (3). The '-' sign means that the verb-object pair is present in our dependency database, but it does not satisfy condition (3). For each '-' marked pairs, the &quot;similar collocation&quot; column provides a similar collocation with a similar mutual information value (i.e., the reason why the pair is not consider to be non-compositional). The '<>' marked pairs are not found in our collocation database for various reasons. For example, &quot;finish seventh&quot; is not found because &quot;seventh&quot; is normalized as &quot;_NUM&quot;, &quot;have a go&quot; is not found because &quot;a go&quot; is not an entry in our lexicon, and &quot;take advantage&quot; is not found because &quot;take advantage of&quot; is treated as a single lexical item by our parser. The ~/marks in the &quot;ntc&quot; column in Table 5 indicate that the corresponding verb-object pairs is an idiom in (Spears and Kirkpatrick, 1993). It can be seen that none of the verb-object pairs in Table 5 that are filtered out by condition (3) is listed as an idiom in NTC-EID.</Paragraph> </Section> class="xml-element"></Paper>