File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1810_metho.xml
Size: 13,376 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1810"> <Title>Quantitative Portraits of Lexical Elements</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Methods of term weighting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Tfidf </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> wherea1a15a18a17 is the total frequency of a terma0a26,a24 is the total number of the documents, and a24 a26 is the total number of documents in which the terma0a26 occurs.</Paragraph> <Paragraph position="3"> Aizawa (2003) has shown that this can be derived from an information theoretic measure. Let a28 and a29 be random variables defined over events in a set of documentsa30 a12a32a31a33a5a35a34a37a36a38a5a40a39a41a36a43a42a44a42a44a42a44a36a38a5a26a36a43a42a44a42a44a42a44a36a38a5a46a45a48a47 and a set of different termsa10 a12a49a31a37a0a50a34a37a36a2a0a51a39a16a36a43a42a44a42a44a42a44a36a2a0a53a52a40a36a43a42a44a42a44a42a44a36a2a0a51a54a55a47 ina30 . Leta1a26a52 denote the frequency ofa0a26 ina5a56a52 ,a1a16a57a58a17 the total frequency ofa0a26, a1a41a59a6a60 the total number of running terms ina5a56a52 , anda61 the total number of term tokens</Paragraph> <Paragraph position="5"> Giving probabilities by relative frequencies, and assuming that all the documents have equal size and the frequency ofa0a26 in the documents that containa0a26 is equal, this measure becomesa0a2a1a4a3a6a5a7a1 ;a0a51a1a4a3a6a5a7a1 has an information theoretic meaning within the given set of documents (Figure 2).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Term representativeness </SectionTitle> <Paragraph position="0"> Hisamitsu, et al. (2000a) proposed a measure of &quot;term representativeness&quot;, in order to overcome the CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 75 text texttext text texttexttext A set of actual texts (targets of IR) term termterm termtermterm termterm Terms as attributes of concrete set of documents text texttext text texttexttext A set of actual texts (a manifestation of discourse) Textual sphere / theoretical sphere of discourse term termterm termtermterm termterm Terms as attributes of theoretical discourse represented by the given set of documents excessive sensitivity of weighting measures to token frequencies. They hypothesised that, for a term a0 , if the term is representative, a30 a15 (the set of all documents containing a0 ) have some specific characteristic. They define a measure which calculates the distance between a distributional characteristic of words arounda0 and the same distributional characteristic in the whole document set.</Paragraph> <Paragraph position="1"> In order to remove the factor of data size dependency, Hisamitsu et al. (2000a) defines the &quot;baseline function,&quot; which indicates the distance between the distribution of words in the original document set and the distribution of words in randomly selected document subsets for each size. The distance between the distribution of words in the original document set and the distribution of words in the documents which accompany the focal term a0 is normalised by the &quot;baseline function.&quot; Formally, a76a78a77a2a79</Paragraph> <Paragraph position="3"> (2) where a30 denotes the set of all documents; a67 the distribution of words in a30 ; a0 a focal term; a30 a15 the set of all documents containing a0 ; a67a85a15 distribution of words in a30 a15; a67 a81a58a83 distribution of words in randomly selected documents whose size equals a30 a15;</Paragraph> <Paragraph position="5"> a65 the distance between two distributions of wordsa67 a26 anda67a86a52 . Log-likelihood ratio was used to measure the distance.</Paragraph> <Paragraph position="6"> This measure observes the centripetal force of a term vis-`a-vis discourse. i.e. it captures the characteristic of terms in the general discourse as represented by the given set of documents (Figure 3).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Lexical productivity </SectionTitle> <Paragraph position="0"> Nakagawa (2000) incorporates a factor of lexical productivity of constituent elements of compound text texttext text texttexttext A set of actual texts (a ladder to be discarded) Textual sphere / theoretical sphere of discourse term termterm termtermterm termterm Lexicological sphere / theoretical sphere of lexica Terms as an attribute of autonomous lexicological shpere units for complex term extraction. The method observes in how many different compounds an elementa0a26 is used in a given document set (let us denote this as a5 a63a3a38a36a24 a65 where a24 indicates the size of the overall document set as counted by the number of word tokens), and used that in the weighting of compounds containinga0a26, by taking weighted average. By explicitly limiting the syntagmatic range of observation of cooccurrence to the unit of compounds, he focused on the lexical productivity as manifasted in texts.</Paragraph> <Paragraph position="1"> This measure depends on the token occurrence, but we can also think of the theoretical lexical productivity in the lexicological sphere: how many compounds a0a26 can potentially make&quot; (let us denote this bya5 a63a3a65). For that, it is necessary to remove the factor of token occurrence. This can be done by:</Paragraph> <Paragraph position="3"> This has so far been unexplored. Potential lexical productivity of an element can be estimated from textual data: Letting a15a18a17 be the occurrence probability of a0a26 in texts, a1 a63a3a38a36a24 a65 be the token occurrence of a0a26 in texts, and a94 a26 be the sample a17a102a101 given to each compound a3a11a103 , and assuming the combination of binomial distribution, we have:</Paragraph> <Paragraph position="5"> tually occur in the document set amonga94 a26.a5 a63a3a65 can be estimated by LNRE methods (Baayen, 2001).</Paragraph> <Paragraph position="6"> Being a measure representing the potential power of a lexical elementa0a26 for constructing compounds, a5 a63a3 a65 indicates the lexical productivity in the lexicological sphere which correspond to theoretical sphere of discourse as represented by the given document set (Figure 4).</Paragraph> <Paragraph position="7"> CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology76</Paragraph> <Paragraph position="9"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Portraits of lexical elements </SectionTitle> <Paragraph position="0"> As the three different measures capture three different aspects of lexical elements, they are not competitive 1. We here use these measures to illustrate characteristics of a few lexical elements.</Paragraph> <Paragraph position="1"> We used NII morphologically tagged-corpus for observation (Okada et al., 2001), which consists of Japanese abstracts in the field of artificial intelligence. Table 1 shows the basic quantitative information. null We chose the six most frequently occurring nominal element for observation, i.e.a124a126a125a13a127a129a128 (system), itively, &quot;system&quot;, and &quot;model&quot; are rather general with respect to the domain of artificial intelligence, &quot;knowledge&quot; and &quot;learning&quot; are domain specific, and &quot;information&quot; and &quot;problem&quot; are in between. Table 2 shows the basic quantitative information for these six lexical elements.</Paragraph> <Paragraph position="2"> Figure 5 plotsa0a2a1a4a3a6a5a11a1 and term representativeness for the six elements. Table 3 shows the estimated value of lexical productivity.</Paragraph> <Paragraph position="3"> Figure 5 shows &quot;learning&quot; and &quot;knowledge&quot;, intuitively the domain-dependent elements, take high high value but the values of &quot;knowledge&quot; is much lower, and about the same as &quot;information&quot;. Interestingly, the lexical productivity of &quot;knowledge&quot; and &quot;information&quot; is also very close to each other. It is possible to infer from these values of term representativeness and lexical productivity that both &quot;information&quot; and &quot;knowledge&quot; are, within the discourse of artificial intelligence, not with high centripetal value as both are rather &quot;base&quot; concepts of the domain. If we observe Table 2, &quot;knowledge&quot; is more often used as it is, while &quot;information&quot; tends to occur as compounds. From this we might be able to hypothesise that &quot;knowledge&quot; is in itself the &quot;base&quot; concept of artificial intelligence while &quot;information&quot; becomes the &quot;base&quot; concept in combination with other lexical items. This fits our intuition, as &quot;information&quot; in itself is more a &quot;base&quot; concept of information and computer science, which is a broader domain of which artificial intelligence is a subdomain. The low a0a2a1a4a3a6a5a7a1 value of &quot;information&quot; comes from the low token frequency coupled with relatively high DF, which shows that &quot;information&quot;, as long as it is used, tends to scatter across documents. This is in accordance with the interpretation that &quot;information&quot; tends to occur in compounds. Still, however, it is difficult to interpret sensibly the fact that thea0a51a1a4a3a6a5a7a1 value of &quot;information&quot; is lower than those of &quot;model&quot; and &quot;system&quot;. Perhaps it is more sensible to interpret a0a2a1a4a3a6a5a7a1 among elements which take the values of term representativeness higher than a certain threshold. Then we can say that &quot;learning&quot; and &quot;knowledge&quot; represent concepts more &quot;central&quot; to the domain of artificial intelligence than &quot;information&quot;.</Paragraph> <Paragraph position="4"> The element &quot;learning&quot;, which takes the highest values both ina0a51a1a4a3a6a5a7a1 and in term representativeness, is conspicuous in its lexical productivity. Compared to &quot;knowledge&quot; whose a0a2a1a4a3a6a5a11a1 value is also high, and with the three elements &quot;problem&quot;, &quot;information&quot; and &quot;knowledge&quot; whose term representativeness values are relatively high, the order of lexical productivity of &quot;learning&quot; is a million times higher (and similar to &quot;model&quot; or &quot;system&quot;). Table 2 shows that &quot;learning&quot; does not occur much as it is, nor does it occur much as the head of compounds. This indicates that &quot;learning&quot; represents an important concept of the given data and in the discourse of artificial intelligence, but only &quot;indirectly&quot; in combination with other elements in compounds where &quot;learning&quot; tend to contribute to as a modifier rather than a head.</Paragraph> <Paragraph position="5"> The two &quot;general&quot; lexical elements, i.e. &quot;model&quot; and &quot;system&quot;, take low term representativeness values2. This is in accordance with our intuition. The lexical productivity of these two elements are extremely high (practically infinite). This indicates that these two elements can be widely used in varieties of discoursal contexts, without in itself contributing much to consolidating the content of discourse. This fits nicely to our intuitive interpretation of the meanings of these two elements, i.e. they are orthogonal to to such domain-dependent elements as &quot;knowledge&quot; or &quot;learning&quot;.</Paragraph> <Paragraph position="6"> This leaves us with the final element &quot;problem&quot;.</Paragraph> <Paragraph position="7"> The value of term representativeness is high, second only to &quot;learning&quot; and in between &quot;learning&quot; and &quot;information&quot;/&quot;knowledge&quot;. The lexical productivity is much closer to &quot;information&quot; and &quot;knowledge&quot; than to the other three. As such, &quot;problem&quot; can be interpreted as a kind of &quot;base&quot; concept, though it retains stronger centripetal force than &quot;information&quot; and &quot;knowledge&quot;. If we ignore a0a2a1a4a3a6a5a11a1 values of &quot;model&quot; and &quot;system&quot; and only compare &quot;information&quot;, &quot;problem&quot;, &quot;learning&quot; and &quot;knowledge&quot;, it is also sensible to see that &quot;problem&quot; represent a concept more central to the domain than &quot;information&quot; but less central than &quot;learning&quot; and &quot;knowledge&quot;.</Paragraph> </Section> class="xml-element"></Paper>