XML Viewer - c02-1063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1063_metho.xml
Size: 18,305 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1063">
  <Title>Hierarchical Orderings of Textual Units</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Numerical Text Representation
</SectionTitle>
    <Paragraph position="0"> This paper uses semantic spaces as a format for text representation. Although it neglects sentence as well as rhetorical structure, it departs from the bag of words model by referring to paradigmatic similarity as the fundamental feature type: instead of measuring intersections of lexical distributions, texts are interrelated on the basis of the paradigmatic regularities of their constituents. A coordinate value of a feature vector of a sign mapped onto semantic space measures the extent to which this sign (or its constituents in case of texts) shares paradigmatic usage regularities with the word defining the corresponding dimension. Because of this sensitivity to paradigmatics, semantic spaces can capture indirect meaning relations: words can be linked even if they never co-occur, but tend to occur in similar contexts. Furthermore, texts can be linked even if they do not share content words, but deal with similar topics (Landauer and Dumais, 1997). Using this model as a starting point, we go a step further in departing from the bag of words model by taking quantitative characteristics of text structure into account (see below).</Paragraph>
    <Paragraph position="1"> Semantic spaces focus on meaning as use as described by the weak contextual hypothesis (Miller and Charles, 1991), which says that the similarity of contextual representations of words contributes to their semantic similarity.</Paragraph>
    <Paragraph position="2"> Regarding the level of texts, reformulating this hypothesis is straightforward: Contextual hypothesis for texts: the contextual similarity of the lexical constituents of two texts contributes to their semantic similarity.</Paragraph>
    <Paragraph position="3"> In other words: the more two texts share semantically similar words, the higher the probability that they deal with similar topics. Clearly, this hypothesis does not imply that texts having contextually similar components to a high degree also share propositional content. It is the structural (connotative), not the propositional (denotative) meaning aspect to which this hypothesis applies. Moreover, this version of the contextual hypothesis neglects the structural dimension of similarity relations: not only that a text is structured into thematic components, each of which may semantically relate to different units, but units similar to the text as a whole do not form isolated, unstructured clumps. Neglecting the former we focus on the latter phenomenon, which demands a supplementary hypothesis: null Structure sensitive contextual hypothesis: units, which are similar to a text according to the contextual hypothesis, contribute to the structuring of its meaning.</Paragraph>
    <Paragraph position="4"> Since we seek a model for automatic text representation for which nonlinguistic context is inaccessible, we limit contextual similarity to paradigmatic similarity. On this basis the latter two hypotheses can be summarized as follows: Definition 1. Let C be a corpus in which we observe paradigmatic regularities of words. The textual connotation of a text x with respect to C includes those texts of C, whose constituents realize similar paradigmatic regularities as the lexical constituents of x. The connotation of x is structured on the basis of the same relation of (indirect) paradigmatic similarity interrelating the connoted texts.</Paragraph>
    <Paragraph position="5"> In order to model this concept of structured connotation, we use the space model M0 of (Rieger, 1984) as a point of departure and derive three text representation models M1, M2, M3. Since M0 only maps words onto semantic space we extend it in order to derive meaning points of texts. This is done as follows: M0 analyses word meanings as the result of a two-stage process of unsupervised learning. It builds a lexical semantic space by modeling syntagmatic regularities with a correlation coefficient a: W - C [?] a82n and their differences with an Euclidean metric d: C - S [?] a82n, where W is the set of words, C is called corpus space representing syntagmatic regularities, and S is called semantic space representing paradigmatic regularities. |W |= n is the number of dimensions of both spaces. Neighborhoods of meaning points assigned to words model their semantic similarity: the shorter the points' distances in semantic space, the more paradigmatically similar the words.</Paragraph>
    <Paragraph position="6"> The set of words W, spanning the semantic space, is selected on the basis of the criterion of document frequency, which proves to be of comparable effectiveness as information gain and kh2-statistics (Yang and Pedersen, 1997). Furthermore, instead of using explicit stop word lists, we restricted W to the set of lemmatized nouns, verbs, adjectives, and adverbs.</Paragraph>
    <Paragraph position="7"> M1: In a second step, we use S as a format for representing meaning points of texts, which are mapped onto S with the help of a weighted mean of the meaning points assigned to their lexical constituents:</Paragraph>
    <Paragraph position="9"> vectorxk is the meaning point of text xk [?] C, vectorai the meaning point of word ai [?] W, and W(xk) is the set of all types of all tokens in xk. Finally, wik is a weight having the same role as the tfidfscores in IR (Salton and Buckley, 1988). As a result of mapping texts onto S, they can be compared with respect to the paradigmatic similarity of their lexical organization. This is done with the help of a similarity measure s based on an Euclidean metric d operating on meaning points and standardized to the unit interval: s: {vectorx|x [?] C}2 - [0,1] (2) s is interpreted as follows: the higher s(vectorx,vectory) for two texts x and y, the shorter the distance of their meaning points vectorx and vectory in semantic space, the more similar the paradigmatic usage regularities of their lexical constituents, and finally the more semantically similar these texts according to the extended contextual hypothesis. This is the point, where semantic spaces depart from the vector space model, since they do not demand that the texts in question share any lexical constituents in order to be similar; the intersection of the sets of their lexical constituents may even be empty.</Paragraph>
    <Paragraph position="10"> M2: So far, only lexical features are considered. We depart a step further from the bag of words model by additionally comparing texts with respect to their organization. This is done with the help of a set of quantitative text characteristics used by (Tuldava, 1998) for automatic genre analysis: type-token ratio, hapax legomena, (variation of) mean word frequency, average sentence length, and action coefficient (i.e. the standardized ratio of verbs and adjectives in a text). In order to make these features comparable, they were standardized using z-scores so that random variables were derived with means of 0 and variances of 1. Beyond these characteristics, a further feature was considered: each text was mapped onto a so called text structure string representing its division into sections, paragraphs, and sentences as a course approximation of its rhetorical structure. For example, a text structure string</Paragraph>
    <Paragraph position="12"> denotes a text T of two sections D, where the first includes 1 and the second 3 sentences S.</Paragraph>
    <Paragraph position="13"> Using the Levenshtein metric for string comparison, this allows to measure the rhetorical similarity of texts in a first approximation. The idea is to distinguish units connoted by a text, which in spite of having similar lexical organizations differ texturally. If for example a short commentary connotes two equally similar texts, another commentary and a long report, the commentary should be preferred. Thus, in M2 the textual connotation of a text is not only seen to be structured on the basis of the criterion of similarity of lexical organization, but also by means of genre specific features modeled as quantitative text characteristics. This approach follows (Herdan, 1966), who programmatically asked, whether difference in style correlates with difference in frequency of use of linguistic forms.</Paragraph>
    <Paragraph position="14"> See (Wolters and Kirsten, 1999) who, following this approach, already used POS frequency as a source for genre classification, a task which goes beyond the scope of the given paper.</Paragraph>
    <Paragraph position="15"> On this background a compound text similarity measure can be derived as a linear model:</Paragraph>
    <Paragraph position="17"> a. where s1(x,y) = s(vectorx,vectory) models lexical semantics of texts x,y according to M1; b. s2 uses the Levenshtein metric for measuring the similarity of the text structure stings assigned to x and y; c. and s3 measures, based on an Euclidean metric, the similarity of texts with respect to the quantitative features enumerated above.</Paragraph>
    <Paragraph position="18"> oi biases the contribution of these different dimensions of text representation. We yield good results for o1 = 0.9, o2 = o3 = 0.05.</Paragraph>
    <Paragraph position="19"> M3: Finally, we experimented with a text representation model resulting from the aggregation (i.e. weighted mean) of the vector representations of a text in both spaces, i.e. vector and semantic space. This approach, which demands both spaces to have exactly the same dimensions and standardized coordinate values, follows the idea to reduce the noise inherent to both models: whether syntagmatic as in case of vector spaces, or paradigmatic as in case of semantic spaces. We experimented with equal weights of both input vectors.</Paragraph>
    <Paragraph position="20"> In the next section we use the text representation models M1, M2, M3 as different starting points for modeling the concept of structured connotation as defined in definition (1):</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Text Linkage
</SectionTitle>
    <Paragraph position="0"> Departing from ordinary list as well as cluster structures, we model the connotation of a text as a hierarchy, where each node represents a single connoted text (and not a set of texts as in case of agglomerative cluster analysis). In order to narrow down a solution for this task we need a linguistic criterion, which bridges between the linguistic knowledge represented in semantic spaces and the task of connotative text linkage. For this purpose we refer to the concept of lexical cohesion introduced by (Halliday and Hasan, 1976); see (Morris and Hirst, 1991; Hearst, 1997; Marcu, 2000) who already use this concept for text segmentation. According to this approach, lexical cohesion results from reiterating words, which are semantically related on the basis of (un-)systematic relations (e.g.</Paragraph>
    <Paragraph position="1"> synonymy or hyponymy). Unsystematic lexical cohesion results from patterns of contextual, paradigmatic similarity: &amp;quot;[...] lexical items having similar patterns of collocation--that is, tending to appear in similar contexts--will generate a cohesive force if they occur in adjacent sentences.&amp;quot; (Halliday and Hasan, 1976, p. 286).</Paragraph>
    <Paragraph position="2"> Several factors influencing this cohesive force are decisive for reconstructing the concept of textual connotation:(i) the contextual similarity of the words in question, (ii) their syntagmatic order, and (iii) the distances of their occurrences. These factors cooperate as follows: the shorter the distance of similar words in a text the higher their cohesive force. Furthermore, preceding lexical choices restrict (the interpretation of) subsequent ones, an effect, which retards as their distance grows. But longer distances may be compensated by higher contextual similarities so that highly related words can contribute to the cohesion of a text span even if they distantly co-occur. By means of restricting contextual to paradigmatic similarity and therefore measuring unsystematic lexical cohesion as a function of paradigmatic regularities, the transfer of this concept to the task of hierarchically modeling textual connotations becomes straightforward. Given a text x, whose connotation is to be represented as a tree T, we demand for any path P starting with root x:  (i) Similarity: If text y is more similar to x than z, then the path between x and y is shorter than between x and z, supposed that y and z belong to the same path P.</Paragraph>
    <Paragraph position="3"> (ii) Order: The shorter the distance between y and z in P, the higher their cohesive force, and vice versa: the longer the path, the higher the probability that the subsequent z is paradigmatically dissimilar to y.</Paragraph>
    <Paragraph position="4"> (iii) Distance: A cohesive impact is preserved  even in case of longer paths, supposed that the textual nodes lying in between are paradigmatically similar to a high degree.</Paragraph>
    <Paragraph position="5"> The reason underlying these criteria is the need to control negative effects of intransitive similarity relations: in case that text x is highly similar to y, and y to z, it is not guaranteed that (x,y,z) is a cohesive path, since similarity is not transitive. In order to reduce this risk of incohesive paths, the latter criteria demand that there is a cohesive force even between nodes which are not immediately linked. This demand decreases as the path distance of nodes increases so that topic changes latently controlled by preceding nodes can be realized. In other words: adding text z to the hierarchically structured connotation of x, we do not simply look for an already inserted text y, to which z is most similar, but to a path P, which minimizes the loss of cohesion in the overall tree, when z is attached to P. These comments induce an optimality criterion which tries to optimize cohesion not only of directly linked nodes, but of whole paths, thereby reflecting their syntagmatic order. Looking for a mathematical model of this optimality criterion, minimal spanning trees (MST) drop out, since they only optimize direct node-to-node similarities disregarding any path context. Furthermore, whereas we expect to yield different trees modeling the connotations of different texts, MSTs ignore this aspect dependency since they focus on a unique spanning tree of the underlying feature space. Another candidate is given by dependency trees (Rieger, 1984) which are equal to similarity trees (Lin, 1998): for a given root x, the nodes are inserted into its similarity tree (ST) in descending order of their similarity to x, where the predecessor of any node z is chosen to be the node y already inserted, to which z is most similar. Although STs already capture the aspect dependency induced by their varying roots, the path criterion is still not met. Thus, we generalize the concept of a ST to that of a cohesion tree as follows: First, we observe that the construction of STs uses two types of order relations: the first, let it call [?]1x, determines the order of the nodes inserted dependent on root x; the second, let it call [?]2y, varies with node y to be inserted and determines its predecessor. Next, in order to build cohesion trees out of this skeleton, we instantiate all relations [?]2y in a way, which finds the path of minimal loss of cohesion when y is attached to it. This is done with the help of a distance measure which induces a descending order of cohesion of paths: Definition 2. Let G = &lt;V,E&gt; be a graph and P = (v1,...,vk) a simple path in G. The path sensitive distance ^d(P,y) of y [?] V with respect to P is defined as</Paragraph>
    <Paragraph position="7"> value assumed by distance measure d, and V(P) is the set of all nodes of path P.</Paragraph>
    <Paragraph position="8"> It is clear that for any of the text representation models M1, M2, M3 and their corresponding similarity measures we get different distance measures ^d which can be used to instantiate the order relations [?]2y in order to determine the end vertex of the path of minimal loss of cohesion when y is attached to it. In case of increasing biases oi for increasing index i in definition (2) the syntagmatic order of path P is reflected in the sense that the shorter the distance of x to any vertex in P, the higher the impact of their (dis-)similarity measured by d, the higher their cohesive force. Using the relations [?]2y we can now formalize the concept of a cohesion tree: Definition 3. Let G = &lt;V,E,o&gt; be a complete weighted graph induced by a semantic space, and x [?] V a node. The graph D(G,x) = &lt;V,E,n&gt; with E = {{v,w}|v &lt;1x w [?] ![?]y [?] V : y &lt;1x w[?]y &lt;2w v} and n: E - a82, the restriction of o to E, is called cohesion tree induced by x.</Paragraph>
    <Paragraph position="9"> Using this definition of a cohesion tree (CT) we can compute hierarchical models of the connotations of texts, in which not only aspect dependency induced by the corresponding root, but also path cohesion is taken into account.</Paragraph>
    <Paragraph position="10"> A note on the relation between CTs and cluster analysis: CTs do not only depart from cluster hierarchies, since their nodes represent single objects, and not sets, but also because they refer to a local, contextsensitive building criterion (with respect to their roots and paths). In contrast to this, cluster analysis tries to find a global partition of the data set. Nevertheless there is a connection between both methods of unsupervised learning: Given a MST, there is a simple procedure to yield a divisive partition (Duda et al., 2001). Moreover, single linkage graphs are based on a comparable criterion as MSTs. Analogously, a given CT can be divided into non-overlapping clusters by deleting those edges whose length is above a certain threshold.</Paragraph>
    <Paragraph position="11"> This induces, so to say, perspective clusters organized dependent on the perspective of the root and paths of the underlying CT.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML