File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1063_evalu.xml
Size: 5,260 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1063"> <Title>Hierarchical Orderings of Textual Units</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> Figure (1) exemplifies a CT based on M3 using a textual root dealing with the &quot;BSE Food Scandal&quot; from 1996. The text sample belongs to a corpus of 502 texts of the German newspaper</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> S&quot;uddeutsche Zeitung of about 320,000 run- </SectionTitle> <Paragraph position="0"> ning words. Each text belongs to an element of a set T of 18 different subject categories (e.g.</Paragraph> <Paragraph position="1"> politics, sports). Based on the lemmatized corpus a semantic space of 2715 lexical dimensions was built and all texts were mapped onto this space according to the specifications of M3. In figure (1) each textual node of the CT is represented by its headline and subject category as found in the newspaper. All computations were performed using a set of C++ programs especially implemented for this study.</Paragraph> <Paragraph position="2"> In order to rate models M1, M2, M3 in comparison to the vector space model (VS) using MSTs, STs and CTs as alternative hierarchical models we proceed as follows: as a simple measure of representational goodness we compute the average categorial cohesion of links of all MSTs, STs and CTs for the different models and all texts in the corpus. Let G = <V,E> be a tree of textual nodes x [?] V, each of which is assigned to a subject category t(x) [?] T , and P(G) the set of all paths in G starting with root x and ending with a leaf, then the categorial cohesion of G is the average number of links (vi,vj) [?] E per path P [?] P(G), where t(vi) = t(vj). The more nodes of identical categories are linked in paths in G, the more categorially homogeneous these paths, the higher the average categorial cohesion of G. According to the conceptual basis of CTs we expect these trees to be of highest categorial link cohesion, but this is not true: MSTs produce the highest cohesion values in case of VS and M3. Furthermore, we observe that model M3 induces trees of highest cohesion and lowest variance, whereas VS shows the highest variance and lowest cohesion scores in case of STs and CTs. In other words: based on semantic spaces, models M1, M2, and M3 produce more stable results than the vector space model.</Paragraph> <Paragraph position="3"> Using M3 as a starting point it can be asked more precisely, which tree class produces the most cohesive model of text connotation.</Paragraph> <Paragraph position="4"> Clearly, the measure of categorial link cohesion is not sufficient to evaluate the classes, since two immediately linked texts belonging to the same scores of trees derived from them.</Paragraph> <Paragraph position="5"> subject category may nevertheless deal with different topics. Thus we need a finer-grained measure which operates directly on the texts' meaning representations. In case of unsupervised clustering, where fine-grained class labels are missed, (Steinbach et al., 2000) propose a measure which estimates the overall cohesion of a cluster. This measure can be directly applied to trees: let Pv1,vn = (v1,...,vn) be a path in tree G = <V,E> starting with root v1 = x, we compute the cohesion of P irrespective of the order of its nodes as follows:</Paragraph> <Paragraph position="7"> The more similar the nodes of path P according to metric d, the more cohesive P. d is derived from the distance measure operating on the semantic space to which texts vi are mapped. As before, all scores x(P) are summed up for all paths in P(G) and standardized by means of |P(G)|. This guarantees that neither trees of maximum height (MHT) nor of maximum degree (MDT), i.e. trees which trivially correspond to lists, are assigned highest cohesion values. The results of summing up these scores for all trees of a given class for all texts in the test corpus are shown in table (2). Now, tree classes and all texts in the test corpus.</Paragraph> <Paragraph position="8"> CTs and STs realize the most cohesive structures. This is more obvious if the scores x(G) are compared for each text in separation: in 494 cases, CTs are of highest cohesion according to measure (5). In only 7 cases, MST are of highest cohesion, and in only one case, the corresponding ST is of highest cohesion. Moreover, even the stochastically organized so called random successor trees (RST), in which successor node's and their predecessors are randomly chosen, produce more cohesive structures than lists (i.e. MDTs and MHTs), which form the predominant format used to organize search results in Internet.</Paragraph> <Paragraph position="9"> To sum up: Table (2) rates CTs in combination with model M3 on highest level. Thus, from the point of view of lexical semantics CTs realize more cohesive branches than MSTs. But whether these differences are significant, is hard to evaluate, since their theoretical distribution is unknown. Thus, future work will be on finding these distributions.</Paragraph> </Section> </Section> class="xml-element"></Paper>