File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2244_metho.xml
Size: 5,634 bytes
Last Modified: 2025-10-06 14:15:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2244"> <Title>Optimal Multi-Paragraph Text Segmentation by Dynamic Programming</Title> <Section position="3" start_page="0" end_page="1485" type="metho"> <SectionTitle> 2 Fragmentation by Dynamic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1485" type="sub_section"> <SectionTitle> Programming </SectionTitle> <Paragraph position="0"> Fragmentation is a problem of choosing the paragraph boundaries that make the best fragment boundaries. The local minima of the similarity curve are the points of low lexical cohesion and thus the natural candidates. To get reasonably-sized microdocuments, the similarity information alone is not enough; also the lengths of the created fragments have to be considered. In this section, we describe an approach that performs the fragmentation by using both the similarities and the length information in a robust manner. The method is based on a programming paradigm called dynamic programming (see, e.g., (Cormen et al., 1990)). Dynamic programming as a method guarantees the optimality of the result with respect to the input and the parameters.</Paragraph> <Paragraph position="1"> The idea of the fragmentation algorithm is as follows (see also Fig. 1). We start from the first boundary and calculate a cost for it as if the first paragraph was a single fragment. Then we take the second boundary and attach to it the minimum of the two available possibilities: the cost of the first two paragraphs as if they were a single fragment and the cost fragmentation(n, p, h, len\[1..n\], sim\[1..n - 1\]) /* n no. of pars, p preferred frag length, h scaling */ I* len\[1..n\] par lengths, sim\[1..n - 1\] similarities */</Paragraph> <Paragraph position="3"> if e ~> emin { /* optimization */ exit the innermost for loop;</Paragraph> <Paragraph position="5"> fragment boundary detection.</Paragraph> <Paragraph position="6"> of the second paragraph as a separate fragment. In the following steps, the evaluation moves on by one paragraph at each time, and all the possible locations of the previous breakpoint are considered. We continue this procedure till the end of the text, and finally we can generate a list of breakpoints that indicate the fragmentation.</Paragraph> <Paragraph position="7"> The cost at each boundary is a combination of three components: the cost of fragment length Clen, and the cost cost\[.\] and similarity sim\[.\] of some previous boundary. The cost function Clen gives the lowest cost for the preferred fragment length given by the user, say, e.g., 500 words. A fragment which is either shorter or longer gets a higher cost, i.e., is punished for its length. We have experimented with two families of cost functions, a family of second degree functions (parabolas),</Paragraph> <Paragraph position="9"> ear. (b) Parabola. p is 600 words in both (a) & (b).</Paragraph> <Paragraph position="10"> &quot;H0.25&quot;, etc., indicates the value of h. Vertical bars indicate fragment boundaries while short bars below horizontal axis indicate paragraph boundaries.</Paragraph> <Paragraph position="11"> where x is the actual fragment length, p is the preferred fragment length given by the user, and h is a scaling parameter that allows us to adjust the weight given to fragment length. The smaller the value of h, the less weight is given to the preferred fragment length in comparison with the similarity measure.</Paragraph> </Section> </Section> <Section position="4" start_page="1485" end_page="1485" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> As test data we used Mars by Percival Lowell, 1895.</Paragraph> <Paragraph position="1"> As an illustrative example, we present the analysis of Section I. Evidence of it of Chapter II. Atmosphere. The length of the section is approximately 6600 words and it contains 55 paragraphs. The fragments found with different parameter settings can be seen in Figure 2. One of the most interesting is the one with parabola cost function and h = .5. In this case the fragment length adjusts nicely according to the similarity curve. Looking at the text, most fragments have an easily identifiable topic, like atmospberic chemistry in fragment 7. Fragments 2 and 3 seem to have roughly the same topic: measuring the diameter of the planet Mars. The fact that they do not form a single fragment can be explained lavg, lmin, Imax average, minimum, and maximum fragment length; and davg average deviation.</Paragraph> <Paragraph position="2"> by the preferred fragment length requirement.</Paragraph> <Paragraph position="3"> Table 1 summarizes the effect of the scaling factor h in relation to the fragment length variation with the two cost functions over those 8 sections of Mars that have a length of at least 20 paragraphs. The average deviation davg with respect to the preferred fragment length p is defined as davg = (~-'~n= 1 \[P -- lil)/m where li is the length of fragment i, and m is the number of fragments. The parametric cost function chosen affects the result a lot. As expected, the second degree cost function allows more variation than the linear one but roles change with a small h. Although the experiment is insufficient, we can see that in this example a factor h > 1.0 is unsuitable with the linear cost function (and h = 1.5 with the parabola) since in these cases so much weight is given to the fragment length that fragment boundaries can appear very close to quite strong local maxima of the similarity curve.</Paragraph> </Section> class="xml-element"></Paper>