File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0702_metho.xml

Size: 13,452 bytes

Last Modified: 2025-10-06 14:15:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0702">
  <Title>Experiments in Unsupervised Entropy-Based Corpus Segmentation</Title>
  <Section position="3" start_page="0" end_page="10" type="metho">
    <SectionTitle>
2 The Approach
2.1 The Corpus
</SectionTitle>
    <Paragraph position="0"> We assume that any corpus C can be described by the expression:</Paragraph>
    <Paragraph position="2"> There must be at least one token T (&amp;quot;word&amp;quot;) which is a string of one or more symbols s :</Paragraph>
    <Paragraph position="4"> Different tokens T must be separated form each other by one or more separators S which are strings of zero or more symbols s :</Paragraph>
    <Paragraph position="6"> Separators can consist of blanks, new-line, or &amp;quot;'real&amp;quot; symbols. They can also be empty strings.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Recoding the Corpus
</SectionTitle>
      <Paragraph position="0"> We will describe the approach on the example of a German corpus.</Paragraph>
      <Paragraph position="1"> First, all symbols s (actually all character codes) of the corpus are recoded by strings of &amp;quot;'visible&amp;quot; ASCII characters. For example: 2  If the language and the alphabet are unknown or unidentified, the symbols of the corpus can be encoded by arbitrarily defined ASCII strings.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Information and Entropy
</SectionTitle>
      <Paragraph position="0"> We estimate probabilities of symbols of the corpus using a 3rd order Markov model based on maximum likelihood. The probability of a symbol s with respect to this model M and to a context c can be estimated by:</Paragraph>
      <Paragraph position="2"> The inIormation of a symbol s with respect to the model M and to a context c is defined by:</Paragraph>
      <Paragraph position="4"> Intuitively, information can be considered as the surprise of the model about the symbol s after having seen the context c. The more the symbol is unexpected from the model's experience, the higher is the value of information \[Shannon and Weaver, 1949\]* The entropy of a context c with respect to this model M expresses the expected value of information, and is defined by:</Paragraph>
      <Paragraph position="6"> Monitoring entropy and information across a corpus shows that maxima often correspond with word 3Note that blanks become &amp;quot;BL&amp;quot; and new-lines become &amp;quot;NL&amp;quot;.</Paragraph>
      <Paragraph position="7"> boundaries \[Alder, 1988, Hutchens and Alder, 1998, among many others\].</Paragraph>
      <Paragraph position="8"> More exactly, maxima in left-to-right entropy HLR and information ILR often mark the end of a separator string S, and maxima in right-to-left entropy Hn and information IR~ often mark the beginning of a separator string, as can be seen in  of the symbol in a given left or right context. An entropy value is assigned between every two symbols. It expresses the model's uncertainty after having seen the left or right context, but not yet the symbol* When going from left to right, all end of a separator, is often marked by a maximum in entropy because the next word to the right can start with almost any symbol, and the model has no &amp;quot;idea&amp;quot; what it will be. There is also a maximum in information because the first symbol of the word is (more or less) unexpected; the model has no particular expectation. null Similarly, when going from right to left, a beginning of a separator is often marked by a maximum in entropy because the word next to the left call end with almost any symbol. There is also a maximum in information because the last symbol of the word is (more or less) unexpected; the model has no particular expectation.</Paragraph>
      <Paragraph position="9"> Usually, there is no maximum at a beginning of a separator, when going from left to right, and no maximum at a separator ending, when going from right to left, because words often have &amp;quot;typical&amp;quot; beginnings or endings, e.g. prefixes or suffixes* This means, when we come from inside a word to the beginning or end of this word then the model will anticipate a separator, and since the number of alternative separators is usually small, the model will not be &amp;quot;surprised&amp;quot; to see a particular one. On the other side, when we come from inside a separator to the beginning or end of this separator, although the model will expect a word, it will be &amp;quot;surprised&amp;quot; about any particular word because the uumber of alternative beginnings or endings of words is large. It also may be observed that the maxima in one direction are bigger then the maxima in the other direction due to the fact that a particular language may have e.g. stronger constraints on endings than on beginnings of words: A language may employ suffixes with most words in a corpus, which limits the number of endings, but rarely use prefixes, which allows a word to start with almost any symbol*</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 Thresholds
</SectionTitle>
      <Paragraph position="0"> Not all maxima correspond with word boundaries.</Paragraph>
      <Paragraph position="1"> Hutchens and Alder \[1998\] apply a threshold of 0.5 log2 \[1~\]\[ to select among all maxima, those that represent boundaries.</Paragraph>
      <Paragraph position="2"> The present approach uses two thresholds that are based on the corpus data and contain no other factors: The first threshold van is the average of all values of the particular function, HLR, HI{L, ILR, or II{L, across the corpus. The second threshold Vm,~ is the average of all maxima of the particular function. All graphs of Figure 1 contain both thresholds (as dotted lines).</Paragraph>
      <Paragraph position="3"> To decide whether a value v of HLa, HRL, ILR, or II{n should be considered as a boundary, we use the four functions:</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.5 Detection of Separators
</SectionTitle>
      <Paragraph position="0"> To find a separator, we are looking for a strong boundary to serve as a beginning or end of the separator. In the current example, we have chosen as a criterion for strong boundaries:</Paragraph>
      <Paragraph position="2"> Here H and I mean either HLR and ILR if we are looking for the end of a separator, or HRL and Inn if we are looking for the beginning of a separator. The variables h and i denote values of these functions at the considered point.</Paragraph>
      <Paragraph position="3"> Once a strong boundary is found, we search for a weak boundary to serve as an ending that matches the previously found beginning, or to serve as a beginning that matches the previously found ending.</Paragraph>
      <Paragraph position="4"> For weak boundaries, we use the criterion:</Paragraph>
      <Paragraph position="6"> If a matching pair of boundaries, i.e. a beginning and an end of a separator, are found, the separator</Paragraph>
      <Paragraph position="8"> is marked. In Figure 1 this is visualized by \] for empty and { } for non-empty separators.</Paragraph>
      <Paragraph position="9"> The search for a weak boundary that matches a strong one is stopped (without success) either after a certain distance 4 or at a breakpoint. For example, if we have the beginning of a separator and search for a matching end then the occurrence of another beginning will stop the search. As a criterion for a breakpoint we have chosen:</Paragraph>
      <Paragraph position="11"> If the search for a matching point has been stopped for either reason, we need to decide whether the initially found strong boundary should be marked despite the missing match. It will only be marked if it is an obligatory boundary. Here we apply the criterion:</Paragraph>
      <Paragraph position="13"> In Figure 1 these unmatched obligatory boundaries are visualized by {u or }u.</Paragraph>
      <Paragraph position="14"> Each of the four criteria, for strong boundaries, weak boundaries, break points, and obligatory boundaries, can be built of any of the four functions boo to b30 (eq.s 7' to 10).</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="10" type="sub_section">
      <SectionTitle>
2.6 Validation of Separators
</SectionTitle>
      <Paragraph position="0"> All separator strings that have a matching beginning and end marker are collected and counted.</Paragraph>
      <Paragraph position="1">  This seems sufficient because we found no separators longer than 3 so far (Tables 1 to 5).</Paragraph>
      <Paragraph position="2"> Table 1 shows such separators collected from the German example corpus. Column 5 contains the strings that constitute the separators, column 2 shows the count of these strings as separators, colunm 3 says in how may different contexts s the separators occurred, colunm 4 shows the total count of the strings in the corpus, and column 1 contains aliases furtheron used to denote the separators. In Table 1 all separators are sorted with respect to column 3. From these separators we retain those that are above a defined threshold relative to the number of different contexts of the top-most separator. In all examples throughout this article, we are using a relative threshold of 0.5, which means in this case (Table 1) that the top-most two separators, &amp;quot;BL&amp;quot; and &amp;quot;NL&amp;quot; that occur ifi 1484 and 850 different contexts respectively, are retained. 6 In the corpus, all separators that have been retained (Table 1) and that have at least one detected boundary (Fig. 1), are validated and marked. For the above corpus section this leads to:  Due to the approach, the precision for BL and NL is 100 %. A string which is different from BL and NL cannot be marked as a separator in the above example. If empty string separators were admitted, the precision would decrease.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 More Examples
</SectionTitle>
    <Paragraph position="0"> We applied the approach to modified versions of the above mentioned German corpus and to an English corpus.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.1 German with Empty String Sep-
arators
</SectionTitle>
      <Paragraph position="0"> For this experiment, we remove all original separators, &amp;quot;BL&amp;quot; and &amp;quot;NL&amp;quot;, from the above German cor- null without blanks and new-line (truncated list) and obtained the result: FiirIn lost 10andsetz lou 10ng lound loNeu lobau loder loKanalis loation diirf loten loinden lon~ich losten loze hnJahren loBetr~ig \]oe loin loMillia rd \[oen IohSh loeaus ~ge \[0geben low erden Io. All loe loin loind loen loalte n loBun lod loes lol~ind loer lonmiisse n lobiszur \[0Jahrh lound loer lotwen de \[0die loKommunen lokrnde los loin s logesarn \[0t lokm lolangen loKanalund loLei lot loungs lonetzessan loie ren lo.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.2 German with Modified Separa-
</SectionTitle>
      <Paragraph position="0"> tors For the next experiment, we chmlged all original separators, &amp;quot;BL&amp;quot; and &amp;quot;NL&amp;quot;, in the above German corpus into a string from the list { ........ , &amp;quot;- -&amp;quot;, ,,#,,, -# #,,, ,,__ _,,, ,,# #,, ,,_ _,, ~ :7  modified separators (truncated list) and obtained the result: Fiir 6In 6st 6and los loetz loung Io# # lound loNeu lobau loder-K loa loon lo al loisation lo--diirf loten#in h den h lon~ich losten \[1 zehn \[3 IoJahrenBe tr~ige-in h loMilliard loenhSh loe# loaus loge logeb loen lo## lowerden lo. hAIlein lo## 6in hd loenal loten</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.3 English Corpus
</SectionTitle>
      <Paragraph position="0"> On an English corpus where all original separators have been preserved: s In the days when the spinning-wheels \ hummed busily in the farmhouses and even great ladies clothed \ in silk and thread lace had their toy spinning-wheels \ of polished oak there might be seen in \ districts far away among the lanes or deep in the bosom of the hills certain \ pallid undersized men who by the side of the brawny \ country-folk looked like the remnants of a disinherited race. we measured the information and entropy shown in Figure 2, collected the separators in Table 5,  (truncated list) and obtained the result: In \[othe iodays iowhen \[othe \[ospinn ing-wheels \[ohummed bbusily ~in \[othe h farmhouses ~and \[oeven log teat 1oladies \[oclothed loin Iosilk \[o andNLthread \[olace \[ohad ~their \]a toy iospinning-wheelsBLof iapolis fled \[ooak \[othere \[omight lobe \[osee n loin \[odistricts \]ofar 10awayBLam ong \[othe \[olanesNLor \[odeep bin 10t he ~)bosom \[oof \[othe \[ohills \[0certa in \[0pallid \[0undersized \[lmen \[owh o \[0by \[othe \[osideBLof \[othe \[0braw ny \[0country-folk \[01ooked hlike \[0 the \[0remnantsBLof ~a \[0disinheri ted \[orace.</Paragraph>
      <Paragraph position="1"> Sln this example, \ denotes that the current line is not finished yet but rather continues on the next line&amp;quot;</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML