File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0118_concl.xml

Size: 4,547 bytes

Last Modified: 2025-10-06 13:57:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0118">
  <Title>apos; i The Effects of Corpus Size and Homogeneity on Language Model Quality</Title>
  <Section position="6" start_page="188" end_page="189" type="concl">
    <SectionTitle>
5. Conclusions
</SectionTitle>
    <Paragraph position="0"> The analysis of the corpora has provided several revealing insights. Firstly, it is necessary to determine the homogeneity of a corpus prior to performing any similarity measures, since it is not clear what a measure of similarity would mean if a homogeneous corpus was being compared with a heterogeneous one. A methodology for calculating homogeneity has been described and the accuracy and usefulness of this is further described in Kilgarriff (1997).</Paragraph>
    <Paragraph position="1"> Clearly, the ernail corpus is highly heterogeneous. This means it is particularly prone to &amp;quot;burstiness&amp;quot; and unpredictability, which affects all levels of n-grams (including unlgrams). This may be due in part to the particular training corpus used, but it is more likely to be inherent to the medium, since email can fulfil so many communicative functions. It therefore exhibits a level of diversity surpassed perhaps only by spontaneous speech. Investigation of the spoken part of the BNC is therefore suggested as an area for further work.</Paragraph>
    <Paragraph position="2"> To a certain extent, the apparent heterogeneity of the emall undermines the results of any similarity measures applied to this corpus. Nevertheless, the extent to which the email is unlike all the other BNC domains is quite apparent and therefore mitigates any unprincipled approaches to corpus augmentation using crude, top-down techniques that involve complete domains taken from the BNC. Consequently, the best way to acquire more ernail data appears to be either: (a) to instigate a further collection initiative, (b) to use more sophisticated bottom.up methods, or (c) to use self-organising adaptation techniques (e.g.</Paragraph>
    <Paragraph position="3"> Clarkson and Robinson, 1997). The similarity metric used in (b) must be chosen carefully. Although the Loglikelihood and rank correlation metrics both produce results that can look intuitively plausible, this merely underlines the need for an objective, thorough evaluation method. Loglikefihood appears to be the more principled of the two measures, and it is suggested that this offers the greater potential.</Paragraph>
    <Paragraph position="4"> The results of the language modelling exercise provide clear evidence that it is possible to build effective LMs from small corpora. The email LM outperformed the other L_Ms on real spoken data (albeit taken from a technical, &amp;quot;ernaiMike&amp;quot; domain) for eight of the ten speakers. This is significant, considering the other LMs were trained on corpora that were several times larger. This effect can be mainly attributed to the source of the n-grams and the extent.to which the larger LMs &amp;quot;waste&amp;quot; probability mass on n-grams that never actually occur in the test data. Other researchers also have investigated methods for adapting large, general LMs using data from a small domain corpus and have found merit in simply building a  smaller LM directly from the domain corpus. For example, Ueberla (1997) observed that the improvements gained by using adaptation techniques compared to simply &amp;quot;starting from scratch&amp;quot; on the domain data become quite small when several tens of thousands of words of domain data are available.</Paragraph>
    <Paragraph position="5"> (Since the email corpus is almost 2 million words it clearly meets this criterion.) It is interesting to note also that this threshold is seen to vary according to the level of similarity between the adaptation (domainspecific) corpus and the background (general) corpus.</Paragraph>
    <Paragraph position="6"> It is also possible to adapt LMs dynamically, using cache-based methods (e.g. Kuhn &amp; de Mori 1990) and evidence suggests that this may prove the more effective approach (Matsunaga et al., 1992). It is clear that ernail is highly heterogeneous and therefore inherently unpredictable. Attempting to model this by static means can thus produce only limited success. By contrast, a dynamic LM would adapt to the current input and update its probabilities accordingly. However, dynamic LMs still need a set of static, baseline probabifities, so the email LM may present the best starting point for this.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML