File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1804_evalu.xml

Size: 2,069 bytes

Last Modified: 2025-10-06 13:59:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1804">
  <Title>Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora</Title>
  <Section position="8" start_page="18" end_page="18" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We have conducted a number of experiments of our C++ implementation on the CETEMPublico Portuguese corpus to derive positional ngram statistics (Frequency and Mutual Expectation). The experiments have been realized on an Intel Pentium 900 MHz PC with 390MB of RAM. From the original corpus, we have randomly defined 5 different size sub-corpora that we present in  For each sub-corpus we have calculated the execution time of different stages of the process: (1) the tokenization that transforms the corpus into a set of integers; (2) the preparation of the mask structure and the construction of the suffix-array data structure; (3) the sorting of the suffix-array data structure and the creation of the Matrix; (4) the calculation of the ME. The results are given in Table 5.</Paragraph>
    <Paragraph position="1">  The window context of the experiment is F=3.</Paragraph>
    <Paragraph position="2">  The results clearly show that the construction of the Matrix and the sort operation over the suffix-array data structure are the most time consuming procedures. On the contrary, the computation of the Mutual Expectation is quick due to the direct access to sub-ngrams frequencies enabled by the Matrix. In order to understand the evolution of the results, we present, in Figure 12, a graphical representation of the results.</Paragraph>
    <Paragraph position="3">  The graphical representation illustrates a linear time complexity. In fact, Alexandre Gil (2002) has proved that, mainly due to the implementation of the Multikey Quicksort algorithm, our implementation evidences a time complexity of O(h(F) N log N) where N is the size of the corpus and h(F) a function of the window context. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML