File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-1804_abstr.xml

Size: 1,351 bytes

Last Modified: 2025-10-06 13:43:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1804">
  <Title>Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space.</Paragraph>
    <Paragraph position="1"> Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML