File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1804_metho.xml

Size: 19,650 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1804">
  <Title>Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora</Title>
  <Section position="5" start_page="3" end_page="5" type="metho">
    <SectionTitle>
2 Positional Ngrams
</SectionTitle>
    <Paragraph position="0"> In the specific field of multiword unit extraction, Dias (2002) has introduced the positional ngram model that has evidenced successful results for the extraction of discontinuous collocations from large corpora.</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
2.1 Principles
</SectionTitle>
      <Paragraph position="0"> The original idea of the positional ngram model comes from the lexicographic evidence that most lexical relations associate words separated by at most five other words (Sinclair, 1974). As a consequence, lexical relations such as collocations can be continuous or discontinuous sequences of words in a context of at most eleven words (i.e. 5 words to the left of a pivot word, 5  The CETEMPublico is a 180 million-word corpus of Portuguese. It can be obtained at http://www.ldc.upenn.edu/.</Paragraph>
      <Paragraph position="1">  This represents 46.986.831 positional ngrams.</Paragraph>
      <Paragraph position="2"> words to the right of the same pivot word and the pivot word itself). In general terms, a collocation can be defined as a specific  continuous or discontinuous sequence of words in a (2.F+1)-word size window context (i.e. F words to the left of a pivot word, F words to the right of the same pivot word and the pivot word itself). This situation is illustrated in Figure 1 for the colloca- null Thus, as computation is involved, we need to process all possible substrings (continuous or discontinuous) that fit inside the window context and contain the pivot word. Any of these substrings is called a positional ngram. For instance, [Ngram Statistics] is a positional ngram as is the discontinuous sequence [Ngram ___ from] where the gap represented by the underline stands for any word occurring between Ngram and from (in this case, Statistics). More examples are given in Table 1.</Paragraph>
      <Paragraph position="3">  In order to compute all the positional ngrams of a corpus, we need to take into account all the words as possible pivot words.</Paragraph>
      <Paragraph position="4">  A simple way would be to shift the two-window context to the right so that each word would sequentially be processed. However, this would inevitably lead to duplications of positional ngrams. Instead, we propose a  As specific, we intend a sequence that fits the definition of collocation given by Dias (2002): &amp;quot;A collocation is a recurrent sequence of words that co-occur together more than expected by chance in a given domain&amp;quot;.</Paragraph>
      <Paragraph position="5"> Virtual Approach to Deriving Ngram Statistics from Large Scale pivot F=3 F=3 one-window context that shifts to the right along the corpus as illustrated in Figure 2. It is clear that the size of the new window should be 2.F+1.</Paragraph>
      <Paragraph position="6"> This new representation implies new restrictions. While all combinations of words were valid positional ngrams in the two-window context, this is not true for a one-window context. Indeed, two restrictions must be observed. null Restriction 1: Any substring, in order to be valid, must contain the first word of the window context.</Paragraph>
      <Paragraph position="7"> Restriction 2: For any continuous or discontinuous sub-string in the window context, by shifting the substring from left to right, excluding gaps and words on the right and inserting gaps on the left, so that there always exists a word in the central position cpos (Equation 2) of the window, there should be at least one shift that contains all the words of the substring in the context window.</Paragraph>
      <Paragraph position="9"> Equation 2: Central position of the window For example, from the first case of Figure 2, the discontinuous sequence [A B _ _ E _ G] is not a positional ngram although it is a possible substring as it does not follow the second restriction. Indeed, whenever we try to align the sequence to the central position, at least one word is lost as shown in Table 2:</Paragraph>
      <Paragraph position="11"> In contrast, the sequence [A _ C _ E F _] is a positional ngram as the shift [_ A _ C _ E F], with C in the central position, includes all the words of the substring.</Paragraph>
      <Paragraph position="12"> Basically, the first restriction aims at avoiding duplications and the second restriction simply guarantees that no substring that would not be computed in a two-window context is processed.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
2.2 Virtual Representation
</SectionTitle>
      <Paragraph position="0"> The representation of positional ngrams is an essential step towards efficient computation. For that, purpose, we propose a reference representation rather than an explicit structure of each positional ngram. The idea is to adapt the suffix representation (Manber and Myers, 1990) to the positional ngram case.</Paragraph>
      <Paragraph position="1"> Following the suffix representation, any continuous corpus substring is virtually represented by a single position of the corpus as illustrated in Figure 3. In fact, the substring is the sequence of words that goes from the word referred by the position till the end of the corpus.</Paragraph>
      <Paragraph position="2">  Unfortunately, the suffix representation can not directly be extended to the specific case of positional ngrams.</Paragraph>
      <Paragraph position="3"> One main reason aims at this situation: a positional ngram may represent a discontinuous sequence of words. In order to overcome this situation, we propose a representation of positional ngrams based on masks.</Paragraph>
      <Paragraph position="4"> As we saw in the previous section, the computation of all the positional ngrams is a repetitive process. For each word in the corpus, there exists an algorithmic pattern that identifies all the possible positional ngrams in a 2.F+1-word size window context. So, what we need is a way to represent this pattern in an elegant and efficient way.</Paragraph>
      <Paragraph position="5"> One way is to use a set of masks that identify all the valid sequences of words in a given window context.</Paragraph>
      <Paragraph position="6"> Thus, each mask is nothing more than a sequence of 1 and 0 (where 1 stands for a word and 0 for a gap) that represents a specific positional ngram in the window context. An example is illustrated in Figure 4.</Paragraph>
      <Paragraph position="7">  Computing all the masks is an easy and quick process.</Paragraph>
      <Paragraph position="8"> In our implementation, the generation of masks is done recursively and is negligible in terms of space and time. In table 3, we give the number of masks h(F) for different values of F.</Paragraph>
      <Paragraph position="9">  In order to identify each mask and to prepare the reference representation of positional ngrams, an array of masks is finally built as in Figure 5.</Paragraph>
      <Paragraph position="10">  From these structures, the virtual representation of any positional ngram is straightforward. Indeed, any positional ngram can be identified by a position in the corpus and a given mask. Taking into account that a corpus is a set of documents, any positional ngram can be represented by the tuple {{id  stands for the document id of the corpus, pos doc for a given position in the document and id mask for a specific mask. An example is illustrated in Figure 6.</Paragraph>
      <Paragraph position="11">  As we will see in the following section, this reference representation will allow us to follow the Virtual Corpus approach introduced by Kit and Wilks (1998) to compute ngram frequencies.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="6" type="metho">
    <SectionTitle>
3 Computing Frequency
</SectionTitle>
    <Paragraph position="0"> With the Virtual Corpus approach, counting continuous substrings can easily and efficiently be achieved. After sorting the suffix-array data structure presented in Figure 3, the count of an ngram consisting of any n words in the corpus is simply the count of the number of adjacent indices that take the n words as prefix. We illustrate the Virtual Corpus approach in Figure 6.</Paragraph>
    <Paragraph position="1">  Counting positional ngrams can be computed exactly in the same way. The suffix-array structure is sorted using lexicographic ordering for each mask in the array of masks. After sorting, the count of a positional ngram in the corpus is simply the count of adjacent indices that stand for the same sequence. We illustrate the Virtual Corpus approach for positional ngrams in Figure 7.</Paragraph>
    <Paragraph position="2">  The efficiency of the counting mainly resides in the use of an adapted sort algorithm. Kit and Wilks (1998) propose to use a bucket-radixsort although they acknowledge that the classical quicksort performs faster for large-vocabulary corpora. Around the same perspective, Yamamoto and Church (2000) use the Manber and Myers's algorithm (1990), an elegant radixsort-based algorithm that takes at most O(N log N) time and shows improved results when long repeated substrings are common in the corpus.</Paragraph>
    <Paragraph position="3"> For the specific case of positional ngrams, we have chosen to implement the Multikey Quicksort algorithm (Bentley and Sedgewick, 1997) that can be seen as a mixture of the Ternary-Split Quicksort (Bentley and McIlroy, 1993) and the MSD  radixsort (Anderson and Nilsson, 1998).</Paragraph>
    <Paragraph position="4"> The algorithm processes as follows: (1) the array of string is partitioned into three parts based on the first symbol of each string. In order to process the split a pivot element is chosen just as in the classical quicksort giving rise to: one part with elements smaller than the pivot, one part with elements equal to the pivot and one part with elements larger than the pivot; (2) the smaller and the larger parts are recursively processed in exactly the same manner as the whole array; (3) the equal part is also sorted recursively but with partitioning starting from the second symbol of each string; (4) the process goes on recursively: each time an equal part is being processed, the considered position in each string is moved forward by one symbol.</Paragraph>
    <Paragraph position="5"> In Figure 8, we propose an illustration of the Multikey Quicksort taken from the paper (Bentley and Sedge- null MSD stands for Most Significant Digit.</Paragraph>
    <Paragraph position="6"> Different reasons have lead to use the Multikey Quicksort algorithm. First, it performs independently from the vocabulary size. Second, it shows O(N log N) time complexity in our specific case. Third, Anderson and Nilsson (1998) show that it performs better than the MSD radixsort and proves comparable results to the newly introduced Forward radixsort.</Paragraph>
    <Paragraph position="7"> Counting frequencies is just a preliminary step towards collocation extraction. The following step attaches an association measure to each positional ngram that evaluates the interdependency between words inside a given sequence. In the positional ngram model, Dias et al. (1999) propose the Mutual Expectation measure.</Paragraph>
  </Section>
  <Section position="7" start_page="6" end_page="18" type="metho">
    <SectionTitle>
4 Computing Mutual Expectation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.1 Principles
</SectionTitle>
      <Paragraph position="0"> The Mutual Expectation evaluates the degree of rigidity that links together all the words contained in a positional ngram ([?]n, n [?] 2) based on the concept of Normalized Expectation and relative frequency.</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="11" type="sub_section">
      <SectionTitle>
Normalized Expectation
</SectionTitle>
      <Paragraph position="0"> The basic idea of the Normalized Expectation is to evaluate the cost, in terms of cohesiveness, of the loss of one word in a positional ngram. Thus, the Normalized Expectation measure is defined in Equation 3 where the function k(.) returns the frequency of any positional</Paragraph>
      <Paragraph position="2"> For that purpose, any positional ngram is defined algebraically as a vector of words [p  . Thus, the positional ngram [A _ C D E _ _] would be rewritten as [0 A +2 C +3 D +4 E] and its Normalized Expectation would be given by Equation 4.</Paragraph>
      <Paragraph position="3">  The &amp;quot;^&amp;quot; corresponds to a convention used in Algebra that consists in writing a &amp;quot;^&amp;quot; on the top of the omitted term of a given succession indexed from 1 to n.</Paragraph>
      <Paragraph position="4">  By statement, any p ii is equal to zero.</Paragraph>
      <Paragraph position="6"> Unsorted array Sorted array as at be by he in is it of on or to as is be by on in at it of he or to</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
Mutual Expectation
</SectionTitle>
      <Paragraph position="0"> One effective criterion for multiword lexical unit identification is frequency. From this assumption, Dias et al.</Paragraph>
      <Paragraph position="1"> (1999) pose that between two positional ngrams with the same Normalized Expectation, the most frequent positional ngram is more likely to be a collocation. So, the Mutual Expectation of any positional ngram is defined in Equation 5 based on its Normalized Expectation and its relative frequency.</Paragraph>
      <Paragraph position="2">  In order to compute the Mutual Expectation of any positional ngram, it is necessary to build a data structure that allows rapid and efficient search over the space of all positional ngrams. For that purpose, we propose a multidimensional array structure called Matrix  .</Paragraph>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.2 Matrix
</SectionTitle>
      <Paragraph position="0"> The attentive reader will have noticed that the denominator of the Normalized Expectation formula is the average frequency of all the positional (n-1)grams included in a given positional ngram. These specific positional ngrams are called positional sub-ngrams of order n-1  . So, in order to compute the Normalized Expectation and a fortiori the Mutual Expectation, it is necessary to access efficiently to the sub-ngrams frequencies. This operation is done through the Matrix.  The Matrix also speeds up the extraction process that applies the GenLocalMaxs algorithm (Gael Dias, 2002). We do not present this algorithm due to lack of space.</Paragraph>
      <Paragraph position="1">  In order to ease the reading, we will use the term sub-ngrams to denote positional sub-ngrams of order n-1.</Paragraph>
      <Paragraph position="2"> However, to understand the Matrix itself, we first need to show how the sub-ngrams of any positional ngram can be represented.</Paragraph>
    </Section>
    <Section position="5" start_page="11" end_page="18" type="sub_section">
      <SectionTitle>
Representing sub-ngrams
</SectionTitle>
      <Paragraph position="0"> A sub-ngram is obtained by extracting one word at a time from its related positional ngram as shown in Fig- null By representing a sub-ngram, we mean calculating its virtual representation that identifies its related substring. The previous figure shows that representing the first three sub-ngrams of the positional ngram {{0,0},14} is straightforward as they all contain the first word of the window context. The only difficulty is to know the mask they are associated to. Knowing this, the first three sub-ngrams would respectively be represented as: {{0,0},15}, {{0,0},16}, {{0,0},13}.</Paragraph>
      <Paragraph position="1"> For the last sub-ngram, the situation is different. The first word of the window context is omitted. As a consequence, in order to calculate its virtual representation, we need to know the position of the first word of the substring as well as its corresponding mask. In this case, the position in the document of the positional sub-ngram is simply the position of its related positional ngram plus the distance that separates the first word of the window context from the first word of the substring. We call delta this distance. The obvious representation of the fourth sub-ngram is then {{0,2},18} where the position is calculated as 0+(delta=2)=2.</Paragraph>
      <Paragraph position="2"> In order to represent the sub-ngrams of any positional ngram, all we need is to keep track of the masks related  to the mask of the positional ngram and the respective deltas. Thus, it is clear that for each mask, there exists a set of pairs {id mask , delta} that allows identifying all the sub-ngrams of any given positional ngram. Each pair is called a submask and is associated to its upper mask  as illustrated in Figure 10.</Paragraph>
      <Paragraph position="3"> Figure 10: Submasks Now that all necessary virtual representations are wellestablished, in order to calculate the Mutual Expectation, we need to build a structure that allows efficiently accessing any positional ngram frequency. This is the objective of the Matrix, a 2-dimension array structure. 2-dimension Array Structure Searching for specific positional ngrams in a huge sample space can be overwhelming. To overcome this computation problem, two solutions are possible: (1) keep the suffix array-based data structure and design optimized search algorithms or (2) design a new data structure to ease the searching process. We chose the second solution as our complete system heavily depends on searching through the entire space of positional ngrams  and, as a consequence, we hardly believe that improved results may be reached following the second solution.</Paragraph>
      <Paragraph position="4"> This new structure is a 2-dimension array where lines stand for the masks ids and the columns for the positions in the corpus. Thus, each cell of the 2-dimension array represents a given positional ngram as shown in Figure 11. This structure is called the Matrix.</Paragraph>
      <Paragraph position="5"> The frequency of each positional ngram can easily be represented by all its positions in the corpus. Indeed, a given positional ngram is a substring that can appear in different positions of the corpus being the count of these positions its frequency. From the previous suffix array- null The upper mask is the mask from which the submasks are calculated. While upper masks represent positional ngrams, submasks represent sub-ngrams.</Paragraph>
      <Paragraph position="6">  In fact, this choice mainly has to do with the extraction process and the application of the GenLocalMaxs algorithm.</Paragraph>
      <Paragraph position="7"> based data structure, calculating all these positions is straightforward.</Paragraph>
      <Paragraph position="8"> Calculating the Mutual Expectation is also straightforward and fast as accessing to any positional ngram can be done in O(1) time complexity. We will illustrate this reality in the next section.</Paragraph>
      <Paragraph position="9">  The illustration of our architecture is now complete. We now need to test our assumptions. For that purpose, we present results of our implementation over the CETEMPublico corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML