File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2171_metho.xml
Size: 19,138 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2171"> <Title>N-GRAM CLUSTER IDENTIFICATION DURING EMPIRICAL KNOWLEDGE REPRESENTATION GENERATION</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. SYSTEM OVERVIEW </SectionTitle> <Paragraph position="0"> The approach acquires a domain specific semantic representation by carrying out stochastic analysis on a large corpus from a technical domain, ltigh frequency phrases are identified and used to recognise groups of paragraphs containing similar subsets of these phrases. It is assumed that, in general, the similarities between paragraphs within each group will define stereotypical concepts.</Paragraph> <Paragraph position="1"> Tools will enable a domain expert to view and manipulate these sets of paragraphs and generate a hierarchical semantic representation of concepts.</Paragraph> <Paragraph position="2"> The corpus ,and semantic representation are used to generate schematic structures within the technical domain. Each structure consists of a list of semantic concepts. Sets of structures which have a high level of correspondence are generated. It is assumed that stereotypical structures are represented by similarities between the members of sets containing a sufficient number of structures, and sufficient correspondence. These are stored in a structure knowledge base.</Paragraph> <Paragraph position="3"> The structures represent stereotypical situations such as lists of actions (e.g. scientific experiments), and common textual information (e.g. the definition of application areas). They are used to translate the existing texts into a semantic/pragmatic representation and store the knowledge in a concise and structured format in a technical knowledge base.</Paragraph> <Paragraph position="4"> New texts are processed immediately after publication, dynamically updating the technical knowledge base. If segments of new texts cannot be processed by the existing structures, then they are analysed and a novel structure is appended to the structure base.</Paragraph> <Paragraph position="5"> Collier (1993) presents a more comprehensive outline of the system's architecture ,and some preliminary stochastic analysis.</Paragraph> </Section> <Section position="5" start_page="0" end_page="7055" type="metho"> <SectionTitle> 3. PARAGRAPH CLUSTERING </SectionTitle> <Paragraph position="0"> The fundamental stage in the process described above is the generation of a domain specific semantic representation. The approach identifies clusters of useful n-grams within paragraphs which correlate with other paragraphs. The term useful defines n-grams that have certain qualities, such as a high frequency of occnrrence, and a wide distribution over texts within the domain.</Paragraph> <Paragraph position="1"> There are two principal steps in the identification of these chtsters: to recognise useful n-grams of varying lengths within a corpus, and to recoguise sets of paragraphs which contain similar clusters, and therefore correlate. null</Paragraph> <Section position="1" start_page="1054" end_page="1054" type="sub_section"> <SectionTitle> 3.1 Structures </SectionTitle> <Paragraph position="0"> Five fundamental strllctures are used during tile identification of correlating paragraphs.</Paragraph> <Paragraph position="1"> 3.1.1 Unique word/integer array Tbe tirst structure is an associative array containing an entry for each unique word ill the corpus. Each entry is indexed by the word, and holds a unique integer representing that word.</Paragraph> <Paragraph position="2"> This array is nsed to translate the textual corpus into a list of integers. All subsequent processing is carried out on this list of integers, this increases efficiency.</Paragraph> <Paragraph position="3"> The renmining four structures have the same format.</Paragraph> <Paragraph position="4"> Rather than being in word order, as the original text is, identical words ,are grouped together in the anay. These word groups are ordered according to their size. For this reason, the word with the highest frequency of occurreuce within the text will exist at the beginning of the array. Figure 1 gives an example of the typical array fordeg The highest frequency word that occurs within the text is the, therefore its group is at the beginning of the array. The second highest liequency word is and, then of., etc. The lowest frequency word is set, its group is positioned at the end of the array.</Paragraph> <Paragraph position="5"> The information contained in each of the rc,naining four arrays is explained below.</Paragraph> <Paragraph position="6"> 3.1.2 Word order array Due to the grouping of words, the word order will have been lost. The second structure defines this, it contains pointers to the next word ill the text. Figure 2 shows the positions of the pointers representing the phrase &quot;... the set of ...&quot;.</Paragraph> <Paragraph position="7"> <:- Cho ---><--- .... a----><---- or .-. ,~t--+ The third structure contains tile unique integer representing the next word pointed to in the text. The value of this will be the integer that represents the word group which the word ordering array element points to.</Paragraph> <Paragraph position="8"> It is clear that the grouping o1' the words in the arrays makes it necessary to create additional arrays and complicates the existing ones. The advantage of this grouping is increased computational elficiency.</Paragraph> <Paragraph position="9"> An example of the enhanced efficiency can he demonstrated by considering the identification of similar ngrams. The next word array groups together next word values which are present alter identical words in the text. For example, if the two word phrases the book, the car, the book and the explosion were present in the text, then integers representing book, ear, book and explosion would be grouped together in the next word array. When testing for silnilar n-grams it is only necess,'u'y to look through one section of the array to identify sets of identical n+ l-grams, rather than it being necessary to jump to many different positions within an extremely large array. This increases the efficiency of memory access due to the enormous reduction in memory paging.</Paragraph> <Paragraph position="10"> 3.1.4 Phrase length array The lburth structure contains a phrase length associated with each word. For example, a 1 represents an individual word, 2 represents at hi-gram (the word attd the cme that is pointed to as the next one), etc.</Paragraph> <Paragraph position="11"> After the process is complete this array will associate the useful n-grants with their initial word and also define their length.</Paragraph> <Paragraph position="12"> 3.1.5 Next phrase array The final structure is related to the fourth. Each corresponding entry is a pointer to the next identical phrase. For example, if there were three occurrences of the set of numbers in the corpus, then there would be three entries in the phrase length array containing a 4. Each of the corresponding entries in the next phrase array would point to the next identical phrase (figure 3).</Paragraph> <Paragraph position="13"> Fig. 3: phrase length and next phrase arrays</Paragraph> </Section> <Section position="2" start_page="1054" end_page="7055" type="sub_section"> <SectionTitle> 3.2 Algorithm </SectionTitle> <Paragraph position="0"> The two lnincipal steps of the process described in section 3 can be divided into six snbsteps. The first four sub-steps represent the identification of usefnl n-gnuns of varying lengths within a corpus, and the last two rcptesent the identification of sets of p~agraphs which conlain similar clusters.</Paragraph> <Paragraph position="1"> Each of the substeps, which create and manipulate the structures defined in section 3. I, is explained below.</Paragraph> <Paragraph position="2"> This procedure produces three arrays. The first associates each tmique word with a unique integer, the second defines the frequency of occurrence of each word, and the third contains pointers to the first position of each word group in the array format defined in figure 1.</Paragraph> <Paragraph position="3"> Initially, each word in the corpus is read sequentially. If an entry associating the word with a unique integer isn't already present, then one is created. If an entry is present, another array containing the frequency of occurrence of each word is incremented.</Paragraph> <Paragraph position="4"> The array containing the words and their associated unique integer is sorted into descending order by considering each word's frequency. Therefore the highest frequency word is associated with 1, the second highest with 2, etc. An array is also created which contains the initial index positions of each unique word in the word grouping format (figure 1). For example, the highest fiequency word would have an initial index of zero. If it had a frequency of 10, then the second highest frequency word index would be 10. If the second word had a frequency of eight, then the third highest frequency word's index would be 18. This indexing array is required during the creation of the word order and next word arrays. This stage creates three arrays. The first and second are the word order and next word ,'m'ays, defined in sections 3.1.2 and 3.1.3. The third is an ,array associating each document in the corpus with the position, in the word order array, of its first word.</Paragraph> <Paragraph position="5"> This procedure sequentially processes each word of each document. As each new document commences, the document name and the pointer value associated with the first word are stored in an array. This enables the beginning of any document to be accessed.</Paragraph> <Paragraph position="6"> For each word, the associated index position from the array generated in the previous step is looked up. This index value is stored in the position in the word ordering array of the previous word that was read. Therefore, defining that this is the index of the next word after the previous one. it also stores the current word's unique integer in the position in the next word array of the previous word that was read. Therelbre, defining that this is the unique integer of the next word after the previous one. Fin,-dly, it increments the index pointer of the word, as this position has now been filled.</Paragraph> <Paragraph position="7"> At the end of each paragraph a special integer representing the carriage return is placed in the next word array, this enables identification of paragraph boundaries. This step generates three arrays. The first is the phrase length array defined in section 3.1.4. The second is the next phrase ,array defined in section 3.1.5. The third is similar to the previous one, but it points to the previous identical phrase rather than the next. The algorithm becomes rather complicated when overwriting existing entries in the second and third arrays. This is due to the manipulation of the pointers to the next and previous identical phrases.</Paragraph> <Paragraph position="8"> Each of the groups of similar words are processed in turn (e.g. in figure 1 all of the the's, then the and's, etc.). The next word array is used to identify the word following the first the in the group. Then all of the other the's are checked to identify those with the same next word, creating a set of those that match. This set represents all of the phrases within the corpus that are the same as the first bi-gram.</Paragraph> <Paragraph position="9"> The phrase length is incremented to two, and this matching process is repeated for the next word of the original phrase (i.e. the third word of the n-gram), but only on the reduced set of previously matching words.</Paragraph> <Paragraph position="10"> This process continues until the longest phrase which occurs a multiple number of times is generated, or a carriage return is encountered.</Paragraph> <Paragraph position="11"> If the final phrase length is greater than one, then each of the words in the nmtching set is processed in turn. If the position pointed to by the word does not already have a phrase associated with it, then the phrase length is stored in the associated position in the phrase length array. The position in the next phrase array of the previous phrase in the matching set is updated with the current phrase's position, and therefore defines that this is the next identical phrase after the previous one. Also, an array which defines the previous phrase's position is updated by storing the pointer value of the previous phrase in the current phrase's slot, and therefore pointing to the previous identical phrase.</Paragraph> <Paragraph position="12"> If the position does already contain a phrase length that is longer, then the current phrase is missed out and the next one processed. In this case the position already has a longer n-gram associated with it.</Paragraph> <Paragraph position="13"> If the new phrase length is longer than the current one, then the phrase is overwritten, but the pointers to the previous and next phrase require updating. For example, if both a previous phrase and a next phrase pointer exist then the current position should be removed from the linking up of the existing set of identical phrases (figure 4). It is necessary to alter the next phrase value of the previous phrase (which is currently set to the position to be overwritten) to the current positions next phrase. Also the next phrase's previons phrase position (which holds the ct, rrent position to be overwritten) requires updating to the current phrases previous phrase position.</Paragraph> <Paragraph position="14"> This process is repeated for all of the other the's in turn, attd then for each of the other groups, generating the longest n-grams which have at least two occurrences.</Paragraph> <Paragraph position="15"> 3.2,4 Identify useful n-grams The fourth step is the identification of the n-grams that provide effective correlations between phrases and paragraphs. The phrase length and next phrase arrays are revised so that they only contain these u-grams.</Paragraph> <Paragraph position="16"> The previous process will have identified the longest phrase that occurs a mt, ltiple number of times in the corpns. The phrase length array ix traversed and each phrase with this longest length is stored in a set. At the same time, the next phrase array is nsed to identify the frequency of occurrence of each phrase. This can be obtained by counting while traversing through the pointers to the next identical phrase.</Paragraph> <Paragraph position="17"> This set of longest phrases is arranged in ascending order by frequency of occurrence. The n-best remain in the phrase length and next phrase arrays. The value of n will depend on the domain being analysed. A domain with considerable correlation will have a greater u than a domain with little correlation. This is an ,area for fi~rther investigation after development of the entire system.</Paragraph> <Paragraph position="18"> All of the subphrases that exist within these n-best are deleted flom the arrays. For example in the phrase the set of numbers, subphrases set of numbers and of numbers will be deleted and so that they are not considered during fi,rther analysis.</Paragraph> <Paragraph position="19"> Those that do not exist within the u-best have their associated phrase lengths reduced by one. This shorter phrase is compared with all other phrases of the same length in the group to identify whether it is identical to an existing phrase. If this is the case, then the next phrase pointer of the last phrase in the set will be altered to point to the first phrase in the identical phrase set, and vice-versa for the previous phrase array.</Paragraph> <Paragraph position="20"> This entire process is repeated, reducing the length of the phrases to be considered by one each time. Therefore, the second iteration will consider phrases with a length equal to the longest phrase minus one, the third iteration considers phrases with a length equal to the longest phrase minus two, etc.</Paragraph> <Paragraph position="21"> When this process is complete the phrase length and next phrase arrays will contain all of the useful phrases. The final two processes identify clusters of phrases within individual p,'u'agraphs which correlate with clusters of phrases in other paragraphs.</Paragraph> <Paragraph position="22"> This procedme associates each paragraph with a weight representing its probability of correlating with other paragraphs. The weight considers factors such as the size of the paragraph, the size and frequency of n-grams existing within that paragraph, and the distribution of the n-grams throughout the corpus.</Paragraph> <Paragraph position="23"> The actual process is relatively straightforward. The corpus is parsed, beginning at the first word and nsiug tile pointers in the next word array. This will traverse the words in the order of the original text, enabling identification of all n-grams in each paragraph and using them in an equation to assign the correlation weight.</Paragraph> <Paragraph position="24"> The current equation to generate paragraph weights is: (n0nl bi-g,.s*2,~)+(nl:m~~*(n+(fn~ 1)*0.5)~ total no words ill paragraph This equation is simple but accounts for ,'Ill the importmtt factors listed above, apart from the distribution of the n-granls within the corpus.</Paragraph> <Paragraph position="25"> These weights are nsed to sort the paragraphs into ascending order.</Paragraph> </Section> <Section position="3" start_page="7055" end_page="7055" type="sub_section"> <SectionTitle> 3.2.6 Identify useful paragraph clusters </SectionTitle> <Paragraph position="0"> The final process identifies ,all of the sets of correlating paragraphs within tile corpus, mid extracts tile highest quality correlations.</Paragraph> <Paragraph position="1"> Each paragraph produced in the previous step is processed in tt, rn. Using tile next phrase array, all paragraphs which correlate with at least one n-gram are identified. null Groups of paragraphs containing identical subsets of' n-grants are identified and placed into sets. Each of these sets can then be assigned a weight representing the quantity, i.e. number of paragraphs, and quality, i.e. number attd size of n-grams.</Paragraph> <Paragraph position="2"> The final step is to sort the correlation weights into ascending order.</Paragraph> <Paragraph position="3"> The system has now produced a list of n-gram clusters representing paragraph correlations. These are ordered by considering the quality of n-grams within the cluster, and the quantity of con'elation occurring with other paragraphs. From the assumptions outlined in section 2, &quot;the similarities between paragraphs within each gronp will define stereotypical concepts&quot;, these clusters will be extremely useful in the generation of a domain specific semantic representation.</Paragraph> </Section> </Section> class="xml-element"></Paper>