File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/p86-1020_metho.xml
Size: 28,040 bytes
Last Modified: 2025-10-06 14:11:54
<?xml version="1.0" standalone="yes"?> <Paper uid="P86-1020"> <Title>BULK PROCESSING OF TEXT ON A MASSIVELY PARALLEL COMPUTER</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 An Overview of the Dictionary Problem </SectionTitle> <Paragraph position="0"> This paper will discuss one of the text processing problems that was encountered during the implementation of the CM-Indexer, a natural language processing program that runs on the Connection Machine (CM). The problem is that of parallel dictionary lookup: given both a dictionary and a text consisting of many thousands of words, how can the appropriate definitions be distributed to the words in the text as rapidly as possible? A parallel dictionary lookup algorithm that makes efficient use of the CM hardware was discovered and is described in this paper.</Paragraph> <Paragraph position="1"> It is clear that there are many natural language processing applications in which such a dictionary algorithm is necessary. Indexing and searching of databases consisting of unformatted natural language text is one such application. The proliferation of personal computers, the widespread use of electronic memos and electronic mail in large corporations, and the CD-ROM are all contributing to an explosion in the amount of useful unformatted text in computer readable form. Parallel computers and algorithms provide one way of dealing with this explosion.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The CM: Machine Description </SectionTitle> <Paragraph position="0"> The CM consists of a large number number of processor/memory cells. These cells are used to store data structures. In accordance with a stream of instructions that are broadcast from a single conventional host computer, the many processors can manipulate the data in the nodes of the data structure in parallel.</Paragraph> <Paragraph position="1"> Each processor in the CM can have its own local variables. These variables are called parallel variables, or parallel fields. When a host computer program performs a serial operation on a parallel variable, that operation is performed separately in each processor in the CM. For example, a program might compare two parallel string variables. Each CM processor would execute the comparison on its own local data and produce its own local result. Thus, a single command can result in tens of thousands of simultaneous CM comparisons.</Paragraph> <Paragraph position="2"> In addition to their computation ability, CM processors can communicate with each other via a special hardware communication network. In effect, communication is the parallel analog of the pointer-following executed by a serial computer as it traverses the links of a data structure or graph.</Paragraph> </Section> <Section position="5" start_page="0" end_page="128" type="metho"> <SectionTitle> 3 Dictionary Access </SectionTitle> <Paragraph position="0"> A dictionary may be defined as a mapping that takes a particular word and returns a group of status bits. Status bits indicate which sets or groups of words a particular word belongs to. Some of the sets that are useful in natural language processing include syntactic categories such as nouns, verbs, and prepositions. Programs also can use semantic characterization information. For example, knowing whether a word is name of a famous person (i.e. Lincoln, Churchill), a place, an interjection, or a time or calendar term will often be useful to a text processing program.</Paragraph> <Paragraph position="1"> The task of looking up the definition of a word consists of returning a binary number that contains l's only. in bit positions that correspond with the groups to which that word belongs. Thus, the definition of &quot;Lincoln&quot; contains a zero in the bit that indicates a word can serve as a verb, but it contains a 1 in the famous person's name bit.</Paragraph> <Paragraph position="2"> While all of the examples in this paper involve only a few words, it should be understood that the CM is efficient and cost effective only when large amounts of a. Select processors containing &quot;Lincoln': * bouz.4 Mcnae~,mge=- 5 .-6 b. Mark seiected processors as famous names: c. Select processors containing &quot;Michaetangelo&quot;: d. Mark Selected processors as famous names: \[;:: :MI;: ; Jl : ? Note: famous name is marked a. Select processors with an upper case, alphabetic first character b, Subselect for processors not at start of sentence: c. Mark selected processors as proper nouns:</Paragraph> <Section position="1" start_page="128" end_page="128" type="sub_section"> <SectionTitle> Proper Noun Proper Noun Ma~ed Mad~ed </SectionTitle> <Paragraph position="0"> text are to be processed. One would use the dictionary algorithms described in this paper to look up all of the words in an entire novel; one would not use them to look up the ten words in a user's query to a question answering system.</Paragraph> </Section> </Section> <Section position="6" start_page="128" end_page="129" type="metho"> <SectionTitle> 4 A Simple Broadcasting Dictio- </SectionTitle> <Paragraph position="0"> nary Algorithm One way to implement a parallel dictionary is to serially broadcast all of the words in a given set. Processors that contain a broadcast word check off the appropriate status bits. When all of the words in one set have been broadcast, the next set is then broadcast. For example, suppose that the dictionary lookup program begins by attempting to mark the words that are also famous last names. Figure 1 illustrates the progress of the algorithm as the words &quot;Lincoln&quot; and then &quot;Michaelangelo&quot; are broadcast. In the first step, all occurrences of &quot;Lincoln&quot; are marked as famous names. Since that word does not occur in the sample sentence, no marking action takes place. In the second step, all occurrences of &quot;Michaelange\]o&quot; are marked, including the one in the sample sentence.</Paragraph> <Paragraph position="1"> In step d, where all processors containing &quot;Michaelangelo&quot; are marked as containing famous names, the program could simultaneously mark the selected processors as containing proper nouns. Such shortcuts will not be examined at this time.</Paragraph> <Paragraph position="2"> After all of the words in the set of ,famous names have been broadcast, the algorithm would then begin to broadcast the next set, perhaps the set containing the names of the days of the week.</Paragraph> <Paragraph position="3"> In addition to using this broadcast algorithm, the CM-Indexer uses syntactic definitions of some of the dictionary sets. For example, it defines a proper noun as a capitalized word that does not begin a sentence. (Proper nouns that begin a sentence are not found by this capitalization based rule; this can be corrected by a more sophisticated rule. The more sophisticated rule would mark the first word in a sentence as a proper noun if it could find another capitalized occurrence of the word in a nearby sentence.) Figure 2 illustrates the progress of this simple syntactic algorithm as it executes.</Paragraph> <Paragraph position="4"> The implementation of both the broadcast algorithm and the syntactic proper noun rule takes a total of less than 30 lines of code in the *Lisp (pronounced &quot;starlisp ~) programming language. The entire syntactic rule that finds all proper nouns executes in less than 5 milliseconds. However, the algorithm that transmits word</Paragraph> <Paragraph position="6"> a. Select all processors where d?-O (not yet defined). If no processors are selected, then algorithm terminates. Otherwise. find the minimum of the selected processor's addresses.</Paragraph> <Paragraph position="7"> '~Host Machine quickly determines that the minimum address is 1 b. Host machine pulls out word in that minimum procesor and looks up its definition in its own serial dictionary/hash table, In this case, the definition of &quot;the&quot; is determined to t~e the bit sequence 001. (The bits are the status bits discussed in the text.) Next, the host machine selects all processors containing the word whose definition was just looked up: c. The entire looked up definition is assigned to all selected prOcessors and all selected processors are marked as defined, d. goto a lists takes an average of more than 5 milliseconds per word to broadcast a list of words from the host to the CM. Thus, since it takes time proportional to the number of words in a given set, the algorithm becomes a bottleneck for sets of more than a few thousand words.</Paragraph> <Paragraph position="8"> This means that the larger sets listed above (all nouns, all verbs, etc.) cannot be transmitted. The reason that this slow algorithm was used in the CM-Indexer was the ease with which it could be implemented and tested.</Paragraph> </Section> <Section position="7" start_page="129" end_page="129" type="metho"> <SectionTitle> 5 An Improved Broadcasting Dic- </SectionTitle> <Paragraph position="0"> tionary Algorithm One improvement to the simple broadcasting algorithm would be to broadcast entire definitions (i.e. several bits), rather than a single bit indicating membership in a set. This would mean that each word in the dictionary would only be broadcast once (i.e. &quot;fly&quot; is both a noun and a verb). A second improvement would be to broadcast only the words that are actually contained in the text being looked up. Thus, words that rarely occur in English, which make up a large percentage of the dictionary, would rarely be broadcast.</Paragraph> <Paragraph position="1"> In summary, this improved dictionary broadcasting algorithm will loop for the unique words that are contained in the text to be indexed, look up the definition of each such word in a serial dictionary on the host machine, and broadcast the looked-up definition to the entire CM. Figure 3 illustrates how this algorithm would assign the definition of all occurrences of the word &quot;the&quot; in a sample text. (Again, in practice the algorithm operates on many thousands of words, not on one sentence.) In order to select a currently undefined word to look up, the host machine executing this algorithm must determine the address of a selected processor. The figure indicates that one way to do this is to take the minimum address of the processors that are currently selected. This can be done in constant time on the CM.</Paragraph> <Paragraph position="2"> This improved dictionary lookup method is useful when the dictionary is much larger than the number of unique words contained in the text to be indexed. However, since the same basic operation is used to broadcast definitions as in the first algorithm, it is clear that this second implementation of a dictionary will not be feasible when a text contains more than a few thousand unique words.</Paragraph> <Paragraph position="3"> By analyzing a number of online texts ranging in size from 2,000 words to almost 60,000 words, it was found that as the size of the text approaches many tens of thousands of words, the number of unique words increased into the thousands. Therefore, it can be concluded that the second implementation of the broadcasting dictionary algorithm is not feasible when there are more than a few tens of thousands of words in the text file to be indexed.</Paragraph> </Section> <Section position="8" start_page="129" end_page="129" type="metho"> <SectionTitle> 6 Making Efficient Use of Paral- </SectionTitle> <Paragraph position="0"> lel Hardware In both of the above algorithms, the &quot;heart&quot; of the dictionary resided in the serial host. In the first case, the heart was the lists that represented sets of words; in the second case, the heart was the call to a serial dictionary lookup procedure. Perhaps if the heart of the dictionary could be stored in the CM, alongside the words from the text, the lookup process could be accelerated.</Paragraph> </Section> <Section position="9" start_page="129" end_page="130" type="metho"> <SectionTitle> 7 Implementation of Dictionary </SectionTitle> <Paragraph position="0"> Lookup by Parallel Hashing One possible approach to dictionary lookup would be to create a hash code for each word in each CM processor in parallel. The hash code represents the address of a different processor. Each processor can then send a lookup request to the processor at the hash-code address, where FOml&t o~ Pt'oce.~.~r Oia~ri~m: \[ ~;tnnn. pr~pq~nr J 1 f Definition Bits: BBBBJ OC/~inaI-Address: N Sla~ it Selected: a. Select all processors, set original address field to be the processor number : b. Call sort with string as the key, and string and N as the fields to copy. The final result is: the definition of the word that hashes to that address has been stored in advance. The processors that receive requests would then respond by sending back the pre-stored definition of their word to the address contained in the request packet.</Paragraph> <Paragraph position="1"> One problem with this approach is that all of the processors containing a given word will send a request for a definition to the same hashed address. To some extent, this problem can be ameliorated by broadcasting a list of the n (i.e. 200) most common words in English, before attempting any dictionary lookup cycles. Another problem with this approach is that the hash code itself will cause collisions between different text words that hash to the same value.</Paragraph> </Section> <Section position="10" start_page="130" end_page="134" type="metho"> <SectionTitle> 8 An Efficient Dictionary Algo- </SectionTitle> <Paragraph position="0"> rithm There is a faster and more elegant approach to building a dictionary than the hashing scheme. This other approach has the additional advantage that it can be built from two generally useful submodules each of which has a regular, easily debugged structure.</Paragraph> <Paragraph position="1"> The first submodule is the sort function, the second is the scan function. After describing the two submodules, a simple version of the fast dictionary algorithm will be presented, along with suggestions for dealing with memory and processor limitations.</Paragraph> <Section position="1" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 8.1 Parallel Sorting </SectionTitle> <Paragraph position="0"> A parallel sort is similar in function to a serial sort. It accepts as arguments a parallel data field and a parallel comparison predicate, and it sorts among the selected processors so that the data in each successive (by address) processor increases monotonically. There are parallel sorting algorithms that execute in time proportional to the square of the logarithm of the number of items to be sorted. One easily implemented sort, the enumerate-and-pack sort, takes about 1.5 milliseconds per bit to sort 64,000 numbers on the CM. Thus, it takes 48 milliseconds to sort 64,000 32-bit numbers.</Paragraph> <Paragraph position="1"> Figure 4 illustrates the effect a parallel sort has on a single sentence. Notice that pointers back to the original location of each word can be attached to words before the textual order of the words is scrambled by the sort.</Paragraph> </Section> <Section position="2" start_page="130" end_page="131" type="sub_section"> <SectionTitle> 8.2 Scan: Spreading Information in Log- </SectionTitle> <Paragraph position="0"> arithmic Time A scan algorithm takes an associative function of two arguments, call it F, and quickly applies it to data field values in successive processors of: The key point is that a scan algorithm can take advantage of the associative law and perform this task in logarithmic time. Thus, 16 applications of F are sufficient to scan F across 64,000 processors. Figure 5 shows one possible scheme for implementing scan. While the scheme in the diagram is based on a simple linked list structure, scan may also be implemented on binary trees, hypercubes, and other graph data structures. The nature of the routing system of a particular parallel computer will select which data structures can be scanned most rapidly and efficiently.</Paragraph> <Paragraph position="1"> Format of processor Diagram: J StrJn~ - PrOCeSSOr Furcc~n va~e: F Backward pointer can be calculated (P is an proc admess= Fotwarclpoulter:P in constant time: all processors \[ ~=~ if se~aact: send their own addresses to the processors pointed to by P.</Paragraph> <Paragraph position="2"> f is any associative function of two arguments a. Select all processors, initialize function value to string, forward pointer to self address + 1 : b. Get back pointer, get function value from processor at I~ack pointer, call this value 8F. Replace the current function value, F, with f(BF,F):</Paragraph> <Paragraph position="4"> C. Calculate a forward pointer that goes twice as far as the current forward pointer.</Paragraph> <Paragraph position="5"> This can be done as follows: Get the value of P at the processor pointed to by your own P, and replace your own P with that new value: d. ff any processor has a valid forward pointer, goto b (the next execution of b has the following effect on the first 4 processors: a f(a,o) f( a, f(b,c)) ~a.b). f(c,O) P: 3 P: 4 P: S P: 6 Note that since f is associative, f(a, f(b, c)) is always equal to f(f(a,b), c), and f(f(a,b), f(c,d)) - f( f( f(a, b), c), d) When combined with an appropriate F, scan has applications in a variety of contexts. For example, scan is useful in the parallel enumeration of objects and for region labeling* Just as the FFT can be used to efficiently solve many problems involving polynomials, scan can be used to create efficient programs that operate on graphs, and in particular on linked lists that contain natural, language text.</Paragraph> </Section> <Section position="3" start_page="131" end_page="132" type="sub_section"> <SectionTitle> 8.3 Application of Scan and Sort to Dic- tionary Lookup </SectionTitle> <Paragraph position="0"> To combine these two modules into a dictionary, we need to allocate a bit, DEFINED?, that is 1 only in processors that contain a valid definition of their word. Initially, it is 1 in the processors that contain words from the dictionary, and 0 in processors that contain words that come from the text to be looked up. The DEFINED? bit will be used by the algorithm as it assigns definitions to text words. As soon as a word receives its definition, it will have its DEFINED? bit turned on. The word can then begin to serve as an additional copy of the dictionary entry for the remainder of the lookup cycle. (This is the &quot;trick&quot; that allows scan to execute in logarithmic time.) First, an alphabetic sort is applied in parallel to all processors, with the word stored in each processor serving as the primary key, and the DEFINED? bit acting as a secondary key. The result will be that all copies of a given word are grouped together into sequential (by processor address) lists, with the single dictionary copy of each word immediately preceding any and all text copies of the same word.</Paragraph> <Paragraph position="1"> The definitions that are contained in the dictionary processors can then be distributed to all of the text words in logarithmic time by scanning the processors with the following associative function f: x and y are processors that have the following fields or parallel variables: function f returns a variable containing the same four fields. This is a pseudo language; the actual program was written in *Lisp.</Paragraph> <Paragraph position="2"> function f(x,y):</Paragraph> <Paragraph position="4"> ;; note that text words that are not found in the ;; dictionary correctly end up with DEFINED? = O This function F will spread dictionary definitions from a definition to all of the words following it (in processor address order), up until the next dictionary word. Therefore, each word will have its own copy of the dictionary definition of that word. All that remains is to have a single routing cycle that sends each definition back to the original location of its text word. * Figure 6 illustrates the execution of the entire sort-scan algorithm on a sample sentence.</Paragraph> </Section> <Section position="4" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 8.4 Improvements to the Sort-Scan Dic- tionary Algorithm </SectionTitle> <Paragraph position="0"> Since the CM is a bit serial Machine, string operations are relatively expensive operation. The dictionary function F described above performs a string comparison and a string copy operation each time it is invoked. On a full size CM, the function is invoked 16 times (log 64K words). A simple optimization can be made to the sort-scan algorithm that allows the string comparison to be performed only once. This allows a faster dictionary function that performs no string comparisons to be used.</Paragraph> <Paragraph position="1"> The optimization consists of two parts. First, a new stage is inserted after the sorting step, before the scanning step. In this new step, each word is compared to the word to its left, and if it is different, it is marked as a &quot;header.&quot; Such words begin a new segment of identical words. All dictionary words are headers, because the sort places them before all occurrences of identical words. In addition, the first word of each group of words that does not occur in the dictionary is also marked as a header.</Paragraph> <Paragraph position="2"> Next, the following function creates the field that will be scanned: ;; header-p is a parallel boolean variable that is ;.; true in headers, false otherwise function create-field-for-scan(header-p): ;define a type for a large bit field</Paragraph> <Paragraph position="4"> ; next, the headers that are dictionary words store ;; their definitions in the correct part of FIELD ;; Non-dictionary headers (text words not found ;; in dictionary) are given null definitions.</Paragraph> <Paragraph position="5"> if header</Paragraph> <Paragraph position="7"> Finally, instead of scanning the dictionary function across this field, the maximum function (which returns the maximum of two input numbers) is scanned across it. Definitions will propagate from a header to all of the words within its segment, but they will not cross past the next header. This is because the next header has a greater self-address in the most significant bits of the field being scanned, and the maximum function selects it rather than the earlier headerg smaller field value. If a header had no definition, because a word was not found in the dictionary, the null definition would be propagated to all copies of that word.</Paragraph> <Paragraph position="8"> The process of scanning the maximum function across a field was determined to be generally useful. As a result, the max-scan function was implemented in an efficient pipelined, bit-serial manner by Guy Blelloch, and was incorporated into the general library of CM functions. null a. After sort, detect the headers (words different from lef~ neighbor) b. In headers only, set the A to the self address and the D to the definition, if there is one.</Paragraph> <Paragraph position="9"> c. Scan the Maximum function across the A:D field.</Paragraph> <Paragraph position="10"> d. Copy definition bits from D to B, and set D? if defined. scanning of the maximum function across it. Note that the size of the field being scanned is the size of the definition (8 bits for the timings below) plus the size of a processor address (16 bits). In comparison, the earlier dictionary function had to be scanned across the definition and the original address, along with the entire string. Scanning this much larger field, even if the dictionary function was as fast as the maximum function, would necessarily result in slower execution times.</Paragraph> </Section> <Section position="5" start_page="133" end_page="133" type="sub_section"> <SectionTitle> 8.5 Evaluation of the Sort-Scan Dictio- </SectionTitle> <Paragraph position="0"> nary Algorithm The improved sort-scan dictionary algorithm is much more efficient than the broadcasting algorithms described earlier. The algorithm was implemented and timed on a Connection Machine.</Paragraph> <Paragraph position="1"> In a bit-serial computer like the CM, the time needed to process a string grows linearly with the number of bits used to store the string. A string length of 8 characters is adequete for the CM-Indexer. Words longer than 8 characters are represented by the simple concatenation of their first 4 and last 4 characters. ASCII characters therefore require 64 bits per word in the CM; 4 more bits are used for a length count.</Paragraph> <Paragraph position="2"> Because dictionary lookup is only performed on alphabetic characters, the 64 bits of ASCII data described above can be compacted without collision. Each of the twenty-six letters of the alphabet can be represented using 5 bits, instead of 8, thereby reducing the length of the character field to 40 bits; 4 bits are still needed for the length count. Additional compression could be achieved, perhaps by hashing, although that would introduce the possibilitY of collisions. No additional compression is performed in the prototype implementation. The timings given below assume that each processor stores an 8 character word using 44 bits.</Paragraph> <Paragraph position="3"> First of all, to sort a bit field in the CM currently takes about 1.5 milliseconds per bit. Second, the function that finds the header words was timed and took less than 4 milliseconds to execute. The scan of the max function across all of the processors completed in under in 2 milliseconds. The routing cycle to return the definitions to the original processors of the text took approximately one millisecond to complete.</Paragraph> <Paragraph position="4"> As a result, with the improved sort-scan algorithm, an entire machine full of 64,000 words can be looked up in about 73 milliseconds. In comparison to this, the original sort-scan implementation requires an additional 32 milliseconds (2 milliseconds per invocation of the slow dictionary function), along with a few more milliseconds for the inefficient communications pattern it requires.</Paragraph> <Paragraph position="5"> This lookup rate is approximately equivalent to a serial dictionary lookup of .9 words per microsecond.</Paragraph> <Paragraph position="6"> In comparison, a Symbolics Lisp Machine can look up words at a rate of 1/500 words per microsecond. (The timing was made for a lookup of a single bit of information about a word in a hash table containing 1500 words). Thus, the CM can perform dictionary lookup about 450 times faster than the Lisp Machine.</Paragraph> </Section> <Section position="6" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 8.6 Coping with Limited Processor Re- </SectionTitle> <Paragraph position="0"> sources Since there are obviously more than 64,000 words in the English language, a dictionary containing many words will have to be handled in sections. Each dictionary processor will have to hold several dictionary words, and the look-up cycle will have to be repeated several times. These adjustments will slow the CM down by a multiplicative factor, but Lisp Machines also slow down when large hash tables (often paged out to disk) are queried. There is an alternative way to view the above algorithm modifications: since they are motivated by limited processor resources, they should be handled by some sort of run time package, just as virtual memory is used to handle the problem of limited physical memory resources on serial machines. In fact, a virtual processor facility is currently being used on the CM.</Paragraph> </Section> </Section> class="xml-element"></Paper>