File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-5101_metho.xml
Size: 38,776 bytes
Last Modified: 2025-10-06 14:11:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C69-5101"> <Title>A SEARCH ALGORITHM AND DATA STRUCTURE FOR AN EFFICIENT INFORMATION SYSTEM</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Letter Table Method </SectionTitle> <Paragraph position="0"> This attractive method as suggested by Lamb and Jacobsen in 1961 for the dictionary lookup in a machine translation system did not receive good attention for its possible applications in general information systems. The reasons could be the immediate response to the numerous letter tables after the second level which indicated its inefficiency in storage, and that no clear search efficiency and update efficiency were expressed.</Paragraph> <Paragraph position="1"> Suppose only the twenty-six English letters are involved, in theory there are twenty-six tables at the firsPS level, 262 -5tables at the second level, 263 tables at ~he third level~ etc. The number of letter tables will in practice be reduced drastically after the second level because of the actual limitation in letter combination in for~Kng a vocabulary. However, no studies of this sort are available for the calculation of storage requirement to disprove its storage inefficiency.</Paragraph> <Paragraph position="2"> The average number of searches or the expected search length of this method can not be calculated as a function of the file or dictionary size. It is simply the average number of letters or characters of a certain language plus one space character or any other delimiter. For the English language, it is a favorable 5.8 searches (S = 4.8 + I), with no concern of the file size. Its update efficiency is compatible with its search efficiency and may be estimated at less than twice the average nmnber of searches.</Paragraph> <Paragraph position="3"> In order to achieve the above efficiency, the letter tables at each level should be structured in alphabetic order, and every letter should be converted into a numeric value such as A = I, B = 2, C = 3, ... , Z = 26 and the space delimiter = 0 or 27 through a simple table-lookup procedure. Those converted values would then be used as the direct-access address within each subset of alphabetic letters at each letter-table level.</Paragraph> <Paragraph position="4"> This discards the need for binary search within each subset of &quot;brothers&quot; as in the cases of Hibbardls and Sussenguth's searches.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. Open Addressing Method </SectionTitle> <Paragraph position="0"> As early as 1957, Peterson introduced this method for random access storage addressing. This method is also called linear probing. It assumes the existence of a certain hash function to transform the key or keyword of an entry into a numerical value within the range of thq table size which is predetermined as 2 M for any integer value of M. The table size should be large enough to accomodate all the entries -6of the file. As in other methods, this method also assumes the probability of finding an entry in the file is equal to one. Under these two assumptions, and if a good hash function is selected for a balanced distribution of hash values, the open addressing method will resolve the situation if more than one key is mapped into a particular slot in the table, and yields a very attractive average number of searches in most of the cases. The algorithm is best described in Morris' phrases: 'The first method of generating successive calculated addresses to be suggested in the literature was simply to place colliding entries as near as possible to their nominally allocated position, in the following sense.</Paragraph> <Paragraph position="1"> Upon collision, search forward from the nominal position (the initial calculated address), until either the desired entry is found or an empty space is encountered-searching circularly past the end of the table to the beginning, if necessary. If an empty space is encountered, that space becomes the home for the new entry.&quot; Peterson did some simulations of open addressing by generating random numbers and storing them into a 500-entry table, and the result of the average number of searches from nine different runs is compared with the calculation obtained through Morris' formula or Salton's formula (L is the loading factor or the percentage of table fullness at the time of search): It is thus clear that unless the table is nearly full, the average number of searches will be surprisingly small. For example, if the loading factor is equal or less than 0.9 the average number of searches will be an amazing 1.965. This can be achieved by allowing an extra ten percent of the table size. In this case, its storage efficiency will become less attractive. However, its search efficiency and update efficiency are excellent due to its extremely low average number of searches.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. Indirect Chaining Method </SectionTitle> <Paragraph position="0"> Since this method makes the same two assumptions as open addressing method and is heavily dependent upon the hash addressing, a more descriptive name for this method is suggested as Hash-Addressed Indirect-Chaining Search (HAICS). Other names found in the literature are scatter index tables~ direct chaining (a variation in chaining structure), closed addressing (direct and indirect chaining), and virtual scatter tables (matching additional hashed bits).</Paragraph> <Paragraph position="1"> The HAICS method uses a structured four-field table, an additional non-addressable overflow area of the table or a separate -8overflow table, and a free storage area called the available space list. It is aimed to fully utilize all spaces reserved for the table before using the overflow area and the free storage area. This method treats the addressable table area as end-rounded, i.e., the first address of the table is considered following the last address. When overflow occurs, the nonaddressable overflow area is made available as an extended table area. This is so arranged to achieve better storage efficiency since in most cases there is no need for the additional overflow area and thus it can be ~nitted at the beginning or added on when the need arises.</Paragraph> <Paragraph position="2"> The HAICS chaining table has fou= fields: keyword (key or data), index, link and pointer. The ke>~ord field is usually one computer word in size for accomodating symbols which identifies the entry. The index field should have enough bits in size to specify the largest relative address in the available space list, so that the variable length entry stored in the available space list could be indexed from this table. The link field is used to indicate the linkage to the next table address where information of entries of the same hashed value can be found. The pointer field is designated to contain the address of the first entry of entries with the same hashed value. Both the link field and the pointer field should have a field length in bits large enough to store the largest relative address of the addressable table area, i.e., the size of the addressable table.</Paragraph> <Paragraph position="3"> Entries are entered at their hashed addresses first, and then upon collision allotted to the next (or surrounded) empty addresses. Their pointers and links are set up for the proper chaining. When an entry is being looked up, the first step is to check the pointer of the entry at hashed address, then go to the pointed address and start searching from this beginning entry. If it is not found, the entry pointed by the link of current entry is searched until -9it is found or there is no further link. The latter case indicates that there is no match for this search. When the entry is found through keyword identification, the address stored in the i~dex field will direct the actual entry storage in the available space list. The index field and the available space list are needed only if the entries are of variable length so that storage space can be conserved. In the case of fixed length entries, the available space llst is no longer needed and the index field in the table should be changed into an entry field with the desired fixed length. A great advantage of the hash addressing is that to update entries in a file requires no sorting or resorting of any kind. In the HAICS method, to delete an entry is to follow the algorithm until the entry is retrieved, and then to hoo~ up the next entry in the chain to the previous entry. All the storage previously occupied by this entry is freed for later use. To add an entry will use the same algorithm to retrieve the last entry in a particular chain and then to set up the linkage to the next empty space in the table and have the information of the added entry stored there. The added entry itself is stored in some free storage a~ea in the available space list being indexed in the chaining table.</Paragraph> <Paragraph position="4"> This method was first introduced in 1961 by Johnson and its average number of searches is calculated simply as: S=I+-- L 2 More interesting yet, this foz~ula is still valid when the loading factor L is greater than one which means the number of entries exceeds the allotted table size and the information of overflOW entries are kept in the overflow area while entries themselves are again placed in the available space list. The cost of owerflo~ is increased linearly at merely 0.9 searches per I00% -I0increase of overflow. This provision virtually eliminates the fear of overflcw which frequently causes almost unmanageable difficulties and at very high expense.</Paragraph> <Paragraph position="5"> Before the table is full as in the usual case, the average number of searches of the indirect chaining method is a hard-tobelieve 1.25 with a meximum of 1.5 when the table is about full. It is at these two figures and the above-mentioned update efficiency and overflow advantage that the author believes some storage inefficiency and programming complexity should be tolerated painlessly. For most search algorithms not included in the above table, they are variations and combinations of the linear search,&quot; single chain, directory search, binary search, double chain and distributed .key aimed at the improvement of a certain efficiency. For example, the double chain method itself is a combination of binary search, a variation of single chain, and the linear search, and it is aimed to improve the update efficiency of the binary search.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> IIL KEYWGRD CONSTRUCTION AND HASE FUNCTION </SectionTitle> <Paragraph position="0"> It is understood from the previous comparison of various search algorithms that the turning point for the excellent performance in search and update efficiencies is at the hash addressing which is essentially a simple ~rocedure applying a certain hash function upon a search key or keywrod. Since the same keyword will always be hashed into the same hash value for table addressing, the criteria for a keyword selection or construction to identify an entry is the characteristic of uniqueness. And, in order to minimize the undesired collision upon hashed addresses, a good hash function should be selected such that it would yield a balanced distribution of hashed values within the range of the table size.</Paragraph> <Paragraph position="1"> I. Keyword Construction Under the consideration of the programming and computing efficiency and of the storage efficiency, usually a keyword of one computer word size is more desirable, e.g., eight-character keyword in a 48-bit word machine.</Paragraph> <Paragraph position="2"> In machine files such as dictionaries, thesauruses, keyword indices, and merchandise catalogs, the keyword is almost readily available for hashing. If the keyword is longer than the allowable number of characters, a simple word truncation at the right end or some word compression schemes can be used to reduce the word size to a desired amount of number of characters.</Paragraph> <Paragraph position="3"> -13-For example, the standard word abbreviatiOn, and a simple procedure to eliminate all the vowels and One of the two same consecutive cOnsonants in a word will all be acceptable for this purpose.</Paragraph> <Paragraph position="4"> In cases of author indices and catalogs, membership rosters, alphabetical telephone directories, taxatiOn records, census, personnel records, student files, and any file using a person's name as the primary source of indexing will have the convenience in using the last name plus one space character and the initials as the keyword. The word compression scheme is certainly applicable if it is necessary.</Paragraph> <Paragraph position="5"> When title indices and catalogs, subject indices and catalogs, business telephone directories, scientific and technical dictionaries, lexicons and idiom-and-phrase dictionaries, and other descriptive multi-word information are desired, the first character of each nontrivil word may be selected in the original word sequence to form a key~cord. For example~ the rather lengthy title of this paper may have a keyword as SADSIRS. Several known information systems are named exactly in this manner such as SIR (Raphael's Semantic Info~ation Retrieval), SADSAM (Lindsay's Sentence Appraiser and Diagrammer and Semantic Analyzing Machine), BIRS (Vinsonhaler's Basic Indexing and Retrieval System), and CGC (Klein and Simmons' Computational Grammar Coder).</Paragraph> <Paragraph position="6"> An alternative to meet the need of the multi-word situation but with a possible improvement in the uniqueness of the resulting hashed value is to perform some arithmetic or logical manipulation on the binary representation of the multi-werd.</Paragraph> <Paragraph position="7"> When the multi-word is stored in consecutive computer words, each binary representation of a computer word is treated as an individual constant. Then either an arithmetic operatiOn (e.g., ADD, SUB~RA~E~ MULTIPLY, and DIrgE) or a logical operatiOn (e.g., AND, OR) is to be performed on these computer words to collapse them into one single computer word as the -14keyword. The resulting keyword from this kind of manipulation is not human readable but will serve its purpose for hash addressing.</Paragraph> <Paragraph position="8"> In some cases where a unique number is assigned to an entry~ there is no need to hash this number provided that number is inside the range of the allotted table size.</Paragraph> <Paragraph position="9"> This is mostly seen when a record or document is arranged in its accession number or location index. Otherwise the number can be treated as letters and be constructed in one of the methods described above.</Paragraph> </Section> <Section position="9" start_page="0" end_page="116" type="metho"> <SectionTitle> 2. Hash Function </SectionTitle> <Paragraph position="0"> The different functions used for random number generations can also serve as the hash function if a likely one-to-one relation can be established between the keyword and the resulting random number. This is also subject to the restriction that only nmnbers inside the range of table size ace aceeptable. Frequently this method will not give a balanced distribution of table addresses and thus affect the search and update efficiencies.</Paragraph> <Paragraph position="1"> The arithmetic or logical manipulation described above for handling multi-word items can also be used as a hash function.</Paragraph> <Paragraph position="2"> One method called division hash code is suggested by Maurer that the binary representation of a keyword is treated as an integer and divided by the table size. The remainder of this division is thus inside the range of the table size and is used as the hash value. As Maurer noticed this method has the disadvantage that sometimes it does not produce indices which are equally distributed.</Paragraph> <Paragraph position="3"> Three methods of computing hash addresses with proven satisfactory results were described very neatly by Morris: -15&quot;If the keys are names or other objects that fit into a single machine word, a popular method of generating a hash address from the key is to choose some bits from the middle of the square of the key--enough bits to be used as an index to address any item in the table. Since the value of the middle bits of the square depends on all of bits of the key, we can expect that different keys will give rise to different hash addresses with high probability, more or less independently of whether the keys share some coe~on feature, say all beginning with the same bit pattern.</Paragraph> <Paragraph position="4"> &quot;If the keys are multiword items, then some bits from the product of the words making up the key may be satisfactory as long as care is taken that the calculated address does not turn out to be zero most of the time. The most dangerous situation in this respect is when blanks are coded internally as zeros or when partial word items are padded to full word length with zeros.</Paragraph> <Paragraph position="5"> '~ third method of computing a hash address is to cut the key up into N-bit sections, where N is the number of bits needed for the hash address, and then to form the sum of all of these sections. The low order N bits of the sum is used as the hash address. This method can be used for single-word keys as well as for multiword keys .... &quot; All these three method assume one slight restriction that the size of the table has to be a power of two because of the binary bit selection. Personally the author prefers the first method of these three due to the extremely simple programming involved. Depending on different machines, the main operation requires about five machine language instructions: load A register with the keyword, integer multiply with the keyword, left shift A and Q registers X bits so that the desired bits is at the left end of the Q register~ clear A register, left shift A & Q registers again Y bits so that the desired bits are resided at the right end of the A register (CDC 3600 COMPASS). -16-If the second method described by Morris is used, the keyword construction for multi-word itom can be eliminated if there is no risk of the kind described. The thlrd method is more interesting because it has the generality of accepting both single-word and multi-word itoms but at a slight cost of some more programming which is to be offset by the cost of multi-word construction.</Paragraph> <Paragraph position="6"> IV. HAICS DATA STRUCTURE In response to the needs of search and update efficiencies, the data structure for the HAICS technique has to be organized in a much more sophisticated way with some additional storage requirement over the entries themselves. As previously described under the Indirect Chaining Method, it requires a fixed-size four-field addressable chaining table for bockkeeping the keywordand all the infornmtion for the chaining mechanism, a reserved free storage area called the available space list for storing the variable-length entries themselves, and an added-on non-addressable overflow table area for overflow chainings.</Paragraph> <Paragraph position="7"> The overall HAICS data structure is quite list oriented but it is packed into the form of arrays for a more efficient indexing and searching procedure. A test program has been written in CDC 3600 Fortran (a variation of Fortran IV) for the convenience of adapting to other computers. The discussions following will frequently refer to the Fortran language and the list structure for a better clarification.</Paragraph> <Paragraph position="8"> I. The Chaining Table The four-field table can be easily set up as four single-dimensional arrays or as a four-dimensional array at the cost of several wasted bits in the computer word for storing the index to the available space list, the link to the next table address, or the pointer to indicate the beginning of a chained -17sub-list. The savings are less computer word-packing and unpacking operations.</Paragraph> <Paragraph position="9"> The positions of each four-field table item in relation to the first item, i.e., the hash addresses, can be viewed as the main list of the table. The linked entries of the same hashed address are treated as a sub-list. Since the relative position of each table item is identified with its hashed address, there is no need to set up an address index for each item. Besides, since each entry can be hash-addressed with an average number of searches at 1.25 and that most chains are not much longer than one or two entries, it is not necessary to have a backward link within a chain.</Paragraph> <Paragraph position="10"> The layout of the chaining table is shown in Table 4with some sample linkages indicated in hand-drawn circles and arrows.</Paragraph> <Paragraph position="11"> _~-~.._~____'i+ _. u~s~ ......... ~ ITMK DNI~TCB _.. ..... +..</Paragraph> <Paragraph position="13"> The four-field table items can also be viewed as four-field notes or cells in a list structured presentation: I keyword An example of the list structured presentation of the HAICS</Paragraph> </Section> <Section position="10" start_page="116" end_page="116" type="metho"> <SectionTitle> 2. The Available Space List </SectionTitle> <Paragraph position="0"> 1~is can be a single one-dlmensional array for the best storage efficiency in accomodating variable length entries.</Paragraph> <Paragraph position="1"> The beginning of an entry is indexed in the index field of the chaining table the relative position of the beginning computer word in the available space llst. This is also shown in the examples of Figure I. The ending of an entry can either be indicated by a special symbol such as two consecutive blanks or EOE as the abbreviation of end-of-entry, or calculated as the entry length and placed at the beginning of an entry.</Paragraph> <Paragraph position="2"> Multi-dimensional arrays are usually wasteful for storing variable length entries. If the entries are of fixed length then, as described before, there is no need to have a separate available space list. These fixed length entries can be put directly in the enlarged index field of the chaining table for a more efficient processing.</Paragraph> </Section> <Section position="11" start_page="116" end_page="116" type="metho"> <SectionTitle> 3. The Overflow Table </SectionTitle> <Paragraph position="0"> The overflow table is structured the same way as the Chaining table except that it is not accessable through the hash function. It serves as an emergency device only after the chaining table is fully utilized and additional storage area is available at that time. When the overflow table is established to meet the emergency, its array names and the size are made available to the ~CS procedure as an extended area for the chaining table.</Paragraph> <Paragraph position="1"> V. HAICS AIEORITIR~S FUR STORING~ RETRIEVING, UPDATING~ AND UTILITY</Paragraph> </Section> <Section position="12" start_page="116" end_page="116" type="metho"> <SectionTitle> FUNUTIONS </SectionTitle> <Paragraph position="0"> The logical procedure of the HAICS technique is described in algorithms for easy adaption to procedure-oriented languages such as Fortran and Algol. Currently, eight comnands have been implemented and operational on the CDC 3600 test program: STORE, RETRIEVE~ ADD, DELETR~ REPLACE, PEINT~ CCHPRESS, and L~ST. They can be ftmctionally classified into three groups: the main algorithms for STORE and RETRIEVE; the updating algorithms for ADD, DELETE, and REPLACE; and the utility algorithms for ~3 CGMITRESS~ and LIST.</Paragraph> <Paragraph position="1"> -21-The two main algorithms are frequently utilized by other algorithms except PRINT.</Paragraph> <Paragraph position="2"> These algorithms are presented in detail in the following: I. Algorithm STORE (S) This algorithm is to be used for establishing a HAICS file at the very beginning. It is assumed that the arrays for the chaining table and the available space list have been set up properly, and that keywords and the hash function have been constructed or selected appropriately through out the subsequent uses of this file.</Paragraph> <Paragraph position="3"> SI Clear the arrays for the Chaining Table and the Available Space list, and set up proper indices for these tables $2 Compute the hash value of the given keyword, $5 The entry is stored in the Available Space List (ASL) sequentially starting at ASL(J) and with a special symbol EOE placed at the end of the entry in ASL, and exit on success.</Paragraph> <Paragraph position="4"> $6 If POINTER (I) = 0 and KEYWORD (I) 4 0, then $7 Search the keyword array downward and end-round until a KEYWORD (I) = 0 is found $8 Set I = POINTER (I), KEYWORD (I) = KEY, and go to Step $4 S9 If a KEYWORD (I) = 0 can not be gound in the keyword array, a message is given to indicate the overflow of the Chaining Table and then exit on failure.</Paragraph> <Paragraph position="6"> upon success, go to Step SI2 With a given keyword, this algorithm will retrieve the entry which is associated with keyword under the same</Paragraph> <Paragraph position="8"> move the entry in ASL starting from ASL(J) to a working area untll an EOE is encountered, and exit on success.</Paragraph> <Paragraph position="9"> If KEYWORD(1) ~ KEY, and if LINK(X) = 0, then exit on failure.</Paragraph> <Paragraph position="10"> If KEYWORD(1) ~ KEy, and if LINK(1) ~ 0, then</Paragraph> <Paragraph position="12"> Repeat for additional entries starting at Step RI.</Paragraph> <Paragraph position="13"> Examples in Tables 5 and 6 will also illustrate this algorlthm in actual applications, The execution of the RETRIEVE command will not change the contents of the Chaining Table and the Available Space List in any event.</Paragraph> </Section> <Section position="13" start_page="116" end_page="116" type="metho"> <SectionTitle> 3. Algorlthm ADD (A) </SectionTitle> <Paragraph position="0"> This algorithm is used when an additlonal or new entry is put into the already established HAICS file. It is an operation of &quot;adding&quot; an entry to the end of a chain of its hashed address, rather than breaking up the chain and &quot;inserting&quot; the entry according to some order or hierarchy. This is so because each chain in the HAICS file is mostly very short with only one or two entries and the &quot;inserting&quot; will gain very little in search and update efficiencies.</Paragraph> <Paragraph position="2"> This algorithm is different from Algoritl~n STORE in that no clear-up operations are performed on the arrays of the Chaining Table and the Available Space List. In addition, a relative address in the Available Space List is accepted as the first available address to stere the added entries themselves.</Paragraph> <Paragraph position="3"> This is used to delete an entry from the HAICS file. This algorlthm is heavily dependent upOn the Algorithm RETRIEVE but instead of just retrieving the entry it traces hack and deletes the entry Itself from the Available Space List and all the pertinate information in the Chaining Table. DI Go to Step RI in Algorithm RETRIEVE, return to Step D2 upon exit on failure from Algorithm RETRIEVE, or return to Step D3 upon exit on success from Algorithm RETRIEVE D2 Exit on failure.</Paragraph> <Paragraph position="4"> D3 Clear up the occupied section of the entry in ASL including the special symbol EOE at the end of the entry</Paragraph> <Paragraph position="6"> trace back the previous link which contains I and set it to zero when it is found, otherwise trace back the origlnal pointer and set it to zero, exit on success.</Paragraph> </Section> <Section position="14" start_page="116" end_page="116" type="metho"> <SectionTitle> 5. Algorithm REPLACE (RP) </SectionTitle> <Paragraph position="0"> This is for the replacement of an entry itself in the Available Space List with the same keyword and linkages in the Chaining Table unchanged. Replacement entries longer than the original entries can be treated in a few different ways.</Paragraph> <Paragraph position="1"> The current algorithm will truncate the excessive end and give a message to indicate the situation. A remedy if desired then can be made through the deletion of the incomplete entry and the addition of the complete entry as a new entry. This algorithm will make use of the Algorithms RETRIEVE and STORE to find the desired entry and then replace the old contents with the new contents in the Available Space List.</Paragraph> <Paragraph position="3"> the old entry in ASL starting from ASL(J) and including an EOE RP5 The new entry is stored in ASL starting from ASL(J) RP6 If the new entry plus an EOE can be accomodated in the old space, exit on success.</Paragraph> <Paragraph position="4"> RP7 If the new entry plus an EOE can not be accommodated in the old space, then store the new entry up to the same length of the old entry and put an EOE at the end, exit on partial success.</Paragraph> </Section> <Section position="15" start_page="116" end_page="433" type="metho"> <SectionTitle> 6. Algorithm PRINT (P) </SectionTitle> <Paragraph position="0"> This is a simple algorithm for a utility function of arranging information in table form and printing out of the chaining Table and the Available Space List as those of Table 4 to Table 14 in this paper.</Paragraph> <Paragraph position="1"> This is an algorithm designed to serve as a &quot;Garbage Collector&quot; in the list processing languages for better storage efficiency. In practical applications, the Available Space List is a huge free storage area which can be on a secondary bulk storage device Such as a drum or disk for random access. After several updating functions performed on the HAICS file, there will inevitably be some space groups residing in the middle of the used portion of the Available Space List. And eventually it will reach a situation that the end of the Available Space List is reached but with many space groups scattered in the middle.</Paragraph> <Paragraph position="2"> To remedy this situation, a periodical operation of the C(R4PRESS command is desirable to repack the Available Space List for a better storage utilization. Many strategies or hierarchies can be used to achieve this purpose with some variations in computing efficiency. The current algorithm starts with the last entry in the Available Space List and moves it to the first accomodatable space group found from -35the beginning of the List. The process is repeated until all the aceomodatable space groups found are filledand thus a largespace group is accumulated at the end of the Available Space List for subsequent additions of n~ entries. CI Search for the largest INDEX(1) in the Chaining Table, then set J = INDEX(1) C2 Count the length of this last entry in ASL starting at ASL(J) until an E0~ is found C3 Check an internal table of space groups in ASL to find an accomodatable space group such that the number of spaces in a space group is greater than the length of the last entry, go to Step C6 if it is not found, go to Step C4 if it is found C4 Move the last entry to the space group found, go to Step C5 if some spaces are left unused, otherwise exit on success.</Paragraph> <Paragraph position="3"> C5 Store information of the space group to the internal table, exit on success.</Paragraph> <Paragraph position="4"> C6 Search ASL from its beginning for a space group, go to Step CI0 if it is not found, go to Step C7 if it is found C7 If the space group found is accomodatable, then go to Step C4, otherwise go to Step C8 C8 Store information of the space group found to the internal table C9 Search ASL continuously for a space group, go to Step CI0 if it is not found before the search reached the original location of the last entry, go to Step C7 if it is found CIO Exit on table not compressable.</Paragraph> <Paragraph position="5"> CII Repeat for additional compression upc~ exit on success by entering at Step CI.</Paragraph> <Paragraph position="6"> A sample result of the COMPRESS algoritlrm upon Tables 9 and ii is shown in the following Tables 12 and 13:</Paragraph> <Paragraph position="8"> -388. Algorithm LIST (L) In contrast to the Algorithm PRINT, LIST will initiate an alphabetical sort on the keywords stored in the Chaining Table, and output an alphabetical list of all entries in the IIAICS file. The final output of this algorithm performed on Tables 12 and 13 is illustrated in Table 14.</Paragraph> <Paragraph position="9"> Sort array KEYWORD(1) alphabetically and carry along the original sequence in the array, I, during the sort process Take the first original sequence number in the sorted keyword order, I Set J = INDEX(l), print the hash value, the keyword, the entry starting address in ASL, and the entry itself, exit on success.</Paragraph> <Paragraph position="10"> Repeat for the next keyword and its original sequence number in the sorted array until this array is exhausted.</Paragraph> <Paragraph position="11"> For the purpose of demonstrating the actual performance of the HAICS main and update algorithms, the statistics gathered from the test run (which also produced Tables 5-11) are listed below in Tables 15 and 16 and are the basis for a prelimPSmary discussion.</Paragraph> </Section> <Section position="16" start_page="433" end_page="433" type="metho"> <SectionTitle> ADD DELETE REPIACE </SectionTitle> <Paragraph position="0"> Nember Accumu- Number Acc~u- Number Accumu-Sample of lative of lative of lative entry searches average searches average searches average sequence of number of number of number current of current of current of entry searches entry searches entry searches The STORE efficiency, i.e., the acc~mlative average number of searches for the STORE algorithm, as shown in Table 15 reveals that starting with an empty chaining table, it is a low 0.286 at 87.5% fullness of the table. Most entries are entered into this table with no search at all which implies a good balanced distribution of keyword hash values.</Paragraph> <Paragraph position="1"> The ADD efficiency is a function of the STORE efficiency. And in this sample's statistics the ADD efficiency obtained through the addition of four entries to make a full chaining table, is in fact the same as if these four entries are placed at the end of the STORE command. Thus the ADD efficiency of 0.75 for four entries can be combined with the STORE efficiency for tWenty-eight entries and the result is a 0.344 of STORE efficiency for a full 32-entry chaining table. It is noted that the ADD efficiency is always greater than (or equal to) the STORE efficiency due to the nonemptiness of the chaining table.</Paragraph> <Paragraph position="2"> The RETRIEVE efficiency is always identical with the search efficiency as indicated in Table 3 which is an average of 1.25 for the indirect chaining method. The accumulative average number of searches does fall into the range between the minimum of 1.0 and the maximum of 1.5 which is a 1.263 at 59.4% table fullness. -43-Both the DELETE and REPLACE efficiencies are functions of the RETRIEVE efficiency or the search efficiency. The sample statistics of accumulative average number of searches of 1.2 for deleting five entries and of 1.333 for replacing three entries gives some indication that the DEleTE and REPLACE efficiencies are cOmpatible with the RETRIEVE efficiency.</Paragraph> <Paragraph position="3"> As mentioned before, the above discussion is preliminary and even premature. The statistics in Tables 15 and 16 do not cover some unusual circumstances although it is a typical example of several regular test runs. To support, or oppose, the above discussion will demand several further extensive tests of each of these five efficiencies under a controlled and isolated environment.</Paragraph> </Section> <Section position="17" start_page="433" end_page="433" type="metho"> <SectionTitle> 2. A Framework for Information Systems </SectionTitle> <Paragraph position="0"> The HAICS method is a basic framework aimed to improve the total efficiency of an information system. It can be progran~ned in a number of languages from the fundamental machine language or assembly language of a particular family of computers, to the high-level procedure-oriented languages such as Fortran and Algol which are acceptable to most of the computers.</Paragraph> <Paragraph position="1"> With an amazing 1.25 average number of searches per entry, this method will certainly make natural language processing not much worse than the n,-nerical computation. It is ready to be implemented for text processing and document retrieval; numerical data retrieval; and for handling of large files such as dictionaries, catalogs, and personnel records, as well as graphic informations. In the test program coded in Fortran and a machine language COMPASS, eight commands as described before are currently implemented and operational in batch mode on a CDC 3600. Further development wil t be on the use of teletype console, CRT terminal, and plotter under a time-sharing -44environment for producing ~ediate responses. This is under the ideal of placing the most complete encyclopedia or a tailored index-reference work under one's fingertip.</Paragraph> <Paragraph position="2"> Specifically, the dictionary lookup operation as the principal operation of an information system, is no longer a lengthy and painful procedure and thus a barrier in natural language processing. Linguistic analysis may be provided with a complete freedom in referring back and forth any entry in the dictionary and the grammar, and the information gained at any stage of analysis can be stored and retrieved in the same way. Document retrieval may go deeper in content analysis and providing a synonym dictionary for some better query descriptor transforw~tions and matching functions. As Shoffner noted, &quot;it is important to be able to determine the extent to which file structure and search techniques influence recall, precision, and other measures of system performance.&quot; This paper tends to support Shoffner's statement by presenting an analysis of current search techniques and a detailed description of the HAICS method which is a possible framework for most information systems.</Paragraph> </Section> class="xml-element"></Paper>