File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1046_metho.xml
Size: 26,936 bytes
Last Modified: 2025-10-06 14:11:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1046"> <Title>ASPERGILLU~ J~ IZaE KERAIINAS E PAPAI~ ENZYME RT * |COL)R~INAT~ IOn.AS BAClr E~ I ~L E',tLf4E FUNGAL E,NZY~E PANCREATIC E~I_/Y4E PLANT E;~ LY ~t E PROIE3LCTIC E~ZY 4F AN~E OF WEA#E |-PR(JP E~Tf JFJ FIBRE ANI k INE RT t-A ;E',; I USe J ~. FINISHING</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> GENERATION OF THESAURUS IN DIFFERENT LANGUAGES A COMPUTER BASED SYSTEM F J DEVADASON </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="304" type="metho"> <SectionTitle> INDIA </SectionTitle> <Paragraph position="0"> The development of the theory of library classification and of subject indexing, for the organisation, storage and retrieval of subjects embodied in documents has a striking parallelism to the search for 'universal forms' and deep structure'ln language and linguistic studies. The significant contributions of the theories of classification and subject indexing are the subject analysis techniques of Ranganathan and Bhattacharyya's POPSI. A computer based system, for generating an information retrieval thesaurus, from modulated subject-proposltlons, formulated according to the subject analysis techniques, enriched with certain codes for relating the terms in the subject-proposltlons has been developed. The system generates hierarchlc, associative, coordinate and synonymous relationships between terms and presents them as an alphabetical thesaurus. Also, once a thesaurus is generated in one language it is possible to produce the same thesaurus in different languages by Just forming a table of equivalent terms in the required language.</Paragraph> <Paragraph position="1"> Inform~tlon Retrle v@l Th@s,eurus An information retrieval thesaurus could be defined as &quot;a controlled dynamic vocabulary of semantically related terms offering comprehensive coverage of a domain of knowledge&quot;. Its main use is in the subject characterization of documents end queries in information storage and retrieval systems^based on concept coordi- nation ~o. The application of computers for updating, testing, editing and printing thesaurus 25 has gained much importance due to the use of thesaurus as a vocabulary control device in bibliographic information storage and retrieval systems, at the input stage for controlled indexing and at the retrieval stage for expanding the f ~,f search query to increase recall both in batch and on-line modes of processing 18, 22, 62, 64.</Paragraph> <Paragraph position="2"> Automatic Generation of Thescurus Several experiments in automatic generation of thesaurus have been carried out in which relationships between terms have been determined by taking into account the number of documents in which therespectlve terms occur Jointly 27, 65. Various clustering techniques have been investigated out of a range of similarity criteria.</Paragraph> <Paragraph position="3"> The role played by similarity criteria in obtaining the environment of each term and the use of this environment for retrieval have been explored 57.</Paragraph> <Paragraph position="4"> Computational procedures for generating thesaurus include keyword statistics, calculation of Tanlmoto coefficient 61, matrix inversion, formation of similarity matrix, automatic cluster analysis using minimal tree procedure and compilation of groups2~nd ~aln6,~__ groups of descriptors i, ~, ~ , ~ , 58, 65, 66.</Paragraph> <Paragraph position="5"> But, &quot;the difficulty, however, is that text-scannlng is more effective in syntactic and morphological analysis where there is sufficient repetition to justify the belief that a particular fact is significant&quot; 31, 32. Further, all these techniques use a lot of computer time and are capable of produclng a list of selected and grouped keywords. But it has also been observed that, although a large variety of clusters and assoclsted query expansions have been obtained, no slgniflcant improvements in the document retrieval3~erformance have bean achieved ~. In other words, an information retrieval thesaurus is something morekeywordsthan4~.llst of grouped and ranked BAsic Aspects of Thesaurus There are two basic aspects of thesaurus construction. They are: I selection of keywords/descriptors of the subject for which the thesaurus is constructed; and 303-2 establishment of interrelationships among these selected key-words as to whether the terms form a broader or narrower or related or synonymous or 'use' relation.</Paragraph> <Paragraph position="6"> Using computers alone for both the above mentioned aspects of thesaurus construction is not practicable for, using computers for selecting the key-words from free language text is not an economic approach, and it is not feasible to make a computer automatically distinguish the relationship between terms as to broader, narrower, synonymous etc.</Paragraph> <Paragraph position="7"> Metalanguage for Information Or~anlsation null It was realised that the failure of experiments with automatic abstracting, indexing etc., &quot;should be sought above all, in an insufficient knowledge of the structure of the text from the standpoint of relationships between an apparent formal linguistic representation on bne hand and on the other hand, the informational content involved in the text ... As the result of such investigations we Can arrive, among others, at various descriptive formulas of the structure of scientific texts &quot;28. One way of arriving at structures that reflect specific textual content is to make use of the restrictions in language usage which are characteristic of the texts in a particular subject matter that is, to exploit the fact that on a particular topic, only certain words in certain combinations actually appear 55. In other words, it was realised that what is required is a special purpose artificial language, to cater to the needs of information storage, processing and retrieval. The Automatic Lan- g uage Processing Advisory Committee 1966) reallsed and reported that &quot;A deeper knowledge of language could help ... to enable us to engineer artificial languages for special purposes... and to use machines as aids in translation and in information retrieval&quot;40. Subsequently, suggestions for a metatheory of linguistics and information science, with a metalauguage having all the properties of a classification schema have been proposed. The term 'metalanguage' specifies a 'public' metalanguage, such as a document classification system, as distinguished from the 'object language' represented by the documents. The written record of a document classification schema is not really parallel to the surface structure of the object language - the natural language sentences of a document. A classification schema is intended to classify, and, therefore the language of the schema is mainly classificatory. In other words, the metalanguage does not explicitly include all relevant terms in the object language, but the object language does include all terms in the metalanguage.</Paragraph> <Paragraph position="8"> Moreover, superset-subset (class inclusion) relations are usually explicitly given by~he structure of the classification jo. Thus some of the'logical semantic' relatigns , specifically those of implication 4D are specified in the so-called 'surface structure~of the metalanguage, but not in the surface structure of the object language 38.</Paragraph> <Paragraph position="9"> Universal Forms and Subject Representatlon null Parallel to the search for universal linguistic forms such as that ex- p ounded by Chomsky, Fodor and others the discovery that certain features of given languages can be reduced to universal properties of language and explained in terms of deeper aspects of linguistic form, 11, 12, 37; and that such deep structure of sentences determine the semantic content while their surface structures determine the phonetic interpretation), steps towards the formulation of generic framework for structuring the representation of the name of a subject for the development of classification schemes and subject21 ~ndexln~ languages were investigated , 3, 45-~8. Such universals are Being arrived at and used in various other areas dealing with information and information processing. For instance, in the area of data modelling, now the basic problem is to identify the world as a domain of objects with properties and relations 10.</Paragraph> <Paragraph position="10"> Such categorlsation of objects of study is not new to the library profession. As early as 1930s, the use of categorlsation of component ideas forming the name of a subject into Personality/core object of study, Matter/ property/method, Energy/acti~1, Space/ place and Time, and defining an order of these categories to form a 'logical, classificatory language' resulting in 'faceted' library classification schemes was known in India 45, 47.</Paragraph> <Paragraph position="11"> It is interesting to note that it has been realised now that the above mentioned Ranganathan's categories Personality, Matter and Energy, are &quot;general categories building the system's etructure as a spatlotemporal neighbourhood relationship &quot; useful in deriving meta informational, for a process of automatic analysis too 13, 14.</Paragraph> <Paragraph position="12"> The order of the component ideas denoting the different categories in the name of a subject as prescribed is context-dependent order. More specifically it is context-specifying order. Every component category sets the context for the next and following ones.</Paragraph> <Paragraph position="13"> Also in this classificatory language, every category should explicitly have the corresponding superordinate component ideas preceding it. The reason for fixing the superordinates before thecomponent elements concerned is to render the component elements denote precisely the ideas they represent.</Paragraph> <Paragraph position="14"> Further, it has been conjectured that 46, 52 the syntax (order) of representation of the component elements in the name of a subject as prescribed by the pr$~ciples for sequence - facet sequence 4w - is more or less parallel to the Absolute Syntax - ie., the sequence in which the component ideas of subjects falling in a subject-fleld arrange themselves in the minds of a majority of normal intellectuals. If the syntax of the representation of the component ideas of subjects is made to conform to, or parallel to the Absolute Syntax, then the pattern of linking of the component ideas - ie., the resulting kngwledge structure is likely to be ~ I More helpful in organising subjects in a logical sequence for efficient storage and retrieval; 2 Free from the aberrations due to variations in linguistic syntax from the use of the verbal plane in naming subjects; and 3 Helpful in probing deeper into the pattern of human thinking and modes of ~ombination of ideas.</Paragraph> <Paragraph position="15"> Subject In~exin ~ ~nd ThesAurus Due to the development of techniques for structuring of subjects and for classification of subjects, several experiments were conducted at the Documentation Research and Training Centre to use them for thesaurus construction.</Paragraph> <Paragraph position="16"> To begin with, a faceted library classification scheme for a specific sub-ject field was used in t~ computer-generation of thesaurus -~ in which it was possible to incorporate the hierarchic relationships of terms. But it was not possible to incorporate the generation of non-hierarchic associative relationship of terms.</Paragraph> <Paragraph position="17"> Terms that have associative relationship to each other have to be establlshed only by consensus of experts in the field concerned. But the validity of the assumption that, knowledge based on the consensus of experts in a field is different from the knowledge expressed in the literature of the field has been challenged, as the two lists of keywords, one given by experts and the other formed by analysis of published literature were not significantly different 33. In other words, terms that are related to each other assoclatively could be easily ascertained by an analysis of the statement of the name of the subject of a document or of a reader's query. For instance, whether &quot;x-ray treatment&quot; is associatlvely related to &quot;cancer&quot;, or not, could be established if there exists a document on &quot;x-ray treatment of cancer&quot;. In other words, a published document on &quot;x-ray treatment of cancer&quot; brings into associative relationship both &quot;x-ray treatment&quot; and &quot;cancer&quot;. Also it is unimportant which terms co-occur frequently in the names of subjects for, any term that is used once in the statement of the name of a subject is enough to be admitted into the thesaurus for that subject and is related with other terms in that name of the subject in some particular way. In order to incorporate associatively related terms in thesauri, experiments were conducted 35, 53 using subject representations formulated for the purpose of developing classification schedules, which were arrived at bY Ranganathan's facet analysis 21, 49 for thesaurus construction. With certain limitations it was possible to generate broader, narrower and associative relationships but not coordinate relationships. Further, it was realised that 2, 17 selection of candidate terms and ascertaining of multiple linkage of relationships among terms can be done in several ways such as by text of dictionary, glossary, encyclopaedia and even text books and treatises.</Paragraph> <Paragraph position="18"> Artificlal Lan~u~e for Thesaurus Further research into the fundamentals of subject indexing languages resulting in the development of a 305general4theory of subject indexing languages and the development of the ing (POPSI) language 3, 8 has provided a basis for a more efficient and flexible system for thesaurus construction. According to the general theory of subject indexing languages; information is the message conveyed or intended to be conveyed by a systemetlsed body of ideas, or its accepted or acceptable substitutes. Information in general, is of two types: discursive information and non-dlscursive information or unit facts. Non-discursive information or unit facts may be6either qualitative or quantitative P, I . The name of a sub-Ject is essentially a piece of non-dlscursive information end it is conveyed by an indicative formulation that summarises in its message, 'what a particular body of information is about'. &quot;The language for indicating what a body of information is about, need not necessarily be in terms of sentences of the natural language. It can be an artificial language of indicative formulation used to indicate whet a body of information is about&quot; o.</Paragraph> <Paragraph position="19"> The essential ingredients of a language - natural or artificial - ere the elementary constituents; and rules for the formulation of admissible expressions using the elementary constituents. A Subject Indexing Language consists of elementary constituents and rules for the formulation of admissible subject-proposltlons. It is used to summarise in indicative formulations what the contents of a source of information are about. The purpose of these summarlslng indicative formulations is to create groups of sources of information to facilitate expeditious retrieval of information about them by providing necessary and sufficient access points.</Paragraph> <Paragraph position="20"> The component ideas in the name of a subject can be deemed to fall in any one of the elementary categories: Discipline, Entity, Action and Property. The term 'manifestation' is used to denote an idea or a term denoting an idea, falling in any one of the elementary categories. Apart from the elementary categories there are Modifiers to the elementary categories. A modifier refers to an idea or a term denoting an idea, used or intended to be used to qualify the manifestation without disturbing the conceptual wholeness of the latter. A modifier can modify a manifestation of any one of the elementary categories, as well as a combination of two or more manifestations of two or more elementary categories.</Paragraph> <Paragraph position="21"> Modifiers can be common modifiers like time, place etc. or special modifiers which can be entity based or action based or property based. Apart from the elementary categories and modifiers there is a Base and Core. Due to the fact that recent research work is generally project-orlented, mlsslon-orlented and inter-dlsclpllnary and not generally disclpllne-orlented, there may be a need to bring together all or major portion of information pertaining toa manifestation or manifestations of a particular elementary category. This manifestation or elementary category is the Base. Similarly, need may arise to bring together within a recognised Base, all or major portion of information pertaining to manifestations of one or more elementary categories, the category or categories concerned are the Core of the concerned Base. Also the elementary categories may admit of Species (genus-specles) and Parts (Whole -Part).</Paragraph> <Paragraph position="22"> The elementary constituents of a speciflc Subject Indexing Language POPSI D, 7, 8are given below:</Paragraph> </Section> <Section position="3" start_page="304" end_page="308" type="metho"> <SectionTitle> 2 Relation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="304" end_page="304" type="sub_section"> <SectionTitle> 2.1 General 2.2 Bias 2.3 Comparison 2.4 Similarity 2.5 Difference 2.6 Application 2.7 Influence </SectionTitle> <Paragraph position="0"> gous to D, E, A and P.</Paragraph> <Paragraph position="1"> 'Fne rules of syntax of POPSI prescribed for the subject-propositions is D followed by E (both modified or unmodified) appropriately interpolated or extrapolated wherever warranted, by A * and/or P (both modified or un-modified). A manifestation of Action (A) follows immediately the manifestation in relation to which it is an A. A manifestation of Property (P) follows immediately the manifestations in relation to -306-which it is a P. A Species (type)/Part follows immediately the manifestation in relation to which it is a Species/ Part. A Modifier follows immediately the manifestation in relation to which it is a modifier. Generally a modifier gives rise to a species. Also if necessary auxiliary words within brackets could be inserted in between terms if found necessarydeg These form the basis of the POPSI language.</Paragraph> <Paragraph position="2"> While examining whether a classification scheme could form a 'metalanguage' of a metatheory of linguistics and information science, it has been observed that &quot;all relational information necessary for the explication of an object language&quot; are not present in classification schema, especially role notions and presuppositions 38. Such 'relational modifiers' or 'role indicators', qS, 20, 63 that describe the role of the concept in context, representing basic 'role notions' such as the cause of the event, the effect of the event etc., similar to that of the case relations - nominative, accusative, instrumental 19 etc. - if incorporated in the subject-propositions, fo~nulated according to the 'subject analysis' techniques mentioned above 3-8, 45-52, then it could form a 'metalanguage' for thesaurus, from which thesaurus could be generated automatically.</Paragraph> </Section> <Section position="2" start_page="304" end_page="308" type="sub_section"> <SectionTitle> ,!Dput Subject-propositions for Thesaurus </SectionTitle> <Paragraph position="0"> The preparation of input to the thesaurus construction system starts with writing out sentences such as, &quot;this book is about ... , this report is about ...., this paper is about ..., this query is about ... &quot; 23, 36. &quot;To tell what is the subject or topis of a play, a picture, a story, a lecture, a book etc., forms part of the individuals mastery of a natural language ... They are the starting point of most requesters when approaching a bibliographic information retrieval system or in a dialo~e with a librarian or documentallst&quot;-60. To aid in such an indicative formulation that summarlses in its message what a particular body of information is about, the title of the document or the raw specification of the readers' query or even sentence or sentences in the text of dictionary, glossary, abstract and even text-books is taken as the starting point. Each of the specific subjects dealt within the document or specified in the reader's query or text statements are determined and expressed in natural language.</Paragraph> <Paragraph position="1"> Let one of the names of subjects be expressed as &quot;Re-tsnnlng of chrome tanned leather using chestnut&quot;. Each of the component ideas such as the name of the dlsclpllne (baser the core object of study (entity) etc., that are implied in the expressed statement of the subject are explicitly stated to form an 'expressive title' 48, 50, 51 i ow . &quot; as fo 1 s. In Leather Technology, retanning of chrome tanned leather by vegetable tanning using chestnut&quot;.</Paragraph> <Paragraph position="2"> The 'expressive title' is then analysed to identify the 'elementary categories' and 'modifiers' and the component terms are written down removing irrelevant auxiliaries, as a formallsed representation, following the principles of sequence of components 9, 49. The analysed and formalised subJect-proposltlon is given below: I Disclpllne) Leather Technology, Core Entity) Chrome Tanned Leather, Action on Entity) Re-tannlng , LB~7 Action based Modifler) VegetabIe- Tanning, /Using\] (Entity based Modifier) CheBtnut.</Paragraph> <Paragraph position="3"> The subject-proposltlon is then modulated by augmenting it by interpolating and extrapolating as the case may be, by the successive superordlnates of each elementary category by finding out 'of which it is a species (type) or part'. The synonymous terms if any are attached to the corresponding standard terms. The modulated subject-proposition is given below: The auxiliary words (even if relevant are removed from the subject-proposition and phrases enclosed within brackets indicating 'role notions' or 'role indicators' are inserted between the kernel terms. The resulting sub- null used-) chestnut.</Paragraph> <Paragraph position="4"> The subject-proposltion is further analysed to determine which terms are associatively related to each other specifically. For instance, in the above subject-proposltlon 'chestnut' is related to 'Vegetable tanning' and also - 307to 'Re-tanning', as an agent used in both the processes. 'Chrome tanned leather' is related to 'Re-tannlng' as it admits of being re-tanned, and also to 'Vegetable tanning' as it admits of being vegetable (re) tanned. After this analysis, the subject-proposition ks formulated as a relation map showing the 'links'. The relation map for the above subject-proposltlon is given in the figure below: '$3' -- Generate NT relation with the immediately succeeding term and generate a reverse BT relation. No role indicator code is used (whole - part relation).</Paragraph> <Paragraph position="5"> The codes for generating RT relation and the associated computer manipulation are: '$1' -- Generate a RT relation with the immediately succeedinn term using the role indicator code LEATHER TECHNOLOGY. LEATHER (type of-) TANNED In the relation map given above, the dotted lines indicate NT/BT relationship, continuous lines indicate RT relationship and slash indicates synonym/use relatlonship.</Paragraph> <Paragraph position="6"> The relationship between pairs of terms NT or RT as indicated by dotted lines and continuous lines respectively as shown in the example, are replaced by appropriate codes to form the input to the thesaurus generation system.</Paragraph> <Paragraph position="7"> The codes used in the subject-propositions for generating entries for a thesaurus are of the following types: 1 those that indicate which terms are to be related (codes for relating terms) and whether the relation is NT or RT or SYN; and 2 those that denote the role indicators. null The codes for relating terms are of the following three types: The codes for generating NT relation and the associated computer manipulation are: '$2' -- Generate a NT relation with the immediately succeeding term using the role indicator code of the term being manipulated and generate a reverse BT relation changing the position of '-' in the role indicator code (genus- species relation); and and generate a reverse RT relation changing the position of '-' in the role indicator code; and '$5, $6, $7, $8, and S0' -- Generate a RT relation with the immediately preceding term with the same '$ code' taking the role indicator code of the term being manipulated and generate a reverse RT relation changing the position of '-' in the role indicator code.</Paragraph> <Paragraph position="8"> The code for generating Synonymous relation and the associated computer manipulation is: '/' -- Generate a Synonymous relation with the immediately preceding term and generate a reverse 'Use' relation.</Paragraph> <Paragraph position="9"> It is to be noted that the role indicators are used specifically for further categorlsatlon of RTs, as they are expected to be numerous. But representation of genus-specles relations could also be categorlsed to achieve better display format and for proper generation of coordinate RTs out of NTs to a particular term. The following is an extract of role indicators used in our experimental codes described above to reflect the different NT and RT links as given An assorted number of subject-propositions from a specific subject field, augmented with codes for relating terms and codes for role indicators are read by a program 'CODEK'. Each of the unique terms in the subject-propositions is internally serial numbered uniquely and the respective terms in the subject-propositions are replaced by their serial numbers. As and when a term is encountered in a subject-proposition, it is matched with existing terms and its seri8l n%/mber is picked if the term is available, if not the term is entered as the last entry with appropriate serial number and the given serial number is replaced in the sub-Ject-proposition. The term dictionary thus built, and the translated subjectpropositions, are written separately as two different files for further processing. A sample of the dictionary is given below:</Paragraph> </Section> </Section> class="xml-element"></Paper>