File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1002_metho.xml
Size: 18,041 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1002"> <Title>TASKS, DOMAINS, AND LANGUAGE S</Title> <Section position="4" start_page="8" end_page="8" type="metho"> <SectionTitle> THE MICROELECTRONICS DOMAI N </SectionTitle> <Paragraph position="0"> The reporting task in the domain of Microelectronics involves capturing information about advances in four types of chip fabrication processing technologies : layering, lithography, etching, and packaging . For each process , this information relates to process-specific parameters that typify advancements . For example, the introduction of a new type of film in layering or a reduction in granularity in lithography both indicate new developments in fabrication technology. To be relevant, these advances must be associated with some identifiable entity that is manufacturing, selling, or distributing equipment, or developing or using processing technology .</Paragraph> <Paragraph position="1"> The MICROELECTRONICS-CAPABILITY template object links together information about the four fabrication technologies (LITHOGRAPHY, LAYERING, PACKAGING, and ETCHING) with the ENTITYs, typically companies, associated with one of the technologies as its DEVELOPER, MANUFACTURER, DISTRIBUTOR, or PURCHASER USER. Additionally, the template captures information about the specific EQUIPMENT used, developed, or sold, as well as information about the type of chips or DEVICES that are expected to be produced by tha t technology. There is a total of nine objects in the domain .</Paragraph> <Paragraph position="2"> Figure 2 below illustrates the information types captured in the Microelectronics template . Appendix B pro -</Paragraph> </Section> <Section position="5" start_page="8" end_page="9" type="metho"> <SectionTitle> LAYERING </SectionTitle> <Paragraph position="0"> vides an example from the Microelectronics domain, including an excerpt from an EME article, along with its corresponding filled-out template . There are two microelectronic capabilities in this example. The first capability is succinctly represented in the first sentence with the identification of a lithography process (&quot;a new stepper&quot;) associated with an entity (&quot;Nikon Corp .&quot;) as the manufacturer and distributor (&quot;to market&quot;) of a piece of equipment that implements a lithographic process. Note also that the technology will be used to produce a device (&quot;64-Mbi t DRAMs&quot;), which satisfies the reporting condition requirement for technology connection to integrated circuit production. Additional information on process and equipment occurs in the text. The second capability stems from information in the second sentence (i .e., &quot;compared to the 0 .5 micron of the company's latest stepper&quot;) . The need to interpret this segment within the context of the discourse demonstrates the level of text understanding required in thi s domain .</Paragraph> </Section> <Section position="6" start_page="9" end_page="10" type="metho"> <SectionTitle> DOMAIN DIFFERENCE S </SectionTitle> <Paragraph position="0"> The JV and ME domain differ in the focus of their task, type of complexity, and level of technicality. The focu s of the JV task is the tie-up formation and the corresponding activities of the resulting agreement. Thus, to a large extent, the task is event-driven . The information to be extracted includes the participants in the event, the economic activity of the event, and adjunct information about the event, such as time, facilities, revenue, and ownership . Entities are central, specifically within the context of the tie-up relationship . In addition, relationships also dominate i n that the tie-up event presents a cohesive collection of linked objects, e .g., persons and facilities linked to entities, entities linked to other entities, industries linked to activities, and so on . The overarching task is fitting together the inter related pieces of the single tie-up event.</Paragraph> <Paragraph position="1"> The focus of the ME task is the four microelectronics chip fabrication processes and their attributes . The task i s not triggered by a particular event, as in JV; the focus is on more static information . The information to be extracted includes the processes with their attributes and associated devices and pieces of equipment . Processes are central in ME, whereas entities are in some sense auxiliary . Although clearly the information about processes must be associated with an entity to be relevant, the task design centers on the processes themselves and their attributes . Essentially , the domain fractures into four separate sub-tasks, one for each process. Linking attributes to a process, like film or temperature to the layering process, involves defining the process in terms of key characteristics inherent in the process itself. Both devices and pieces of equipment are also associated with processes, but in quite different and indirec t ways. Equipment represents the implementation of a process, whereas devices represent the application of a process .</Paragraph> <Paragraph position="2"> No single overarching task applies for the ME domain ; rather, there are four separate, concurrent subtasks in which associated characteristics of processes are identified .</Paragraph> <Paragraph position="3"> The two domains also differ in the nature of their complexity. The complexity of the JV domain lies not in the predominance of technical jargon but in the intricacies of the interrelationships within a tie-up event . These intricacies cover a broad range of activities that legitimately fall within the domain of joint business ventures . Since there i s no single way to create a business relationship of the sort captured in this domain, there can be many points at which interpretation or judgment comes into play. Although this interpretation can be minimized by specification (some times arbitrary) in the fill rules, the open-endedness, and in some ways potential for creativity, in how a tie-up is realized results in domain complexity. For example, determining whether or not a text has enough information to warrant reporting a tie up, or whether there is sufficient evidence for a tie-up activity, may require a substantial amount o f judgment on the part of the analyst . Initially, there was a wide variation in interpretation of these issues among the J V analysts for each language . However, through frequent meetings, these differences in interpretation were narrowe d over time, and there was a convergence of viewpoints on what information to extract from a given JV document an d how to represent it in the template . The fill rules were continually modified and updated to incorporate the heuristic s developed by the analysts for determining when a valid tie up or activity existed .</Paragraph> <Paragraph position="4"> The resolution of coreferences, which also contributes to domain complexity, is a key task in the Joint Ventures domain . In particular, the entities in the JV documents were typically referenced in multiple ways . The EJV example in Appendix A illustrates one case where each of the ENTITYs is referred to at least three times in the text, and each of those multiple (and differing) references may contribute additional information to the ENTITY objects. For example, the phrase &quot;the Japanese sports goods maker&quot; needs to be coreferenced with &quot;Bridgestone Sports Co .&quot; in order t o identify the nationality of Bridgestone . Of equal importance in the JV domain is event-level coreference determina - null tion, in other words, determining which joint ventures are unique among a set of multiple apparent joint ventures i n the text. For example, the article in Appendix A has multiple paragraphs, each discussing a joint venture, and event level coreference resolution is required to determine that they are all discussing the same joint venture, not four different ones. This coreference layering problem at both entity and event levels makes extraction difficult in thi s domain .</Paragraph> <Paragraph position="5"> In comparison, the ME domain derives complexity not from interrelationships, but from its composition . There are four sub-domains, one for each process. Each sub-domain corresponds to a process with attributes, two of whic h can be devices or pieces of equipment. In addition, entities are associated with these processes in one of four differen t capacities: developer, manufacturer, distributor, or purchaser/user . Adding complexity to the ME domain is the pre-requisite to connect the technology to integrated chip production .</Paragraph> <Paragraph position="6"> The third area of domain difference is the level of technicality, namely, the extent to which highly technica l terms and knowledge are used . The JV domain lies within the financial/economic area, and the articles are typical o f general business news . The one element of the JV domain that relies more on technical jargon or specific technica l descriptions is the product or service that the joint venture will be involved in . This information, in addition to bein g reported as an exact string fill from the text, also is reported in the JV template as a two-digit code, according to th e Standard Industrial Classification manual compiled by the U . S. Office of Management and Budget . These string s may involve technical terms ; for example, &quot;ignition wiring harness&quot; is classified as an automobile component .</Paragraph> <Paragraph position="7"> In contrast, the ME domain lies within the scientific and technical arena with a corpus composed of produc t announcements and reports on research advances . The texts are loaded with domain-specific technical terms, at time s detailing chip fabrication methodology. The fill rules provide a resource for this technical terminology, which essentially provides hooks into the text for extracted information. These hooks mean that in the pre-processing stage, some of the extracted information can be identified as discrete tagged elements and then confirmed for extraction in later stages of processing. This &quot;bias for keywording&quot; is lessened to some extent by the higher percentage of irrelevant documents in the ME corpus than the JV corpus and by two requirements in the reporting conditions (i .e., a process must be associated with an entity in one of four roles and the application for the process must be related to integrate d circuits).</Paragraph> </Section> <Section position="7" start_page="10" end_page="17" type="metho"> <SectionTitle> LANGUAGE DIFFERENCES </SectionTitle> <Paragraph position="0"> Although the Japanese and English tasks are apparently identical (other than the language of the texts and templates), subtle differences emerge with closer scrutiny of the corpora, template definitions, and fill rules (see &quot;Corpor a and Data Preparation&quot; paper in this volume) for each of the two languages. Even the corpora for English and Japanese differ, in that the two English corpora are drawn from more than 200 sources each, and have a fairly low percentage of irrelevant documents in the set, whereas the Japanese corpora have a limited set of sources, but a higher percentage of irrelevant documents .</Paragraph> <Paragraph position="1"> Over the course of the data preparation task, differences between English and Japanese as reflected in the corpora were gradually incorporated into the fill rules . A major difference between the Japanese and English texts in th e JV domain is the fact that in the JJV corpus, the most typical relationship involves two entities joining together in a tie up where no joint venture company is created, whereas in EJV, the typical relationship involves one in which tw o entities form a joint venture company as part of the agreement . In EJV, texts which were produced by Japanese new s sources (in English) could also reflect the type of tie-up arrangement typical of the Japanese texts, i.e., where no join t venture company is formed .</Paragraph> <Paragraph position="2"> Differences between Japanese and English are also reflected in minor discrepancies in the Japanese and English template definitions and more substantial divergences in the corresponding fill rules. While every attempt was made to keep the template definition for each domain identical across languages, there are some differences . Thus, although the English and Japanese templates have the same objects and slots for each domain, there are cases where the con tent or format of the fills for a particular slot vary from one language to the other, reflecting differences in the two corpora. null In the JJV and EJV templates, an example of a content difference in fillers is seen in the FACILITY object's FACILITY-TYPE slot, which is a set fill for both EJV and JJV However, for EJV the fillers include COMMUNICA-TIONS, SITE, FACTORY, FARM, OFFICE, MINE, STORE, TRANSPORTATION, UTILITIES, WAREHOUSE, and OTHER, whereas in JJV, the fillers are (translated) : STORE, RESEARCH_INSTITUTE, FACTORY, CENTER, OFFICE, TRANSPORTATION, COMMUNICATIONS, CULTURE/LEISURE, and OTHER. The fillers were defined and selected by the analysts to reflect the types of information most commonly found in the corpora .</Paragraph> <Paragraph position="3"> A format difference in slot fills between languages (for both JV and ME) is exhibited in the ENTITY object's NAME slot, where English requires a normalized form for the entity name, based on a standardized list of abbreviations for corporate designators, including more familiar ones like INC (incorporated) and LTD (limited), as well a s some specifically used by foreign firms, such as AG (for Aktiengesellschaft -- Germany), EC (for Exempt Company -- Bahrain), and PERJAN (for Perusuhan Jawatan -- Indonesia) . For Japanese, such a list of designators was not avail able, and in the corpus itself, most companies are indicated by the ending sha or kaisha, so it was decided that a string fill would be more appropriate for this slot filler .</Paragraph> <Paragraph position="4"> The JJV fill rules give detailed decision trees for determining who the tie-up partners are . This reflects the fact that in the JJV corpus, the texts often begin by mentioning a tie up between two groups . For example, the two group s might be Mitsubishi Group and Daimler Group, but then, in the second paragraph, one learns that the actual tie up i s between Mitsubishi Shoji and Daimler Benz. The JJV fill rules explicitly address this type of situation, since it occurs frequently in the corpus. The fill rules stipulate that in cases of tie ups between groups, the group leaders are to b e taken as the tie-up partners, if they are mentioned in the text . The EJV fill rules address a slightly different problem, namely, of how to represent tie-up partners if the text states 'Four Malaysian finance firms announced a joint venture. ..,&quot; in which case a tie up between two (not four) identical partner entities would be created . This situation did not typically arise in the JJV corpus.</Paragraph> <Paragraph position="5"> As with the JV domain, the two ME corpora highlight significant differences . First, there are basically three news sources for the JME corpus, the same set of sources as for the JJV corpus . The EME corpus, on the other hand , is selected from a business and trade database with more than 200 different sources . Second, the JME corpus (30% ) contains a higher percentage of irrelevant documents than the English corpus (20%) . Third, even though the relative proportion of the four process types is similar, there is a distinct difference between languages in the type of information available for the PACKAGING object. Not only is there considerably more information available for all PACKAGING slots for English, there is also clear evidence that information for the BONDING and UNITS_PER_PACKAGE slots is infrequently available in Japanese . English texts are also more likely than Japanese texts to contain two or more PACKAGING objects, which may partially explain anecdotal reports that PACKAGING texts were considered difficult to code for the English analysts, but easy for the Japanese analyst .</Paragraph> <Paragraph position="6"> There are actually no substantive differences reflected in the two sets of fill rules for the ME domain . However, differences between languages are indicated in the type of information available for extraction . For example, some se t fill choices in the template simply do not occur in Japanese, like some of the hierarchical set fill choices for th e PACKAGING object's TYPE slot in ME. Also the keywords, &quot;gate size&quot; and &quot;feature size&quot; that indicate granularit y for the LITHOGRAPHY object do not occur in the Japanese corpus . Other minor differences are also indicated in th e fill rules as to how the information is represented in the English and Japanese texts . To illustrate, in contrast to th e EME fill rules, the Japanese fill rules are more likely to list relevant keywords in the text associated with ENTITY roles and to identify relevant stereotypic format clues for location information . This approach suggests the greater likelihood of identifiable patterns within the Japanese text . Another illustration of the dissimilarity in information pre sentation is the Japanese inclusion of English within the Japanese text, for example in layering or packaging types o r in entity names .</Paragraph> <Paragraph position="7"> In the second quarter of 1991, Nikon Corp . (7731) plans to market the &quot;NSR-1755EX8A,&quot; a new stepper intended for use in the production of 64-Mbit DRAMs . The stepper will use an 248-nm excimer laser as a light source and will have a resolution of 0.45 micron, compared to the 0 .5 micron of the company's latest stepper .</Paragraph> </Section> class="xml-element"></Paper>