File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0401_metho.xml

Size: 27,104 bytes

Last Modified: 2025-10-06 14:07:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0401">
  <Title>Concept Identification and Presentation in the Context of Technical Text Summarization</Title>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
Abstract Introducing the Topics
</SectionTitle>
    <Paragraph position="0"> ! ! Virtual prototyping is a technique which has been suggested for use in, for example, telecommunication product development as a high-end technology to achieve a quick digital model that could be used in the same way as a real prototype. Presents the design rationale of WebShaman, starting from the concept design perspective by introducing a set of requirements to support communication via a concept model between industrial designer and a customer. In the article, the authors suggest that virtual prototyping in collaborative use between designers is a potential technique to facilitate design and alleviate the problems created by geographical distance and complexities in the work between different parties. The technique, was implemented in the VRP project, allows component level manipulation of a virtual prototype in a WWW (World Wide Web) browser. The user services, the software architecture, and the techniques of WebShaman were developed iteratively during the fieldwork in order to illustrate the ideas and the feasibility of the system. The server is not much different from the other servers constructed to support synchronous collaboration.</Paragraph>
    <Paragraph position="1"> Identified Topics: 3D model - VIRPI project - WWW - WW-vV technique - WebShaman CAD system - conceptual.model- customer - object-oriented model- product - product concept - product design - requirement - simulation model - smart virtual prototype - software component - system- technique - technology - use - virtual componentvirtual prototype - virtual prototype system- virtual prototyping Information about the Topics An example of a conceptual model, a pen-shaped'wireless user interface for a mobile telephone.</Paragraph>
    <Paragraph position="2"> A virtual prototype is a computer-based simulation of a prototype or a subsystem with a degree of functional realism, comparable to that of a physical prototype.</Paragraph>
    <Paragraph position="3"> A computer system implementing the high-end aspects of virtual prototyping has been developed in the VRP project (VRP, 1998) at VTT Electronics, in Oulu, Finland.</Paragraph>
    <Paragraph position="4"> The two-and-a-half-year VIRPI project consists of three parts.</Paragraph>
    <Paragraph position="5"> Nowadays, CAD (computer-aided design) systems are used as an aid in industrial, mechanical and electronics design for the specification and development of a product.</Paragraph>
    <Paragraph position="6"> A virtual prototype system can be used for concept testing in the early phase of product</Paragraph>
    <Paragraph position="8"> source documents. One of the alignments is presented in Table 1. The first column contains the information of the professional abstract.</Paragraph>
    <Paragraph position="9"> The second and third columns contain the information from the source document that matches the sentences of the professional abstract, and its location in the source document. We have produced 100 of these tables containing a total of 309 sentences of professional abstracts aligned with 568 sentences of source documents.</Paragraph>
    <Paragraph position="10"> These alignments allowed us to identify on one hand, concepts, relations and types of information usually conveyed in abstracts; and on the other hand, valid transformations in the source in order to produce a compact and coherent text. The transformations include verb transformation, concept deletion, concept reformulation, structural deletion, parenthetical deletion, clause deletion, acronym expansion,  rithm.</Paragraph>
    <Paragraph position="11"> In this paper we have presented a more efficient distributed algorithm which construct a breadth-first search tree in an asynchronous communication network</Paragraph>
    <Paragraph position="13"> Presents a model and gives an First we present a model and give overview of lst/overview of related research, related research.</Paragraph>
    <Paragraph position="14"> Analyzes the complexity of the algo- We analyse the complexity of our algorithm, lst/rithm, and gives some examples of per- and give some examples of performance on formance on typical networks, typical networks.</Paragraph>
    <Paragraph position="15">  rithm.&amp;quot; S.A.M. Makki. Computer Communications, 19(8) Jul 96, p628-36. abbreviation, merge and split. In our corpus, 89% of the sentences from the professional abstracts included at least one transformation.</Paragraph>
    <Paragraph position="16"> Results of the corpus study are detailed in (Saggion and Lapalme, 1998) and (Saggion and Lapalme, 2000).</Paragraph>
    <Paragraph position="17"> We have identified a total of 52 different types of information (coming from the corpus and from technical articles) for technical text summarization that we use to identify some of the main themes. These types include: the explicit topic of the document, the situation, the identification of the problem, the 'identification of the solution, the research goal, the explicit topic of a section, the * authors' development, the inferences, the description of a topical entity, the definition * of a topical entity, the relevance of a topical enthy, the advantages, etc. Information types are classified as indicative or informative depending on the type of abstract they contribute to (i.e. the topic of a document is indicative while the description of a topical entity is informative). Types of information are identified in sentences of the source document using co-occurrence of concepts and relations and specific linguistic patterns. Technical articles from different domains refer to specific concepts and relations (diseases and treatments in Medicine, atoms and chemical reactions in Chemistry, and theorems and proofs in Mathematics). We have focused on concepts and relations that are common across domains such as problem, solution, research need, experiment, relevance, researchers~ etc.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Text Interpretation
</SectionTitle>
    <Paragraph position="0"> Our approach to text summarization is based on a superficial analysis of the source document and on the implementation of some text re-generation techniques such as merging of topical information, re-expression of concepts and acronym expansion. The article (plain text in English without mark-up) is segmented in main units (title, author information, author abstract, keywords, main sections and references) using typographic information and some keywords. Each unit is passed through a bipos statistical tagger. In each unit, the system identifies titles, sentences and paragraphs, and then, sentences are interpreted using finite state transducers identifying and packing linguistic constructions and domain specific constructions. Following that, a conceptual dictionary that relates lexical items to domain concepts and relations is used to associate semantic tags to the different structural elements in the sentence. Subsequently, terms (canonical form of noun groups), their associated semantic (head of the noun group) and theirs positions are extracted from each sentence and stored in an AVL tree (te~ tree) along with their frequency. A conceptual index is created which specifies to which particular type of information each sentence could contribute. Finally, terms and words are extracted from titles and  stored in a list (the topical structure) and acronyms and their expansions are recorded.</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Content Selection
</SectionTitle>
      <Paragraph position="0"> In order to represent types of information we use templates. In Table 2, we present the Topic of the Document, Topic of the Section and Signaling Information templates. Also presented are some indicative and informative patterns. Indicative patterns contain variables, syntactic constructions, domain concepts and relations. Informative patterns also include one specific position for the topic under consideration. Each element of the pattern matches one or more elements of the sentence (conceptual, syntactic and lexical elements match one element while variables match zero or more).</Paragraph>
      <Paragraph position="1">  The system considers sentences ~hat were identified as carrying indicative information (their position is found in the conceptual index).</Paragraph>
      <Paragraph position="2"> Given a sentence* S and a type of information T the system verifies if the sentence matches some of the patterns associated with type T.</Paragraph>
      <Paragraph position="3"> For each matched pattern, the system extracts information from the sentence and instantiates a template of type T. For example, the Content slot of the problem identification template is instantiated with all the sentence * :(avoiding references, structural elements and parenthetical expressions) while the What slot 'of the topic of the document template is instantiated with a parsed sentence fragment * to the left or to the right of the make known relation depending on the attribute voice of the verb (active vs. passive). All the instantiated templates constitute the Indicative Data Base (IDB). The system matches the topical structure with the topic candidate slots from the IDB.</Paragraph>
      <Paragraph position="4"> The system selects one template for each term in that structure: the one with the greatest weight (heuristics are applied if there are more than one). The selected templates constitute the indicative content and the terms appearing in the topic candidate slots and their expansions constitute the potential topics of the document. Expansions are obtained looking for terms in the term tree sharing the semantic of some terms in the indicative  content.</Paragraph>
      <Paragraph position="5"> The indicative content is sorted using positional information and the following conceptual order: situation, need for research, problem, solution, entity introduction, topical information, goal of conceptual entity, focus of conceptual entity, methodological aspects, inferences and structural information. Templates of the same type are grouped together if they appeared in sequence in the list. The types considered in this process are: the topic, section topic and structural information.</Paragraph>
      <Paragraph position="6"> The sorted templates constitute the text plan.</Paragraph>
      <Paragraph position="7">  For each potential: topic and sentence where it appears (that information is found on the term tree) the system verifies if the sentence contains an informative marker (conceptual index) and satisfies an informative pattern. If so, the potential topic is considered a topic of the document and a link will be created between the topic and the sentence which will be part of the informative abstract.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4 Content Presentation
</SectionTitle>
    <Paragraph position="0"> Our approach to text generation is based on the regularities observed in the corpus of professional abstracts and so, it does not implement a general theory of text generation by computers. Each element in the text plan is used to produce a sentence. The structure of the sentence depends on the type of template.</Paragraph>
    <Paragraph position="1"> The information about the situation, the problem, the need for research, etc. is reported as in the original document with few modifications (concept re-expression). Instead other types require additional re-generation: for the topic of the document template the generation procedure is as follows: (i) the verb form for the predicate in the Predicate slot is generated in the present tense (topical information is always reported in present tense), 3rd person of singular in active voice at the beginning of the sentence; (ii) the parsed sentence fragment from the N'hat slot is generated in the middle of the sentence (so the appropriate case for the first element</Paragraph>
    <Paragraph position="3"> instance of make known instance of {research paper, study, work, research} instance of{research paper, author, study, work, research, none} parsed sentence fragment section and sentence id list of terms from the What filler  has, to be generated); and (iii) a full stop is generated. This schema of generation avoids the formulation of expressions like &amp;quot;X will be presented&amp;quot;, &amp;quot;X have been presented&amp;quot; or &amp;quot;We have presented here X&amp;quot; which are usually found on source documents but which are awkward in the context of the abstract text-type. Note that each type of information prescribes its own schema of generation.</Paragraph>
    <Paragraph position="4"> Some elements in the parsed sentence fragment require re-expression while others are presented in &amp;quot;the words of the author.&amp;quot; If the system detects an acronym without expansion in the string it would expand it and record that situation in order to avoid repetitions. Note that as the templates contain parsed sentence fragments, the correct punctuation has to be re-generated. For merged templates the generator implements the following patterns of production: if n adjacent templates are to be presented using the same predicate, only one verb will be generated whose argument is the conjunction of the arguments from the n templates. If the sequence of templates have no common predicate, the information will be presented as a conjunction of propositions. These patterns of sentence production are exemplified in Table 3.</Paragraph>
    <Paragraph position="5"> The elaboration of the topics is presented upon reader's demand. The information is presented in the order of the original text. The informative abstract is the information obtained by this process as it is shown in Figure 1.</Paragraph>
  </Section>
  <Section position="7" start_page="4" end_page="4" type="metho">
    <SectionTitle>
5 Limitations of the Approach
</SectionTitle>
    <Paragraph position="0"> Our approach is based on the empirical examination of abstracts published by second services. In our first study, we examined 100 abstracts and source documents in order to deduce a conceptual and linguistic model for the task of summarization of technical articles.</Paragraph>
    <Paragraph position="1"> Then, we expanded the corpus with 100 more items in order to validate the model. We believe that the concepts, relations and types .of information identified account for interesting ,phenomena appearing in the corpus and constitute a sound basis for text summarization.</Paragraph>
    <Paragraph position="2"> 'Nevertheless, we have identified only a few * linguistic expressions used in order to express -particular elements of the conceptual model (241 domain verbs, 163 domain nouns, 129 adj.ectives , 174 indicative patterns, 87 informative patterns). This is because we are mainly concerned with the development of a general method of automatic abstracting and the task of constructing such linguistic resources is time consuming as recent work have shown (Minel et al., 2000).</Paragraph>
    <Paragraph position="3"> The implementation of our method relies* on State-of-the-art techniques in natural language processing including noun and verb group identification and conceptual tagging. The interpreter relies on the output produced by a shallow text segmenter and on a statistical POStagger. Our prototype only analyses sentences for the specific purpose of text summarization and implements some patterns of generation observed in the corpus. Additional analysis could be done on the obtained representation to produce better results.</Paragraph>
  </Section>
  <Section position="8" start_page="4" end_page="4" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> (Paice and Jones, 1993) have already addressed the issue of content identification and expression in technical summarization using templates, but while they produced indicative abstracts for a specific domain, we are producing domain independent indicative-informative abstracts. Being designed for one specific domain, their abstracts are fixed in structure while our abstracts are dynamically constructed. Radev and McKeown (1998) also used instantiated templates, but in order to produce summaries of multiple documents in one specific domain. They focus on the generation of the text while we are addressing the overall process of automatic abstracting. Our concern regarding the presentation of the information is now being addressed by other researchers as well (Jing and McKeown, 1999).</Paragraph>
  </Section>
  <Section position="9" start_page="4" end_page="7" type="metho">
    <SectionTitle>
7 Evaluating Content and Quality in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
Text Summarization
</SectionTitle>
      <Paragraph position="0"> Abstracts are texts used in tasks such as assessing the content of the document and deciding if the source is worth reading. If text summarization systems are designed to fulfill those requirements, the generated texts have to be evaluated according to their intended function and its quality. The quality and success of human produced abstracts have already been addressed in the literature (Grant, 1992; Gibson, 1993) using linguistic criteria such as cohesion and coherence, thematic structure, sentence structure and lexical density. But in automatic text summarization, this is an emergent research topic.</Paragraph>
      <Paragraph position="1"> (Minel et al., 1997) have proposed two methods of evaluation addressing the content of the abstract and its quality. For content evaluation, they asked human judges to classify summaries in broad categories and also verify if the key ideas of source documents are appropriately expressed in the Summaries. For text quality, they asked human judges to identify problems such as dangling anaphora and broken textual segments and also to make subjective judgments about readability. In the context of the TIPSTER program, (Firmin and Chrzanowski,</Paragraph>
      <Paragraph position="3"> Re-Generated Sentences Sentences from Source Documents Illustrates the principle of virtual prototyping and the different techniques and models required.</Paragraph>
      <Paragraph position="4"> Presents the mechanical and electronic design o\] the robot harvester including all subsystems, namely, fruit localisation module, harvesting arm and gripper-cutter as well as the integration of subsystems and the specific mechanical design of the picking arm addressing the reduction of undesirable dynamic effects during high velocity operation.</Paragraph>
      <Paragraph position="5"> Shows configuration of the robotic fruit harvester Agribot and schematic view of the detaching tool.</Paragraph>
      <Paragraph position="6"> PAWS (the programmable automated welding system) was designed to provide an automated means of planning, controlling, and performing critical welding operations for improving productivity and quality.</Paragraph>
      <Paragraph position="7"> Describes HuDL (local autonomy) in greater detail; discusses system integration and the 1MA (the intelligent machine architecture); and also gives an example implementation.</Paragraph>
      <Paragraph position="8"> Figure 1 Virtual prototyping models and techniques illustrates the principle of virtual prototyping and the different techniques and models required.</Paragraph>
      <Paragraph position="9"> After a brief introduction, we present the mechanical and electronic design of the robot harvester including all subsystems, namely, fruit localisation module, harvesting arm and gripper-cutter as well as the integration of subsystems.</Paragraph>
      <Paragraph position="10"> Throughout this work, we present the specific mechanical design of the picking arm addressing the reduction of undesirable dynamic effects during high velocity operation. null The final prototype consists of two jointed harvesting arms mounted on a human guided vehicle as shown schematically in Figure 1 Configuration of the robotic . fruit' harvester Agribot.</Paragraph>
      <Paragraph position="11"> Schematic representation of the operations involved in the detaching step can be seen in Figure 5 Schematic view of the detaching tool and operation.</Paragraph>
      <Paragraph position="12"> PAWS was designed to provide an automated means of planning, controlling, and performing critical welding operations for improving productivity and quality. Section 2 describes HuDL in greater detail and section 3 discusses system integration and the IMA.</Paragraph>
      <Paragraph position="13"> An example implementation is given in section 4 and section 5 contains the conclusions.</Paragraph>
      <Paragraph position="14">  .egorization task using TREC topics. For text quality, they addressed subjective aspects such * as the length of the summary, its intelligibility and its usefulness. We have carried out an eval* uation of our summarization method in order to assess the function of the abstract and its text quality.</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
7.1 Experiment
</SectionTitle>
      <Paragraph position="0"> We compared abstracPSs produced by our method with abstracts produced by Microsoft'97 Summarizer and with others published with source documents (usually author abstracts). We have chosen Microsoft'97 Summarizer because, even if it only produces extracts, it was the only summarizer available in order to carry out this evaluation and because it has already been used in other evaluations (Marcu, 1997; Barzilay and Elhadad, 1997).</Paragraph>
      <Paragraph position="1"> In order to evaluate content, we presented judges with randomly selected abstracts and five lists of keywords (content indicators). The judges had to decide to which list of keywords the abstract belongs given that different lists share some keywords and that they belong to the same technical domain. Those. lists were obtained from the journals where the source documents were published. The idea behind this evaluation is to see if the abstract convey the very essential content of the source document.</Paragraph>
      <Paragraph position="2"> In Order to evaluate the quality of the text, we asked the judges to provide an acceptability score between 0-5 for the abstract (0 for unacceptable and 5 for acceptable) based on the following criteria taken from (Rowley, 1982) (they were only suggestions to the evaluators and were not enforced): good spelling and grammar; clear indication of the topic of  the source document; impersonal style; one paragraph; conciseness; readable and understandable; acronyms are presented along with their expansions; and other criteria that the judge considered important as an experienced reader of abstracts of technical documents.</Paragraph>
      <Paragraph position="3"> We told the judges that we would consider the abstracts with scores above 2.5 as acceptable. Some criteria are more important than other, for example judges do not care about impersonal style but care about readability.</Paragraph>
      <Paragraph position="4">  Source Documents: we used twelve source documents from the journal Industrial Robots found on the Emerald Electronic Library (all technical articles). The articles were downloaded in plain text format. These documents are quite long texts with an average of 23K characters (minimum of llK characters and a maximum of 41K characters). They contain an average of 3472 words (minimum of 1756 words and a maximum of 6196 words excluding punctuation), and an average of 154 sentences (with a minimum of 85 and a maximum of 288).</Paragraph>
      <Paragraph position="5"> Abstracts: we produced twelve abstracts us:ing our method and computed the compression ,ratio in number of words, then we produced twelve abstracts by Microsoft'97 Summarizer 1 using a compression rate at least as high as our (i.e. if our method produced an abstract with a compression rate of 3.3% of the source, we produced the Microsoft abstract with a compression rate of 4% of the source). We extracted the twelve abstracts and the twelve lists of keywords published with the source documents. We thus obtained 36 different abstracts and twelve lists of keywords.</Paragraph>
      <Paragraph position="6"> Forms: we produced 6 different forms each containing six different abstracts randomly 2 chosen out of twelve different documents (for a total of 36 abstracts). Each abstract was printed in a 1We had to format the source document in order for the Microsoft Summarizer to be able to recognize the structure of the document (titles, sections, paragraphs and sentences).</Paragraph>
      <Paragraph position="7"> 2Random numbers for this evaluation were produced using software provided by SICSTus Prolog.</Paragraph>
      <Paragraph position="8"> different page. It included 5 lists of keywords, a field to be completed with the quality score associated to the abstract and a field to be filled with comments about the abstract. One of the lists of keywords was the one published with the source document, the other four were randomly selected from the set of 11 remaining keyword lists, they were printed in the form in random order. One page was also available to be completed with comments about the task, in particular with the time it took to the judges to complete the evaluation. We produced three copies of each form for a total of 18 forms.</Paragraph>
      <Paragraph position="9">  We had a total of 18 human judges or evaluators. Our evaluators were 18 students of the M.Sc. program in Information Science at McGill Graduate School of Library &amp; Information Studies. All of the subjects had good reading and comprehension skills in English. This group was chosen because they have knowledge about what constitutes a good abstract and they are educated to become professionals in Information Science.</Paragraph>
      <Paragraph position="10">  The evaluation was performed in one hour session at McGill University. Each human judge received a form (so he/she evaluated six different abstracts) and an instruction booklet. No other material was required for the evaluation (i.e. dictionary). We asked the judges to read carefully the abstract. They had to decide which was the list of keywords that matched the abstract (they could chose more than one or none at all) and then, they had to associate a numeric score to the abstract representing its quality based on the given criteria. This procedure produced three different evaluations of content and text quality for each of the 36 abstracts. The overall evaluation was completed in a maximum of 40 minutes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML