File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/c92-4200_intro.xml
Size: 4,297 bytes
Last Modified: 2025-10-06 14:05:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4200"> <Title>KNOWLEDGE EXTRACTION FROM TEXTS BY SINTESI</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> In the last years a great interest in the information retrieval from texts has grown; as a matter of fact the more the ability of memorising large quantities of data increases, the more the difficulties in extracting grows. The classical information retrieval approaches that consider the input only from a formal point of view are not powerful or user-friendly enough, as they are not able to cope with the real content of the texts; the quality of their results is generally poor. When a higher quality is needed, it is necessary to adopt methods derived from the field of the natural language processing, to extract some structured knowledge (as the objects mentioned and their relations). In this case different kinds of applications require different architectures. As a matter of fact, tu extract information from news it is necessary to be able to cover a wide range of correct syntactic forms, but a little of extragrammaticalities. At the same time the general knowledge sources are more important than the domain dependent \[Jacobs 88\]. On the contrary, in the case of short technical texts (tot example diagnostic reports) a wide syntactic coverage is not the main needing, but a large use of extragrammaticalities or a sublanguage must be taken into account \[Liddy 91\]. At the same time the domain dependent knowledge sources assume a main role, especially because of the presence of the implicit knowledge.</Paragraph> <Paragraph position="1"> In the latter case many of the approaches proposed in the field of NLP are not suitable or powerful enough, as they should guarantee three main features: efficiency, robustness and accuracy.</Paragraph> <Paragraph position="2"> Efficienc3~ is necessary because the input is generally long (more than one sentence), requiring a strong treatment of phenomena such as anaphora and ellipsis. Moreover, the efficiency is important when a system must operate in real time.</Paragraph> <Paragraph position="3"> _Rgbustness is needed because of the use of a sublanguage, the presence of implicit knowledge and/or of ill-formed sentences.</Paragraph> <Paragraph position="4"> is needed because the objects involved in a technical description are generally very complex from a linguistical point of view (for example a car part name may be composed by even ten words), and these descriptions are generally affected by the problems of the sublanguage and of the implicit knowledge.</Paragraph> <Paragraph position="5"> Accuracy allows to resolve those problems by using not only the knowledge of the world, but the lingqtistic information, too.</Paragraph> <Paragraph position="6"> Accuracy and robustness are difficult to obtain at the same time; many classical approaches to natural language processing guarantee accuracy but fail in robustness; other methods are robust, but not accurate. The following techniques have been proposed in the last years to cope with the ill-fbrmedness \[Kirtner 91\]: l.The addition of some special rules to a formal grammar \[Weischedel 83\]; 2.The introduction of some grammar-independent syntax-driven \[Mellish 89\] or semantics driven rules \[Kirtner 91\].</Paragraph> <Paragraph position="7"> 3.The treatment of the ill-fbrmedness as a correct form of a sublanguage \[Lehrberger 86\].</Paragraph> <Paragraph position="8"> 4.The Use of semantics driven approaches as caseframe parsing \[Carbonell 84\].</Paragraph> <Paragraph position="9"> In this paper we present our experience in building a system to extract knowledge from short technical diagnostic texts; we adopted a full-text semantics driven analysis integrated by the use of a general syntactic parser. The ill-formed input is treated without introducing special rules or AC'TES DE COLING-92, NANTES, 23-28 hOt~q&quot; 1992 1 2 4 4 PROC. OF COLING-92, NANr ,F.,S, AUG. 23-28, 1992 sublanguage concepts; as we will see, we use the advantages of the case-fralne approach to obtain robustness, while the syntactic knowledge is used when accuracy is needed.</Paragraph> </Section> class="xml-element"></Paper>