File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4200_metho.xml
Size: 16,458 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4200"> <Title>KNOWLEDGE EXTRACTION FROM TEXTS BY SINTESI</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. OVERVIEW OF SINTESI </SectionTitle> <Paragraph position="0"> SINTESI (Sistema INtegrato per TESti in ltaliano) ICiravegna 89-921 is a prototype Ior the knowledge extraction from Italian inputs, perlorming a fidl-text analysis. It is used on short descriptive technical texts (four or five sentences in seven or eight lines) containing complex linguistic constructions like conjunctions, negations, ellipsis and anaphorae. The use of extragrammatical language and of implicit knowledge is also frequent. Typical of our domain (car fault reports) are the complex object descriptions (a full car part description may involve the use of even ten words). The goal of SINTESI is to extract the diagnostic information (main fault, chain of c,'tuses, chain of effects, car parts involved, etc.) from each text and v,) build its semantic representation. In the rest of the paper we will show how the different knowledge sources contribute to the text analysis and how they contribute to guarantee the robustness, the efficiency and the accuracy.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 THE SENTENCE ANALYSIS </SectionTitle> <Paragraph position="0"> S1NTESI integrates five knowledge sources: lexical, syntactic, pragmatic, general semantic and world knowledge. The inlmt texts is first pre-processed by a morphological analyser with the help of a dictionary (cnrrently containing about 4.000 entries). A lexical-semantic analysis is then applied to recognise some special patterns such as ranges of data or numbers, measurements, chemical formulas, etc, through a context free grammar. Some special preliminary information (&quot;the semantic markers&quot;) is also lint in the sentence to help semantics in the following steps.</Paragraph> <Paragraph position="1"> The rest of the analysis is based on a semantics driven approach that integrates in a flexible way two modules dedicated to the linguistic, two to the semantic and one to the pragmatic analysis (fig. 1).</Paragraph> <Paragraph position="2"> The semantic modules lead the analysis in order to pertbrm two main steps: object recognition and object linking. This separation is introduced in order to be able to recognize fragments, when a complete analysis is impossible. The syntactic modules are used to build the linguistic structure of the objects and of the whole sentence. The pragmatics is mainly used to control the object linking. Additional modules provide the discourse analysis and the correlation of the knowledge extracted in the different sentences of the text.</Paragraph> <Paragraph position="3"> q~e pragmatics and the additional modules are not discussed in this paper.</Paragraph> <Paragraph position="4"> 3 The formalism was developed in collaboration with a group of the University of Torino.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 OBJECT RECOGNITION </SectionTitle> <Paragraph position="0"> The object recognition step processes the text from left to right trying to identify the objects in each sentence. It is a bottom-up task that uses four kind of knowledge: the semantic markers, the general semantic knowledge, the world knowledge and the syntactic knowledge. The schema of this step is shown in fig.2. The presence of the objects is shown by the * m nti' m'rk rL put by the lexicon. The markers activate an expectation in the semantic module that is used to control the analysis and guarantee the robustness. The syntactic module is then activated to recognise the structural form of each object without trying to build a full sentence representation. The know k_~_~9~_gg \[Campia et al 90\] is represented by an independent grammar. The syntactic structure of the linguistic expressions is represented by dependency trees. This kind of representation was chosen because it makes the syntax-semantics interaction easier, tire interdependencies among words being shown in a very clear ntanner. The syntactic analysis is done by a set of production rules in which the conditions test the current tree status, whereas the actions modify it. A semantic test is performed immediately after each syntactic action in order to ilnprove the efficiency and reduce the use of the backtracking \[Ciravegna 91al. Each semantic test activates tile semantic module which uses two kinds of knowledge to answer to the test and to build the semantic representation of the current object: the general anti the world knowledge.</Paragraph> <Paragraph position="1"> The t~eneral semantic knnwled~,e is based on caseframes and contains the basic intormation to answer to the question (general description of objects, intormation about roles and role-fillers, etc.). The caseframe model was derived from the Entity Oriented Parsing approach \[Hayes 84\]. The information contained in the caseframes produces a first hint on the identity of the object.</Paragraph> <Paragraph position="2"> Performing a semantic test at this level means filling a role in a casefranre, The precise identity of all the already known objects is contained in the w_A~rld knnwled~e and it is lormed by the syntactic and semantic descriptions of each object \[Lesmo 90,91\]3.Only the objects interesting tot the knowledge extraction are containexl in this structure. Pertbrming a semantic test at this level means matching the syntactic repre~sentation on the contained descriptions via a set of structural rules. Every accepted connection between two words contributes in filling both the roles of the caseframe and to extract a semantic identity. The syntactic analysis of the current object description ends when the next word is not belonging to the current object description (for semantic, pragmatic or linguistic reasons). The object is then closed and the contrul is returned to the semantic AcrEs DE COLING-92. NAI,ZfES. 23-28 AOt~&quot;r 1992 l 2 4 S PROC. OF COLING-92. NAI~rES. AUO. 23-28. 1992 controller that will try to identify another object, until the end of the sentence is found.</Paragraph> <Paragraph position="3"> As examole of the analysis of a very easy description of a ear fault, consider: &quot;Fissaggi della coppa \[dell'\[ olio \[dell motore con cricche&quot; (literally: &quot;Bolts of the slump \[of the\] oil \[of the\] engine with breaks'). The first semantic marker is put on &quot;fissaggi&quot; that activates an entity of type &quot;car_part&quot; (fig. 3a). At this step the role-expectation given by the caseframe shows that in the rest of the sentence we will probably have some other words specifying the current object and a fault associated to the car part. The expectation given by the world knowledge is given by the following car part descriptions: a. (fissaggio (perno (baneo (motore)))) b. (fissaggio (coppa (olin (motore)))) c. (fissaggio (coppa (riciclaggio (olio)))) d. (fissaggio (ammortizzatori (anteriori (telaio)))) These descriptions are transformed in the corresponding dependency trees (see fig.3). The syntactic module is then activated and it tries to connect &quot;coppa&quot; to &quot;fissaggio&quot; through the preposition &quot;del&quot;; the connection is semantically acceptable because there is a caseframe plausibility, and the preposition is acceptable for the connection. Moreover there are domain based semantic descriptions that support the connection ('b' and 'c' only, because the others don't support the word &quot;coppa&quot;), so the semantic module is able to continue the current object building and the parser does the same. The analysis continues in the same way until the word &quot;cricche&quot; is found; &quot;cricche&quot; is pointing to another kind of entity (a fault), so the semantic controller stops the syntactic module. The fault description is then analysed in the same way. At the end of the object recognition we will have two objects: a car part (given by a caseframe and the 'b' description) and a thult.</Paragraph> <Paragraph position="4"> SINTESI is able to cope with the loss of the determiners and prepositions in the description, because the dependency grammar shows only the structural relations between the different syntactic types; for example a typical rule shows that a determiner (or a noun) may be attached to a noun and not something like: &quot;NP-> det+Noun&quot;. This kind of ill-formedness are then overcome by the formalism, without introducing metarules, sublanguage concepts or semantics driven rules.</Paragraph> <Paragraph position="5"> At this level the interaction between syntax and semantics brings to the object recognition from a structural and semantic point of view when the object is present in the world knowledge base.</Paragraph> <Paragraph position="6"> When this identity is not known, it is possible to recognize the presence of the object by using only the syntactic module and the caseframe level of the knowledge base (role expectation). Even if it is not possible to derive a direct identity (for structural reasons), it is possible to recognize it by using the role-expectation coming from the other objects in the sentence. In this way, it is possible to maintain the accuracy on the identity and on the structure of the objects (if the description is already known), maintaining the robustness (when it is not known or it contains unknown words or unsolvable gaps).</Paragraph> <Paragraph position="7"> Note that even if the object recognition is semantics driven, the approach is flexible: sometimes in fact it becomes syntax-driven; it happens to treat some special cases as the noun+ adjectives+ noun construction due to the loss of the preposition between the two nouns. These forms often give origin to nominal compounds, and, especially in conjunctions, bring to garden paths and different rules for the adjectives \[Campia et al 90\].</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 OBJECT LINKING </SectionTitle> <Paragraph position="0"> When all the objects of a sentence are identified by the object recognition step, a connection among them is tried, using the role-expectation contained in the caseframe level. There are two possible kinds of connection strategies: total linking and partial linking. The first one is tried when no failures were reported during the object recognition. It integrates Bottom-Up (BUS) and Top-Down (TDS) strategies in order to build the structure of the whole sentence. The TDS is an expectation-driven analysis of the connections among objects, driven by the main roles of some kinds of constituents (verbs, conjunctions, etc.). It is also driven by linguistic and pragmatic rules.</Paragraph> <Paragraph position="1"> The TDS is not executed in a left-to-right way and it is able to cope with some kinds of garden paths involved by conjunctions. The B__U__S. is performed instead from left to right, and connects the objects not considered by the TDS. For every object a role-driven connection is tried with all the objects that are linguistically and semantically acceptable.</Paragraph> <Paragraph position="2"> A focus stack is used to control the connections.</Paragraph> <Paragraph position="3"> A score is given to every linking. This score is integrated by other evaluations coming from the domain specific knowledge. Different strategies on the connections are adopted according to different cases: it is possible that sometimes even some linguistically non acceptable relations may be accepted for pragmatical reasons.</Paragraph> <Paragraph position="4"> When the object recognition analysis reported some problems, only the BUS is adopted in order to form some aggregates of objects. This aggregation is influenced by the content of the unknown or incorrect parts of the input.</Paragraph> <Paragraph position="5"> Connections among objects separated by obscure parts are considered unlikely. A classification of the unknown parts is done trying to apply some lexical, syntactic or semantic heuristic rules on the identity of the words contained (a main verb has a strong power of separation, a group of adjectives hasn't it, and so on).</Paragraph> <Paragraph position="6"> In the example of section 3.1, the two objects (the car part and the fault) are linked via the slot ACrF.S DI~ COL1NG-92. NANTES. 23-28 .~Ot~ 1992 1 2 4 6 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 &quot;with fault&quot; of the first object. The connection is done by the BUS strategy because the lack of a verb in the sentence makes the system to hypothesize some gaps or ellipsis. The result of tile analysis is shown in fig 3b.</Paragraph> <Paragraph position="7"> The separation between object recognitimt and object linking guarantees the accuracy (when possible) and the robustness, by adopting different strategies according to the reliability rate of the object recognition. Note that even the TDS+BUS allows to cope with sentences that don't bring to a ctnnplete structure (i.e. it is not possible to connect some objects to any role in the sentence).</Paragraph> <Paragraph position="8"> It is possible because the BUS is always applied and it is this strategy that guarantees die robustness during the object linking.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. ADVANTAGES OF THE SCHEMA </SectionTitle> <Paragraph position="0"> The illustrated strategy fits our requirements of accuracy, robustness and accuracy.</Paragraph> <Paragraph position="1"> Accuracy is achieved because the syntax-semantics interaction during the object recognition and the TDS+BUS (during the object linking) bring to a lull sentence structural and semantic definition when the input is correct.</Paragraph> <Paragraph position="2"> Robustness is achieved during the object recognition because the syntactic analyser is able to cope with some kinds of ill-formed input (the loss of prepositions and determiners), without introducing extra-rules. In any case the analysis is still lead by semantics, so it is possible to force the syntactic module to accept an incorrect input.</Paragraph> <Paragraph position="3"> Moreover the semantic previsional ability ntakes us to cope with even unknown or incorrect object descriptions. It is possible by excluding die world knowledge and remaining at the caseti'alne level.</Paragraph> <Paragraph position="4"> The role-expectation driven analysis allows to detect the new descriptions attd to propose their addition to the knowledge base.* The robustness is also achieved during the object linking, because it is possible to adopt different strategies to cope with different rate of understanding or ill-forlnedness (unknown parts of the input, the lack of some fundamental constituents as the verbs, etc.).</Paragraph> <Paragraph position="5"> is guarantied because at the object recognition level each syntactic connection is immediately semantically tested, so the interaction is efficient. Moreover the tests at the world knowledge level are efficient, because they are reduced to a comparison between graphs (the current syntactic tree and the possible deep descriptious); it is particularly important in the reduction of the gaps involved by conjunctions ICiravegna 9 lb\].</Paragraph> <Paragraph position="6"> In addition the separation between object recognition and object linking allows to insert between them an additional module called SKIMMER that improves the efficiency; it is a function that, given the interesting types of objects fur the knowledge extraction (faults, car parts, etc.) and the types of the ol3ject contained in the current sentence, decides if die sentence will bring new interesting information or not. If not the object linking and knowledge extraction are skipped. It is necessary to apply the skimmer after the end of the object linking (and not before) because a negation or a semantic modifier may change the meaning of an object. Moreover the anafora resolution is not affected by the skipping of a sentence, because the objects are &quot;already recognised.</Paragraph> </Section> class="xml-element"></Paper>