File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-2008_metho.xml
Size: 11,761 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-2008"> <Title>Natural Language Analysis of Patent Claims</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Analysis algorithm </SectionTitle> <Paragraph position="0"> The analyzer takes a claim text as input and after a sequence of analysis procedures produces a set of internal knowledge structures in the form of predicate-argument templates filled with chunked and supertagged natural language strings. The implementation of an experimental version is being carried out in C++. In further description we will use the example of a claim text shown in Figure 1.</Paragraph> <Paragraph position="1"> The basic analysis scenario for the patent claim consists of the following sequence of procedures: Every procedure relies on a certain amount of static knowledge of the model and on the dynamic knowledge collected by the previous analyzing procedures.</Paragraph> <Paragraph position="2"> The top-level procedure of the claim analyser is tokenization. It detects tabulation and punctuation flagging them with different types of &quot;border&quot; tags. Following that runs the supertagging procedure, a look-up of words in the shallow sitional, adverbial, gerundial and infinitival phrases in the claim text shown in Figure 1. lexicon (see Section 2.1). It generates all possible assignments of supertags to words.</Paragraph> <Paragraph position="3"> Then the supertag disambiguation procedure attempts to disambiguate multiple supertags. It uses constraint-based hand-crafted rules to eliminate impossible supertags for a given word in a 5-word window context with the supertag in question in the middle. The rules use both lexical, &quot;supertag&quot; and &quot;border&quot; tags knowledge about the context. The disambiguation rules are of several types, not only &quot;reductionistic&quot; ones. For example, substitution rules may change the tag &quot;Present Plural&quot; into &quot;Infinitive&quot; (We do not have the &quot;Infinitive&quot; feature in the supertag feature space). If there are still ambiguities pending after this step of disambiguation the program outputs the most frequent reading in the multiple supertag.</Paragraph> <Paragraph position="4"> After the supertags are disambiguated the chunking procedure switches on. Chunking is carried out by matching the strings of supertags against patterns in the right hand side of the rules in the PG component of our grammar. &quot;Border&quot; tags are included in the conditioning knowledge.</Paragraph> <Paragraph position="5"> During the chunking procedure we use only a subset of PG rewriting rules. This subset includes neither the basic rule &quot;S = NP+VP&quot;, nor any rules for rewriting VP. This means that at this stage of analysis we cover only those sentence components that are not predicates of any clause (be it a main clause or a subordinate/relative clause). We thus do not consider it the task of the chunking procedure to give any description of syntactic dependencies.</Paragraph> <Paragraph position="6"> The chunking procedure is a succession of processing steps itself starting with the simplenoun-phrase procedure, followed the complexnoun-phrase procedure, which integrates simple noun phrases into more complex structures (those including prepositions and conjunctions). Then the prepositional-, adverbial-, infinitival- and gerundial-phrase procedures switch on in turn.</Paragraph> <Paragraph position="7"> case-role are long distance dependencies of the predicate &quot;comprising&quot;. The order of the calls to these component procedures in the chunking algorithm is established to minimize the processing time and effort. The ordering is based on a set of heuristics, such as the following. Noun phrases are chunked first as they are the most frequent types of phrases and many other phrases build around them. Figure 1 is a screenshot of the interface of the analysis grammar acquisition tool. It shows traces of chunking noun, prepositional, adverbial, gerundial and infinitival phrases in the example of a claim text shown in the left pane of Figure 3.</Paragraph> <Paragraph position="8"> The next step in claim analysis is the procedure determining dependencies. At this step in addition to PG we start using our DG mechanism. The procedure determining dependencies falls into two components: determining elementary (one predicate) predicate-argument structures and unifying these structures into a tree. In this paper we'll limit ourselves to a detailed description of the first of these tasks.</Paragraph> <Paragraph position="9"> The elementary predicate structure procedure, in turn, consists of three components, which are described below.</Paragraph> <Paragraph position="10"> The fist find-predicate component searches for all possible predicate-pattern matches over the &quot;residue&quot; of &quot;free&quot; words in a chunked claim and returns flagged predicates of elementary predicate-argument structures. The analyzer is capable to extract distantly located parts of one predicate (e.g. &quot;is arranged&quot; from &quot;A is substantially vertically arranged on B&quot;).</Paragraph> <Paragraph position="11"> The second find-case-roles component retrieves semantic (case-roles) and syntactic dependencies (such as syntactic subject), requiring that all and only dependent elements (chunked phrases in our case) be present within the same predicate structure. null The rules can use a 5-phrase context with the phrase in question in the middle. The conditioning knowledge is very rich at this stage. It includes syntactic and lexical knowledge about phrase constituents, knowledge about supertags and &quot;border&quot; tags, and all the knowledge about the properties of a predicate as specified in the predicate dictionary. This rich feature space allows quite a good performance in solving the most difficult analysis problems such as, recovery of empty syntactic nodes, long distance dependencies, disambiguation of PP attachment and parallel structures. There can several matches between the set of case-roles associated with a particular phrase within one predicate structure. This type of ambiguity can be resolved with the probabilistic knowledge about case-role weights from the predicate dictionary given the meaning of a predicate.</Paragraph> <Paragraph position="12"> If a predicate is has several meanings then the procedure disambiguate predicate starts, which relies on all the static and dynamic knowledge collected so far. During this procedure, once a predicate is disambiguated it is possible to correct a case-role status of a phrase if it does not fit the predicate description in the lexicon.</Paragraph> <Paragraph position="13"> Figure 3 shows the result of assigning case-roles to the predicates of the claim in Figure 1. The set of predicate-arguments structures conforms the format of knowledge representation given in Section 2.3. As we have already mentioned the analyzer might stop at this point. It can also proceed further and unify this set of predicate structures into a tree. We do not describe this rather complex procedure here and note only that for this purpose we can reuse the planning component of the generator described in (Sheremetyeva and Nirenburg, 1996).</Paragraph> </Section> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> 4 Examples of possible applications </SectionTitle> <Paragraph position="0"> In general, the final parse in the format shown in Figure 3 can be used in any patent related application. It is impossible to give a detailed description of these applications in one paper. We thus limit ourselves to sketching just two of them, - machine translation and improving the readability of patent claims.</Paragraph> <Paragraph position="1"> Long and complex sentences, of which patent claims are an ultimate example, are often mentioned as sentences of extremely low translatability (Gdaniec, 1994). One strategy currently used to cope with the problem in the MT frame is to automatically limit the number of words in a sentence by cutting it into segments on the basis of the punctuation only. In general this results in too few phrase boundaries (and some incorrect ones, e.g.</Paragraph> <Paragraph position="2"> enumerations). Another well-known strategy is pre-editing and postediting or/ and using controlled language, which can be problematic for the MT user. It is difficult to judge whether current MT systems use more sophisticated parsing strategies to deal with the problems caused by the length and complexity of of patent claims. The right pane shows an input claim (see Figure 1) chunked into predicates and other phrases (case-role fillers). The structure of complex phrases can be deployed by clicking on the &quot;+&quot; sign. The right pane contains the claim text a set of simple sentences. of real life utterances as most system descriptions are done on the examples of simple sentences.</Paragraph> <Paragraph position="3"> To test our analysis module for its applicability for machine translation we used the generation module of our previous application, - AutoPat, - a computer system for authoring patent claims (Sheremetyeva, 2003), and modeled a translation experiment within one (English) language, thus avoiding (for now) transfer problems to better concentrate on the analysis proper. Raw claim sentences were input into the analyzer, and parsed.</Paragraph> <Paragraph position="4"> The parse was input into the AutoPat generator, which due to its architecture output the &quot;translation&quot; in two formats, - as a single sentence, which is required when a claim is supposed to be in- null The transfer module (currently under development) transfers every individual SL parse structure into an equivalent TL structure keeping the format of its representation. It then &quot;glues&quot; the individual structures into a tree to output translation as one sentence or generates a set of simple sentences directly from the parse in Figure 3.</Paragraph> <Paragraph position="5"> cluded in a patent document, and as a set of simple sentences in TL. The modules proved to be compatible and the results of such &quot;translation&quot; showed a reasonably small number of failures, mainly due to the incompleteness of analysis rules.</Paragraph> <Paragraph position="6"> The second type of the translation output (a set of sentences), shows how to use our analyzer in a separate (unilingual or multilingual) application for improving the readability of patent claims, which is relevant, for example, for information dissemination. Figure 4 is a screenshot of the user interface of a prototype of such an application.</Paragraph> <Paragraph position="7"> We are aware of two efforts to deal with the problem of claim readability. Shnimory et. al (2002) investigate NLP technologies to improve readability of Japanese patent claims concentrating on rhetorical structure analysis. This approach uses shallow analysis techniques (cue phrases) to segment the claim into more readable parts and visualizes a patent claim in the form of a rhetorical structure tree. This differs from our final output, which seems to be easier to read. Shnimory et. al (cf.) refer to another NLP research in Japan directed towards dependency analysis of patent claims to support analytical reading of patent claims. Unfortunately the author of this paper cannot read in Japanese. We thus cannot judge ourselves how well the latter approach works.</Paragraph> </Section> class="xml-element"></Paper>