File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-2008_intro.xml
Size: 10,316 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-2008"> <Title>Natural Language Analysis of Patent Claims</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Knowledge </SectionTitle> <Paragraph position="0"> The structure and content of the knowledge base has been designed to a) help solve analysis problems, -- different kinds of ambiguity, -- and b) minimize the knowledge acquisition effort by drawing heavily on the patent claim linguistic restrictions. null A patent claim shares technical terminology with the rest of a patent but differs greatly in its content and syntax. It must be formulated according to a set of precise syntactic, lexical and stylistic guidelines as specified by the German Patent Office at the turn of the last century and commonly accepted in the U.S., Japan, and other countries.</Paragraph> <Paragraph position="1"> The claim describes essential features of the invention in the obligatory form of a single extended nominal sentence, which frequently includes long and telescopically embedded predicate phrases. A US patent claim that we will further use as an example in our description is shown in Figure 1.</Paragraph> <Paragraph position="2"> A cassette for holding excess lengths of light waveguides in a splice area comprising a cover part and a pot-shaped bottom part having a bottom disk and a rim extending perpendicular to said bottom disk, said cover and bottom parts are superimposed to enclose jointly an area forming a magazine for excess lengths of light waveguides, said cover part being rotatable in said bottom part, two guide slots formed in said cover part, said slots being approximately radially directed, guide members disposed on said cover part, a splice holder mounted on said cover part to form a rotatable splice holder.</Paragraph> <Paragraph position="3"> In our system the knowledge is coded in the system lexicon, which has been acquired from two kinds of corpora, - a corpus of complete patent disclosures and a corpus of patent claims. The lexicon consists of two parts: a shallow lexicon of lexical units and a deep (information-rich) lexicon of predicates. Predicates in our model are words, which are used to describe interrelations between the elements of invention. They are mainly verbs, but can also be adjectives or prepositions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Shallow Lexicon </SectionTitle> <Paragraph position="0"> The word list for this lexicon was automatically acquired from a 5 million-word corpus of a US patent web site. A semi-automatic supertagging procedure was used to label these lexemes with their supertags.</Paragraph> <Paragraph position="1"> Supertagging is a process of tagging lexemes with labels (or supertags), which code richer information than standard POS tags. The use of supertags, as noted in (Joshi and Srinivas, 1994) localizes some crucial linguistic dependencies, and thus show significant performance gains. The content of a supertag differs from work to work and is tailored for the needs of an application. For example, Joshi and Srinivas (1994) who seem to coin this term use elementary trees of Lexicalized Tree-Adjoining Grammar for supertagging lexical items. In (Gnasa and Woch, 2002) it is grammatical structures of the ontology that are used as supertags. In our model a supertag codes morphological information (such as POS and inflection type) and semantic information, an ontological concept, defining a word membership in a certain semantic class (such as object, process, substance, etc.). For example, the supertag Nf shows that a word is a noun in singular (N), means a process (f), and does not end in -ing. This supertag will be assigned, for example, to such words as activation or alignment.</Paragraph> <Paragraph position="2"> At present we use 23 supertags that are combinations of 1 to 4 features out of a set of 19 semantic, morphological and syntactic features for 14 parts of speech. For example, the feature structure of noun supertags is as follows: Tag [ POS[Noun [object [plural, singular] process [-ing, other[plural, singular]] substance [plural, singular] other [plural, singular]]]]] In this lexicon the number of semantic classes (concepts) is domain based. The &quot;depth&quot; of supertags is specific for every part of speech and codes only that amount of the knowledge that is believed to be sufficient for our analysis procedure. That means that we do not assign equally &quot;deep&quot; supertags for every word in this lexicon. For example, supertags for verbs include only morphological features such as verb forms (-ing form, -ed form, irregular form, finite form). For finite forms we further code the number feature (plural or singular). Semantic knowledge about verbs is found in the predicate lexicon.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Predicate Lexicon </SectionTitle> <Paragraph position="0"> This lexicon contains reach and very elaborated linguistic knowledge about claim predicates and covers both the lexical and, crucially for our system, the syntactic and semantic knowledge. Our approach to syntax is, thus, fully lexicalist. Below, as an example, we describe the predicate lexicon for claims on apparatuses. It was manually acquired from the corpus of 1000 US patent claims.</Paragraph> <Paragraph position="1"> Every entry includes the morphological, semantic and syntactic knowledge.</Paragraph> <Paragraph position="2"> Morphological knowledge contains a list of practically all forms of a predicate that could only be found in the claim corpus.</Paragraph> <Paragraph position="3"> Semantic knowledge is coded by associating every predicate with a concept of a domain-tuned ontology and with a set of case-roles. The semantic status of every case-role is defined as &quot;agent&quot;, &quot;place&quot;, &quot;mode&quot;, etc. The distinguishing feature of the case frames in our knowledge base is that within the case frame of every predicate the case roles are ranked according their weight calculated on the basis of the frequency of their occurrence in actual corpus together with the predicate. The set of case-roles is not necessarily the same for every predicate.</Paragraph> <Paragraph position="4"> Syntactic knowledge includes the knowledge about linearization patterns of predicates that codes both the knowledge about co-occurrences of predicates and case-roles and the knowledge about their liner order in the claim text. Thus, for example, the following phrase from an actual claim: (1: the splice holder) *: is arranged (3: on the cover part) (4: to form a rotatable splice holder) (where 1, 3 and 4 are case role ranks and &quot;*&quot; shows the position of the predicate), will match the linearization pattern (1 * 3 4). Not all case-roles defined for a predicate co-occur every time it appears in the claim text. Syntactic knowledge in the predicate dictionary also includes sets of most probable fillers of case-roles in terms of types of phrases and lexical preferences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Grammar and Knowledge Representation </SectionTitle> <Paragraph position="0"> In an attempt to bypass weaknesses of different types of grammars the grammar description in our model is a mixture of context free lexicalized Phrase Structure Grammar and Dependency Grammar formalisms.</Paragraph> <Paragraph position="1"> Our Phrase Structure Grammar consists of a number of rewriting rules and is specified over a space of supertags. The grammar is augmented with local information, such as lexical preference and some of rhetorical knowledge, - the knowledge about claim segments, anchored to tabulations, commas and a period (there can only be one rhetorically meaningful period in a claim which is just one sentence). This allows the description of such phrases as, for example, &quot;several rotating, spinning and twisting elements&quot;. The head of a phrase (its most important lexical item) is assigned by a grammar rule used to make up this phrase.</Paragraph> <Paragraph position="2"> The second component of our grammar is a version of Dependency Grammar. It is specified over the space of phrases (NP, PP, etc.) and a residue of &quot;ungrammatical&quot; words, i.e., words that do not satisfy any of the rules of our Phrase Structure Grammar.</Paragraph> <Paragraph position="3"> The Dependency Grammar in our model is a strongly lexicalized case-role grammar. All syntactic and semantic knowledge within this grammar is anchored to one type of lexemes, namely predicates (see Section 2.2). This grammar assigns a final parse (representation) to a claim sentence in the form: text::={ template){template}* template::={label predicate-class predicate ((caserole)(case-role)*} null case-role::= (rank status value) value::= phrase{(phrase(word supertag)*)}* where label is a unique identifier of the elementary predicate-argument structure (by convention, marked by the number of its predicate as it appears in the claim sentence, predicate-class is a label of an ontological concept, predicate is a string corresponding to a predicate from the system lexicon, case-roles are ranked according to the frequency of their cooccurrence with each predicate in the training corpus, status is a semantic status of a case-role, such as agent, theme, place, instrument, etc., and value is a string which fills a case-role. Supertag is a tag, which conveys both morphological information and semantic knowledge as specified in the shallow lexicon (see Section 2.1). Word and phrase are a word and phrase (NPs, PPs, etc.) in a standard understanding. The representation is thus quite informative and captures to a large extent both morpho-syntactic and semantic properties of the claim.</Paragraph> <Paragraph position="4"> For some purposes such set of predicate templates can be used as a final claim representation but it is also possible to output a unified representation of a patent claim as a tree of predicate-argument templates.</Paragraph> </Section> </Section> class="xml-element"></Paper>