XML Viewer - w03-2007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-2007_metho.xml
Size: 25,575 bytes
Last Modified: 2025-10-06 14:08:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-2007">
  <Title>Makoto IWAYAMA</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 Characteristics of Patent Claim
</SectionTitle>
    <Paragraph position="0"> Typical Japanese patent claims taken from two patents are shown in Figure 1 and 2.</Paragraph>
    <Paragraph position="1"> In general, Japanese sentences are inserted with the touten &amp;quot;z&amp;quot;or&amp;quot;|&amp;quot; (comma) and end with the kuten &amp;quot;{&amp;quot;or&amp;quot;}&amp;quot; (period) . The touten plays a role of segmenting the sentence for disambiguating the meaning and for improving readability. According to the literature (Maekawa, 1995), the average length of Japanese sentences is 55.85 characters in newspaper articles on politics and 75.37 characters on social affairs articles.</Paragraph>
    <Paragraph position="2"> The claims of Figure 1 and 2 are both written in one sentence. Though they are appropriately in- null taining a newline (Publication Number=10-146993) (Note: &lt;nl&gt; means a newline.) serted with the touten &amp;quot;z&amp;quot;, they are unusually long with the length of 295 characters and 119 characters. It is definitely true that most Japanese who are not accustomed to reading patent claims have difficulty in reading them. In fact, according to (Kasuya, 1999), Japanese patent attorneys themselves recognize that Japanese patent claims are difficult to read. The salient characteristics of Japanese patent claims from the viewpoint of readability are as follows: null  1. The length of sentence is long.</Paragraph>
    <Paragraph position="3"> 2. The structure of description is complex.</Paragraph>
    <Paragraph position="4"> 3. There are several terms which are difficult to  understand or requires explanation for understanding. null To examine the first point, we extracted all of the first claims of the sample data (59,968 patents) in the NTCIR3 patent collection, and calculated the average sentence length. We found that it is 242 characters and confirmed that Japanese patent claims are unusually long.</Paragraph>
    <Paragraph position="5"> With regard to the second point, we surveyed several books and articles written for patent applicants to explain how to draft patent claims(Kasai, 1999; Kasuya, 1999) and how to translate patent claims(Lise, 2002).</Paragraph>
    <Paragraph position="6"> Based on the survey, we classify the description style into the following three. [Note: In the following explanation, Japanese phrases are followed by their literal expression in [] and their English translation in (). ] Process sequence style As in &amp;quot;...`[shi](does), ... `[shi](does), ...`h[shita] (and does)...&amp;quot;| the sequence of processes is described}Mainly used in method inventions.</Paragraph>
    <Paragraph position="7"> Element enumeration style As in &amp;quot;...q[to](and), ...q[to](and), ...qTs[to kara naru](comprising), ...&amp;quot;, the set of element is described. Mainly used in product inventions.</Paragraph>
    <Paragraph position="8"> Jepson-like style As in &amp;quot;...tSMo[ni oite](in), ... qb[wo tokuchou to suru](be characterized by), ...&amp;quot;, the description consists of the first half part and the last half part. In the first half part, either the known or the precondition part is described. In the last half part, either the new or the main part is described  .</Paragraph>
    <Paragraph position="9"> These patterns are not mutually exclusive. For example, the first half part of the Jepson-like style may be written in the process sequence style or in the element enumeration style.</Paragraph>
    <Paragraph position="10"> With regard to the third point, Figure 1 contains the term &amp;quot;&amp;quot;(an actuator) and Figure 2 contains the term &amp;quot; &amp;quot;(sticky ink) which require explanation for understanding.</Paragraph>
    <Paragraph position="11"> Because of these characteristics, the well-known Japanese parser KNP (Kurohashi, 2000) incorrectly analyze or cannot process most of the Japanese patent claims.</Paragraph>
    <Paragraph position="12"> KNP's dependency analysis works by detecting parallel structure utilizing thesaurus and dynamic programming, but it does not work well for patent  Note that the term &amp;quot;Jepson claim&amp;quot; is rigidly defined and used in Europe or in the USA to describe the kind of claims in which the known part and the new part are clearly separated. In Japan, that is not common and the separation is more vague(Lise, 2002). That's why we name this as &amp;quot;Jepson-like style&amp;quot;.</Paragraph>
    <Paragraph position="13">  , and].&amp;quot;].</Paragraph>
    <Paragraph position="14"> claims because they often include &amp;quot;chain expressions&amp;quot; in which one concept is first defined and next another concept is defined using the first. For the claim in Figure 1, although &amp;quot;YUZ&amp;quot; (a load detection method), &amp;quot;Hw*  :!+&amp;quot; (a frequency transfer device no.1), &amp;quot;Hw* :! +&amp;quot; (a frequency transfer device no.2), &amp;quot;!&amp;quot; (a modulation method), and &amp;quot;C \&amp;quot; (an oscillation generation method) need to be recognized  as parallel, it cannot be recognized due to the existence of the expressions designated by the underline.</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="5" type="metho">
    <SectionTitle>
3 Structure Analysis of Patent Claims
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Background
</SectionTitle>
      <Paragraph position="0"> To improve readability of Japanese patent claims, we claim that the structure of description needs to be presented in a readable way. To do so, the structure needs to be analyzed first.</Paragraph>
      <Paragraph position="1"> Japanese patent claims are described in such a way that multiple sentences are coerced into one sentence(Kasuya, 1999). In other words, a claim is composed of multiple sentences that have some kind of relationships with each other. Therefore, we decided to apply the RST (Rhetorical Structure Theory) (Mann, 1999) that was proposed to analyze discourse structure composed of multiple sentences.</Paragraph>
      <Paragraph position="2"> RST was proposed in the 1980's and has been successfully applied to automatic summarization (Marcu, 2000), automatic layout (John Bateman, 2000), and so on. A Tcl/Tk-based interactive tool (OD'onnel, 1997) was developed to support to manually edit and to visually show the structure.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Framework
</SectionTitle>
      <Paragraph position="0"> For the structure analysis of Japanese patent claims, we defined six relations as in Table 1. Two of them are multi-nuclear where composing elements are equally important. Four of them are mono-nuclear where one element is nucleus, the other is satellite, and the nucleus is more important than the satellite.</Paragraph>
      <Paragraph position="1"> In the &amp;quot;Example&amp;quot; column of Table 1, the regions enclosed with &amp;quot;[&amp;quot; and &amp;quot;]&amp;quot; are segments or spans and the underlined ones are nuclei.</Paragraph>
      <Paragraph position="2"> Given the patent claims in Figure 1 and Figure 2, we can analyze their structure and present them visually by using RSTTool (OD'onnel, 1997) as in Figure 3 and Figure 4  .</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Cue-phrase-based Approach
</SectionTitle>
      <Paragraph position="0"> In designing the algorithm, we took a similar approach to (Marcu, 2000). We collected cue phrases that can be used for segmenting long claims and establishing relations among segments or spans.</Paragraph>
      <Paragraph position="1">  Because RSTTool is written in Tcl/Tk and Tcl/Tk is an internationalized language, we did not have to localize it to display Japanese characters.</Paragraph>
      <Paragraph position="2">  [Note: &amp;quot;pKlo&amp;quot; means &amp;quot;in&amp;quot;.] Cue phrases were first collected manually by reading patent claims. Then we found that about half of the claims are inserted with newlines at seemingly segment boundaries as in Figure 2.</Paragraph>
      <Paragraph position="3"> We investigated all of the extracted first claims of the sample data and 48.5% of them are newline-inserted claims. It seems that the drafters of patent claims explicitly inserted those newlines for readability for themselves. We checked the description pattern of the last three morphemes just before each newline of those claims. The result is shown in Table 2. In Table 2, &amp;quot;Verb-Cont-Form&amp;quot; means &amp;quot; ;&amp;quot; (verb in continuous form) and &amp;quot;AuxVerb-Cont-Form&amp;quot; means &amp;quot;;&amp;quot; (auxiliary verb in continuous form). Note that the description patterns are expressed in the regular expression notation of Perl.</Paragraph>
      <Paragraph position="4"> Summarizing the above, we came up with the cue phrases in Table 3. In Table 3, &amp;quot;Verb-Basic-Form&amp;quot; means &amp;quot;,&amp;quot; (verb in basic form) and &amp;quot;AuxVerb-Basic-Form&amp;quot; means &amp;quot;,&amp;quot; (auxiliary verb in basic form).</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Algorithm and Implementation
</SectionTitle>
      <Paragraph position="0"> We designed an algorithm for analyzing structure of independent claims  . Although patent claims are written in natural language, it's not written in a free form and is restricted in a sense that there are description styles established in the community. So, we designed an algorithm composed of a lexical analyzer and a parser as in the formal language processors. null  Independent claims are claims which do not refer to any other claims.</Paragraph>
      <Paragraph position="1"> First, the input claim is analyzed with the morphological analyzer &amp;quot;chasen&amp;quot; (Matsumoto et al., 2002). Because some patent claims explicitly contain newlines as in Figure 2, we use the &amp;quot;-j&amp;quot; option setting the sentence delimiter as &amp;quot;{&amp;quot; in &amp;quot;.chasenrc&amp;quot;. Next, the output from chasen is analyzed with the lexical analyzer. The main point of our algorithm is the context-dependent behavior of the lexical analyzer as follows: * The lexical analyzer outputs two types of token: cue phrase token and morpheme token.</Paragraph>
      <Paragraph position="2"> * Outputting morpheme tokens is done depending on some contextual conditions to avoid ambiguities in the parsing.</Paragraph>
      <Paragraph position="3"> * For other morphemes whose context did not satisfy the above conditions, an anonymous morpheme token (WORD) is output.</Paragraph>
      <Paragraph position="4"> Next, the output from the lexical analyzer is processed with the parser generated from a context-free grammar (CFG) by using Bison (Donnelly and Stallman, 1995)-compatible parser generator. The CFG we designed for Japanese patent claim consists of 57 rules, 11 terminals, and 19 non-terminals.</Paragraph>
      <Paragraph position="5"> Finally, a structure tree is constructed in the form of &amp;quot;.rs2&amp;quot; file used in RSTTool v2.7. By using RST-Tool, the output is visually displayed as in Figure 3 and Figure 4.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Evaluation
</SectionTitle>
      <Paragraph position="0"> The evaluation was done by using the first claims  of 59,956 patents extracted from the NTCIR3 patent data collection.</Paragraph>
      <Paragraph position="1"> The NTCIR3 patent data collection consists of 697,262 patents opened to public in 1998 and in 1999. For the analysis, the collection of cue phrases, and the creation of the CFG, we used patents in 1998. For the evaluation, we used patents in 1999. We checked the IPC (International Patent Classification) code of 59,956 patents and confirmed that the distribution is similar to the one of all opened patents in 1999 disclosed by JPO (Japan Patent Office). null The evaluation was done in the following points:  The accept ratio was more than 99.77%. The processing speed was 0.30 second per each claim (evaluated on a Linux PC using Pentium III 1GHz and 512MB memory). So, it is almost real-time.</Paragraph>
      <Paragraph position="2">  By specifying a command-line switch, our program can be run without utilizing the originally inserted newlines. The newline insertion positions can be predicted by the result of structure analysis and some heuristics. So, indirect evaluation was done by comparing the newline insertion positions between the originally newline-inserted claims and the automatically newline-inserted claims utilizing the result of structure analysis. The recall(R), the precision(P), and the F-measure(F) are calculated by the followings, where c is the number of correctly-inserted newlines, n is the number of newlines in the original claim, and i is the number of inserted newlines.</Paragraph>
      <Paragraph position="4"> The baseline was set in that the newlines are inserted mechanically at the end of every sequence of &amp;quot;(NOUN|SYMBOL)(z||)&amp;quot; and &amp;quot;(Verb-ContForm|AuxVerb-Cont-Form)(z||)&amp;quot;. null Note that newlines are sometimes inserted at the positions that are not segment boundaries in the meaning of RST. For example, it is often the case that at the end of &amp;quot;xz&amp;quot; (a postpositional particle representing the subject), newlines are inserted. So, our newline-insertion prediction algorithm has the inherent upper limit whose recall is 0.873.</Paragraph>
      <Paragraph position="5"> The result is shown in Table 4.</Paragraph>
      <Paragraph position="6">  The direct evaluation on accuracy was done by using randomly selected 100 claims extracted. All of these claims are the first claims. Again, we checked the distribution of IPC and confirmed it's similar to the one of all opened patents in 1999 disclosed by JPO.</Paragraph>
      <Paragraph position="7"> The 100 claims were analyzed by our program and the visually-displayed outputs like Figure 3 and 4 were presented to a subject who had some experience in reading patent specifications. The subject evaluated the result by the following criteria: * when the claim is in the Jepson-like style, whether that is correctly recognized.</Paragraph>
      <Paragraph position="8"> * when the claim is in the Jepson-like style, whether the structure is correctly analyzed for the first half part and for the last half part.</Paragraph>
      <Paragraph position="9"> * when the claim is not in the Jepson-like style, whether the structure is correctly analyzed for the whole.</Paragraph>
      <Paragraph position="10"> The result is shown in Table 5.</Paragraph>
    </Section>
    <Section position="6" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
3.6 Application to Patent Claim Paraphrase
</SectionTitle>
      <Paragraph position="0"> Once the structure of patent claims are analyzed, we can apply the result to paraphrase patent claims.</Paragraph>
      <Paragraph position="1"> To do so, the following actions are incorporated into the lexical analyzer and the parser.</Paragraph>
      <Paragraph position="2"> * The lexical analyzer deletes the words &amp;quot; G&amp;quot; (the), &amp;quot;&amp;quot; (the), and &amp;quot;G&amp;quot; (the). * For the parser, new actions are added which relocates the &amp;quot;noun group&amp;quot; located at the end to the front. Same thing for the &amp;quot;noun group&amp;quot; located just before JEPSON CUE for the Jepson-like style claims.</Paragraph>
      <Paragraph position="3"> * For the process sequence style, the lexical analyzer conjugates verbs and adverbs from their continuous form to basic form and replaces the touten &amp;quot;(z||)&amp;quot; with the kuten &amp;quot;{&amp;quot;. * For the element enumeration style, the lexical analyzer converts those cue phrases such as &amp;quot;T s&amp;quot;(consist of) and &amp;quot;b&amp;quot; (include) to their &amp;quot;&amp;quot;(&amp;quot;teiru&amp;quot; form) plus &amp;quot;{&amp;quot; and deletes &amp;quot;q(z||)&amp;quot; (and) at the end of each element. null * The lexical analyzer converts &amp;quot;\q&amp;quot;(thing) just before &amp;quot;qb&amp;quot;(characterized by) to &amp;quot;&lt;&amp;quot;(the following).</Paragraph>
      <Paragraph position="4"> * For the Jepson-like style, the parser separates the first-half part and the last-half part by inserting a newline.</Paragraph>
      <Paragraph position="5"> By doing the above processing, long patent claim sentences are divided into multiple sentences. But as there are cases where some of the generated sentences are still too long, those sentences longer than the threshold length (75 characters) are recursively processed.</Paragraph>
      <Paragraph position="6"> An example of paraphrase is shown in Figure 5.</Paragraph>
      <Paragraph position="7"> We believe that paraphrasing can not only improve readability of patent claims but also can work effectively as a preprocessing for machine transla- null In fact, there are several commercial machine translation software which does special preprocessing for patent claims before translating from Japanese to English.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="5" end_page="5" type="metho">
    <SectionTitle>
4 Term Explanation for Patent Claims
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.1 Background and Motivation
</SectionTitle>
      <Paragraph position="0"> Once the structure of patent claims are analyzed and presented visually, next hurdle for readability is terms.</Paragraph>
      <Paragraph position="1"> There are many novel terms used in patent claim description. They can be classified into the following categories: Terms specific to the invention Patent drafters often assign unique names to the invention, its elements, and its processes for their identification. null Terms specific to the domain The patent law requires patents should be written so that those who have ordinary knowledge in the domain can understand and perform the invention. So, technical terms that are established in the domain are often used. Additionally, there exist &amp;quot;patent jargons&amp;quot; which are created by combining two kanji characters such as &amp;quot;U &amp;quot; (put and insert) and &amp;quot; ;&amp;quot; (put into the hall)(Kasai, 1999). They are first created by some patent drafters for the sake of brevity and have been widely used in the community. So, they are terms specific to the inventions of the domain. Those who do not have enough knowledge in the domain or those who are not accustomed to reading patent specifications have difficulty in understanding them.</Paragraph>
      <Paragraph position="2"> Giving appropriate explanations for these terms would help to improve readability of patent claims.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Approach
</SectionTitle>
      <Paragraph position="0"> First of all, it is necessary to recognize terms to be explained. There are many research issues in term extraction in general, but for our purpose we use the following morphological pattern to extract terms from patent claims:  By using the above pattern, we can extract such terms as &amp;quot; 'VZ&amp;quot; (method to blow heat wind), &amp;quot;&amp;quot; (read value), and &amp;quot;#&amp;quot; (liquid drop) which contain verbs.</Paragraph>
      <Paragraph position="1"> Second, by using the result of structure analysis, we can infer the categories of the terms as follows: * If the term appears at the end of the claim or just before the JEPSON CUE in the Jepson-like style, or just before &amp;quot;q&amp;quot; (and) in the element enumeration style, it is a term specific to the invention. For example, &amp;quot;  generating device) and &amp;quot;YUZ&amp;quot;(a load detection method) in Figure 1 are terms specific to this invention.</Paragraph>
      <Paragraph position="2"> * If the term appears in the middle of the first half in the Jepson-like style, it can be a term specific to the domain. For example, &amp;quot; &amp;quot;(an actuator) in Figure 1 is a technical term in the domain.</Paragraph>
      <Paragraph position="3"> * If the term is a two-kanji character and is not listed in the ordinary dictionaries, it can be a patent jargon.</Paragraph>
      <Paragraph position="4"> Finally, by looking at the detailed description of the invention or related inventions, we can back up the above inference as follows: * The terms specific to the invention should be described after the &amp;quot;means to solve the problem&amp;quot; section in the detailed description of the invention.</Paragraph>
      <Paragraph position="5"> * The terms specific to the domain are widely used in the inventions of the domain. So, it is highly possible that they occur frequently in the related inventions. We can consider the collection of search result as the related inventions. * Some of the technical terms specific to the domain are described in the &amp;quot;prior art&amp;quot; section of the detailed description of the invention or related inventions in the domain.</Paragraph>
      <Paragraph position="6"> For those technical terms specific to the domain, explanatory portions such as the following can be found:</Paragraph>
      <Paragraph position="8"> (... driving the oil pressure cylinder (or the actuator) at the speed of ...) &amp;quot;...AZ...&amp;quot; (... the spout (or the orifice) ...) &amp;quot;...w'$Zm...&amp;quot; (... blowing out ink preliminarily (namely, purging ink) ...&amp;quot; &amp;quot;...w{...&amp;quot; (... ink of the hot-melt type (or solid ink) ... As can be seen in the above, explanatory portions can be found by using cue phrases such as &amp;quot; &amp;quot; and &amp;quot;&amp;quot;, &amp;quot;&lt;&amp;quot; (&amp;quot;in the following&amp;quot;), and &amp;quot;m &amp;quot; (&amp;quot;or&amp;quot; or &amp;quot;namely&amp;quot;).</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.3 Sample Scenario
</SectionTitle>
      <Paragraph position="0"> From the patent claim in Figure 2, we find many terms that are candidates for explanation such as &amp;quot; -&amp;quot; (time measurement), &amp;quot;-&amp;quot; (the method to measure time), &amp;quot;-AL&amp;quot; (measurement result), &amp;quot; &amp;quot; (sticky ink), &amp;quot; &amp;quot; (removal of sticky ink), &amp;quot; rg&amp;quot; (removal processing of sticky ink), &amp;quot; &amp;quot; (the method to remove sticky ink).</Paragraph>
      <Paragraph position="1"> Among the above terms, &amp;quot;-&amp;quot; (the method to measure time) and &amp;quot; &amp;quot; (the method to remove sticky ink) are terms specific to the invention because they are judged as the elements by structure analysis.</Paragraph>
      <Paragraph position="2"> By searching the detailed description, we can find the explanatory portion for &amp;quot; &amp;quot; (sticky ink) as follows.</Paragraph>
      <Paragraph position="3"> &amp;quot;...wSU Cb\q&lt; qMO...&amp;quot; (... the ink of increased stickiness (in the following, we call it as &amp;quot;sticky ink&amp;quot; ...)</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
4.4 Further Analysis and Experimentation
</SectionTitle>
      <Paragraph position="0"> We continue to analyze the NTCIR3 patent data collection, specifically &amp;quot;Patolis Test Collection&amp;quot; which is a test collection for patent retrieval consisting of a set of query and search result. We use each search result as &amp;quot;related inventions&amp;quot; and analyze them to collect cue phrases for finding explanatory portions for technical terms specific to the domain.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> A NLP research for patent claim is already reported in (Kameda, 1995). It is directed toward dependency analysis of patent claims. Although it is proposed to support &amp;quot;analytic reading&amp;quot; of patent claims, the evaluation result for large-scale real patent data is not reported. Our approach is different from (Kameda, 1995) in that the top-level structure is analyzed.</Paragraph>
    <Paragraph position="1"> In (Sheremetyeva and Nirenburg, 1996), a research on a system for authoring patent claims using NLP and knowledge engineering technique is reported. null</Paragraph>
  </Section>
  <Section position="7" start_page="5" end_page="5" type="metho">
    <SectionTitle>
6 Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> We have presented a framework to represent the structure of patent claims and a method to automatically analyze it. The evaluation result suggest that our approach is robust and practical.</Paragraph>
    <Paragraph position="1"> We are currently investigating a method to clarify terms in patent claims and to find the explanatory portions from the detailed description part of the patent specifications.</Paragraph>
    <Paragraph position="2"> It is not only a step toward improving readability, but it can also lead to more challenging task of automatic patent map generation(Study group on patent map, 1990).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML