File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2153_metho.xml

Size: 11,157 bytes

Last Modified: 2025-10-06 14:13:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2153">
  <Title>AN EFFICIENT SYNTACTIC TAGGING TOOL FOR CORPORA @</Title>
  <Section position="4" start_page="949" end_page="952" type="metho">
    <SectionTitle>
2 DESIGN OF CSTT
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="949" end_page="950" type="sub_section">
      <SectionTitle>
2.1 The context-dependent shift/reduce tagging
</SectionTitle>
      <Paragraph position="0"> m ech a nisln The proccss of context-dependent tagging is that when a sentence is input(the input string is the sequence of part of speech), we look up the rule base with the top two elements of the stack to see whether there exist rules coinciding with the current context. If not, human operation is required to determine whether reduce or shift. If reduce, then further decides what phrase structure will be constructed, and what dependency relation will be constructed bctwecn these top two elements. The system records the current context and the operations to tbrms a ncw rule, and put it into rule base.</Paragraph>
      <Paragraph position="1"> Formally, context dependent rule is represented as:</Paragraph>
      <Paragraph position="3"> Where x, y are the top two elements in the stack, and cqfl are the context on the left hand ofx and the context on the right hand of y respectively.The context is represented as a sequence of part o1&amp;quot; speeches. There are two actions on the right hand of a rule, shift action denoted as s, and reduce action denoted as(z,?,h).For reduce action, z denotes the phrase structure after reduction, and ? denotes the dependency relation between x and y, h denotes which clement is the head of the phrase structure and dependency relation. By tt='A'means the top clement is the head, h='B' means that the second top clement of the stack is the head. Now let/s sce the tagging process for a simple sentence:  R VY R USDE A NG (where, R: pronoun, VY: verb ,~.u, USDE: ufl~ju, A: adj., NG: general noun.)  structure, NP: noun phrase, SS: sub-scntcnce, SP: sentence. SUB: subject, DEP: u((,j ,, structure, ATTA: modifier, OBJ: object, MARK: punctuation mark, GOV: the predicate of sentence.) l)epcndency relation is represented as a triple of the form &lt;modificr, head,the dcpendcncy relation &gt; .The tagging result is represented as a sct el&amp;quot; triples: { &lt; 4.~, ,~,SUB &gt;, &lt; ~:. ,Nil,GOV &gt;, &lt; 4tf ~, ,ft&lt;O ,DEP&gt;, &lt; ~I*,J,}\])lJ.~,ATTA &gt;, &lt;)f,)I\]IJS.,A'FTA &gt;, &lt;Jl/l ~ ,~h~ ,OBJ &gt; }.At each stcp, we can obtain a rule by recording the content of stack and input string, and the operation(shift or reduce) given by user. II' the operation is a reduction, the phrase structure and dependency relation arc to be decided by user. Ilere are two rulcs obtained:</Paragraph>
      <Paragraph position="5"> After the reduction, the phrasc structurc formed rcplaces the top two elements in the stack. And the head will reprcscnt this phrase in later pro+ ccss. Since scntcnecs varies with its length, we use tbrcc elements on thc lcl't side of the top two clements in the stack and the top I'ivc clemcnts in thc input string as the context.</Paragraph>
      <Paragraph position="6"> The input is a scqucnce ot+ the part of speech of a sentence, and the output is the depcndency tree dcnotcd as a set of triple oF the form (modifier, hcad, the dependency relation), and as a by-product, context-dependent rules are acquired. It is obviously that we can work out the phrase structure trcc as well by modifying the algorithm (not detailcd in this papcr).</Paragraph>
      <Paragraph position="7"> l,ct CDG be the context-dcpendent rule base which were acquired bctbre,CDG is empty if&amp;quot; the system is just put into use. NUMBER-OF-AC-TION records the number of total actions(either shift or reduce) during tagging, NUMBER-OF-AUTOMATION is the number of actions(given by the system itselt) which are conlirmed to bc right by human. The automatic tagging ratio is therefore sct as NUMBER-OF-AI)-TOMAT1ON / NUMBER-OF-ACTIONS.</Paragraph>
      <Paragraph position="8"> At present, the system is under supervision, human intervention is applied at each step either to confirm the actions given by the system or to append new actions. Idcally, the tagging process should be nearly full automatic with minimum human intervention. But it is a long term process. We believed that with the size of corpora tagged increases, the automatic tagging ratio will be improved, and whcrt it reaches to a degree of high</Paragraph>
    </Section>
    <Section position="2" start_page="950" end_page="952" type="sub_section">
      <SectionTitle>
2.2 The tagging algorithm
</SectionTitle>
      <Paragraph position="0"> enough, human intervention may be removed, or it may only be needed in the case that no rule is matched.</Paragraph>
      <Paragraph position="2"> first five elements of the stack and input string respectively. If there are less than five elements in the stack or in the input string, then fills with blanks. AP-PEND merges two lists to obtain the current context.</Paragraph>
      <Paragraph position="3"> CONSULT-TO-CDG looks up the rule base and returns a list of rules matching with the current context.</Paragraph>
      <Paragraph position="4"> The list is empty when no rule is matched. If the list is not empty, rules are sorted in descending order of their usage frequency. If human/s intervention is dcfault(this may be available when the automatic tagging ratio reaches to some high degree), the system will take a action according to the rule of the highest frequency.</Paragraph>
      <Paragraph position="5"> CONSULT-TO-HUMAN returns only one rule by hmnan's inspection. In this interactive process, human is asked to dctermine what action should be taken, he first inspect the rule-list to see if there is already a rule correctly confirming with current context, if not, he should tell the system whether &amp;quot;shift&amp;quot; or '/reduce&amp;quot;, if &amp;quot;reduce&amp;quot;, he is requested to tell the system what phrase structure and what dependency relation is to be built, and which element, the top element of the stack, or the second is the head. A new rule will be acquired when human makes a different operation from existing roles, by recording the current context and the operation.</Paragraph>
      <Paragraph position="6"> NUMBER-OF-AUTOMATION records the times that the rule with the highest frequency coincides with human's decision, which means that if the system works in automatic way, the rule with the highest frequency is right. NUMBER-OF-ACTIONS records the total times of operation(shift or reduce) during tagging. The  HEAD returns the head word of a phrase. The function PUSIt means push an element into stack, and POP pops top element out of stack, FIRST and SECOND return tbe first clement and second element of a list respectively. In matching process, weighted matching approach (Simmons &amp; Yu, 1992) is used. Assmnc the set of CDG rules is R= { RI, R2, .., Rm} , where the left hand of each rule is Ri= {rid ri2.. , ril0} , assume the context of the top two elements of the stack is C TM {% c a, .., cs0} , where c 4 and c s arc the top two elements in the stack, we set up a match function: lt(Ci, rii) = 1, if e i = rii , .u(ci, rii) = 0, if cjI = rip The score function is</Paragraph>
      <Paragraph position="8"> some cases. CDG base is controlled dynamically so that to keep high efficiency of matching. A rule will be removed from the CDG base if it is seldom used.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="952" end_page="953" type="metho">
    <SectionTitle>
3 EXPERIMFNT AND ANALYSIS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="952" end_page="953" type="sub_section">
      <SectionTitle>
3.1 The experiment
</SectionTitle>
      <Paragraph position="0"> A small corpora of 1300 sentences of daily life is prepared for experiment, with the average length as 20 Chinese characters per sentence, the corpora covers main classes of Chinese simple declarative sentences.The experiments is conducted in the following steps:  (1) input a sentence; (2) word segmentation; (3) part of speech tagging.</Paragraph>
      <Paragraph position="1">  The tagging model is a bi-gram modcl(Bai &amp; Xia, 1991), and the correct ratio is about 94% , so human confirmation is needed.</Paragraph>
      <Paragraph position="2"> (4) tagging the dependency relation by CSTT. A rule is preferred if and only if SCORE is greater than a threshold { set in advance. {=2l means full matching. In the beginning of the system, the full matching is recommended in order to deduce the conflict. And after certain period of tagging, we may set the threshold smaller than 21 to overcome the shortage of rules in As shown in Table 3, 1455 rules was obtained from the first 300 sentences. In the whole experiment, totally 6521 rules was obtained. The more sentences tagged, the higher automatic tagging ratio may be. After 1200 sentenccs have been tagged, the ratio of automatic operation is above 50%.</Paragraph>
    </Section>
    <Section position="2" start_page="953" end_page="953" type="sub_section">
      <SectionTitle>
3.2 Discussion
</SectionTitle>
      <Paragraph position="0"> (1) The rule conflict Although this system has some power for disambiguation due to the context-dependent rules, it is difticult to resolve some ambiguities.Therelbre it is easy to understand that a eonllict will occur if some ambiguity is encountered. For example, the sequence ofVG A NG may be {(A, VG, COMPLEMENT),(NG, VG, OBJ)} or {(A, NG, ATTA), (NG, VG, OBJ)}, and the sequence NGI NG2 may be {(NG2, NG1, COORDINATE)} or {(NGI, NG2, ATTA)} as the following two pairs of sentence demonstrate:  Thcre arc two kinds of ambiguities, one is contextual depcndcnt ambiguity, another is contextual independent ambiguity. For the former, CSTT can resole some of them. For example, ~(VG)~L, (NG1)I'/,J (USDE)~'~ (NG2)is an ambiguous phrasc(which may be {(VG, nil, GOV), (NG1, USDE, DEP), (USDE, NG2, ATTA), (NG2, VG, OBJ)} which means &amp;quot;killcd the hunter's dog',or {(VG, USDE, DEP), (NG1, VG, OBJ), (USDE, NG2, ATTA), (NG2, nil, GOV)} which means the dog which killed the hunter. However, if the context is considered, the ambiguity may be resolved:</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="953" end_page="953" type="metho">
    <SectionTitle>
VG NG USDE NG VG Y
</SectionTitle>
    <Paragraph position="0"> M Q VG NG USDE NG Un\[brtunately, CSTT canq resolve the ambiguity of the later, human-intervcntionis necessary. (2) The convergence of the CDG rule According to the analysis of (Simmons &amp; Yu 1992), 25,000 CDG rules will be sufficient to cover the 99% phenomenon of English common sentences. In this sense, the CDG rule is convergent. If we are only for syntactic tagging, the convergence issues can be avoided temporally, if the automatic ratio reaches above 80%, we can stop acquisition, at this time the tagging can already provide lots help to the users. Of course, if we make some effective attempts to CSTT, it may be developed into an el'licicnt dependency parser as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML