File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1032_metho.xml
Size: 15,443 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1032"> <Title>Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation</Title> <Section position="5" start_page="234" end_page="235" type="metho"> <SectionTitle> 3. Language Model </SectionTitle> <Paragraph position="0"> Parsing can be viewed as optimizing. Suppose an n-word sentencc, w j, w 2 ..... w (including punctuation marks), the parsing task is to find a parsing tree T, such that P(7\]w l, w e ..... w n) has the maximal probability. We define T here to be a sequence of chunks, cp c 2 ..... c m, and each c (0 < i <_ m) contains one or more words wj (0 < j _< n). For example, the sentence &quot;parsing can be viewed as optimization.&quot; consists of 7 words. Its one possible parsing result under our demand is: (2) \[parsing\] \[can be viewed\] \[as optimization\] \[.\] C 1 C2 C3 C4 Now, the parsing task is to find the best chunk sequence, ('*. such that (3) C*=argmaxP((,Iw,) Tile ('i is one possible chunk sequence, c\], C 2 ..... Cmi , where m i is the number of chunks of the possible chunk sequence. To chunk raw text without other information is ve.ry difficult, since the word patterns are many millions. Therefore, we apply a tagger to preprocessing the raw texts and give each word a unique part of speech. That is. for an n-word sentence, w 1, w 2 ..... w n (including punctuation marks), we assign part of speeches t l, t 2 ..... t n to the respective words. Now the real working model is:</Paragraph> <Paragraph position="2"> Using bi-gram language model, we then reduce P(Cilt 1,</Paragraph> <Paragraph position="4"> where Pi( &quot; ) denotes the probability for the i'th chunk sequence and c o denotes the beginning mark of a sentence. Following (5), formula (4) becomes</Paragraph> <Paragraph position="6"> In order to make the expression (6) match the intuition of human being, namely, 1) the scoring metrics are all positive, 2) large value means high score, and 3) the scores are between 0 and 1, we define a score function</Paragraph> <Paragraph position="8"> The final language model is to find a chunk sequence C*, which satisfies the expression (8).</Paragraph> <Paragraph position="9"> Dynamic programming shown in (9) is used to find the best chunk sequence. The score\[i\] denotes the score of position i. The words between position pre\[i\] and position i form the best chunk from the viewpoint of position i. The dscore(cO is the score of the probability P(ci) and the cscore(ci\[ci-l) is the score of the probability P(cilci-l). These scores are collected from the training corpus, SUSANNE corpus (Sampson, 1993; Sampson, 1994). The details will be touched on in Section 5.</Paragraph> <Paragraph position="10"> (9) Algorithm input : word sequence wl, w2 ..... wn, and the corresponding POS sequence t~, t2 ..... tn output : a sequence of chunks c~, c2, ..., Cm 1. score\[0\] = 0; prel0l = 0, 2. for (i = 1: i<n+l; i++) do 3 and 4; 3. j*= maxarg (score\[prelJ\]l+dscore(cj)+cscore(cjlcj-1));</Paragraph> <Paragraph position="12"> output the word Wpre\[i\]+l ..... wi to form a chunk;</Paragraph> </Section> <Section position="6" start_page="235" end_page="236" type="metho"> <SectionTitle> 4. Linguistic Knowledge </SectionTitle> <Paragraph position="0"> In order to assign a head to each chunk, we first define priorities of POSes. X'-theory (Sells, 1985) has defined the X'-equivalences shown as Table 1.</Paragraph> <Paragraph position="2"> Table 1 defines five different phrasal structures and the hierarchical structures. The heads of these phrasal structures are the first level of X'-Equivalences, that is, X.</Paragraph> <Paragraph position="3"> The other grammatical constituents function as the specifiers or modifiers, that is, they are accompanying words not core words. Following this line, we define the primary priority of POS listed in Table 1.</Paragraph> <Paragraph position="4"> (10) Primary POS priority 1 : V > N > A > P In order to extract the exact head, we further define Secondary POS priority among the 134 POSes defined in LOB corpus (Johansson, 1986).</Paragraph> <Paragraph position="5"> (11) Secondary POS priority is a linear precedence relationship within the primary priorities for coarse POSes I We do not consider the INFL. since our model will not touch on this structure.</Paragraph> <Paragraph position="6"> For example, LOB corpus defines four kinds of verbial words under the coarse POS V: VB*, DO*, BE* and HV* 2. The secondary priority within the coarse POS V is: (12) VB* > I-iV* > DO* > BE* Furthermore, we define the semantic head and the syntactic head (Abney, 1991).</Paragraph> <Paragraph position="7"> (13) Semantic head is the head of a phrase according to the semantic usage; but syntactic head is the head based on the grammatical relations.</Paragraph> <Paragraph position="8"> Both the syntactic head and the semantic head are useful in extracting noun phrases. For example, if the semantic head of a chunk is the noun and the syntactic one is the preposition, it would be a prepositional phrase. Therefore, it can be connected to the previous noun chunk to form a new noun phrase. In some case, we will find some chunks contain only one word, called one-word chunks. They maybe contain a conjunction, e.g., that. Therefore. the syntactic head and the semantic head of one-word chunks are the word itself.</Paragraph> <Paragraph position="9"> Following these definitions, we extract the noun phrases by procedure (14): (14) (a) Co) (c) (d) Tag the input sentences.</Paragraph> <Paragraph position="10"> Partition the tagged sentences into chunks by using a probabilistic partial parser.</Paragraph> <Paragraph position="11"> Decide the syntactic head and the semantic head of each chunk.</Paragraph> <Paragraph position="12"> According to the syntactic and the semantic heads, extract noun phrase from these chunks and connect as many noun phrases as possible by a finite state mechanism.</Paragraph> <Paragraph position="13"> raw tagged chunked (TAo- PER) NPso, be assigned POSes to each word and then pipelined into 2 Asterisk * denotes wildcard. Therefore, VB* represents VB (verb, base form), VBD (verb, preterite), VBG (present participle), VBN (past participle) and VBZ (3rd singular form of verb). a chunker. The tag sets of LOB and SUSANNE are different. Since the tag set of SUSANNE corpus is subsumed by the tag set of LOB corpus, a TAG-MAPPER is used to map tags of SUSANNE corpus to those of LOB corpus. The chunker will output a sequence of chunks. Finally, a finite state NP-TRACTOR will extract NPs. Figure 2 shows the finite state mechanism used in our work.</Paragraph> <Paragraph position="14"> The symbols in Figure 2 are tags of LOB corpus. N* denotes nous: P* denotes pronouns; J* denotes adjectives; A* denotes quantifiers, qualifiers and determiners; IN denotes prepositions: CD* denotes cardinals; OD* denotes ordinals, and NR* denotes adverbial nouns. Asterisk * denotes a wildcard. For convenience, some constraints, such as syntactic and semantic head checking, are not shown in Figure 2.</Paragraph> </Section> <Section position="7" start_page="236" end_page="238" type="metho"> <SectionTitle> 5. First Stage of Experiments </SectionTitle> <Paragraph position="0"> Following the procedures depicted in Figure 1, we should train a chunker firstly. This is done by using the SUSANNE Corpus (Sampson, 1993; Sampson, 1994) as the training texts. The SUSANNE Corpus is a modified and condensed version of Brown Corpus (Francis and Kucera, 1979). It only contains the 1/10 of Brown Corpus, but involves more information than Brown Corpus. The Corpus consists of four kinds of texts: 1) A: press reportage; 2) G: belles letters, biography, memoirs; 3) J: learned writing; and 4) N: adventure and Western fiction. The Categories of A, G, J and N are named from respective categories of the Brown Corpus. Each Category consists of 16 files and each file contains about 2000 words.</Paragraph> <Paragraph position="1"> The following shows a snapshot of SUSANNE Corpus.</Paragraph> <Paragraph position="2"> In order to avoid the errors introduced by tagger, the SUSANNE corpus is used as the training and testing texts. Note the tags of SUSANNE corpus are mapped to LOB corpus. The 3/4 of texts of each categories of SUSANNE Corpus are both for training the chunker and testing the chunker (inside test). The rest texts are only for testing (outside test). Every tree structure contained in the parse field is extracted to form a potential chunk grammar and the adjacent tree structures are also extracted to form a potential context chunk grammar.</Paragraph> <Paragraph position="3"> After the training process, total 10937 chunk grammar rules associated with different scores and 37198 context chunk grammar rules are extracted. These chunk grammar rules are used in the chunking process.</Paragraph> <Paragraph position="4"> Table 3 lists the time taken for processing SUSANNE corpus. This experiment is executed on the Sun Sparc 10, model 30 workstation, T denotes time, W word, C chunk, and S sentence. Therefore, T/W means the time taken to process a word on average.</Paragraph> <Paragraph position="5"> seconds on average. To process all SUSANNE corpus needs about 436 seconds, or 7.27 minutes.</Paragraph> <Paragraph position="6"> In order to evaluate the performance of our chunker, we compare the results of our chunker with the denotation made by the SUSANNE Corpus. This comparison is based on the following criterion: This criterion is based on an observation that each non-terminal node has a chance to dominate a chunk.</Paragraph> <Paragraph position="7"> Table 4 is the experimental results of testing the SUSANNE Corpus according to the specified criterion. As usual, the symbol C denotes chunk and S denotes sentence.</Paragraph> <Paragraph position="8"> Table 4 shows the chunker has more than 98% chunk correct rate and 94% sentence correct rate in outside test, and 99% chunk correct rate and 97% sentence correct rate in inside test. Note that once a chunk is mischopped, the sentence is also mischopped. Therefore, sentence correct rate is always less than chunk correct rate. Figure 3 gives a direct view of the correct rate of this We employ the SUSANNE Corpus as test corpus. Since the SUSANNE Corpus is a parsed corpus, we may use it as criteria for evaluation. The volume of test texts is around 150,000 words including punctuation marks.</Paragraph> <Paragraph position="9"> The time needed from inputting texts of SUSANNE Corpus to outputting the extracted noun phrases is listed in Table 5. Comparing with Table 3, the time of combining chunks to form the candidate noun phrases is not significant.</Paragraph> <Paragraph position="10"> The evaluation is based on two metrics: precision and recall. Precision means the correct rate of what the system gets. Recall indicates the extent to which the real noun phrases retrieved from texts against the real noun phrases contained in the texts. Table 6 describes how to calculate these metrics.</Paragraph> <Paragraph position="12"> systdegm ,l .on NP }} a I b The rows of &quot;System&quot; indicate our NP-TRACTOR thinks the candidate as an NP or not an NP: the columns of &quot;SUSANNE&quot; indicate SUSANNE Corpus takes the candidate as an NP or not an NP. Following Table 6, we will calculate precision and recall shown as (16).</Paragraph> <Paragraph position="14"> To calculate the precision and the recall based on the parse field of SUSANNE Corpus is not so straightforward at the first glance. For example, (17) 3 itself is a noun phrse but it contains four noun phrases. A tool for extracting noun phrases should output what kind of and how many noun phrases, when it processes the texts like (17). Three kinds of noun phrases (maximal noun phrases, minimal noun phrases and ordinary noun phrases) are defined first. Maximal noun phrases are those noun phrases which are not contained in other noun phrases. In contrast, minimal noun phrases do not contain any other noun phrases.</Paragraph> <Paragraph position="15"> Apparently, a noun phrase may be both a maximal noun phrase and a minimal noun phrase. Ordinary noun phrases are noun phrases with no restrictions. Take (17) as an example. It has three minimal noun phrases, one maximal noun phrases and five ordinary noun phrases.</Paragraph> <Paragraph position="16"> In general, a noun-phrase extractor forms the front end of other applications, e.g., acquisition of verb subcategorization frames. Under this consideration, it is not appropriate to taking (17) as a whole to form a noun phrase. Our system will extract two noun phrases from (17). &quot;a black badge of frayed respectability&quot; and &quot;his neck&quot;.</Paragraph> <Paragraph position="17"> (17) ilia black badge\] of lfrayed respectabilityll that ought never to have left \[his neck\]\] We calculate the numbers of maximal noun phrases, minimal noun phrases and ordinary noun phrases denoted in SUSANNE Corpus, respectively and compare these numbers with the number of noun phrases extracted by our system.</Paragraph> <Paragraph position="18"> Table 7 lists the number of ordinary noun phrases (NP), maximal noun phrases (MNP), minimal noun phrases (mNP) in SUSANNE Corpus. MmNP denotes the maximal noun phrases which are also the minimal noun phrases. On average, a maximal noun phrase subsumes 1.61 ordinary noun phrases and 1.09 minimal noun phrases.</Paragraph> <Paragraph position="19"> To calculate the precision, we examine the extracted noun phrases (ENP) and judge the correctness by the SUSANNE Corpus. The CNP denotes the correct ordinary noun phrases, CMNP the correct maximal noun phrases. CmNP correct minimal noun phrases and CMmNP the correct maximal noun phrases which are also the minimal noun phrases. The results are itemized in Table 8. The average precision is 95%.</Paragraph> <Paragraph position="20"> Here, the computation of recall is ambiguous to some extent. Comparing columns CMNP and CmNP in Table 8 with columns MNP and mNP in Table 7, 70% of MNP and 72% of mNP in SUSANNE Corpus are extracted, In addition, 95% of MmNP is extracted by our system. It means the recall for extracting noun phrases that exist independently in SUSANNE Corpus is 95%. What types of noun phrases are extracted are heavily dependent on what applications we will follow. We will discuss this point in Section 7. Therefore, the real number of the applicable noun phrases in the Corpus is not known.</Paragraph> <Paragraph position="21"> The number should be between the number of NPs and that of MNPs. In the original design for NP-TRACTO1L a maximal noun phrase which contains clauses or prepositional phrases with prepositions other than &quot;of' is not considered as an extracted unit. As the result, the number of such kinds of applicable noun phrases (ANPs) form the basis to calculate recall. These numbers are listed in Table 9 and the corresponding recalls are also shown.</Paragraph> <Paragraph position="22"> The automatic validation of the experimental results gives us an estimated recall. Appendix provides a sample text and the extracted noun phrases. Interested readers could examine the sample text and calculate recall and precision for a comparison.</Paragraph> </Section> class="xml-element"></Paper>