File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1814_metho.xml
Size: 10,965 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1814"> <Title>Extracting Pronunciation-translated Names from Chinese Texts using Bootstrapping Approach</Title> <Section position="3" start_page="1" end_page="11" type="metho"> <SectionTitle> 2 Bootstrapping Algorithm for Locating </SectionTitle> <Paragraph position="0"> P-Names Currently, there is no standard corpus that annotates all P-Names. Since annotating thousands of P-Names is more difficult than collecting thousands of P-Names from the Internet, we recur to using the Internet search engine to collect a large set of P-Names. Figure 1 illustrates our main components in bootstrapping process.</Paragraph> <Paragraph position="1"> consisting of 60 context words, such as {&quot;NULL&quot;, &quot;g16772g13785&quot;, &quot;g16840&quot;, &quot;g6466&quot;, &quot;g3324&quot;, &quot;g2052&quot;}. These are typical context words found around person, location and organization names in as query to retrieve relevant web pages from the Internet using a commercial search engine. We then extract possible P-Names from the returned web pages to update P . We then perform the lexical chaining to generalize these context words to generate semantic classes.</Paragraph> <Paragraph position="2"> d) We repeat the process from step (a) until any of the following conditions is satisfied: (i) when no new P-Name is found; (ii) when the desired number of iterations is reached; or (iii) when the number of P-Names found exceeds a desired number.</Paragraph> <Paragraph position="3"> The following subsections discuss the details of the bootstrapping process.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.1 Querying and Extracting the P-Names </SectionTitle> <Paragraph position="0"> from the Web The first step of the algorithm is to derive good queries from the character set C</Paragraph> <Paragraph position="2"> to search the Internet to obtain new web pages. If we use all single</Paragraph> <Paragraph position="4"> to perform the search, we are likely to get too many pages containing irrelevant information. As a compromise, we use every two</Paragraph> <Paragraph position="6"> (except those combinations that have been used in the previous iterations) to search the Internet using Google (by using its language tool ). We consider only up to 300 entries returned by Google. We divide the content of the web pages into text segments by using the non-alphanumeric characters as delimiters. We extract those text segments that contain the search</Paragraph> <Paragraph position="8"> or both and store them in R (m) . For example, from the web page given in Figure 2, the text segments extracted include: strings &quot;g3534g7043g8943g3839g7043 g3534&quot; and &quot;g3534g7043g8943g3839g7043g3534g11352g5445g10267g15999g16760g1038&quot; from the first entry; and strings &quot;g3624g12197g3839g7043g3534&quot;, &quot;g1332g7171g2554g11705g17959g5415g3624g12197 g3839g7043g3534g12461g12447g3324g15915g3848g5067g4626g11352g7114g1517&quot; and &quot;g16843g16772g1315g3624g12197g3839 g7043g3534g2555&quot; from the second entry.</Paragraph> <Paragraph position="9"> Given R (m) , we next extract the possible P-Names.</Paragraph> <Paragraph position="10"> Firstly, we segment those entries in R (m) by performing the longest forward match using the common word dictionary. We then remove all</Paragraph> <Paragraph position="12"> candidates by string matching. We store the new P-Name candidates found in P</Paragraph> <Paragraph position="14"> For example, if we use &quot;g7043, g3534&quot; obtained from C Here the bracketed words are common words or English letters and they are removed from string matching. The sub-string &quot;g5064g4584g14305g3839g7043g3534&quot; appears 5 times and it is matched as a possible P-Name.</Paragraph> </Section> <Section position="2" start_page="2" end_page="11" type="sub_section"> <SectionTitle> 2.2 Deriving New P-Name Characters </SectionTitle> <Paragraph position="0"> Given the set of P-Name candidates in P (m) , we next use both the context words and corpus statistics to confirm the P-Name and extract new P-Name characters.</Paragraph> <Paragraph position="1"> From observation, context information is useful to comfirm a P-Name and determine its left and right boundary. Thus we use one-word context to confirm and classify P-Name candidates into person names or part of location or organization names. For each context word w c in CW (m) , we first compute its probability vector of occurrences, PV(w c ), around person, location and organization names in PKUC as follows:</Paragraph> <Paragraph position="3"> ) respectively give the number of times w c appears at the left (or right) boundary of person (p), location (l) and organization (o) names in PKUC. c</Paragraph> <Paragraph position="5"> ) gives the probability that the P-Name is of type x, if this cue-word is at the left (or right) boundary of the P-Name.</Paragraph> <Paragraph position="6"> Given a P-Name candidate p</Paragraph> <Paragraph position="8"> the set of its left and right context words as W</Paragraph> <Paragraph position="10"> >, and use these to compute the confidence vector of p</Paragraph> <Paragraph position="12"> . Here we simply average the probabilities of the left and right context words to derive the final probability vector. We assign p</Paragraph> <Paragraph position="14"> to be part of a named entity of type x, if c</Paragraph> <Paragraph position="16"> for x[?]{p, l, o}. Here we set t p to be 0.8. In case that there are more than one value greater than</Paragraph> <Paragraph position="18"> , we select the one with the highest value in the type vector as the type of that P-Name.</Paragraph> <Paragraph position="19"> We next derive an objective measure to evaluate how likely a candidate in P (m) could be a P-Name.</Paragraph> <Paragraph position="20"> We observe that a string is likely to be a P-Name if: (a) it contains some sub-strings that frequently appear in typical P-Names such as &quot;g19475g4584&quot;, &quot;g7043g3534&quot;, &quot;g8943g3839&quot;, etc; and (b) it has context words in CW (m-1) set that indicates that it has high probability of being part of a named entity. Thus for each P-Name</Paragraph> <Paragraph position="22"> respectively number of times the character strings</Paragraph> <Paragraph position="24"> . b and o are predefined constants (here we use b =0.5 and o =1.5). Equation (3.1) gives higher weight to p in the current iteration, before fusing the two values in Equation (3).</Paragraph> <Paragraph position="25"> Since we would like to obtain more new P-Names during bootstrapping, in each iteration, we would like to expand the P-Name character set. In order to select the most likely P-Name characters, we derive a quasi-probability, Conf(c i (m) ), to estimate how likely a character c i (m) in the P-Name candidate set P (m) could be used as a P-Name character. To do this, we make use of both the PKUC corpus and P (m) . We observe that most characters in P (m) also appear in the PKUC corpus, sometimes as P-Name characters sometimes as common characters. Thus, intuitively we estimate</Paragraph> <Paragraph position="27"/> <Paragraph position="29"> single-character word in PKUC. Equation (4) aims to identify characters that appear frequently as part of P-Names, but rarely as part of common words.</Paragraph> <Paragraph position="30"> It also favors characters that appear in more probable P-Names through the s(p</Paragraph> <Paragraph position="32"> ) measures.</Paragraph> <Paragraph position="33"> Although Equation (4) is effective in identifying individual P-Name characters, it is not good at locating the sequences of P-Name characters that form the P-Names. This is because there are many characters that have low Conf(c</Paragraph> <Paragraph position="35"> ) values that are part of a P-Name. For example, in a P-Name &quot;g8886g2027g1474 g1823&quot;, the character &quot;g1474&quot; has low confidence to be a P-Name character as defined by Equation (4).</Paragraph> <Paragraph position="36"> However, it co-occurs with high confident P-Name characters such as &quot;g2027&quot; and &quot;g1823&quot;. To overcome this problem, we modify the confidence value of each character by considering its neighbors (context) to derive a smoothed confidence measure in Equation</Paragraph> <Paragraph position="38"> is respectively the co-occurrence of characters c</Paragraph> <Paragraph position="40"> in the P-Name set. Equation (5) tries to supplement the confidence of c i by its context, that is, it uses the higher of the bi-gram statistics with its preceding and succeeding word to enhance its confidence. We rank all the characters in P</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 2.3 Deriving New Context Words </SectionTitle> <Paragraph position="0"> In addition to finding new P-Name characters, there , we regard it as part of a named entity. For these P-Names which could be possible part of named entities, the following steps are performed: (for x[?]{p, l, o}) if their probabilities under that category is greater than a threshold t g (say, 0.5).</Paragraph> <Paragraph position="1"> e) We then perform lexical chaining using HowNet to generalize the context words under each of the 6 categories separately. The general lexical chaining algorithm is given in detail in Chua and Liu (2002).</Paragraph> <Paragraph position="2"> related words are grouped together. We update the confidence vectors of the semantic groups by averaging the confidence values of words in each of the semantic groups.</Paragraph> <Paragraph position="3"> At the end of this process, we obtain a new set of a new document, we want to use these resources to identify all P-Names. The process is carried out as follows: a) We first use our common word dictionary to remove all common words.</Paragraph> <Paragraph position="4"> b) Next we use knowledge of P-Name candidates and corpus statistics to identify a sequence of likely P-Name characters. Any sub-string in which the Sconf(c i ) (see Equation (5)) of each consecutive character in that string is greater than a pre-specified threshold t c (we use 5) is considered as a P-Name.</Paragraph> <Paragraph position="5"> c) A frequently occurring problem during testing is how to handle new characters not found in set that we do not know their confidence values. Such problem occurs as a same foreign name may be translated to different P-Names with similar Chinese PinYin. For these characters, we adopt the similar homophone approach to relate unknown characters to the known characters in C</Paragraph> </Section> </Section> class="xml-element"></Paper>