File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1051_metho.xml
Size: 8,700 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1051"> <Title>AN INTERPRETATIVE DATA ANALYSIS OF CHINESE NAMED ENTITY SUBTYPES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AN INTERPRETATIVE DATA ANALYSIS OF CHINESE NAMED ENTITY SUBTYPES </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. MOTIVATIONS FOR AN INTERPRE- TATIVE DATA ANALYSIS </SectionTitle> <Paragraph position="0"> &quot;In assessing the performance of information extraction systems, we are interested in knowing the classes of errors made and the circumstances in which they are made.&quot;\[1\] However, to date the Tipster scoring categories (correct, partial, incorrect, spurious, missing, and noncommittal) have not been applied to classes of data based on structural distinctions in the language, or on semantic subclasses more finely differentiated than the NE types (person, location, organization, time, date, money, and percent). For example, there has been no attempt to score the extraction of transliterated foreign person names, or of short-form aliases of corporation names, or of Julian dates as opposed to Gregorian dates as opposed to dates of the Chinese lunar calendar.</Paragraph> <Paragraph position="1"> There are obvious practical reasons for this. The scothag criteria are limited to those that can be measured without access to anything more than the annotations the systems generate \[2\], and those applied by human taggers to the answer keys. Moreover, any new annotations that might become available represent a limited subset of the infinite number of ways that NE data might be subcategorized, in accordance with particular interests, applications, and capabilities.</Paragraph> <Paragraph position="2"> Yet, from among these innumerable possible subcategoties of entity names, a few would seem likely to emerge as more well-motivated than the rest. Note that an appendix on &quot;VIP Names&quot; or &quot;Country and Capital City Names&quot; is more likely to appear in a desk-top dictionary than a list called &quot;Ethnic Surnames in their Native Scripts and Common Anglicized English Renderings.&quot; One would expect observant end-users of information extraction systems to notice rather quickly that certain high frequency, hard-to-get, or thematically significant categories of names are missing or incorrect in the output. And one might desire, at some point in the system development loop, to capture these observations systematically, so as to direct efforts at system improvement. This would be especially desirable if system developmeat includes relatively labor-intensive linguistic analysis. null Doing this &quot;systematically&quot; is not the same as measuring errors scientifically. To count the number of tagged VIP person names, for example, presupposes somebody's interpretation of whether &quot;VIP&quot; includes only chiefs-of-state, or chiefs-of-state and cabinet ministers, or these plus nobel prize winning scientists, novelists, peace activists, etc. So, the following observations are at best an interpretative error analysis, informed by knowledge of the language and of likely user expectations. However, we try to define this as a series of steps that reasonably approximates a scientific discovery procedure. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="453" type="metho"> <SectionTitle> 2. A PROCEDURE FOR INTERPRETA- TIVE DATA ANALYSIS </SectionTitle> <Paragraph position="0"> The following steps were taken to analyze the MET Chinese test data.</Paragraph> <Paragraph position="1"> - Step I: Scan the Input Data for Salient Subtypes The MET Chinese named-entity data tagged in the test keys was scanned for sub-classes of names appearing to meet one or more of the following criteria: - they are frequently-occurring semantic subtypes (for example it was apparent that country names comprised a large subclass within the LOCATION-tagged data); - they are readily listable, high-interest subtypes (for example, &quot;chiefs-of-state&quot; comprise a class small enongh to be readily listable by a human analyst; in addition, we might expect their activities to be more newsworthy than those of the newspaper reporters or official government spokespersons whose names also appear frequently in the data).</Paragraph> <Paragraph position="2"> - they are readily patternable subtypes (for example, many taggable organization names begin with a location name and end with a unit designator, as in the name &quot;Minnesota Mining and Manufacturing Corporation&quot;. Other organization names, such as &quot;Hamas,&quot; had no obviously specifiable morphological features in common with large numbers of other names.</Paragraph> <Paragraph position="3"> This scanning process identified a small number of data subtypes, which were individually describable in terms of the meaning, forms or distributions of names, and which collectively seemed to comprise a large percentage of all names extracted \[3\]. The resulting inventory of subtypes can be thought of as an hypothesis that the NE data is describable in a certain interpretative yet systematic way.</Paragraph> <Paragraph position="4"> - Step 2: Count NE Occurrences by Subtype Tagged names were searched by NE &quot;type&quot; (person / location / organization) using a concordance tool (NMSU's &quot;XConcord&quot;), then copied to files representing each of the posited subtype classes, or to a catch-all &quot;residual&quot; class. The number of names in each file was then counted to arrive at an overall profile of the data distribution. This step can be thought of as a test of the data distribution hypothesis.</Paragraph> <Paragraph position="5"> - Step 3: Chart the Distribution of NE Data Table 1 (following the tex0 provides a summary of the &quot;test&quot; results.</Paragraph> <Paragraph position="6"> - Step 4: Check for Inconsistencies in the Data Distribution null The numbers in the boxes of Table 1 were tallied and analyzed for internal consistency and non-conformity to our original expectations, that is, to show that the &quot;hypothesis&quot; was not invalidated. If no inconsistencies were found and an acceptably high percentage of the data had been accounted for, then the descriptive category set might have appeared adequate. Note, however, that the ratio of &quot;residual&quot; person names, 40%, is considerably higher than the ratios of residual location and organization names. This suggested that the initial description was leaving a significant portion of the data unaccounted for.</Paragraph> <Paragraph position="7"> - Step 5: Loop Back to Step 1 (or stop when an acceptably high percentage of the data is accounted for, and inconsistencies are resolved) Re-examination of the data revealed that, among the 2-3 syllable, non-VIE &quot;residual&quot; person names, 40% \[4\] are directly preceded on the left by a &quot;title&quot; (e.g. &quot;Representative so-and-so&quot;). This still leaves 24% \[5\] of person names unaccounted for. Some high percentage of this 24% presumably could be accounted for by an adequate structural description of Chinese &quot;surname pins given name&quot; patterns. This description, and measurement of the data it would cover, was not attempted, due to time constraints and the complexity of the problem.</Paragraph> </Section> <Section position="4" start_page="453" end_page="453" type="metho"> <SectionTitle> 3. APPLICATIONS OF THE INTERPRE- TATIVE DATA ANALYSIS </SectionTitle> <Paragraph position="0"> As suggested above, variations of the above procedure can be used to generate profiles of the data in order to direct efforts at system improvement. This may or may not be worth the cost of analysis if system improvement is driven solely by pipingmore and more massive amounts of development data into a statistical learning engine. If sufficiently massive and varied development data is available, presumably the system eventually will train upon something approaching all of the relevant data subtypes, without any need to know and describe what those subtypes are. However, when the approach involves labor-intensive pattern development based on linguistic structures, future language-analytic development could be focused by applying in advance something like the foregoing procedure, supported by tailored versions of concordance tools and other on-line analytic aids.</Paragraph> </Section> class="xml-element"></Paper>