XML Viewer - c67-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/67/c67-1023_metho.xml
Size: 41,533 bytes
Last Modified: 2025-10-06 14:11:03
<?xml version="1.0" standalone="yes"?>
<Paper uid="C67-1023">
  <Title>AN EXPERIMERTAL SYSTEM FOR A~OMATIC RECOGNITION OF PERSONAL TITLES A~D E~O~AL ~A~S ~N ~PAPER TEXTS</Title>
  <Section position="1" start_page="0" end_page="14" type="metho">
    <SectionTitle>
AN EXPERIMERTAL SYSTEM FOR
A~OMATIC RECOGNITION OF PERSONAL TITLES
A~D E~O~AL ~A~S ~N ~PAPER TEXTS
Casimir Borkowski
</SectionTitle>
    <Paragraph position="0"> Thomas J. Watson IBM Research Center Yorktown Heights, N.Y.</Paragraph>
    <Paragraph position="1"> Smry.</Paragraph>
    <Paragraph position="2"> Natural language seems to contain various special-purpose sublanguages (e.g., personal titles, personal names) -- each with its own structure which relative to the total structure of language is quite simple.</Paragraph>
    <Paragraph position="3"> An ability to generate and to recognize automatically words and word strings belonging to various special-purpose sub-languages may prove to he very useful since they play an important role in indexing ~ in various systems for extracting and distributing information.</Paragraph>
    <Paragraph position="4"> This paper (i) describes some of the main problems involved in automatic recognition of personal titles and names in newspaper texts, (2) outlines some rules of an algorithm designed to perform this task, (3) presents statistics concerning the algorithm's accuracy and exhaustiveness obtained in manual application of the algorithm to texts, (4) discusses and interprets some of the results, and (5) suggests some applications for computer programs capable of recognizing personal titles and names.</Paragraph>
    <Paragraph position="5"> Motivation for the Exl~eriment.</Paragraph>
    <Paragraph position="6"> One of the major questions of the day is the extent to which a computer can be instructed to identify various parts of texts written in plain, ordinary language. In trying to answer this question, we set ourselves the preliminary limited objective of developing an automatic procedure for identifying personal titles and personal names in English-language texts.</Paragraph>
    <Paragraph position="7"> i Experimental Design.</Paragraph>
    <Paragraph position="8"> Our procedure in setting up an automatic method for identifyin~ personal names and titles was approximately as follows: (i) We investigated permissible patterns of personal titles au~ of E~lish, French, Russian, German, Spanish, Chinese, Arabic, and other personal names whose occurrence in texts we could anticipate. (2) We obtained a 60,O00-word sample of newspaper texts and determined: Ibl patterns of occurrence of personal names and titles in texts, patterns of personal names and titles occurring in texts, and (C) problems involved in distinguishing personal names and titles from each other and from other parts of texts.</Paragraph>
    <Paragraph position="9"> (3) Based on (1):sagdl (2) above, we set up an automatic procedure designed to identify personal names and titles in newspaper texts. This procedure was embodied in flowcharts and a dictionary of about  8,000 entries.</Paragraph>
    <Paragraph position="10"> (4) We tested our procedure manually on a lO0,000-word sample of new newspaper texts, and we emended the rules and expanded the dictionary on the basis of the information provided by the tests. (5) We then stabilized the improved procedure and (a) tested it out manually on a new 40,000-word sample of newspaper texts and (b) collected statistics concerning its accuracy and exhaustiveness. (Our reasons for applying the algorithm manually were as follows: (a) our identification system was embodied in dictionary entries and flow charts which were sufficiently detailed to Permit accurate execution of recognition procedures, and (b) we thought that it would not pay to code and debug over a period of months what would probably turn out to be a &amp;quot;one-shot&amp;quot; program) (6) We then investigated what types of errors had occurred and  proposed various amendments to the automatic recognition procedure. Some Problems of Automatic Identification of Personal Titles and Names. Automatic identification of titles and names in texts is of course not without its difficulties. First of all, many personal names are orthographically identical with other types of words in the language. This is the case since among the main sources of surnames are: (i) titles (e.g., &amp;quot;King&amp;quot;), (2) names of occupations (e.g., &amp;quot;Baker&amp;quot;), (3) topographic terms (e.g., &amp;quot;Hill&amp;quot;), (4) Personal attributes (e.g., &amp;quot;Coward&amp;quot;), (5) place names (e.g., &amp;quot;London&amp;quot;), (6) names of animals (e.g., &amp;quot;Fox&amp;quot;), (7) names of%~eeS ~ (e.g., &amp;quot;Pine&amp;quot; ), etc.</Paragraph>
    <Paragraph position="11"> There is considerable ambiguity between personal names and place names due to the fact that not only are the names of ) places a frequent source of personal names but also because many localities were named after people, as for example, Elizabet~ New Jersey andDallas, Texas. And to make matters worse, hotels~ business firms, universities, etc. can be named after people~a~d are often referred to by an abbreviated name which is that of a person (e.g., &amp;quot;He is staying at the Hilton&amp;quot;, &amp;quot;He graduated from Stanford&amp;quot; &amp;quot;Ford was hit by a strike last &amp;quot;~ UT~m -~ ~st ''~ . week ~ of them cllmbedthe ~ e~e ) As for personal names like &amp;quot;HelenaRubinstein&amp;quot; and &amp;quot;Max Factor&amp;quot;, they designate persons as well as business firms, while &amp;quot;Philip Morris&amp;quot; is the name of a person, of a corporation, and of a brand of cigarettes.</Paragraph>
    <Paragraph position="12"> Yet another difficulty arises in case of names of persons (e.g., &amp;quot;Madison&amp;quot;) when they perform a naming function with regard to something, say an avenue, (e.g., &amp;quot;Madison Avenue&amp;quot;). Presumbly, it would be worthwhile to distinguish automatically references to persons from references to things named after persons.</Paragraph>
    <Paragraph position="13"> Further difficulties in automatic recognition result from the co-occurrence in texts of names belonging to different name strings (e.g., &amp;quot;John Byron&amp;quot; as in &amp;quot;Estelle gave John Byron's Don Juan&amp;quot;, &amp;quot;Mary Jane&amp;quot; as in &amp;quot;For Mary Jane had nothing but sympat-~&amp;quot; I ~xander Montgomery&amp;quot; as in &amp;quot;According to Alexander Montgomery was slow in exploiting successes&amp;quot;.).</Paragraph>
    <Paragraph position="14"> Other difficulties in :recOgnizing personal names result from the fact that personal titles are not unfailing aids in identifying and disambiguating personal names since titles themselves can be homographic with other types of words. For instance, &amp;quot;General&amp;quot; is a military rank in &amp;quot;General Mobutu&amp;quot;, but not in &amp;quot;General Motors&amp;quot;.</Paragraph>
    <Paragraph position="15"> Further difficulties result from the fact that some titles are homographic with given names. How is an automaton to tell that &amp;quot;Dean&amp;quot; is a.title in &amp;quot;Dean Wiesner&amp;quot; but a name in &amp;quot;Dean Rusk&amp;quot;, that &amp;quot;King&amp;quot; is a title in &amp;quot;King James&amp;quot; but a name in &amp;quot;James King&amp;quot;, that '~arl&amp;quot; is a title in &amp;quot;the fourth Earl Russell&amp;quot; but a name in &amp;quot;the</Paragraph>
    <Section position="1" start_page="0" end_page="14" type="sub_section">
      <SectionTitle>
Chief Justice Earl Warren&amp;quot;?
Some Reco~nltion Rules.
</SectionTitle>
      <Paragraph position="0"> Our recognition algorithm was intended a~ a frame of reference in an investigation of the trade-off between the efficiency and the complexity of a series of algorithms.</Paragraph>
      <Paragraph position="1"> To be able to investigate the trade-off between the efficiency and the complexity of a series of algorithms, we avoided taking as our point of departure a strong theory about the structure of language, memant~cs, pragmatics, etc., and about the amount of syntactic recognition required for successfUl identification in texts of personal titles and names. Instead, we sought to discover</Paragraph>
      <Paragraph position="3"> what ~ssumptions and what information about natural l~a@e~ahd text~ may be pertinent .to.;.the resolution of the limited problems in recognition which we set for ourselves.</Paragraph>
      <Paragraph position="4"> Stronger assumptions and muze elaborate techniques of analysis can be built into subsequent algorithms if required and as required. For instance, since parsing may be helpful in identifying sequences of names each of which is followed by its title (e.g., &amp;quot;The President nominated John Gordon Ambassador to Guatemala, William T. M. Beale Jr. Ambassador to Jamaica...&amp;quot;) future recognition algorithms may parse sentences containing: (1) double-object verbs (e.g., &amp;quot;nominate&amp;quot;) and (2) strings consisting of personal names followed by titles.</Paragraph>
      <Paragraph position="5"> At a later time, parsing and/or other types of analyses may be extended to sentences, paragraphs, and articles containing other kinds of words and phrases. However, since parsing and other types Of analyses may be expensive, it would seem advisable to apply them only when they can reasonably be expected to provide economic solutions to valid problems.</Paragraph>
      <Paragraph position="6"> Our rules describe the arrangement in the sentence of the words, phrases, and punctuation marks which are pertinent to recognition of names and titles. Generally, the description starts with the first, that is, the leftmost pertinent element and termi I nates with the last, or rightmost pertinent element. Recognition rules were given a &amp;quot;left-to-right&amp;quot; format because rules expressed in this way are easy to implement on an electronic computer.</Paragraph>
      <Paragraph position="7"> For greater ease of understanding, the rules are'expressed here in narrative form. For the sake of brevity, only some recognition rules are listed here. A more complete description of identification rules is available elsewhere.(1) Our rules for recognizing names of persons take advantage of the style rules of The New York Times. We would conjecture that whereas details of name recognition rules may vary from newspaper to newspaper, their general pattern will remain fairly stable and independent of editorial conventions.</Paragraph>
      <Paragraph position="8"> The rule for identifying personal titles which was selected as a reasonable first approximation states that a word or phrase in text is a personal title either:  (i) if it matches a word or string of words on a list of titles or (2) if it matches a word or a string of words which is on a llst of words and phrases which commonly combine with titles (e.g., &amp;quot;Acting&amp;quot;, &amp;quot;Assistamt&amp;quot;, '&amp;quot;~ice&amp;quot;) and is followed by a personal title (e.g., &amp;quot;Acting Mayor,, &amp;quot;Acting Assistant Vice President&amp;quot;) or (3) if it is a personal title followed by a wor~ or a strim~ of words which is on a list of words which commonly combine with titles (e.g. t &amp;quot;-elect ~, &amp;quot; at Large&amp;quot;, &amp;quot;pro tempore&amp;quot;) as in &amp;quot;Senator-elect&amp;quot;, &amp;quot;Ambassador at Large&amp;quot;, &amp;quot;President pro tempore&amp;quot;. or (4) if it is a title designated by a llst, like say &amp;quot;Commissioner&amp;quot;, followed by the word &amp;quot;of&amp;quot; and any capitalized word (e.g., &amp;quot;Co~nissioner of Parks&amp;quot;).</Paragraph>
      <Paragraph position="9"> or (5) if it is a word beginning with a capital letter and followed  by a title designated by a list (e.g., &amp;quot;Co~zmlssioner&amp;quot; as in &amp;quot;Police Co~,issioner&amp;quot; ). ~' A preliminary (and a highly tentative) rule specifies how titles concatanate. This rule permits distinguishing some strings such as &amp;quot;Prime Minister, Sir&amp;quot; as in &amp;quot;Prime Minister, Sir Alec Douglas-Home&amp;quot;, &amp;quot;Rev. Dr.&amp;quot; as in &amp;quot;Rev. Dr. Martin Luther King&amp;quot;, &amp;quot;Mr. Chairman&amp;quot;3&amp;quot;Mr. Counsel&amp;quot;, &amp;quot;Mr. Chairman, Ladies, and Gentlemen:&amp;quot;, and so forth from titles followed by names.</Paragraph>
      <Paragraph position="10"> The present set of rules for identifying titles which are homographic (that is, ortographically identical) with other words is relatively simple. It divides ambiguous titles into four classes; words of Class I (e.g., &amp;quot;King&amp;quot;, &amp;quot;Pope&amp;quot;, &amp;quot;Prince&amp;quot;) are assumed to be titles (e.g., &amp;quot;King John&amp;quot;, &amp;quot;Pope John&amp;quot;) unless: (1) preceded by either personal titles designated by a 1/st such as &amp;quot;Mr.&amp;quot;, &amp;quot;Dr.&amp;quot;, &amp;quot;M. Sgt.&amp;quot;, &amp;quot;General&amp;quot;, etc., or by given names and initials in various combinations (e.g., &amp;quot;Mr. King&amp;quot;, &amp;quot;John King&amp;quot;, &amp;quot;Dr. Pope&amp;quot;, &amp;quot;John Pope&amp;quot;, &amp;quot;John M. King'!, &amp;quot;J. M. King&amp;quot;) or (2) followed by such postnomial elements as &amp;quot;Sr.&amp;quot; (e.g., &amp;quot;King, Sr.&amp;quot;), &amp;quot;&amp; Bros.&amp;quot; (e.g., &amp;quot;King &amp; Bros.), &amp;quot;and Company&amp;quot; (e.g., &amp;quot;King and Company&amp;quot;), and so forth.</Paragraph>
      <Paragraph position="11"> Occasionally, words of Class I are followed by capitalized words or phrases which designate various institutions, establishments, locations, and so forth which are frequently named after persons (e.g. 3 &amp;quot;Drug Store&amp;quot;, &amp;quot;College&amp;quot;, &amp;quot;Avenue&amp;quot;, &amp;quot;Theorem&amp;quot;). Although, the llst of such words and phrases is open-ended, its most frequently occurring members can be discovered and listed quite easily. Furthermore, there is some evidence that many or most such phrases can be identified by means of recogniti6n rules.</Paragraph>
      <Paragraph position="12"> Words which are members of Class I are assumed to be names when they are followed by words such as &amp;quot;College&amp;quot; o~ phrases such as &amp;quot;Drug Store&amp;quot; .</Paragraph>
      <Paragraph position="13"> Words which are members of Class II (e.g., &amp;quot;Kaiser&amp;quot;, &amp;quot;Chamberlain&amp;quot;, '~arl&amp;quot;) are assumed to be personal names., Commonly occurring exceptionsto this rule (e.g., &amp;quot;Kaiser Wilhelm&amp;quot;, &amp;quot;Lord Chamberlain&amp;quot;, '~arl of&amp;quot; (if followed by a word beginning with a capital letter)) are listed.</Paragraph>
      <Paragraph position="14"> Words which are members of Classes III and IV (e.g., &amp;quot;General&amp;quot;, &amp;quot;Principal&amp;quot;, &amp;quot;Justice&amp;quot;) are assumed to be titles. Commonly occurring exceptions to this rule are listed (&amp;quot;General Assembly&amp;quot;, &amp;quot;Major Medical Plan&amp;quot;, &amp;quot;Principal Investigator&amp;quot;, &amp;quot;Justice Department&amp;quot; ).</Paragraph>
      <Paragraph position="15"> As our rules become more sophisticated, the need for lists of exceptions will diminish. However, it is likely that listing exceptions will often be an attractive alternative to rendering a rule more complicated.</Paragraph>
      <Paragraph position="16"> Personal titles in the plural are recognized by means of a simple rule. It states that a string of characters is a personal title in the plural if:  (i) it is recognizable as a personaltitle and if either (2) its final word is followed by the letter &amp;quot;s&amp;quot; (e.g., &amp;quot;Major Generals&amp;quot; ) or else (3) if one of the words of which it is composed and which a llst  designates as the stem for the plural is followed by the letter &amp;quot;s&amp;quot; (e.g., &amp;quot;Collector&amp;quot; as in &amp;quot;District Collector of Internal Revenue&amp;quot;). A procedure similar to the one for identifying titles in the plural is used to recognize personal titles in the possessive case (e.g., singular: &amp;quot;Major General's&amp;quot;, plural: &amp;quot;Major Generals'&amp;quot;). Our rules assume that the capitalized word or string of words and initials which frequently follows a title is the name of a person (e.g., &amp;quot;President Nkrumah&amp;quot;, '~r. Paul-Henri Spaak&amp;quot;, &amp;quot;Governor Nelson Rockefeller&amp;quot;).</Paragraph>
      <Paragraph position="17"> If a title is followed by a word beginning with a lower-case letter or by certain punctuation marks, this indicates that the title is not followed by a name. However, occasionally personal titles arm followed by names beginning with lower-case letters (e.g. I &amp;quot;President de Gaulle&amp;quot;).</Paragraph>
      <Paragraph position="18"> On occasion, titles are followed by capitalized words which are not names. This happens in particular when a title is followed by a capital~zed word or phrase which designates an institution, an establisb~ment', a site, and so forth, nsu~ed after a personal title (e.g., &amp;quot;Ambassador Bar&amp;quot;, &amp;quot;Archduke Trio&amp;quot;, &amp;quot;Emperor Concerto&amp;quot;~ &amp;quot;President Hotel&amp;quot;, &amp;quot;Queens County&amp;quot;, &amp;quot;Viceroy Lumber Company&amp;quot; ) * Although -- as mentioned earlier -- the list of such words and phrases is open-ended, its most frequently occurring members have been listed and are consequently identifiable. Co~nly occurring exceptions to this rule are also listed and are therefore identifiable. In addition, we have some simple preliminary rules for identifying phrases whose designata often bear as names words or phrases which are personal titles (e.g., &amp;quot;President Radio Repair Shop&amp;quot;, &amp;quot;Viceroy Lumber Company&amp;quot;). However, since the identification of phrases which designate such namesakes has been given little attention, these rules are very tentative.</Paragraph>
      <Paragraph position="19"> Words which are generally names of weekdays when they occur after personal titles (e.g., &amp;quot;They saw the President Monday&amp;quot;) are of course listable and therefore identifiable.</Paragraph>
      <Paragraph position="20"> Occasionally, titles are followed by capitalized words which are not names and for the recognition of which the rules make no provisions (e.g., &amp;quot;British&amp;quot; in &amp;quot;The Prime Minister, British sources said, will arrive on Monday.&amp;quot; and &amp;quot;New York&amp;quot; in &amp;quot;Mr. Stevenson prefers Washington and Mr. Rusk, the Secretary of Stat% New York.&amp;quot;). Constructions such as these are, however, quite rare. Occasionally, prepositional and other phrases intrude between a title and a name (e.g., &amp;quot;the French Ambassador to the _United States, Herve Alphand&amp;quot;, &amp;quot;the Foreign Minister of France, Maurice Couve de Murville&amp;quot;). Prepositional phrases of this sort and other adjuncts are identified by various rules of the &amp;quot;brute force&amp;quot; type whose statements are constructed as follows: If a personal title (e.g., &amp;quot;Foreign Minister&amp;quot;) is followed by a phrase consisting of the preposition &amp;quot;of&amp;quot; and of the name of a country, such phrase is part of the title.</Paragraph>
      <Paragraph position="21"> In general 3 in a string of words consisting of titles and names t titles precede names (e.g., &amp;quot;the Secretary of State, Dean Rusk I the Foreign Minister of the Federal Republic, Gerhard SchrSder, the Foreign Minister of France, Maurice Couve de Murville&amp;quot;). Sequences of names each of which is followed by its title (e.g., &amp;quot;Dean Rusk, the Secretary of State, Gerhard SchrSder, the Foreign Minister of the Federal Republic, ...&amp;quot;) are rare.</Paragraph>
      <Paragraph position="22"> (Ordinarily, iz. a construction of this type, each title is set off from the name which follows it by a semi-colon.) In spite of counter-examples such as the ones above, one can reasonably assume that if the capitalized word or string of words and initials which frequently follows a personal title is NOT an identifiable word like &amp;quot;Hotel&amp;quot;, &amp;quot;Garage&amp;quot;, &amp;quot;Street&amp;quot; I &amp;quot;Monday&amp;quot;, etc., or a phrase li~e &amp;quot;Barber Shop&amp;quot;, '~rug Store&amp;quot;, &amp;quot;Meat Packing Company&amp;quot;, etc., then it is a personal name.</Paragraph>
      <Paragraph position="23"> While counter-examples to this rule and to similar rules are easy to invent, the inventors of counter-examples usually miss the point that rules such as these are statistical observations a~ that in actual application to texts they hold up rather well. As state~ earlier, among the goals of an investigation of this type is to obtain experimental evideuce as to how well the rules hol~ up an~ what amendments are required to simplify them, ~to render them more accurate, and to expand their scope.</Paragraph>
      <Paragraph position="24"> If a personal title (e.g., &amp;quot;President&amp;quot;) is conjoined to the titles &amp;quot;Mrs.&amp;quot; or &amp;quot;Miss&amp;quot; (as in &amp;quot;the President and Mrs. Johnson&amp;quot;), the capitalized word or string of words which follows the conjoined titles is generally the name of a person.</Paragraph>
      <Paragraph position="25"> In general, the name of a person acts distributively with regard to the preceding titles, that is to say, a phrase like &amp;quot;the President and Mrs. Johnson&amp;quot; decomposes into &amp;quot;President Johnson and Mrs. Johnson&amp;quot;.</Paragraph>
      <Paragraph position="26"> Occasionally, a personal name does not act distributively with regard to conjoined titles which precede it (as in &amp;quot;an agreement between the Cardinal and Mrs. Johnson&amp;quot;, &amp;quot;a meeti~ between the President and Mrs. Luce&amp;quot;); however, the present set of rules makes no provisions for recognizing such cases.</Paragraph>
      <Paragraph position="27"> Generally, in newspaper ~rticles, capitalized words and initials which (a) frequently follow a personal title in the plural (e.g., &amp;quot;Senators&amp;quot;) and (b) which ar~ not followed by other titles (e.g., &amp;quot;Senators, Congressmen, and Generals&amp;quot;) are strings of personal names (e.g., &amp;quot;Senators Javits and Kennedy&amp;quot;, &amp;quot;Senators Jacob Javits, Robert Kennedy and George D. Aiken&amp;quot;, &amp;quot;Presidents Johnson and Lopez Mateos&amp;quot;).</Paragraph>
      <Paragraph position="28"> Of course, &amp;quot;Ambassadors Bar and Grill&amp;quot; is not a title in the plural followed by two names. However, commonly occurring phrases such as &amp;quot;Bar and Grill&amp;quot; are listable and therefore identifiable. null As a rule, the first name string -- which is often separated from the title by a comma -- terminates before the first conjunction &amp;quot;and&amp;quot; or before the next comma; the second name string begins after &amp;quot;and&amp;quot; or after the comma; etc.</Paragraph>
      <Paragraph position="29"> If two name strings which follow a title in the plural (e.g., &amp;quot;Senators Jacob Javits, Robert Kennedy, ...&amp;quot;) are separated by a comma, then -- generally -- the end of the second name string is marked by a comma or by the conjunction &amp;quot;and&amp;quot;.</Paragraph>
      <Paragraph position="30"> If two name strings which follow a title in the plural are separated by &amp;quot;and&amp;quot;, then -- generally -- the end of the second name string is marked either by punctuation marks such as sentence period I a colon, a semi-colon, etc., or by a word beginning with a lower-case letter (e.g., &amp;quot;arrived&amp;quot; in &amp;quot;Presidents Johnson and Lopez Mateos arrived today.&amp;quot;). However, if the word beginning with a lower-case letter is a name conjunction (e.g., &amp;quot;de&amp;quot;, &amp;quot;yon&amp;quot;), then such word does not mark the end of the second name (e.g., &amp;quot;Presi-</Paragraph>
      <Paragraph position="32"> dents Lyndon Johnson and Charles de Gaulle&amp;quot;).</Paragraph>
      <Paragraph position="33"> Occasionally, the string of words which follows a title in the plural consists of both names and prepositional phrases (e.g., &amp;quot;Senators Javlts of New York and Fulbright of Arkansas&amp;quot;, &amp;quot;Senators from New York, Javits and Kennedy&amp;quot;). Our rules permit identifying some prepositional and other phrases which may intrude between titles in the plural and the names which follow them.</Paragraph>
      <Paragraph position="34"> Generally, the title in the plural acts distributively with regard to the names which follow it, that is to say, a phrase like &amp;quot;Senators Javits and Kennedy&amp;quot; decomposes into &amp;quot;Senator Javlts and Senator Kennedy&amp;quot;.</Paragraph>
      <Paragraph position="35"> Generally, the end of a name string which may follow a personal title in the singular is marked by punctuation (comma, sentence period, dash, semi-colon, colon, exclamation point, apostrophe t three dots, left or right parenthesis, etc.) or by a word beginning with a lower-case letter.</Paragraph>
      <Paragraph position="36"> However, a lower-case letter does not mark the end of a name string if the word which begins with it is either:  (1) a name conjunction (e.g., &amp;quot;de&amp;quot; as in &amp;quot;Attorney General Nicolas deB. Katzenbach&amp;quot;) or (2) the last element of a hyphenated Chinese given name (e.g-, &amp;quot;lai&amp;quot; in &amp;quot;Premier Chou En-lai&amp;quot;) or (3) the one-letter Spanish word &amp;quot;y&amp;quot; (e.g., &amp;quot;President Jose Bustamante y Rivero&amp;quot;) (4) if it is one of the Arabic words &amp;quot;ibn&amp;quot;, &amp;quot;el&amp;quot;, &amp;quot;al&amp;quot;, &amp;quot;er&amp;quot;, and so forth (as in &amp;quot;Abdul-Assiz ibn-Saud&amp;quot;, &amp;quot;Abd-el Eader&amp;quot;, &amp;quot;Abdal-Kadir&amp;quot;, &amp;quot;Abd-er-Rahman&amp;quot; ).</Paragraph>
      <Paragraph position="37">  The end of a name string is often marked by its last ele~ment (e.g., &amp;quot;Jr.&amp;quot; as in &amp;quot;Rev. Dr. Martin Luther King Jr.&amp;quot;, &amp;quot;2nd&amp;quot; as in &amp;quot;Douglas MacArthur 2nd&amp;quot;, the Roman numerals &amp;quot;I&amp;quot;, &amp;quot;If&amp;quot;, &amp;quot;III&amp;quot;, etc., as in &amp;quot;King Idris I&amp;quot;). Cases in which Roman numerals are in the middle of a name are rare and can be treated as listable exceptions (e.g., &amp;quot;King Gustaf VI Adolf&amp;quot;).</Paragraph>
      <Paragraph position="38"> NOTE: The present rule makes no provisions for distinguishing Roman numerals &amp;quot;I&amp;quot;, '~&amp;quot;, and &amp;quot;X&amp;quot; from the first person pronoun &amp;quot;I&amp;quot; and from the letters &amp;quot;V&amp;quot; and &amp;quot;X&amp;quot; since contexts in which these ambiguities may cause error in name recognition seem rare (e.g. I &amp;quot;Malcolm X&amp;quot;, &amp;quot;Pope Leo X&amp;quot;, &amp;quot;Idrls I&amp;quot;, &amp;quot;May I leave?&amp;quot;). Ordinarily, a left parenthesis or a left bracket are among the punctuations which mark the end of a name string. This, however~ is not the case when a person'~s title and given name are followed by his nickname in quotation marks, as for example in &amp;quot;Gen. Howell (&amp;quot;Howling Mad&amp;quot;) Smith&amp;quot;, &amp;quot;Adm. William (&amp;quot;Bull&amp;quot;) Halsey&amp;quot;, etc. Similarly, whereas ordinarily a left bracket marks the end of a name string, occasionally, when quoting someone, newspapers supply in brackets the part of name which the original state~nt omitted (e.g., &amp;quot;M~ agreement with Senator ~qichard B.\] Russell...&amp;quot;); sequences such @s these do not mark the end of a name.</Paragraph>
      <Paragraph position="39"> The rule for identifying nicknames which was selected as a reasonable first approximation states that &amp;quot;strings of words in parentheses and quotes which occur immediately after the title and/or given names and before a surname are nicknames&amp;quot;..</Paragraph>
      <Paragraph position="40"> A parallel rule serves to identify names in brackats which act as amplifications of original statements.</Paragraph>
      <Paragraph position="41"> The preceding section has stated in considerable although by no means full detail some rules for identifying titles and names. We hope that this form of presentation indicates the vast amount of detail involved in rules for automatic recognition without, however, overburdening the reader with a multitude of minute points of information.</Paragraph>
      <Paragraph position="42"> Results of the Experiment.</Paragraph>
      <Paragraph position="43"> Since our identification rules were embodied in dictionary entries and flow charts which were sufficiently detailed to permit an accurate manual execution of identification procedures, it was decided that our identification system would be tested out by hand on a sample of The New York Times texts.</Paragraph>
      <Paragraph position="44"> Identification procedures were applied manually to some 40,000 words of texts. Altogether eighty-eight articles from eleven issues were selected and processed. Only newsartlcles were included in the sample. All materials found in the special sections such as (1) entertainment, (2) food-fashions-family-furnishings, (3) social events, (4) necrology, etc. were omitted. Materials in the sample consisted of only texts of newsartic+-es; picture captions, advertisements, italicized lists of various sorts, charts and diagrams, etc. were excluded from the data.</Paragraph>
      <Paragraph position="45"> Our 40,577-word sample contained 806 occurrences of ~s of persons. Of the 806 occurrences of names of persons, 46 or about 6% of the total were missed. In addition, 47 words and word strings were mistakenly identified as personal names or personal titles.</Paragraph>
      <Paragraph position="46"> Figure of merit F for the results of this identification lO system was computed by means of the following formula:  where C is the number of correct identifications, M is the number ofmistaken identifications, and T is thenumber of names of persons in the sample. (2) For T = 806, C =746, and M = 47</Paragraph>
      <Paragraph position="48"> Ar~l~sis of Ma~or Errors.</Paragraph>
      <Paragraph position="49"> Twenty-six misses (out of a total of forty-six) and thirty mistaken identifications (out of a total of forty-seven) occurred in attempted identifications of words, word stems, and word strings which perform a naming function vis-a-vis some namesake (e.g., &amp;quot;Grumman Aircraft Engineering Corporation&amp;quot;). This source of misses and false identifications would be eliminated if in the future the automatic identification system was not required to decide whether words, word stems, and word strings (e.g., &amp;quot;Grumman&amp;quot;) performing a naming function vis-a-vis some identifiable namesake (e.g., &amp;quot;Aircraft Engineering Corporation&amp;quot;) are names of persons. We also need more effective rules for computing namesake phrases (e.g., &amp;quot;Aircraft Company&amp;quot;) and personal titles (e.g., &amp;quot;Fireman Apprentice&amp;quot;) from their respective elements (e.g. I &amp;quot;Aircraft&amp;quot;, &amp;quot;Company&amp;quot;, &amp;quot;Fireman&amp;quot;, &amp;quot;Apprentice&amp;quot;).</Paragraph>
      <Paragraph position="50"> In addition, we need to prevent or c~eliminate the errors caused by the assumption that all capitalized words occurring after ambiguous words such as &amp;quot;General&amp;quot;, &amp;quot;Justice&amp;quot;, &amp;quot;Major&amp;quot;, &amp;quot;Principal&amp;quot;, etc. are names of persons.</Paragraph>
      <Paragraph position="51"> We also require more effective rules to distinguish strings of titles (e.g., &amp;quot;President, Secretary of State&amp;quot;) from titles followed by names. In addition, we need more effective rules for distributing a title among all ~s of persons which follow it in the text (e.g., &amp;quot;Senators Vance Hartke and Birch Bayh of Indiana and Eugene J. McCarthy and Walter F. Mondale of Minnesota&amp;quot;).</Paragraph>
      <Paragraph position="52"> In ad@ition, we ~y requi~e rules which ~m0uld uheCk onHthe did ones : rather than supersede them. The new set of rules would be applied to words and phrases which were identified as names of persons by the old set of rules. The new rules could indicate the degree 3_I of confidence with which the al6orithm identified a wor~ or a string of wor~s as name of a person s or as a personal title followed by the name of a person, etc.</Paragraph>
      <Paragraph position="53"> The advantage of this procedure consists in not having to revamp the algorithm in order to accomodate new rules. New rules would simply be tacked on to the old ones. New rules might che~k whether the elements of a string of letters, punctuation marks, spaces, numbers, etc. which the old rules had identified as a can be (a) words of the English language and (b) names of persons.</Paragraph>
      <Paragraph position="54"> Whether a string of characters is a personal name could be decided by probability tables constructed along these lines:  The new rules should be relatively easy to implement.</Paragraph>
      <Paragraph position="55"> The question &amp;quot;Is this string of letters an English word?&amp;quot; could be answered by means of (a) a lookup in a dictionary based on some desk dictionary -- say Webster's Collegiate, and (b) si,~le rules for identifying affixes of the plural, the past tense, the gerund, the negation, etc. The question &amp;quot;Can this string of letters be a personal name?&amp;quot; could be answered by means of (a) a lookup in a dictionary based On a large telephone directory -- say the Manhattan Telephone Directs,and (b) simple rules for identifying the plural (e.g., &amp;quot;es&amp;quot; and &amp;quot;s&amp;quot; as in &amp;quot;the Joneses&amp;quot; and &amp;quot;the Weinbergs&amp;quot;) and other affixes.</Paragraph>
      <Paragraph position="56"> Improving the automatic identification system may require several subsidiary investigations. For instance, we may be well advised to determine the relationship -- if any -- between, on the one hand, the length, the date, the place of origin, the subject matter, the authorship, and the type of newspaper articles processed through the system~nd on the other, the effectiveness of the algorithm.</Paragraph>
      <Paragraph position="57">  Automatic classification of words and phrases of the type described here can be regarded as a particularly simple case of machine translation. However, the goal of this type of machine translation is not translation into another natural language but TEXT REDUCTION: certain words and word strings are identified as &amp;quot;pertinent&amp;quot; (e.g., personal titles, personal names~ place names 2 street addresses, numbers and measures, dates and other time phrases, company names, trade names, chemical formulas, etc., etc.)and others as &amp;quot;not pertinent&amp;quot;. Pertinent words and phrases are retained and labeled, and all others are suppressed.</Paragraph>
      <Paragraph position="58"> Even this simple goal requires rules which are rather complex. However, because many word strings which the algorithms such as this one attempt to recognize have simple structure (&amp;quot;phrase structure&amp;quot;), they can be generated and possibly also recognized with a reasonable degree of accuracy by simple automata (&amp;quot;pushdown storage&amp;quot;) or by a combination of linguistic and statistical techniques.</Paragraph>
      <Paragraph position="59"> More generally, it may be useful to view natural language as a macro-language containing certain special-purpose micro-languages (or &amp;quot;sublanguages&amp;quot;) -- each with its own structure which relative to the total structure of language is quite simple. It may be of some practical and theoretical interest (a) to investigate the structures and the inter-relations of such sublanguages and (b) to construct algorithms for identifying in texts words and word strings belonging to such sublanguages.</Paragraph>
      <Paragraph position="60"> An ability to produce and identify automatically words and word strings belonging to various special-purpose categories (i.e., sublanguages, each with its own set of rules) may prove to be very useful in information retrieval because they play an important role in various systems for extracting and distributing information.</Paragraph>
      <Paragraph position="61"> It would appear that along with researching and developing methods for high-quality fully automatic classification of words in texts, it may be advisable to set up efficient procedures for (a) manual classification and tagging of words and word strings in texts, and (b) subsequent automatic extraction of data from texts which were recognized either manually or automatically. One procedure for manual classification of words in texts would require computer-legible texts which can be projected on TV-type tubes (hereafter, &amp;quot;display screens&amp;quot;) and either lightpens or cursors for writing on display screens. It may look approximately as follows: A newspaper article would be copied from some type of machine-readable tape into a suitable computer. The computer would +-L~n project the article on a display screen. A clerk would then scan the display screen and locate various types of words and phrases in the article (say, names of persons, names of organizations, dates, addressesj and so forth).</Paragraph>
      <Paragraph position="62">  Upon identifying a type of word.or of word string, the clerk would flash a lightpen or a cursor at the display screen and bracket that word or word string in suitable identifying symbols.</Paragraph>
      <Paragraph position="63"> Next, identifying symbols would be transferred from display screen to tape by means of a computer program. The recognized tape could then be processed in various ways by miscellaneous information extracting programs.</Paragraph>
      <Paragraph position="64"> It seems likely that manual assignment of word strings in texts to special-purpose sublanguages (akin to thesaurus classes) would provide a valuable interim service while methods for high-quality automatic classification are researched and developed. If and when automatic procedures for recognizing in texts dotes, personal titles, various technical and professional terms, meta-linguistic terms, names, etc., etc. become competitive with manual ones, the data processing community will be already in possession of operational computer programs capable of extracting data from recognized texts.</Paragraph>
      <Paragraph position="65"> Some Possible Applications.</Paragraph>
      <Paragraph position="66"> In the absence of figures on the cost of identifying personal titles and names by computer, the subject of the applications of computer programs capable of recognizing names of persons in newspaper texts must remain in the domain of speculations.</Paragraph>
      <Paragraph position="67"> We would conjecture that if the speed of computation was high and its price could be kept low, and if the figure of merit could be raised to .98 or higher, then a computer program for identifying names of persons in texts would be worth incorporating into existing information retrieval systems of very large newspapers and periodicals.</Paragraph>
      <Paragraph position="68"> It is still unknown whether a program with a figure of merit lo~er than .98 would be useful in information retrieval. We would surmise that it might be adequate for some purposes provided that it is sufficiently fast and cheap.</Paragraph>
      <Paragraph position="69"> Several uses suggest themselves immediately for computer programs capable of identifying cheaply, rapidly, accurately, and exhaustively the names and titles of persons in computer-legible newspaper texts. They seem to fall into five broad and overlapping categories: (1) automatic indexing of newspaper articles, (2) determining how the nsm~s of persons cluster with one another and with other words, (3) establishing frequency counts of names of persons, (4) tracing associations between names of persons, and (5) answering questions of the '~gho?&amp;quot; type within an automatic or semi-automatic system capable of providing answers to &amp;quot;Who?&amp;quot; &amp;quot;Whom?&amp;quot; &amp;quot;whose?&amp;quot; &amp;quot;When?&amp;quot; and &amp;quot;where?&amp;quot; types of questions addressed to a newspaper file.</Paragraph>
      <Paragraph position="70"> Systems for (a) either automatic or manual classification  of words and word strings in texts, and (b) subsequent automatic extraction of data from texts which were recognized either automatically or manually may be useful to many groups, among them: (l~ipolltical scientists, sociologists, lexicographers, onomasticlans, and literary scholars concerned with the occurrence of names, titles, and other words in texts, (2) editors, documentallsts, librairians, and others concerned with automation of editing and of literature searching, (3) opinion survey and market research statlsticians concerned with the occurrence of names in texts, celebrity ratings, measurement of opinion trends, etc.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML