File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1013_metho.xml
Size: 12,282 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?> <Paper uid="M95-1013"> <Title>CRL/NMSU Description of the CRL/NMSU Systems Used for MUC- 6</Title> <Section position="2" start_page="0" end_page="157" type="metho"> <SectionTitle> Basic NE System </SectionTitle> <Paragraph position="0"> The Basic system consists of a pipeline of Unix process . These can be identified as carrying out four different types of task .</Paragraph> <Paragraph position="1"> 1. Recognizing names based on character patterns (numbers, dates) . 2. Recognizing names based on pre-stored names.</Paragraph> <Paragraph position="2"> 3. Applying mark-up to potential components of complete names . 4. Recognizing names based on patterns of components .</Paragraph> <Paragraph position="3"> For the most part the system is context free . A few of the patterns used require some additiona l context before a name is recognized. For example an ambiguous human name in isolation may be recognized if it is followed closely by a title .</Paragraph> <Paragraph position="4"> The system consists of suite of `C' and lex programs .</Paragraph> <Paragraph position="5"> Component units recognized by the system are cities, provinces, countries, company prefixes an d suffixes, company beginning and ending words (Club, Association etc .), unambiguous and ambiguous human first and last names, titles and human position words .</Paragraph> </Section> <Section position="3" start_page="157" end_page="158" type="metho"> <SectionTitle> Additional Fix-up Procedures </SectionTitle> <Paragraph position="0"> Final patterns are used to join together units of the same type which are immediately next to eac h other in the text.</Paragraph> <Paragraph position="1"> After all the pattern based procedures have operated on the text a final pass is made to recogniz e abbreviated forms of names . This takes the lists of names found so far and truncates them . (right to left for person and left to right for companies) . These new lists are then used as lists of known organizations and persons and any occurrences of these in the text are marked . In particular for organizations the headline is not processed apart from this last stage . This avoids recognition o f organizations such as &quot;Leaves Bank&quot; . The assumption that names mentioned in the heading wil l be repeated in the body of the text holds almost universally .</Paragraph> <Section position="1" start_page="157" end_page="158" type="sub_section"> <SectionTitle> Data Sources </SectionTitle> <Paragraph position="0"> The data used in the Basic system is derived from public domain source, university phone lists , the Tipster Gazetteer and government data-bases of company names.</Paragraph> <Paragraph position="1"> Performance The performances of the for the test set and for the walk through article are given in Appendix A . Overall performance was Recall - 85% and Precision - 87% giving an F measure of 85 .8 . Walk through article Performance here was Recall - 63% and Precision - 83% .</Paragraph> <Paragraph position="2"> The main source of error was missing patterns in the system . For example Robert L. James was partially recognized (as L . James), McCann-Erickson was missed as no hyphenated company pat tern had been added. Once a frequently mentioned name is ignored in its full form the syste m unfortunately misses all abbreviated forms . This article also shows the importance of context in reliably recognizing some names (e .g. an analyst with PayneWebber).</Paragraph> </Section> </Section> <Section position="4" start_page="158" end_page="161" type="metho"> <SectionTitle> AutoLearn NE System </SectionTitle> <Paragraph position="0"> The AutoLearn system was developed to explore the possibility of using simple learning algorithms to detect specific features in text . An implementation of Quinlan's ID3 Algorithm was used [2,3] . This algorithm constructs a decision tree which decides whether an element of a collection satisfies a property or not . Each element of a collection has a finite number of attributes each o f which may take one of several values . Quinlan's original paper suggests the range of values of th e attributes should be &quot;small&quot;. In the case of the AutoLearn system the values are every word occurring in the training collection .</Paragraph> <Paragraph position="1"> Collections for Name Recognitio n In order to apply the ID3 algorithm the data needs to be structured into a collection, each membe r of which has specific values for a set of attributes and for each of which it is known whether th e member has a specific property or not . For the name recognition problem the training data was converted in tuple of five words . Each tuple was marked as having the start (or end) of a specifi c type of proper name at the middle word of the tuple. This data can be easily generated from the training articles. Thus for the beginning of a person many differences between Robert L . - 1 differences between Robert L. James 1 between Robert L. James , -1 etc.</Paragraph> <Paragraph position="2"> Fourteen sets of training data were generated using the 318 development articles supplied for MUC-6. The quality of the tagging is not particularly uniform, but no attempt has been made to improve this.</Paragraph> <Paragraph position="3"> Generating the decision trees As each word of the training data is read it is hashed and stored in a hash table . Thereafter words are referred to by their hash values. For each of the values of the five attributes (words 1 through 5) a count is maintained of the number of times this value contributed to an element holding a proper named occurrence at the middle attribute. The attribute to be tested first is chosen by computing for each value the relative frequency of positive and negative outcomes for this value . This is used to approximate the information content of that attribut e</Paragraph> <Paragraph position="5"> The sum of the approximate information contents for each column is calculated and the colum n with the highest value is chosen as the primary decision . Here all the values which always contrib uted to a positive outcome are used as the primary decision . Values which are always negative are ignored (this is primarily to reduce the size of the data being handled) . New sub-collections are formed with elements containing one value which contributed both to positive or negative outcomes are collected and the tree building process is repeated for each of these new collections .</Paragraph> <Paragraph position="6"> The decision trees thus formed can be output in a readable if somewhat lengthy form . In most cases the first choice is the third word in a group taking one of a large number of values . Thereafter a group off fairly impenetrable tests occur. For example for location beginnings If word 3 is one of the following - Milwaukee Ridgefild Pa ST.. (around 300 more words) then location_beginning else if word 3 is Illinois and word 1 is Indiana then location_beginnin g else if word 3 is Northeast and word 1 is `in' then location_beginnning The printed decision table takes about 5 pages .</Paragraph> <Paragraph position="7"> Running the AutoLearn System A pass through the texts is made for each decision tree (beginning and end) of each named entity . First the hash table of words is read and the corresponding decision tree . The text is then processed in groups of five words . Whenever a positive decision is made a new tag is added to th e output stream.</Paragraph> <Paragraph position="8"> Ideally at this stage the tagging would be done . However, given that we are processing new texts , there are many occasions where an end or a beginning is identified, but the corresponding beginning or end is not. For example a surname may have been seen previously, but not the attached forename. At this point a heuristic is applied which for every un-matched bracket in the text work s forward or backward until some appropriate point is reached . The actual skipping heuristics need to be different for organizations, persons, locations, dates and numbers .</Paragraph> <Section position="1" start_page="159" end_page="160" type="sub_section"> <SectionTitle> Data Sources </SectionTitle> <Paragraph position="0"> The only data source used for the AutoLearn system was the 318 MUC-6 training texts .</Paragraph> <Paragraph position="1"> Performance A high precision was expected from this system . Most of the errors that occur are due to failures of the bracket insertion heuristic . The overall scores were Recall - 47% and Precision- 81% giving an F measure of 59 .3 .</Paragraph> <Paragraph position="2"> No specific code was inserted to handle numbers or dates . The method was more successful wit h organizations and locations then with persons. More training data is perhaps required to make th e system aware of the spread of examples for human names .</Paragraph> <Paragraph position="3"> Walk through article The performance here was Recall - 36% and Precision - 88% .</Paragraph> <Paragraph position="4"> The major problem here is that the system has not learned a rule which uses &quot;Mr .&quot; to identify the word previous to a name.</Paragraph> <Paragraph position="5"> Relationship of Performance to Amount of Training Data .</Paragraph> <Paragraph position="6"> The evaluation texts were processed with decision trees generated using subsets of the MUC-6 development data. This was done in increasing units of 50 texts . The results are shown in Figure 1 below. Both recall and precision increase with increasing training data . Precision appears to tail off at around 82%. Recall, however, increases (with one exception) steadily over the whole range .</Paragraph> </Section> <Section position="2" start_page="160" end_page="161" type="sub_section"> <SectionTitle> Future Developments </SectionTitle> <Paragraph position="0"> We intend to rebuild the Basic system. One of the principle drawbacks of the system is its sequen tial application of component tags . In many cases a second tag is not applied because the word o r</Paragraph> <Paragraph position="2"> phrase is ambiguous . The correct solution here is to apply all tags in a manner that allows the correct tags to be selected by the pattern processing mechanisms . In addition we plan to improve our collection of patterns. The current version of the system is being made generally available. This, we hope, will provide us with some feedback on patterns and errors in the data files .</Paragraph> <Paragraph position="3"> Some further experiments are also planned with the AutoLearn system . The main drawback with the system is that it doesn't make maximal use of the training data in so far that with small training samples one word may be sufficient to make a decision . This situation can probably be improved by replacing specific words with a NULL word. This will force he system to develo p rules based more on context . In particular when the system encounters unknown words these will be considered equivalent to the NULL word.</Paragraph> <Paragraph position="4"> We also intend to apply the learning method described here to other NLP tasks such as part o f speech tagging and disambiguation .</Paragraph> </Section> </Section> <Section position="5" start_page="161" end_page="433" type="metho"> <SectionTitle> References </SectionTitle> <Paragraph position="0"> [1] Cowie, J.,Guthrie, L., Pustejovsky, J., Waterman, S., and Wakao, T., The CRL/Bradneis System us Used for MUC-5 In Proceedings of the Fifth Message Understanding Conference (MUC5) Baltimore, Ma ., Morgan Kaufmann, 1993 .</Paragraph> <Paragraph position="1"> [2] Quinlan, J.R. Discovering Rules by Induction from Large Collections of Examples .In Expert Systems in the Micro-Electronic Age, ed Michie D ., Edinburgh University Press, 1979 .</Paragraph> <Paragraph position="2"> [3] Quinlan, J.R. Machine Learning: Easily Understood Decision Rules .In Computer Systems that Learn, eds. Weiss S.M. and Kulikowski C .A., Morgan Kaufmann, 1991 .</Paragraph> </Section> <Section position="6" start_page="433" end_page="433" type="metho"> <SectionTitle> ALL OBJECTS </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>