File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/m98-1005_evalu.xml
Size: 17,197 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1005"> <Title>UNIVERSITY OF DURHAM: DESCRIPTION OF THE LOLITA SYSTEM AS USED IN MUC-7</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> The Named Entity Task The system's total score for the 100 articles of the formal run was: P&R 76.43 2P&R 77.31 P&2R 75.57 This is an improvement on the scores for the named entity task that the system achieved during MUC-6. However, the score was a little disappointing, as during training the system consistently achieved scores in the region of mid to high eighties.</Paragraph> <Paragraph position="1"> A shift in the topic from airlines and aircraft to satellite, rocket and missile launches explains some of the problems that were encountered. LOLITA's data in the latter area was not strong. Apart from the specific data, some basic data was found to be missing too, e.g., the names of the planets of the Solar system. Furthermore, the system was not prepared to recognise space shuttle names (e.g., Endeavour, Columbia) or missile names like Scud, Patriot, etc.</Paragraph> <Paragraph position="2"> A number of company names in the satellite television market were also missing. Several of them come without a clear designator and were not recognised by the system, e.g., BSkyB, SatelLife, Intelsat, Comsat, Canal-Plus, etc. (NB. BSkyB might have been resolved correctly in the presence of British Sky Broadcasting Group Plc in the same article. Unfortunately, the rules in the acronym matching algorithm didn't handle this case correctly.) Finally, a mistake was made in interpreting the guidelines of the task. The names of newspapers were not marked as ORGANIZATIONS and so this too contributed to a drop in scores.</Paragraph> <Paragraph position="3"> Walk-through article The score for the walk-through article was: P&R 75.57 2P&R 76.63 P&2R 74.55 This is very slightly lower than the overall score for the formal run. The worst scoring group of entities in this article was ENAMEX PERSON, where out of 16 entities, just over half were marked correctly (R 56%, P 53%). This is different from the overall trend, where the system's score for PERSON was a lot higher (R 80%, P 74%). An example of an error that occurred in this text is for Llennel Evangelista. The sentence: Llennel Evangelista, a spokesman for Intelsat, a global satellite consortium based in Washington, said the accident occurred at 2 p.m. EST Wednesday...</Paragraph> <Paragraph position="4"> was incorrectly analysed for two reasons. First of all the name Llennel was not in the data base. Secondly, a parsing error resulted in the analysis of Llennel Evangelista as both a spokesman and a consortium. The label of ORGANIZATION was then (incorrectly) chosen.</Paragraph> <Paragraph position="5"> Although it is helpful for the system to have the names of people in its database, it is not crucial. For example, the name LaRae Marsik was also unknown to the system, but this was dealt with correctly, because here the correct parsing facilitated a correct analysis: A spokeswoman for Tele-Communications, LaRae Marsik, said the partners in the Latin American venture intended to begin service by the end of 1996.</Paragraph> <Paragraph position="6"> So, because LaRae Marsik was understood to be a spokeswoman (a concept known to the system), it was possible to conclude that it must be a PERSON.</Paragraph> <Paragraph position="7"> A number of named entities in the SLUG and PREAMBLE fields were missed, e.g., three occurrences of MURDOCH. The strings in these fields appeared to be in some special format rather than natural language. The rules required by the system for handling these fields were therefore rather specialised. As these were MUC-7 specific, relatively little effort was spent polishing them; the effort instead being expanded on the more generally useful core rule sets.</Paragraph> <Paragraph position="8"> The recognition of organisations was much better in this article: R 64%, P 80%. This is higher than the overall results in the whole test (the score there was R 63%, P 67%). The most common reasons for errors appear to be problems with tokenisation or parsing.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> The Co-reference Task </SectionTitle> <Paragraph position="0"> The LOLITA system's score for the 20 articles in this evaluation was: Recall 46.9% Precision 57.0% f-value 51.5% As with the named entity task, this was a much better score than the system achieved in MUC-6. One of the reasons leading to co-reference resolution errors was lack of full parsing. The performance of the co-reference resolution task within our system is sensitive to the basic analysis being correct: possibly more so than the other two tasks in which the system was entered. In many cases, the full parse was not available and the parsing recovery mechanism didn't always provide sufficient input to facilitate a good semantic analysis. This led to problems, whereby even seemingly easy co-reference links were sometimes lost.</Paragraph> <Paragraph position="1"> Another problem was the fact that due to lack of resources not enough time was devoted to dealing with co-references involving conjoined noun phrases. In the previous MUC such co-references were excluded by the task, and although the LOLITA system has never explicitly excluded co-references involving conjoined noun phrases the rules that were used required more thorough testing than resources allowed.</Paragraph> <Paragraph position="2"> Some trivial errors also contributed to a reduction in score. For example, in the document 9601160264 the string McDonald was consistently marked instead of McDonald's. Additionally, due to a text output error, a second chain containing some occurrences of McDonald was built. This lowered the score considerably. Having corrected the error, the score for this article increased by just over 10%. This increase leads to a slightly better overall score of: Recall 48.0% Precision 58.6% f-value 52.8% In a number of articles the system scored particularly well: 6 of the articles scored well above the f-value of 60% and one of them as high as 70.5%. The problems that were encountered in other articles were typically due to the lack or resources that were available for debugging and testing. The developers believe that with a modest amount of further effort the system's overall scores for this task would have been even higher.</Paragraph> <Paragraph position="3"> The problem such as the one illustrated by the McDonald example points to a difficulty in automatic scoring and evaluation. On a semantic level, the system connected together the correct chain, however for scoring purposes this constituted a spurious chain. The resulting drop in score is treated the same as other spurious connections that could have been semantically incorrect. A scorer which is able to include a semantic component would give a more accurate reflection as to the success of any 'deeper' analysis that a system may have undertaken.</Paragraph> <Paragraph position="4"> Walk-through article The official score for the co-reference walk-through article is: Recall 45.6% Precision 57.1% f-value 50.7% This was slightly lower than our system's overall score for this task.</Paragraph> <Paragraph position="5"> Although performing reasonably well on most of the smaller chains within the article, problems occurred with two of the longer chains. The first involving Hughes and the second with Federal Communications Commission/FCC. In the case of Hughes, problems with the analysis of Hughes' Galaxy VIII(I) led to losses in both recall and precision. The string Hughes was not marked up at all, while Galaxy VIII(I) was split into two separate units Galaxy VIII and I.</Paragraph> <Paragraph position="6"> We noticed also that in one case our system marked a larger maximal noun phrase than the key. Having changed the key (according to what we believe is consistent with the task description) to include the following as an antecedent: ... the Federal Communications Commission's allocation of aswath of spectrum that will let their earth stations communicate with satellites in space rather than just: ... the Federal Communications Commission's allocation of a swath of spectrum we gain some extra points in the score: Recall 46.8%, Precision 58.7%, f-value 52.1%.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Template Elements Task </SectionTitle> <Paragraph position="0"> The overall score on this task was: P&R 66.75 2P&R 69.74 P&2R 64.01 This result is probably the most satisfying of the three MUC tasks that the system entered. Using LOLITA's template support it was possible to produce very reasonable templates within a relatively short space of time. To prepare the system for this task took only about 10 person days.</Paragraph> <Paragraph position="1"> The best performing subtask was the entity slot, particularly where the task required the extraction of organisations and persons. The lowest score was obtained in the ENT_DESCRIPTOR category, which could have been increased, had the co-reference performance score been better.</Paragraph> <Paragraph position="2"> Walk-through article The walk-through article score was: P&R 76.92 2P&R 77.05 P&2R 76.80 which is better than the system's overall score. The errors made were mainly due to tokenisation, for example, in International Technology Underwriters of Bethesda, Maryland the system didn't take the strings as a full name of company but split off Maryland into a separate entity. There was a similar problem with Space Transportation Association of Arlington, Virginia.</Paragraph> <Paragraph position="3"> Also, some inaccurate co-reference resolution resulted in spurious ENT_DESCRIPTORS: for example, the company's Washington headquarters for Bloomberg and Rupert Murdoch's for News Corporation. The latter descriptor error is much less serious than the former, as it does actually make some sense. The former descriptor, however, is erroneous.</Paragraph> <Paragraph position="4"> Other errors are data dependent: for example, the system didn't have some basic geographical data for well known cities of Europe, resulting in Paris being classified as a region.</Paragraph> <Paragraph position="5"> Despite the above problems the majority of the underlying analysis for this text was correct. The system was then able to successfully apply the higher level heuristics at the semantic and pragmatic levels. This demonstrates that given a correct underlying analysis the development of the high level template element application is relatively trivial. Results such as these provide further evidence to the developers that concentrating development on the core analysis is a strategy that will produce the best long term results.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="evalu"> <SectionTitle> CURRENT ACTIVITIES </SectionTitle> <Paragraph position="0"> Re-engineering of the System The use of Haskell has proved very beneficial in the development of the LOLITA system (see [1] for more details). Of particular benefit is the ability to quickly prototype complex algorithms. More recently the need for a large amount of prototyping has diminished, as the core parts of the system have become relatively stable. In order to improve the system's performance a major effort is now underway to re-engineer large parts of the system in C++. This has also provided the developers with the opportunity to make some alterations to the structure of the overall system.</Paragraph> <Paragraph position="1"> The aim of the project is still to develop a powerful core set of tools (e.g., parser, etc) which can then be used by a number of high level applications. However, the re-engineering process has also offered an opportunity for some more fundamental alterations to the system. An important one of these is the mechanism that is used for expressing the system's linguistic rules. In the past rule sets have been written as pieces of code. Although this allowed for a great amount of control in the rule-writing process it had obvious limitations, e.g., having to recompile after each change, and the writers of the rules requiring the knowledge to translate them into pieces of code. To avoid these problems a number of engines are now being implemented that are able to process sets of linguistic rules that have been written in a more appropriate language. Linguists can now write and test rules without the need to be able to understand the underlying engine's code. The programmers can also now concentrate on optimising the engines which process these rule sets.</Paragraph> <Paragraph position="2"> Perhaps the clearest example of such an engine is the parser. In the past grammatical rules were added as pieces of code and the system re-compiled. This relatively inefficient grammar development process has now been superceded by ones in which the running system loads the grammar from appropriate files. This allows the grammarian to concentrate their effort on the grammatical rules and not the implementation of them. Using C++ the programmers have also been able to develop some very low level optimisations that have greatly increased the speed of the parser whilst at the same time reducing its memory requirements. The new parser is estimated to be some 10 times faster and require a fifth less memory. (This parser was not available in time for the MUC-7 evaluations.) The developers view this re-engineering process as a natural progression of LOLITA from a research arena to that of the commercial world.</Paragraph> <Paragraph position="3"> Addition of Dictionary Definitions to Knowledge Base A natural language processing system requires knowledge at a number of important levels. The required knowledge includes: Grammatical word information knowledge about the structure of a word.</Paragraph> <Paragraph position="4"> Semantic word information knowledge about the meaning of a word, e.g., - `soccer' is a game played on a pitch by two teams, - `sell' involves a transfer of money from a buyer to a seller.</Paragraph> <Paragraph position="5"> World knowledge knowledge such as: - what particular objects are used for, - why particular events happen.</Paragraph> <Paragraph position="6"> It is widely recognised that knowledge about words and their meanings (the first two types) is already available in conventional dictionaries. However, these are aimed at human readers. For a computer to utilise knowledge in a dictionary the definitions need to be processed to extract and represent the information in a suitable form. A major project is currently underway which uses LOLITA to help process dictionary knowledge for computer use. The dictionary which has been selected for this project is the Cambridge International Dictionary of English (CIDE). The aim is to incorporate the knowledge contained in CIDE into SemNet. It is estimated that this will increase the size of SemNet from over 100,000 nodes to well over a million.</Paragraph> <Paragraph position="7"> To be able to carry out this process LOLITA is used to analyse as much of the definition as possible. However, help is required in resolving various problems and ambiguities that occur in the original definition. This help is given by the user in the form of a question-answering session (see [10] for more details). The questions fall into a number of different categories: choosing grammatical categories, picking word meanings, entering word information, solving structural ambiguities, finding referents for mentioned words, finding referents for implicit objects, making objects more specific, naming events and entities, naming relationships, confirming the analysis.</Paragraph> <Paragraph position="8"> The categories of questions in the above list also represents a rough estimate as to the order of the question-answering process (earlier questions such as the picking of word meanings may also occur in a different context later in the process). In practice the interaction with the natural language system is via a Graphical User Interface. LOLITA processes as much of the definition as it is able and then typically presents the user with a question and a list of possible answers. The user then simply uses the mouse to select the most appropriate answer. Occasionally a question is asked that requires the name of an entity (or relationship) to be entered; a text entry box is then provided for that purpose. It is anticipated that it will take several person years to completely analyse the dictionary. The system has been designed so that it can be used, with little in the way of training, by people who are not specialists in the area of NL. This greatly increases the number of people that can be used to enter the knowledge and so will reduce the amount of time that will be required to complete this project.</Paragraph> </Section> class="xml-element"></Paper>