File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1024_metho.xml
Size: 25,804 bytes
Last Modified: 2025-10-06 14:12:54
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1024"> <Title>Automatic Extraction of Facts from Press Releases to Generate News Stories</Title> <Section position="4" start_page="170" end_page="171" type="metho"> <SectionTitle> 3. Business Problem </SectionTitle> <Paragraph position="0"> A major component of Reuters business is to provide real-time financial news to financial traders. Corporate earnings and dividend reports are two routine, but extremely important, types of financial news that Reuters handles.</Paragraph> <Paragraph position="1"> Publicly-traded companies must, by law, provide this information periodically, and equities traders rely on news services like Reuters to distill the companies' reports and make the information available within minutes or even seconds so that they can use it to make decisions about which stocks to buy and sell. It is imperative that the reports be generated very quickly and very accurately; if Reuters can produce important earnings and dividend stories first, they will have the edge in the very competitive real-time financial news market.</Paragraph> <Paragraph position="2"> One important electronic sources of earnings information is PR Newswire, which provides a wide range of press releases on many topics to subscribers. Figure 1 is a typical earnings press release received through the PR Newswire service. Figure 2 shows the corresponding Reuters news story which a reporter would generate from this release.</Paragraph> <Paragraph position="3"> While the production of these reports is crucial to Reuters business, it is a routine, tedious task which requires just enough domain knowledge and human intelligence to require trained reporters. JASPER helps Reuters produce earnings and dividend news stories substantially faster, with fewer errors, and with less tedium. JASPER automatically generates draft earnings and dividend stories from the press releases carried on PR Newswire and makes them available</Paragraph> </Section> <Section position="5" start_page="171" end_page="171" type="metho"> <SectionTitle> /FROM PR NEWSWIRE MINNEAPOLIS 612-871-7200/ TO BUSINESS EDITOR: GREEN TREE ANNOUNCES THIRD QUARTER RESULTS </SectionTitle> <Paragraph position="0"> ST. PAUL, Minn., Oct. 17 /PRNewswire/ -- Green Tree Acceptance, Inc. (NYSE, PSE: GNT) today reported net earnings for the third quarter ended Sept. SO of $i0,395,000, or 70 cents per share, compared with net earnings of $10,320,000, or 70 cents per share, in the same quarter of 1989.</Paragraph> <Paragraph position="1"> For the nine months, net earnings were $26,671,000, or $1.70 per share, compared with the first nine months of 1989, which had net earnings of $20,800,000, or $1.21 per share.</Paragraph> </Section> <Section position="6" start_page="171" end_page="171" type="metho"> <SectionTitle> GREEN TREE ACCEPTANCE, INC. STATEMENT OF EARNINGS </SectionTitle> <Paragraph position="0"> tO reporters for editing. Reporters need only check the information and make any necessary changes.</Paragraph> <Paragraph position="1"> In all, JASPER attempts to extract 56 different values from an earnings release, though not all of these will ever be present in any given release. Most of the values that JASPER extracts are numbers -- net income, per share income, revenues, sales, average number of shares outstanding, etc. -- and most information types are reported for four time periods: the quarter just ended, the corresponding quarter of the prior year, the fiscal year to date just ended, and the corresponding year to date period of the prior year. Other information types have only one value; these include the quarter being reported (Q1, Q2, Q3, or Q4), the end date of the quarter being reported, the place of origin of the release, the dividend, the date on which the dividend will be paid, etc.</Paragraph> <Paragraph position="2"> The JASPER system was developed between December, 1990 and August, 1991. The software was installed in early August, 1991, and reporters in New York and other Reuters offices in the United States began experimental use of the system immediately.</Paragraph> <Paragraph position="3"> Results of this use have shown that JASPER does its job quickly and accurately.</Paragraph> <Paragraph position="4"> * JASPER processes the average earnings or dividend release in approximately 25 seconds.</Paragraph> <Paragraph position="5"> * By the standard measures of recall and precision, the system is over 96% accurate overall in selecting relevant releases for processing.</Paragraph> <Paragraph position="6"> * By correstxmding measures for fact extraction, the system is over 84% accurate overall in extracting the desired information from the selected releases. Over 90% of the values that JASPER places in the stories it generates are correct.</Paragraph> <Paragraph position="7"> * JASPER handles 33% of targeted releases perfectly. It handles 21% of all earnings stories with no errors or omissions whatever;, and handles 82% of all dividend releases with no errors or omissions.</Paragraph> </Section> <Section position="7" start_page="171" end_page="174" type="metho"> <SectionTitle> 4. Technical Approach </SectionTitle> <Paragraph position="0"> Upon receiving a press release from PR Newswire, JASPER first determines whether it is &quot;relevant&quot; -- that is, whether it is one of the earning or dividend releases from which we wish to extract information. Carnegie Group's Text Categorization Shell (TCS)\[1\] is used to do this selection. Only about 20% of the information on the wire is relevant.</Paragraph> <Paragraph position="1"> JASPER has a frame representation which defines the specific information types to be extracted from relevant texts. These frames guide the remainder of the processing. The slots of the frame define what information is to be extracted and also hold information about how the processing for each slot is to be performed.</Paragraph> <Paragraph position="2"> For each slot in the frame, the system tries to match against each sentence an associated set of patterns of words; if any of the patterns match, a procedure, or extraction method, also associated with the particular slot, is called to decide whether the patterns which matched can be used to assign a value to the slot. The exllaction method may decide that no slot value should be assigned, or it may translate the information that matched into a canonical form and store it in the frame. Once all available information has been extracted and stored in the frame, JASPER generates a news story from the information and makes the story available to reporters for editing.</Paragraph> <Paragraph position="3"> Together, the patterns and the extraction methods make up the application-specific rulebase. The rulebase is tailored to the syntactic structures and vocabulary that we have observed in our analysis of the corpus. JASPER does not do complete syntactic parsing or complete semantic analysis of the text. Instead, it matches &quot;sketchy&quot; patterns, looking only for relevant words or phrases within sentences. The extraction methods too were written expressly to handle the forms that we have observed in PR Newswire texts. The rulehase makes certain assumptions about the language it expects to fmd in a text; while these assumptions are not always borne out, they are in most cases, and JASPER reaches a very high level of accuracy because of them.</Paragraph> <Paragraph position="4"> The input press releases often have a table along with the textual part, as in the example in Figure 1. The information contained in the two parts often overlaps, but in most cases neither the textual nor the tabular part gives all the required information. We therefore extract the information from both the text and the table and then merge the two sets of values. In this paper we do not discuss the techniques used to extract information from tables.</Paragraph> <Paragraph position="5"> JASPER runs under Ultrix on a DECstation 3100. The dedicated standalone DECstation has loose system interfaces to the PR Newswire feed and to a Tandem computer on which the reporters edit stories. The core extraction system runs in Lucid Common Lisp and uses the Common Lisp Object System (CLOS) to represent its frames.</Paragraph> <Section position="1" start_page="172" end_page="172" type="sub_section"> <SectionTitle> 4.1. Text Understanding Control </SectionTitle> <Paragraph position="0"> The control of the text understanding component of JASPER follows a simple algorithm. For each sentence in the release, JASPER checks every item on an ordered list of targeted information types, or slots, to determine whether a value has already been assigned to the corresponding slot.</Paragraph> <Paragraph position="1"> If no value has yet been stored, JASPER tries to match the current sentence against a set of patterns associated with that slot. If any pattern matches, tentatively identified values from the sentence are bound to pattern matcher variables, and the extraction method associated with that information type is called to interpret the results of the pattern matching.</Paragraph> <Paragraph position="2"> The extraction methods are application-specific procedures associated with individual slots which use the results of pattern matching to determine whether any slots should be f'llled and what value(s) should be used. If an extraction method assigns a value to a slot, the slot is marked as &quot;done&quot; and is removed from the list of slots to try on subsequent sentences.</Paragraph> <Paragraph position="3"> 4.2. The Pattern Matcher One important component of Carnegie Group's Text Categorization Shell is a powerful pattern matcher which matches complex patterns of words written in a specialized pattern language against text. This technology is also central to JASPER's fact extraction technology. The network-based left-to-right pattern matcher includes disjunction, negation, optionality, and skipping operators, and performs regular and irregular English morphology transformations when words are specified as nouns or verbs.</Paragraph> <Paragraph position="4"> The following pattern illustrates the pattern matching operators:</Paragraph> <Paragraph position="6"> This pattern says to match either the word profit or profits (+!; indicates that it is a noun) or earnings, followed within eight words by any number ($n), followed optionally by million, followed by dollar or dollars; and a match will fail if the phrase per share follows dollar. This pattern would match in sentences like the following: quarter of 1990 will exceed expectations at 45.6 million dollars.</Paragraph> <Paragraph position="7"> The former sentence will fail because per share follows dollars. The latter will fall because more than eight words intervene between earnings and the number, JASPER uses an extended version of the TCS pattern marcher for extracting information. It not only provides a boolean indication of whether a pattern matched, but also saves the information that we want to extract from the matches as special variables. A variable binding operator was added which can transform words matched in the text into a canonical form or simply save the words that matched. For example, this pattern</Paragraph> <Paragraph position="9"> will match any number and bind the number that matched to the pattern matcher variable %numbe=.</Paragraph> <Paragraph position="10"> This variable binding operator can also canonicalize values, as shown in the following pattern:</Paragraph> <Paragraph position="12"> Patterns like this one can match a variety of expressions with the same meaning, binding a pattern matcher variable to a single form representing this meaning. This pattern matches all of the following phrases and binds the variable %q.uazter to 4 in every case: fourth quarter, 4th quarter, 4th qtr, fourth qtr, Q4. Once the crucial information is saved as pattern marcher variables, it can be used by the exlraction methods to fill in values in a frame representation of the text</Paragraph> </Section> <Section position="2" start_page="172" end_page="174" type="sub_section"> <SectionTitle> 4.3. Knowledge Representation </SectionTitle> <Paragraph position="0"> JASPER uses CLOS to control the extraction processing and to store the extracted information. Each type of release from which we extract information -- earnings and dividends -- has a frame, or CLOS class, associated with it, with a slot for each information type that JASPER extracts. Figure 3 shows a portion of the earnings frame.</Paragraph> <Paragraph position="1"> As mentioned above, we are interested in extracting numbers for four different time periods for many information types. The slots current-quarter-net, prior-quarter-net, current-ytd-net, and prior-ytd-net in Figure 3 represent the four slots for net income. All four slots are processed together using the same patterns and extraction methods; in order to accomplish this, a group slot, net-income-group in the example, is defined to hold the information required for processing these slots. The individual slots corresponding to each time period then hold the specific values extracted from the text.</Paragraph> <Paragraph position="2"> Other information types have just one slot; for example period-reported in the example represents the period for which earnings are being reported (QI, Q2, Q3, or 04). This slot contains the information about how to extract the information -- the pattems and extraction methods -- and also holds the value once it is extracted.</Paragraph> <Paragraph position="3"> Each of the slots in the earnings frame in turn has a class as its value; these classes store information about how to do the extraction, and once information has been extracted from the press release, they store the value extracted. Each of these classes has the following slots associated with it for extracting from text: This section describes application-specific patterns and procedures used for fact extraction. In analyzing the relevant texts, we found tremendous regularity in the language and syntactic structures used due to stylistic conventions followed by U.S. companies in reporting earnings and dividends. The patterns and extraction methods take advantage of these regularities, handling the forms that are most likely to occur in the text with a high level of accuracy, and the forms that occur less frequently or not at all less accurately.</Paragraph> <Paragraph position="4"> The patterns used for extraction tend to match &quot;sketchy&quot; phrases, with skipping between the relevant elements of the pattern. For example, in order to find the net income we need to know that earnings are under discussion and we need to know what the amount of the earnings was; we can skip over other irrelevant information. A pattern like the following was used for net income:</Paragraph> <Paragraph position="6"> The patterns and extraction methods follow a few main strategies, depending on the kind of information to be extracted. Each of the strategies is described below.</Paragraph> <Paragraph position="7"> 4.4.1. Extracting Information for Simple Slots Several slots for earnings and dividends required a very simple strategy. The reporting period for earnings is an example of this type of slot. The patterns match simple phrases and bind a variable to the value to be extracted; the extraction method then takes the value bound to the variable, canonicalizes it if necessary, and fills in the appropriate value in the frame.</Paragraph> <Paragraph position="8"> Below is the pattern for the fourth quarter reporting period:</Paragraph> <Paragraph position="10"> If this pattern is matched, the pattern matcher variable %quarter is bound to the value 4. The extraction method for the reporting-period slot is then called to fill in the value for the slot in the frame.</Paragraph> <Paragraph position="11"> 4.40. Understanding Time Context in Text Earnings figures are generally given for four periods. In order to interpret the numbers in an earnings release, the system must not only find the figures reported and determine which information type they refer to (e.g. net income), but must also know the time period they apply to -the current or prior year, and the quarter or the year to date. For efficiency and for accuracy in handing elliptical time expressions, we handled time phrases separately, maintaining a time context which is then used to determine which of the four group slots to fill with the figures extracted. This time context makes it possible to process pairs of sentences like the following: * Earnings during the fourth quarter of 1990 were 50_5 million dollars. Sales were 74.3 million dollars.</Paragraph> <Paragraph position="12"> When JASPER processes the fh'st sentence it stores as the time context in working memory the fact that the last period mentioned was a quarter and the last year mentioned was the current one. After the time context is set up in this way, the earnings information is invcessed. The following sentence gives sales information, but does not provide any information about time. Despite this, the persistent time context in working memory allows us to determine that the slot to fill is the sales slot for the current quarter rather than for the prior quarter or for one of the year-to-date slots. The extraction procedures for time handling use heuristics based on our analysis of the particular texts to be handled and on our knowledge of English syntax, semantics, and pragmatics. While JASPER does not handle all time contexts correctly, it performs very well on the types that occur in the corpus of PR Newswire earnings reports.</Paragraph> <Paragraph position="13"> 4.4.3. Extracting Numbers for Group Slots JASPER uses the same strategy for filling in all slots in earnings releases that require number values. We will use net income as an example. Net income has four specific slots to fill, one for each of the reported time periods; all are handled together by the net-income-g=oup slot, which has a single set of patterns to match and a single extraction method to sort out which of the specific slot(s) to fill when relevant patterns match.</Paragraph> <Paragraph position="14"> The net-income-group slot has two sets of patterns, informally called current patterns and prior patterns: * current patterns match a word or phrase like earnings followed at some distance by a number; the number is bound to a pattern matcher variable. For example,</Paragraph> <Paragraph position="16"> . prior patterns match, in different orders, a word like earnings and a comparison word (e.g., compared, increase ... from, rise ... from, versus, etc.) followed at some distance by a number, which is bound to a pattern matcher variable. The following is an example of one such pattern: These two patterns match the net income from the current and prior period in sentences like the following: * XFZ Company's profits for the current year increased from 45.5 million dollars last year to 50 million dollars. The time context described above is used to help determine which time period the extracted numbers refer to.</Paragraph> <Paragraph position="17"> Conflicts between multiple matches are resolved by a heuristic procedure which allows JASPER to handle very complex sentences like the following with perfect accuracy: *ABC Company reported net earnings of 50 million dollars or 45 cents per share on revenues of 62 million dollars this year compared to earnings of 55 million dollars or 51 cents per share on revenues of 71.1 million dollars last year.</Paragraph> </Section> </Section> <Section position="8" start_page="174" end_page="175" type="metho"> <SectionTitle> 5. Status and Results </SectionTitle> <Paragraph position="0"> JASPER was deployed for testing and use by reporters in early August 1991. Reporters in New York and other Reuters offices in the United States are currently using the system as an aid in producing earnings reports from PR Newswire announcements.</Paragraph> <Paragraph position="1"> Accuracy tests run at Carnegie Group on a set of press releases that the system developers had never seen showed that JASPER's accuracy compares favorably with the results seen at the Second Message Understanding Conference (MUCK-II) \[9\].</Paragraph> <Paragraph position="2"> JASPER also runs quickly enough to be used in this real-time application at an average of about 25 seconds per relevant press release. Reuters required processing to be less than 30 seconds in order for the journalists to get the stories out in the very tight timeframes they have to work with.</Paragraph> <Section position="1" start_page="174" end_page="175" type="sub_section"> <SectionTitle> 5.1. Accuracy </SectionTitle> <Paragraph position="0"> Before delivering JASPER we ran an accuracy test on 100 earnings releases and 50 dividend releases that the system developers had not seen or analyzed prior to the test.</Paragraph> <Paragraph position="1"> Accuracy scores were calculated by manually comparing the values extracted by JASPER with the correct values specified by a Reuters journalist. We measured accuracy separately for selection of relevant releases and fact extraction. Results are reported below.</Paragraph> <Paragraph position="2"> Selection refers to the identification of relevant earnings and dividend reports in the slream of press releases from PR Newswire. Selection is measured with the standard measures of recall and precision. Recall is the percentage of actual earnings and dividend announcements that the selection process succeeds in finding. If recall is high, the system is not missing many items that it should select.</Paragraph> <Paragraph position="3"> Precision is the percentage of announcements that JASPER selects that are actually relevant, i.e. relate to earnings or dividends. If precision is high, the system is not wrongly selecting many items that should not be selected. These measures correlate closely with the recall and precision measures used for MUCK-II, with only minor differences.</Paragraph> <Paragraph position="4"> The figures in Figure 4 are based on 1047 PR Newswire releases, representing four days transmissions. The &quot;Expected&quot; figures represent the number of relevant releases actually present in the sample; the &quot;Assigned&quot; figure.s represent the number of releases selected by JASPER. We calculate overall accuracy as the average of the recall and precision scores.</Paragraph> <Paragraph position="5"> To compare our results with those of MUCK-II, we have chosen the highest score for recall and precision for each of four tests: two tests each with two different data sets. The In'st test on each data set was run &quot;cold&quot; -- the system developers had not seen the data in advance. The second test in each case was run after the system developers had made some changes to accommo4ate the test d_ata. The best recall and precision scores for each test do not necessarily come from the same system.</Paragraph> <Paragraph position="6"> We use two measures of accuracy for fact extraction: completeness and correctness. Completeness corresponds roughly to the recall measure used in MUCK-II, and to the recall measure used for selection; it measures the percentage of targeted values available in the PR announcements that are actually extracted correctly by the system. A targeted value is one that should, according to Reuters practice and style guidelines, appear in the Reuters news story.</Paragraph> <Paragraph position="7"> correct values extracted completeness = total targeted values Correctness corresponds roughly to the precision measure used in MUCK-II and the precision measure used for selection; it measures the percentage of times that a value extracted by the system is correct.</Paragraph> <Paragraph position="8"> correct values extracted correctness = total values extracted JASPER was designed with an emphasis on correctness rather than on completeness on the assumption that reporters are less likely to overlook gaps than wrong values in the story. To compensate for this built-in bias, we also calculate an overall accuracy figure for extraction by averaging the percentages obtained for completeness and corr~tness.</Paragraph> <Paragraph position="9"> In the accuracy results given in Figure 6, the &quot;Unadjusted&quot; figures are the raw results of the test. The &quot;Adjusted&quot; figures take into account typographical errors in the PR Newswire input (treating them in our favor), as well as the judgments of the same Reuters reporter regarding permissible deviations from his output. The figures &quot;With Changes&quot; are based on a second test on the same input after some changes had been made to the extraction rulebase. scores from each of four test for their correlates of completeness and correctness are given. The completeness and correctness scores do not necessarily come from the same system for any given test. The four tests involved two tests each of two different data sets. The first test on each data set was run &quot;cold&quot; -- the system developers had not seen the data in advance. The second test in each case was run after the system developers had made some changes to accommodate the test data, and so should correspond roughly to our lest &quot;with changes&quot;. While the MUCK-II accuracy measures differ somewhat from JASPER's, we believe that they are similar enough to show that JASPER compares favorably with the results of the systems which competed in MUCK-II.</Paragraph> </Section> </Section> class="xml-element"></Paper>