File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1109_metho.xml
Size: 17,889 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1109"> <Title>From Information Retrieval to Information Extraction</Title> <Section position="5" start_page="85" end_page="91" type="metho"> <SectionTitle> 3 Highlight </SectionTitle> <Paragraph position="0"> Highlight (Thomas et al., 2000) is a general-purpose IE engine for use in commercial applications. The work described in the current paper extends Highlight by building an interface on top of it and replacing the internal representation of linguistically analysed texts with a representation based on distributed representations c.f. (Milward, 2000). Demos of the Highlight System are accessible from the SRI Cambridge web site at http://www.cam.sri.com</Paragraph> <Section position="1" start_page="85" end_page="86" type="sub_section"> <SectionTitle> 3.1 Distributed Representation </SectionTitle> <Paragraph position="0"> The inspiration for the approach taken here was the work in (Milward, 2000) on interpretation of dialogue utterances. This uses distributed semantic representations (based on indexing individual parts of a semantic structure) to encode both full and partial semantic analyses. For example, the logical form P(a,b) is represented by the set of constraints given in Figure 1.</Paragraph> <Paragraph position="1"> This approach aims to combine the advantages of shallow pattern-matching approaches (which are fast and robust) with those of deeper analysis which tends to do better for phenomena which involve scope. For example, consider the utterance &quot;The 2.10 flight to Boston will not stop at New York&quot;. A simple pattern matching approach looking for \[flight 1 to \[location/via \[location\] would extract the wrong information since it has no notion of the scope of the negation.</Paragraph> <Paragraph position="2"> In this work we take the same idea, but apply it to Information Extraction. Sets of indexed constraints (which are themselves partial descriptions of semantic structure) are used directly as search expressions. Since the indexed constraints can encode full semantic structures, the search terms can be more or less specific: we can start with just constraints on lexical items, and then build up to a full semantic structure by adding structural con- null straints.</Paragraph> <Paragraph position="3"> Search is now a question of expressing the appropriate constraints on the information to be extracted. For example, Figure 2 is a sketch rule to extract departure time from a sentence such as &quot;I leave at 3pm.&quot; The rule subsumes a range of rules which vary in the specificity with which they apply. For example, applying only the obligatory conditions will result in the extraction of 3pro as departure time from sentences such as &quot;I leave Cambridge to arrive at 3pm&quot; since in the obligatory component there is no restriction on M with respect to T, i.e. the leave event and the time. Adding the optional constraints in (1) gives us the restriction that the leaving must be dominated by the preposition at, e.g. &quot;I leave from the station at 3pm,&quot; and (2) requires immediate dominance: &quot;I leave at 3pm.&quot; This ranges from pattern matching to almost full semantics. Any particular application of a rule can be scored according to the number of optional conditions which are fulfilled, thereby using linguistic constraints if they are available.</Paragraph> <Paragraph position="4"> In practice, the system described here departs from this approach in several respects. Firstly, we were working with the pre-existing Highlight system which uses shallow syntactic processing based on cascaded patterns (similar to (Hobbs et al., 1996)). Deriving semantic relationships from this is not particularly reliable, so it is preferable to use search terms which rely more on positional clues.</Paragraph> <Paragraph position="5"> We therefore used a distributed syntactic representation to provide more reliable syntactic constraints (e.g. head word of, subject of, same sentence) augmented by positional constraints (e.g.</Paragraph> <Paragraph position="6"> precedes, immediately precedes). Secondly, we wanted an intuitive user interface so constraints such as 'dominates' were not appropriate.</Paragraph> </Section> <Section position="2" start_page="86" end_page="88" type="sub_section"> <SectionTitle> 3.2 User Interface </SectionTitle> <Paragraph position="0"> Given the underlying representation described above, the user's task in building a query is to propose a set of constraints which can be matched against the representation of a text or set of texts.</Paragraph> <Paragraph position="1"> The interface attempts to preserve the convenience of keyword based IR but also enable more refined searches, and control over the presentation of results. Keyword based search is a special case where the user specifies one or more keywords which they want to find in a document. However, users can also specify that they want to find a class of item (e.g. companies) and refine the search for items within the same sentence, not just the same document.</Paragraph> <Paragraph position="2"> For example, to find 'Esso' and a location in the same sentence we need the set of constraints given below:</Paragraph> <Paragraph position="4"> The interface emphasises the items the user wants to search for. Consider Figure 3. The user is looking for two items, the first including the word Esso (this could be e.g. &quot;Esso Corp.&quot;, &quot;Esso Holdings&quot; etc.), and the second item of sort 'location'. The effect of pressing ~Add' is to include the positional constraint that the two items must appear in the same sentence. In later parts of this paper we provide examples of more sophisticated queries which imitate what is currently achieved by pattern matching in more standard IE systems (e.g.</Paragraph> <Paragraph position="5"> Fastus (Hobbs et al., 1996)).</Paragraph> <Paragraph position="6"> In our approach IE is a seamless extension of IR. This can be contrasted with some more typical ways of tying together IE and IR by performing an IR search followed by IE. In that approach, lit is used first to select a limited set of documents, then IE is applied. This is fine if the queries and the keywords are fixed (and appropriate for each other). It is not ideal otherwise. For example, suppose you are interested in Esso's profits. To achieve your query you might use IR to find documents containing Esso, then use an IE system which has been customised to look for company profits. However, some of the results are likely to be unexpected, for example, you would obtain Shell's profits if there were a document describing Shell's profits which just happens to mention Esso in passing.</Paragraph> <Paragraph position="7"> Items can be constrained to have a particular head word, to include a particular word or to be of a particular sort. Multiple constraints on a single item are possible, as shown in Figure 5. Positional constraints can include any kind of inter-item constraints e.g. precedes, same sentence. There are further option buttons which concern the files to be queried and the layout of the output. The default is to provide a table which includes each item found plus the sentence in which it appears in the document.</Paragraph> <Paragraph position="8"> Two levels of user expertise (novice and expert) allow more or less control over the technical details of the query. Expert mode (accessed by clicking the button at the top right) looks much the same but has facilities such as altering the template output, naming the items and more options on the pull-down menus. For instance, in expert mode the user may specify which items are optional in the query, and what their syntactic class might be.</Paragraph> <Paragraph position="9"> There are also additional extra parameters for the expert user for the output templates.</Paragraph> <Paragraph position="10"> A typical user query is given in Figure 4. Here the user is looking for appositives in the pattern Person Noun o.f Company, for example John Smith, chairman of X-Corp. 1 Note that the query is not particularly linguistically sophisticated: the Position item is glossed as a noun group and the single preposition of is specified when a looser restriction (perhaps to be any preposition) would certainly turn up more results. However, this query is quick and simple to construct and can be used as a diagnostic for a more detailed query.</Paragraph> <Paragraph position="11"> A more complex query is shown in Figure 5 for a protein interaction task. The sort interaction in Figure 5 could be defined as a disjunction of the constraints head word = interact, head word = bind, head word = associate).</Paragraph> </Section> <Section position="3" start_page="88" end_page="88" type="sub_section"> <SectionTitle> 3.3 One-off Query vs Pattern Base </SectionTitle> <Paragraph position="0"> Highlight can operate in two distinct modes: interactive and batch. The interactive mode suits one-off queries as seen but can also be used to prototype queries on a subset of a corpus. Once a user is satisfied with the accuracy of a query, this user-defined 'pattern' can be stored for later use. Batch mode is used when a query is to be run over a large amount of text and requires no intervention from the user other than the initial set-up.</Paragraph> </Section> <Section position="4" start_page="88" end_page="91" type="sub_section"> <SectionTitle> 3.4 Preprocessing Files and ScalabiHty </SectionTitle> <Paragraph position="0"> If a user makes two queries over the same set of documents it does not make sense to do all the linguistic processing twice. To avoid this, documents 1We currently do not index commas but a positional constraint representing separation by a comma could easily be added.</Paragraph> <Paragraph position="1"> can be preprocessed. Preprocessing involves tagging, chunking, recognition of sorts, and conversion of the results into a set of constraints. At query time, there is no further linguistic processing to be done, just constraint satisfaction.</Paragraph> <Paragraph position="2"> The system has a relatively simple outer loop which considers a query for each document in turn (rather than e.g. using a single cross-document index). If a document has not been preprocessed, it is processed, then tested against the query. If a document has been preprocessed, the results of preprocessing are loaded, and tested against the query. Preprocessing produces a worthwhile increase in speed. Loading the preprocessed files ready for constraint satisfaction takes (on average) less than a tenth of the time it takes to process the files to get to the same stage. However, preprocessed files do take around 60 times more filespace than the source files from which they are derived 2.</Paragraph> <Paragraph position="3"> Although loading a preprocessed file is much faster than processing from scratch, the loading is still a significant factor in the total processing time. In fact for simple queries the loading accounts for over 90% of processing time. We were therefore keen to load only those files where there was a chance of success for a query. One way to do this is to split the query into an IR and an IE component and to use IR to pre filter the set of documents. If we only have to load one tenth of the documents then again we can expect a 10 times speed up (assuming the time to do IR is relatively trivial relative to the time as a whole).</Paragraph> <Paragraph position="4"> A simple way to achieve IR filtering is to extract out any non-optional keywords from the query, and then only process those documents that contain the keywords. However, many of the queries which we use do not contain keywords at all, only sorts. In these cases, we cannot run IR over the source documents, since these do not contain the sortal information. Instead we search over a summary file which is created during preprocessing.</Paragraph> <Paragraph position="5"> This contains a set of all the sorts and words found in the file. The II:t stage consists of selecting just those files which match the sorts and keywords in the query. This set is then passed to the IE component which deals with relational constraints such as same sentence, and interactions between constraints which have to be calculated on the fly such as precedes which are not indexed during preprocessing. null The IrE component satifies the constraints in the 2This is worse than it need be: we have not yet attempted to rationalise the preprocessed files, or use encoding schemes to reduce redundancy.</Paragraph> <Paragraph position="6"> query against the constraints in the preprocessed file. The constraint solver tries to satisfy the most specific constraints in the query first. Constraints are indexed and reverse indexed for efficiency.</Paragraph> <Paragraph position="7"> The processing times we currently achieve are very good by the standards of Information Extraction, and are adequate for the applications for which we have been using the system. The current approach is similar to that of e.g. Molla and Hess (Aliod and Hess, 1999), who first partition the index space into separate documents, and use the IR component of queries as a filter.</Paragraph> <Paragraph position="8"> Table 1 shows the difference in processing times on two queries for two different datasets. Times for processing each query on each dataset are labelled Old for Highlight with no IR and no preprocessed files, New for Highlight with preprocessed files and NewIR for Highlight with both an IR stage and preprocessed files. 3 New% and NewIR% give New/Old and NewIR/Old respectively. From the table we can see that (for these queries) adding the pre-processed files reduces total processing time by around 75%. Adding the IR stage reduces it by a further couple of percent in the case of the FT files and by 10% for the WSJ files. The performance increase on FT data is less dramatic because the data provides more hits (i.e. we extract more templates per input file) but note that for both datasets these improvements are at the worse end of the spectrum: both the WSJ and FT files stand a very good chance of containing both per-son and company (the sorts in our test queries) and so the IR component will propose a large number of each set for IE treatment, and we can also expect several hits per file, which tends to slow query processing. In other queries, e.g. a search for a company name, we would expect things to be much quicker as borne out by the results in</Paragraph> </Section> </Section> <Section position="6" start_page="91" end_page="92" type="metho"> <SectionTitle> 4 Performance </SectionTitle> <Paragraph position="0"> Figures 6, 7 correspond to the queries in Figures 3 and 4 respectively. Figure 8 is the result of searching for protein interactions (a more general version of the query in Figure 5.) 4 Table 2 contrasts these results. FilesIR and FilesIE refer to the number of our query results.</Paragraph> <Paragraph position="1"> of files input to each stage of the system, i.e. the query in Figure 6 was run over 500 news articles, only one of them was selected by our IR component and was thus input to the IE component. Hits denotes how many of the files passed to IE actually had at least one template in them and Templates shows how many templates were extracted as a result of the query. Time is the total time for the query in seconds.</Paragraph> <Paragraph position="2"> The query in Figure 6 looks for documents containing Esso and a location, s Because Esso is such a good discriminator in this document set, appearing in only one out of 500 documents, the IE query only considers one document and the 500 files contain the common sorts person and company. Figure 8 is similar but has the additional overhead of around 2/3 of all files having hits where Figure 7 has only 1/20.</Paragraph> <Paragraph position="3"> Factors affect the total amount of time that a query takes include: With the addition of the IR stage, the specificity of the query has the potential to impact greatly on total query time as already seen. Having a hit in the file can also significantly affect timings since there is no chance of an early failure of constraint satisfaction. For example, taking a file from each of the queries in Figures 7 and 8 which each take 270msec to load (i.e. are of the same size) we find that the one in which there is no hit takes 10msec to process but the one is which there is a hit takes 6.5sec (of which just ll0msec is spent writ.rag results to file.) The lessons from these results are not particularly surprising: queries should use the most specific words and sorts possible to get good value from the IR component. If there are many solutions--and remember that this system aims to extract specific information rather than a relevant document or document passagc then there will be a time penalty.</Paragraph> <Paragraph position="4"> 5Because there is no sentential restriction on these two items, they do not appear on the same row in the output. This is an option available to the user which means that in queries of this kind, the output is a summary of the &quot;true&quot; output (a cross product of each item with each other item) which can obviously result in very large output tables.</Paragraph> <Paragraph position="5"> Our expectation is that complex queries which also involve a lot of hits are more likely for batch mode operation, so the time penalty will not be so crucial. One-off queries will tend to be simpler, involving searches for less frequent information. However, we are also investigating further ways to improve processing speed, in particular during constraint satisfaction.</Paragraph> </Section> class="xml-element"></Paper>