File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1044_metho.xml
Size: 53,491 bytes
Last Modified: 2025-10-06 14:14:24
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1044"> <Title>Architecture Commitee</Title> <Section position="2" start_page="308" end_page="309" type="metho"> <SectionTitle> 1.11 Technical References </SectionTitle> <Paragraph position="0"> (a). Guidelines for Electronic Text Encoding and Interchange, TEI P3, Vol. 1 & 2, ACH, ACL, ALLC (b). ISO 8879: 1986, SGML standards</Paragraph> </Section> <Section position="3" start_page="309" end_page="311" type="metho"> <SectionTitle> 2.0 GENERAL TIPSTER FEATURES </SectionTitle> <Paragraph position="0"> The Architecture shall provide appropriate information to support TIPSTER features that are included in specific Applications. Features developed as a result of the requirements specified in this document will generally, but not always, be initiated by the User through User Interface commands that may be passed to the User Information Request for processing. The results from such processing, or other internal TIPSTER processes, will be passed to the User Interface via the User Information Output. Supporting details, for those requirements considered to be an internal part of the Architecture, will be found in the remainder of this document.</Paragraph> <Section position="1" start_page="309" end_page="309" type="sub_section"> <SectionTitle> 2.1 Task Handling </SectionTitle> <Paragraph position="0"> Processing tasks shall operate interactively with progress status shown to the User or in the background with progress status available upon User request. Depending upon the specific task the User may direct the task to operate interactively or in the background mode. Tasks may be canceled or the mode changed. The nature of this requirement implies co-operation between the Operating System, the Application and the Architecture. The Architecture shall facilitate this co-operation by allowing status information to be updated and passed between components.(5, 6, 7) Verification Method: Demonstration.</Paragraph> </Section> <Section position="2" start_page="309" end_page="311" type="sub_section"> <SectionTitle> 2.2 Multi-Lingual </SectionTitle> <Paragraph position="0"> The Architecture shall allow documents to be processed in any human language, provided the appropriate knowledge bases have been built and language specific modules are available. This includes the processing of documents which contain multiple languages in a document. The Architecture shall be based on multi-lingual layers so that additional languages can easily be added, without recreating the basic Application. The existence and use of multi-lingual layers shall not have significant, adverse effect on single language processing. (5, 6, 7) Verification Method: Inspection.</Paragraph> </Section> </Section> <Section position="4" start_page="311" end_page="314" type="metho"> <SectionTitle> 3.0 PERSISTENT KNOWLEDGE REPOSITORY </SectionTitle> <Paragraph position="0"> The Persistent Knowledge Repository is a common place (storage device) where information may be retained.</Paragraph> <Paragraph position="1"> The goal of Persistent Knowledge is to further the use of common knowledge items as new Applications are built. A significant pay-off may accrue through common usage, particularly in the development of operational Applications.</Paragraph> <Paragraph position="2"> It is recognized that there may be some reduction in capability when common knowledge items are used in an Application instead of unique, customized knowledge items; however, the history of other technical areas indicates that common items and sharing has merit and possible pay-off.</Paragraph> <Paragraph position="3"> Generally, Persistent Knowledge is that knowledge which is retained from one run to the next of an Application(s).</Paragraph> <Paragraph position="4"> However, there are degrees of persistency. Some items, such as, lexicons, gazetteers, glossaries, dictionaries, marking rules and grammars rules, are comparatively stable, subject to normal maintenance operations. A Query Library, on the other hand, may be modified frequently. Results from Detection, Extraction or User actions may be used to modify the items in the Persistent Knowledge Repository.</Paragraph> <Paragraph position="5"> Specific algorithms, even those which use items in the Persistent Knowledge Repository, are not considered part of the Repository, but instead are included in the particular instance of an implemented requirement; however, the structure of the particular knowledge item is an Architectural item.</Paragraph> <Paragraph position="6"> It should be noted that while Persistent Knowledge requires a place to keep knowledge items and the format of the item or storage area, not all items must be completely filled initially under the TIPSTER project. Some items will be filled, grow or be augmented as Applications are implemented under the Architecture. As a minimum the format and access method of each Persistent Knowledge item shall be established. Since Persistent Knowledge items will require frequent access the format and access design shall place a high priority on efficiency.</Paragraph> <Paragraph position="7"> Requirement 2.2 for multi-lingual capability is applicable for this section and Persistent Knowledge items should be designed to apply to multiple languages.</Paragraph> <Section position="1" start_page="311" end_page="311" type="sub_section"> <SectionTitle> 3.1 Persistent Knowledge Sharing </SectionTitle> <Paragraph position="0"> The Architecture shall allow for sharing of Persistent Knowledge items. This includes sharing by different components in an Application and by different Applications built in accordance with the TIPSTER Architecture. (2, The Architecture shall provide for the use of a common Lexicon that can support various document processing tasks in different Applications. The intent of this requirement is to allow new Applications to adopt and modify existing TIPSTER lexicons, where suitable. The common lexicon format should cover a frequently used set of data fields and be applicable across languages. (5, 8) Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="1"> The Architecture shall provide for the use of a common Gazetteer that can support various document processing tasks in different Applications. (5, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="2" start_page="311" end_page="313" type="sub_section"> <SectionTitle> 3.1.3 Common Parts of Speech Word Lists </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of common parts of speech word lists with a range of different tags to support various document processing tasks in different Applications. These are the word lists used by a part of speech tagger. (5) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="3" start_page="313" end_page="313" type="sub_section"> <SectionTitle> 3.1.4 Common Grammar Rules Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of common grammar rules that can support various document processing tasks in different Applications. This may be considered a far term requirement. (0) Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="1"> The Architecture shall provide for the use of a common Document Structure Library which will store formal descriptions of the structures of specific document types to aid in their processing. Reference (b), SGML standards, defines the concept of a DTD, Document Type Definition. Reference (a), TEI, addresses the concept and retention of common document structures as well as methods for assembling complex structures, including graphics, multi-media and text, from basic structure definitions. This is accomplished through the use of DTDs. While much of the scope of Reference(a), TEI is directed to embedding the associated tagging information during document creation in order to support publication needs, it also provides supporting concepts that can be used to dynamically mark documents. Document structure definitions are not restricted to the TEI methodology; however, whatever method is used, the document structure definition shall be stored in the Library so as to facilitate use by new Applications(0) 3.1.5.1 Basic Near Term Requirement A basic library defining commonly used message types and standard document divisions, such as communication header and text, shall be available for Applications implemented in the near term. The near-term requirement shall be met through the use of the Core Tag Sets and the Base Tag Sets as defined in Reference (a), TEI, or optionally and less desirable, by document structure definitions not restricted to the TEI methodology; Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="2"> 3.1.5.2 Enhanced Far Term Requirement An enhanced library with richer membership which covers more complex document structures than the basic library shall be implemented over the long term. The long-term requirement shall be met through the use of Additional Tag Sets as defined in Reference (a), TEl.</Paragraph> <Paragraph position="3"> Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="4"> The Architecture shall provide for the use of a Common SGML Tag Sets Dictionary The basis for these Tag Sets shall be Reference (a), TEI. and any augmentation provided by various Applications. (5) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="4" start_page="313" end_page="313" type="sub_section"> <SectionTitle> 3.1.7 Common Complete Template Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Complete Template Library that can support various document processing tasks in different Applications. Templates in the library shall be composed of template objects, fill rules and patterns associated with them. This is a library of templates already defined and used by TIPSTER applications (0, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="5" start_page="313" end_page="314" type="sub_section"> <SectionTitle> 3.1.8 Common Template Object Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Template Object Library that can support various document processing tasks in different Applications. The Library shall contain common objects composed of slot definitions with fill rules but without patterns. The purpose of this library is to allow construction of templates using pre-defined objects. (0, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="6" start_page="314" end_page="314" type="sub_section"> <SectionTitle> 3.1.9 Common Pattern Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Pattern Library that can support various document processing tasks in different Applications. The purpose of this library is to provide patterns strings for use in building new template filling capability. (0, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="7" start_page="314" end_page="314" type="sub_section"> <SectionTitle> 3.1.10 Common Detection Criteria Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Detection Criteria Library containing statements of user information needs as well as the associated Application translated, user understandable, queries that can support various document processing tasks in different Applications. Reuse of such criteria can facilitate the building and modification of requests for retrieval and routing. (0, 8)</Paragraph> </Section> <Section position="8" start_page="314" end_page="314" type="sub_section"> <SectionTitle> 3.1.11 Common Stemming Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Stemming Library that can support various document processing tasks in different Applications. The Library shall include word stems, prefixes and suffixes; thus a stem may be identified by direct look-up or the word may be parsed and parts compared with prefixes or suffixes, as appropriate, to identify a word stem, Stems, prefixes, suffixes may reside in separate parts of the Library so as to The Architecture shall provide for the use of common Stop Word Lists that can support various document processing tasks in different Applications. Different Stop Word Lists shall be applicable to different parts of a document to allow for different usage/meaning of the same word. (0) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="9" start_page="314" end_page="314" type="sub_section"> <SectionTitle> 3.1.13 Common Phrase Lists </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of common Phrase Lists that can be shared by various document processing tasks in different Applications. This requirement means that domain specific Phrase Lists can be a shareable resource. (0) Verification Method: Demonstration and inspection.</Paragraph> </Section> </Section> <Section position="5" start_page="314" end_page="316" type="metho"> <SectionTitle> 3.1.14 Common Predicate-Argument Dictionary </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Predicate-Argument Dictionary that can support various document processing tasks in different Applications. (0) Verification Method: Demonstration and inspection.</Paragraph> <Section position="1" start_page="314" end_page="314" type="sub_section"> <SectionTitle> 3.1.15 Common Term Expansion Dictionary </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common Term Expansion Dictionary that can be used to look up equivalent terms, variation terms, synonym terms or abbreviation expansions to support various document processing tasks in different Applications. In general, there may be Application dependencies with the same term having different meanings, depending upon the Applications(0) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="2" start_page="314" end_page="316" type="sub_section"> <SectionTitle> 3.1.16 Common User Annotation Library </SectionTitle> <Paragraph position="0"> The Architecture shall provide for the use of a common User Annotation Library that provides a repository for pre-defined User Annotations in the form of partially or fully completed Annotations for pre-defined document locations or associated with particular attributes that a User may use for comments about the document or the Annotations created by the Application. Specific Annotation type(s) shall be assigned for User Annotations. (6) The Architecture shall support the use of Dictionaries in machine readable form (e.g. CD-ROM) to support document processing tasks in different Applications.</Paragraph> </Section> </Section> <Section position="6" start_page="316" end_page="319" type="metho"> <SectionTitle> 4.0 DOCUMENT MANAGEMENT </SectionTitle> <Paragraph position="0"> The Architecture Concept, Reference (9), identifies several typical forms of documents, Form 0 through Form 4.</Paragraph> <Paragraph position="1"> Form 0 is the original document and Forms 1 and 2 are intermediate forms of the document. The Document Management process is concerned with Form 3 and Form 4 documents. A Form 3 document is the input to TIPSTER and Form 4 is the internal TIPSTER form. Document sources shall be in machine readable form and may be from communication lines or from computer files.</Paragraph> <Section position="1" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.1 Document Attributes </SectionTitle> <Paragraph position="0"> There are a variety of document attributes that may be used by TIPSTER processing functions. Some of these attributes such as Date of Information, Author or Source may be used directly by Detection to select documents.</Paragraph> <Paragraph position="1"> Other attributes such as Original Language or Code Set may be used to control the internal TIPSTER processing.</Paragraph> <Paragraph position="2"> The specific, necessary group of attributes is Application dependent. See Appendix A for a list of the most common attributes that might be used by an Application. (5) Verification Method: Inspection and demonstration.</Paragraph> </Section> <Section position="2" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.2 Other Attributes </SectionTitle> <Paragraph position="0"> Other document attributes may be specified by the Application, by the operating system, by the installation, by the user group and by the user. These may include such varied items as: processing network configuration, access/security control information or document collection names. (5) Verification Method: Inspection and demonstration.</Paragraph> </Section> <Section position="3" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.3 Original Document </SectionTitle> <Paragraph position="0"> A document shall always be available in an original form. The Application may define this to be a Form 1, Form 2 or Form 3 type of document. See Reference (9) (5) Verification Method: Demonstration.</Paragraph> </Section> <Section position="4" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.4 Document Sets </SectionTitle> <Paragraph position="0"> Documents shall be managed in such a way that ordered and unordered groups can be created. These groups may be comprised of sets with various characteristics such as, source, publisher or discourse. The document sets can have access controlled through Access Control that limits usage to the specified User or User Group.</Paragraph> </Section> <Section position="5" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.5 Corrections </SectionTitle> <Paragraph position="0"> Corrections to a document shall be applied so as to leave the original document, as defined in Paragraph 4.3, intact.</Paragraph> <Paragraph position="1"> Corrections are made through the use of Annotations. Revision numbers shall be associated with correction Annotations. A new persistent document may be created by applying corrections and assigning the appropriate revision number. Revised, corrected documents may be processed as any other Form 3 document. It shall be possible to make corrections Annotations to corrections Annotations. (5, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="6" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.6 Adaptive Document Structures </SectionTitle> <Paragraph position="0"> Adaptive identification of formats and document structure may be made based upon representative documents. This implies a degree of learning about document structures as they change over time. Representative document structures are maintained in the Document Structure Library. (4).</Paragraph> <Paragraph position="1"> Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="7" start_page="316" end_page="317" type="sub_section"> <SectionTitle> 4.7 Annotation </SectionTitle> <Paragraph position="0"> Annotations are information added to a document by User or computer processing. The Architecture shall recognize various types of annotations associated with specific passages of text as specified by a text span with begin and end values or the whole document. An Annotation to an Annotation is permitted. Also, it should be possible to obtain all Annotations associated with a particular document location through specific begin and end location values. Retrieval of Annotations should also be possible by type, by Annotation Group and for the entire document. (5, 7, 8) Provisions must be made for permanent Annotations. These may be stored separately from the document to which they apply. In such cases appropriate references or links shall tie the Annotations to the document. (5, 7) Verification Method: Demonstration.</Paragraph> <Paragraph position="1"> A single annotation may be associated with multiple document locations (spans). (5) Verification Method: Demonstration.</Paragraph> <Paragraph position="2"> Annotations shall be extensible, in accordance with the Architecture Maintenance Policy, as the Architecture matures. In addition to a reference mechanism(span and document ID) that identifies the scope of the Annotation, the two principal parts are the Annotation Type identification and the specific information in the Annotation. Common Annotation Types shall be defined and these definitions shall be maintained as part of the Architecture Configuration Policy. New Annotation Types may be created. (5) Verification Method: Inspection.</Paragraph> <Paragraph position="3"> Annotations may be treated together as sets of the same or differing types. These groups are necessary for retrieval purposes and to control processing. (0, 7, 8) A type of Annotation is required to allow the User to make notes about a document, other Annotations or the Detection and Extraction processing. User Annotations may have access controlled by the mechanism specified in Access Control. (6, 7, 8) Verification Method: Inspection.</Paragraph> </Section> <Section position="8" start_page="317" end_page="319" type="sub_section"> <SectionTitle> 4.8 Document Control </SectionTitle> <Paragraph position="0"> Basic Document Control includes the maintenance of document collections, document lists, correction and version records, ownership record, access control and security. (0, 8) Successor versions to the original document shall be marked with a revision number and the document ID and the cause for revision recorded as a document history attribute. (5) Verification Method: Demonstration.</Paragraph> <Paragraph position="1"> Applications shall not be prevented from interfacing to external source control information. In general, the Architecture shall not restrict an Application from using: The Architecture shall support conversion of document files written in various character encodings to a standard encoding before TIPSTER processing. The conversion shall be from a selected set of conversion tables and a set of conversion algorithms for the documents. The Architecture permits a standard Code Set for all internal operations. Retention of the original document is still required. The Architecture shall recognize on-going work of Reference (a) in the area of Code Sets. (5, 7) Verification Method: Demonstration.</Paragraph> </Section> </Section> <Section position="7" start_page="319" end_page="323" type="metho"> <SectionTitle> 5.0 USER INFORMATION REQUEST </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="319" end_page="319" type="sub_section"> <SectionTitle> User Information Requests define the manner in which the Application shall operate and perform its various tasks. </SectionTitle> <Paragraph position="0"> These criteria are usually originated by the User; however, they may be integral to the specific Application. They are embodied in Selection Statements, other Detection Criteria, Queries, Routing Queryand Templates. User commands from the User Interface component pass through the User Information Request area to initiate specific TIPSTER operations; for example, show a document, create a private collection, show a document list, etc. User information requests shall be presented in the language of the document except for two cases noted below.</Paragraph> </Section> <Section position="2" start_page="319" end_page="322" type="sub_section"> <SectionTitle> 5.1 User Defined Detection Criteria </SectionTitle> <Paragraph position="0"> The User shall be able to establish the document detection criteria in a variety of formats. These include creating a free text Selection Statement, a shorter free text need statement, example document or keyword Boolean statements with negative operators to specify the desired documents or sub-sets of documents. Previously created and saved detection criteria may be obtained from the appropriate library and modified. Criteria used for retrospective search and for routing shall have the same formats available. Different types of criteria may be used together, such as a Selection Statement with keywords. (3, 4, 6, 7, 8) The Application should provide any necessary assistance for Selection Statement generation. Selection Statements shall describe, in free text form, the kind of information the user requires. The statement shall be sufficiently complete so that relevance can be determined with reasonable certainty. The format for Selection Statements must be common and sharable between Applications. (0, 6, 7, 8) Verification Method: Inspection.</Paragraph> <Paragraph position="1"> Detection criteria may be stated as Boolean keyword criteria and negative operators to exclude documents. Keyword stated criteria may include statements of document attributes, such as author, source, date of composition, date of receipt, country of origin, etc. (0, 6, 7, 8) The user shall be able to view a detection component's interpretation of the submitted Detection Criteria. This query shall be in a format that is understandable by an interested user. The user shall be able to relate this query to both the criteria and the document it retrieves. The user shall be able to modify this component version query directly and shall be able to create new Detection Criteria in this component's format if he learns the format. Verification Method: Demonstration.</Paragraph> <Paragraph position="2"> In the detection process, items such as personal names, organization names, equipment names, locations, dates and identification numbers, may be treated as single units when specified as such by the user. (0, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="3"> The Architecture shall support a detection component that allows foreign language documents to be searched or distributed using English language Selection Statements. A multi-lingual retrieval or routing Application may have a separate module for each language it handles, but builds an index that is language independent. The component shall return a single ranked list of documents in multiple languages. Foreign language documents in the list may be annotated with English glosses of selected words and phrases. The Application can provide a custom browsing interface to display annotated documents to the user. (5, 6, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="4"> The Architecture shall allow assistance to the native speaker of English in formulating detection criteria in a foreign language. This includes providing interfaces for a program that shall take an English Selection Statement and shall return a version of the Selection Statement in another language or several versions in several different languages. The Architecture shall maintain the logical association between the English version of the query and the foreign language version of the query. (6, 7, 8) Verification Method: Inspection.</Paragraph> <Paragraph position="5"> 5.1.10 Query and Detection Criteria Refinement Refinement of all types of Detection Criteria, including the queries for retrospective search or routing, shall be supported by the Architecture. Relevance tags may provide guidance for the refinement; also, as an option, documents already seen may be suppressed from the re-run. (2, 3, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="6"> The Architecture shall allow prioritization to affect the manner in which documents are retrieved and presented for review. Priorities may be attached to Detection Criteria, including Selection Statements or portions thereof. In particular, the User shall be able to prioritize references to personalities, events, objects, times, or locations as welt as identifying the priority of Detection Criteria in a submission of multiple criteria statements. (6, 7, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="3" start_page="322" end_page="323" type="sub_section"> <SectionTitle> 5.2 User Defined Extraction Criteria </SectionTitle> <Paragraph position="0"> Presently, this process is labor intensive and requires a co-operative effort between Users and Application developers. The objective is to allow the End User or Developer to establish the criteria for extracting information from documents. This may include creating Templates, Patterns and Fill Rules needed for extraction and keeping items in libraries for future use. Toward this objective the Architecture should support more automation with interactive assistance to the User and Developer in preparing these items. Sample or training Templates may be used to assist the User. (0, 4, 7, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="1"> The Architecture shall accept as input Template Schema with empty slot relationships and treated as formatted information. Templates may be of varying complexity, from just single entities to Message Understanding Conference (MUC) type Templates. The language used to specify Template Schema should be that which is evolving under MUC. (0, 5, 8).</Paragraph> <Paragraph position="2"> Verification Method: Demonstration.</Paragraph> <Paragraph position="3"> The Architecture allows the specification of criteria for correct filling of each template slot and object. These criteria may be provided as a &quot;Fill Rules Document&quot; or as a sufficient number of examples of correct fills with context obtained from sample text. The extraction component shall determine how to fill the Template based upon these criteria. (0, 8) The Architecture shall permit complete Templates, Template Objects and Patterns to be stored in their respective libraries. These items may be retrieved, modified and stored as new items. When Templates Objects are retrieved the Fill Rules associated with each slot shall also be retrieved to assist in understanding existing Templates, constructing new Templates or modifying the fill criteria. Template Objects and Patterns may be selected, modified and re-combined or attached to different Templates to establish new extraction criteria. (2, 4, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> </Section> <Section position="8" start_page="323" end_page="323" type="metho"> <SectionTitle> 5.2.3.1 Manual Processes </SectionTitle> <Paragraph position="0"> In the near term, Templates Objects and Patterns, that is the component specific structures which allow the extraction component to recognize slot fills, shall be constructed/combined by manual or semi-automated processes and the appropriate libraries updated by normal maintenance operations. (2) Verification Method: Demonstration.</Paragraph> </Section> <Section position="9" start_page="323" end_page="326" type="metho"> <SectionTitle> 5.2.3.2 Automated Processes </SectionTitle> <Paragraph position="0"> In the future, the Architecture shall allow construction/combining of Templates Objects and Patterns by automated processes and the libraries updated automatically. The goal is to permit the development of Template Schema and fill criteria connected to component specific patterns via an interactive process by a user with subject domain knowledge. (2, 4, 7, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="1"> This requirement is applicable to Requirements 5.2.1, 5.2.2 and 5.2.3. The Architecture shall support, as a minimum, the following information types as slot fillers: a. String fills b. Set fills c. Hierarchical set fills d. Normalized fills e. Pointers to other entities.</Paragraph> <Paragraph position="2"> The Architecture shall specify how to store and pass filled Templates between components as well as the detail representation of these types. Filled templates shall appear as Annotations with links to relevant text spans in the source document. Security requirements are applicable to this requirement. (6, 7, 8) Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="3"> The Architecture shall allow a multi-lingual extraction component that represents Template definition in any combination of multi-lingual Fill Rules, Template Objects, slots and documents. For example, Fill Rules may be in a language different than the language of the source document. The Architecture shall support Template, Object and slot-level language and code set identification, as necessary, either in the Template Schema (5.2.1) or in individual filled Templates. (6, 7, 8) Verification Method: Inspection.</Paragraph> <Section position="1" start_page="323" end_page="324" type="sub_section"> <SectionTitle> 5.3 Document Clustering </SectionTitle> <Paragraph position="0"> The Architecture shall allow requests for document clustering provided the appropriate clustering algorithms and interfaces to Document Management and User Information Output are established. (0) Verification Method: Inspection.</Paragraph> </Section> <Section position="2" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 6.1 Document Viewing </SectionTitle> <Paragraph position="0"> The Architecture shall allow an Application to define which parts of formatted or structured documents, including document attributes, summaries, abstracts, subject, word delineated zones, etc., get displayed when a document is shown to the user. The Architecture shall allow viewing of documents to commence before a query operation is completed, if appropriate for the particular detection component. (3, 4, 5, 6, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="3" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 6.2 Document Ordering </SectionTitle> <Paragraph position="0"> Architecture shall allow the User to order document lists to support document viewing by any document attribute(s), result(s), Annotation(s), Template(s), slots or combination thereof related to document Detection or Extraction processing. For example, view by date, source, relevance rank and template slot. (4, 5, 6, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="4" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 6.3 Merged Results </SectionTitle> <Paragraph position="0"> The Architecture shall allow detection results from different collections or different detection components to be combined for viewing proposes. A common sorting and priority ordering may be applied by the User. (6, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="5" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 6.4 Document Grouping </SectionTitle> <Paragraph position="0"> The Architecture shall allow the grouping of documents based upon common occurrences of specific strings, nearly identical passages of text or similar Template Objects. The selection specification may identify the length of the string that is considered for &quot;identical&quot; selection. Identical or nearly identical documents may be viewed together or removed from the viewing list. The output shall result from either finding all &quot;identical&quot; documents in a collection which match a specific document or finding all &quot;identical&quot; documents in a collection based only on the selection specification. (6, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="6" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 6.5 Marking of Significant Text </SectionTitle> <Paragraph position="0"> The Architecture shall support tagging of the text strings in a document, that caused selection of the document so that when the document is passed to the User Interface for viewing highlighting or other notification may be applied. The Architecture shall also support tagging of text that caused a particular slot to be filled. Additionally, text that caused template or object instantiation shall be tagged. (6, 7, 8) Verification Method: Demonstration.</Paragraph> </Section> <Section position="7" start_page="324" end_page="325" type="sub_section"> <SectionTitle> 6.6 Viewing and Editing Filled Templates </SectionTitle> <Paragraph position="0"> The Architecture shall make filled Templates available to the User Interface for viewing, editing and disposition.</Paragraph> <Paragraph position="1"> Editing includes modification or deletion of any tags or links. Any assessment of the component confidence of the slot fills shall also be available for viewing and if tags or links were edited by the user the confidence field may also be edited. (5, 6, 7) Verification Method: Demonstration.</Paragraph> </Section> <Section position="8" start_page="325" end_page="325" type="sub_section"> <SectionTitle> 6.7 Component Processing Confidence </SectionTitle> <Paragraph position="0"> The detection and extraction components may provide an estimate of the component confidence level about each document selected or each piece of information that has been extracted. In the case of the extraction component estimate, it is the component's confidence in its process for filling each slot. (6, 7) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="9" start_page="325" end_page="325" type="sub_section"> <SectionTitle> 6.8 Processing Log </SectionTitle> <Paragraph position="0"> The Architecture shall allow an Application to create a Processing Log which records each individual process step of an Application run and any related errors. The log may be turned on or off or particular levels of reporting may be selected, e.g. critical error, warning error, etc. (4) Verification Method: Demonstration.</Paragraph> </Section> <Section position="10" start_page="325" end_page="326" type="sub_section"> <SectionTitle> 6.9 Document Processing Statistics </SectionTitle> <Paragraph position="0"> The Architecture shall allow an Application to collect document processing and error statistics. Processing statistics shall include various counts related to document processing and progress status during runs. Error statistics shall include counts of the number of times particular errors occurred. The specific level and scope of operational statistics is Application dependent, but may include Document Management, Detection and Extraction statistical items. (3, 7)</Paragraph> </Section> </Section> <Section position="10" start_page="326" end_page="329" type="metho"> <SectionTitle> 7,0 SCOPE OF PROCESSING SERVICES </SectionTitle> <Paragraph position="0"> The Architecture shall address document detection and information extraction. The statistical, linguistic and semantic analysis performed by Detection and Extraction may be considered the key work of TIPSTER processing.</Paragraph> <Paragraph position="1"> It is here that document information is manipulated, in various ways, to support the identification of a desired sub-set of documents or extracted information. This is accomplished by applying Persistent Knowledge information in conjunction with matching algorithms, annotations and template/pattern matching techniques to reduce a Document Collection to the desired sub-set or to extract information from the Document Collection. This sub-set or its extracted information is then presented to the user, passed to other components or stored for later use.</Paragraph> <Paragraph position="2"> The analytical methods and tools representative of the techniques to be applied to this functional area are given in Appendix B; however, this list should not be considered either exhaustive or restrictive.</Paragraph> <Paragraph position="3"> The Architecture shall support various generic information types that are applicable to TIPSTER processes. The use of any particular type of information is dependent upon the Application. Appendix C gives a list of the more common generic types that may be used by an Application. (0, 5, 7, 8)</Paragraph> <Section position="1" start_page="326" end_page="327" type="sub_section"> <SectionTitle> 7.1 Detection </SectionTitle> <Paragraph position="0"> The Detection process shall use the Detection Criteria in conjunction with Document Management and the Persistent Knowledge Repository to select and route documents from document sets. The specific search and identification algorithms are dependent upon the particular Application. Annotations may be created and/or used by the Detection component. The Detection component shall be compatible with other components and modules of the TIPSTER Architecture. (0, 7, 8) Compiled Queries are the forms of the Detection Criteria, generated by the detection capability from the Selection Statement and used for document selection. Compiled Queries operate for both retrospective retrieval and routing applications. Compiled Queries shall indicate and retrieve the document sub-sets that are of interest to the User. Query generation may be done automatically by the Application or with help from the User. (3, 6, 7, 8) Every document detection component shall be able to serve as part of a document routing function that compares new documents from a specified source to, potentially, very large numbers of profiles from many users. The routing function is expected to have compared, within some specified interval that may be minutes to hours, each new document with all profiles. Certain categories of documents shall be processed at higher priority than others, if required by the Application, for example, FLASH messages before ROUTINE messages. Depending upon the Application, as soon as a document has been routed, it must be available for retrospective search. In these cases the routing process shall coordinate its operations with the process that controls the indexing of documents for search and retrieval. (3, 6, 7, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> <Section position="2" start_page="327" end_page="327" type="sub_section"> <SectionTitle> 7.1.7 Prioritization of Documents </SectionTitle> <Paragraph position="0"> Documents may be assigned priorities based on matches between documents and prioritized portions of queries, profiles and Selection Statements. Various algorithms for combining weights, when there are multiple matches in one document, may be selected. Priority scores can be weighted, based on where in the document the match occurs, such as title, source attribute, etc. If any portion of a document matches a portion of a query that has a priority attribute, then the document is assigned that priority. The priority can be scaled, based on where in the document the match occurs. If several such portions of the query match, then an algorithm can combine priorities to get a net priority for the document. The priority, once calculated is available as an attribute for sorting documents for subsequent processing, whether viewing by the user, information extraction, or subsequent routing. (6, 7, 8) Verification Method: Demonstration and Inspection.</Paragraph> </Section> <Section position="3" start_page="327" end_page="329" type="sub_section"> <SectionTitle> 7.2 Information Extraction </SectionTitle> <Paragraph position="0"> The Extraction process shall use Templates, Fill Rules, Patterns or other methods in conjunction with Document Management, Detection and the Persistent Knowledge Repository to extract desired information from documents in Document Collections. The specific extraction methodology and algorithms are dependent upon the particular Application. Annotations may be created and/or used by the Extraction component. The Extraction component shall be compatible with other components and modules of the TIPSTER Architecture. Typical factual information to be extracted includes, but is not limited to, entities, objects, events, entity relationships and event relationships. (6, 7, 8) The Architecture shall support the concept of using Detection to filter documents as the input to the Extraction process, however, the Extraction process may operate independently from the Detection process, if required by the Application. (0, 8) Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="1"> The Architecture shall support the processing of Selection Statements and the natural language portions of the user defined Detection Criteria by the Extraction component, i.e. the desired information to be extracted may be specified through the use of natural language and the specifics identified by an Extraction component. The resulting specifics can be used to aid in the query formulation. (0) Verification Method: Demonstration.</Paragraph> <Paragraph position="2"> The Architecture shall allow Extraction to provide Abstracts or document summaries. The specific quality and content of the Abstract are Application dependent. (0, 8) Verification Method: Demonstration.</Paragraph> <Paragraph position="3"> The Architecture shall recognize a standard set of objects. All information extraction components are expected to be able to extract instances of these objects. These objects are expected to be part of an expanding library that shall be augmented as Applications are implemented. See Paragraph 3.1.8. (6, 8) Verification Method: Demonstration and inspection.</Paragraph> <Paragraph position="4"> The output of an information extraction component may include a set of document Annotations and optionally, one or more lists of filled templates. Each template should support a tabular or spreadsheet view of extracted information. The building of a data base is beyond the scope of the Architecture. However, it shall support the access to specific information to build a data base by allowing an Application to use filled templates. (6, 7, 8) The Architecture shall provide an interface to existing standard extraction module evaluation tools. This shall include the use of a test corpora, although actual evaluation is not part of the Architecture. (6, 8) Verification Method: Demonstration and inspection.</Paragraph> </Section> </Section> <Section position="11" start_page="329" end_page="329" type="metho"> <SectionTitle> 8,0 INTERFACE CONTROL DOCUMENT </SectionTitle> <Paragraph position="0"> The Interface Control Document is the defining document identifying specific inputs and outputs for TIPSTER components and modules.</Paragraph> <Section position="1" start_page="329" end_page="329" type="sub_section"> <SectionTitle> 8.1 Modularity </SectionTitle> <Paragraph position="0"> The Architecture shall provide for modularity. The ICD shall unambiguously define the Architecture component interfaces. These interfaces shall describe all the external information needed by the component. This serves to bound the component and there-by modularizing it. (2, 5, 8) Verification Method: Inspection.</Paragraph> </Section> </Section> <Section position="12" start_page="329" end_page="331" type="metho"> <SectionTitle> 8,2 Interchangeability </SectionTitle> <Paragraph position="0"> The Architecture shall provide for interchangeability. The ICD shall unambiguously define the Architecture component interfaces. The inputs and outputs shall be defined with sufficient detail to allow an Application's TIPSTER components and modules to be exchanged with similar TIPSTER components or modules. This shall also allow vendors to develop alternative components that also meet the specifications of the ICD. In this way, the components shall be Interchangeable. (2, 8) Verification Method: Inspection.</Paragraph> <Section position="1" start_page="329" end_page="329" type="sub_section"> <SectionTitle> 8.3 Specific Interfaces and Protocols </SectionTitle> <Paragraph position="0"> The Architecture shall provide for specific interfaces and protocols. The ICD shall unambiguously define the Architecture component interfaces. This shall include any specific standards and protocols allowed in the Architecture. Included in this requirement is the specification of Application Program Interfaces (API). (0) Verification Method: Inspection.</Paragraph> </Section> <Section position="2" start_page="329" end_page="329" type="sub_section"> <SectionTitle> 8.4 Application Language </SectionTitle> <Paragraph position="0"> Interfaces should be specified to facilitate future development of an Application language. Such a high levcl language would allow the construction of Applications by use of APIs corresponding to various modules and components of the Architecture. (5) Verification Method: Inspection.</Paragraph> </Section> <Section position="3" start_page="329" end_page="331" type="sub_section"> <SectionTitle> 8.5 Extensible Architecture </SectionTitle> <Paragraph position="0"> The Architecture shall provide for extension and adoption of new implementation approaches. The ICD shall unambiguously define the Architecture component interfaces. This shall provide a basis for any future enhancements to the Architecture. Enhancements can only be envisioned and designed when the base Architecture is well defined.</Paragraph> </Section> </Section> <Section position="13" start_page="331" end_page="339" type="metho"> <SectionTitle> 9.0 OPERATING ENVIRONMENT </SectionTitle> <Paragraph position="0"> The Operating Environment is concerned with such items as client/server schemes, file handling methods, operating systems, communications and support items.</Paragraph> <Section position="1" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 9.1 Research Framework </SectionTitle> <Paragraph position="0"> The Architecture shall provide a design that can serve as an efficient research framework. (2) Verification Method: Demonstration.</Paragraph> </Section> <Section position="2" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 9.2 Transportability </SectionTitle> <Paragraph position="0"> The Architecture shall provide a design that maximizes platform transportability. The use of capabilities which are Operating System or environment dependent must be clearly identified and modularized so as to isolate them from transportable components and modules. (2, 3, 4, 8) Verification Method: Inspection and demonstration</Paragraph> </Section> <Section position="3" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 9.3 Scaleability </SectionTitle> <Paragraph position="0"> Components shall be scaleable to a large number of documents and a high document flow rate; up to a maximum of 1,000,000 documents per day with access to 2 terabytes of text. (2, 3, 4, 5, 8) Verification Method: Inspection.</Paragraph> </Section> <Section position="4" start_page="331" end_page="339" type="sub_section"> <SectionTitle> 9.4 Tools </SectionTitle> <Paragraph position="0"> Tools and enhancements to assist in applying the Architecture to new tasks, applications and languages should be identified and, where possible, developed. (1, 7) Verification Method: Demonstration.</Paragraph> <Paragraph position="1"> Architectural choices should recognize the desirability of incorporating multi-level security in component implementation. The Architecture shall support the Application security requirements of the organization responsible for the Application, for example, security labels on processes and/or data items. In such cases labels shall not be separable from the process or data item. (5, 8) The architecture shall support physical and software boundaries to devices where documents and data are stored. The boundaries shall ensure that persons only have access to the appropriate level of classified information. These may be implemented at the API level of the software components.</Paragraph> <Paragraph position="2"> Auditing and administrative support shall be available for marking and filtering data to ensure proper document/data access, distribution and viewing and also to record improper access attempts. The scope of marking of data may be at the document, paragraph, data item or object level.</Paragraph> <Paragraph position="3"> The Architecture shall recognize the importance of appropriate response times, particularly in the interactive mode, and not impede implementations from meeting accepted standards. The goal is 2 seconds for interactive activities such as document or list displays and a few tens of seconds for activities such as query or search that require significant computer resources. (2, 3, 8)</Paragraph> </Section> </Section> <Section position="14" start_page="339" end_page="343" type="metho"> <SectionTitle> APPENDIX B Analytical Methods </SectionTitle> <Paragraph position="0"> The list below identifies Typical Analytical Methods that may be encountered in any TIPSTER Application.</Paragraph> <Paragraph position="1"> list is open ended. The use of a particular method is dependent upon the specific TIPSTER Application.</Paragraph> <Paragraph position="2"> Matching conditions and exceptions</Paragraph> </Section> class="xml-element"></Paper>