File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1003_metho.xml
Size: 11,227 bytes
Last Modified: 2025-10-06 14:15:20
<?xml version="1.0" standalone="yes"?> <Paper uid="X98-1003"> <Title>TIPSTER LESSONS LEARNED: THE SE/CM PERSPECTIVE</Title> <Section position="4" start_page="17" end_page="19" type="metho"> <SectionTitle> 3 LESSON LEARNED </SectionTitle> <Paragraph position="0"> This section describes several lessons we believe have been learned as a result of having pursued the development of the architecture.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 3.1 Architectures Are &quot;Good Things to </SectionTitle> <Paragraph position="0"> Have&quot; Whatever the specific achievements or failings of the current TIPSTER architecture, the consensus of the program is that having an architecture was a good idea. It provided a central focus to an otherwise somewhat loosely-coupled set of contracts and it provided a forum for the researchers to share concerns and make progress on areas of mutual interest. It made it much easier to describe the program and its goals, both to participants and to outside interested parties.</Paragraph> <Paragraph position="1"> There was always a lot of great hope expressed in the architecture and a stated desire by all participants that its goals were important to achieve. Though it may be judged that the final TIPSTER architecture did not achieve all these goals, we should be encouraged by the achievements and use the lessons learned for an improved architecture in the future. We should consider architectures as a good idea for text processing systems.</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 3.2 Programmatic Incentives May Be Necessary To Get The Architecture Used </SectionTitle> <Paragraph position="0"> Usually, an architecture is a model of standards and interfaces used to development common, reusable components or modules in support of various domain operational applications. Applying this concept to the research environment was new and unusual. Normally, researchers are concerned with new algorithms and concepts and are not involved in the bigger application or system picture. They must be properly indoctrinated in the needs for an architecture, provided adequate support tools and directed to use the architecture.</Paragraph> <Paragraph position="1"> It can not be an optional consideration except where their work has no bigger contribution to an application.</Paragraph> </Section> <Section position="3" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 3.3 Direction and Support </SectionTitle> <Paragraph position="0"> In Phase II, the Government and COTRs, would meet once a month for architecture discussions.</Paragraph> <Paragraph position="1"> This provided common grounds to support architecture development. In Phase III, these meetings did not occur, with the result that architecture development slowed because of lack of common guidance to the researchers as to the importance of the architecture. Also, the work statements and funding for some of the researchers may not have supported their needed contributions to the architecture through the Technical Working Groups.</Paragraph> <Paragraph position="2"> The message here is that when there are so many contracts, a high level of coordination and cooperation is necessary.</Paragraph> </Section> <Section position="4" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 3.4 The Architecture Should Be Available At The Beginning </SectionTitle> <Paragraph position="0"> Since the architecture was designed in parallel with the researchers' main tasks it could not serve as a framework on which they could do their work since it changed frequently. Thus, there was uncertainty as to the environment and structure into which their work should fit and how their components would work with other components.</Paragraph> <Paragraph position="1"> The architecture development process should have proceeded quite differently. The architecture should have been essentially complete after two or three months so the researchers would know about the framework in which they were expected to work. This improved schedule could have been achieved by having an Architecture Design Team (ADT) consisting of no more than five people working together continuously, at one location, for two or three months. The ADT should be comprised of domain specialists AND system specialists. A suggested composition would be: (knowledge of, and experience in, building real applications is important) The major focus of the architecture for Phase III was the development of a COmmon Request Broker Architecture (CORBA) compliant Architecture Capabilities Platform (ACP) to host TIPSTER-compliant software components and modules. The ACP provides a software platform to test individual TIPSTER tools and capabilities. Developers will be able to demonstrate to the Government the modularity of their text handling systems by plugging components and modules into the ACP and interacting with the other TIPSTER components on the platform. In addition, the ACP will demonstrate the capability to interact with systems based on Z39.50 standards. The ACP will also have various supporting components such as document collections, standard detection needs, lexicons, a document manager and a default graphical user interface (GUI).</Paragraph> <Paragraph position="2"> The unavailability of the ACP in Phase III is very similar to the unavailability of the architecture in Phase II. Both were needed early so the researchers could properly design and test their products during development. Even though there was a limited budget for the ACP it probably would have been more effective if most of the money was loaded to the beginning of the project so it could &quot;get on the air&quot; sooner.</Paragraph> </Section> <Section position="5" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 3.5 Architectures Can Promote Sharing and Increase Efficiency </SectionTitle> <Paragraph position="0"> Text processing systems are complicated systems composed of many components. Broadly speaking, these components are arranged in a serial pipeline, each component building on the output of the preceding component. While researchers concentrated on developing individual components, they all had need for input data. In the extreme case shown in Figure 1, a version of each component m in an N-stage pipeline gets built N-m+l times (by that many researchers). Ideally, each component could be built once and then shared (Figure 2).</Paragraph> <Paragraph position="1"> TIPSTER demonstrated this sharing with several components. The most highly shared component was the Document Manager. Several versions (but far fewer than the number of research efforts) were built and shared throughout Phases II and III. Lexicons, semantic nets, and some part-of-speech tagging components were also shared. There was also a great deal of discussion of sharing that we feel would have materialized had there been more time (even simply the third year of Phase III). This component sharing was primarily hampered by the delay of fielding the ACP.</Paragraph> </Section> <Section position="6" start_page="18" end_page="19" type="sub_section"> <SectionTitle> Architecture Design and Application Development Experience Are Critical </SectionTitle> <Paragraph position="0"> The CAWG approach could have been more tightly controlled, directed, and limited in duration. The contributors were experts in their particular domain, but the group as a whole would have benefited from additional expertise in system architecture design.</Paragraph> <Paragraph position="1"> Early on it was apparent that the CAWG could not agree on the scope, selection, design or utility of numerous small modules. This resulted in an architecture of large components (the equivalent of a Computer Software Component in lifecycle terminology). The issues the CAWG faced with small modules were ownership of algorithms and software, module interfaces, which modules would be designed and whether small modules were technically feasible. There also appeared to be some resistance to the concept of a larger, generalized systems approach for TIPSTER. This is somewhat understandable since many of the researchers were used to working independently on small algorithmic pieces of code.</Paragraph> <Paragraph position="2"> The end result was that the architecture was made of three large components. Document Manager, Document Detection and Information Extraction. Since all of the researchers could use a Document Manager, it became the center-piece of the architecture and controlled much of the remaining work of Phase II. In our opinion, this significantly weakened the concept of an architecture that could be used to build a variety of domain applications. If the document manager functions were smaller and more flexible, they would likely have been able to support a broader set of needs (e.g., detection researchers were unable to develop a compliant document manager that was also fast enough for their needs). The program also would have benefited from greater focus on the detection and extraction components.</Paragraph> </Section> </Section> <Section position="5" start_page="19" end_page="19" type="metho"> <SectionTitle> UNFINISHED WORK </SectionTitle> <Paragraph position="0"> The architecture currently is a work-inprogress. Some things that might be considered for a future program are: The existing architecture is a mixture of standards, interfaces and implementation approaches. This causes confusion as to what parts are really architecture and which parts are module and component code. An overhaul of the architecture is needed to separate these things into a document which provides organization and structure through standards augmented with compliant modules which are built to the architecture standard as components in a toolbox. The ACP is a partial step toward the toolbox; however, more tools are needed.</Paragraph> <Paragraph position="1"> The interfaces need clarification so that they specify only what is needed for compatible interfaces, allowing for different implementations that allow systems to optimize for different constraints. Many of the current interfaces are overly constrained.</Paragraph> <Paragraph position="2"> A standardized storage method for documents should be established. This would allow different and possibly more efficient Document Manager components to be used in the architecture on an interchangeable basis. If an architecture is as general, flexible and open-ended as the TIPSTER architecture, it becomes nearly impossible to have a application built which can have interchangeable components.</Paragraph> <Paragraph position="3"> * Code should be provided which supports the Detection Need and Queries function. Since this is a generic function, a tool to support it is appropriate.</Paragraph> <Paragraph position="4"> The Pattern Specification Language capability should be completed and tested. This could be a critical area in standardizing rules to bring Information Extraction technology up the level of</Paragraph> </Section> class="xml-element"></Paper>