File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1013_metho.xml
Size: 5,386 bytes
Last Modified: 2025-10-06 14:14:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1013"> <Title>The Berkeley FrameNet Project</Title> <Section position="4" start_page="88" end_page="88" type="metho"> <SectionTitle> 2. Subcorpus Extraction. Based on </SectionTitle> <Paragraph position="0"> the Vanguard's work, the subcorpus extraction tools (2.2) produce a representative collection of sentences containing these words.</Paragraph> <Paragraph position="1"> This selection of examples is achieved through a hybrid process partially controlled by the preliminary lexical description of each lemma. Sentences containing the lemma are extracted from from a corpus and classified into subcorpora by syntactic pattern (2.2.1) using a CASCADE FILTER (2.2.2, 2.2.5, 2.2.6) representing a partial regular-expression grammar of English over part-of-speech tags (cf. Gahl (forthcoming)), formatted for annotation (2.2.4) , and automatically sampled (2.2.3) down to an appropriate number.</Paragraph> <Paragraph position="2"> (If these heuristics fail to find appropriate examples by means of syntactic patterns, sentences are selected using INTERACTIVE SELEC-</Paragraph> </Section> <Section position="5" start_page="88" end_page="89" type="metho"> <SectionTitle> TION TOOLS (2.3)). </SectionTitle> <Paragraph position="0"> 3. Annotation. Using the annotation software (3.2) and the tagsets (3.2.1) derived from the Frame Database, the Annotators (3.1) mark selected constituents in the extracted subcorpora according to the frame elements which they realize, and identify canonical examples, novel patterns, and problem sentences. 1deg 4. Entry Writing. The Rearguard (4.1) reviews the skeletal lexical record created by the Vanguard, the annotated example sentences (5.3), and the FEGs extracted from them, and builds both the entries for the lemmas in the Lexical Database (5.2) and the frame descriptions in the Frame Database (5.1), using the Entry Writing Tools (4.2).</Paragraph> <Paragraph position="1"> ldegWe are building a &quot;constituent type identifier&quot; which will semi-automatically assign Grammatical Function (GF), and Phrase Type (PT) attributes to these FEmarked constituents, eliminating the need for Annotators to mark these.</Paragraph> </Section> <Section position="6" start_page="89" end_page="89" type="metho"> <SectionTitle> 3 Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="89" end_page="89" type="sub_section"> <SectionTitle> 3.1 Data Model </SectionTitle> <Paragraph position="0"> The data structures described above are implemented in SGML. n Each is described by a DTD, and these DTDs are structured to provide the necessary links between the components.</Paragraph> </Section> <Section position="2" start_page="89" end_page="89" type="sub_section"> <SectionTitle> 3.2 Software </SectionTitle> <Paragraph position="0"> The software suite currently supporting database development is an aggregate of existing software tools held together with PERL/CGI-based &quot;glue&quot;. In order to get the project started, we have depended on off-the-shelf software which in some cases is not ideal for our purposes. Nevertheless, using these programs allowed us to get the project up and running within just a few months. We describe below in approximate order of application the programs used and their state of completion.</Paragraph> <Paragraph position="1"> interactive, web-based tool.</Paragraph> <Paragraph position="2"> * CQP (2.2.1) is a high-performance Corpus Query Processor, developed at IMS Stuttgart (IMS, 1997). The cascade filter, which partitions lemmaspecific subcorpora by syntactic patterns, is built using a preprocessor (written in PERL, 2.2.2) which generates CQP's native query language.</Paragraph> <Paragraph position="3"> * XKWIC (2.3) is an X-window, interactive tool, also from IMS, which facilitates manipulating corpora and subcorpora.</Paragraph> <Paragraph position="4"> * Subcorpora are prepared for annotation by a program (&quot;arf&quot; for Annotation Ready Formatter, 2.2.4) which wraps SGML tags around sentences, target words, comments and other distinguishable text elements. Another program, &quot;whittle&quot; (2.2.3), combines subcorpora in a preselected order, removing very long and very short sentences, and sampling to reduce large subcorpora.</Paragraph> <Paragraph position="5"> * Alembic (3.2) (Mitre, 1998), allows the interactive markup (in SGML) of text files according to predefined tagsets (3.2.1). It is used to introduce frame element annotations into the subcorpora.</Paragraph> <Paragraph position="6"> * Sgmlnorm, etc. (from James Clark's SGML tool set) are used to validate and manage the SGML files. * Entry Writing Tools (4.2) (in development) * Database management tools to manage the catalog of subcorpora, schedule the work, render the nEventually, we plan to migrate to an XML data model, which appears to provide more flexibility while reducing complexity. Also, the FrameNet software is being developed on Unix, but we plan to provide cross-platform capabilities by making our tool suite web-based and XML-compatible.</Paragraph> <Paragraph position="7"> SGML files into HTML for convenient viewing on the web, etc. are being written in PERL. RCS maintains version control over most files.</Paragraph> </Section> </Section> class="xml-element"></Paper>