File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1503_metho.xml
Size: 7,609 bytes
Last Modified: 2025-10-06 14:07:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1503"> <Title>The TELRI tool catalogue: structure and prospects</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> AF morpho-syntactic taggers AF concordancers AF aligners </SectionTitle> <Paragraph position="0"> Each catalogue entry is contained in <sect1>, the top-level section element.</Paragraph> <Paragraph position="1"> The section, besides containing a <title> and being marked with an ID, is composed of two <sect2> elements. The first gives the information that is common to all sorts of tools, while the second is tool-type specific.</Paragraph> <Paragraph position="2"> The information records are encoded as<formalpara>, where each such element has a <title>, followed by the text of the of the record as a <para>. Various other DocBook elements are used to annotate pieces of information, e.g. <address>, <affiliation> and similar details. Table 1 gives as an example a complete dummy catalogue entry, where variable parts are prefixed by 'this is'.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Catalogue input and output </SectionTitle> <Paragraph position="0"> While the initial catalogue was input directly with an SGML editor and then validated, the envisioned additions will be performed via a Web form interface, available at http://gnu.nytud.hu/telri/. Figure 1 displays the top part of the screenshot of the HTML form designed to collect the specification of description of catalogue items.</Paragraph> <Paragraph position="1"> The definition of the particular information sought about the software tools required some consideration. Obviously, we would like to have as detailed a description of each item as possible. On the other hand, one has to bear in mind that the TELRI Catalogue will appeal for free voluntary contributions. Hence, the form should be maximally easy to fill in with minimal effort in order to avoid possibly deterring people from contributing who might otherwise have done so. The crucial factor to consider was to find the right balance between the set of required and optional items. In the end, the required information fields were confined to the bare minimum of name, task, description and TELRI helpline. Table 2 displays the full list of questions used in the HTML form.</Paragraph> <Paragraph position="2"> The form interface runs a Perl CGI script, which mails the output, encoded as the above described DocBook <sect1> element, to the editors of the catalogue. After checking, fresh entries are included in the official release of the catalogue. null The DocBook format is suitable for storage and interchange, but it is, of course, not appropriate for displaying the information. However, one of the benefits of using standardised solutions is that conversion tools and specifications are, to a large extent, already available. For presentation, we have been so far experimenting with the XML Stylesheet Language, XSL, or, more precisely, XSLT, the XSL Transformation Language, (W3C, 2000). XSLT is a recommendation of the W3C and is a language for transforming XML documents into other XML documents. There already exist several freely available XSLT processors, e.g., Xalan (http://xml.apache.org/xalan/), produced by the Apache XML Project.</Paragraph> <Paragraph position="3"> XSLT is most often used to produce HTML output for viewing on the Web, and so called Formatted Objects, which are then further transformed into print formats, usually PDF.</Paragraph> <Paragraph position="4"> For DocBook XML there exist ready-made stylesheets for both kinds of output, made by Norman Walsh and available at on the Web (http://nwalsh.com/docbook/xsl/). In the current version we have used these 'out of the box' tools to render the catalogue, although some slight modifications would be in order to produce output better tailored to the catalogue application. Figure 2 contains a sample HTML output of one item in the Catalogue.</Paragraph> <Paragraph position="5"> In summary, Figure 3 gives a graphical overview of the data processing of the TELRI Catalogue items.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Catalogue Contents </SectionTitle> <Paragraph position="0"> The catalogue currently contains only a few sample entries, which, nevertheless, exemplify the kinds of software that are to be most relevant for inclusion into the catalogue: AF tools that at least one TELRI partner has experience in using and that the partner is willing to support for new users AF tools that are available free of cost, at least for academic purposes and, preferably, are open source AF tools that are language independent or adapt easily to new languages AF tools that are primarily meant for corpus processing null At present, the catalogue lists the following tools: AF The morpho-syntactic tagger TnT (Brants, 2000) A robust and very efficient statistical part-of-speech tagger that is trainable on different languages and on virtually any tagset. It is available by a license agreement which is free of charge for non-commercial purposes. Distribution is available, in binaries only, for</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Linux and SunOS/Solaris. </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> AF The IMS Corpus Workbench concordancer </SectionTitle> <Paragraph position="0"> (Christ, 1994) Comprises a powerful Corpus Query Processor and a graphical user interface. It is available by a license agreement which is free of charge for non-commercial purposes. Distribution, in binary form only, is available for Linux and SunOS/Solaris.</Paragraph> <Paragraph position="1"> AF The Vanilla sentence aligner (Danielsson and Ridings, 1997) A simple but useful program that aligns a parallel corpus by comparing sentence lengths in characters by dynamic timewarping. The program assumes that hard boundaries are correctly aligned and performs alignment on soft boundaries. It is freely available with C source code distribution. null</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> AF The Twente Word Aligner (Hiemstra, 1998) </SectionTitle> <Paragraph position="0"> The program constructs a bilingual lexicon from a parallel sentence aligned corpus. The translations are ranked according to computed confidence. The system uses statistical measures and works for single words (tokens) only. It is available under the GNU General Public License and is written in C.</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> AF PLUG Word Aligner (Ahrenberg et al., </SectionTitle> <Paragraph position="0"> 1998) The system integrates a set of modules for knowledge-lite approaches to word alignment, with various possibilities to change configuration and to adapt the system to other language pairs and text types. The system takes a parallel sentence aligned corpus as input and produces a list of word and phrase correspondences in the text (link instances) and additionally a bilingual lexicon from these instances (type links). It is available by a license agreement which is free of charge for non-commercial purposes. Distribution is available, in binary form only, for Linux and MS Windows.</Paragraph> </Section> class="xml-element"></Paper>