File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/p05-3017_relat.xml
Size: 2,363 bytes
Last Modified: 2025-10-06 14:15:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3017"> <Title>Supporting Annotation Layers for Natural Language Processing</Title> <Section position="4" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> There are several specialized tools for indexing and querying treebanks. (See Bird et al. (2005) for an overview and critical comparisons.) TGrep21 is a a grep-like utility for the Penn Treebank corpus of parsed Wall Street Journal texts. It allows Boolean expressions over nodes and regular expressions inside nodes. Matching uses a binary index and is performed recursively starting at the top node in the query. TIGERSearch2 is associated with the German syntactic corpus TIGER. The tool is more typed than TGrep2 and allows search over discontinuous constituents that are common in German. TIGERSearch stores the corpus in a Prolog-like logical form and searches using unification matching. LPath is an extension of XPath with three features: immediate precedence, subtree scoping and edge alignment. The queries are executed in an SQL database (Lai and Bird, 2004). Other tree query languages include CorpusSearch, Gsearch, Linguist's Search Engine, Netgraph, TIQL, VIQTORYA etc.</Paragraph> <Paragraph position="1"> Some tools go beyond the tree model and allow multiple intersecting hierarchies. Emu (Cassidy and Harrington, 2001) supports sequential levels of annotations over speech datasets. Hierarchical relations may exist between tokens in different levels, but precedence is defined only between elements within the same level. The queries cannot express immediate precedence and are executed using a linear search. NiteQL is the query language for the MATE annotation workbench (McKelvie et al., 2001). It is highly expressive and, similarly to TIGERSearch, allows quering of intersecting hierarchies. However, the system uses XML for storage and retrieval, with an in-memory representation, which may limit its scalability.</Paragraph> <Paragraph position="2"> Bird and Liberman (2001) introduce an abstract general annotation approach, based on annotation graphs.3 The model is best suited for speech data, where time constraints are limited within an interval, but it is unnecessarily complex for supporting annotations on written text.</Paragraph> </Section> class="xml-element"></Paper>