File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2710_metho.xml
Size: 6,942 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2710"> <Title>The NITE XML Toolkit: Demonstration from ve corpora</Title> <Section position="3" start_page="0" end_page="65" type="metho"> <SectionTitle> 2 The NITE XML Toolkit </SectionTitle> <Paragraph position="0"> At its core, NXT consists of three libraries: one for data handling, one for searching data, and one for building GUIs for working with data. The data handling libraries include support for loading, serialization using a stand-off XML format, navigation around loaded data, and changes to the data in line with a speci c data model that is intended especially for data sets that contain both timing information and overlapping structural markup. The search facility implements a query language that is designed particularly for the data model and allows the user to nd n-tuples of data objects that match a set of conditions based on types, temporal conditions, and structural relationships in the data. The GUI library de nes signal players and data displays that update against loaded data and can highlight parts of the display that correspond to current time on the signals or that correspond to matches to a query typed into a standard search interface. This includes support for selection as required for building annotation tools and a speci c transcription-oriented display.</Paragraph> <Paragraph position="1"> NXT also contains a number of end user interfaces and utilities built on top of these libraries. These include a generic display that will work for any NXT format data, con gurable GUIs for some common hand-annotation tasks such as markup of named entities, dialogue acts, and a tool for segmenting and labelling a signal as it plays. They also include command line utilities for common search tasks such as counting query results and some utilities for transforming data into, for instance, tab-delimited tables.</Paragraph> <Paragraph position="2"> Finally, a number of projects have contributed sample data and annotation tools as well as mechanisms for transforming data to and from other formats. Writing and testing a new up-translation typically takes someone who understands NXT's format between one and three days. The actual time depends on the complexity of the structure represented in the input data and whether a parser for the data format must be written from scratch.</Paragraph> <Paragraph position="3"> Badly documented formats and ill-formed data take longer to transform.</Paragraph> </Section> <Section position="4" start_page="65" end_page="67" type="metho"> <SectionTitle> 3 Examples </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> currently NXT's biggest user, and is also its largest provider of nancial support. AMI, which is collecting and transcribing 100 hours of meeting data (Carletta et al., 2005) and annotating part or all of it for a dozen different phenomena, is using NXT's data storage for its reference format, with data being generated natively using NXT GUIs as well as up-translated from other sources. The project uses ChannelTrans (ICSI, nd) for orthographic transcription and Event Editor (Sumec, nd) for straightforward timestamped labelling of video; although NXT comes with an interface for the latter, Event Editor, which is Windows-only and not based on JMF, has better video control and was already familiar to some of the annotators. For annotation, AMI is using the con gurable dialogue act and named entity tools as well as tailored GUIs for topic segmentation and extractive summarization that links extracted dialogue acts to the sentences of an abstractive summary they support. Figure 1 shows the named entity annotation tool as con gured for the AMI project. Aside from the sheer scale of the exercise, the AMI effort is unique in requiring simultaneous annotation of different levels at different sites. NXT does not support data management, but its stand-off XML data format has made it relatively easy to manage the process using a combination of a CVS repository for version control, web forms for data upload, and wikis for work assignment and progress reports.</Paragraph> <Paragraph position="3"> The AMI Project piloted many of their techniques on the ICSI Meeting Corpus (Janin et al., 2003), which shares some characteristics with the AMI corpus but is audio-only. More information about this closely related use of NXT can be found in (Carletta and Kilgour, 2005).</Paragraph> <Paragraph position="4"> This example is a small project that is looking at the relationship between referring expressions and the hand gestures used to point at a map. Although the transcription, referring expression, and gesture annotations were done in other tools and then uptranslated, NXT gave the best support for linking referring expressions with gestures and analysing the results. Figure 2 shows the linking tool. One interesting aspect of this project was that the analysis was performed by a postgraduate psychologist. Analysts with no computational experience nd it more dif cult to learn how to use the query language, but several have done so. With this kind of data set, simply the ability to play the signals and annotations together and highlight query results provides insights into behaviours that are dif- null This example is an annotation of Genesis in classical Hebrew that shows its structural division into books, chapters, verses, and half-verses. The data itself, which is purely textual, was originally stored in an MS Access relational database, but overlapping hierarchies in the structure made it dif cult to query in this format. After nding NXT on the web and consulting us about the best way to represent the data using the NXT data model, the user successfully up-translated his data, searched it using NQL, and exported counts to SPSS to create corpus statistics.</Paragraph> <Section position="1" start_page="66" end_page="67" type="sub_section"> <SectionTitle> Example 4: The Switchboard Corpus The Switchboard Dialogue Corpus </SectionTitle> <Paragraph position="0"> (Godfrey et al., 1992) has been popular for computational discourse research.</Paragraph> <Paragraph position="1"> (Carletta et al., 2004) describes an effort which up-translated its Penn Treebank syntactic analysis to NXT format, added annotations of markables for animacy, information structure, and coreference, and used this information all together. This project made heavy use of NXT's query language, including the ability to index query results in the data storage format itself for easy access. The work is now being extended to align an improved version of the transcriptions that includes word timestamps derived by forced alignment with the transcriptions used for the syntactic and discourse annotation, and to add annotations for phonology and syllable structure, all within the same corpus structure.</Paragraph> </Section> </Section> class="xml-element"></Paper>