File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1205_metho.xml
Size: 3,624 bytes
Last Modified: 2025-10-06 14:07:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1205"> <Title>Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface</Title> <Section position="3" start_page="0" end_page="30" type="metho"> <SectionTitle> 2. Design Criteria </SectionTitle> <Paragraph position="0"> There are three important design criteria for the Sinica Treebank: maximal resource sharing, minimal structural complexity, and optimal semantic information.</Paragraph> <Paragraph position="1"> First, to achieve maximal resource sharing, the construction of the Sinica Treebank is bootstrapped from existing Chinese computational linguistic resources. The textual material is extracted from the tagged Sinica Corpus (hRp:l/www.sinica.edu.tw/ftms-bin/ kiwi.sh, Chen et al. 1996). In other words, the tasks and issues involving tokenization / word segmentation and category assignment are previously resolved. It is worth noting that the segmentation and tagging of Sinica Corpus have undergone vigorous post-editing. Hence the precision of category-assignment is much higher than with an automatically tagged corpora. In addition, since the same research team carried out the tagging of Sinica Corpus and annotation of Sinica Treebank, consistency of the interpretation of texts and tags are ensured. For structure-assigument, an automatic parser (Chen 1996) is applied before human post-editing.</Paragraph> <Paragraph position="2"> Second) the criterion of minimal structural complexity is motivated to ensure that the assigned structural information can be shared regardless of users' theoretical presupposition. It is observed that theory-internal motivations often require abstract intermediate phrasal levels (such as in various versions of the X-bar theory). Other theories may also call for an abstract covert phrasal category (such as INFL in the GB theory for Chinese). In either case, although the phrasal categories are well-motivated within the theory, their significance cannot be maintained in the context of other theoretical frameworks. Since a primary goal of annotated corpora is to serve as the empirical base of linguistic investigations, it is desirable to annotate structure divisions that are the most commonly shared among theories. We came to the conclusion that the minimal basic level structures are the ones that are shared by all theories. Thus our annotation is designed to achieve minimal structural complexity. All abstract phrasal levels are eliminated and only canonical phrasal categories are marked.</Paragraph> <Paragraph position="3"> Third) a critical issue involving Treebank construction as well as theories of NLP is how much semantic information, if any, should be incorporated. The original Penn Treebank took a fairly straightforward syntactic approach. A purely semantic approach, though tempting in terms of theoretical and practical considerations, has never been attempted yet. A third approach is to annotate partial semantic information, especially those pertaining to argument-relations. This is an approach shared by us and the Prague Dependency Treebank (e.g. Bohmova and Hajikova 1999). In this approach, the thematic relation between a predicate and an argument is marked in addition to grammatical category. Note that the predicate-argument relation is usually grammatically instantiated and generally considered to be the semantic relation that interacts most closely with syntactic behavior. This allows optimal semantic information to be encoded without going too beyond the partially automatic process of argument identification.</Paragraph> </Section> class="xml-element"></Paper>