File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2157_metho.xml

Size: 10,173 bytes

Last Modified: 2025-10-06 14:07:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2157">
  <Title>A Description Language for Syntactically Annotated Corpora</Title>
  <Section position="4" start_page="0" end_page="1058" type="metho">
    <SectionTitle>
2 The Query Language
2.1 The right kind of graphs
</SectionTitle>
    <Paragraph position="0"> If syntactic analysis is meant to provide for a basis of semantic interpretation, the predicate-argulnent structure of a sentence nmst be recoverable fi'om its syntactic analysis. Nonlocal dependencies like topicalization, right extraposition, tell us that tr'ccs are not expressive enough. We need a way to connect an extraposed constituent with its syntactic resp. semantic head. This can be done either by introducing empty leaf nodes plus a means for node coreference (like in the Penn Treebank) or by admitting crossing edges. In our project, the latter solution has been chosen (Skut et al., 1997), partly tbr the reason that it is simpler to annotate (no decision on the right place of a trace has to be taken). We call this extension of trees with crossing edges syntaz graphs. An example is shown in Fig. 1.</Paragraph>
    <Paragraph position="1"> In order to discuss the details of the language, we will make reference to the simpler syntax graph in Fig. 2.</Paragraph>
    <Paragraph position="3"/>
    <Section position="1" start_page="1056" end_page="1056" type="sub_section">
      <SectionTitle>
2.2 Nodes: feature records
</SectionTitle>
      <Paragraph position="0"> Syntactic phrases and lexical entries usually come with a bundle of morphosyntaetic information like part-of speech, case, gender, and mnnber. In computational linguistics, t~ature structures are used for that purpose. Since we need only a way to represent morphosyntactic information (not Sylltactic or semantic structures) themselves, we restrict ourselves to feature records, i.e. fiat; feature structures whose tbature values are constants. We admit Boolean tbrmulas, tbr the fl.'ature values, as well as tbr the feature-value pairs themselves.</Paragraph>
      <Paragraph position="1"> For example, all proper nouns (&amp;quot;NE&amp;quot;) and nouns (&amp;quot;NN&amp;quot;) can be retrieved by \[pos= &amp;quot;NE&amp;quot; I &amp;quot;NN&amp;quot;\] As usual, strucl;ura\] identity ca.n be expressed by the use of logical variables. However, variables must not occur in the SCOl)e of negation, since this would introduce the colnlmtational overhead of inequality constraints. null The values of a feature with 'infinite' range like word or 1emma can be referred to by regular exl)ressions, e.g. the nouns (&amp;quot;NN&amp;quot;) with initial M can be retrieved by</Paragraph>
      <Paragraph position="3"> The/-symbols inark a regular expression.</Paragraph>
    </Section>
    <Section position="2" start_page="1056" end_page="1057" type="sub_section">
      <SectionTitle>
2.3 Node relations
</SectionTitle>
      <Paragraph position="0"> Since gral)hs are two-dimensional objects, we need one basic node relation tbr each dimension, direct precedence . for the horizontal dilnension and direct dominance &gt; tbr the vertical dimension (the precedence of two inner nodes is defined as the precedence  of their leftmost terminal successors (Lezius and KSnig, 2000a)) Some convenient derived node relations are the following: &gt;* dominance (minimum path length 1) &gt;n dominance in n steps (n &gt; 0) &gt;m,n dominance between ~n, and n steps (0 &lt; m &lt; n) &gt;Ol leftmost terminal successor ('left corner') &gt;@r rightmost terminal successor ('right corner') * * precedence (minimum nmnber of intervals: 1) * n precedence with rt intervals (n &gt; 0) * m,n precedence between m and 'n, intervals (0 &lt; m &lt; $ siblings $.* siblings with precedence</Paragraph>
    </Section>
    <Section position="3" start_page="1057" end_page="1057" type="sub_section">
      <SectionTitle>
2.4 Graph descriptions
</SectionTitle>
      <Paragraph position="0"> We admit restricted 13oolean expressions over node relations, i.e. conjunction and disjunction, but no negation. For examI)le, tile</Paragraph>
      <Paragraph position="2"> art both satisfied by the NP-constituent in Fig. 2. #nl, #n2 art variables. Tile symbol &amp;quot;NR&amp;quot; is an edge label. Edges can be labelled in order to indicate the syntactic relation between two nodes.</Paragraph>
    </Section>
    <Section position="4" start_page="1057" end_page="1057" type="sub_section">
      <SectionTitle>
2.5 Types
</SectionTitle>
      <Paragraph position="0"> For tile t)urpose of conceptual chuity, tile user can define type hierarchies. 'SubtylleS: may also be constants e.g. like in the case of part-of-speech symbols. Here is all excerpt from the type hierarchy tbr the STTS tagset:</Paragraph>
      <Paragraph position="2"> This hierarchy can be used to tbrmulate queries in a more concise manner: \[pos=nominal\] .* \[pos=&amp;quot;VVFIN&amp;quot;\]</Paragraph>
    </Section>
    <Section position="5" start_page="1057" end_page="1058" type="sub_section">
      <SectionTitle>
2.6 Templates
</SectionTitle>
      <Paragraph position="0"> E.g. Ibr a concrete lexicon acquisition task, one might have to define a collection of interdependent, comI)lex queries. In order to keel) tile resulting code tractable and reusable, queries call be organised into telnplates (oi macros). Templates can take logical variables as arguments and may refer to other temi)lates , as long as there is no (embedded) self reference. Logically, templates art offline-compilable Horn fbrmula.</Paragraph>
      <Paragraph position="1"> Here are some examples tbr template def initions. A simple notion of VerbPhrase is being de.fined with reference to a notion of</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
  <Section position="5" start_page="1058" end_page="1058" type="metho">
    <SectionTitle>
3 The Corpus Annotation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1058" end_page="1058" type="sub_section">
      <SectionTitle>
Language
3.1 Corpus annotation vs. queries
</SectionTitle>
      <Paragraph position="0"> Actually, the query language is rather a dc,scription language which (:an 1)e used also for encoding the syntactic annotation of a corpus. \]n the current proje, ct, a SylltaCtically disambiguated corpus is being 1)reduced. This means, that, for corl)us annotation, only a sublanguage of the i)rol)osed language is adnlissibh', with the following restrict;ions: null * The graph (;ollstrailltS Illay only inclu(le the, t)asi(: node relations (&gt;, .).</Paragraph>
      <Paragraph position="1"> ,, The only logical contlective on all structural levels is the COl\junction el)crater &amp;.</Paragraph>
      <Paragraph position="2"> * lq,egular expressions are, 'not admitted.</Paragraph>
      <Paragraph position="3"> ,, Tyl)es and teml)lates are 'uo/, admitted.</Paragraph>
      <Paragraph position="4"> The automatically generate(1 corl)us annotation (:ode (generate(1 from the, outl)ut of tile gral)hical annotation interface) for Fig. 2 looks as fl)llows, with some additional mark-up for ease of processing.</Paragraph>
    </Section>
    <Section position="2" start_page="1058" end_page="1058" type="sub_section">
      <SectionTitle>
3.2 An XML representation
</SectionTitle>
      <Paragraph position="0"> When designing the, architecture of our sysloin, we had to deal with the 1)roblem of various diflhrent formats for the representation of syntactically annotated corpora: Penn ~lYe, ebank, Ne, Gra (Skut et al., 1.997), Tipst;er, Susmme, several fi)rnlats for chunked texts and the I)roposed des(:ription language,.</Paragraph>
      <Paragraph position="1"> Thus, we have developed an XML based format which guarantees maximmn 1)ortability (Mengel and Lezius, 2000). An online ('onversion tool (NeOra, Penn Treebank -+ XML) is availabh', on our project homepage.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1058" end_page="1058" type="metho">
    <SectionTitle>
4: Formal Semantics
</SectionTitle>
    <Paragraph position="0"> Compared to most other corpus description and corpus query languages, o111 graph (tescription language comes with a ibrmal and a clear-cut operational semantics, which has been described ill a technical report (Lezills anti KSnig, 2000a). The semantics has been compiled from the correslmntling parts of tbrmal semantics of the typed, unification-based gramlnar tbrmalisms and constraint-based logic programming languages which have been cited above. Due to the, fact that the corpus slid the query are represe, nted in the same description language, one Call detille a (;oi1se(tllellce relation })et\veell the corl)uS and the query. Essentially, the annotated cortms corresponds to a Prolog database, and the corpus query to a Prolog query. A query result is a syntax graph from the tort)us.</Paragraph>
  </Section>
  <Section position="7" start_page="1058" end_page="1059" type="metho">
    <SectionTitle>
5 Implementation
</SectionTitle>
    <Paragraph position="0"> One might argue that commercial and research implementations tbr structurally annotated texts are already available, i.e.</Paragraph>
    <Paragraph position="1"> XML-retrieval systems, e.f. (LTG, 1999).</Paragraph>
    <Paragraph position="2"> However, we intend to solve t)rol)lems which are spe('ifi(&amp;quot; to natural language descriptions: non-eml)e(t(ling (non-tree-lilw,) structm'al annotations crossing edge, s, and, on the long-texm, re, trieval of co-indexed sul.)structures (co-refl;rence phenomena). A domain-specific impleme, ntation of the search engine gives the basis for optiinizations wrt. linguistic applications (Lezius and KSnig, 20001)).</Paragraph>
    <Paragraph position="3"> Before queries can be (wahlate.d on a new corl)uS (e.ncoded in the NeGra, Penn Tree-bank or XML format), a preprocessing tool has to convert it into the format of the description language. Subsequently, the col pus is indexed in order to guarantee efficient lookups during the query evaluation. The query processor to date is cal)able of evaluating 1)asic queries (cf. Sect. 2.2-2.4)..To support all popular platforms, the tool is implemented in JawL There, is a servlet available on the project web page which illustrates the, cuir(:nt stage of the implementation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML