File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/h05-2008_abstr.xml

Size: 6,089 bytes

Last Modified: 2025-10-06 13:44:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-2008">
  <Title>References</Title>
  <Section position="1" start_page="0" end_page="14" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> There was simply linguistics at the beginning. During the years, linguistics has been accompanied by various attributes.</Paragraph>
    <Paragraph position="1"> For example corpus one. While a name corpus is relatively young in linguistics, its content related to a language - collection of texts and speeches - is nothing new at all. Speaking about corpus linguistics nowadays, we keep in mind collecting of language resources in an electronic form.</Paragraph>
    <Paragraph position="2"> There is one more attribute that computers together with mathematics bring into linguistics - computational. The progress from working with corpus towards the computational approach is determined by the fact that electronic data with the unlimited computer potential give opportunities to solve natural language processing issues in a fast way (with regard to the possibilities of human being) on a statistically signi cant amount of data.</Paragraph>
    <Paragraph position="3"> Listing the attributes, we have to stop for a while by the notion of annotated corpora. Let us build a big corpus including all Czech text data available in an electronic form and look at it as a sequence of characters with the space having dominating status a separator of words. It is very easy to compare two words (as strings), to calculate how many times these two words appear next to each other in a corpus, how many times they appear separately and so on. Even more, it is possible to do it for every language (more or less). This kind of calculations is language independent it is not restricted by the knowledge of language, its morphology, its syntax. However, if we want to solve more complex language tasks such as machine translation we cannot do it without deep knowledge of language. Thus, we have to transform language knowledge into an electronic form as well, i.e. we have to formalize it and then assign it to words (e.g., in case of morphology), or to sentences (e.g., in case of syntax). A corpus with additional information is called an annotated corpus.</Paragraph>
    <Paragraph position="4"> We are lucky. There is a real annotated corpus of Czech Prague Dependency Treebank (PDT). PDT belongs to the top of the world corpus linguistics and its second edition is ready to be of cially published (for the rst release see (Haji c et al., 2001)). PDT was born in Prague and had arisen from the tradition of the successful Prague School of Linguistics. The dependency approach to a syntactical analysis with the main role of verb has been applied. The annotations go from the morphological level to the tectogrammatical level (level of underlying syntactic structure) through the intermediate syntacticalanalytical level. The data (2 mil. words) have been annotated in the same direction, i.e., from a more simple level to a more  complex one. This fact corresponds to the amount of data annotated on a particular level. The largest number of words have been annotated morphologically (2 mil. words) and the lowest number of words tectogramatically (0.8 mil. words).</Paragraph>
    <Paragraph position="5"> In other words, 0.8 million words have been annotated on all three levels, 1.5 mil.</Paragraph>
    <Paragraph position="6"> words on both morphological and syntactical level and 2 mil. words on the lowest morphological level.</Paragraph>
    <Paragraph position="7"> Besides the veri cation of 'pre-PDT' theories and formulation of new ones, PDT serves as training data for machine learning methods. Here, we present a system Styx that is designed to be an exercise book of Czech morphology and syntax with exercises directly selected from PDT.</Paragraph>
    <Paragraph position="8"> The schoolchildren can use a computer to write, to draw, to play games, to page encyclopedia, to compose music - why they could not use it to parse a sentence, to determine gender, number, case, . . . ? While the Styx development, two main phases have been passed: 1. transformation of an academic version of PDT into a school one. 20 thousand sentences were automatically selected out of 80 thousand sentences morphologically and syntactically annotated. The complexity of selected sentences exactly corresponds to the complexity of sentences exercised in the current text-books of Czech. A syntactically annotated sentence in PDT is represented as a tree with the same number of nodes as is the number of the words in the given sentence. It differs from the schemes used at schools (Grepl and Karl* k, 1998). On the other side, the linear structure of PDT morphological annotations was taken as it is only morphological categories relevant to school syllabuses were preserved.</Paragraph>
    <Paragraph position="9"> 2. proposal and implementation of exercises. The general computer facilities of basic and secondary schools were taken into account while choosing a potential programming language to use. The Styx is implemented in Java that meets our main requirements platform-independent system and system stability.</Paragraph>
    <Paragraph position="10"> At least to our knowledge, there is no such system for any language corpus that makes the schoolchildren familiar with an academic product. At the same time, our system represents a challenge and an opportunity for the academicians to popularize a eld devoted to the natural language processing with promising future.</Paragraph>
    <Paragraph position="11"> A number of electronic exercises of Czech morphology and syntax were created.</Paragraph>
    <Paragraph position="12"> However, they were built manually, i.e.</Paragraph>
    <Paragraph position="13"> authors selected sentences either from their minds or randomly from books, newspapers. Then they analyzed them manually. In a given manner, there is no chance to build an exercise system that re ects a real usage of language in such amount the Styx system fully offers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML