File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0114_intro.xml

Size: 3,711 bytes

Last Modified: 2025-10-06 14:06:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0114">
  <Title>Towards Automatic Grammar Acquisition from a Bracketed Corpus Thanaruk Theeramunkong</Title>
  <Section position="3" start_page="0" end_page="168" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Designing and refining a natural language grammar is a diiBcult and time-consuming task and requires a large amount of skilled effort. A hand-crafted grammar is usually not completely satisfactory and frequently fails to cover many unseen sentences. Automatic acquisition of grammars is a solution to this problem. Recently, with the increasing availability of large, machine-readable, parsed corpora, there have been numerous attempts to automatically acquire a CFG grammar through the application of enormous existing corporaILar90\]\[Mi194\]\[Per92\]\[Shi95 \].</Paragraph>
    <Paragraph position="1"> Lari and Young\[Lar90\] proposed so-called inside-outside algorithm, which constructs a grammar from an unbracketed corpus based on probability theory. The grammar acquired by this method is assumed to be in Chomsky normal form and a large amount of computation is required. Later, Pereira\[Per92\] applied this algorithm to a partially bracketed corpus to improve the computation time. Kiyono\[Kiy94b\]\[Kiy94a\] combined symbolic and statistical approaches to extract useful grammar rules from a partially bracketed corpus. To avoid generating a large number of grammar rules, some basic grammatical constraints, local boundaries constraints and X bar-theory were applied.</Paragraph>
    <Paragraph position="2"> Kiyono's approach performed a refinement of an original grammar by adding some additional rules while the inside-outside algorithm tries to construct a whole grammar from a corpus based on Maximum Likelihood. However, it is costly to obtain a suitable grammar from an unbracketed corpus and hard to evaluate results of these approaches. As the increase of the construction of bracketed corpora, an attempt to use a bracketed (tagged) corpus for grammar inference was made by Shiral\[Shi95\]. Shirai constructed a Japanese grammar based on some simple rules to give a name (a label) to each bracket in the corpus. To reduce the grammar size and ambiguity, some hand-encoded knowledge is applied in this approach.</Paragraph>
    <Paragraph position="3"> In our work, like Shirai's approach, we make use of a bracketed corpus with lexical tags, but instead of using a set of human-encoded predefined rules to give a name (a label) to each bracket, we introduce some statistical techniques to acquire such label automatically. Using a bracketed corpus, the grammar learning task is reduced to the problem of how to determine the nonterminal label of each bracket in the corpus. More precisely, this task is concerned with the way to classify brackets to some certain groups and give each group a label. We propose a method to group brackets in  a bracketed corpus (with lexical tags), according to their local contextual information, as a first step towards the automatic acquisition of a context-free grammar. In the grouping process, a single nontermina\] label is assigned to each group of brackets which are similar. To do this, we apply and compare two types of techniques called distributional analysis\[HarSl\] and hierarchical Bayesian clustering\[Iwa95\] for setting a measure representing similarity among the bracket groups. We also propose a method to determine the appropriate number of bracket groups based on the concept of entropy analysis. Finally, we present a set of experimental results and evaluate our methods with a model solution given by humans.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML