XML Viewer - p03-1053

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1053_metho.xml
Size: 24,544 bytes
Last Modified: 2025-10-06 14:08:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1053">
  <Title>A Word-Order Database for Testing Computational Models of Language Acquisition</Title>
  <Section position="3" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 The Language Domain Database
</SectionTitle>
    <Paragraph position="0"> The focus of the language domain database, (hereafter LDD), is to make readily available the different word order patterns that children are typically exposed to, together with all possible syntactic derivations of each pattern. The patterns and their derivations are generated from a large battery of grammars that incorporate many features from the domain of natural language.</Paragraph>
    <Paragraph position="1"> At this point the multilingual language domain contains sentence patterns and their derivations generated from 3,072 abstract grammars. The patterns encode sentences in terms of tokens denoting the grammatical roles of words and complex phrases, e.g., subject (S), direct object (O1), indirect object (O2), main verb (V), auxiliary verb (Aux), adverb (Adv), preposition (P), etc. An example pattern is S Aux V O1 which corresponds to the English sentence: The little girl can make a paper airplane. There are also tokens for topic and question markers for use when a grammar specifies overt topicalization or question marking.</Paragraph>
    <Paragraph position="2"> Declarative sentences, imperative sentences, negations and questions are represented within the LDD, as is prepositional movement/stranding (pied-piping), null subjects, null topics, topicalization and several types of movement.</Paragraph>
    <Paragraph position="3"> Although more work needs to be done, a first round study of actual child-directed sentences from the CHILDES corpus (MacWhinney, 1995) indicates that our patterns capture many sentential word orders that children typically encounter in the period from 1-1/2 to 2-1/2 years; the period generally accepted by psycholinguists to be when children establish the correct word order of their native language. For example, although the LDD is currently limited to degree-0 (i.e. no embedding) and does not contain DP-internal structure, after examining by hand, several thousand sentences from corpora in the CHILDES database in five languages (English, German, Italian, Japanese and Russian), we found that approximately 85% are degree-0 and an approximate 10 out of 11 have no internal DP structure.</Paragraph>
    <Paragraph position="4"> Adopting the principles and parameters (P&amp;P) hypothesis (Chomsky, 1981) as the underlying framework, we implemented an application that generated patterns and derivations given the following points of variation between languages:  1. Affix Hopping 2. Comp Initial/Final 3. I to C Movement 4. Null Subject 5. Null Topic 6. Obligatory Topic 7. Object Final/Initial 8. Pied Piping 9. Question Inversion 10. Subject Initial/Final 11. Topic Marking 12. V to I Movement 13. Obligatory Wh movement  The patterns have fully specified X-bar structure, and movement is implemented as HPSG local dependencies. Pattern production is generated top-down via rules applied at each subtree level. Subtree levels include: CP, C', IP, I', NegP, Neg', VP, V' and PP. After the rules are applied, the subtrees are fully specified in terms of node categories, syntactic feature values and constituent order. The subtrees are then combined by a simple unification process and syntactic features are percolated down. In particular, movement chains are represented as traditional &amp;quot;slash&amp;quot; features which are passed (locally) from parent to daughter; when unification is complete, there is a trace at the bottom of each slash-feature path. Other features include +/-NULL for non-audible tokens (e.g.</Paragraph>
    <Paragraph position="5"> S[+NULL] represents a null subject pro), +TOPIC to represent a topicalized token, +WH to represent &amp;quot;who&amp;quot;, &amp;quot;what&amp;quot;, etc. (or &amp;quot;qui&amp;quot;, &amp;quot;que&amp;quot; if one prefers), +/-FIN to mark if a verb is tensed or not and the illocutionary (ILLOC) features Q, DEC, IMP for questions, declaratives and imperatives respectively. null Although further detail is beyond the scope of this paper, those interested may refer to Fodor et al. (2003) which resides on the LDD website.</Paragraph>
    <Paragraph position="6"> It is important to note that the domain is suitable for many paradigms beyond the P&amp;P framework. For example the context-free rules (with local dependencies) could be easily extracted and used to test probabilistic CFG learning in a multilingual domain. Likewise the patterns, without their derivations, could be used as input to statistical/connectionist models which eschew traditional (generative) structure altogether and search for regularity in the left-to-right strings of tokens that makeup the learner's input stream. Or, the patterns could help bootstrap the creation of a domain that might be used to test particular types of lexical learning by using the patterns as templates where tokens may be instantiated with actual words from a lexicon of interest to the investigator. The point is that although a particular grammar formalism was used to generate the patterns, the patterns are valid independently of the formalism that was in play during generation.</Paragraph>
    <Paragraph position="7">  To be sure, similar domains have been constructed. The relationship between the LDD and other artificial domains is summarized in Table 1. In designing the LDD, we chose to include syntactic phenomena which: i) occur in a relatively high proportion of the known natural languages;  If this is the case, one might ask: Why bother with a grammar formalism at all; why not use actual child-directed speech as input instead of artificially generated patterns? Although this approach has proved workable for several types of non-generative acquisition models, a generative (or hybrid) learner is faced with the task of selecting the rules or parameter values that generate the linguistic environment being encountered by the learner. In order to simulate this, there must be some grammatical structure incorporated into the experimental design that serves as the target the learner must acquire. Constructing a viable grammar and a parser with coverage over a multilingual domain of real child-directed speech is a daunting proposition. Even building a parser to parse a single language of child-directed speech turns out to be extremely difficult. See, for example, Sagae, Lavie, &amp; MacWhinney (2001), which discusses an impressive number of practical difficulties encountered while attempting to build a parser that could cope with the EVE corpus; one the cleanest transcriptions in the CHILDES database. By abstracting away from actual child-directed speech, we were able to build a pattern generator and include the pattern derivations in the database for retrieval during simulation runs, effectively sidestepping the need to build an online multilingual parser. ii) are frequently exemplified in speech directed to 2-year-olds; iii) pose potential learning problems (e.g. cross-language ambiguity) for which theoretical solutions are needed;  iv) have been a focus of linguistic and/or psycholinguistic research; v) have a syntactic analysis that is broadly agreed on.</Paragraph>
    <Paragraph position="8"> As a result the following have been included: * By criteria (i) and (ii): negation, nondeclarative sentences (questions, imperatives). null * By criterion (iv): null subject parameter (Hyams 1986 and since).</Paragraph>
    <Paragraph position="9"> * By criterion (iv): affix-hopping (though not widespread in natural languages).</Paragraph>
    <Paragraph position="10"> * By criterion (v): no scrambling yet.</Paragraph>
    <Paragraph position="11"> There are several phenomena that the LDD does not yet include: * No verb subcategorization.</Paragraph>
    <Paragraph position="12"> * No interface with LF (cf. Briscoe 2000; Villavicencio 2000).</Paragraph>
    <Paragraph position="13"> * No discourse contexts to license sentence fragments (e.g., DP or PP fragments).</Paragraph>
    <Paragraph position="14"> * No XP-internal structure yet (except PP = P + O3, with piping or stranding).</Paragraph>
    <Paragraph position="15"> * No Linear Correspondence Axiom (Kayne 1994).</Paragraph>
    <Paragraph position="16"> * No feature checking as implementation of  inversion, etc.</Paragraph>
    <Paragraph position="17"> The LDD on the web: The two primary purposes of the web-interface are to allow the user to interactively peruse the patterns and the derivations that the LDD contains and to download raw data for the user to work with locally.</Paragraph>
    <Paragraph position="18"> Users are asked to register before using the LDD online. The user ID is typically an email address, although no validity checking is carried out. The benefit of entering a valid email address is simply to have the ability to recover a forgotten password, otherwise a user can have full access anonymously.</Paragraph>
    <Paragraph position="19"> The interface has three primary areas: Grammar Selection, Sentence Selection and Data Download. First a user has to specify, on the Grammar Selection page, which settings of the 13 parameters are of interest and save those settings as an available grammar. A user may specify multiple grammars. Then in the sentence selection page a user may peruse sentences and their derivations.</Paragraph>
    <Paragraph position="20"> On this page a user may annotate the patterns and derivations however he or she wishes. All grammar settings and annotations are saved and available the next time the user logs on. Finally on the Data Download page, users may download data so that they can use the patterns and derivations offline. The derivations are stored as bracketed strings representing tree structure. These are practically indecipherable by human users. E.g.:  To be readable, the derivations are displayed graphically as tree structures. Towards this end we have utilized a set of publicly available LaTex macros: QTree (Siskind &amp; Dimitriadis, [online]). A server-side script parses the bracketed structures into the proper QTree/LaTex format from which a pdf file is generated and subsequently sent to the user's client application.</Paragraph>
    <Paragraph position="21"> Even with the graphical display, a simple sentence-by-sentence presentation is untenable given the large amount of linguistic data contained in the database. The Sentence Selection area allows users to access the data filtered by sentence type and/or by grammar features (e.g. all sentences that have obligatory-wh movement and contain a prepositional phrase), as well as by the user's defined grammar(s) (all sentences that are &amp;quot;Italian-like&amp;quot;). On the Data Download page, users may filter sentences as on the Sentence Selection page and download sentences in a tab-delimited format. The entire LDD may also be downloaded - approximately 17 MB compressed, 600 MB as a raw ascii file.</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="8" type="metho">
    <SectionTitle>
3 A Case Study: Evaluating the efficiency
</SectionTitle>
    <Paragraph position="0"> of parameter-setting acquisition models.</Paragraph>
    <Paragraph position="1"> We have recently run experiments of seven parameter-setting (P&amp;P) models of acquisition on the domain. What follows is a brief discussion of the algorithms and the results of the experiments. We note in particular where results stemming from work with the LDD lead to conclusions that differ from those previously reported. We stress that this is not intended as a comprehensive study of parameter-setting algorithms or acquisition algorithms in general. There is a large number of models that are omitted; some of which are targets of current investigation. Rather, we present the study as an example of how the LDD could be effectively utilized.</Paragraph>
    <Paragraph position="2"> In the discussion that follows we will use the terms &amp;quot;pattern&amp;quot;, &amp;quot;sentence&amp;quot; and &amp;quot;input&amp;quot; interchangeably to mean a left-to-right string of tokens drawn from the LDD without its derivation.</Paragraph>
    <Section position="1" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
3.1 A Measure of Feasibility
</SectionTitle>
      <Paragraph position="0"> As a simple example of a learning strategy and of our simulation approach, consider a domain of 4 binary parameters and a memoryless learner  which blindly guesses how all 4 parameters should be set upon encountering an input sentence. Since there are 4 parameters, there are 16 possible combinations of parameter settings. i.e., 16 different grammars. Assuming that each of the 16 grammars is equally likely to be guessed, the learner will consume, on average, 16 sentences before achieving the target grammar. This is one measure of a model's efficiency or feasibility.</Paragraph>
      <Paragraph position="1">  By &amp;quot;memoryless&amp;quot; we mean that the learner processes inputs one at a time without keeping a history of encountered inputs or past learning events.</Paragraph>
      <Paragraph position="2"> However, when modeling natural language acquisition, since practically all human learners attain the target grammar, the average number of expected inputs is a less informative statistic than the expected number of inputs required for, say, 99% of all simulation trials to succeed. For our blind-guess learner, this number is 72.</Paragraph>
      <Paragraph position="3">  We will use this 99-percentile feasibility measure for most discussion that follows, but also include the average number of inputs for completeness.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
3.2 The Simulations
</SectionTitle>
      <Paragraph position="0"> In all experiments:  * The learners are memoryless. * The language input sample presented to the learner consists of only grammatical sentences generated by the target grammar. * For each learner, 1000 trials were run for each of the 3,072 target languages in the LDD. * At any point during the acquisition process,  each sentence of the target grammar is equally likely to be presented to the learner. Subset Avoidance and Other Local Maxima: Depending on the algorithm, it may be the case that a learner will never be motivated to change its current hypothesis (G curr ), and hence be unable to ultimately achieve the target grammar (G targ ). For example, most error-driven learners will be trapped if G curr generates a language that is a superset of the language generated by G targ . There is a wealth of learnability literature that addresses local maxima and their ramifications.</Paragraph>
      <Paragraph position="1">  However, since our study's focus is on feasibility (rather than on whether a domain is learnable given a particular algorithm), we posit a built-in avoidance mechanism, such as the subset principle and/or default values that preclude local maxima; hence, we set aside trials where a local maximum ensues.  The average and 99-percentile figures (16 and 72) in this section are easily derived from the fact that input consumption follows a hypergeometric distribution.  Discussion of the problem of subset relationships among languages starts with Gold's (1967) seminal paper and is discussed in Berwick (1985) and Wexler &amp; Manzini (1987). Detailed accounts of the types of local maxima that the learner might encounter in a domain similar to the one we employ are given in Frank &amp; Kapur (1996), Gibson &amp; Wexler (1994), and Niyogi &amp; Berwick (1996).</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="8" type="sub_section">
      <SectionTitle>
3.3 The Learners' strategies
</SectionTitle>
      <Paragraph position="0"> In all cases the learner is error-driven: if G curr can parse the current input pattern, retain it.  The following refers to what the learner does when G curr fails on the current input. * Error-driven, blind-guess (EDBG): adopt any grammar from the domain chosen at random not psychologically plausible, it serves as our baseline.</Paragraph>
      <Paragraph position="1"> * TLA (Gibson &amp; Wexler, 1994): change any one parameter value of those that make up G  . Adopt it. (I.e. there is no testing of the new grammar against the current input). * Non-SVC TLA (Niyogi &amp; Berwick, 1996): try any grammar in the domain. Adopt it only in the event that it can parse the current input. * Guessing STL (Fodor, 1998a): Perform a structural parse of the current input. If a choice point is encountered, chose an alternative based on one of the following and then set parameter values based on the final parse tree: * STL Random Choice (RC) - randomly pick a parsing alternative.</Paragraph>
      <Paragraph position="2"> * Minimal Chain (MC) - pick the choice that obeys the Minimal Chain Principle (De Vincenzi, 1991), i.e., avoid positing movement transformations if possible.</Paragraph>
      <Paragraph position="3"> * Local Attachment/Late Closure (LAC) -pick the choice that attaches the new word to the current constituent (Frazier, 1978).</Paragraph>
      <Paragraph position="4">  The EDBG learner is our first learner of interest. It is easy to show that the average and 99% scores increase exponentially in the number of parameters and syntactic research has proposed more than 100 (e.g. Cinque, 1999). Clearly, human learners do not employ a strategy that performs as poorly as this. Results will serve as a baseline to compare against other models.</Paragraph>
      <Paragraph position="5">  We intend for a &amp;quot;can-parse/can't-parse outcome&amp;quot; to be equivalent to the result from a language membership test. If the current input sentence is one of the set of sentences  The TLA: The TLA incorporates two search heuristics: the Single Value Constraint (SVC) and Greediness. In the event that G  is rejected as a hypothesis (Greediness). Following Berwick and Niyogi (1996), we also ran simulations on two variants of the TLA - one with the Greediness heuristic but without the SVC (TLA minus SVC, TLA-SVC) and one with the SVC but without Greediness (TLA minus Greediness, TLA-Greed). The TLA has become a seminal model and has been extensively studied (cf. Bertolo, 2001 and references therein; Berwick &amp; Niyogi, 1996; Frank &amp; Kapur, 1996; Sakas, 2000; among others). The results from the TLA variants operating in the LDD are presented in Table 3.</Paragraph>
      <Paragraph position="6">  Particularly interesting is that contrary to results reported by Niyogi &amp; Berwick (1996) and Sakas &amp; Nishimoto (2002), the SVC and Greediness constraints do help the learner achieve the target in the LDD. The previous research was based on simulations run on much smaller 9 and 16 language domains (see Table 1). It would seem that the local hill-climbing search strategies employed by the TLA do improve learning efficiency in the LDD. However, even at best, the TLA performs less well than the blind guess learner. We conjecture that this fact probably rules out the TLA as a viable model of human language acquisition. The STL: Fodor's Structural Triggers Learner (STL) makes greater use of the parser than the TLA. A key feature of the model is that parameter values are not simply the standardly presumed 0 or 1, but rather bits of tree structure or treelets. Thus, a grammar, in the STL sense, is a collection of treelets rather than a collection of 1's and 0's. The STL is error-driven. If G curr cannot license s, new treelets will be utilized to achieve a successful parse.</Paragraph>
      <Paragraph position="7">  Treelets are applied in the same way as any &amp;quot;normal&amp;quot; grammar rule, so no unusual parsing activity is necessary. The STL hypothesizes grammars by adding parameter value treelets to</Paragraph>
      <Paragraph position="9"> when they contribute to a successful parse.</Paragraph>
      <Paragraph position="10"> The basic algorithm for all STL variants is:  1. If G curr can parse the current input sentence, retain the treelets that make up G curr .</Paragraph>
      <Paragraph position="11"> 2. Otherwise, parse the sentence making use of  any or all parametric treelets available and adopt those treelets that contribute to a successful parse. We call this parametric decoding. null Because the STL can decode inputs into their parametric signatures, it stands apart from other acquisition models in that it can detect when an input sentence is parametrically ambiguous. During a parse of s, if more than one treelet could be used by the parser (i.e., a choice point is encountered), then s is parametrically ambiguous. The TLA variants do not have this capacity because they rely only on a can-parse/can't-parse outcome and do not have access to the on-line operations of the parser. Originally, the ability to detect ambiguity was employed in two variations of the STL: the strong STL (SSTL) and the weak STL.</Paragraph>
      <Paragraph position="12"> The SSTL executes a full parallel parse of each input sentence and adopts only those treelets (parameter values) that are present in all the generated parse trees. This would seem to make the SSTL an extremely powerful, albeit psychologically implausible, learner.</Paragraph>
      <Paragraph position="13">  However, this is not necessarily the case. The SSTL needs some unambiguity to be present in the structures derived from the sentences of the target language. For example, there may not be a single input generated by G targ that when parsed yields an unambiguous treelet for a particular parameter.</Paragraph>
      <Paragraph position="14">  In addition to the treelets, UG principles are also available for parsing, as they are in the other models discussed above.  It is important to note that Fodor (1998a) does not put forth the strong STL as a psychologically plausible model. Rather, it is intended to demonstrate the potential effectiveness of parametric decoding.</Paragraph>
      <Paragraph position="15"> Unlike the SSTL, the weak STL executes a psychologically plausible left-to-right serial (deterministic) parse. One variant of the weak STL, the waiting STL (WSTL), deals with ambiguous inputs abiding by the heuristic: Don't learn from sentences that contain a choice point. These sentences are simply discarded for the purposes of learning. This is not to imply that children do not parse ambiguous sentences they hear, but only that they set no parameters if the current evidence is ambiguous.</Paragraph>
      <Paragraph position="16"> As with the TLA, these STL variants have been studied from a mathematical perspective (Bertolo et al., 1997a; Sakas, 2000). Mathematical analyses point to the fact that the strong and weak STL are extremely efficient learners in conducive domains with some unambiguous inputs but may become paralyzed in domains with high degrees of ambiguity. These mathematical analyses among other considerations spurred a new class of weak STL variants which we informally call the guessing STL family.</Paragraph>
      <Paragraph position="17"> The basic idea behind the guessing STL models is that there is some information available even in sentences that are ambiguous, and some strategy that can exploit that information. We incorporate three different heuristics into the original STL paradigm, the RC, MC and LAC heuristics described above.</Paragraph>
      <Paragraph position="18"> Although the MC and LAC heuristics are not stochastic, we regard them as &amp;quot;guessing&amp;quot; heuristics because, unlike the WSTL, a learner cannot be certain that the parametric treelets obtained from a parse guided by MC and LAC are correct for the target. These heuristics are based on well-established human parsing strategies. Interestingly, the difference in performance between the three variants is slight. Although we have just begun to look at this data in detail, one reason may be that the typical types of problems these parsing strategies address are not included in the LDD (e.g. relative clause attachment ambiguity). Still, the STL variants perform the most efficiently of the strategies presented in this small study (approximately a 100-fold improvement over the TLA). Certainly this is due to the STL's ability to perform parametric decoding. See Fodor (1998b) and Sakas &amp; Fodor (2001) for detailed discussion about the power of decoding when applied to the acquisition process.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML