File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0743_metho.xml
Size: 25,374 bytes
Last Modified: 2025-10-06 14:07:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0743"> <Title>The Acquisition of Word Order by a Computational Learning System</Title> <Section position="4" start_page="0" end_page="213" type="metho"> <SectionTitle> 2 The Learning System </SectionTitle> <Paragraph position="0"> The learning system is composed of a language learner equipped with a UG and a learning algorithm that updates the initial parameter settings, based on exposure to a corpus of utterances. Each of these components is discussed in more detail in the following sections.</Paragraph> <Section position="1" start_page="209" end_page="210" type="sub_section"> <SectionTitle> 2.1 The Universal Grammar </SectionTitle> <Paragraph position="0"> The UG consists of principles and parameters, and the latter are set according to the linguistic environment (Chomsky 1981). This proposal suggests that human languages follow a common set of principles and differ among one another only in finitely many respects, represented by a finite number of parameters that can vary according to a finite number of values (which makes them learnable in Gold's paradigm). In this section, we discuss the UG and associated parameters, which are formalised in terms of a Unification-Based Generalised Categorial Grammar (UB-GCG), embedded in a default inheritance network of lexical types. We concentrate on the description of word order parameters, which reiiect the basic order in which constituents occur in different languages.</Paragraph> <Paragraph position="1"> UB-GCGs extend the basic Categorial Grammars ((Bar Hillel, 1964)) by including the use of attribute-value pairs associated with each category and by using a larger set of rules and operators. Words, categories and rules are represented in terms of typed default feature structures (TDFSS), that encode orthographic, syntactic and semantic information. There are two types of categories: atomic categories (s - sentence-, np - noun phrase-, and n noun), that are saturated, and complex categories, that are unsaturated. Complex categories have a functor category (defined in RE-SUET), and a list of subcategorised elements (defined in ACTIVE), with each element in the list defined in terms of two features: SIGN, encoding the category, and DIRECTION, encoding the direction in which the category is to be combined (where VALUE can be either forward or backward). As an example, in English an intransitire verb (s\np) is encoded as shown in figure 1, where only the relevant attributes are shown.</Paragraph> <Paragraph position="2"> In this work, we employ the rules of (forward and backward) application, (forward and backward) composition and generalised weak permuration. A more detailed description of the UB-GCG used can be found in (Villavicencio 2000).</Paragraph> <Paragraph position="3"> The UG is implemented as a UB-GCG, erabedded in a default inheritance network of lexical types (Villavicencio 1999), implemented in the YADU framework (Lascarides and Copestake 1999). The categories and rules in the grammar are defined as types in the hierarchy, represented in terms of TDFSS and the feature-structures associated with any given category or rule are defined by the inheritance chain.</Paragraph> <Paragraph position="4"> With different sub-networks used to encode different kinds of linguistic knowledge, linguistic regularities are encoded near the top of a network, while types further down the network are used to represent sub-regularities or exceptions.</Paragraph> <Paragraph position="5"> Thus, types are concisely defined, with only specific information being described, since more general information is inherited from the supertypes. The resulting UB-GCG is compact, since it avoids redundant specifications and the information is structured in a clear and concise way through the specification of linguistic regularities and sub-regularities and exceptions.</Paragraph> <Paragraph position="6"> Regarding the categories of the UB-GCG, word order parameters are those that specify the direction of each element in the subcategorisation list of a complex category. In figure 1, subjdir is a parameter specifying that the np subject is to be combined backwards. As the categories are defined in terms of an inheritance hierarchy, the parameters (and their values) in these categories are propagated throughout the hierarchy, from supertypes to subtypes, which inherit this information by default. There are 28 parameters defined, and they are also in a hierarchical relationship, with the supertype being gendir, which specifies, by default, the general direction for a language, and from which all the other parameters inherit. Among the subtypes, we have subjdir, which specifies the direction of the subject, vargdir, which specifies the direction of the other verbal arguments and ndir, which specifies the direction of nominal caregories. A fragment of the parameters hierarchy can be seen in figure 2. With these 28 binary-valued parameters the UG defines a space of almost 800 grammars.</Paragraph> <Paragraph position="7"> The parameters are set based on exposure to a particular language, and while they are unset, they inherit their value by default, from their supertypes. Then, when they are set, they can either continue to inherit by default, in case they have the same value as the supertype, or they can override this default and specify their own value, breaking the inheritance chain. For instance, in the case of English, the value of</Paragraph> <Paragraph position="9"> chy a default inheritance schema reduces the pieces of information to be acquired by the learner, since the information is structured and what it learns is not a single isolated category, but a structure that represents this information in a general manner. This is a clear and concise way of defining the UG with the parameters being straightforwardly defined in the categories, in a way that takes advantage of the default inheritance mechanism, to propagate information about parameters, throughout the lexical inheritance network.</Paragraph> <Paragraph position="10"> gendir is defined, by default, as forward, capturing the fact that it is a predominantly right-branching language, and all its subtypes, like subjdir and vargdir inherit this default information. Then an intransitive verb, which has the direction of the subject specified by subjdir, will be defined as S/NP, with subjdir having default value forward. However, as in English, the subject NP occurs to the left of the verb, utterances with the subject to the left will trigger a change in subjdir to backward, which overrides the default value, breaking the inheritance chain, figure 3. As a result, intransitive verbs are defined as S\NP, figure 1, for the grammar to account for these sentences. In the syntactic dimension of this network, intransitive verbs can be considered the general case of verbs, and the information defined in this node is propagated through the hierarchy to its subtypes, such as the transitive verbs, figure 3. For the learner, the information about subjects (subjdir = backward) has already been acquired while learning intransitive verbs, and the learner does not need to learn it again for transitive verbs, which not only inherit this information, but also have the direction for the object defined by vargdir (vargdir = forward), as shown in figure 3. The use of</Paragraph> </Section> <Section position="2" start_page="210" end_page="211" type="sub_section"> <SectionTitle> 2.2 The Corpus </SectionTitle> <Paragraph position="0"> The UG has to be general enough to capture the grammar for any language, and the parameters have to be set to account for a particular language, based on exposure to that language.</Paragraph> <Paragraph position="1"> This can be obtained by means of a corpus of utterances, annotated with logical forms, which is described in this section. Among these sentences, some will be triggers for certain parameters, in the sense that, to parse that sentence, some of the parameters will have to be set to a given value. We are using the Sachs corpus (Sachs 1983) from the CHILDES project (MacWhinney 1995), that contains interactions between only one child and her parents, from the age of 1 year and 1 month to 5 years and 1 month. From the resulting corpus, we extracted material for generating two different corpora: one containing only the child's sentences and the other containing the caretakers' sentences.</Paragraph> <Paragraph position="2"> The caretakers' corpus is given as input to the learner to mirror the input to which a child learning a language is exposed. And the child's corpus is used for comparative purposes.</Paragraph> <Paragraph position="3"> In order to annotate the caretakers' corpus with the associated logical forms, a UB-GCG for English was built, that covers all the constructions in the corpus: several verbal con- null structions (intransitives, transitives, ditransitives, obliques, control verbs, verbs with sententim complements, etc), declarative, imperative and interrogative sentences, and unbounded dependencies (wh-questions and relative clauses), among others. Thus the caretakers' corpus contains sentences annotated with logical forms, and an example can be seen in figure 4, for the sentence I will take him, where a simplified version of the relevant attributes is shown, for reasons of clarity. Each predicate in the semantics list is associated with a word in the sentence, and, among other things, it contains information about the identifier of the predicate (SIT), the required arguments (e.g. ACTOR and UN-DERGOER for the verb take), as well as about the interaction with other predicates, specified by the boxed indices (e.g. take:ACTOR = \[\] = /:SIT). This grammar is not only used for annotating the corpus, but is also the target to which the learner has to converge. At the moment around 1,300 utterances were annotated with corresponding logical forms, with data ranging from when the child is 14 months old to 20 months old.</Paragraph> </Section> <Section position="3" start_page="211" end_page="213" type="sub_section"> <SectionTitle> 2.3 The Learning Algorithm </SectionTitle> <Paragraph position="0"> The learning algorithm implernents the Bayesian Incremental Parameter Setting (BIPS) algorithm defined by Briscoe (1999).</Paragraph> <Paragraph position="1"> The parameters are binary-valued, where each possible value in a parameter is associated with a prior and a posterior probability. The value with highest posterior probability is used as the current value. Initially, in the learning process, the posterior probability associated with each parameter is initialised to the prior probability, and these values are going to define the parameter settings used. Then, as trigger sentences are successfully parsed, the posterior probabilities of the parameter settings that allowed the sentence to be parsed are reinforced.</Paragraph> <Paragraph position="2"> Otherwise, when a sentence cannot be parsed (with the correct logical form) the learning algorithm checks if a successful parse can be achieved by changing the values of some of the parameters, in constrained ways. If that is the case, the posterior probability of the values used are reinforced in each of the parameters, and if they achieve a certain threshold, they are retained as the current values, otherwise the previous values are kept. This constraint on the setting of the parameters ensures that a trigger does not cause an immediate change to a different grammar. The learner, instead, has to wait for enough evidence in the data before it can change the value of any parameter. As a consequence, the learner behaves in a more conservative way, being robust to noise present in the input data.</Paragraph> <Paragraph position="3"> Following Briscoe (1999) the probabilities associated with the parameter values correspond to weights represented in terms of fractions, with the denominator storing the total evidence for a parameter and the numerator storing the evidence for a particular value of that parameter. For instance, if the value backward of the subjdir parameter has a weight of 9/10, it means that from 10 times that evidence was provided for subjdir, 9 times it was for the value backward, and only once for the other value, forward. Table 1 shows a possible initialisation for the subjdir parameter, where the prior has a weight of 1/10 for forward, corresponding to a probability of 0.1, and a weight of 9/10 for backward, corresponding to a probability of 0.9. The posterior is initialised with the same values as the prior, and as backward has a higher posterior probability it is used as the current value for the parameter. These initial parameter values determine the initial grammar for the learner. As triggers are processed, they provide evidence for certain parameters and these are represented as additions to the denominator and/or numerator of each of the posterior weights of the parameter values. Table 2 shows the status of the parameter after 5 triggers that provided evidence for the value backward. Initially, the learner uses the evidence provided by the triggers to choose certain parameter values, in order to be able to parse these triggers successfully while generating the appropriate logical form. After that, the triggers are used to reinforce these values, or to negate them.</Paragraph> <Paragraph position="4"> and specifies its own value, breaking the inheritance chain. For instance, in figure 3, subjdir overrides the default value specified by gendir, breaking the inheritance chain. Unset subtype parameters inherit, by default, the current value of their supertypes, and while they are unset they do not influence the values of their supertypes. null As the parameters are defined in a default inheritance hierarchy, each time the posterior probability of a given parameter is updated, it is necessary to update the posterior probabilities of its supertypes and examine the current parameter settings to determine what the most appropriate hierarchy for these settings is, given the goal of converging to the target. The learner has a preference for grammars (and thus hierarchies) that not only model the data (represented by the current settings) well, but are also compact, following the Minimum Description Length (MDL) Principle. In this case, the most probable grammar in the grammar space, among the ones consistent with the parameter settings, is the one where the default inheritance hierarchy is the more concise, having the minimum number of non-default parameter values specified, as described in (Villavicencio 2000).</Paragraph> <Paragraph position="5"> The 28 word order parameters are defined in a hierarchical relation, with the supertype parameters being set in accordance with the subtypes, to reflect the value of the majority of the subtypes. In this way, as the values of the sub-types are being set, they influence the value of the supertypes. If the value of a given sub-type differs from the value of the supertype, the subtype overrides the inherited default value</Paragraph> </Section> </Section> <Section position="5" start_page="213" end_page="216" type="metho"> <SectionTitle> 3 The Acquisition of Word Order </SectionTitle> <Paragraph position="0"> We are investigating the acquisition of word order, which reflects the underlying order in which constituents occur in different languages.</Paragraph> <Paragraph position="1"> In this section we describe one experiment, where we compare the performance, of different learners under four conditions. Each learner is given as input the annotated corpus of sentences paired with logical forms, and they have to change the values of the parameters corresponding to the relevant constituents to account for the order in which these constituents appear in the input sentences. We defined five different learners corresponding to five different initialisations of the parameter settings of the UG, to investigate how the init~alisations, or starting points, of the learners influence convergence to the target grammar. The first one, the unset learner, is initialised with all parameters unset, and the others, the default learners, are each initialised with default parameter values corresponding to one of four basic word orders, defined in terms of the canonical order of the verb (V), subject (S) and objects (O): SVO, SOV, VSO and OVS. We initialised the parameters subjdir, vargdir and gendir of the default learners according to each of the basic orders, with gendir having the same direction as vargdir, and all the other parameters having unset values. These parameters have the prior and posterior probabilities initialised with 0.1 for one value and 0.9 for the other. In this way, an SVO learner, for example, is initialised with subjdir having as current value backward (0.9), vargdir forward (0.9) and gendir forward (0.9).</Paragraph> <Paragraph position="2"> The sentences in the input corpus are presented to a learner only once, sequentially, in the original order. The input to a learner is pre-processed by a system \[Waldron, 2000\] that assigns categories to each word in a sentence.</Paragraph> <Paragraph position="3"> The sentences with their putative category assignments are given as input to the learner. The learner then evaluates the category assignments for each sentence and only uses those that are valid according to the UG to set the parameters; the others are discarded. The corpus contains 1,041 English sentences (which follow the SVO order), but from these only a small proportion are triggers for the parameters, in the sense that, for the learner to process them, it has to select certain parameter values. As each triggering sentence is processed, the learner changes or reinforces its parameter values to reflect the order of constituents in these sentences.</Paragraph> <Paragraph position="4"> We wanted to check how the different learners performed in a normal noisy environment, with a limited corpus as input, and also to check if there is an interaction between the different initialisations and the noise in the input data. To do that we tested how the learners performed under four conditions. Each condition was run 10 times for each learner, and we report here the average results obtained.</Paragraph> <Section position="1" start_page="213" end_page="214" type="sub_section"> <SectionTitle> 3.1 Condition 1:Learners-10 in a Noisy Environment </SectionTitle> <Paragraph position="0"> In the first condition, we initialised the parameters subjdir, vargdir and gendir of the default learners with the prior and posterior probabilities of 0.1 corresponding to a weight of 1/10, and probabilities of 0.9 to a weight of 9/10. Results from the first experiment can be seen in table 3, where the learners are specified in the first column, the number of input triggers in the second, the number of correct parameters in relation to the target is in the third, and the number of parameters that are set with these triggers is in the fourth column.</Paragraph> <Paragraph position="1"> The results show no significant variation in the performance of the different Learners. This is the case with the number of parameters that are correct in relation to the target, with an average of 22.3 parameters out of 28, and also with the number of parameters that are set given the triggers available, with an average of 10.5 parameters out of 28.</Paragraph> <Paragraph position="2"> The only difference between the learners was the time needed for each learner to converge: the closer the starting point of the learner was to the target, the faster it converged, as can be seen in figure 5, for the subjdir parameter. This figure shows all the learners converging to the target value, with high probability, and with a convergence pattern very similar to the one presented by the unset learner. Even those default learners that were initialised with values incompatible with the target soon overcame this initial bias and converged to the target. The same thing happens for vargdir and gendir. This figure also shows some sharp falls in the convergence to the target value, for these learners. For example, the unset learner had a sharp drop in probability, which fell from 0.94 to 0.85, around trigger 16. These declines were caused by noise in the category assignments of the input triggers, which provided incorrect evidence for the parameter values.</Paragraph> </Section> <Section position="2" start_page="214" end_page="214" type="sub_section"> <SectionTitle> 3.2 Condition 2:Learners-10 in a Noise-free Environment </SectionTitle> <Paragraph position="0"> In order to test if and how much of the learners' performance was affected by the presence of noisy triggers, using the same initialisations as the ones in condition 1, we tested how the learners performed in a noise-free environment.</Paragraph> <Paragraph position="1"> To obtain such an environment, as each trigger was processed, a module was used for correcting the category assignment, if noise was detected.</Paragraph> <Paragraph position="2"> The results are shown in table 4.</Paragraph> <Paragraph position="3"> These learners have performances similar to those in condition 1 (section 3.1), with an average of 22.3 of the 28 parameters correct in relation to the target, and an average of 10.6 parameters that can be set with the triggers available. But, in this condition the convergence was slightly faster for all learners, as can be seen in figure 6. These results show that, indeed, the presence of noise slows down the convergence of the learners, because they need more triggers to compensate for the effect produced by the noisy triggers.</Paragraph> </Section> <Section position="3" start_page="214" end_page="215" type="sub_section"> <SectionTitle> 3.3 Condition 3:Learners-50 in a Noisy Environment </SectionTitle> <Paragraph position="0"> We then tested if the use of stronger weights to initialise the learners would affect the learners performance. The parameters subjdir, vargdir and gendir were initialised with a weight of 5/50 for the probability of 0.1 and a sented by these learners for the subjdir parameter. The effect produced by the noise was increased with these stronger weights, such that all the learners had a slower convergence to the target. Even those default leaxners initialised with values compatible with the target had a slightly slower convergence when compared to those in condition 1, with weaker weights, because they had to overcome the stronger initial bias before converging to the target values. But, in spite of that, the performance of the learners is only slightly affected by the stronger weights, as shown in table 5. They had a performance similar to the ones obtained by the learners in the previous conditions, as shown in figure 8, comparing these learners with those in condition 1.</Paragraph> </Section> <Section position="4" start_page="215" end_page="216" type="sub_section"> <SectionTitle> 3.4 Condition 4:Learners-50 in a Noise-free Environment </SectionTitle> <Paragraph position="0"> When the noise-free environment was used with these stronger weights, the convergence pattern was slightly faster for all learners, when compared to condition 3 (which used a noisy environment), but still slower than conditions 1 and 2, as shown in figure 9. These learners had a similar performance to those obtained in all the previous conditions, as can be seen in table 6, and in figure 10, which also shows the results obtained by the learners in condition 2, which</Paragraph> </Section> </Section> <Section position="6" start_page="216" end_page="217" type="metho"> <SectionTitle> 3.5 Discussion </SectionTitle> <Paragraph position="0"> As confirmed by these results, there is a strong interaction between the different starting points and the presence of noise. The noise has a strong influence on the convergence of the learners, slowing down the learning process, since the learners need more triggers to compensate for the effect caused by the noisy ones. The different initialisations caused little impact in the learners' performance, in spite of noticeably delaying the convergence to the target of those learners that have values incompatible with the target. Thus, when combining the presence of noise with the use of stronger weights , there was a significant delay in convergence, w\]~ere the final posterior probability was up to 10% lower than in the noise-free case (e.g. for the OVS learner), as can be seen in figures 7 and 9.</Paragraph> <Paragraph position="1"> Nonetheless, these learners were robust to the presence of noise in the input data, only selecting or changing a value for a given parameter when there was enough evidence for that. As a consequence, all the learners were converging towards the target, even with the small amount of available triggers, regardless of the initialisations and the presence of noise. This is the case even with an extreme bias in the initial values.</Paragraph> <Paragraph position="2"> Moreover, the learners make effective use of the inheritance mechanism to propagate default values, with an average of around 4.2 non-default specifications for these learners.</Paragraph> </Section> class="xml-element"></Paper>