File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1101_metho.xml
Size: 13,431 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1101"> <Title>New Zealand</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Modelling Phonological Complexity </SectionTitle> <Paragraph position="0"> The mechanistic model we have used to represent diachronic phonological derivations is that of Probabilistic Finite State Automata (PFSA). These are state determined machines which have stochastic transition functions. The derivation of each word in MB or MC from Middle Chinese consists of a sequence of diachronic rules. These rule sequences for each of the approximately 2700 words are used to construct our PFSA. Node 0 of the PFSA corresponds to the reconstructed form of the word in Middle Chinese. Arcs leading out of states in the PFSA represent particular rules that were applied to a form at that state, transforming it into a new intermediate form. A transition on a delimiter symbol, which always returns to state 0, signifies the end of a derivation process whereby the final form in the daughter language has been arrived at. The weightings on the arcs represent the number of times that particular arc was traversed in processing the entire corpus of words. The complete PFSA then represents the phonological complexity of the derivation process from Middle Chinese into one of the modern dialects.</Paragraph> <Paragraph position="1"> If this is the case, then the length of the minimal description of the PFSA would be indicative of the distance between the parent and daughter languages. There are two levels at which the diachronic complexity can be measured. The first is of the canonical PFSA, which is a trie encoding of the rules. This is the length of the diachronic phonological hypothesis accounting for the given dataset. The second is of a minimised version of the canonical machine. Our minimisation is performed initially using the sk-strings method of Raman and Patrick (1997b) and then reducing :Z the resultant automaton further with a beam search heuristic (1997a). The sk-strings method constructs a non-deterministic finite state automaton from its canonical version by successively merging states that are indistinguishable for the top s% of their most probable output strings limited to a length of k symbols. Both s and k are variable parameters that can be set when starting program execution. In this paper, the reduced automata are the best ones that could be inferred using any value of string size (k) from 1 to 10 and any value of the agreement percentage (s) from 1 to 100. The beam search method reduces the PFSA by searching recursively through the best m descendants of the current PFSA where a descendant is defined to be the result of merging any two nodes in the parent PFSA. The variable parameter m is called the beam size and determines the exhaustiveness of the search. In this paper, m was set to 200, which was the maximum the Sun Sparcserver 1000 with 256 MB of main memory could tolerate.</Paragraph> <Paragraph position="2"> The final resultant PFSA, minimised thus is, strictly speaking , a generalisation of the proposed phonology. Its size is not really indicative of the complexity of the original hypothesis, but it serves to bring to light important patterns which repeat themselves in the data. The minimisation, in effect, forms additional diachronic rules and highlights regular patterns to a linguist. The size of this structure is also given in our results to show the effect of further generalisation to the linguistic hypothesis.</Paragraph> <Paragraph position="3"> A final point needs to be made regarding the motivation for the; additional sophistication embodied in this method as compared to, say, a more simplistic phonological approach like a distance measure based on a: simple summation of the number of proposed rules. Our method not only gives a measure dependent on the number of rules, but also on the inter-relationship between them, or the regularity present in the whole phonology. A lower value indicates the p~esence of greater regularity in the derivation process. As a case in point, we may look at two closely related dialects, which have the same number of rules in their phonology from a common parent. It may be the case that one has diverged more by losing more of its original structure. As in the method of internal reconstruction, if we assume that the complexity of a language increases with time due to the presence of residual forms (Crowley, 1987, p.150-453), the PFSA derived for the more distant language will have a greater complexity than the other.</Paragraph> </Section> <Section position="6" start_page="0" end_page="3" type="metho"> <SectionTitle> 4 Procedural Decisions </SectionTitle> <Paragraph position="0"> The derivations that were used in constructing the PFSA were traced out individually for each of the 2714 forms and entered into a spreadsheet for further processing. The Relative Chronologies (RC) of the diachronic rules given in Chen76 and CN84 propose rule orderings based on bleeding and feeding relationships between rules. 3 We have tried to be as consistent as possible to the RC proposed in Chen76 and CN84. For the most part, we view violations to the RC as exceptions to their hypothesis. Consistency with the RC proposed in Chen76 and CN84 has been maintained as far as possible. For the most part, violations to them are viewed as serious exceptions. Thus if Rule A is ordered before Rule B in the RC, but is required to apply after Rule B in a specific instance under consideration, it is made an exceptional application of Rule A, denoted by &quot;\[A\]&quot;. Such exceptional rules are considered distinct from their normal forms. The sequence of rules deriving Beijing tou from Middle Chinese *to (&quot;all&quot;), for example, is given as &quot;tl-split:ralse-u:diphthongu:chameh&quot;. However, &quot;diphthong-u&quot; is ordered before &quot;ralse-u&quot; in the RC. The earlier rule in the RC is thus made an exceptional application and the rule sequence is given instead as &quot;tl-split:raiseu:\[diphthong-u\]:chamel:&quot;. null There are also some exceptional phonological changes not accounted for by CN84 or Chen76. In these cases, we form a new rule representing the change that took place, denote it in square brackets to show its exceptional status. Related exceptions are grouped together as a single exceptional rule. For example, Tone-4 in Middle Chinese only changes to Tone-la or Tone-2 in Beijing when the form has a voiceless initial. However, for the Middle Chinese form *niat (&quot;pinch with fingers&quot;) in Tone-4, the corresponding Beijing form is hie in Tone-la. Since the n-initial is voiced, the t4tripart rule is considered to apply exceptionally. The complete rule sequence is thus denoted by &quot;raisei:apocope:chamel:\[t4\]:&quot; where the &quot;It4\]&quot; exceptional rule covers cases when Tone-4 in SMC unexpectedly changed into Tone-la or Tone-2 in Beijing in the absence of a voiceless initial.</Paragraph> <Paragraph position="1"> It also needs to be mentioned that there are a few cases where an environment for the application of a rule might exist, but the rule itself may not apply although it is required to by the linguistic hypothesis.</Paragraph> <Paragraph position="2"> of applying before it, then A is said to bleed B. If rule A causes rule B to apply by applying before it, it is said to feed rule B.</Paragraph> <Paragraph position="3"> This would constitute an exception again. The details of how to handle this situation more accurately are left as a topic for future work, but we try to account for it here by applying a special rule \[!A\] where the '!' is meant to indicate that the rule A didn't apply when it ought to have. As an example, we may consider the derivation of Modern Cantonese hap(Tone 4a) from Middle Chinese *khap(Tone 4) (&quot;exactly&quot;). The sequence of rules deriving the MC form is &quot;t4-split:spirant:x-weak:&quot;. However, since the environment is appropriate (voiceless initial) for the application of a further rule, AC-split, after t4split had applied, the non-application of this additional rule is specified as an exception. Thus, &quot;t4split:spirant:x-weak:\[!AC-split\]:&quot; is the actual rule sequence used.</Paragraph> <Paragraph position="4"> In general, the following conventions in representing and treating exceptions have been followed as far as possible: Exceptional rules are always denoted in square brackets. They are considered excluded from the l:tC and thus are consistently ordered at the end of the rest of the derivation process wherever possible. null A final detail concerns the status of allophonic changes in the phonology. The derivation process is actually two-stage, comprising a diachronic phase during which phonological changes take place and a synchronic phase during which allophonic changes are automatically applied. Changes caused by Cantonese or Beijing Phonotactic Constraints (PCs) are treated as allophonic rules and fall into the synchronic category, whereas PCs applying to earlier forms are treated in line with the regular diachronic rules which Chen76 calls P-rules.</Paragraph> <Paragraph position="5"> A minor problem presents itself when it comes to making a clear-cut separation between the historical rules proper and the synchronic allophonic rules. In Chen76 and CN84, they are not really considered part of the historical derivation process. Yet it was found that the environment for the application of a diachronic rule is sometimes produced by an allophonic rule. Such feeding relationships between allophonic and diachronic rules make the classification of those allophonic rules difficult.</Paragraph> <Paragraph position="6"> The only rule considered allophonic in Beijing is the *CHAMEL PC, this being a rule which determines the exact qualities of MB vowels. For Cantonese, CN84 has included two allophonic rules within its RC under bleeding and feeding relationships with P-rules. These are the BREAK-C and Y-FUSE rules, both of which concern vocalic detail.</Paragraph> <Paragraph position="7"> In these cases, every instance of their application within the diachronic phonology has been treated as an exception, effectively elevating these exceptions to the status of diachronic rules. In other cases, as with other allophonic rules, they are always ordered after all the diachronic rules. Since the problem regarding the status of allophonic rules in general is properly in the domain of historical linguists, it is beyond the scope of this work. It was thus decided to provide two complexity measures -- one including allophonic detail and one excluding all allophonic detail not required for the derivation process.</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 5 Minimum Message Length </SectionTitle> <Paragraph position="0"> The Minimum Message Length (MML) principle of Georgeff and Wallace (1984) is used to compute the complexity of the PFSA. For brevity, we will henceforth call the Minimum Message Length of PFSA as the MML of PFSA or where the context serves to disambiguate, simply MML.</Paragraph> <Paragraph position="1"> In the context of data transmission, the MML of a set of symbols is the minimum number of bits needed to transmit a static model together with the data symbols given this model a priori. In the context of PFSA, the MML is a sum of: * the length of encoding a description of the proposed machine * the length of encoding the dataset assuming it was emitted by the proposed machine The following formula is used for the purpose of computing the MML:</Paragraph> <Paragraph position="3"> where N is the number of states in the PFSA, tj is the number of times the jth state is visited, V is the cardinality of the alphabet including the delimiter symbol, nij the frequency of the ith arc from the jth state, mj is the number of different arcs from the jth state and m} is the number of different arcs on non-delimiter symbols from the jth state. The logs are to the base 2 and the MML is in bits.</Paragraph> <Paragraph position="4"> The MML formula given above assumes a non-uniform prior on the distribution of outgoing arcs from a given state. This contrasts with the MDL criterion due to Rissanen (1978) which recommends the usage of uniform priors. The specific prior used in the specification of my is 2 -mj, i.e. the probability that a state has n outgoing arcs is 2 -n.</Paragraph> <Paragraph position="5"> Thus mj is directly specified in the formula using just mj bits and the rest of the structure specification assumes this. It is also assumed that targets of transitions on delimiter symbols return to the start state (State 0 for example) and thus don't have to be specified. The formula is a modification for non-deterministic automata of the formula in Patrick and Chong (1987) where it is stated with two typographical errors (the factorials in the numerators are absent). It is itself a correction (through personal communication) of the formula in Wallace and Georgeff (1984) which follows on from work in numerical taxonomy (Wallace and Boulton, 1968) that apljlied the MML principle to derive information me~ures for classification.</Paragraph> </Section> class="xml-element"></Paper>