File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/j94-3004_metho.xml
Size: 48,432 bytes
Last Modified: 2025-10-06 14:13:53
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-3004"> <Title>The Reconstruction Engine: A Computer Implementation of the Comparative Method</Title> <Section position="3" start_page="383" end_page="387" type="metho"> <SectionTitle> 2. Overview of Algorithms and Data Structures </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="383" end_page="383" type="sub_section"> <SectionTitle> 2.1 A Few Preliminary Remarks about the Data and Terminology </SectionTitle> <Paragraph position="0"> We shall attempt to be precise in our use of the linguistic terminology related to historical reconstruction: when lexical items, or modern forms from the various lexicons of individual languages, are grouped into cognate sets on the basis of recurring phonological regularities (correspondences) they will be referred to as reflexes. The ancestor word-form from which these regular reflexes derive is called a reconstruction, protoform, or etymon. Thus, English father, German Vater, Greek pater, and Sanskrit pitr- are all reflexes of a Proto-Indo-European (PIE) etymon reconstructed as something like *poter- (the asterisk indicates that this word is a reconstruction and not an attested form). The relations between the constituent phonological elements of etyma and their modern reflexes are called sound laws and are usually written in the form of diachronic phonological rules; for example, PIE *p > English/f/./f/is said to be the outcome of PIE *p in English. Languages that share a common ancestor are said to be the daughters of that ancestor.</Paragraph> <Paragraph position="1"> The data on which our study and these examples are based and that are used in exemplifying the operation of the program are taken from the Tamang group of the Bodic division of the Tibeto-Burman branch of the Sino-Tibetan family in Shafer's classification (Shafer 1955), spoken in Nepal (Mazaudon 1978, 1988). The reconstructed ancestor, Proto Tamang-Gurung-Thakali-Manang, is abbreviated *TGTM. Four modern tones (numbered 1 to 4) are recognized in the modern languages and two proto-tone categories (labelled A and B) are reconstructed. The tones of both reconstructed and daughter forms are transcribed before the syllable, e.g. Abap. The eight dialects used are discussed in detail by Mazaudon (1978). The dialects and their abbreviations are (as cited in columns 5 to 12 of the Table of Correspondences in Figure 9a): Risiangku (ris), Sahu (sahu), Taglung (tag), Tukche (tuk), Marpha (mar), Syang (syang), Ghachok (gha), and Prakaa (pra).</Paragraph> </Section> <Section position="2" start_page="383" end_page="387" type="sub_section"> <SectionTitle> 2.2 Synopsis of the Reconstruction Engine </SectionTitle> <Paragraph position="0"> RE implements (i) a set of algorithms that generate possible reconstructions given word forms in modern languages (and vice versa as well) and (ii) a set of algorithms that arrange input modern forms into possible cognate sets based on those reconstructions.</Paragraph> <Paragraph position="1"> The first set implements a simple bottom-up parser; the second automates database management chores, such as reading multiple input files and sorting, merging, and indexing the parser's output.</Paragraph> <Paragraph position="2"> Input-output diagram of RE's basic projection functions.</Paragraph> <Paragraph position="3"> The core functions of RE compute all possible ancestor forms (using a Table of Correspondences and a phonotactic description, a Syllable Canon, both described in Section 3.1) and makes sets of those modern forms that share the same reconstructions. Tools for further dividing of the computer-proposed cognate sets based on semantic distinctions are also provided. The linguist (that is, the user) collects and inputs the source data, prepares the table of correspondences and phonotactic description (syllable canon), and verifies the semantics of the output of the phonologically based reconstruction process. RE, qua &quot;linguistic bookkeeper,&quot; makes the projections and keeps track of several competing hypotheses as the research progresses. Specifically, the linguist provides as input to the program: (a) Word forms from several modern languages, with glosses.</Paragraph> <Paragraph position="4"> (b) Parameters that control the operation of the program and interpretation of input data (mostly not described here).</Paragraph> <Paragraph position="5"> (c) A file containing the Table of Correspondences, detailed below.</Paragraph> <Paragraph position="6"> (d) The Syllable Canon, described below.</Paragraph> <Paragraph position="7"> (e) Semantic information for disambiguating modern and reconstructed homophones, described below.</Paragraph> <Paragraph position="8"> The parsing algorithm implemented in RE is bi-directional (in the sense of time): the &quot;upstream &quot;3 process involves projecting each modern form backward in time and merging the sets of possible ancestors generated thereby to see which, if any, are identical. Conversely, given a protoform, the program computes the expected regular reflexes in the daughter languages, as illustrated in Figure 2.</Paragraph> <Paragraph position="9"> The process can be done interactively (as illustrated in Figure 3 below) or in batch using machine-readable lexicons prepared for this purpose.</Paragraph> <Paragraph position="10"> Figure 3 is a representation of the contents of the computer screen after the user has entered three modern words (1). The program has generated the reconstructions from which these forms might derive (2). The list of numbers (called the analysis) following the reconstruction refers to the row numbers in the table of correspondences used by 3 Upstream in the sense of time. We had originally described the temporal directions of the program as backward and forward. The opposition of upstream and downstream, suggested to us by John Hewson, one of the developers of the first &quot;Electronic Neogrammarian,&quot; (Hewson 1973) is much more intuitive. - Modem Forms * Reflexes 1. ris 2. sahu 3. tag 4. tuk 5. mar Spo 6. syang Spo 7. gha A simple example of interactive &quot;upstream&quot; computation (transcription and languages exemplified are described in Section 2.1).</Paragraph> <Paragraph position="11"> the program in generating the reconstructions. In two cases reflexes have more than one possible ancestor. The program has then proposed the two cognate sets that result from computing the set intersection of the possible ancestors (3). The proposed sets are listed in descending order by population of supporting forms. 4 Conversely, given a protoform, RE will predict (actually &quot;postdict') the regular reflexes in each of the daughter languages. Figure 4 reproduces the results on the computer screen of performing such a &quot;downstream&quot; calculation. Here the etymon entered by the user (1) produced reflexes (2) through two different syllabic analyses (numbered 1. and 2. in the &quot;Reflexes of ...&quot; window): Abap as initial /b-/ plus vowel /-a-/ plus final /-p/, and as initial /b-/ followed by rhyme/-ap/. The algorithms used in this process are described in Section 4.2.</Paragraph> <Paragraph position="12"> 3. Previous Research in Computational Historical Linguistics In order to provide some context for a discussion of our efforts, we first present a brief discussion of the computational approaches to the study of sound change and review some of the software developed (see also Hewson 1989).</Paragraph> <Paragraph position="13"> Applications of computers to problems in historical linguistics fall into two distinct categories: those based on numerical techniques, usually relying on methods of statistical inference; and those based on combinatorial techniques, usually implementing some type of rule-driven apparatus specifying the possible diachronic development 4 In fact, the situation is slightly more complicated than is shown here: there are two other possible reconstructions and another possible cognate set that are not shown because of space considerations. This example is discussed in more detail in Section 5.1.</Paragraph> <Paragraph position="15"> The expected outcomes of *Abap (a &quot;downstream&quot; computation).</Paragraph> <Paragraph position="16"> of language forms. The major features of a few of these programs are reviewed briefly below. The programs discussed by no means exhaust the field; the criteria for selecting them is that they have been described in the literature sufficiently for an evaluation, and that for this reason they have come to the attention of the authors. Indeed, the literature in this field is fragmented: starting in the 1960s and 1970s a sizable literature on the lexicostatistic properties of language change developed following Swadesh's earlier glottochronological studies (for example, Swadesh 1950). On the other hand, only a handful of attempts to produce and evaluate software of the rule-application type (for use in historical linguistics) are documented in the literature (Becker 1982; Brandon 1984; Durham & Rogers 1971; Frantz 1970; Kemp 1976). In general these programs seem to have been abandoned after a certain amount of experimentation.</Paragraph> <Paragraph position="17"> Certainly the problem of articulating a set of algorithms and associated data sets that completely describe the regular sound changes evidenced by a group of languages is a daunting task.</Paragraph> <Paragraph position="18"> To the first class belong lexicostatistic models of language change. The COMPASS module of the WORDSURV program described below belongs to this class (cf. Wimbish 1989). It measures degree of affiliation using a distance metric based on the degree of similarity between corresponding phonemes in different languages. Also to this class belong applications that measure genetic affiliation as a function of the number of shared words in a selected vocabulary set. Any method that depends on counting &quot;shared words,&quot; we note, assumes the existence and prior application of a means of determining which forms are cognate; and any such estimates of the relatedness of languages are only as good as the metric that determines which ones are cognate.</Paragraph> <Paragraph position="19"> applied to derive later forms from earlier forms, and RE is a member of this class.</Paragraph> <Paragraph position="20"> Examples of programs of this sort are PHONO, being applied to Latin-to-Spanish data (and described below); VARBRUL (by Susan Pintzuk) used to analyze Old English, and two programs used to analyze Romance languages: Iberochange, based on a ruleprocessing subsystem called BETA, used for lbero-Romance languages (Eastlack 1977) and one unnamed (Burton-Hunter 1976).</Paragraph> </Section> <Section position="3" start_page="387" end_page="387" type="sub_section"> <SectionTitle> 3.1 Hewson's Proto-Algonkian Experiment </SectionTitle> <Paragraph position="0"> The &quot;proto-projection&quot; techniques used by RE were implemented earlier by John Hewson and others at the Memorial University of Newfoundland (Hewson 1973, 1974). 5 The strategy is transparent; as Hewson notes, he and his team decided to &quot;follow the basic logic used by the linguist in the comparative method&quot; (Hewson 1974:193). The results of this research have recently been published in the form of an etymological dictionary of Proto-Algonkian (Hewson 1993).</Paragraph> <Paragraph position="1"> The program as first envisioned was to operate on &quot;consonant only&quot; transcriptions of polysyllabic morphemes from four Amerindian languages. The program would take a modern form, &quot;project it backwards&quot; into one or more proto-projections, then project these proto-projections into the next daughter language, deriving the expected regular reflexes. The lexicon for this language would be checked for these predicted reflexes; if found, the program would repeat the projection process, zig-zagging back and forth in time until all reflexes were found. For example, given Fox/poohke~amwa/he cuts it open, the program would match the correct Cree form, as indicated in Figure 5.</Paragraph> <Paragraph position="2"> There were problems with this approach. In cases where no reflex could be found (as in Figure 5, where no Menomini cognates for this form existed in the database), the process would grind to a halt. Recognizing that &quot;the end result of such a programme would be almost nil&quot; (Hewson 1973:266), the team developed another approach in which the program generated all possible proto-projections for the 3,403 modern forms.</Paragraph> <Paragraph position="3"> These 74,049 reconstructions were sorted together, and 'only those that showed identical proto-projections in another language' (some 1,305 items) were retained for further examination. At this point Hewson claimed that he and his colleagues were then able to quickly identify some 250 new cognate sets (Hewson 1974:195). The vowels were added back into the forms, and from this a final list of cognate sets was created. A cognate set from this file, consisting of a reconstruction and two supporting forms, is reproduced below (Figure 6).</Paragraph> </Section> </Section> <Section position="4" start_page="387" end_page="403" type="metho"> <SectionTitle> 5 The authors of RE developed this technique independently and later discovered this methodologically similar computer project on Proto-Algonkian. </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="388" end_page="389" type="sub_section"> <SectionTitle> 3.2 WORDSURV </SectionTitle> <Paragraph position="0"> The Summer Institute of Linguistics (SIL), a prodigious developer of software for the translating and field linguist located in Dallas, Texas, provides a variety of integrated tools for linguistic analysis. One of these tools, the COMPASS module of WORDSURV, allows linguists to compare and analyze word lists from different languages and to perform phonostatistic analysis. To do so, the linguist first enters &quot;survey data&quot; into the program; reflexes are arranged together by gloss, as illustrated in the reproduction in Figure 7.</Paragraph> <Paragraph position="1"> In addition to the a priori semantic grouping of reflexes by gloss, the linguist must also re-transcribe the data in such a way that each constituent of a reflex is a single character, that is, &quot;no digraphs are allowed. Single unique characters must be used to represent what might normally be represented by digraphs.., e.g. N for ng.&quot; (Wimbish 1989:43). The program also requires that part of the diachronic analysis be carried out before entering the data into the computer in order to incorporate that analysis into the data. For example, when the linguist hypothesizes that &quot;a process of sound change has caused a phone to be lost (or inserted), a space must be inserted to hold its place in the forms in which it has been deleted (or not been inserted)&quot; (Wimbish 1989:43). That is, the zero constituent must be represented in the data itself. The program also contains a &quot;provision for metathesis .... Enter the symbols > n (where n is a one- or two-digit Computational Linguistics Volume 20, Number 3 number) after a word to inform WORDSURV that metathesis has occurred with the nth character and the one to its right&quot; (Wimbish 1989:43). An example of this may be seen in column 3 of Figure 7. To represent tone, the author notes that &quot;there are at least two solutions. The first is to use a number for each tone (for example lma3na). The second solution is to use one of the vowel characters with an accent .... The two methods will produce different results&quot; when the analysis is performed (Wimbish 1989:44). While the last statement may surprise some strict empiricists (after all, the same data should give the same results under an identical analysis), it should come as no surprise to linguists who recognize that the selection of unit size, the type of constituency, and other problems of representation may have a dramatic effect on conclusions. RE is distinguished from this program in that (i) no a priori grouping of forms by gloss is required (a step that is fraught with methodological problems inasmuch as it requires the linguist to decide a priori which forms might be related), (ii) no alignment of segments is required (also a problematic step for a number of reasons), and (iii) the constituent inventory is not limited to segments. In passing, the lexicostatistics that are computed are based on the &quot;Manhattan distance&quot; (in a universal feature matrix) between corresponding phonemes from different languages as a measure of their affiliation. The validity of this measure for establishing genetic affiliation is suspect: corresponding phonemes may be quite different in terms of their phonological features without altering the strength of the correspondence or the closeness of the genetic affiliation. Also, the metrics of feature spaces are notoriously hard to quantify, so any distance measures are themselves likely to be unreliable. RE computes no such statistics, though some tools (described below) that might be used in subgrouping do exist.</Paragraph> </Section> <Section position="2" start_page="389" end_page="389" type="sub_section"> <SectionTitle> 3.3 DOC: Chinese Dialect Dictionary on Computer </SectionTitle> <Paragraph position="0"> DOC is one of the earliest projects to attempt a comprehensive treatment of the lexicons of a group of related languages. DOG was developed &quot;for certain problems \[in which\] the linguist finds it necessary to organize large amounts of data, or to perform rather involved logical tasks--such as checking out a body of rules with intricate ordering relations.&quot; (Wang 1970:57). A sample dialect record (in one of the original formats) is illustrated in Figure 8. Note that as in the case of WORDSURV, the data must be pre-segmented according to a universal phonotactic description (in this case the Chinese syllable canon) that the program is built to handle. The one-byte-one-constituent restriction does not exist, though the (maximum) size of constituents is fixed with the data structure.</Paragraph> <Paragraph position="1"> At least four versions of this database and associated software were produced (Cheng 1993:13). Originally processed as a punched-card file on a LINC-8, the program underwent several metamorphoses. An intelligent front-end was developed in Clipper (a microcomputer-based database management system) that allows the user to perform faceted queries (i.e. multiple keyterm searches) against the database. The database is available as a text file (slightly over one megabyte) containing forms in 17 dialects for some 2,961 Chinese characters (Cheng 1993:12). DOG has no &quot;active&quot; component: it is a database of phonologically analyzed lexemes organized for effective retrieval.</Paragraph> </Section> <Section position="3" start_page="389" end_page="390" type="sub_section"> <SectionTitle> 3.4 Phono: A Program for Testing Models of Sound Change </SectionTitle> <Paragraph position="0"> PHONO (Hartman 1981, 1993) is a DOS program that applies ordered sets of phonological rules to input forms. The rules are expressed by the user in a notation composed of if-then clauses that refer to feature values and locations in the word. PHONO converts input strings (words in the ancestor language) into their equivalent feature matrices using a table of alphabetic characters and feature values supplied by the user. The program then manipulates the feature matrices according to the rules, converting the</Paragraph> <Paragraph position="2"> A dialect record in DOC (cited from Figure 7 in Wang \[1970\]).</Paragraph> <Paragraph position="3"> matrices back into strings for output. Hartman has developed a detailed set of rules that derive Spanish from Proto-Romance. Besides allowing the expression of diachronic rules in terms of features, facilities are included to handle metathesis. Unlike RE, which handles only one step (at a time) in the development of multiple languages, PHONO traces the history of the words of a single language through multiple stages.</Paragraph> <Paragraph position="4"> 4. Description of Data Structures and Algorithms We turn now to RE, which represents another step in the application of computational techniques to the problems faced by historical linguists. As will become clear, any computational tools designed to be used by historical linguists must be able to operate in the face of considerable uncertainty. In the course of carrying out a diachronic analysis the linguist is likely to have several competing hypotheses that might explain the observed variation. Data from many sources, varying in quality and transcription, will be compared. The research will proceed incrementally, both in terms of the portions of the lexicons and phonologies treated and in the number of languages or dialects included. RE as a tool helps with only a portion of this task, the problem of creating and maintaining regular cognate sets and the reconstructions that accompany them.</Paragraph> </Section> <Section position="4" start_page="390" end_page="392" type="sub_section"> <SectionTitle> 4.1 Principal Data Structures: The Table and the Canon </SectionTitle> <Paragraph position="0"> Two data structures (internal to the program) are relevant to the phonological reconstruction, and these are passed as arguments to RE. The first is a Table of Correspondences (Fig. 9a) representing the linguist's hypothesis about the development of the languages being treated. The columns of the table are (1) a correspondence set number, uniquely identifying the correspondence; (2) the distribution of the correspondence within the syllable structure (i.e. the type of syllable constituent: in this case, Tone, Initial, Liquid, Glide, Onset, Rhyme, Vowel, or Final); (3) the PROTOCONSTITUENT itself; (4) the phonological context (if any) to which the correspondence is limited; (5-12) the OUTCOME or reflex of the protoconstituent in the daughter languages. 6 The 6 The term reflex will be reserved for describing a complete modern form that is the regular descendant of some protoform. Outcome will be used for the regular descendent of a protoconstituent. 102 0 kr /_e: 103 0 kr /u 104 0 kr /_a 105 0 kr /_a t 106 0 kr 31 R,V a 186 V a /_p (5) (6) (7) (8) (9) (10) (11) (12) ris s~'~hu tag tuk mar svang mha ora i X H I IjH 1 H 1 i H i 1 3 X~ L 3 3 3 3 3 L 3 3 k k k k k k k k k ~ h k k k k k r r r r r r r 1-=r r r r r r r r r k k k ~ k k kr kr kr kr h t k k kr kr kr kr hw t kj kj kr kr kr kr h ~ kj kj kr kr kr kr h t k k kr kr O. 8. ~. O O O ~. &quot;8&quot; a ~. a O 0 0 8.</Paragraph> <Paragraph position="1"> Figure 9a Excerpt from the Table of Correspondences.</Paragraph> <Paragraph position="2"> \[T,O\] \[O(G),I(L)(G),f~\] \[R, VF\]</Paragraph> <Paragraph position="4"> Syllable canon in Proto-Tamang.</Paragraph> <Paragraph position="5"> CONSTITUENT TYPES (T, I, F, L, etc. in column 2) are specifiable by the user. So, for exam null ple, C and V could be chosen if no other types of constituents need to be recognized for the research. Note that the table allows for several different outcomes depending on context; the absence of context indicates either an unconditioned sound change or the Elsewhere case of a set of related rules (as discussed below).</Paragraph> <Paragraph position="6"> The second data structure is a syllable canon that provides a template for building monosyllables. It specifies how the constituents of the table of correspondences may be combined based on the (syllable) constituent types (column 2 of the table). Thus, the outcomes for a final/k/(correspondence 181) and an initial/k/(correspondence 93) are never confused by the program. The program takes the syllable canon as an Lowe and Mazaudon The Reconstruction Engine argument expressing the adjacency constraints on the constituents found in the table. For example, the canon for *TGTM, illustrated in Figure 9b, has three slots, each of which has its own substructure: first, a tone (optional, as indicated by the possibility of a zero element); followed by an (also optional) initial element consisting of various combinations of Onset, Glide, Initial, and Liquid; and terminating with either a Rhyme element or a Vowel plus Final consonant. A syllable is composed of zero or more elements from each of these slots. Picking the longest possible combination from each slot produces the maximal syllable permitted by the canon containing six constituents (TILGVF), one from each constituent type except O and R. Similarly, the minimal one has only one constituent (R). Parentheses indicate optional elements and brackets separate sequential slots in the syllable structure.</Paragraph> <Paragraph position="7"> This description, a type of regular expression, provides a shorthand device for expressing several possible syllable structure trees. Indeed, the Proto-Tamang Syllable Canon is quite complex in this respect, because several hypotheses about syllable structure are encoded in it. For other languages, in which only consonant and vowel need be distinguished in describing syllable structure, a simpler canon (e.g. CV(C)) might suffice. Polysyllabic syllable canons can be expressed and used by the canon in two ways: * Explicitly, for example \[CV(C)\]\[CV, O\] a bisyllabic canon in which the minimal form is CV and the maximal is CVCCV.</Paragraph> <Paragraph position="8"> As a recursive application of a single syllable. This is done via a software toggle that allows the canon structure to be repeatedly mapped over an input form. For example, if the polysyllable toggle is turned on, the canon \[(C)V(C)\] would match forms of the form V, CVC, CVCCV, CVCVCVV, etc.</Paragraph> </Section> <Section position="5" start_page="392" end_page="396" type="sub_section"> <SectionTitle> 4.2 Algorithms </SectionTitle> <Paragraph position="0"> or transform a modern form into a set of possible reconstructions or vice versa: (i) tokenizing the given form into a list of row numbers in the table of correspondences (column 1 of Figure 9a), (ii) filtering the tokenized forms according to syllabic and phonological constraints, and (iii) substituting the actual outcomes in the Table of Correspondences for the tokens.</Paragraph> <Paragraph position="1"> Tokenization. On a first recursive pass RE generates (recursively from the left of the input form) all possible segmentations of the form. That is, starting from the left, the program divides the form into two, and then repeats the process on the right-hand part until the end of the form is reached. Essentially, this algorithm implements a standard solution to a standard problem, that of finding all parses of an input form given a regular expression (encoded in this case in the syllable canon and table). As the segmentation tree is created, the program checks to see that the node being built is actually specified as an element of the Table of Correspondences and thus avoids building branches of the tree that cannot produce outcomes (according to the Table of Correspondences). The pseudocode in Figure 10 outlines the algorithm. Consider for example the segmentations of: (1) ~Akra head hair if InputString is null then return(TokenList) for i = I to the length of InputString leftside = leftmost i characters of InputString rest = the rest of InputString lookup leftside in list of constituents for this table column if found then add tokens (i.e. TofC row numbers) for this constituent to TokenList Of these eight segmentations, only two are composed completely of elements that occur in the protoconstituent column (3) of the table. For each of the valid segmentations, RE constructs a tokenized version of the form, in which each element of the segmented form is replaced with the correspondence or list of correspondences for that constituent in the table. *k, for example, has three possible outcomes (given by rows 93, 94, and 181 of the Table of Correspondences), depending on its syllabic position and environment.</Paragraph> <Paragraph position="2"> if TokenList is NULL then return(ListofPossibleForms) /* recursive step */ for each RowNumber of first segmented element in TokenList if phonological and syllabic context constraints are met then for each language in the table add outcomes for this RowNumber to each output form in ListofPossibleForms for this language end if otherwise /~ do not use this token in building output forms ~/ end if end for remove first segmented element from TokenList</Paragraph> <Paragraph position="4"> Pseudocode for filtering possible projections and substituting regular outcomes.</Paragraph> <Paragraph position="5"> Filtering. Having created and tokenized a list of all valid segmentations, the algorithm traverses each tokenized form, looking up each correspondence row number of each segment in the table and substituting the outcome of that row from the appropriate column of the table. As the output form is being created, the phonological and phonotactic contexts are checked to eliminate disallowed structures, as illustrated in the pseudocode given in Figure 11.</Paragraph> <Paragraph position="6"> With these constraints, however, only one combination is licensed: 1.104.31, because: (i) only the tone correspondence for row 1 applies since it specifies the outcome of prototone A for voiceless initials; (ii) only outcomes of row 104 for *kr- are generated since this is the most specific rule that applies; and because (iii) row 186 is eliminated as a possibility for *-a in this case, since these outcomes only occur when *-a is followed by *-p.</Paragraph> <Paragraph position="7"> Some complications in the application of the rules should be noted here. The program does apply Panini's principle, also known as the Elsewhere Condition (Kiparsky 1973, 1982). Thus, of all the possible *kr- correspondences, only the most specific is Computational Linguistics Volume 20, Number 3 selected. For example, though the context in line 104 *-a is a substring (or subcontext) of line 105 *-at, only one or the other is selected for any particular segmentation of a protoform ending in *-at (i.e. 104 for *-a- + *-t vs. 105 for *-at). If the &quot;specificity&quot; of several applicable contexts is the same, all are used by the program in generating the forms. 7 Also, note that since the context is stated in terms of proto-elements, when computing backwards (upstream) the program must tokenize and substitute ahead to determine if the context of a correspondence applies.</Paragraph> <Paragraph position="8"> Substitution. In the final step, the program substitutes the outcomes for each correspondence row in each of the language columns of the table and outputs the expected reflexes. The expected outcome of just the segmentation A-kr-a in Tukche, for example, (Figure 9a, column 8 tuk) is either 1\[o or H\[O.8 The segmentation A-k-r-a, though a valid segmentation of the input form into table constituents, would fail to produce any reflexes because the phonological context criterion is not met.</Paragraph> <Paragraph position="9"> This process is performed for each language column in the table, resulting in a list of the modern reflexes of the input protoform. This assumes, of course, that the reconstructed forms are correct, the rules are correct, and no external influences have come into play. By comparing these computer-generated modern forms with the forms actually attested in the living languages we can check the adequacy of the proposed analysis, and make improvements and extensions as required.</Paragraph> <Paragraph position="10"> only a few possible segmentations. Consider, however, the possible valid segmentations of the *TGTM form *Bgrwat hawk, eagle, schematized in Figure 12. There are, of course, a substantially larger number of invalid segmentations. Each token of a segmentation may have a sizable list of possible outcomes. One can see that even relatively uncomplicated monosyllables are capable of causing massive ambiguity in structural interpretation. Indeed, some of the monosyllabic forms in the Tamang database generate nearly 100 reconstructions, even given the limitations of syllabic and phonological context.</Paragraph> <Paragraph position="11"> sion (i.e. in Section 4.2.1) shows how the Table of Correspondences can be read from ancestor to daughter (left to right), downstream in the sense of history. It can also be read from daughter to ancestor (right to left), upstream, revealing all the possible ancestors of a given modern segment of a particular language. For example, we can see from the excerpt in Figure 9a that Syang /k/ (in column 10 of the table) could derive from either *k- or *kr- (to be read from the column of protoconstituents (column 3) of the table).</Paragraph> <Paragraph position="12"> By combining, according to the syllable canon, all the possible permutations of *initial, *tone, and *rhyme for the initial, tone, and rhyme of a modern word, the 7 There is a great deal more to say about specificity and the complexity of the environmental constraints, so much so that a separate and rather lengthy discussion of it is merited. As currently implemented in RE, context must be stated in terms of immediately adjacent constituents (remote context cannot be used). Also, the context must be stated in terms of constituents (i.e. atoms), or lists of constituents: regular expressions and other possible definitions are not supported. Specificity is measured in a straightforward way: correspondence rules with no context have low specificity (specificity = 0). Rules with a one-sided context have specificity 1. Rules with a contextualizing element on both sides have specificity 2. Only integer specificities are supported.</Paragraph> <Paragraph position="13"> 8 The cover symbol &quot; is used to permit the upstream reconstruction of Tukche forms in which the tone of the modern form is not precisely known. In the downstream direction, however, it licenses the generation of two possible reflexes.</Paragraph> <Paragraph position="14"> (1) ~ gr w a t TOGVF (2) B gr w at T O G R (3) B g r wa t T 0 V F (4) ~ gr wat T O R (5) B g r w a t TILGVF (6) ~ g r w a t T I L G R (7) B g r wa t T I L V F (8) ~ g r wa t T I L R Segmentation of *Bgrwat 'hawk, eagle.' computer can, using exactly the same procedures as described in Section 4.2.1, create a list of its possible reconstructions. If the possible reconstructions of a set of words that are cognate are compared, it must be true that one or more of the reconstructions is the same for all words in the set (assuming, of course, that the words are related via regular sound changes).</Paragraph> <Paragraph position="15"> In the example in Figure 13, this computation has been done on the modern forms for the word snow in four languages of the TGTM group. Each column contains the possible reconstructions for the modern reflex listed on top of the column. A comparison of the columns (or examination of the Venn diagram below) shows that one reconstructed form, *Bgli~ (in row 1), is indeed supported by all the members of the cognate set, and that these four languages provide sufficient data to rule out some of the other reconstructions proposed on the basis of one language alone. 9</Paragraph> </Section> <Section position="6" start_page="396" end_page="401" type="sub_section"> <SectionTitle> 4.3 The Database Management Side of Historical Reconstruction </SectionTitle> <Paragraph position="0"> Using the interactive mode of RE described above is a good way to &quot;debug&quot; the table of correspondences and canon. However, RE is most useful as a means of analyzing complete lexicons. The four processes involved in creating reasonable cognate sets from a set of lexicons of modern forms are schematized in Figure 14. They are: .</Paragraph> <Paragraph position="1"> 2.</Paragraph> <Paragraph position="2"> 3.</Paragraph> <Paragraph position="3"> 4.</Paragraph> <Paragraph position="4"> segmentation of lexemes and generation of proto-projections, comparison of proto-projections and creation of tentative cognate sets, merging (conflation) of subsets in the list of tentative cognate sets, and conflict resolution within and between cognate sets of homophonous reflexes and homophonous reconstructions (i.e. the application of semantic information).</Paragraph> <Paragraph position="5"> 9 While the Taglung form itself is sufficient to determine the 'proper' reconstruction in this case, and if the Syang form were not available, it would break the tie between the other competing reconstructions (Bglin, BgilJ, and Bgin), it is usually difficult to pick out such decisive lexical items from a list of words.</Paragraph> <Paragraph position="6"> Bgli~ snow in Proto-Tamang (8 different protoforms produced from 4 reflexes). Selecting the 'best' reconstruction from the list of possible reconstructions. The algorithms for each of these processes are outlined in the pseudocode in Figures 15a-c. First, the Tokenize and FilterAndSubstitute procedures are performed for each form in each source dictionary.</Paragraph> <Paragraph position="7"> Next, the list of reconstructions generated is examined and those reconstructions that fail to have sufficient support are eliminated. The remaining reconstructions are retained.</Paragraph> <Paragraph position="8"> Third, each set is compared with each other set to get rid of those which are subsets of other sets (a type of &quot;set covering problem,&quot; discussed in Section 5.1 below). This is primarily a data reduction process, and not interesting algorithmically. We have therefore not provided pseudocode describing it. It is, however, NP-hard, and therefore takes a lot of time for a dataset of any size. 1deg 10 For a discussion of set-covering and NP-complete problems, see, for example, Ralston and Reilly (1993), 938-941.</Paragraph> <Paragraph position="9"> setup tables for the language data file to be processed get appropriate columns from TofC set language codes, etc. for output if the reconstruction already exists in the list, link modern_form to existing reconstruction otherwise add reconstruction and link to modern_form into list end if end for end for Pseudocode for RE's basic batch functions--first create reconstructions. Finally, if the linguist is able to provide (on the basis of analysis of previous runs) semantic criteria for distinguishing homophones, the program can separate the sets divide a set into two based on list of glosses selected for each of the newly created divided sets if (it is supported by data from at least two different languages) and (it is not now a subset of some other existing cognate set) then retain the divided set otherwise delete the divided set</Paragraph> <Paragraph position="11"> /* check the division in the rest of the sets */ for all other sets containing any subset of these glosses if words with semantically incompatible glosses appear then divide the set (as was done above)</Paragraph> <Paragraph position="13"> Pseudocode for RE's basic batch functions--semantic component.</Paragraph> <Paragraph position="14"> into sets that contain only semantically compatible reflexes. The method for accomplishing this is described in Section 6.1 below.</Paragraph> <Paragraph position="15"> The first step, creating the list of proto-projections, is merely a matter of iteratively applying the reconstruction-generating procedures described in Section 4.2.1 to all the forms in the files. The list of protoforms obtained by running all the entries of a modern dictionary backward through tlhe program is saved for later combination with reconstructions generated by words from other languages. The process is illustrated in Figure 16. Note that forms that fail to produce any reconstructions are saved in a residue file for further analysis. In the example below (Figure 16), we see that a Nepali loan word, Tukche/2gar/ house, failed to produce any reconstructions in the protolanguage, because there exists a phonological subsystem for Nepali loans in Tukche that does not conform to the phonology of native words (i.e. the phonology described by the Table of Correspondences). In particular, no voiced initials have survived in Tukche. In other cases, forms collected :in the check files may indicate a mistake in the Table of Correspondences, which needs to be corrected to allow the word to reconstruct successfully. Note that the Tukche words in this example are glossed in French (neige &quot;snow,&quot; oeil &quot;eye,&quot; maison &quot;house&quot;), as they are taken from a French-Tukche dictionary. This is a significant fact, as will be explained in Section 6.1.</Paragraph> <Paragraph position="16"> Combining the lists of reconstructions for several languages into a single sequence and sorting by the proposed reconstructions brings together all reflexes that could descend from a particular reconstruction (Figure 17).</Paragraph> <Paragraph position="17"> Proposition of protoforms and the residue (&quot;check&quot;) file.</Paragraph> <Paragraph position="18"> From this sorted list, RE extracts matching reconstructions, with their supporting forms, and proposes them as potential cognate sets (Figure 18). Ideally, rules would be sufficiently precise for the program to propose only valid sets, and sufficiently broad not to exclude legitimate possibilities. However, there is a certain amount of redundancy and uncertainty in the rules that tends to result in several possible reconstructions for the same cognate set. On the other hand, some forms that do produce possible reconstructions cannot be included in a cognate set because their reconstruction does not match that of any word in the other languages. These isolates (not illustrated) are collected by the program during the set creation process and maintained as a separate list.</Paragraph> <Paragraph position="19"> The first evaluation measure hypothesized for establishing the validity of a cognate set was the number of supporting forms. The program retained cognate sets when the number of supporting forms from different languages reached a certain threshold value. However, many reasonable cognate sets had forms from only a few languages.</Paragraph> <Paragraph position="20"> The handling of this problem and other problems having to do with the composition of the proposed cognate sets is dealt with in more detail in Section 5.</Paragraph> <Paragraph position="21"> mutate in phonological shape, but would remain distinct from other words in both form and meaning. Making cognate sets in such a situation would be quite straightforward. In reality, neither semantic nor phonological distinctions are maintained over time. We will examine some of the implications of this situation.</Paragraph> </Section> <Section position="7" start_page="401" end_page="402" type="sub_section"> <SectionTitle> 5.1 Many Reconstructions May Be Possible for a Given Set of Cognates </SectionTitle> <Paragraph position="0"> The process of &quot;triangulation&quot; (discussed in Section 4.2.3 and in some detail in Lowe and Mazaudon \[1989, 1990\] and Mazaudon and Lowe \[1991\]) provides a means for 11 Merger refers to the diachronic process by which the distinction between two (or more) phonemes is lost. Words that were minimal pairs on the basis of this distinction become homophones. Split refers to the process by which a phoneme becomes two (usually because of some modification in the context).</Paragraph> <Paragraph position="1"> A &quot;Cognate Set&quot; generated by RE.</Paragraph> <Paragraph position="2"> selecting the best reconstruction out of several candidates. If it were possible a priori to determine which reflexes might fall together based on the correspondences, it would be possible to preserve just those reconstructions from which all the reflexes might descend (and discard the other reconstructions). This situation is the one illustrated with the words for snow in Figure 13. However, when entire lexicons are processed, it is not necessarily possible nor even desirable to attempt to partition the lexemes into comparable sets to begin with. It is necessary to generate all possibilities and later to eliminate the undesirable ones through a sort of competition. Using the number of supporting forms in a cognate set as the sole or primary criterion for keeping the set was found to be an inadequate heuristic: some perfectly good sets have only three members, while others with more members are shaky and in some cases simply wrong.</Paragraph> <Paragraph position="3"> An example of this competition is illustrated in Figure 19 (another representation of the data presented in Figure 3). Here, as in the Bgli~ snow example (Figure 13), several different cognate sets composed of the same reflexes but having different reconstructions have been generated. The reflexes clearly fall together into the same overall cognate set, but the reconstruction of several of the forms is non-unique: the reconstruction *Abo: is supported by forms from Marpha and Syang, while *Abap is supported by all four languages. This cognate set, then, reflects a merger of several smaller sets, as indicated in the Venn diagram. The Risiangku form disambiguates the reconstructions, showing that *Abap should be recognized as the winning reconstruction (it is marked with the *; other reconstructions supported by various subsets of the reflexes are marked with !*). As an aside, this process of set conflation can only be accomplished once all the words in all the lexicons are processed. The problem then presented is a set-covering problem (one of the various kinds of NP-complete problems for which no fast or easy solution exists). In the case of our Tamang database, in which about 7,000 modern forms yield about 8,000 different reconstructions that have supporting forms from at least two languages, set conflation takes several hours on our 386-based machine.</Paragraph> </Section> <Section position="8" start_page="402" end_page="403" type="sub_section"> <SectionTitle> 5.2 Other Kinds of Competition for Reflexes </SectionTitle> <Paragraph position="0"> A number of other types of overlapping can occur. And in general, the cognate sets of competing reconstructions do not nest as neatly as they do in the previous example. It is often the case that the cognate sets may merely overlap to some extent. These cases fall into several distinct classes: First, there exist some overlapping sets in which neither set is a proper subset of the other; the reflexes fit semantically, so the problem is one of reconciling the recon- null Nested cognate sets.</Paragraph> <Paragraph position="1"> structions. Figure 20 illustrates the problems that arise when semantically compatible reflexes support different reconstructions.</Paragraph> <Paragraph position="2"> This particular situation results from an uncompensated merger (of length) that is now in progress: the length distinction appears to be on its way out in the Risiangku dialect (abbreviated &quot;ris&quot; in Figure 20). This explanation is based on knowledge of the language and an internal analysis of its phonology. At this time it is not clear what algorithm (if any) could be used to sort out cases like these.</Paragraph> <Paragraph position="3"> A similar but more complicated situation can be seen in Figure 21. Here the free variation in vowel length generates additional possible reconstructions, and those dialects where final consonants have been lost permit the reconstruction of variants with a final stop as well. However, the form from the Risiangku dialect, which normally preserves finals, cannot reconstruct (according to the table) to either the short vowel or stopped rhyme, and so another cognate set supporting a long vowel reconstruction is created.</Paragraph> </Section> </Section> class="xml-element"></Paper>