File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3201_metho.xml

Size: 23,918 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3201">
  <Title>A Combined Phonetic-Phonological Approach to Estimating Cross- Language Phoneme Similarity in an ASR Environment</Title>
  <Section position="3" start_page="1" end_page="3" type="metho">
    <SectionTitle>
2 Phoneme specification
</SectionTitle>
    <Paragraph position="0"> In the CPP approach to estimating cross-language phoneme similarity, each phoneme in our multilingual ASR dataset is associated with a distinctive feature matrix. Feature categories are fixed for all phonemes, hierarchically related, and binaryvalued. Feature-contradiction, associated with allophonic variance, is explicitly addressed through the introduction of a small set of special corollary features.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.1 The phoneme feature matrix
</SectionTitle>
      <Paragraph position="0"> As noted in the introduction, cross-language phoneme comparison requires accurate feature specification. Because a phoneme comprises one or more  allophones which may contrast in particular features, a distinctive feature strategy that allows for feature contradiction is preferred. Omitting contradictory features and underspecifying contradictory values are two well-known methods.</Paragraph>
      <Paragraph position="1"> However, cross-language phoneme comparison in a computational environment is greatly facilitated by agreeing on a fixed set of binary-valued features for all phonemes. A fixed set of distinctive features is favored as this enables cross-class phoneme comparison. A binary-valued system is easy to manipulate and naturally lends itself to mathematical formulation. However, strict binary-valued feature systems only indicate the presence or absence of a feature, and feature contradiction must then be indicated by feature omission - which is not possible in a fixed distinctive feature set.</Paragraph>
      <Paragraph position="2"> The phoneme specification method that we employ indicates feature contradiction associated with allophony in a strict binary-valued, fixed set of distinctive features through the introduction of special feature categories. Specifically, we utilize a small set of corollary features to mark the occasional, allophonic realizations of some primary features. A corollary feature is defined as a feature that supplements a primary feature in the system.</Paragraph>
      <Paragraph position="3"> The corollary features mark &amp;quot;occasionality&amp;quot; (associated with context dependency, dialectal variation, speech style variation, etc.) in the primary feature as either present or absent.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Primary and corollary features
</SectionTitle>
      <Paragraph position="0"> Our feature set includes twenty-six primary articulatory features and six corollary features. The selected primary features conform to a typical set of hierarchically-related distinctive features (e.g. syllabic, sonorant, consonantal, labial, coronal, nasal, continuant, high, low, back, etc.) (Ladefoged 1975). In this hierarchical system, the presence of one feature presupposes the presence of those hierarchically dominant features. For example, the presence of the feature [alveolar] requires the presence of the feature [coronal], and the presence of the feature [nasal] requires the presence of the feature [sonorant]. Significantly, the reverse of these relations is not true. As is explained later in the next section, this feature structure allows for a linguistically-principled determination of feature salience in phonetic distance calculation.</Paragraph>
      <Paragraph position="1"> Corollary features are restricted to specifying those primary features that are judged to be most significant to cross-language phoneme comparison in an ASR environment. Phoneme inventories designed for ASR comprise both phonemes and significant allophones, where a significant allophone is characteristically both acoustically distinct from the primary allophone and associated with a sufficiently high count of occurrence in the associated speech database. Thus American English ASR inventories regularly include an alveolar tap, a contextually-realized allophonic variant of both /t/ and /d/. Furthermore, pronunciation transcriptions in ASR lexica are typically phonetic - within the context of the phoneme-based inventory. So, word-final voice neutralization in German is overtly indicated throughout the lexicon (e.g. hund : h U n t).</Paragraph>
      <Paragraph position="2"> A typical ASR phoneme then does not represent a true phoneme; rather it encompasses only that phonemic variation that is not explicitly captured by its existing significant allophones in the inventory. null Corollary features specify variance that is not usually overtly indicated in ASR inventories and lexica but that is important to cross-language phoneme comparison in an acoustic, ASR environment. Internal phoneme recognition experiments indicate that generally major class features (syllabic, sonorant, etc.), manner features (nasal, continuant, etc.) and laryngeal features (voice, spread glottis, etc.) are more robustly identified than place features (labial, coronal, etc.); accordingly, the set of corollary features, provided in Table 1, predominantly targets particular major class, manner, and laryngeal features.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
Corollary
Feature
Description
</SectionTitle>
      <Paragraph position="0"> syllabic-occ positive value marks the occasional realization of the phoneme as a syllabic consonant or glide voice-occ positive value marks the occasional voicing of phonemes labial-occ positive value marks the occasional rounding of vowels nasal-occ positive value marks the occasional nasalization of vowels and glides rhotic-occ positive value marks the occasional rhotization of liquids and vowels spread-occ positive value marks the occasional aspiration of obstruents It should be pointed out that allophones that express a place contrast or difference in continuance  with the primary realization of a phoneme are typically considered significant allophones in the ASR phoneme system and are therefore overtly represented. null As an illustration of the usefulness of corollary features in cross-language phoneme comparison, consider Table 2 which includes a partial feature matrix for the phoneme /k/ associated with 17 languages and dialects:  Note that the realization of the phoneme /k/ differs across the seventeen languages and dialects in the two features provided: [spread glottis] and [spreadocc]. The presence of the feature [spread glottis], marked by 1, and the non-presence of the corollary feature [spread-occ], marked by 0, indicates that the glottis is always open during the articulation of the phoneme; i.e. this phoneme is consistently associated with aspiration. The precise IPA transcription of this segment is /k h /. A positive value for the corollary feature [spread-occ] means that the phoneme is only sometimes associated with aspiration. This phoneme has two principle phonetic realizations, marked [k] and [k h ] in IPA notation. A 0 value for the feature [spread glottis] and corollary feature [spread-occ] indicates that the segment is never aspirated. Thus this phoneme is most accurately labeled /k/ in IPA labeling.</Paragraph>
      <Paragraph position="1"> Because this methodology incorporates phoneme feature contradiction, overall phonological similarity among languages and dialects is more precisely predicted:  Eur. Port. Romance Table 3 reveals that Germanic languages tend to only occasionally aspirate /k/, Romance languages avoid aspirating /k/, and Sinitic languages typically aspirate /k/. Of course, closely related languages tend to be phonologically similar.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="3" end_page="11" type="metho">
    <SectionTitle>
3 Phonetic distance
</SectionTitle>
    <Paragraph position="0"> Most techniques for measuring phonetic distance between phonemes that do not assume speech data availability are based on articulatory features, though perceptual distance, judged (subjective) distance, and historical distance are also attested (Kessler 2005). We base our phonetic distance measurement on articulatory features because of their cross-linguistic consistency and general availability.</Paragraph>
    <Paragraph position="1"> As Kessler notes, standard phonological theory provides no guidance in comparing phonetic distance between phonemes across multiple features (Kessler 2005). In our experiments to date, we use the Manhattan distance where the distance between phonemes equals the sum of the absolute values of individual feature distances. This approach is fairly standard in the literature, though the Euclidean distance has also been reported to attain good results (Kessler 2005).</Paragraph>
    <Paragraph position="2"> Because features are known to differ in relative importance (Ladefoged 1969), some researchers apply weights or saliencies to the individual features for distance calculation. Nerbonne and Heeringa (1997), for example, weighted each feature by information gain, or entropy reduction. Kondrak (2002) expressed weights as coefficients that could  be changed to any numeric value. He adjusted the coefficients until he achieved optimal performance on aligning cognate words.</Paragraph>
    <Paragraph position="3"> In our approach, weights are derived from the lexica of all the considered languages. Specifically, the value of a weight for a feature is derived from the frequency of the feature in the lexica. Each language is treated equally in this approach; thus, the weights are not subject to the relative size of a language's lexicon.</Paragraph>
    <Paragraph position="4"> Because our phoneme specification method incorporates hierarchical relations between features, feature weights are necessarily interdependent.</Paragraph>
    <Paragraph position="5"> Hierarchically dominant features are more frequently attested than their subordinate features and thus receive more weight. Further, hierarchically superior features tend to correspond to major phonetic categories (sonorant, consonantal, syllabic, etc.), which are expected to be more contrastive or distant to each other than sister subordinate categories. Thus, in a hierarchical feature system, lexical frequency of features is a reasonable indication of feature importance in phonetic contrast or distance. In the following two subsections the phonetic distance algorithm is described.</Paragraph>
    <Paragraph position="6"> Quantitative representation of phonemes A phoneme is denoted by )(ip l , where l (=1,...,L) represents the language that includes the phoneme, and i (=1,...,I l ) represents the index of the phoneme in language l. Thus, the phoneme inventory of language l is</Paragraph>
    <Paragraph position="8"> vector transposition.</Paragraph>
    <Paragraph position="9"> Weighted phonetic distance As mentioned, the value of a weight for a feature in the present phonetic distance approach is derived from the frequency of the feature in the lexica of all the considered languages. Let )]([ ipc ll denote the occurrence count of a phoneme )(ip l in a lexicon of language l, then the frequency of each feature j contributed by the phoneme )(ip</Paragraph>
    <Paragraph position="11"> where diag(vector) gives a diagonal matrix with elements of the vector as the diagonal entries. We define the phonetic distance between phonemes</Paragraph>
    <Paragraph position="13"> in the form of a Manhattan distance, which is expressed as</Paragraph>
    <Paragraph position="15"> Ik ,,1L= , and the weights, given in a diagonal matrix )( jW , are dependent upon the feature identity j.</Paragraph>
  </Section>
  <Section position="5" start_page="11" end_page="11" type="metho">
    <SectionTitle>
4 Phonological distance metrics
</SectionTitle>
    <Paragraph position="0"> Although our phoneme specification approach is designed to account for allophonic variance, not all variation is captured. Because of this, the effectiveness of measuring phonetic distance as a stand-alone strategy to predicting cross-language phoneme similarity is compromised. Furthermore, phonetic distance does not determine relative phoneme similarity in the not atypical scenario where two or more phonemes share the same phonetic distance to some target phoneme. In order to address these problems, phonological distance metrics are used to bias cross-language phoneme similarity predictions toward languages that have similar phoneme inventories and phoneme frequency distributions. The general idea is that the more similar the phoneme inventory and relative importance of each corresponding phoneme between languages, the more likely it is that the corresponding phonemes will be more similar.</Paragraph>
    <Paragraph position="1"> Phonological distance consideration is especially desirable in an ASR environment because ultimately HMMs corresponding to those source-language phonemes predicted to be most similar to  target-language phonemes must interact in a system that is intended to reflect a single target language. Use of phonological metrics then ensures that the overall model pool will have a bias toward a reduced set of phonologically similar languages, and it is reasonable to expect that similarity in languages of the model pool provides consistency in the target HMM system (see Schultz and Waibel 2000).</Paragraph>
    <Paragraph position="2"> In this section, we define two distance metrics to characterize cross-language phonological similarity. One is based on monophoneme inventories while the other is based on biphoneme inventories.</Paragraph>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.1 Monophoneme distribution distance
</SectionTitle>
      <Paragraph position="0"> Monophoneme distribution distance characterizes the difference in lexical phoneme distribution between two languages. Specifically, the distribution, or normalized histogram, of the phonemes is obtained from a large lexicon of a language, with the probability in the distribution corresponding to the frequency of a phoneme in the lexicon. We derive the distribution from a lexicon as we consider it more representative of a language's phonology than a particular database.</Paragraph>
      <Paragraph position="1"> The monophoneme distribution metric is a typological comparison that is based on two principal classes of information: (1) types of sounds and (2) frequencies of these sounds in the lexicon. The former class is directly associated with phoneme inventory correspondence while the latter concerns relative phoneme importance.</Paragraph>
      <Paragraph position="2"> Because the phoneme inventories of the two languages to be compared may not be identical, we first need to define a combined inventory for them</Paragraph>
      <Paragraph position="4"> is a phoneme in the combined inventory where there are total lt I phonemes.</Paragraph>
      <Paragraph position="5"> The frequency of the phoneme )(mp lt in language l can be expressed as</Paragraph>
      <Paragraph position="7"> in a lexicon of language l. If a phoneme )(mp lt does not exist in the language, its frequency would be zero. The difference of phoneme frequencies between the two languages can be calculated as</Paragraph>
      <Paragraph position="9"> Then the monophoneme distribution distance between the target language t and source language</Paragraph>
      <Paragraph position="11"> The distance is calculated between the target language and every one of the source languages.</Paragraph>
      <Paragraph position="12"> In view of the known differences in phonological characteristics between vowels and consonants, we make separate calculations for the vowel and consonant categories. Thus Eq. (9) becomes</Paragraph>
      <Paragraph position="14"> where g=Vowels or Consonants.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.2 Biphoneme distribution distance
</SectionTitle>
      <Paragraph position="0"> The biphoneme distribution distance metric characterizes the difference in lexical distribution of phoneme pairs, or biphonemes, between two languages. Similar to the monophoneme distribution distance, the distribution of biphonemes in a language is obtained based on the frequency of biphonemes in a large lexicon.</Paragraph>
      <Paragraph position="1"> The biphoneme metric indicates how phonemes can combine in a language and how important these combinations are. Though the phonotactics provided in this approach is limited to only a sequence of two, the overall biphoneme inventory and distribution provides important phonological information. For example, it indicates if and to what extent consonants can cluster. Some languages tend to disfavor consonant clustering, like the Romance languages, while others allow for broad clustering, like the Germanic languages. It also indicates if and to what extent vowels may cooccur. Many languages require an onset consonant so vowels will never co-occur; other languages have no such restriction.</Paragraph>
      <Paragraph position="2"> The biphoneme metric then yields types of information that are distinct from the monophoneme metric. It explicitly provides a biphoneme inventory, permissible phonotactic sequences, and phonotactic sequence importance. It also implicitly incorporates phoneme inventory and phonological complexity information.</Paragraph>
      <Paragraph position="3"> Similar to the monophoneme distribution distance, the distribution of biphonemes in a language  is obtained based on the frequency of a biphoneme in a large lexicon. The biphoneme inventory for the target language t is expressed as  where )(nq lt is a biphoneme in the combined inventory where there are total lt I' biphonemes. For a phoneme at the beginning or end of a word, )(nq lt takes the format of &amp;quot;void+phoneme&amp;quot; or &amp;quot;phoneme+void&amp;quot;, respectively.</Paragraph>
      <Paragraph position="4"> The frequency of a biphoneme )(nq lt in language l can be expressed as  Similarly, the distance is better characterized within the categories of vowels and consonants separately. In our algorithm we count each biphoneme twice, the first time as a left-contact biphoneme and second time as a right-contact biphoneme. Thus  where g=Vowels or Consonants.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.3 CPP phoneme distance
</SectionTitle>
      <Paragraph position="0"> For phoneme similarity prediction, we unite the phonetic and phonological distance metrics to arrive at the CPP phoneme distance measurement.</Paragraph>
      <Paragraph position="1"> Since the three distances are from different domains and provide distinct types of information, normalization is necessary before combination.</Paragraph>
      <Paragraph position="2"> The normalization, aimed at extracting the relative ranking between source phonemes and languages, is a linear processing that scales the score range from each domain into the range [0 1].</Paragraph>
      <Paragraph position="3"> We equate the overall importance of phonetics with that of phonology by providing a weight of 2 to the phonetic score and 1 to each of the phonological scores. By doing this, a source-language phoneme can have a greater phonetic distance to some target-language phoneme than other source-language phonemes but a lower phonological distance and receive a lower overall phoneme distance score. It is because phonological distance is considered as important as phonetic distance that the overall constructed target-language model pool will tend to be restricted to a subset of phonologically similar languages.</Paragraph>
      <Paragraph position="4"> The feature-based phoneme distance metric is defined as  Dg , the original range is determined by scores of all the source languages. Their scaling is done once for a target language t. While for ),( kid lt , we found that it is better to do scaling once for each target phoneme )(kp</Paragraph>
      <Paragraph position="6"> and the original range is determined by scores of a group of candidate phonemes that includes at least one phoneme from any source language.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="11" end_page="11" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> To test our CPP approach to phoneme similarity prediction, we compared it to an acoustic distance approach in ASR experiments. Because native language speech data is used in measuring model distance in the acoustic approach, it is expected to work better than the knowledge-based approach, which only estimates acoustic similarity indirectly through articulatory phonetic distance and overall phonological distance.</Paragraph>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.1 Model construction
</SectionTitle>
      <Paragraph position="0"> We employ the regular 3-state, left-right, multimixture, continuous-Gaussian HMMs as the acoustic models and assume that the models from all the source and target languages have the same topology except that the number of mixtures in a state may vary. Once the top source phonemes are determined from our feature-based phoneme distance metric for each target phoneme, the target HMM is constructed by gathering all the mixtures for a corresponding state from the source candidates. The original mean and variance values are maintained while the mixture weights are uniformly scaled down so that the new weights add up to one for each state. It is possible to weigh mixtures according to the relative importance of the candidates if the importance as reflected by the phoneme distance metric has a significantly large difference. The transition probabilities are adopted from the top one candidate model.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.2 CPP phoneme model construction
</SectionTitle>
      <Paragraph position="0"> We used the 17 languages and dialects provided in Table 2 in the experiments testing our CPP phoneme distance approach to phoneme HMM similarity. For each language, a native monolingual model set had been built by training with native speech data. The acoustic features are 39 regular MFCC features including cepstral, delta, and deltadelta. The individual ASR databases derive from a variety of projects and protocols, including, but not limited to, CallHome, EUROM, SpeechDat, Polyphone, and GlobalPhone. In each of the following experiments, we select one language as the target language, and construct its acoustic models by using all the other languages as source languages. A phoneme distance score is calculated for each target phoneme and the top two candidate source-language phonemes are chosen for HMM model construction. We conducted experiments with Italian, Latin American Spanish, European Portuguese, Japanese, and Danish as target languages.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.3 Acoustic model construction
</SectionTitle>
      <Paragraph position="0"> In the acoustics distance approach, models are built with the top two models chosen from source languages based on their acoustic distance from the corresponding native target model. For these experiments, we adopt the widely used Bhattacharyya metric for the distance measurement (Mak and Barnard 1996). It should be noted that the recognition performance of the acousticsconstructed models is not a theoretically strict upper bound for HMM similarity because the measurement in the acoustic space is probabilistic.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML