File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1094_metho.xml
Size: 10,631 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1094"> <Title>A concurrent approach to the automatic extraction of subsegmental primes and phonological constituents from speech</Title> <Section position="3" start_page="578" end_page="578" type="metho"> <SectionTitle> 1 Phonological primes and constituents </SectionTitle> <Paragraph position="0"> Much of the phonological research work of the past twenty years has focussed on phonological representations: on the make-up of individual segments and on the prosodic hierarchy binding skeletal positions together.</Paragraph> <Paragraph position="1"> Some researchers (e.g. Anderson and Ewen 1987 and Kaye et al. 1985) have proposed a small set of subsegmental primes which may occur in isolation but can also he compounded to model the many phonologically significant sounds of the world's languages. To give an example, in one version of GP (see Brockhaus et al. 1996), nine primes or ELEMENTS are recognised, viz. the .manner elements h (noise) and ? (occlusion), the source elements H (voicelessness), L (non-spontaneous voicing) and N (nasality), and the resonance elements A (low), I (palatal), U (labial) and R (coronal). These elements are phonologically active - they can spread to neighbouring segments, be lenited etc..</Paragraph> <Paragraph position="2"> The skeletal positions to which elements may be attached (alone or in combination) enter into asymmetric binary relations with each other, so-called GOVERNING relations. A CONSTITUENT is defined as an ordered pair, governor first on the left and governee second on the right. Words are composed of well-formed sequences of constituents. Which skeletal positions may enter into governing relations with each other is mainly determined by the elements which occupy a particular skeletal slot, so elemental make-up is an important factor in the construction of phonological constituents.</Paragraph> <Paragraph position="3"> GP proponents have claimed that elements, which were originally described in articulatory terms, have audible acoustic identities. As we shall see in SS 2, it is possible to define the acoustic signatures of individual elements, so that the presence of an element can be detected by analysis of the speech signal.</Paragraph> <Paragraph position="4"> Picking out elements from the signal is much more straightforward than identifying phonemes.</Paragraph> <Paragraph position="5"> Firstly, elements are subject to less variation due the contextual effects (e.g. place assimilation) of preceding and following segments than phonemes.</Paragraph> <Paragraph position="6"> Secondly, elements are much smaller in number than phonemes (nine elements compared to c. 44 phonemes in English) and, thirdly, elements, unlike phonemes, have been shown to participate in the kind of phonological processes which lead to variation in pronunciation (see references in Harris 1994).</Paragraph> <Paragraph position="7"> Fourthly, although there is much variation of phoneme inventory from language to language, the element inventory is universal.</Paragraph> <Paragraph position="8"> These four characteristics of its elements, plus the availability of reliable element detection, make a phonological framework such as GP a highly attractive basis for multi-speaker speech-driven software. This includes not only traditional ASR applications (e.g. dictation, database access), but also embraces multilingual speech input, medical (speech therapy) and teaching (computer-assisted language learning) applications.</Paragraph> </Section> <Section position="4" start_page="578" end_page="579" type="metho"> <SectionTitle> 2 Signatures of GP elements </SectionTitle> <Paragraph position="0"> Table 1 below details the acoustic cues used in PhonMaster. Using training data from five speakers, male and female, synthetic and real with different regional accents, these cues discriminate between the simplest speech segments containing an element in a minimal combination with others. In the case of a resonance element, say, U, the minimal state of combination corresponds to isolated occurrence in a vowel such as \[U\], as in RP English hood or German Bus.</Paragraph> <Paragraph position="1"> The accuracy of cues such as those in Table 1 for discrimination of simplest speech segments has been tested by different researchers using ratios of within-class to between-class variance-covariance and dendrograms (Brockhaus et al. 1996, Williams 1997), as described in PhonMaster's documentation.</Paragraph> <Paragraph position="2"> The cues are calculated from fast Fourier transforms (FFTs) of speech signals in terms of total amplitude or energy distribution ED across low, middle and high frequency parts of the vocal range and the angular frequencies to(F) and amplitudes a(F) of formants. The first four cues dp, to {h are properties of a single spectral slice, and the change in these four from slice to slice is logged as t} 5, which peaks at segment boundaries. The duration cue #p6 is segment-based, computable only after segmentation from the length in slices from boundary to boundary, normalising this length using the JSRU database of the relative durations of segments in different manner classes (see Chalfont 97). The normalisation is a simple form of time-warping without the computational complexity of dynamic time-warping The other segment-based cues contrast steady-state formant values at the centre of a segment with values at entrance and exit boundary. They describe the context of a segment without going to the computational complexity of triphone HMMs (e.g.</Paragraph> <Paragraph position="3"> Young 1996). The PhonMaster approach is not tied to a particular set of cues, so long as the members of the set are concerned with ratios which vary much less from speaker to speaker than absolute frequencies and intensities. Nor is the approach bound to FFTs - linear predictive coding would extract energy density and formants just as well.</Paragraph> <Paragraph position="4"> Signatures are defined from cues by locating in cue space cluster centres and defining a quadratic discriminant based on the variance-covariance matrix of the cluster. When elements occur in higher degrees of combination than those selected for the training sample, separate detection thresholds for distance from cluster centre are set for occurrence as head and occurrence as operator.</Paragraph> </Section> <Section position="5" start_page="579" end_page="581" type="metho"> <SectionTitle> 3 Stagewise element recognition </SectionTitle> <Paragraph position="0"> The detection of dements in the signal proceeds in three stages, with concurrent processes (lexical access, phonological process repair...) being launched after each stage and before the full identity of a segment has been established.</Paragraph> <Paragraph position="1"> The overall architecture of the recognition task is shown in Figure 1. At Stage 1, the recogniser checks for the presence of the manner elements h and ?.</Paragraph> <Paragraph position="2"> 1. MaalteC/ 2. Pbenttlelt This launches the calculation of cues 4)5 (for the automatic segmentation process) and 4)6 (to distinguish vowels from approximants, and to determine vowel length). The ensuing manner class assignment process produces the classes: Occ Occlusion (i.e. ? present as head, as in plosives and affricates) Sfr Strong fricative (i.e. h present as head, as in \[s\], \[z\], IS\] and \[Z\]) Wfr Weak fricative (i.e. h present as operator, as in plosives and non-sibilant fricatives) Vowel (not readily identifiable as being either long or short).</Paragraph> <Paragraph position="3"> the words can be identified uniquely by manner class alone. This is the case for languages such as English, German, French and Italian, so the accessing of an individual word may be successful as early as Stage 1, and no further data processing need be carried out. If, however, as in Figure 3, the manner-class sequence identified is a common one, shared by several words, then the recognition process moves As soon as such a sequence of manner classes becomes available, repair processes and lexical searches can be launched concurrently. The repair object refers to the constituent structure which can be built on the basis of manner-class information alone and checks its conformance to the universal principles of grammar in GP as well as to language-specific constraints. In cases of conflict with either, a new structure is created to resolve the conflict For example, the word potential is often realised without a vowel between the first two consonants. This elided vowel would be restored automatically by the repair object, as illustra'~d in Figure 2, where a nuclear position (N) has been inserted between the two onset (O) positions occupied by the plosives. Constituent structure is less specific than manner classes (in certain cases, different manner-class sequences are assigned the same constituent structure), so manner classes form the key for lexical access at Stage 1. Zue (1985) reports that, even in a large lexicon of c. 20, 000 words, around a third of on to Stage 2, where the phonatory properties of the segments identified at Stage 1 are determined.</Paragraph> <Paragraph position="4"> Continuing with the example in Figure 3, the lexical access object would now discard words such as seed or shade, as neither of them contains the element H (voicelessness in obstruents), whose presence has been detected in both the initial fricative and the final plosive at Stage 2. Again, it may be possible to identify a unique word candidate at the end of Stage 2, but if several candidates are available, recognition moves on to Stage 3.</Paragraph> <Paragraph position="5"> Here, the focus is on the four resonance elements. As the manifestations of U, R, I and A vary between voiced vs. voiceless obstruents vs. sonorants, appropriate cues are invoked for each of these three broad classes (some of the cues reusing information gathered at Stage 1). The detection of certain resonance elements then provides all the necessary information for a final lexical search. In our example, only one word, seep, contains all the elements detected at Stages 1 to 3, as illustrated in Concurrently with this lexical search, repair processes check for the effects of assimilation, allowing for adjacent segments (especially in clusters involving nasals and plosives)to share one or more resonance elements, thus resolving possible access problems arising from words such as input /'InpUt/being realised as ['IrnpUt].</Paragraph> </Section> class="xml-element"></Paper>