File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4205_metho.xml

Size: 17,339 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4205">
  <Title>Forschungsprojekt: Entwicklung phonologischer Regelsysteme und Untersuchungen zur</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EVENT RELATIONS AT THE PHONETICS/PHONOLOGY INTERFACE
JULIE CARSON-BERNDSEN
DAFYDD GIBBON
</SectionTitle>
    <Paragraph position="0"> In this paper a procedure for the constrnction of event relations at the phonetics/phonolngy interface is presented. The approach goes further than previous formal interpretations of autosegmental phonology in that phonological relations are explicitly related to intervals in actual speech signals as required by a speech recognition system. An event structure containing the temporal relations of overlap, precedence and inclusion is automatically constructed on the basis of an event lattice with time annotations derived from the speech signal. The event structure can be interpreted linguistically as an antosegmental representation with assimilation, hmg components or coarticulation. The theoretical interest of this work lies in its contrilmtion to the solution of the projection problem in speech recognition, since a rigid mapping to segments is not required.</Paragraph>
    <Paragraph position="1"> t. Motivation In the processing of speech one nf the major problems is the projection problem at the phonetics/phonology interface: sounds and words are realised with different degrees of coarticulation (overlap of properties) iu different lexical, syntactic, and phonostylistic contexts and thus a segmentation into phonemes alone is too rigid in order to capture all variants. Furthermore, the set of possible words in natural languages, analogous to the set of sentences, is infinite, in fact, even subsets of these sets may be so large that a simple list is no longer tractable. This has so far proved to be an insuperable problem for the simple concatenative word inodels of current speech recognition systems, whether phoneme, disyllable, or word based. In this paper, a new approach to this problem is proposed, starting from recent well-motivated developments in phonology such as autosegmental phonology (Goldsmith, 1976,1~)0), artietdatory phonology (Browman &amp; Goldstein, 1986,1989), underspecification theory (Archangeli, 1988; Keating, 1988) and phonological events (Bird &amp; Klein, 1990). The overall context for the work presented here is a further development of the PhoPa system (Carson, 1988; Carson-Berndsen, It)0()) for phonological word parsing with a feature-based phonotactic net. The present approach goes beyond these studies in deriving phonological relations directly from speech data, and in providing detailed languageospecific top-down phonotactic coustraints.</Paragraph>
    <Paragraph position="2"> For phonological parsing a flexible notion of compositiouality is utilised based on underspecified structures with 'autosegmental' tiers of parallel phonological events which avoid a rigid mapping from phonetic parameters to simple sequences of segments.</Paragraph>
    <Paragraph position="3"> The motivation for using an event-based phonological rcprescutatiou was to use phonological knowledge as represented in the phonotactic net (thus also maintaining the notion of underspecification and optimisation by the use of feature eooccurrence restrictions) while cateriug for those phenomena arising in continuous speech which do not correspond to the phonotactics of the lm~guage. An example of this kind of phenomenon found during the labelling of the EUROM-0 speech data in the SAM project (F~SPRIT 2589 of. Brauu, 1991b) is the cluster \[szs\] in the German word \[vE:RUNsz.stc:m\] as a pronunciation of /vE:RUNszYste:m/ WctTmtngs.,ystem (see section 3).</Paragraph>
    <Paragraph position="4"> By using a phonotactic description based ou au autosegmental representation of events and the temporal relations which exist between them, a rigid s%,mentatkm at the phonetic level is no longer necessary. A further advantage of an event representatiou with temporal annotatkms at the phoneticsphonology interface coucerns the exchange of differing types of information between the two levels. An event is interpreted as an interval with a particular property, and it is not necessary to confine the possible set of properties to couventional phonological features such as vnice or nasal but acoustic properties of actual speech signals such as &amp;quot;fi'icatiou noise&amp;quot; or &amp;quot;syllable peak&amp;quot; may be included.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Event Relations
</SectionTitle>
    <Paragraph position="0"> Three stages are involved in the determination of signal-derived event relations at the phonetics/phonology interface. These are: (1) Event Detection, which will be discussed from the point of view of phonetic and phonological levels of representation in section 2.1., (2) Event Mapping where the relations between the individual events are constructed automatically, which is discussed in section 2.2 and (3) Evnut Structure Constraints, defining phonological ACTES D1! COLING-92, NArcre's, 23-28 Ao~ 1992 1 2 6 9 PROC. OF COl JNG-92, NAWrES, Ant;. 23-28, 1992 structure, which are discussed in section 2.3. The work described here ks concerned primarily with speech recognition rather than synthesis and in particular with its phonological parsing component as opposed to the acoustic front end. The event relations generated at the phonetics/phonology interface serve as input to a constraint-based phonological parser whose knowledge base is an event-based description of the phonotactics of the language.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1. Phonetic and Phonological Events
</SectionTitle>
      <Paragraph position="0"> Assuming that the feature detectors at the acoustic level recognise events each consisting of a property and an interval together with a measure of confidence, it is possible to define a procedure which automatically constructs temporal relations of overlap, precedence and inclusion over intervals Bird &amp; Klein (1990) have some reservations about the use of endpoints of intervals at tile phonological level. However, absolute temporal annotatious must indeed be provided at the phonetic level on the basis of threshold and confidence values for a particular acoustic event in a speech signal token, and the use of these in the calculation of temporal relations for a given signal within an actual speech recognition procedure ks in fact necessary, not an option.</Paragraph>
      <Paragraph position="1"> At the phonological level, an event is simply a pair of a property and an interval &lt; P, 1 &gt;. At the phonetic level, an event is a quadruple &lt;P, ts, t~, C&gt;, providing information on event-type (property), start of interval, end of interval and confidence value. This serves as input to event mapping. The output of the mapping is a set of tuples &lt;ei, R, ej&gt; where ei and ej represent events and R is the temporal relation which exists between them (overlap, precedence or temporal inclusion). Using phonological constraints based on simplex and complex phonolological event structures, the phonologically relevant information is abstracted from this set of tuples. It is not the temporally annotated events themselves which are interesting for the phonological parser but the temporal relations which exist between these events (cf. section 2.3).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2. Event Mapping
</SectionTitle>
      <Paragraph position="0"> lu the speech recognition context there is a mapping of absolute phonetic eveuts to abstract temporal relations between events is described. The algorithm for the automatic construction of event relations has the following properties: Each event pair is tested only once; there is no explicit statement of reflexivity. The reflexivity and symmetry of overlap are uot reflected in the output, but can be inferred by Modus Poneos fi'om the axioms at the phonological level. Inclusion is a special case of overlap; thus, when an event is temporally included in another these events also overlap, and the algorithm makes use of this fact.</Paragraph>
      <Paragraph position="1"> There are nine types of overlap, seven of which are instances of inclusion, and all are catered for by the algorithm. It was flmnd that the relation of temporal inclusion played an important role in the constraints needed for phonological parsing (Carson-Berndsen, 1991). Simultaneity was not considered due to the fact that phonetic decisions are made on the basis of confidence values and thus the likelihood of true simultaneity is low. There is no difficulty, however, in augmenting the algorithm to cater for this if required since it is in fact a relationship of mutual temporal inclusion.</Paragraph>
      <Paragraph position="2"> The relations of overlap and precedence which hold between pairs of events are governed by a set of axioms; event structures are defined as a collection of events and constraints. These axioms can be regarded as having three different functions: inference, abbreviation and consistency checking.</Paragraph>
      <Paragraph position="3"> With respect to the abbreviation function of the axioms, this feature is not currently availed of in the algorithm as this would not reduce the search space.</Paragraph>
      <Paragraph position="4"> The consistency checking function of the axioms would be an extra step after the relations have been constructed. The output of the event mapping is an event lattice, analogous to the traditional disjunctive lattices of phoneme, syllable or word-based speech recognition, but not so far considered in previous work based on autosegmeutal structures.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3. Event Structure Constraints
</SectionTitle>
      <Paragraph position="0"> There is clearly no direct correspondence between events as measured in a signal, and abstract phonological structures. These levels differ in five major ways: first, the signal-derived relations may be incomplete, owing to noisy input; second, the signal-derived event relations approximate to the transitive closure of the phonologically relevant minimal specification of the event structure, and must therefore be reduced by appropriate criteria; third, contextually conditioned phonetic reductions, assimilations and epentheses must be resolved; lourth, explicit complex phonological structures need to be defined; fifth, there may be no simple relation between event endpoints and nodes in parse chart structures. To complete the mapping from phonetic events to phonological event structures, constraiuts must be formulated which fulfil these tasks. The third type will be briefly discussed in section 3; the rest of the present section is mainly concerned with the fourth type. For the phonological component in the present system, a distinction is made between simplex and complex events.</Paragraph>
      <Paragraph position="1"> A simplex phonological event is defined as the basic unit of input from the phonetic component; at the phonetic level these events are in general a function of several parameters and are therefore by no means 'simplex' at this level. A complex phonological event is constructed compositionally in terms of the precedence, overlap and inclusion relations at the phonological level. So for example the composition of the simplex events occlusion, transient and noise results in the complex event plosive. Complex events also refer to AclT.'.s DE COIANG-92. NANqlLS, 23-28 AO~':I&amp;quot; 1992 1 2 7 0 PROC. OF COL1NG-92, NANTES, AUG. 23-28, 1992 larger structures relevant at the phonological level such as syllable onset or reduced syllable. Using the co~mtraint axiom set, further relations between these complex events are inferred. In the speech recognition context, absolute speech signal constants are required to be assigned to the largest complex events in order to permit synchronisation at higher levels. The output of constraint application is thus a complex event lattice which is subsequently mapped to a linguistic parse chart (cf. Chien &amp; al. 1990).</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. An Example
</SectionTitle>
    <Paragraph position="0"> In this section, an example of input and output in the system for generating the relations between phonetic events in a token of the English word p_aLm__lm /pa:m/is discussed (cf. also Carson-Berndsen, 1991).</Paragraph>
    <Paragraph position="1"> The speech signal k~ shown in Figure 1; the phonemic annotations and display were produced with the SAMLAB speech signal labelling system (Braun, 1991a). The events used in this analysis are based on a feature set proposed by Fant (1973); although the features have labels which indicate articulatory features, they are in fact acoustically based. A diagrammatic representation of the detected events in an approximately 520 msec interval is shown in Figure 2.</Paragraph>
    <Paragraph position="2"> The temporally annotated events arc passed to the phonological component of the speech recognition system in the interface format given in (3), Before the alx)ve algorithm is applied, the tuples are uniquely identified and translated into a variety of attribute-value notation as shown in (4) (note that confidence values are not considered further here).</Paragraph>
    <Paragraph position="3">  (3) T.empor~l input from the phonetic leve_l  &lt;voiceless, 0, 91A9, C&gt; &lt;voiced, 91.2, 517.5, C&gt; &lt;glide, 452.6, 498.2, C&gt; &lt;occlusive, 0, 35.4, C&gt; &lt;transient, 34.5, 641.6, C&gt; &lt;noise, (:~).61, 91.16, C&gt; &lt;vowellike, 94.29, 392.6, C&gt; &lt;nasal, 402.9, 518.6, C&gt; &lt;bilabial, 20.45, 93.2, C&gt; &lt;tongue-retracted, 93.21, 392.6, C&gt; &lt;bilabial, 392.62, 518.2, C &gt;  eL~: LAB(bilabial, &lt; 392.62,518.2 &gt; ) Of particular interest to the phonological parser are the precedence relations between those event properties of the same type and the overlap and temlx~ral inclusion relations between event properties of differing types. Initially all relations between the individual events are generated automatically in (5). The temporal relations of overlap, precedence and inclusion are represented by the symbols 'deg', '&lt;' and '{' respectively.</Paragraph>
    <Paragraph position="4"> One of the motivations for having chosen an event-based phonology for coping with the interface between phonetics and phonology was to be able to cater for phenomena which do not correspond to the phonotactics of the language. It may be the case, as given in the example Wi#Jrungssystern in section 2, that the information on the centre portion of the signal, which is shown in (6) after the translation into attribute-value structure, is provided by the phonetic component.</Paragraph>
    <Paragraph position="5"> (6) TCmp0rol annotations for !szstl closter e: FRICATION(fricative, &lt; 0, 3(}1.3 &gt; ) e~: VOICE(voiced, &lt; 79.9, 229.3 &gt; ) e~: VOWELLIKE(vowellike, &lt; 128.5, 202.6 &gt;) e~: OCCLUSION(occlusive, &lt;301.31, 334.6&gt;) There is not a fall match between the output of the event mapping and any phonological representation, because FRICATION is continuous throughout and and thus overlaps VOWELLIKE rather than both preceding and following it. \]However, the pbonological constraints include information on possible phonotactic structures; these will not be discussed here in detail (but cf. Carson-Bcrndsea 1992). Positions in these strnctures ;ire underspecificd in terms of events, thus indirectly defining a priority between specified and non~ specified event types at those positions. In this case, at the relevant VOWELLIKE interval FRICATION overlap is not specified, and titus a phonotactic match is permitted; VOICE is also not specified for initial sibilants. Note that vowel quality does not need to be specified in detail in the phonotactics, if an actual lexical item is morc highly specified at these positions, it will match this part of the phonotactic structure, thus ultimately allowing the relevant portion of phonological representation of Wii.hrttngs.wstetn to be derived.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Final Remarks
</SectionTitle>
    <Paragraph position="0"> In this paper a new solution to the projection problem in speech recognition is proposed in the form of a three-stage procedure for the automatic construction of event relations and phonological event structures, starting with an event lattice of simplex events in the form of temporal annotations provided by the acoustic phonetic component of a speech recognition system. In contrast to the purely concatenative solutions to word compositionality which are conventionally used, the present flexible approach using the three compositional relations of overlap, precedence and temporal inclusion promise a principled and effective solution to the projection problem at the phonetics/phonology interface.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML