File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0803_metho.xml

Size: 24,118 bytes

Last Modified: 2025-10-06 14:15:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0803">
  <Title>Speech Annotation by Multi-sensory Recording</Title>
  <Section position="3" start_page="0" end_page="25" type="metho">
    <SectionTitle>
2. Multi-sensoR' recording
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the four signals that are simultaneously recorded. Next, we describe the physical set up for recording and the recording session.</Paragraph>
    <Section position="1" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
2.1 Sensors
</SectionTitle>
      <Paragraph position="0"> Four signals are received and recorded in a multi-sensor), recording session and these are the acoustic signal (Sp), laryngograph signal (Lx), plosive signal (Fx) and nasal signal (Nx). Lx provides information about vocal fold vibrations and enables the identification of voiced/unvoiced segment as well as occurrence of each epoch (i.e. vocal fold closure). The latter is important for pitch detection as well as subsequent signal processing that are pitch synchronous. Figure 2 shows the use of Lx to define the voiced segment and epoch positions.</Paragraph>
      <Paragraph position="1">  ~: &amp;quot;&amp;quot; slwaO4C .............</Paragraph>
      <Paragraph position="2"> l.'~;:::-'2+- ::-':~:::'-,;~ ~',~. ---- -- -..--- ,- -~ ........................... -- &amp;quot;i'~l'~l ..... .-.., . .1!1 ) i --.,: ,.:,.: :::.+- ........... : ............................... : _...::.._..:.. _)..L_..:.._.L::..__.~.L. _Z:.:.____:.: .__.~.__:. ;_.:~  syllable /pal The 4 channels from the top to bottom are: acoustic signal (Sp), laryngograph signal (Lx), turbulence signal (Fx) and nasal signal (AS:), respectively.</Paragraph>
      <Paragraph position="3"> The *v signal is picked up by a high-frequency sensitive miniature microphone placed 1 to 2 cm near the mouth. The signal is drastically attenuated so that only a sudden burst of air can provide sufficient excitation for recording. The burst of air is registered for aspiration or turbulence near the month (e.g. fricative and aspirated voice stop) which may be undetectable in the acoustic signal (Sp). Figure 3 shows the aspirated voice stop/p/that is not registered in Sp. We anticipate that for continuous speech this type of events occur often.</Paragraph>
      <Paragraph position="4"> A piezeo-ceramic transducer is placed near the nose bridge to detect nasal resonance. The signal from this transducer is Nx and it is useful to  determine when nasalization occurs. This would be useful for detecting nasal consonants because it is simply an absorption of the vocalic energy, represented as a spectral zero. Figure 4 shows the recording of nasal resonance for the word /ma/.</Paragraph>
      <Paragraph position="5"> ..... !' !!':iy i l;i &amp;quot;~i\] i'~;:7' ~f&amp;quot;(~ !~'.!~i':fi~(~i&amp;quot;&amp;quot;~i ~i?.~t:~ ........ L ...... ' ';iii :::: &amp;quot;'  mixer, sensor, laryngograph and microphone power supply are placed inside the ancoehic chamber where as the recorder and PC are placed outside because of noise from cooling fans. The mixer provides amplification for the Nx signal and attenuation for the Fx signal.</Paragraph>
      <Paragraph position="6"> Likewise, the microphone power supply and the laryngograph provide amplification of the Sp and Lx signals, respectively. Four channel tape recordings are carried out first because they can serve as back up. Afterwards, the recorded data are transferred to the PC by the computer interface under computer control via the RS232 link. It is possible to mark the beginning and ending of each utterance using the DAT recorder.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.3 Recording Session
</SectionTitle>
      <Paragraph position="0"> We have carried out recording isolated Cantonese \[2\] speech sounds as well as read speech of phrases and sentences. For isolated syllables, subjects are asked to pronounce all combinations of Cantonese initials, finals and tones which amounts to several thousand syllables. To save time and manual effort, the subject reads aloud a page of syllables (about 50) which are recorded on to the DAT tape before transfer to the PC. To maintain some consistency, subjects are asked to read aloud a carrier sentence by heart and pronounce only the target syllable.</Paragraph>
      <Paragraph position="1"> For continuous read speech, subjects are given a list of sentences or phrases to read aloud. These sentences are selected from a corpus, that maximizes the coverage of Cantonese diphones based on a greed),' algorithm \[3\]. The 104 sentences covered 348 Cantonese diphones. The corpus is a collection of news articles from the PH corpus \[4\].</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="25" end_page="27" type="metho">
    <SectionTitle>
3. Isolated Syllable Marking
</SectionTitle>
    <Paragraph position="0"> Each file contains a set of syllables read aloud in a recording session. The 4 channels are sampled at 16kHz and quantized to 16 bits. The first step is to isolate the syllables from silence and label these syllables with the corresponding phonetic spelling augmented with a tone. Next, the four channel data is compressed into a marked speech data to save storage by a multiplicative factor of 4. The marked speech data uses the least significant three bits to encode where an epoch, some turbulence at the month, some nasalization or silence have occurred, according to the scheme shown in Table 1. Silence is also encoded because the recording will be carried out for an utterance instead of isolated syllables for later work.</Paragraph>
    <Paragraph position="1">  different marks of speech data. Nasal and turbulence are assumed not to simultaneously occur.</Paragraph>
    <Paragraph position="2"> The least significant three bits instead of the most significant three bits are chosen for encoding because of compatibility reasons. The three bits can be considered as an additive noise component of magnitude at most 3 bits (i.e. 8). Usually, speech signals are much larger than 8 so that the noise due to the least 3 bits are almost negligible based on this encoding. We have found no noticeable degradation in the marked speech signal, which can be fed to other software like MATLAB as binary data.</Paragraph>
    <Paragraph position="3">  energy &gt; thres and energy &lt; thres and duration &lt; glitch startdUratidegn i~ &amp;quot;~.~..~..~.~..x.// e n e r g y * -( 0 ) thres energy &lt;~ thres / m duration = glitch  state machine for speech segmentation.</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Speech Segmentation
</SectionTitle>
      <Paragraph position="0"> The 4-channel recording is segmented based on the running energy of the speech signal Sp. A fnite-state machine (FSM) keeps track of the segmentation decision (Figure 5). At state 0, the FSM considers the speech as silence. When the running energy is beyond a threshold T, the FSM makes a transition to state 1. The FSM remains in state 1 provided that the running energy remains beyond the threshold. Otherwise, it will make a transition back to state 0. If tile FSM  remains in state 1 for a sufficiently long time that the speech signal cannot be a glitch, the FSM makes a transition to state 2. It will remain in state 2 if the running energy is beyond T divided by m. The multiplicative reduction m accounts for the steady reduction of speech energy near the end. Otherwise, the FSM makes a transition back to state 0.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.2 Phonetic Spelling Labeling
</SectionTitle>
      <Paragraph position="0"> Each segmented speech data corresponds to a syllable and the data has to be labeled with the corresponding phonetic spelling. Due to noise, sometimes glitches are mis-recognized as speech data and there are usually more segmented speech files than the amount of labels. We used a simple strategy to sort the data by size and delete the extra small files before labeling is carried out.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.3 Epoch Detection
</SectionTitle>
      <Paragraph position="0"> The detection of epoch is based on the Lx signal.</Paragraph>
      <Paragraph position="1"> The epoch is roughly located when the Lx signal is at the maximum near the largest change in the Lx signal. A simple detection strategy is to determine the first order backward difference:</Paragraph>
      <Paragraph position="3"> The detection selects those with a positive slope (i.e. DLx\[i\] &gt; 0). A threshold T,, is set according to the following rule:</Paragraph>
      <Paragraph position="5"> in order to decide those slopes which are definitely too small to consider for the identification of the epoch. Another threshold Tk is determined by the k-means algorithm which decides which of the remaining slopes are large enough and which are too small. Any remaining slopes, which are larger than T~ and which occurred consecutively, are deleted except at the last position. The remaining slopes positions are then the epoch positions (Figure 6).</Paragraph>
      <Paragraph position="6">  shown at the top. The result is shown at the bottom where each spike represents the largest positive slop found.</Paragraph>
      <Paragraph position="8"> amplitude of the positive backward difference of the Lx signal. The threshold was found to be 1530 by the k-means algorithm which is reasonable.</Paragraph>
      <Paragraph position="9"> The k-means algorithm for determining Tk assumes there are two clusters: cl for slopes that are significantly large and c2 for those slopes which are significantly small. Initially, the algorithm selects the two extreme slope values (i.e. maximum and minimum) as the centriod of the two respective clusters. A slope x is randomly selected and decided which cluster it belongs based on the following rule: if d(cl,x) &gt; d(c2,:U then</Paragraph>
      <Paragraph position="11"> where dO is the distance between the centroid of a cluster and the slope x. After each assignment, the centroid of the changed cluster is updated.</Paragraph>
      <Paragraph position="12"> Assignment of slopes to the two clusters; is repeatedly carried out until no more slope values to assign (Figure 7).</Paragraph>
    </Section>
    <Section position="4" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
3.4 Plosive/Fricative Detection
</SectionTitle>
      <Paragraph position="0"> Certain plosives (e.g./p/) and fricatives (e.g./f/) produces turbulence near the month (Figure 8).</Paragraph>
      <Paragraph position="1"> This sudden burst of air is registered in the Fx signal as a sudden rise in magnitude. We follow Chan and Fourcin \[1\] to find the envelop of the Fx signal by first high-pass filtering (with sigma smoothing) the signal at l kHz and smooth it by a</Paragraph>
    </Section>
    <Section position="5" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
3.5 Nasalization
</SectionTitle>
      <Paragraph position="0"> The amount of nasality is computed based on both the Nx signal and the Lx signal as in \[1\].</Paragraph>
      <Paragraph position="1"> Nasality is considered as the energy absorbed in the nasal cavity, reflected by the amount of nasal resonance picked up by the peizeo-ceramic transducer. The absolute value of A5. would indicate the amount of energy in the vibration but this has to be summed over one pitch period to indicate the amount of absorption for the pulse of air released in one vocal cord openclose cycle. Thus, we compute A~ as the sum of the absolute value of the Nx signal in one pitch period between two consecutive epoches.</Paragraph>
      <Paragraph position="2"> The presence of nasality (Figure 8) is determined by a threshold Tx where an)' N,, value larger than Tx implies there exists some significant nasalization. To decide a better threshold between significant nasalization and insignificant nasalization, a different threshold T,, is used, which is determined by the k-means algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="27" end_page="30" type="metho">
    <SectionTitle>
4. Continuous Speech Annotation
</SectionTitle>
    <Paragraph position="0"> Annotation for a speaker-independent continuous speech is not an easy task without training. Our main idea is to find a reliable coarse match between the available phonetic spelling of the speech and perform additional processing to locate fine details.</Paragraph>
    <Paragraph position="1"> A reliable cue is voicing which is available from the Zx signal because it is decoupled from the acoustic environment, making voice identification under extreme noisy environment possible. Also, since the Zx signal represents the source signal without convolving with the vocal track, it is relatively easy and reliable to detect the occurrence of pitch marks and therefore voicing. For matching phonetic spelling with speech sound, usually a syllable corresponds to a voice segment because each syllable must have a peak. Thus, the voice segment can be used the basic unit for finding the annotation of the speech.</Paragraph>
    <Section position="1" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Voiced segment identification
</SectionTitle>
      <Paragraph position="0"> To detect voicing, tile Lx signal is differenced and thresholded by the k-means algorithm, as in marking speech data (Section 3.3). In addition, the voice segment must have some continuity in the vocal cord vibration which restricts the duration of the voice segment to have at least 2 cycles. Taking the range of pitch to be between 2 and 20ms \[5\], the duration of voiced segment must be at least 40ms long. Figure 9 shows an example of fnding the voiced segments when reading aloud a sentence. The accuracy of the Lx signal is usually within 1 Lx cycle.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.2 Sentence/Phrase Boundary Detection
</SectionTitle>
      <Paragraph position="0"> Sentence and phrase boundary can be manually marked by the DAT recorder or from the annotation software. The later is particularly tiresome because the amount of speech data is large, b'pically around 100Mbytes. Therefore, the visualization software takes time to scan and display' the data.</Paragraph>
      <Paragraph position="1"> The alternative explored in here is to automatically identify these sentence/phrase boundaries by measuring the duration between two voiced segments. If the duration is more than 900ms, a sentence/phrase boundary is found. However, the subject has to be aware of this arrangement for sentence/phrase separations.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.3 Unvoiced Context Computation
</SectionTitle>
      <Paragraph position="0"> For computational efficiency, unvoiced context computation only identifies the existence or absence of air burst, noise above 4kHz and nasal resonance. The existence and absence of these events are used for coarse matching between identified voice segment from speech and from phonetic spelling (Figure 10).</Paragraph>
      <Paragraph position="1"> Figure 10: The detection of unvoiced contexts of an utterance of 30 syllables. Key: LB for the existence of an air burst in the left context of a voiced segment and LF for the existence of fricative noise in the left context. Since there are no right context air burst of fricative noise, they were not identified (as RB and RF respectively).</Paragraph>
      <Paragraph position="2"> 4. 3.1 Air Bw'st Detection Air burst is detected in the Fx signal. For each voice segment, the left and right contexts for air burst detection are between 10ms and 40ms away from the voiced segment. Within these two portions of the speech data, we obtain the maximum absolute differenced Fx signal. If this maximum is larger than a threshold (set at 800), then air burst is detected.</Paragraph>
      <Paragraph position="3"> 4. 3.2 Fricative-like Noise For fricative-like consonants, the turbulence is registered as noise above 4kHz in tile Sp signal. Since these noise can extend quite far from the voiced segment, Sp signal between 10ms and 800 ms away from the voiced segment is examined. For each context, Sp signal is high-pass filtered at a cutoff of 4kHz. The filtered signal is differenced and the largest magnitude is compared with a threshold. If the signal is larger than the threshold, than fricative noise is present.</Paragraph>
    </Section>
    <Section position="4" start_page="28" end_page="30" type="sub_section">
      <SectionTitle>
4.4 Coarse Matching
</SectionTitle>
      <Paragraph position="0"> The aim of coarse matching is to associate the voiced segments of the phonetic spelling and  those identified in the speech signal. The voiced segment identified in the speech signal may represent one or more voiced segment of the phonetic spelling because of co-articulation. For example, the greeting sentence can have the following phonetic spelling/li ho ma/. The three voiced segments of this phonetic spelling are/li/, /ho/ and /ma/. However, in continuous speech, the voiced segment identified may be co-articulated to gather giving rise to only 2 voiced segments:/li homa/since nasal/m/and vowels /o/ and /a/ are voiced. Here, voicing has the special meaning that the vocal fold vibrates.</Paragraph>
      <Paragraph position="1"> Therefore, some voiced consonants like fricatives are not considered as voiced because the production does not involve vocal fold vibrations.</Paragraph>
      <Paragraph position="2"> The voiced segments in the phonetic spelling and found in speech are temporally ordered so that these segments can be considered as strings where each character is a voiced segment.</Paragraph>
      <Paragraph position="3"> Coarse matching can be considered as a string matching problem but due to co-articulation approximate string matching that caters for merging voiced segment in matching is needed (Figure 11).</Paragraph>
      <Paragraph position="4"> ;ii~ T i T i'llll I, hll ,,m J,_ l.t,.,,t,~. ,,~.,~ .... ~ IOll~ ~dtl~!  Let s be the sequence of voiced segments identified in the Sp signal. Likewise, let p be the sequence of voiced segments in the phonetic spelling of Sp. Let s\[i\] denote the /,h voiced segment and likewise forp\[i\].</Paragraph>
      <Paragraph position="5"> The distance D(s,p) between s and p is the minimal number of edit operations that transform s to p and vice versa. The minimal distance and the sequence of operations can be found by dynamic programming, using the following rule:</Paragraph>
      <Paragraph position="7"> where d\[i,j\] is the minimal edit distance from (0,0) to position (i,j), representing the matching of voiced segments s\[O,i\] in Sp with those p\[O.j\] in the phonetic spelling.</Paragraph>
      <Paragraph position="8">  Unlike approximate string matching, the edit distance of insertion, deletion and substitutions are determined differently. For insertion, we consider the two voiced speech segments at i and i -l are associated with a single voiced segment of phonetic spelling. Effectively, there is an error in voice segmentation where one of the segment (i or i + l ) is a spurious detection.</Paragraph>
      <Paragraph position="9"> For deletion, the voiced segments of the phonetic spelling is associated with one voiced speech segment. Effectively, this edit operation is accounting for the co-articulation of two voiced segments as in/li homa/.</Paragraph>
      <Paragraph position="10"> Such co-articulation does not occur freely. For example, if there are plosives or fricatives in the unvoiced context between the two voiced segments (e.g. co-articulation in /li/ and /ho/), then it is very unlikely that the voiced segments are co-articulated together. In addition, if the voiced speech segment is very short, then it is also unlikely that the two voiced segments of the phonetic spelling are read with the single voiced speech segment. Thus, both unvoiced context constraints and voiced segment duration are weighting factors of the deletion operation.</Paragraph>
      <Paragraph position="11">  For substitution at position (i, j) , we consider whether the voiced speech segment is the same as the voiced segment of the phonetic spelling. If they have the same unvoiced context, the substitution cost would be low. Otherwise, the cost would be based on the number of mismatches in the unvoiced context.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="30" end_page="31" type="metho">
    <SectionTitle>
5. Software Design and Implementation
</SectionTitle>
    <Paragraph position="0"> Inevitably, it is necessary to check and correct automatic annotation. Software tool developed for this purposes needs to visualize a large volume of data that runs into hundreds of Mbytes. This is particularly the case for multisensor3' recording.</Paragraph>
    <Paragraph position="1">  of the speech data in Figure 12. Pitch mark identification was carried out as shown in the second channel.</Paragraph>
    <Paragraph position="2"> For visualization, our software decimates tile given speech data since the resolution of the screen is only 1024 (Figure 12). This provides a bird's eye view of the data. For speech data details, the user can zoom (Figure 13) into a region within two markers defined by clicking the mouse at the appropriate screen location. Within the magnified scale, the user can move (Figure 14) the speech data to the left or right of the current magnified region of data. The user can also save the marked region directly into a file. The name of the file can be automatically generated or found from a list of labels in a file.  manual labeling speech data. This dialog box is invoked when tile user double clicks between two markers or within a voiced segment.</Paragraph>
    <Paragraph position="3"> Signal processing for visualization is carried out with the data stored in tile buffer and it is not directly operating on the speech data in the file. The purpose is to visualize tile effect of setting parameters of certain signal processing function. Once the desired parameter values are found, signal processing is carried out for tile speech data in the file. Since tile buffer data is a decimated version of the data in tile file, the signal processing parameters have to be scaled by the amount of decimation. For example, a 16 kHz signal may be decimated 4 times and the cutoff frequency of high-pass filtering at 4kHz has to reduced to l kHz.</Paragraph>
    <Paragraph position="4"> The software also enable us to visualize tile marked speech data for verification and modification. Non-silence is shown as a ribbon on the top of the view window. Since nasal and air-burst do not occur simultaneously, they are shown as different color ribbons at tile horizontal level in the view window. The pitch  epoches are displayed as vertical lines. Due to decimation, most pitch epoches are not displayed (Figure 15). They will appear again when a segment is magnified (Figure 16).</Paragraph>
    <Paragraph position="5">  speech data showing the location of the identified pitch epoches which were absent in Figure 15.</Paragraph>
  </Section>
  <Section position="7" start_page="31" end_page="31" type="metho">
    <SectionTitle>
6. Discussion
</SectionTitle>
    <Paragraph position="0"> We have described how 4-channels of speech data are recorded and transferred to the computer. We demonstrated that marked speech data provide important information for both annotation and speech analysis. Our marking scheme is space efficient and it is compatible with other speech processing software without regard to marking. We have also described how to post-process the 4-channels of data to obtain the marking information. Although the marking process can be completely automatic, human checking is still necessary for full correctness. Many decisions are based on setting an appropriate threshold, which can be deternained by the k-means algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML