File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1813_abstr.xml
Size: 4,309 bytes
Last Modified: 2025-10-06 13:42:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1813"> <Title>Using the Segmentation Corpus to define an inventory of concatenative units for Cantonese speech synthesis</Title> <Section position="1" start_page="0" end_page="2" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The problem of word segmentation affects all aspects of Chinese language processing, including the development of text-to-speech synthesis systems. In synthesizing a Hong Kong Cantonese text, for example, words must be identified in order to model fusion of coda [p] with initial [h], and other similar effects that differentiate word-internal syllable boundaries from syllable edges that begin or end words. Accurate segmentation is necessary also for developing any list of words large enough to identify the word-internal cross-syllable sequences that must be recorded to model such effects using concatenated synthesis units. This paper describes our use of the Segmentation Corpus to constrain such units.</Paragraph> <Paragraph position="1"> Introduction What are the best units to use in building a fixed inventory of concatenative units for an unlimited vocabulary text-to-speech (TTS) synthesis system for a language? Given a particular choice of unit type, how large is the inventory of such units for the language, and what is the best way to design materials to cover all or most of these units in one recording session? Are there effects such as prosodically conditioned allophony that cannot be modeled well by the basic unit type? These are questions that can only be answered language by language, and answering them for We use &quot;Cantonese&quot; to mean the newer Hong Kong One major challenge involves the definition of the &quot;word&quot; in Cantonese. As in other varieties of Chinese, morphemes in Cantonese are typically monosyllabic and syllable structure is extremely simple, which might suggest the demi-syllable or even the syllable (Chu & Ching, 1997) as an obvious basic unit. At the same time, however, there are segmental &quot;sandhi&quot; effects that conjoin syllables within a word. For example, when the morpheme Ji</Paragraph> <Paragraph position="3"> stands as a word alone (meaning 'to collect'), the [p] is a glottalized and unreleased coda stop, but when the morpheme occurs in the longer wordJi He zaap6hap6 ('to assemble'), the coda [p] often resyllabifies and fuses with the following [h] to make an initial aspirated stop. Accurate word segmentation at the text analysis level is essential for identifying the domain of such sandhi effects in any full-fledged TTS system, whatever method is used for generating the waveform from the specified pronunciation of the word. A further challenge is to find a way to capture such sandhi effects in systems that use concatenative methods for waveform generation.</Paragraph> <Paragraph position="4"> This paper reports on research aimed at defining an inventory of concatenative units for Cantonese using the Segmentation Corpus, a lexicon of 33k words extracted from a large corpus of Cantonese newspaper texts. The corpus is described further in Section 2 after an excursus (in Section 1) on the problems posed standard, and not the older Canton City one.</Paragraph> <Paragraph position="5"> We use the Jyutping romanization developed by the Linguistics Society of Hong Kong in 1993. See http://www.cpct92.cityu.edu.hk/lshk.</Paragraph> <Paragraph position="6"> by the Cantonese writing system. Section 3 outlines facts about Cantonese phonology relevant to choosing the concatenative unit, and Section 4 calculates the number of units that would be necessary to cover all theoretically possible syllables and sequences of syllables.</Paragraph> <Paragraph position="7"> The calculation is done for three models: (1) syllables, as in Chu & Ching (1997), (2) Law & Lee's (2000) mixed model of onsets, rhymes, and cross-syllabic rhyme-onset units, and (3) a positionally sensitive diphone model. This section closes by reporting how the number of units in the last model is reduced by exploiting the sporadic and systematic phonotactic gaps discovered by looking for words exemplifying each possible unit in the Segmentation Corpus.</Paragraph> </Section> class="xml-element"></Paper>