File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/91/j91-3001_abstr.xml
Size: 14,042 bytes
Last Modified: 2025-10-06 13:47:16
<?xml version="1.0" standalone="yes"?> <Paper uid="J91-3001"> <Title>Dictionary \] ,L Elimination & Identification Rules Tagnm Analysis C Let~er-to-soun Language</Title> <Section position="2" start_page="0" end_page="260" type="abstr"> <SectionTitle> 1. Background </SectionTitle> <Paragraph position="0"> There has been a great deal of interest recently in the generation of accurate phonetic equivalences for proper names. New and enhanced services in the telecommunications industry as well as the increasing interest in speech I/O for the workstation has renewed interest in applications such as the automation of name pronunciation by speech synthesizer in reverse directory assistance (number to name) applications (Karhan et al. 1986). In addition, speech recognition research can benefit by automatic lexicon construction to be ultimately used in such applications as directory assistance (name to number) and a variety of workstation applications (Cole et al. 1989).</Paragraph> <Paragraph position="1"> * 30 Forbes Rd. (NRO5/I4), Northboro, MA 01532 USA (~) 1991 Association for Computational Linguistics Computational Linguistics Volume 17, Number 3 The inaccuracy of name pronunciation by parametric speech synthesizer has been a problem often addressed in the literature (Church 1986; Golding and Rosenbloom 1991; Liu and Haas 1988; Macchi and Spiegel 1990; Spiegel 1985, 1990; Spiegel and Macchi 1990; Vitale 1987, 1989a, 1989b, and others). The difficulty stemmed from the fact that high-quality speech synthesizers were so optimized for a particular language (e.g., American English), that a non-English form such as an unassimilated or partially assimilated loanword would be processed according to English letter-to-sound rules only) Since non-Anglo-Saxon personal names fall into the category of loanwords, the pronunciation of these forms ranged from slightly inaccurate to grossly unintelligible.</Paragraph> <Section position="1" start_page="0" end_page="258" type="sub_section"> <SectionTitle> 1.1 General Letter-to-Sound Rules </SectionTitle> <Paragraph position="0"> Letter-to-sound rules are a requirement in any text-to-speech architecture and take slightly different forms from system to system; however, they typically follow a standard linguistic format such as x -. y/z, where x is some grapheme sequence, y some phoneme sequence, and z the environment, usually graphemic. The following is a typical example of a set of letter to sound rules:</Paragraph> <Paragraph position="2"> This set would handle all such forms as CELLAR, CILIA, CY, CAT, COD, etc., but clearly not loanwords such as CELLO for exactly the same reasons that make the pronunciation of last names so difficult for a synthesizer having only English letter-to-sound rules. A number of letter-to-sound rule sets are in the public domain, (e.g., Hunnicutt 1976; Divay 1984, 1990). However, many rule sets that are currently in use in commercial speech synthesizers remain confidential. Venezky (1970) contains an extensive discussion of issues involving phoneme-grapheme correspondence.</Paragraph> <Paragraph position="3"> The accuracy of pronunciation of normal text in high-quality speech synthesizers using exclusively or primarily letter-to-sound processing can now range as high as 95+%. 2 In tests we ran, however, this accuracy (without dictionary lookup), was degraded by as much as 30% or more when the corpus changed to high-frequency proper names. The degradation was even higher when the names were chosen at random and could be from any language group. Spiegel (1985) cites the average error rate for the pronunciation of names over four synthesizers as 28.7%, which was consistent with our results.</Paragraph> <Paragraph position="4"> The reason for this degradation is due to the fact that the phonological intelligence of a speech synthesizer for a given language cannot discriminate among loanwords that are not contained in its memory (i.e., dictionary). In the Case of names, these are really loanwords ranging from the commonly found Indo-European languages such as French, Italian, Polish, Spanish, German, Irish, etc. to the more &quot;exotic&quot; ones such as Japanese, Armenian, Chinese, Latvian, Arabic, Hungarian, and Vietnamese. Clearly, the pronunciation of these names from the many ethnic groups does not conform to the phonological pattern of English. For example, as pronounced by the average English speaker, most German names have syllable-initial stress, Japanese and Spanish names tend to have penultimate stress, and some French names have word-final stress.</Paragraph> <Paragraph position="5"> 1 That is, phonemic rules. Obviously, the phonetics output by a synthesizer would not be sufficient for multiple languages. 2 In an informal study, Klatt (personal communication) tested our rule set for English by replicating a study by Bill Huggins (Bolt, Beranek and Newman) using letter to sound rules without dictionary over 1678 complex polysyllabic forms. The algorithm tested (and the one used in this study) had an error rate of 5.1%. The error rate using a dictionary would be much lower.</Paragraph> <Paragraph position="6"> Vitale Algorithm for High Accuracy Name Pronunciation Chinese names tend to be monosyllabic and consequently stress is a non-issue; in Italian names, stress may be penultimate or antepenultimate as is the case with Slavic languages and certain other groups.</Paragraph> <Paragraph position="7"> But while stress patterns are relatively few in number, the letter-to-sound correspondences are extremely varied. For example, the orthographic sequence CH is pronounced \[~\] in English names e.g., CHILDERS, \[~\] in French names e.g., CHARPENTIER, and \[k\] in Italian names e.g. BRONCHETTI or the anglicized version of some German names e.g., BACH. This means that letter-to-sound must account for a potentially large number of diverse languages in order to output the correct phonetics.</Paragraph> <Paragraph position="8"> Most researchers understand that in order to process the name accurately, at least two parameters must be known: (1) that the string is a name and thus needs to be processed by a special algorithm; and (2) that the string must be identified with a particular set of languages or language groups such that the specifics of the pronunciation (i.e., the letter-to-sound rules) can be formally described (Church 1986; Liu and Haas 1988; and others). While there has been some interest in attempting to identify a word as a name from random text, this present work assumes a database in which name fields are indexed as such (e.g., a machine-readable telephone directory) and no further mention of this will be made. This paper simply describes an implementation of this two-stage process, and details the first stage -- the correct identification of a name as belonging to a certain language group. It should be stressed that there have been other attempts to implement similar algorithms, although few descriptions of such implementations are available.</Paragraph> </Section> <Section position="2" start_page="258" end_page="258" type="sub_section"> <SectionTitle> 1.2 Language Groups </SectionTitle> <Paragraph position="0"> For purposes of identification, sets of similar languages are more efficiently grouped together. However, the language groups used in this study may not always correspond to the set of language families familiar to most linguists. For example, while Japanese or Greek may be in groups by themselves, languages such as Spanish and Portuguese may be grouped together into a So. Romance group and this set may be different from, Say, Italian, which may be grouped with Rumanian, or French, which may be grouped by itself. This is done to reduce the complexity of letter-to-sound (Section 4.1). However, the software is set up such that groupings can be moved around to accommodate different letter-to-sound rule sets. In addition, the number of groups is a variable parameter and could be modified as would the inclusion of any new rule sets in the letter-to-sound subsystem. Thus, for n language groups, the probability P of some language group Li being the correct etymology is P(Li) - 1 -- ~.</Paragraph> </Section> <Section position="3" start_page="258" end_page="260" type="sub_section"> <SectionTitle> 1.3 Etymology </SectionTitle> <Paragraph position="0"> Identification of a particular language group in the United States and many countries of Western Europe is not an easy task. According to the United States Social Security files (Smith 1969), there are approximately 1.5 million different last names in the United States, with about one-third of these being unique in that they occur only once in the register. 3 Furthermore, the etymologies of the names span the entire range of the world's languages, although the spread of these groupings is obviously related to geopolitical units and historical patterns of immigration and is different in the United States than it is, say, in Iceland, Ireland, or Italy.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 17, Number 3 2. Role of the Dictionary The first step in the process was the construction of a dictionary that contained both common and unusual names in their orthographic representation and phonetic equivalent. All sophisticated speech synthesizers today use: a lexical database for dictionary lookup to process words that are, for one reason oi&quot; another, exceptions to the rule. In generic synthesizers, these are typically functors that undergo vowel or stress reduction, partially assimilated or unassimilated loanwords that cannot be processed by language-specific letter-to-sound rules, abbreviations that are both generic and domain-specific, homographs that need to be distinguished phonetically, and selected proper nouns, such as geographical place names or company names.</Paragraph> <Paragraph position="2"> In the case of proper surnames, however, dictionary lookups, while necessary, are of limited use. There are a number of reasons for this. First, while the most common names would have an extremely high hit rate (much like functors in a generic system), the curve quickly becomes asymptotic. Church (1986) has shown that while the most common 2,000 names can account for 46% of the Kansas City telephone book, it would take 40,000 entries to obtain a 93% accuracy rate. Furthermore, accuracy would decrease if one considers that geographic area has a profound influence on name grouping, and thus the figures for a large East or West Coast metropolitan area would certainly be significantly lower. It can be easily shown that the functional load of each name changes with the geographical location. 4 The name SCHMIDT, for example, is not in the list of the most frequent 2,000 names, yet it appears in the Social Security files as the most common name in Milwaukee (Spiegel 1985). Liu and Haas (1988) conducted a similar experiment that included 75 million households in the U.S. The first few thousand names account for 60% of the database, but the curve flattens out after 50,000 names and it would take 175,000 names in a dictionary to cover 88.7% of the population. This would mean that even with an extremely large dictionary (each entry of which would have to be phoneticized), there would still be an error rate of over 11%.</Paragraph> <Paragraph position="3"> Even with these limitations, dictionary lookups are still quite important. Frequently occurring names, like functors, have a high functional load (above). Spiegel (1985) claims that if the most common 5,000 names are used in a dictionary for a population of 10 million people, even if letter-to-sound had an accuracy of only 75% (which is extremely low for a high-quality speech synthesizer), the error rate would be < 2.5%. Most other researchers have also assumed a dictionary lookup as part of any procedure to increase the accuracy of name pronunciation. Therefore the general flow of text from the grapheme space to the phonetic realization must proceed first through a dictionary. Common last names such as SMITH, JOHNSON, WILLIAMS, BROWN, JONES, MILLER, DAVIS, WILSON, ANDERSON, TAYLOR, etc. and common names (both first and last names) from a variety of other languages should be included. The size of this dictionary is up to its creator. The dictionary used in this software contained about 4,000 lexical entries that were proper names, s There is, however, no reason to exclude 4 Functional load here is used in a slightly different sense than in linguistics. The functional load of a grapheme is its frequency of occurrence, in relation to other graphemes in the language, weighted equally, as measured over a sizable corpus of orthographic data. 5 In practice, the name dictionary could be contained within a larger dictionary that would be part of a genetic text-to-speech system. Moreover, the dictionary should be easily modifiable by an applications writer. Functions such as add, remove, find, modify, and the like can be used to maximize the effect of the dictionary, especially if some preliminary analysis has been done on population statistics. Experience has also shown that a programmer should be able to easily merge new word or name lists with a base dictionary and quickly examine a variety of statistics including the size in entries, bytes, or blocks as Vitale Algorithm for High Accuracy Name Pronunciation very large dictionaries (e.g., > 50,000 words) although the choice of a search algorithm then becomes more important in real-time implementations.</Paragraph> <Paragraph position="4"> When a dictionary lookup is used and a match occurs, the result is simply a translation from graphemes to phonemes, and the phoneme string (along with many other acoustic parameters picked up along the way) is output to the synthesizer. 6 When there is no match, (i.e., most cases), however, some algorithm is needed to increase pronunciation accuracy.</Paragraph> </Section> </Section> class="xml-element"></Paper>