File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1008_metho.xml

Size: 36,364 bytes

Last Modified: 2025-10-06 14:14:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1008">
  <Title>What makes a word: Learning base units in Japanese for speech recognition</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The problem with Japanese
</SectionTitle>
    <Paragraph position="0"> The Japanese language is written without spaces in between words. This means that before one can even start designing a recognition or translation system for Japanese the units that will be recognized, or translated, must be defined. Many sequences of phonemes, particularly those representing nouns, are clearly independent and can be designated as free-standing units. Japanese has a rich and fusional inflectional system, though, and delimiting where a verb ending ends and another begins, for example, is seldom straightforward.</Paragraph>
    <Paragraph position="1"> Japanese has typically been segmented in variations on four ways for the purposes of recognition and parsing, although since many papers on Japanese recognition do not specify what units they are using, or how they arrived at the definition of a &amp;quot;word&amp;quot; in Japanese, it is hard to compare systems. * Phrase/Bunsetsu level: (Early ASURA (Morimoto et al. , 1993), QJP (Kameda, 1996)) -advantages: long enough for accurate recognition, captures common patterns  - disadvantages: requires dictionary entry for each possible phrase; vocabulary explosion null * &amp;quot;Word&amp;quot; level: (JANUS (Schultz and Koll, Mayfield Tomokiyo ~ Ries 60 Learning base units in Japanese Laura Mayfield Tomokiyo and Klaus Ries (1997) What makes a word: Learning base units in Japanese for speech recognition. In T.M. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 60-69. (~) 1997 Association for Computational Linguistics 1997)) - advantages: units long enough not to cause confusion, but short enough to capture generalizations null - disadvantages: not natural for Japanese; easy to be inconsistent; may hide qualities of Japanese that could help in recognition * Morpheme level: (Verbmobil (Yoshimoto and Nanz, 1996)) - advantages: mid-length units that are natural to Japanese - disadvantages: a lot of room for incon null sistency; &amp;quot;morpheme&amp;quot; can be interpreted broadly and if segmented in the strictest sense units can be single phonemes * Phoneme cluster level: (NEC demi-syllable (Shinoda and Watanabe, 1996)), JANUS KSST 1 - advantages: only need a short dictionary - disadvantages: high confusability, although confusability seems less of a problem for Japanese than some other languages null The bunsetsu is a unit used to segment Japanese which generally consists of a content component on the left side and a function component on the right side. Bunsetsu boundaries seem to be natural points for pausing and repetition, and most elementary schools include bunsetsu segmentation as a formalized part of grammar education. John-ga (&amp;quot;John-NOM&amp;quot;), hon-o (&amp;quot;book-ACC&amp;quot;), and yonda (&amp;quot;gave&amp;quot;) are all examples of bunsetsu.</Paragraph>
    <Paragraph position="2"> Bunsetsu can be quite long in terms of both phonemes and morphemes, however, and quite unique. For example, saseteitadakitaindesuga would be considered a single bunsetsu. This phrase contains a causative form of the verb &amp;quot;to do&amp;quot;, sase-, a gerunditive suffix -re-, the root of a formal verb meaning to receive -itadaki-, a desidirative suffix tai-, a complementizer -n-, a copula -desu-, and a softener -ga.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Our approach
</SectionTitle>
    <Paragraph position="0"> Our approach, described in detail in (Ries et al., 1996), uses a statistical tool that automatically finds important sequences. This tool was originally developed to help mitigate the bias introduced by a  word-based orthography by explicitly modeling important multi-word units. The target of the tool was languages for which the word seemed already a useful level of abstraction from which to expand, and experiments were first performed on English and German for the scheduling task. One important motivation for this work was the desire to capture lexicalized expressions that exhibit, in natural speech, markedly different pronunciation from what concatenating the constituent words would predict. Examples of such expressions are don't-know (dunno), i-would-have (ida), you-all (yaw).</Paragraph>
    <Paragraph position="1"> The objective of the phrase-finding procedure is to find a pair of frequently co-occuring basic units for which joining all occurrences in the corpus is a useful operation. Until very recently most implementations of this idea have made use of measures of co-occurrence that have been useful in other domains, and the pair is chosen by maximizing that criterion.</Paragraph>
    <Paragraph position="2"> In contrast we assume that we want to model the corpus with a statistical language model and search for those sequences that increase the modeling power of the model by the largest amount. Our measurements are based on information theoretic principles and the usage of m-gram models of language, a common practice in the speech community. The model described here will therefore implicitly consider the words surrounding the phrase candidates and use information about the context to determine the goodness of a sequence, which is in contrast to traditional measures.</Paragraph>
    <Paragraph position="3"> (Ries et al., 1996) has compared a variety of measure as reported in the literature and has found these to be not competitive with the new technique if used in statistical language models. In a very vague statement we want to add that this corresponds to the experience in eyeballing these sequences. The measures that were compared against in this earlier work</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Statistical lan6uage modeling and
</SectionTitle>
      <Paragraph position="0"> speech recogmtion Statistical models of language are, to our knowledge, the type of language model used in all modern speech Mayfield Tomokiyo 8J Ries 61 Learning base units in Japanese recognition engines, especially in research systems but also in most commercial large vocabulary systems that can recognize naturally fluent spoken language. In principle the speech recognition problem is to find the most likely word sequence W given the acoustic A.</Paragraph>
      <Paragraph position="1"> argmaxwP(W \[A) Using Bayes theorem and the knowledge that p(A) does not change the maximization we arrive at argm xwp(AIW), v(w) p(AIW ) is commonly referred to as the acoustic model, p(W) is the language model and the argmax operator is realized by specialized search procedures. This paper for the most part ignores the search problem. The acoustic model is in part influenced by the sequences since we can change the entries in the pronunciation dictionary that encode the phoneme sequences the speech system uses to generate its models. During this generation process most modern systems make only partial use of neighboring words and the construction process is up to date also unable to model contractions, especially at word boundaries.</Paragraph>
      <Paragraph position="2"> It is therefore of great advantage to have a basic unit in the decoder that allows for manual or automatic dictionary modification that captures these phenomena. This has recently been reported to be a very promising modeling idea on several different speech recognition tasks in English. The underlying assumption is that sequences of units that have a high stickiness are by conventional usage very likely to show idiosyncratic pronuncations much like single words do: They are for the most part lexicalized.</Paragraph>
      <Paragraph position="3"> The statistical language modeling problem for the sequence of words W = Wl,...,wn where wn is a special end of sentence symbol can then be rephrased as n</Paragraph>
      <Paragraph position="5"> We will for most applications probably never be able to find enough data to estimate p as presented above.</Paragraph>
      <Paragraph position="6"> An often practiced shortcut is therefore to assume that each word is only dependent on the last m - 1 words and that this distribution is the same in all positions of the string. These models are called m-gram models and have proved to be very effective in a large number of applications, even though they are a naive model of language.</Paragraph>
      <Paragraph position="7"> Information theoretic measures (Cover and Thomas, 1991) are frequently used to describe the power of language models. (Cover and Thomas, 1991) shows in chapter 4.2 that the entropy rate of a random process converges, under additional assumptions, to the entropy of the random source. This has been taken as the justification for using an approximation of a notational difference of the entropy rate,dubbed perplexity, as a measure of the strength of the language model. Given a bigram model p and a test text wl,..., w,~ the perplexity PP is defined as PP = 2- ~ ~=1 logP(w,lw,_~) where we make usage of a special &amp;quot;start-of-sentence&amp;quot; symbol as w0. In the sequel we happily ignore this for notational convenience.</Paragraph>
      <Paragraph position="8"> Since we will be changing the basic unit during the sequence finding procedure it is useful to normalize the perplexity onto one standard corpus. Say the standard test corpus has length n and the new test corpus has length n' we define for the test corpus ppret = pp-~. ppr~l is therefore a notational variant of the probability of the test text given the model which is independent of the used sequences of words and is the only meaningful measure in this context.</Paragraph>
      <Paragraph position="9"> The calculation of the model p itself from empirical data involves a number of estimation problems. We are using the well understood and empirically tested backoff method, as recently described e.g. by (Kneser and Ney, 1995).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Algorithm description
</SectionTitle>
      <Paragraph position="0"> The idea of the algorithm is to search for sequences that reduce the relative perplexity of the corpus in an optimal way. For example, if we were working with a bigram model and came across the sequence credit card bill, not only would we have to choose among words like &amp;quot;report,&amp;quot; &amp;quot;history&amp;quot; and &amp;quot;check&amp;quot; as possible successors for &amp;quot;credit,&amp;quot; but the word &amp;quot;card&amp;quot; itself has many senses and &amp;quot;game,&amp;quot; &amp;quot;shop&amp;quot; and &amp;quot;table&amp;quot; might all be more likely followers of &amp;quot;card&amp;quot; than &amp;quot;bill,&amp;quot; if no other context is known. By creating a new word, credit_card, we eliminate one choice and decrease the surprise of seeing the next word.</Paragraph>
      <Paragraph position="1"> Since the new word is now treated exactly like other word instances in the corpus, it can in turn be the first or second half of a future joining operation, leading to multi-word compounds.</Paragraph>
      <Paragraph position="2"> The sequence-finding algorithm iterates over all word pairs in a training corpus, and in each iteration chooses the pair (recall that one or both elements of this pair can themselves be sequences) that reduces the bigram perplexity the most. This can be done by just calculating the number of times all possible word triples appeared and going over this table (except for those entries that have a count of zero) Mayfield Tomokiyo 8J Ries 62 Learning base units in Japanese once. This is iterated until no possible compound reduces perplexity. This technique is obviously just an approximation of an algorithm that considers all word sequences at once and would allow the statistical model to produce the components of a sequence separately. The clustering is therefore a bottom up procedure and during the training of our models we are making a variation of the Viterbi assumption in joining the sequences in the corpus blindly.</Paragraph>
      <Paragraph position="3"> For the corpora we worked with, this technique was sufficiently fast with the efficient implementation described in (Ries et al., 1996), which makes further use of estimation tools from pattern recognition such as the leaving one out technique.</Paragraph>
      <Paragraph position="4"> Inspired by (Lauer, 1995), we have very recently extended this technique so that the algorithm has the option of, instead of replacing a sequence of two units by a new symbol, replacing it by either the left or right component of that sequence. The idea is that the resulting model could capture head information. We have tested this approach on some of our English corpora; the resulting sequences look unpromising, however, and the new option was seldom used by the algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Application to Japanese
</SectionTitle>
      <Paragraph position="0"> Realizing that the phrase-finding procedure we used on English and German was producing units that were both statistically important and semantically meaningful, we decided to apply the same techniques to Japanese. We needed units that were long enough for recognition and wanted to generalize on inflected forms that are used over and over again with different stems, as well as longer sequences that are frequently repeated in the domain. Other motivations for such a process include:  The approach described in Section 3.2 is a bottom-up approach to sequence finding, and the segmentation of Japanese is more intuitively viewed as a top-down problem in which an input string is broken down to some level of granularity. In applying the algorithm in (Ries et al., 1996) to Japanese, we reversed the problem, first breaking the corpus down to the smallest possible stand-alone units in Japanese, and then building up again, constructing phrases.</Paragraph>
      <Paragraph position="1"> We chose the mora as our fundamental unit. A mora is a suprasegmental unit similar to a syllable, with the important distinctions that a mora does not need to contain a vowel (syllabic /n/ and the first of double consonants are considered independent morae) and a mora-based segmentation would treat long vowels as two morae. The word gakkoo (school) would be two syllables, but four morae: gak-ko-o. Each kana of the Japanese syllabary represents one mora. In some cases kana can be combined and remain a single mora; kyo, as in Tokyo, is an example. null There is some argument as to whether it is natural to break multi-phoneme (CV) kana down further, to the phoneme level; specifically, some analyses of Japanese verb inflections consider the root to include the first phoneme of the alternating kana, as shown in Table 1.</Paragraph>
      <Paragraph position="2"> kana phoneme example stem intl. stem intl.</Paragraph>
      <Paragraph position="3"> hashi ra hashir a hashiranai hashi ri hashir i hashirimasu hashi ru hashir u hashiru hashi re hashir e hashireba hashi ro hashir o hashiroo  verb stems and inflections The nasal consonant kana is considered an independent unit.</Paragraph>
      <Paragraph position="4"> The problem of segmentation is not unique to Japanese; there are other languages without spaces in the written language, and verb conjugations and other inflective forms are issues in almost any language. Words as defined by orthography can be more a curse than a blessing, as having such convenient units of abstraction at our disposal can blind us to more natural representations.</Paragraph>
      <Paragraph position="5"> (Ito and Kohda, 1996) describes an approach similar to ours. Our work is different because of the phrase finding criterion we use, which is to maximize the predictive power of the m-gram model directly. The recent (Ries et al., 1996) showed that a variation of that measure, coined bigram perplexity, outperforms classical measures often used to find phrases. For Chinese (Law and Chan, 1995), a similar measure was combined with a tagging scheme since the basic dictionary already consisted of 80,000 words. The algorithm presented in (Ries et al., 1996) is comparatively attractive computationally, and avoids problems with initialization as it works Mayfield Tomokiyo 8J Ries 63 Learning base units in Japanese in pure bottom up fashion. Ries did not find specific improvements from using word classes in the tasks under consideration.</Paragraph>
      <Paragraph position="6"> Masataki (Masataki and Sagisaka, 1996) describes work on word grouping at ATR, although what they describe is critically different in that they are grouping previously defined words into sequences, not defining new words from scratch. Nobesawa presents a method for segmenting strings in (Nobesawa et al. , 1996) which uses a mutual information criterion to identify meaningful strings. They evaluate the correctness of the segmentation by cross-referencing with a dictionary, however, and seem to depend to a certain extent on grammar conventions. Moreover, a breaking-down approach is less suitable for speech recognition applications than a building-up one because the risk of producing out-of-vocabulary strings is higher. Teller and Batchelder (Teller and Batchelder, 1994) describe another segmentation algorithm which uses extensively knowledge about the type of a character (hiragana/katakana/kanji, etc). This work, though, as well as Nobesawa's, is designed for processing Japanese text, and not speech. Our process is similar to noun compounding procedures, such as described in (Lauer, 1995), but does not use a mutual information criterion. The algorithm was originally developed to find sequences of words in English, initially in order to reduce language model perplexity, then to predict sequences that would be contracted in fast speech, again in English. The work described in this paper is an application of this algorithm to learning of word units in Japanese.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Since the phrase-finding algorithm described in 3.2 is designed to maximize bigram perplexity, the evaluations described here measure this criterion.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Task
The Spontaneous Scheduling Task (SST) databases
</SectionTitle>
      <Paragraph position="0"> are a collection of dialogues in which two speakers are trying to schedule a time to meet together.</Paragraph>
      <Paragraph position="1"> Speakers are given a calendar and asked to find a two-hour slot given the constraints marked on their respective calendars. Dialogues have been collected for English (ESST), German (GSST), Spanish (SSST), Korean (KSST) and Japanese (JSST).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Test corpora
</SectionTitle>
      <Paragraph position="0"> Six language models were created for the scheduling task JSST (Schultz and Koll, 1997). The models were drawn from six different segmentations of the</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Mayfield Tomokiyo ~ Ries 64
</SectionTitle>
      <Paragraph position="0"> same corpus, as described below. Segments (also referred to as &amp;quot;chunks&amp;quot;) were found using the compounding algorithm described in Section 3.2.</Paragraph>
      <Paragraph position="1">  1. Corpus C1 comprised only romanized mora syllables. A romanization tool was run over the original kanji transcriptions; the romanized text was then split into kana (morae).</Paragraph>
      <Paragraph position="2"> 2. Corpus C2 was the result of running C1 through the sequencer.</Paragraph>
      <Paragraph position="3"> 3. Corpus C3 comprised chunks that were learned before romanization. The chunked kanji text was then run through the same romanization tool.</Paragraph>
      <Paragraph position="4"> 4. Corpus C4 was a hand-edited version of C3,  where some word classes (like day of the week if only &amp;quot;tuesday&amp;quot; existed in the corpus the rest of the days were added by hand) were fleshed  out and superfluous chunks removed.</Paragraph>
      <Paragraph position="5"> 5. Corpus C5 was the hand-segmented text used in the current JSST system, with the errorful segmentations described in 5 6. Corpus C6 was C5 + chunks from C4  Only experiments involving romanized corpora were used. The choice of using romanized text over kana text was primarily based on the requirements of our language modeling toolkit; we used a one=toone mapping between kana and roman characters. Equipped with a list of chunks (between 800 and 900 were identified in these corpora), one can always reproduce kanji representations. Breaking down a kanji-based corpus, though, would require a dictionary entry for each individual kanji, of which there are over 2500 that occur in our database. Not only is this difficult to do, given the 3-12 possible readings for each kanji, we would be left after the chunking process with singleton kanji for which it is often impossible to determine the correct reading out of context. One experiment combining chunks extracted from a kanji corpus with chunks from a kana corpus was performed, but the results were not encouraging. Kanji are an extremely informative form of representation, and we will continue to look for ways to incorporate them in future work. However, experiments do show that even without them phrasebuilding can produce significant results.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Perplexity results
</SectionTitle>
      <Paragraph position="0"> The relative perplexities reported below are all normalized with respect to corpus C1. The result below clearly indicates that we can do at least as good Learning base units in Japanese or even better than human segmentations using automatically derived segmentations from the easily definable mora level. We also want to point out that the sequence trigram is better than a four-gram which indicates that the sequences play a critical role in the calculation of the model.</Paragraph>
      <Paragraph position="1"> Our measure of success so far is relative perplexity, and for speech recognition the ultimate measure is of course the accuracy of the recognition results. These results however are in our judgement much better than our results on English or German and we are hopeful that we can integrate this into our JANUS Japanese speech system.</Paragraph>
      <Paragraph position="2">  The dictionary size is the base dictionary size, without the chunks included. The mora dictionary has only 189 word types because it comprises only the legal syllables in Japanese, plus the letters of the alphabet, human and non-human noise, and some other symbols. The word dictionary, used in modeling C5 and C6, had 2357 word types.</Paragraph>
      <Paragraph position="3"> To make the results as strong as possible we used a pseudo closed vocabulary for C5 and C6. This means that we included all word types that occur in the training and test set in the vocabulary. The dictionary size is therefore exactly the number of word types found in both training and test sets and includes the number of sequences added to the model. This favors C5 and C6 strongly, since words that are not in the dictionary cannot be predicted by the language model at all nor can a speech recognition system detect them. However this setup at least guarantees that the models built for C5 and C6 predict all words on the test set as C1-4 do. For larger tasks we assume that the unknown word problem in Japanese will be very pronounced.</Paragraph>
      <Paragraph position="4"> A speech system can obviously recognize only words that are in its dictionary. Therefore, every unknown word causes at least one word error, typically even more since the recognizer tries to fit in another word with a pronounciation that does not fit in well. This may lead to wrong predictions of the language model and to wrong segmentations of the acoustic signal into base units. C1-C4 have a closed vocabulary that can in principle recognize all possible sentences and these segmentations do not suffer from this problem.</Paragraph>
      <Paragraph position="5"> In English, this would be equivalent to having been able to build phoneme based language models that are better than word models, even if we choose the vocabulary such that we have just covered the training and test sets. In some pilot experiments we actually ran the sequence finding procedure on an English phoneme corpus and a letter corpus without word boundaries and found that the algorithm tends to discover short words and syllables; however, the resulting models are not nearly as strong as word models.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Emergence of units
</SectionTitle>
    <Paragraph position="0"> One of the exciting things about this study was the emergence of units that are contracted in fast and casual speech. A problem with morphological breakdowns of Japanese, which are good for the purposes of speech recognition because they are consistent and publicly available tokenizers can be used, is that multi-morph units are often abbreviated in casual speech (as in &amp;quot;don't know&amp;quot; ~ &amp;quot;dunno&amp;quot; in English) and segmenting purely along morphological boundaries hides the environment necessary to capture these phenomena of spontaneous speech. We found that the chunking process actually appeared to be extracting these sequences.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Reducible sequences captured
</SectionTitle>
      <Paragraph position="0"> Following is an example comparing the chunking to the original (termed word-based here) segmentation in JSST. The task, again, is appointment scheduling.</Paragraph>
      <Paragraph position="1"> Numbered sentences are glossed in Table 2; (1) and  (6) correspond to (A); (2,7) to (B); (3,8) to (C), etc. (1) gozenchuu ni shi te itadake reba (2) getsuyoobi ni shi te itadakere ba to omoi masu (3) ukagawa shi te itadakere ba (4) renraku shi nakere ba to omot te (5) sorosoro kime nake re ba nara nai  Sentences 1-5 are shown as segmented by human transcribers. Sentences 6-10 are the same three sentences, segmented by our automated process.  (6) (gozenchuu) ni (shiteitada) (kereba) (7) (getsuyoobi) ni (shiteitada) (kereba) (toomoimasu) (8) (ukagawa) (shiteitada) (kereba)  yuugata.nara $ aite fi)masu kedo $ evening-if open Is SOFTENER kaigi ga$ haitte orimasu $ meeting SUBJ in is  discussed at the point in the text where the sentences occur; dollar signs indicate bunsetsu boundaries. (9) (renraku) shi na (kereba) (toomo) (tte) (10) (sorosoro) (kime) na (kereba) (nara) (nai)  There are two issues of importance here. First, the hand-segmenting, while it can be tailored to the task, is inconsistent; the sequence &amp;quot;...ni-shi-te-i-tada-ke-re-ba&amp;quot; (If I could humbly receive the favor of doing...) is segmented at one mora boundary in (1) and at another in (2). Sentences (4) and (7) show the same sequences as segmented by the chunker; the segmentation is consistent. The same is true for &amp;quot;...na-ke-re-ba in (4) and (5) as compared to (9) and (10).</Paragraph>
      <Paragraph position="2"> The second important issue is the composition of the sequences. The sequence &amp;quot;kereba&amp;quot; in (6-10), while used here in a formal context, is one that is often reduced to &amp;quot;kya&amp;quot; or &amp;quot;kerya&amp;quot; in casual speech. The knowledge that &amp;quot;kereba&amp;quot; can be a word is very valuable for the speech recognizer. Once it has access to this information, it can train its expected pronunciations of the. sequence &amp;quot;kereba&amp;quot; to include &amp;quot;kya&amp;quot; pronunciations as they occur in the spoken corpus. Without the knowledge that these three morae can form one semantic unit, the recognizer cannot abstract the information that when combined in certain contexts they can be reduced in this special way.</Paragraph>
      <Paragraph position="3"> Although the (kereba) in (6) and (7) is attached to a verb, itadaku, that is very formal and would not be abbreviated in this way, let us consider sentences  (D) and (E), here segmented into bunsetsu phrases: (11) renraku shinakereba to omotte (12) renraku shinakya to omotte (13) sorosoro kimenakereba naranai (14) sorosoro kimenakya naranai  Sentence (D) is shown in (11) in full form and in (12) in contracted form; sentence (E) is shown in (13) in full form and in (14) in contracted form. Selection of the chunk (kereba) provides the environment necessary for modeling the contraction &amp;quot;kya&amp;quot; with some verbs and adjectives in informal speech. May\]ield Tomokiyo ~ Ries 66 Learning base units in Japanese Basing a tokenizer on syntactic factors can hide precisely such environments* A second example of a frequently contracted sequence in Japanese is to yuu or tte yuu which becomes something close to &amp;quot;chuu&amp;quot; or &amp;quot;tyuu&amp;quot; in fast and sloppy speech.</Paragraph>
      <Paragraph position="4">  (15) naN tte yuu ka (16) sono hi wa gogo wa muri desu, to yuu ka, sanji made  kaigi ga haitte, iru node sanji ikoo nara daijoubu desu l~edo The to yuu sequence is recognized as a single sequence in some tokenization methods and not in others, so the idea of treating it as a single word is not novel, but in order for the variant &amp;quot;chuu&amp;quot; to be considered during recognition, it is important that our system recognize this environment* There are cases in which the combination to yuu will not collapse to &amp;quot;chuu:&amp;quot; (17) asa hayaku to yuugata nara aitemasu kedo In the scheduling domain, the word yuugata (evening) is common enough for it to be identified as a word on its own, and the utterance is correctly segmented as (to) (yuugata). In a different domain, however, the extraction of (toyuu) might take precedence over other segmentation, which would indeed be incorrect.</Paragraph>
      <Paragraph position="5"> Yet another type of contraction common in casual speech is blending of the participial suffix te and the beginning of the auxiliary oru, as in (J).</Paragraph>
      <Paragraph position="6"> The -te form of the verb, also often referred to as the participial (Shibatani, 1987) or gerundive (Matsumoto, 1990) form, is constructed by adding the suffix te to the verb stem plus the renyoo inflection* This renyoo (conjunctive) form of the verb is also used with the past-tense suffix ta and provisional suffix tara.</Paragraph>
      <Paragraph position="7"> In the majority of the literature, the -te form seems to be analyzed either as a single unit independent of the auxiliary verb (iru/oku/aru/morau etc.) (Sells, 1990) or broken down into its morphological constituents (Yoshimoto and Nanz, 1996). An exception is (Sumita and Iida, 1995)* With certain auxiliary verbs, though, the e in te is dropped and the suffix-initial t is affixed to the initial vowel of the auxiliary, as in hait-torimasu, shi-tokimasu. This phenomenon is very pronounced in some dialects and only slight in others* Our method does identify several units that have the -te appended directly onto the auxiliary verb, creating a very useful phonetic environment for us.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Scheduling Task (SST).
5.2 Long enough for speech recognition
</SectionTitle>
      <Paragraph position="0"> In speech recognition systems, short recognition units are to be avoided because they are confusible it is much harder to distinguish between &amp;quot;bee&amp;quot; and &amp;quot;key&amp;quot; than &amp;quot;BMW&amp;quot; and &amp;quot;key lime pie.&amp;quot; This is one reason that we did not want to use a morphological breakdown of input sentences. Segmented in the strictest sense (Teller and Batchelder, 1994), the sentence &amp;quot;\[I\] was studying&amp;quot; could be written as: benkyoo shi te i mashi ta study do PART PROG POLITE PAST Single-phoneme units like/i/and syllabic/n/are so small that they are easy to misrecognize. Even /te/and/ta/are shorter than would normally be desired, although Japanese phonemes appear to be less confusible than their English and German counterparts (Schultz and Koll, 1997)* Units such as (shite) and (imashita), as produced by our algorithm, are long enough to be distinguishable from other words, yet short enough to generalize* Since the basic unit from which we were building was the mora, ending up with units that were too short was a concern.</Paragraph>
      <Paragraph position="1"> We found that the average unit length in mora was comparable to that of the hand-segmented system, however* It is also important, though, to control the vocabulary size if a reasonable search space is desired* Early experiments with recognizing at the bunsetsu level in Korean indicated that vocabulary did explode, since most full bunsetsu were used only once.</Paragraph>
      <Paragraph position="2"> The vocabulary growth actually did level off eventually, but the initial growth was unacceptable, and we switched to a syllable-based system in the end.</Paragraph>
      <Paragraph position="3"> Figure 5.2 shows vocabulary growth rates in Janus for different languages in the scheduling domain.</Paragraph>
      <Paragraph position="4"> Mayfield Tomokiyo ~ Ries 67 Learning base units in Japanese</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Undesired effects
</SectionTitle>
      <Paragraph position="0"> Since our algorithm evaluates all sequences with the same components identically, some compounding that is clearly wrong occurs.</Paragraph>
      <Paragraph position="1">  For example, the chunk (kuno} was identified by the system. This was because the phrases daigaku-no &amp;quot;university-GEN&amp;quot; and boku-no &amp;quot;I/me-GEN&amp;quot; were both very common- the algorithm abstracted incorrectly that (kuno) was a meaningful unit before it found the word daigaku, which it eventually did identify.</Paragraph>
      <Paragraph position="2">  Although the point where a stem should end and an inflection begin can be ambiguous, most stems have definite starting points, and this algorithm can miss them. For example, mooshiwakegozaimasen &amp;quot;I'm very sorry&amp;quot; occurs many times in the database, but our algorithm only extracted part: (shiwakegozaimaseN}. Because of the way our stopping criterion is defined, we can infer from the fact that the full phrase was not extracted that by forming this compound we would actually have increased the difficulty of the corpus; more analysis is needed to understand exactly why.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The results reported here show that we can get similar entropies in our language model by using an automatic process to segment the data. This means that we do not have to rely on human segmenters, which can be inconsistent and time consuming. We can also tailor the segmentation style to the task; the inflected forms and general word choice in casual and formal speech are very different, and our method allows us to target those which are most relevant. This is in itself a significant result.</Paragraph>
    <Paragraph position="1"> Additionally, we found that our method finds sequences which are likely to undergo contractions and reductions in casual speech. This has implications not only for Japanese, but also for speech recognition in general. If our algorithm is finding a natural base unit in Japanese, we should be able to use a similar approach to find units more natural than the word in other languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML