XML Viewer - w02-0707

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0707_metho.xml
Size: 12,811 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0707">
  <Title>Koji TOCHINAI Graduate school of Business Administration</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Speech processing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Speech data
</SectionTitle>
      <Paragraph position="0"> It is necessary to extract time-varying spectral characteristics in utterances and apply them to the system. We used several conversation sets from an English conversation book (GEOS Publishing Inc., 1999). The Japanese speech data was recorded with a 48kHz sampling rate on DAT, and downsampled to 8kHz. All speech data in the source language was spoken by Japanese male students of our laboratory. The speech data was spoken by 2 people in the source and target languages, respectively.</Paragraph>
      <Paragraph position="1"> The content of the data sets consists of conversations between a client and the front desk at a hotel and conversations between a client and train station staff.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Spectral characteristics of speech
</SectionTitle>
      <Paragraph position="0"> In our approach, the acoustic characteristics of speech are very important because we must nd common and different acoustic parts by comparing them. It is assumed that acoustic characteristics are not dependent on any language. Table 1 shows the conditions for speech analysis. The same conditions and the same kind of characteristic parameters of speech are used throughout the experiments.</Paragraph>
      <Paragraph position="1"> In this report, the LPC coef cients are applied as spectral parameters because Murakami et al.(2002) could obtain better results by using these parameters  than other representations of speech characteristics. 2.3 Searching for the start point of parts between utterances  When speech samples were being compared, we had to consider how to normalize the elasticity on timedomain. Many methods were investigated to resolve this problem. We tried meditating a method that is able to obtain a result similar to dynamic programming (H. Sakoe et al., 1978; H. F. Silverman et al., 1990) to execute time-domain normalization. We adopted a method to investigate the difference between two characteristic vectors of speech samples for determining common and different acoustic parts. The Least-Squares Distance Method was adopted for the calculation of the similarity between these vectors.</Paragraph>
      <Paragraph position="2"> Two sequences of characteristic vectors named test vector and reference vector are prepared.</Paragraph>
      <Paragraph position="3"> The test vector is picked out from the test speech by a window that has de nite length. At the time, the</Paragraph>
      <Paragraph position="5"> right, Mr. Brown. - Good afternoon.</Paragraph>
      <Paragraph position="6"> reference vector is also prepared from the reference speech. A distance value is calculated by comparing the present test vector and a portion of the reference vector . Then, we repeat the calculation between the current test vector and all portions of the reference vector picked out and shifted in each moment with constant interval on time-domain.</Paragraph>
      <Paragraph position="7"> When a portion of the reference vector reaches the end of the whole reference vector, a sequence of distance values is obtained as a result. The procedure of comparing two vectors is shown as Figure 3. Next, the new test vector is picked out by the constant interval, then the calculation mentioned above is repeated until the end of the test vector . Finally, we should get several distance curves as the result between two speech samples.</Paragraph>
      <Paragraph position="8"> Figure 4 and Figure 5 show examples of the difference between two utterances. These applied speech samples are spoken by the same speaker. The contents of the compared utterances are the same in Figure 4, and are quite different in Figure 5. The horizontal axis shows the shift number of reference vector on time-domain and the vertical axis shows the shift number of test vector, i.e., the portion of test speech. In the gures, a curve in the lowest loca-tion has been drawn by comparing the top of the test speech and whole reference speech. If a distance value in a distance curve is obviously lowest than other distance values, it means that the two vectors have much acoustic similarity.</Paragraph>
      <Paragraph position="9"> As shown in Figure 5, the obvious local minimum distance point is not discovered even if there is the lowest point in each distance curve. On the other hand, as shown in Figure 4, when the test and reference speech have the same content, the minimum distance values are found sequentially in distance curves. According to these results, if there is a position of the obviously smallest distance point in a distance curve, that portion should be regarded as a common part . Moreover, if these points sequentially appear among several distance curves, they will be considered a common part. At the time, there is a possibility that the part corresponds to several semantic segments, longer than a phoneme and a syllable.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Evaluation of the obvious minimal distance
</SectionTitle>
      <Paragraph position="0"> value To determine that the obviously lowest distance value in the distance curve is a common part, we adopt a threshold calculated by statistical information. We calculate the variance of distance values shown as and the mean value within the curve.</Paragraph>
      <Paragraph position="1"> The threshold is conducted as = 4 2 from the equation of the Gaussian distribution and the standardized normal distribution.</Paragraph>
      <Paragraph position="2"> A point of the smallest distance value within a curve is represented by x and a parameter m shows the mean value of distances. A common part is detected if (x m)2 &gt; , because the portion of reference speech has much similarity with the test vector of the distance curve in a point, and that common part is represented by 0 . Otherwise the speech portion for test vector is regarded as a different part and represented by 1 . If several common parts are decided continuously, we deal with them as one common part, and the rst point in that part will be the start point nally. In our method, the acoustic similarities evaluated by several calculations are only the factor for judgment in classifying common or different parts in the speech samples.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Generation and application of
</SectionTitle>
    <Paragraph position="0"> translation rule</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Correction of acquired parts
</SectionTitle>
      <Paragraph position="0"> The two reference speech samples are divided into several common and different parts by comparison.</Paragraph>
      <Paragraph position="1"> However, there is a possibility that these parts include several errors of elasticity normalization because the distance calculation is not perfect to resolve this problem on time-domain. We attempt to correct incomplete common and different parts using heuristic techniques when a common part is divided by a discrete different part, or a different part is divided by a discrete common part.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Acquisition of translation rules
</SectionTitle>
      <Paragraph position="0"> Common and different parts corrected in 3.1 are applied to determine the rule elements needed to generate translation rules. Figure 6 and 7 show the results of comparing utterances. In the rst case, a part containing continuous values of 0 represents a common part. In the second case, a part consisting of only 1 is regarded as a different part. In Figure 6, two utterances are calculated as a long common part. On the contrary, two utterances are calculated as a long different part in Figure 7. These results are comparable with lexical contents because the syntactic sentence structures are the same in both cases.</Paragraph>
      <Paragraph position="1"> Moreover, when a sentence structure includes common and different parts at the same time, we can treat this structure as a third case. We deal with these three cases of sentence structure as rule types. In all the above-mentioned cases, several sets of common and different parts are acquired if those utterances were almost matching or did not match at all. Combining sets of common parts of the source and target languages become elements of the translation rules for its generation. At this time, the set of common parts extracted from the source language, that have  a correspondence of meaning with a set of common parts in target language, are kept. The sets of different parts become elements of the translation rules as well.</Paragraph>
      <Paragraph position="2"> Finally, these translation rules are generated by completing all elements as below. It is very important the rules are acquired if the types of sentences in both languages are the same. When the types of sentence structures are different, it is impossible that translation rules are obtained and registered in the rule dictionary because we can not decide the correspondence between two languages samples uniquely. Acquired rules are categorized in the following types: Rule type 1: those with a very high sentence similarity null Rule type 2: those with sentences including common and different parts Rule type 3: those with very low sentence similarity null When a new rule containing the information of several common parts is generated, the rule should keep the sentence form so that different parts in the speech sample are replaced as variables. Information that a translation rule has are as follows: rule types as mentioned above index number of a source language's utterance sets of start and end points of each common and different part</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Translation and speech synthesis
</SectionTitle>
      <Paragraph position="0"> When an unknown speech utterance of a source language is adapted to get the result of translation, acoustic information of acquired parts in the translation rules are compared in turn with the unknown speech, and several matched rules become the candidates to translate. The inputted utterance should be reproduced by a combination of several candidates of rules. Then, the corresponding parts of the target language in candidate rules are referred to obtain translated speech. Although the nal synthesized target speech may be produced roughly, speech can directly be concatenated by several suitable parts of rules in the target language using the location information on time-domain in rules.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Inductive Learning Method
</SectionTitle>
    <Paragraph position="0"> The Inductive Learning that Araki et al.(2001) proposed acquires rules by extracting common and different parts through the comparison between two samples. This method is designed from an assumption that a human being is able to nd out common and different parts between two samples although these are unknown. The method is also able to obtain rules by repetition of the acquired rules registered in the rule dictionary.</Paragraph>
    <Paragraph position="1"> Figure 8 shows an overview of recursive rule acquisition by this learning method. Two rules acquired as rule(i) and rule(j) are prepared and compared to extract common and different acoustic parts as well as comparisons between speech samples.</Paragraph>
    <Paragraph position="2"> Then, these obtained parts are designed as new rules.</Paragraph>
    <Paragraph position="3"> If the compared rules consist of several common or different parts, the calculation is repeated within each part. It is assumed that these new rules are much more reliable for translation.</Paragraph>
    <Paragraph position="4"> If several rules are not useful for translation, they will be eliminated by generalizing the rule dictionary optimally to keep a designed size of memory. The ability of optimal generalization in the Inductive Learning Method is an advantage, as less examples have to be prepared beforehand. Much sample data is needed to acquire many suitable rules with conventional approaches.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML