File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1073_metho.xml
Size: 20,001 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1073"> <Title>The Design for the Wall Street Journal-based CSR Corpus*</Title> <Section position="5" start_page="357" end_page="357" type="metho"> <SectionTitle> THE WSJ-CORPUS STRUCTURE AND CAPABILITIES </SectionTitle> <Paragraph position="0"> Specifically, the WSJ corpus is scalable and built to accommodate variable size large vocabularies (SK, 20K, and larger), variable perplexities (80, 120, 160, 240, and larger), speaker dependent (SD) and independent (SI) training with variable amounts of data (ranging from 100 to 9600 sentences/speaker), including equal portions of verbalized and non-verbalized punctuation (to reflect both dictation-mode and non-dictation-mode applications), separate speaker adaptation materials (40 phonetically rich sentences/speaker), simultaneous standard close talking and multiple secondary microphones, variable moderate noise environments, equal numbers of male and female speakers chosen for diversity of voice quality and dialect. In order to collect large quantities of speech data very cost-effectively, it was decided to collect the majority of the recorded speech in a &quot;read&quot; speech mode, whereby speakers are prompted by newspaper text paragraphs. The presentation of coherent paragraph blocks of text provides semantically meaningful material, thereby facilitating the production of realistic speech prosodies. Small amounts of unprompted &quot;spontaneous&quot; speech are provided for comparison (utilizing some naive speakers as well as some who are experienced at dictation for human transcription).</Paragraph> <Paragraph position="1"> Testing paradigms were carefully constructed to accommodate efficient comparisons of SI and SD performance and variable size vocabulary &quot;open&quot; and &quot;closed&quot; tests to permit evaluation both with and without &quot;out-of-vocabulary&quot; lexical items. The value of variable amounts of training set materials can be directly assessed both within and across speakers. Well-trained speaker-dependent performance provides an upper bound against which the success of different speaker-independent modeling and speaker-adaptive methodologies may be rigorously compared.</Paragraph> <Paragraph position="2"> Adaptive acoustic and language modeling is easily supported through the following simple though rigorous automatic paradigm: 1) Recognition of a sentence is performed and assessed as usual against existing system acoustic and language models. 2) The system commences to adapt using (supervised) or not using (unsupervised) the correct &quot;clear text&quot; to modify its internal acoustic and language models automatically before proceeding to recognition of the next utterance.</Paragraph> <Paragraph position="3"> Recognition performance with this kind of automatic axlaptation is assessable with standard scoring routines. This mode provides an easy means to maximize performance for speakers by tracking and accommodating to speaker and environmental changes in a dynamic fashion, also simulating (in a reproducible fashion) an interactive system mode where speakers correct system recognition errors, and using systems which can utilize this feedback to improve performance, in a continuous automatic fashion. The results of automatic adaptation can be assessed in an on-going &quot;dynamic&quot; fashion, or stopped after varying amounts of adaptation, for subsequent &quot;static&quot; testing on materials to which the system is not subsequently adapted\[I,2,3\].</Paragraph> <Paragraph position="4"> The availability of large amounts of machine-readable text from nearly three years of the Wall Street Journal enables meaningful statistical benchmark language models (including bigrams and trigrams) to be generated, and the results from these to be easily contrasted. By varying the types of language models chosen, the effect on recognition performance of variable perplexities for the same textual materials can be assessed. The availability of this text provides a valuable resource enabling novel language models and language models adapted from other tasks to be developed and evaluated as well.</Paragraph> </Section> <Section position="6" start_page="357" end_page="358" type="metho"> <SectionTitle> THE WSJ-PILOT DATABASE </SectionTitle> <Paragraph position="0"> It was judged to be too ambitious to immediately record a 400 hour recognition database. Therefore, a smaller pilot database built around the WSJ task was designed. A joint BBN/Lincoln proposal for the pilot was adopted by the CSR committee. In an attempt to &quot;share the shortage&quot; this proposal provided equal amounts of training data for each of three popular training paradigms. This proposal was also rich enough that it provided for &quot;multi-mode&quot; use of the data to allow many more than just three paradigms to be explored.</Paragraph> <Paragraph position="1"> The original plan was for about a 45 hour database, but the three recording sites, (MIT, SRI, and TI), each recorded about a half share for a total of 80 hours. The resultant database is shown in Table 4 and described below. (About</Paragraph> <Paragraph position="3"> It is important to be able to train a language model that is well matched to the (text) source to be used as a control condition to isolate the performance of the acoustic modeling from the language modeling\[12\]. (It is always possible to train a mismatched language model, but its effects cannot be adequately assessed without a control matched language model.) Ideally, one would have access to many (tens to hundreds of millions of words) of accurately transcribed spoken speech. Such was not available to us. Therefore, this condition was simulated by preprocessing the WSJ text in a manner that removed the ambiguity in the word sequence that a reader might choose. (This preprocessing is similar to that which might be used in a text-to-speech system\[4\].) This ensures that the unread (and unchecked) text used to train the language model is representative of the spoken test material.</Paragraph> <Paragraph position="4"> The original WSJ text data were supplied by Dow Jones, Inc. to the ACL/DCI\[9\] which organized the data and distributed it to the research community in CD-ROM format.</Paragraph> <Paragraph position="5"> The WSJ text data were supplied as 313 1MB files from the years 1987, 1988 and 1989. The data consisted of articles that were paragraph and sentence marked by the ACL/DCI.</Paragraph> <Paragraph position="6"> (Since automatic marking methods were used, some of the paragraphs and sentence marks are erroneous.) The article headers contained a WSJ-supplied document-control number.</Paragraph> <Paragraph position="7"> The preprocessing began with integrity checks: one file from 1987 and 38 from 1988 were discarded due to duplication of articles in the same file (1987) or duplication of data found in other files (1988). 274 files were retained, which yielded 47M with-verbalized-punctuation words from 1.8M sentences. (The yield is on the order of 10% fewer words in the non-verbalized-punctuation version.) Each file contain a scatter of dates, usually within a few days, but sometimes up to six months apart. Each file was characterized by its most frequent date (used later to temporally order the files).</Paragraph> <Paragraph position="8"> Since the CSR Committee had decided to support both with and without verbalized punctuation modes, it was necessary to produce four versions of each text: with/without verbalized punctuation x prompt/truth texts. (A prompt text is the version read by the speaker and the truth text is the version used by the training, recognition, and scoring algorithms.) The preprocessing consisted of a general preprocessor (GP) followed by four customizing preprocessors to convert the GP output in the four specific outputs. The traditional computer definition of a word is used--any white-space separated object is a word. Thus, a word followed by a comma becomes a word unless that comma is separated from the word. (Resolution of the role of a period or an apostrophe/single quote can be a very difficult problem requiring full understanding of the text.) The general preprocessor started by labeling all paragraphs and sentences using an SGML-Iike scheme based upon the file name, document-control number, paragraph number within the article, and sentence number within the paragraph. This marking scheme, which was carried transparently though all of the processing, made it very easy to locate any of the text at any stage of the processing. A few bug fixes were applied for such things as common typos or misspellings. Next the numbers are converted into orthographics. &quot;Magic numbers&quot; (numbers such as 386 and 747 which are not pronounced normally because they have a special meaning) are pronounced from an exceptions table. The remaining numbers are pronounced by rule--the algorithms cover money, time, dates, &quot;serial numbers&quot; (mixed digits and letters), fractions, feetinches, real numbers, and integers. Next sequences of letters are separated: U.S.---~U. S., Roman numerals are written out as cardinals or ordinals depending on the left context, acronyms are spelled out or left as words according to the common pronunciation, and abbreviations (except for Mr., Mrs., Ms., and Messrs.) are expanded to the full word. Finally, single letters are followed by a &quot;.&quot; to distinguish them from the words &quot;a&quot; and &quot;I&quot;. This output is the input to the four specific preprocessors.</Paragraph> <Paragraph position="9"> The punctuation processor is used in several modes. In its normal mode, it is used to produce the with-verbalized-punctuation texts. It resolves apostrophes from single quotes (an apostrophe is part of the word, a single quote is not), resolves whether a period indicates an abbreviation or is a punctuation mark, and separates punctuation into individual marks separate from the words. This punctuation is written out in a word-like form (eg. ,GOMMA) to ensure that the speaker will pronounce it. This output is the with-punctuation prompting text. Until this point, the text retains the original case as suppled on the CD-ROM. If one wishes to perform case-sensitive recognition (ie. the language model predicts the case of the word), this same text can be used as the with-punctuation truth text or if one wishes to perform case-insensitive recognition, the text may be mapped to upper-case. (A post-processor is supplied with the database to perform the case mapping without altering the sentence markings.) Initial use of the database will center on case-insensitive recognition.</Paragraph> <Paragraph position="10"> The without-punctuation prompting text is very similar to the GP output. Only a few things, such as mapping &quot;%&quot; to &quot;percent&quot;, need to be performed. This text contains the mixed case and normal punctuation to help the subject speak the sentence. (The subject is instructed not to pronounce any of the punctuation in this mode.) The punctuation processor is used in a special mode to produce the without-punctuation truth-text. It performs all of the same processing as described above to locate the punctuation marks, but now, rather than spelling them out, eliminates them from the output. (Since the punctuation marks do not appear explicitly in the acoustics, they must be eliminated from the truth texts. Predicting punctuation from the acoustics has been shown to be impractical--human transcribers don't punctuate consistently, and, in an attempt to perform punctuation prediction by the language model in a CSR, IBM found a high percentage of their errors to be due to incorrectly predicted punctuation\]14\]. People dictating to a human transcriber verbalize the punctuation if they feel that correct punctuation is important: e.g. lawyers. They also verbally spell uncommon words and issue formatting commands where appropriate.) This without-punctuation truth text is again mixed case and can be mapped to upper case if the user desires.</Paragraph> </Section> <Section position="7" start_page="358" end_page="359" type="metho"> <SectionTitle> WSJ TEXT SELECTION INTO DATABASE PARTS </SectionTitle> <Paragraph position="0"> Next it was necessary to divide the text into sections for the various parts of the database. Since the plan called for the pilot to become a portion of the full database, all text processing and selection were performed according to criteria that were consistent with the full database.</Paragraph> <Paragraph position="1"> Ninety percent of the text, including all of the Penn Treebank\[17\] (about 2M words) were reserved for training, 5% for development testing, and the remaining 53~ for evaluation testing. The non-treebank text files were temporally ordered (see above) and 28 were selected for testing--the odd ordinal files for development testing and the even ordinal flies for evaluation testing. (The Treebank included the 21 most recent files so it was not possible to simulate the real case-train on the past and test on the &quot;present&quot;). All of the non-test data, with the exception of the sentences recorded for acoustic training, is available for training language models. The acoustic training data is eliminated to allow a standard sanity check: CSR testing on the acoustic training data without also performing a closed test on the language model.</Paragraph> </Section> <Section position="8" start_page="359" end_page="359" type="metho"> <SectionTitle> WSJ TEXT SELECTION FOR RECORDING </SectionTitle> <Paragraph position="0"> Next the recording sentences were selected. Separate sentence &quot;pools&quot; were selected from the appropriate text sections for SI train (10K sentences), SD train (20K sentences), 20K-word vocabulary test (4K development test and 4K evaluation test sentences), and 5K-word vocabulary test (2K development test and 2K evaluation test). It was originally hoped that the 5K vocabulary test set could be formed as a subset of the 20K test set, but this was not possible---thus the 4 test sets are completely independent.</Paragraph> <Paragraph position="1"> The recording texts were filtered for readability. (The WSJ uses a lot of uncommon words and names and uses complex sentence structures that were never intended to be read aloud.) The first step was to form a. word-frequency list (WFL) (ie. a frequency-ordered unigram list) from all of the upper-case with-punctuation truth texts. This yielded a list of 173K words. (For comparison, mixed case yields 210K words). Next, a process of automated &quot;quality filtering&quot; was devised to filter out the majority of the erroneous and unreadable paragraphs. This filtering is applied only to the recorded texts, not to the general language model training texts. Since many typos, misspellings and processing (both ACL-DCI and preprocessing) errors map into low frequency words, any paragraph which contained an out-of-top-64K-WFL word or was shorter than 3 words was rejected. (The top 64K WFL words cover 99.6% of the frequency-weighted words in the database.) Any paragraph containing less than three sentences or more than eight sentences was rejected to maintain reasonable selection unit sizes. Any paragraph containing a sentence longer than 30 words was rejected as too difficult to read 1. Because the WSJ contains many instances of certain &quot;boiler-plate&quot; figure captions which would be pathologically over represented in the test data, duplicate sentences were removed from the test sentence pools. Finally human checks verified the high overall quality of the chosen sentences. Note that this does not mean perfect--there were errors in both the source material and the preprocessing.</Paragraph> <Paragraph position="2"> 1 One of the authors (dbp) has recorded about 2500 WSJ sentences. The most difficult sentences to record were the longest ones. After a little practice, verbalized punctuation sentences were only slightly harder to read than the non-verbalized punctuation ones. This slight additional difficulty can be accounted for by the fact that the verbalized punctuation sentences average about 10% longer than the non-verbalized punctuation ones.</Paragraph> <Paragraph position="3"> The 20K test pools were produced by randomly selected quality-filtered paragraphs until 8K (4K dev. test and 4K eval. test) sentences were selected. This produced a realized vocabulary of 13K words. Since this data set was produced in a vocabulary insensitive manner, it can be used without bias for open and closed recognition vocabulary testing at any vocabulary size up to 64K words. (However, using it for open vocabulary testing at any vocabulary size less than 20K will yield a large number of out-of-vocabulary errors-the top-20K of the WFL (the 20K open vocabulary) has a frequency weighted coverage of 97.8% of the data.) Attempts to produce the 5K vocabulary test pools by the same method produced too few sentences to be useful (,-,1200). Thus it was necessary to use a vocabulary sensitive procedure--paragraphs were allowed to have up to 1 out-of-top-5.6K-WFL words. This produced the highest yield (~4K sentences with a realized vocabulary of 5K words) and reduces, but does not completely eliminate the tail of the word frequency distribution. This test set allows open and closed vocabulary testing at a 5K-word vocabulary, but would be expected to yield somewhat biased test results if used at larger test vocabularies\[10,14\]. The top-5K of the WFL (the 5K open vocabulary) has a frequency weighted coverage of 91.7% of the data.</Paragraph> <Paragraph position="4"> Finally, the evaluation test paragraphs were broken into four separate groups. This was done to provide four independent evaluation test sets.</Paragraph> <Paragraph position="5"> The recording sites selected a randomly chosen subset of the paragraphs from the pool corresponding to the database section being recorded (with replacement between subjects) for each subjects to read. The sentences were recorded one per audio file. All subjects recorded one set of the 40 adaptation sentences.</Paragraph> </Section> <Section position="9" start_page="359" end_page="360" type="metho"> <SectionTitle> OTHER WSJ DATABASE COMPONENTS </SectionTitle> <Paragraph position="0"> The above describes the selection and recording of the acoustic portion of the WSJ-pilot database. Additional components--such as a dictionary and language models-are required to perform recognition experiments. Dragon Systems Inc., under a joint license agreement with Random House, has provided a set of pronouncing dictionaries-totaling 33K words--to cover the training and 5K and 20K-word open and closed test conditions. This dictionary also includes the 1K-word Resource Management\[15\] vocabulary to allow cross-task tests with an existing database. MIT Lincoln Laboratory, as part of its text selection and preprocessing effort, has provided baseline open and closed test vocabularies based upon the test-set realized-vocabularies and the WFL for the 5K and 20K test sets. Lincoln has also provided 8 baseline bigram back-off\[8,11\] language models (5K/20K words x open/closed vocab. x verbalized/non-verbalized punct.) for research and cross-site comparative evaluation testing. Finally language model training data and utilities for manipulating the processed texts have been made available to the recording and CSR research sites.</Paragraph> <Paragraph position="1"> NIST compiled the data from the three recording sites (MIT, SRI, and TI), formatted it, and shipped it to MIT where WORM CD-ROMS were produced for rapid distribution to the CSR development sites.</Paragraph> </Section> class="xml-element"></Paper>