XML Viewer - h94-1004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1004_metho.xml
Size: 16,296 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1004">
  <Title>Lexicons for Human Language Technology</Title>
  <Section position="4" start_page="0" end_page="13" type="metho">
    <SectionTitle>
2. Current English Lexicon Efforts
</SectionTitle>
    <Paragraph position="0"> Our primary effort is to provide lexicons for English. We are funding a large, high-quality English pronunciation lexicon; an English syntactic lexicon, including detailed information about syntactic properties of verbs; and a set of improvements in an existing lexicon of English word-sense differences. All three lexicons will eventually be tied to appropriately sampled occurrences in text and speech corpora.</Paragraph>
    <Paragraph position="1"> Based on an original proposal from Ralph Grishman and James Pustejovsky, we have been called our English lexicon &amp;quot;COMLEX,&amp;quot; for COMmon LEXicon.</Paragraph>
    <Section position="1" start_page="0" end_page="13" type="sub_section">
      <SectionTitle>
2.1. Pronunciation: PRONLEX
</SectionTitle>
      <Paragraph position="0"> For the COMLEX English pronouncing dictionary (&amp;quot;PRONLEX&amp;quot;), the LDC has obtained (by purchase or donation) rights to combine four existing large and high-quality lexicons. Bill Fisher at NIST has been carrying out a pilot project to design a consensus segment set, and to map the representations in the multiple sources into it automatically. Then words where the various sources agree will be accepted, while disagreements will be adjudicated by human judges, and new words will be added as needed.</Paragraph>
      <Paragraph position="1">  The., sources we are starting from will provide coverage of more than 250K word forms. Appropriate coverage of the words found in the various ARPA speech recognition databases will also be guaranteed. We solicit suggestions for :lists of other words to cover, such as proper names, including surnames, place names, and company names.</Paragraph>
      <Paragraph position="2"> The pronunciation representations used in the first release of PRONLEX, being based on those in the lexicons we are starting from, will be similar to those provided in typical dictionary pronunciation fields and used in most of today's speech recognition systems. This level is best described as &amp;quot;surface phonemic&amp;quot; rather than phonetic-it abstracts away from most dialect variation, contextconditioned variation, and casual-speech reduction. Pat Keating at UCLA has been carrying out a pilot project to examine systematically the relationship between such normative pronunciations and the actual phonetic segments found when the corresponding words are used in conversational speech. We provided a sample of occurrences of words with high, medium and low frequencies of occurrence, drawn from the Switchboard data base. We will use the results of this study to plan how to improve the pronunciations in the initial release of PRONLEX. Readers are invited to join an on-going email discussion of this topic.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
2.2. COMLEX Syntax
</SectionTitle>
      <Paragraph position="0"> A lexicon of syntactic information, known as &amp;quot;COMLEX Syntax,&amp;quot; is under development by Ralph Grishman and others at NYU. After designing the feature set and representational conventions, Grishman created a zerothorder mock-up from existing resources. This has been circulated for comments and is available to interested parties from the LDC, along with the specifications for the syntactic features and lexical representations to be used. The project at NYU is now doing the lexicon over * again by hand, guided by corpus-derived examples. The first release will occur later this year.</Paragraph>
    </Section>
    <Section position="3" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
2.3. COMLEX Semantics
</SectionTitle>
      <Paragraph position="0"> The existing WordNet lexical database, available from George Miller's group at Princeton, provides a number of kinds of semantic information, including hypo/hypernym relations and word sense specification. In order to improve the quality of its coverage of real word usage, and to provide material for training and testing &amp;quot;semantic taggers,&amp;quot; the LDC has funded an effort by Miller's group to tag the Brown corpus using WordNet categories, modifying WordNet as needed.</Paragraph>
    </Section>
    <Section position="4" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
2.4. COMLEX Corpus
</SectionTitle>
      <Paragraph position="0"> Because of the Zipfian 1/f distribution of word frequencies, a corpus would have to be unreasonably large in order to offer reasonable sample of an adequate number of words. Although it is no longer difficult to amass a corpus of hundreds of millions or even billions of words, complete human annotation of such a corpus is impractical. Therefore the LDC proposes to create a new kind of sampled corpus, offering a reasonable sample of the words in a lexicon the size of COMLEX Syntax, so that human annotation or verification of (for instance) four million tokens would provide 100 instances of each of 40K word types* This sampled corpus (in reality to be sampled according to a more complex scheme) can then be &amp;quot;tagged&amp;quot; with both syntactic and semantic categories. The entire corpus from which the sample is drawn will also be available, so that arbitrary amounts of context can be provided for each citation. The design of this sampled corpus is still under discussion, and reader participation is again invited.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="13" end_page="15" type="metho">
    <SectionTitle>
3. Other Languages
</SectionTitle>
    <Paragraph position="0"> This past year, we cooperated with the CELEX group in the Netherlands to publish their excellent lexical databases for English, German and Dutch. In this case, our willingness to pay for CD-ROM production, to handle the technical arrangements for publication, and to shoulder some of the burden of distribution was enough to help bring this resource into general availability.</Paragraph>
    <Paragraph position="1"> As a first step towards providing new lexical resources in languages other than English, the LDC has begun an effort to provide medium-sized pronouncing dictionaries in a variety of languages. This effort, which will be co-ordinated with efforts to provide transcribed speech and text resources in the same languages, is beginning this year with Japanese, Mandarin, and Spanish. It aims at coverage comparable to an English dictionary with about 20K words.</Paragraph>
    <Paragraph position="2"> As the U.S. speech research community begins to work on languages other than English, it is confronted with new issues that have reflexes in the design and implementation of even such a simple-seeming object as a pronouncing dictionary. Again, we solicit the community's participation in helping us choose a useful approach. In the next section, we would like to highlight one of the questions that will need to be answered, language by language, in the early stages of such a project.</Paragraph>
    <Paragraph position="3"> Morphology? The question is, how should orthographically-defined units be broken up or combined in a lexicon? One answer, which is the easiest one to give for an English pronouncing dictionary  for speech recognition applications, is &amp;quot;not at all: list all and only the orthographic units paired with their pronunciations.&amp;quot; For other languages, this answer may no longer apply.</Paragraph>
    <Paragraph position="4"> Table 1 shows (for 5 databases of journalistic text) how * many word types are needed to account for various percentages of word tokens. In all languages except Chinese, the principles for defining &amp;quot;words&amp;quot; in the text were the same: a contiguous string of Mphabetic characters flanked by white space preceded and followed by any number of punctuation characters, with case distinctions ignored. All &amp;quot;words&amp;quot; containing digits or other non-alphabetic characters were left out of the counts, except that a single internal hyphen was permitted. In the case of Chinese, the notion of &amp;quot;word&amp;quot; was replaced by &amp;quot;character&amp;quot; for purposes of calculating this table. As Table 1 shows, languages with a larger number of inflected forms per word, or with more productive derivational processes not split up in the orthography (such as German compounding), tend to require a larger number of word types to match a given number of word tokens.</Paragraph>
    <Paragraph position="5"> The counts for Chinese represent the other extreme, in which every morpheme (= Chinese character) is written separately, and the orthography does not even indicate how these morphemes are grouped into words (either in the phonological sense, or in the sense that any Chinese dictionary lists tens of thousands of 2-, 3-, or 4-character combinations whose meaning is not predictable from the meaning of the parts).</Paragraph>
    <Paragraph position="6"> In English, the orthographic word is a fairly convenient unit both for pronunciation determination and for language modeling. Depending on the mix of word types in the sample, there are only about 2 to 2.5 inflected forms per &amp;quot;lemma&amp;quot; (base form before inflection), and the rules of regular inflection are fairly easy to write. Productive derivation of new words from old (e.g. &amp;quot;sentencize&amp;quot;) is not all that common. Most compounds are written with white space between the members, even if their meaning and stress are not entirely predictable (e.g. &amp;quot;red herring,&amp;quot; &amp;quot;chair lift&amp;quot;). For these reasons, a moderate-sized list of English orthographic forms can be found that will achieve good coverage in new text or speech.</Paragraph>
    <Paragraph position="7"> Smoothing is required for good-quality n-gram modeling of English word sequences in text, but morphological relations among words have not been an important dimension in most approaches. Language models, like pronunciation models, can thus treat English orthographic words as atoms. As a result, from the point of view of speech recognition technology, there has not been a strong need for an English pronouncing dictionary that encodes morphological structure and features.</Paragraph>
    <Paragraph position="8"> However, the situation in German may be different. As Table 1 suggests, simple reliance on word lists derived from a given amount of German text will produce a significantly lower coverage than for a corresponding English case, and even very large lexicons will leave a surprisingly large number of words uncovered. Thus the Celex German lexicon, which contains 359,611 word forms corresponding to 50,708 lemmas, failed to cover about 10% of a sample of German text and transcribed speech. Of the missing words, about half were regular compounds whose pieces were in the lexicon (e.g. Lebensqualitdt), while by comparison less than 1/6 were proper names.</Paragraph>
    <Paragraph position="9"> The same sort of relative difficulty in unigram coverage can be seen in Table 2, where we look at the count of word types for a lexicon derived from one sample in order to cover a given percentage of word tokens in another sample. German requires a two- or three-times larger lexicon than English does to achieve a given level of coverage, and the factor increases with the coverage level. This is not because of differences in the type of text--all samples are drawn from the same or similar newswires, covering the same or similar distributions of topics. Spanish is in between German and English in this matter.</Paragraph>
    <Paragraph position="10"> One simple approach is to make the lexicon into a network that generates a large set of words and their pronunciations. Thus German Lebensqualitdt will be derived as a compound made up of Leben and Qualitdt. The point of such an exercise is not to shrink the size of the lexicon, or to express its redundancies (although both are consequences), but rather to predict how the forms we have seen will generalize to the much larger number of forms we have not seen yet.</Paragraph>
    <Paragraph position="11"> A similar issue arises for inflectional morphology. An Italian verb has at least 53 inflected forms (3 persons by 2 numbers by 7 combinations of tense, aspect and mood, plus 4 past participle forms, 5 imperative forms, the infinitive and the gerund). Several hundred additional &amp;quot;cliticized&amp;quot; forms (joining the infinitive, the gerund and three of the imperative forms with various combinations of the 10 direct object and 10 indirect object pronouns) are also written without internal white space. In a database of 3.2M words of Italian, forms of the common verb &amp;quot;cercare&amp;quot; to look for occur 1818 times, but 8 of the 53 regular forms are missing, and a larger number of the possible combinations with object pronouns.</Paragraph>
    <Paragraph position="12"> Forms of the (also fairly common) verb &amp;quot;congiungere&amp;quot; occur 89 times, and 41 of its 53 forms are missing. This indicates both the difficulty of finding all inflected forms as unigrams by simple observation, and also the greater problem for language modeling caused by the distribu- null tion of a lemma's probability mass among its various forms.</Paragraph>
    <Paragraph position="13"> It is not obvious what the right approach is to these cases, so researchers should have convenient access to lexicons that can easily be reconfigured to provide various types and degrees of subword analysis.</Paragraph>
    <Paragraph position="14"> Chinese presents exactly the opposite problem. The Taiwanese newspaper text used in the counts (done by Richard Sproat of AT&amp;T Bell Labs) employs a total of about '7,300 character types in a corpus of more than 17M character tokens. Each character (with a few exceptions) is pronounced in just one way, as a single syllable. However, a given syllable might be written as quite a few different possible characters, each one (roughly speaking) a separate morpheme. There is no inflection in Chinese, but there is a lot of compounding of morphemes into words with unpredictable meanings. A typical Chinese dictionary will list tens of thousands of such combinations, and new forms are seen all the time, just as in German. However, this compounding is not indicated in the orthography.</Paragraph>
    <Paragraph position="15"> A language model based on (at least some) compound words will of course be effectively of higher order than one based only on characters. Again, there are several approaches to this question, ranging from explicit listing of the largest possible number of multiple-character words on standard lexicographical criteria, to a simple smoothed N-gram model based on individual characters as the only unigrams. This issue has a phonetic side as well, since multiple-character words in Mandarin often have a fixed or strongly preferred stress pattern, and at least for some dialects, unstressed syllables may be strongly reduced.</Paragraph>
    <Paragraph position="16"> Both issues--explicit representation of the internal structure of certain orthographic words, and grouping of several contiguous orthographic words as a single lexical entry--have scattered echoes in speech recognition technology as applied to English. However, other languages put these (and other) question on the agenda in a much stronger form.</Paragraph>
  </Section>
  <Section position="6" start_page="15" end_page="17" type="metho">
    <SectionTitle>
4. New Kinds of Lexicons
</SectionTitle>
    <Paragraph position="0"> New ARPA tasks are likely to require new kinds of resources. For instance, the outcome of the on-going discussion about semantic evaluation will probably motivate new sorts of lexicons as well as new kinds of annotated corpora.</Paragraph>
    <Paragraph position="1">  a 500K-word sample, needed to cover various percentages of word tokens in a non-contiguous sample (about two months away). Asterisk means coverage at that level was not possible from the given sample.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML