File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1063_metho.xml

Size: 25,020 bytes

Last Modified: 2025-10-06 14:12:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-1063">
  <Title>The Typology of Unknown Words: An Experimental Study of Two Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.0 Related work
</SectionTitle>
    <Paragraph position="0"> In his pioneering article \[Damerau 64\], the author gives valuable information about the frequency of typogruphical errors. In his paper Damerau indicates that typically,  80% of all ill-formed words in a document are file result of one of four typographical errors: * Transposition of two letters, e+g. &amp;bali instead of 61abli; * Insertion of one extra letter, e.g. 6conomioque instead of 6conomique; * Deletion of one letter, e.g. additionelle instead of additionnelle; null * Substitution of a valid letter by one that is wrong, e.g. oglig6 instead of oblig6.</Paragraph>
    <Paragraph position="1">  More recent results \[Pollock and Zamora 83\] also indicate that in most cases, there is only one error per word. The classification of possible errors has been extended over the years to include other types of errors \[Srihari 85, Szanzer 69, Veronis 88\]. Based on this body of work, we ACTES DE COLlNG-92, NANTES, 23-28 ^o~r 1992 4 0 8 Pgoc. OF COLING-92, NANTES, AUG. 23-28, 1992 can propose the following incomplete list of the possible nature of errors:  * Typographical errors, which are errors of execution in carrying out the task of typing text on a keyboard; * Orthographic errors, which are errors of intention attributable to distraction or lack of knowledge on the part of file author; * Syntactic and semantic errors; * Errors committed during rite input procedure, either by an optical character recognition device or by a speech recognition system; * Storage and transmission errors due to noisy electronics or communication channels.</Paragraph>
    <Paragraph position="2"> 3.0 Corpus  Our typology of unknown words is based on the study of two corpora.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Hansard
</SectionTitle>
      <Paragraph position="0"> The first one, a French corpus called the Hansard, is a transcript of all file proceedings that took place in tile Canadian House of Commons in 1986.</Paragraph>
      <Paragraph position="1"> Since Canada is offmially a bilingual country, whenever Members of Parliament gather together to debate laws, the transcripts of the session have to be made available in both English and French. On the day a session is held, transcripts are translated and printed rapidly in order for the Members of Parliament to have a bilingual copy of the previous days' session on their desk the next morning. null The main characteristics of this corpus are:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Jobs
</SectionTitle>
      <Paragraph position="0"> The second corpus, called Jobs, was obtained from Employment and Immigration Canada and consists of English job offers. Employment centres across Canada receive calls from employers offering job opportunities.</Paragraph>
      <Paragraph position="1"> Clerks are responsible for answering the telephone and writing up the job postings.</Paragraph>
      <Paragraph position="2"> The nmin characteristics of this corpus arc:  * Telegraphic style; * Manually typed into a computer program flint has a rigidly formatted interface; * Made up solely of text originally written in English; * Written rapidly by a clerk.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Extracting unknown words
</SectionTitle>
      <Paragraph position="0"> The two cmpora differ in nature and in the way respective lists of unknown words were extracted.</Paragraph>
      <Paragraph position="1"> For the Hansard corpus we tokenized the text and we automatically tagged each token with a part of speech \[Foster 91\]. From this list we then removed all punctuation, numbers and words beginning with a capital letter (proper nouns and abbreviations merit separate study). We then singled out all the words that could not be found in an electronic dictionary. For this operation we used tile DMF \[Bourbeau, Pinard 86\] which contains the equivalent of 59 000 entries.</Paragraph>
      <Paragraph position="2"> As for the English corpus most of the work was done by hand. We tokenized the text as previously described but the sifting of punctuation, numbers, words beginning with a capital letter and known words, was done manually, leaving a list of unknown words.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.0 Typology
</SectionTitle>
    <Paragraph position="0"> We trove divided the list of unknown words into two main groups. GI contains words that could not be recognized but were correct, while G2 contains erroneous words. We have further subdivided these two groups into different types of unknown word.</Paragraph>
    <Paragraph position="1"> Our goal has been to identi|y tendencies in this group wc call &amp;quot;anknown words&amp;quot;. In doing so, we iucrcased the number of types and inevitably some of these types intersect. We have relied on our intuition and experience to assign the most plausible type to the unknown words.</Paragraph>
    <Paragraph position="2"> In this section descriptions will be given of each of these types along with numerous examples. In addition, in the case of G2 types, we speculate on tile possible causes of error.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 G1 : Correct words
</SectionTitle>
      <Paragraph position="0"> In principle, proper nouus shoukl nnt be part of the list of unknown words since we removed all words beginning with a capital letter. But a few occurrences of proper nouns appeared with tile wrong eapitalizmion and in other cases a lower case component of a proper noun (isolated by the tokenization process) was found.</Paragraph>
      <Paragraph position="1"> E.g. ottawa (Ottawa) 1, nat (B'nai Brith)</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACRES DE COLlNG-92, NANIES, 23-28 Ao~r 1992 4 0 9 Prec. OF COLING-92, NANTES, AUG. 23-28, 1992
4,1.2 Abbreviations
</SectionTitle>
    <Paragraph position="0"> Upper case abbreviations (acronyms, initials, etc.) are not considered to be unknown words, but a few (common) abbreviations are written in lower case and thus end up in the unknown word list.</Paragraph>
    <Paragraph position="1"> E.g. km 0dlom/~tre), pub (publieil~)  Although numbers and punctuation have not been considered valid unknown word candidates, since letters are sometimes used as roman numbers, a few ordinal numbers were found.</Paragraph>
    <Paragraph position="2"> E.g. i (1), iv (4)  Those are words or expressions that cannot be found in traditional dictionaries. Some of them can found in specialized dictionaries \[Shiaty 88\] and some of them can be identified by native speakers.</Paragraph>
    <Paragraph position="3"> E.g. abrier (couvrir), b6eosses (toilettes), cenne (sou)  Scholarly words include technical or rare words. They can be found in large reference tools like Termium 2. E.g. 6cosph~re, amoxicillin, anadrome, ayatollah  Certain expressions (French and Latin mostly) are made up of several elements separated by spaces. Isolated from the rest of the expression, some of these elements cannot be recognized.</Paragraph>
    <Paragraph position="4"> E.g. facto (de facto), wa (oskee wa wa), feminem (ad feminem)  in the Hansard this category corresponds to anglicisms or English words appearing in a quote.</Paragraph>
    <Paragraph position="5"> E.g. abortionniste, affluente, runn~s However, we also found foreign words in the English corpus.</Paragraph>
    <Paragraph position="6"> E.g. chad chow, noel, solicite  Derived words are very productive. The number of occarrences of this type of unknown word in the Hansard represents almost 30% of all unknown words. In French we found 96 affixes that were used to form new words,  I. In the context of an example, parentheses indicate the correct or intended word.</Paragraph>
    <Paragraph position="7"> 2. The terminological data bank of the Translation Bureau of the Department of the Secretary of State of Canada. null  Certain words have both a prefix and a suffix at the same time.</Paragraph>
    <Paragraph position="8"> E.g. r66chelonnement, prOcommercialisation Certain affixes are more productive than others: anti-, d&amp;, d~s-, extra-, sur-, in-, inter- rO-, super-age, -ation, -ien, -cur, -iser, -ment  We excluded from the unknown word list compounds beginning with a capital letter and compounds that cannot be recognized when considered as a whole nor when the elements are considered individually. The unknown words classified as compounds are: ones that should start with a capital letter but do not; those in which the necessary spaces or hyphens have been deleted, i.e. the elements have been concatenated; and compounds made up of an element that cannot be recognized (often because of the 'o' infix).</Paragraph>
    <Paragraph position="9"> E.g. c~tblodistributeurs, chimio-d6pendance, radioastronomique</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.1,10 Garbled words
</SectionTitle>
    <Paragraph position="0"> We include in this category words that are divided by a blank space, words that are joined together but are not compounds and words which are, in general, affected by electronic noise. Although in some ways this could be considered an error, we did not want to put this category in G2 because contrary to other types in G2, in this case the writer cannot be held responsible for the error.</Paragraph>
    <Paragraph position="1"> E.g. employEs, sAvez-vous, afinque, erreur.Ce</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 G2: Erroneous words
</SectionTitle>
      <Paragraph position="0"> These errors are unique to the French corpus and can be subdivided into four types:  This type of error is unique to the English corpus and corresponds to problems with hyphens and apostrophes. There are three cases:  Knowing the configuration of a standard keyboard and the way people type suggests several plausible reasons for the inserfiou of superfluous characters. * A key is held down too long, generating sequences of identical letters.</Paragraph>
      <Paragraph position="1"> E.g. 6tonnne, access, beaauconp, paartnership * The finger strikes two contiguous keys at the same time. E.g. 6conomioque, 6galememnt, profcessional, tltgen * 'Influence' of other letters in the same word. E.g. 6vidememenl,, a6oroport, accueuillir, taboubli, electrolologist Other instances of insertion seem to be simply attributable to a lack of knowledge of the language. E.g. 6perduemeot, absoluement, orthopaedic, paediatric.</Paragraph>
      <Paragraph position="2"> For another group of insertion-type errors, no obvious explanation could be found.</Paragraph>
      <Paragraph position="3"> E.g. constinu6, lotusi, manchine, experiencep  The omission of a character is the most common typographical error. This is probably related to a situation where rapid typing is required and where the mind might work faster than the hand. tfere is a list of the ten most frequently omitted letters (the percentages are based on the total number of words in this chLss): Letter r s i n t e p c 1 a -!  This is a fairly complex category. Substitution of one letter for another can be typographical or orthographic in nature. Some tentative explanations include: * The letter is replaced by an adjacent letter.  include here the displacement of a single letter. E.g. avatanges, comagpnies, avalaible, expierence  There are not many errors under this heading, since no syntactic analysis has been done in order to extract the list of unknown words. What we have here are errors of morphology and conjugation.</Paragraph>
      <Paragraph position="4"> E.g. 6tEe (6t6), pines (pin), cloths (clothes)  There are a Iew remaining words which we could not lit in the other categories; some of them are incorrect while others can be considered spelling variations flint are not fully standard.</Paragraph>
      <Paragraph position="5"> E.g. tee shirt (T-shirt), thru (through)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.0 Frequency of unknown words
</SectionTitle>
      <Paragraph position="0"> The Hansacd corpus contains 4 173 506 tokens. Among these tokens we found 2 982 distinct unknown words occurring 9 301 times. This represents 0.2% of all tokens.</Paragraph>
      <Paragraph position="1"> The Jobs corpus contains 140 482 tokens. Of those, 1 016 were distinct unknown words occurring 2 109 times. This represents 1.5% of all tokens.</Paragraph>
      <Paragraph position="2"> We now present in tabular form the frequency distribution of unknown words in built corpora. For each type of unknown word we indicate the number of distinct words (cases) and the total number of occurrences (occ.) found.</Paragraph>
      <Paragraph position="3"> ACRES DE COLING-92, NA~Cn~S, 23-28 Ao~r 1992 4 1 l Puoc. oF COLING-92, NArer~s, Auo. 23-28, 1992 For each of these numbers, we also give the associated percentages over the total number of unknown words in both G1 and G2. Therefore the total percentages of G1 and G2 add up to 100 percent.</Paragraph>
      <Paragraph position="4">  The following points should be noted: * A word containing two errors is accounted for in two categories. This explains why the total is a slightly higher than the total number of unknown words given previously* null * In the Hansard there are 16 words (0.17%) that contain more than one error per word and 94 words ( 1.01%) that belong to both GI and G2 (e.g. a word can be incorrect and be derived at the same dine). On the other hand.</Paragraph>
      <Paragraph position="5"> with Jobs there are 42 words (1.99%) that contain more than one error per word. These results are comparable to Damerau's findings about the preponderance of single error words.</Paragraph>
      <Paragraph position="6"> * Of course, different extraction procedures give different results. The Hansard contains a great many correct words not in the DMF; on the other hand the Jobs list of unknown words contains very few of those correct words. When faced with a word they do not recognize immediately, humans have the option of consulting a dictionary (general or specialized) and even if the word is not in any of those, the person can still rely on his or her intuition about word composition and derivation in order to accept a word.</Paragraph>
      <Paragraph position="7"> * In the case of the Hansard the total number of occurrences in G1 (71.93%) is much higher than the total number of occurrences in G2 (28.07%). This significant restflt shows that instead of putting all of our efforts into trying to develop a better error correcter, we would gain a lot from looking into ways of dealing with the deficiencies of our lexical databases.</Paragraph>
      <Paragraph position="8"> * Since English does not have accents, this category is not represented in G2 of Jobs.</Paragraph>
      <Paragraph position="9"> * On the other hand, errors involving hyphens and apostrophes are very common in the Jobs corpus. We classified these as punctuation errors.</Paragraph>
      <Paragraph position="10"> * We believe that the punctuation category of G2 Jobs is not representative of English in general. The high frequency of this type of error is due to a peculiarity of the AcrEs DE COLING-92, NANTES. 23-28 AO~r 1992 4 1 2 PROC, Ol: COLING-92, NANTES, AUG. 23-28, 1992 program responsible for the input of job descriptions which encourages the use of hyphens to parenthesize text.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.0 Recognizing unknown words
</SectionTitle>
      <Paragraph position="0"> In this section we examine possible avenues of investigation designed to deal with the different unknown word types.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 G2: Erroneous words
</SectionTitle>
      <Paragraph position="0"> Wheu confronted with an unknown word, the ideal NLP system would be able to understand the text and to deal with what was intended by the writer, and not just what he wrote. But of course this is not within the scope of current technology.</Paragraph>
      <Paragraph position="1"> A more realistic goal is to try to deal with typographical errors and a lot of attention over the years has been given to the detection and correction of such errors. Different methods have been proposed, some completely automatic, others meant to assist humans in proof reading, some practical and usable, others of theoretical interest only.</Paragraph>
      <Paragraph position="2"> For a good overview of this field of research we suggest \[Peterson 80\], while \[Pollock 82\] contains an extensive bibliography.</Paragraph>
      <Paragraph position="3"> Despite years of research, the detection and correction of typographical errors remains a problem not entirely resolved. Commercial software as well as state-of-the-art techniques described in the literature can only propose approximate solutions. No program is capable of detecting every error and capable of always suggesting the right correction.</Paragraph>
      <Paragraph position="4"> Despite their limitations, some existing methods can still be useful and sometimes even better than most human correctors. This fact is well illustrated by the success of commercial detector/corroctors available on the market, despite an overall performance that can at best be described as acceptable \[Dinnematin and Sanz 90\].</Paragraph>
      <Paragraph position="5"> in order to detect errors most techniques rely on a list of correct words known to the system (a dictionary), possibly augmented by a set of morphological roles.</Paragraph>
      <Paragraph position="6"> Amongst the possible approaches to typographical error correction, two methods seem to be more successful than the others.We can either compare the unknown word against each of the dictionary words and if one of those comes close enough to the original word according to some measure of similarity, it can be used in its place (for an example see \]Wagner and Fischer 74\]). Or we can take an erroneous word, undo all possible errors we want to detect and then search the dictionary to see if any of those potential corrections produces a valid word. We call this method the hypothesis generation method. For example a transposition error can be detected by transposing each pair of characters in the unknown word and then consultlug the dictionary with the resulting words. This essentially is the technique used in such programs as the DEC-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
10 Spell software.
</SectionTitle>
    <Paragraph position="0"> The method based on a measure of similarity is too inefficient to be practical and is mostly of theoretical interest.</Paragraph>
    <Paragraph position="1"> The latter is more efficient but also more approximate in that it is not guaranteed tim\[ we will find a correction if we did not expect the offending error.</Paragraph>
    <Paragraph position="2"> \[n both cases the contents of the dictionary must be carefully selected. It must be large enough to offer reasonable coverage, but on the other hand there is a real danger of using a list of words that is too big, in that a very extensive list will usually contain rare and archaic words that could correspond to errors on more frequent words.</Paragraph>
    <Paragraph position="3"> An error corrector integrated in an NLP system should allow us to reduce the dictionary search space by comparing the erroneous word only with dictionary words complying with die syntactic and semantic requirements valid at that time in the processing. This should make the search significantly more efficient. For example, if at some point we are expecting a verb and we encounter an unknown word, in order to suggest corrections we could limit ourselves and cousitler only the verbs in the dictionary. null due interesting aspect of typographical error correction methods such as the hypothesis generation method is that they can also be used to correct some of the other types of errors. So with these methods, not only do we have a (somewhat approximate) solution to insertion, deletion, transposition and substitution errors, but in some cases they will also solve punctuation, accent and grammar errors. For example in the case of accents, we can extend the French alphabet with the possible accented letters and simply use this alphabet to generate more candidate corrections. null Again if the hypothesis generation method is chosen, then further use can be made of the knowledge gained about the type of errors usually comufitted. For example in order to minimize the number of hypotheses generated and to maximize the probability of tinding the right correction, when testing the deletion of a character, one could attentpt to &amp;quot;re-introduce&amp;quot; the character only in the case of the 10 most frequent deletions. More anecdotal knowledge gained through the sifting of the list of uuknowu words could also be of .some use. For example, duplication of consonants was a frequent type of insertion error and thus. if only a few hypotheses are to be tried, unknown words with duplicate consonants could be considered prime candidates for insertion errors.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 G1 : Correct words
</SectionTitle>
      <Paragraph position="0"> The results collected in the course of our study should at the very least, influence the amount of effort put into dealing with each of the diflerent types of errors. The re-ACIT.S DE COLING-92. NANTES, 23-28 AO~T 1992 4 I 3 l'Roc. OF COLING-92, NANTES, AUO. 23-28, 1992 alization that a large percentage of unknown words are part of the G1 group warrants renewed effort in treating this type of problem.</Paragraph>
      <Paragraph position="1"> There will always be w~ds that a system cannot recognize, if only because some of them belong to so-called open classes. But we can still reduce the number of such .words.</Paragraph>
      <Paragraph position="2"> One obvious solution is to enrich the dictionary, for example with common abbreviations and expressions. Another similar, but more modular solution consists in supplementing the basic dictionary with auxiliary dictionaries. One could envision separate dictionaries for regional and scholarly words for example.</Paragraph>
      <Paragraph position="3"> The ordinals found in our corpora could easily be recognized by a grammar describing the formation of roman numerals.</Paragraph>
      <Paragraph position="4"> Foreign words represent a difficult problem. They are exceptions to the usual assumption that the whole text to be processed is expressed in the same language throughout.</Paragraph>
      <Paragraph position="5"> Although it does not completely solve the problem, the detection of such signs as double quotes, setting the words apart from the text, could be used to suggest that the following unknown words might be foreign.</Paragraph>
      <Paragraph position="6"> It might be possible to recognize garbled words and compounds by using methods similar to the ones used to treat G2 words. For example the deletion of a necessary hyphen could be detected and possibly corrected as is done for the deletion of an ordinary character.</Paragraph>
      <Paragraph position="7"> As we have seen, derived words represent an impressive percentage of the total number of unknown words. Even if we were to enlarge the dictionary we would never be able to include every derived word, for they are much too productive. Therefore the solution seems to lie in a rule-based description of derivation similar to the description of inflectional morphology. This will require integrating detailed studies of affixation and of the structure and semantic compositionality of derived words.</Paragraph>
      <Paragraph position="8"> Finally, GI words are perhaps more difficult to process than G2 words. As \[Hayes and Mouradian 81\] put it: &amp;quot;Since novel words are by definition not in the known vocabulary, how can we distinguish them from misspelling?&amp;quot; Most of the time (but not always) they will not be close enocagh to words in the dictionary for the system to make suggestions. The best one can hope for in this situation is to deduce from the context the maximum amount of information about the word, such as its role in the sentence. As for the ability to learn new vocabulary, this is beyond the capabilities of current artificial intelligence.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML