File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0505_metho.xml
Size: 24,044 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0505"> <Title>Inflected Languages. Application to Basque Language</Title> <Section position="3" start_page="29" end_page="33" type="metho"> <SectionTitle> 2 Word Prediction Methods for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> Non-Inflected Languages </SectionTitle> <Paragraph position="0"> In this section some of the methods that have been used in word prediction for non-inflected languages are summarised. This small review will serve as a basis for coming sections in order to identify the key aspects that are involved in prediction. These methods are going to be presented by increasing complexity, from the simplest to the most complex.</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 2.1 Probabilistic Methods </SectionTitle> <Paragraph position="0"> The simplest word prediction method is to built a dictionary containing words and their relative frequencies of apparition. When the user starts typing a string of characters a the predictor offers the n most frequent words beginning by this string in the same way they are stored in the system. Then, the user can choose the word in the list he or she wanted to enter or continue typing if it is not in the list.</Paragraph> <Paragraph position="1"> There are several studies about word frequencies in different languages, for instance (Beukelman et al., 1984) gives information about the frequency of word occurrence in English used by some disabled people.</Paragraph> <Paragraph position="2"> If the dictionary does not contain inflected words (that is, if there are just the lemmas), it may need some correction by the user (or by the system) in order to adjust its concordance with other related words. For instance, it may need to adjust the gender: &quot;C'est une voiture fantastique&quot; or the number &quot;A lot of cars&quot;. The dictionary uses to be an alphabetically ordered list of words and their frequencies, but other possible dictionary structures can be found in (Swiflin et al., 1987a). This prediction system can be adapted to the user by updating the frequency of the word in the dictionary each time this word is used. Words seldom employed can be replaced by others which are not in the dictionary. The inclusion of new words is not difficult because all the information that is required is their frequency. Further information about this type of prediction can be seen in (Colby et al., 1982), (Garay et al., 1994a), (Heckathorne et al., 1983), (Hunnicutt, 1987), (Swif-</Paragraph> <Paragraph position="4"> fin et al., 1987a), (Venkatagiri, 1993).</Paragraph> <Paragraph position="5"> To enhance the results of this method, an indication about the &quot;recency&quot; of use of each word may be added. In this way, the prediction system is able to offer the most recently used Words among the most probable ones beginning by a. Each entry in the dictionary is composed by a word, its frequency and its recency of use. Adaptation of the dictionary to the user's vocabulary is possible by updating the frequency and recency of the each word used. (Swiffin et al., 1987a) observes that this method produces small better savings in the number of keystrokes needed than in the previous approach, but more information must be stored in the dictionary and the complexity is also increased.</Paragraph> <Paragraph position="6"> Tables Another possibility is to use the relative probability of appearance of a word depending on the previous one. To implement this system a two-entries table is needed to store the conditional probability of apparition of each word Wj after each Wi. If the dictionary contains N words the dimension of the table will be of N*N. That is, it will have N 2 entries, but most of the values in the table will be zero or close to zero.</Paragraph> <Paragraph position="7"> In some cases it could be possible for the system to give proposals before entering the beginning of a word. The recency of use may also be included in this approach. This method is hardly adaptable to include the user preferred words because the dimensions of the table cannot be changed. This difficulty leads to the design of modified versions, like the one that uses only the most probable pair of words, reported as in (Hunnicutt, 1987).</Paragraph> </Section> <Section position="3" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 2.2 Syntactic Word Prediction </SectionTitle> <Paragraph position="0"> This approach takes into account the syntactic information inherent to the languages. To this end two l~inds of statistical data are used: the frequency of apparition of each word and the conditioned probability of each syntactic category to follow every other syntactic category. In this way, the set of words that are candidates to be proposed by the predictor is restricted to the ones that match the most probable syntactic role in the current position of the sentence, thus increasing the hint rates. This syntactic table is smaller than the one used in the previous approach, and the proportion of probabilities which are close to zero is also smaller. Each entry in the dictionary will associate a word with its syntactic category, and its frequency of apparition. Words can be sorted by syntactic categories to facilitate the selection process. When a word is syntactically ambiguous, that is, when more than one category is possible for a given word, one entry for each possible category may be created. The table of conditional probabilities of syntactic categories has a fixed size and it is built before the use of the predictor. Adaptation to the user's lexicon is possible because there is no need to increase the size of the table. New words are included in the dictionary with a provisional syntactic category deducted from its use. Later on, the system may require some help from the user to verify if the categorisation was correct. It could be also possible to add some morphological information in the dictionary to propose the words with the most appropriate morphological characteristics (gender, number). This could increase the hint rate of the predictor.</Paragraph> <Paragraph position="1"> Some systems that use this approach are described in (Garay et al., 1994a) and (Swiffin et al., 1987b).</Paragraph> <Paragraph position="2"> In these approaches, the current sentence is being parsed using a grammar to get the most probable categories. Parsing methods for word prediction can be either &quot;top-down&quot; (Van Dyke, 1991) or &quot;bottomup&quot; (Garay et al., 1994b), (Garay et al., 1997). So, there is a need to define the syntactic rules (typically LEFT <- \[RIGHT\]+, usually being LEFT and RIGHT some syntactic categories defined in the system) that are used in a language. Within a rule, it could be possible to define concordance amongst the components of the right part (either in gender and/or in number). Then, the proposals may be offered with the most appropriate morphological characteristics. It is necessary to leave open to the user the possibility of changing the word's ending. For example, if there is a mismatch in the rule used by the system, it may be necessary to modify the end of an accepted proposal. The dictionary is similar to the one used in the previous approach with the addition of morphological information to allow concordance.</Paragraph> <Paragraph position="3"> The complexity of this system is also larger because in this case, all the words of the sentence that appear before the current word are taken into account, while in the previous approaches only one previous word was used. The adaptation of the system for the new words is made increasing the word frequencies and the weights of the rules. The inclusion of new words is similar to the one in the previous approach. The use of grammars for word prediction is also shown in (Hunnicutt, 1989), (Le P~v~dic et al., 1996), (Morris et al., 1991) and (Wood et al., 1993).</Paragraph> </Section> <Section position="4" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 2.3 Semantic Word Prediction </SectionTitle> <Paragraph position="0"> These methods are not very used, because their results are similar to those of the syntactic approaches, but the increase in complex!ty is great. Maybe the simplest method that can be used is the semantic word prediction by using parsing methods. In this approach each word has some associated semantic categories, while in the previous one categories were purely syntactic. The rest of the features (the procedure, complexity, structure of the dictionary, adaptability...) are similar to the previous one. Nevertheless, the problem of giving semantic categories to the words is very complex and it results difficult to be programmed. Some authors propose semantic categorisation made &quot;by hand&quot; (Hunnicutt, 1989). There may be other methods to treat the semantic information, but their complexity is going to be very great for a real-time system as the word predictors are intended to be, even the time requirements (maybe a few seconds between two consecutive keystrokes of an impaired person) are not very strong for the computational capacities of today's In this section the use of previously reviewed word prediction methods for non-inflected languages is studied and their suitability for inflected languages is discussed. So, the key question is: Are the word prediction methods that we have previously shown useful for inflected languages? As we mentioned in the introduction, in non-inflected languages it is feasible to include in the dictionary all the forms derived from each lemma, taking into account that the number of variations is quite small. For instance, in English friends is the only variation (without creating composed words) of friend, and the verbs have a few variations too.</Paragraph> <Paragraph position="1"> * In Spanish, the word amigo (with the same meaning than friend) may vary in gender and number, giving the words: amiga, amigos and amigas. But the variations that the word adiskide (same meaning as friend or amigo) may have in Basque make it impossible to store them in the dictionary of the system. This is one of the changes to be taken into account for the design of a predictor for this type of languages. In inflected languages, the complexity in making the changes is very high, because of the number of possibilities. One possibility is to group the suffixes depending on their syntactic function to make it possible to have an easy automatisation. In addition, we shouldn't forget that suffixes may be recursively concatenated.</Paragraph> <Paragraph position="2"> In the previously presented prediction methods, the ones using probabilistic information mainly work with the words as isolated entities. That is, they work seeing each word in the dictionary as a whole to be guessed, without taking into account the morpho-syntactical information inherent to the languages. So, a word that is not at the lexicon can not be guessed. The impossibility to store all the combinations of a word, make these methods not very suitable for inflected languages 2.</Paragraph> <Paragraph position="3"> Therefore, it would be very interesting to treat the entire sentence. Then, the first syntactic approach is not very useful, because it only takes into account the previous word. And the second one is very hard to implement, because of the number of variations a word may have. Maybe a great number of rules have to be defined to cope with all the variations, but in this way the probabilities to guess the rule which is being used are very small, because of their variety.</Paragraph> <Paragraph position="4"> The same thing happens with the semantic approach, which has, as it has been said before, the same procedural characteristics as the second syntactic one.</Paragraph> <Paragraph position="5"> So, the complexity needed to create a correct word, including all the suffixes it needs, in inflected languages may make it necessary to search for other prediction methods, apart from all that were shown in the previous section.</Paragraph> <Paragraph position="6"> 2To know what the suitability for the next shown approaches can he, let us show a special case for Basque: verbs, mainly auxiliary verbs. They depend not only on the subject (which normally appears as absolutive or ergative cases) but also on the direct complement (if the sentence is transitive this complement has the absolutive case while the subject has the ergative case) and on the indirect complement (the dative case). For instance, the auxiliary dizut is related to the subject of the first per-son singular, the object complement in the singular and the indirect complement of the second person singular.</Paragraph> <Paragraph position="7"> But if the subject is in the third person plural, the indirect complement in the first person plural and the direct complement is in the plural, the needed auxiliary has to be dizkigute. Both cases are in the present of the indicative. If the tense of the verb changes, the verb itself also changes (for example, the past of the indicative of dizut is nizun and the past of dizkigute is zizkiguten). There also are some cases in which the verb depends on the gender of the absolutive, ergative or dative cases.</Paragraph> <Paragraph position="8"> As we have seen in the previous section, it is very difficult to predict complete words in inflected languages because of the variations a word may have.</Paragraph> <Paragraph position="9"> As there is a huge variety of inflected languages, let us concentrate on the particular characteristics of the Basque language, customising to this case the operational way.</Paragraph> <Paragraph position="10"> For this first approach, due to the above mentioned primacy of suffies (over other affixes) in the Basque language, and to simplify the problem, prediction in Basque is divided in two parts: prediction of lemmas and prediction of suffixes. Thus, two dictionaries (one for lemmas and other for suffixes) are used. The first one includes the lemmas of the language alphabetically ordered with their frequencies and some morphologic information in order to know which possible declensions are possible for a word.</Paragraph> <Paragraph position="11"> The second one includes suffixes and their frequencies ordered by frequencies.</Paragraph> <Paragraph position="12"> To start the prediction, the system tries to anticipate the lemma of the next word. Most of the methods seen in previous sections can be used for this purpose. When the lemma is accepted (or typed entirely if the predictor fails), the system offers the suffixes that are correct for this lemma ordered by frequencies. As the acceptable suffixes for a noun can be about 62 (as we have seen in the Table 1) only the most probable n suffixes are offered 3. As can be seen, the operational way is very similar to word prediction using tables of probabilities, but there is some added complexity because the system (and also the user) has to distinguish between lemmas and suffixes. In addition, more than one table of probabilities may be necessary to properly make predictions. Apart from the increase of the complexity, a decrease of the keystroke savings may be expected, because of the need of accepting at least two proposals for completing a word (while at least only one proposal is required with predictors for non-inflected languages).</Paragraph> <Paragraph position="13"> Even if some promising results have been obtained, there are still some problems to solve in this approach.</Paragraph> <Paragraph position="14"> * First of all, due to the possibility of recursively composed suffixes (concatenating the existing aWith n depending on the interaction method ones) the system has to again propose a list of suffixes until the user explicitly marks the end of the current word (maybe inserting a space character).</Paragraph> <Paragraph position="15"> * The recursive behaviour is one of the reasons to create more than one table of probabilities which stores the probability of apparition of a suffi immediately after the previous one.</Paragraph> <Paragraph position="16"> * The system may be adapted to the user updating the frequencies in the lexicons and the probabilities of the tables. To include a new lemma in the dictionary, it is necessary to obtain its morphological characteristics.</Paragraph> <Paragraph position="17"> * Finally, due to the special characteristics of the verbs (that include any kind of affixes in concordance with other words in the entire sentence) their prediction requires a special treatment.</Paragraph> <Paragraph position="18"> Therefore, it seems interesting to do a syntax approach for these types of languages, because otherwise, the problems of this approach are very dimcult to solve.</Paragraph> </Section> <Section position="5" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 4.2 Second Approach to Solving the Prediction Problem in an Inflected Language </SectionTitle> <Paragraph position="0"> This approach will try to alleviate the above mentioned problems. The lemmas and the suffies are still treated separately, but syntactical information is included in the system. This can be done by adding syntactic information to the entries of the dictionary of lemmas, and some weighted grammatical rules on the system. The main idea is to parse the sentence while it is being composed and to propose the most appropriate lemmas and suffixes. In principle, the parsing allows storing/extracting the information that has influenced in forming the verb.</Paragraph> <Paragraph position="1"> There exist systems that verify the morphologic and syntactical correctness of a Basque sentence, but the complexity of the Basque verb avoids its anticipation. To face this problem, the most frequent verb forms are included in the dictionary, and a morphological generator permits their modification or the addition of suffixes when it is necessary.</Paragraph> <Paragraph position="2"> As there are no probability tables, there is no problem related to their extension. The adaptation of the system is made by updating the frequencies of the lemmas and suffies and the weights of the defined rules. The inclusion of a new lemma in the lexicon might cause some lack of syntactic information. To solve this problem, there are some possibilities. First, the predictor tries to guess the category, depending on the most highly weighted rule at that point of the sentence. Second, the predictor asks the user directly about the information. The first approach can produce false assumptions, while the second one slows the message composition rate and demands a great knowledge of the syntax by the user. There is another possibility: the predictor marks the lemma and the user is asked to complete the needed information after ending the session.</Paragraph> <Paragraph position="3"> Finally, recursion may be included into the defined rules. Most of the grammars may have an implicit recursion which may be shown by rules. For instance, let us consider these rules:</Paragraph> <Paragraph position="5"> where NP means Noun Phrase, PP, Prepositional Phrase, Noun is a noun and Prep, a preposition. As can be seen, these rules can be expanded to: NP <- Noun Prep NP.</Paragraph> <Paragraph position="6"> So, the NP is on the left and on the right of the same rule, and a recursion happens. This recursion may be used as a way to indicate the recursion of the concatenation of the suffixes, because they can express the syntactic role of a word in a sentence, as it was noted in the introduction.</Paragraph> <Paragraph position="7"> The operational way and the order of complexity are similar to the word prediction using grammars. Nevertheless higher complexity may be expected mainly due to the existence of lemmas and suffixes. So, poorer keystroke savings are expected. To enhance this approach, it seems interesting to try to guess the entire word, that is, a lemma and its associated suffix. This system will be easier to use (there is no need to force users to know what the lemma and what the suffix of a word are) and may have better results, measured in terms of keystroke savings.</Paragraph> </Section> <Section position="6" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.3 Third Approach to Solving the Prediction Problem in an Inflected Language </SectionTitle> <Paragraph position="0"> Taking into account the previous experience, a third approach could be tried. Built as a combination of the previous ones, the main idea is to guess the entire current word. It treats the beginning of the sentence like the first approach, using statistical information.</Paragraph> <Paragraph position="1"> While advancing in the composition of the sentence, the system parses it and uses this information to offer the most probable words, including both lemma and suffix, like the second approach does. The first word of the sentence is treated using the first approach seen. But to minimise the problems related to that approach, tee rest of the sentence is treated using the second approach.</Paragraph> <Paragraph position="2"> In this way, only three tables would be needed: one with the probabilities of the syntactic categories of the lemmas to appear at the starting of a sentence, another with the probabilities of the basic suffixes to appear after those words and the third with the probabilities of the basic suffixes to appear after another basic suffix (and to make possible the recursion). All of these tables would have fixed sizes even when new lemmas are added to the system.</Paragraph> <Paragraph position="3"> The adaptation of the system would be made updating the first table and, while the suffixes would be added to the word, the other two tables would be also updated. With relation to the new lemmas that do not have the information completed, they might update, or not, the first of the tables if a entry for the unknown cases is included; otherwise they would remain unchanged. Finally, the problem of verb formation in Basque is not solved and the most frequent verb forms are included in the dictionary in the same way as in the second approach.</Paragraph> </Section> </Section> <Section position="4" start_page="33" end_page="33" type="metho"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> Our experience with the Basque language in word prediction applied to Alternative and Augmentative Communication for people with disabilities, shows that prediction methods successful for non-inflected languages are hardly applicable to inflected ones.</Paragraph> <Paragraph position="1"> The high number of in flexions for each word makes their inclusion in the lexicon impossible. Different approaches have been studied to overcome this problem. To be able to predict whole words it is necessary to determine the syntactic role of the next word in the sentence. That can be done by means of a syntactic analysis &quot;on the fly&quot;. Nevertheless the results of the evaluation of these methods with the Basque language are not as good as the ones obtained with non-inflected languages.</Paragraph> </Section> <Section position="5" start_page="33" end_page="33" type="metho"> <SectionTitle> 6 Acknowledgements </SectionTitle> <Paragraph position="0"> The authors would like to acknowledge the work of the rest of the members of the Laboratory of Human-Computer Interaction of the Computer Science Faculty of the University of the Basque Country. They also would like to acknowledge the aid given by Jose Marl Arriola, Kepa Sarasola and Ruben Urizar, who work in the IXA Taldea of the Computer Science Faculty above mentioned.</Paragraph> </Section> class="xml-element"></Paper>