File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1117_metho.xml
Size: 19,664 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1117"> <Title>Cognate Mapping -- A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Lexicographic Aspects of Morpho-Semantic Indexing </SectionTitle> <Paragraph position="0"> We briefly outline the lexicographic and semantic aspects of our approach, called Morpho-Semantic Indexing (henceforth, MSI), which translates source documents (and queries) into an interlingual representation in which their content is represented by language-independent semantic descriptors.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Subwords as Lexicon Units </SectionTitle> <Paragraph position="0"> Our work is based on the assumption that neither fully inflected nor automatically stemmed words constitute the appropriate granularity level for lexicalized content description. Especially in scientific and technical sublanguages, we observe a high frequency of domain-specific and content-bearing suffixes (e.g., '-itis', '-ectomia' in the medical domain), as well as the tendency to construct utterly complex word forms such as 'pseudo hypo para thyroid ism', 'gluco corticoid s', or 'pancreat itis'.1 In order to properly account for the particularities of &quot;medical&quot; morphology, we introduced subwords (Schulz et al., 2002) as selfcontained, semantically minimal units and motivated their existence by their usefulness for document retrieval rather than by linguistic arguments.</Paragraph> <Paragraph position="1"> The minimality criterion is quite difficult to define in a general way, but its implications can be illustrated by the following example. Given the text token 'diaphysis', a linguistically plausible morpheme decomposition would possibly lead to 'dia phys is'. From a medical perspective, a segmentation into 'diaphys is' seems much more reasonable, because the linguistically canonical morphological decomposition is far too fine-grained and likely to create too many ambiguities. For instance, comparable 'low-level' segmentations of semantically unrelated tokens such as 'dia lyt ic', 'phys io logy' lead to morpheme-style units 'dia' and 'phys', which unwarrantedly match segmentations such as 'dia phys is', too. The (semantic) self-containedness of the chosen subword is often supported by the existence of a synonym, e.g., for 'diaphys' we have 'shaft'.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Subword Lexicon and Thesaurus </SectionTitle> <Paragraph position="0"> Subwords are assembled in a multilingual lexicon and thesaurus, which contain subword entries, special subword attributes and semantic relations between subwords. Up until now, the lexicon and the thesaurus have both been constructed manually, with the following considerations in mind: Subwords are entered, together with their attributes such as language (English, German, Portuguese) and subword type (stem, prefix, suffix, invariant). Each lexicon entry is assigned a unique identifier representing one synonymy class, the MORPHOSAURUS identifier (MID), which contains this entry as its unique member.</Paragraph> <Paragraph position="1"> 1' ' denotes the concatenation operator.</Paragraph> <Paragraph position="2"> Synonymy classes which contain intralingual synonyms and interlingual translations of subwords are fused. We restrict intra- and inter-lingual semantic equivalence to the context of medicine.</Paragraph> <Paragraph position="3"> Semantic links between synonymy classes are added. We subscribe to a shallow approach in which semantic relations are restricted to a paradigmatic relation has-meaning, which relates one ambiguous class to its specific readings,2 and a syntagmatic relation expands-to, which consists of predefined segmentations in case of utterly short subwords.3 We refrain from introducing hierarchical relations between MIDs, because such links can be acquired from domain-specific vocabularies, e.g., the</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Medical Subject Headings (MESH, 2001). </SectionTitle> <Paragraph position="0"> Table 1 depicts how source documents (cf. the first column with an English and Portuguese fragment) are converted into an interlingual representation by a three-step procedure. First, each input word is orthographically normalized in terms of lower case characters and according to language-specific rules for the transcription of diacritics (second column). Next, words are segmented into sequences of semantically plausible sublexical items according to the subwords listed in the lexicon (third column). Finally, each meaning-bearing subword is replaced by its language-independent semantic identifier, the MID, which unifies intralingual and interlingual (quasi-)synonyms. Then, the system yields the interlingual output representation of the system (fourth column).</Paragraph> <Paragraph position="1"> The manual construction of the trilingual sub-word lexicon and the subword thesaurus has consumed, up until now, three and a half person years. The project originally started from a bilingual German-English lexicon, while the Portuguese part was added in a later project phase. The combined subword lexicon contains 58,479 entries,4 with 21,397 for English, 22,053 for German, and 15,029 for Portuguese.</Paragraph> <Paragraph position="2"> Taking into account, on the one hand, the outstanding importance of Spanish as a major Western princeton.edu/ wn/doc.shtml, last visited on January 3, 2004). Linguistically speaking, the entries are basic forms of verbs, nouns, adjectives and adverbs.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Original Orthographic Morphological Semantic Document Normalization Segmentation Normalization </SectionTitle> <Paragraph position="0"> High TSH values suggest the diagnosis of primary hypothyroidism while a suppressed TSH level suggests hyperthyroidism. high tsh values suggest the diagnosis of primary hypothyroidism while a suppressed tsh level suggests hyperthyroidism.</Paragraph> <Paragraph position="1"> high tsh value s suggest the diagnos is of primar y hypo thyroid ism while a suppress ed tsh level suggest s hyper thyroid ism.</Paragraph> <Paragraph position="3"> A presenc,a de valores elevados de TSH sugere o diagn'ostico de hipotireoidismo prim'ario, enquanto n'iveis suprimidos de TSH sugerem hipertireoidismo.</Paragraph> <Paragraph position="4"> a presenca de valores elevados de tsh sugere o diagnostico de hipotireoidismo primario, enquanto niveis suprimidos de tsh sugerem hipertireoidismo.</Paragraph> <Paragraph position="5"> a presenc a de valor es elevad os de tsh suger e o diagnost ico de hipo tireoid ismo primari o, enquanto niveis suprimid os de tsh suger em hiper tireoid ismo.</Paragraph> <Paragraph position="7"> (column 1) is orthographically transformed (column 2), segmented according to the subword lexicon (column 3), while content-bearing subwords are mapped to MSI-specific equivalence classes whose identifiers (MIDs) are automatically generated by the system (column 4). (Bold MIDs co-occur in both documents.) language and, on the other hand, the close lexical ties between Portuguese and Spanish as Romance languages, we intended to augment the existing MORPHOSAURUS system by Spanish as its fourth language and at the same time reuse the knowledge of Portuguese for the purpose of speeding up and facilitating the Spanish lexicon acquisition.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We use the following resources for the experiments: A Portuguese subword lexicon, as described in the previous section.</Paragraph> <Paragraph position="1"> A manually created list of 842 Spanish affixes. Medical corpora for Spanish and Portuguese.</Paragraph> <Paragraph position="2"> These corpora were compiled exploiting heterogeneous WWW sources. The size of the acquired corpora amounts to 2,267,841 tokens with 118,021 types for Spanish and 3,406,589 tokens with 133,146 types for Portuguese.</Paragraph> <Paragraph position="3"> Word frequency lists generated from these corpora, for Spanish and Portuguese.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Spanish Subword Generation </SectionTitle> <Paragraph position="0"> In order to acquire a first-shot Spanish subword lexicon we designed the following lexeme generation strategy: Using the Portuguese lexicon, identical and similarly spelled Spanish subword candidates (cognates) are generated. As an example, the Portuguese word stem 'estomag' ('stomach') is identical with its Spanish cognate. An example for a pair of similar stems is 'mulher' ('woman') (Portuguese) vs. 'mujer' (Spanish). Similar subword candidates</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Rule Portuguese Spanish </SectionTitle> <Paragraph position="0"> (P ! S) Example Example qua ! cua quadr cuadr eia ! ena veia vena ss ! s fracass fracas lh ! j mulher mujer lh ! ll detalh detall l ! ll lev llev i ! y ensai ensay f ! h formig hormig +ca ! za cabeca cabeza +o+! ue sort suert ... ... ...</Paragraph> <Paragraph position="1"> were generated by applying a set of string substitution rules some of which are listed in Table 2. In total, we formulated 45 rules as a result of identifying common-language Portuguese-Spanish cognates in a commercial dictionary. Some of these substitution patterns cannot be applied to starting or ending sequences of characters in the Portuguese source subword. These regularities are captured by using a wildcard ('+' in Table 2) representing at least one arbitrary character.</Paragraph> <Paragraph position="2"> First, for each Portuguese lexicon entry (n = 14,183 stems and invariants, excluding affixes), all possible Spanish variant strings were generated based upon the set of string substitution rules. This led, on the average, to 9.53 Spanish variant hypotheses per Portuguese subword entry (ranging from 5.3 variants for high-frequency four-character words to 355.2 for low-frequency 17-character words). All these candidates were subsequently compared to the Spanish word frequency list, we had previously compiled from our Spanish text corpus. Wherever a left-sided string match (in the case of stems) or an exact one (in the case of invariants) occurred, the matching string was listed as a potential Spanish cognate of the Portuguese subword it originated from. Whenever several Spanish substitution alternatives for a Portuguese subword had to be considered (cognate ambiguity) that particular one was chosen which had the closest relative distribution in the corpus-derived Spanish word frequency list, when compared to its Portuguese equivalent in the Portuguese word list. As a result, we obtained a list of tentative Spanish subwords each linked by the associated MIDs to its corresponding cognate in the Portuguese lexicon.</Paragraph> <Paragraph position="3"> Quantitatively, starting from 14,183 Portuguese subwords, a total of 132,576 Spanish subword candidates were created using the string substitution rules. Matching these Spanish candidates against the Spanish corpus and allowing for a maximum of one Spanish candidate per Portuguese subword, we identified 11,206 tentative Spanish cognates (79% of the Portuguese seed lexicon) which are linked to a total of 8,992 MIDs from their Portuguese correlates (hence, 2214 synonym relationships have also been hypothesized). 2,977 generated items could not be found in the Spanish corpus, at all.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Manual Semantic Validation </SectionTitle> <Paragraph position="0"> One of the authors evaluated manually a random sample of 388 (3.5% of all generated) cognate pairs in order to identify false friends, i.e., similar words in different languages with different meanings. In our sample we found, e.g., the Spanish candidate *'crianz' for the Portuguese 'crianc' (the normalized stem of 'crianc,a'; English: 'child'). The correct translation of Portuguese 'crianc' to Spanish, however, would have been 'nin' (the stem of 'ni~no'), whilst the Spanish 'crianz' refers to 'criac' (stem of 'criac, ~ao' in Portuguese; English: 'breed'). Taking these false friend errors into account, the automatic generation of Portuguese-Spanish cognate pairs still yields 89,4% accuracy.</Paragraph> <Paragraph position="1"> Assuming then that approximately 1,188 false friends are among the list of 11,206 generated Spanish subword translations (10.6%), the question arises how to distinguish false friends from true positives (cognates). Because a manual examination of the entire candidate set is a tedious and still error-prone work, we shifted our attention to automatic semantic validation techniques.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Automatic Semantic Validation </SectionTitle> <Paragraph position="0"> In order to automatically validate all the generated cognate pairs, we examined the local context in which these cognates occur in non-parallel corpora of both languages involved. The basic idea that underlies this approach is that a subword that appears in a certain context should have a (true positive) cognate that occurs in a similar context, at least when (very) large corpora are taken into account.</Paragraph> <Paragraph position="1"> Cognate similarity can then be measured in terms of context vector comparison (cf. also Rapp (1999) or Koehn and Knight (2002)).</Paragraph> <Paragraph position="2"> We therefore processed the Portuguese corpus using the morpho-semantic normalization routines as discussed in Section 2. In the next step, we created a context vector for each MID, the components of which contained the relative frequencies of co-occurring MIDs in a local window of four subsequent, yet unordered MID units (a size also endorsed by Rapp (1999)).</Paragraph> <Paragraph position="3"> In order to compute the context vector for each Spanish subword candidate, we then constructed a seed lexicon with all the automatically created Spanish subword candidates, together with the list of Spanish affixes. Based on this lexicon, the Spanish corpus was morphologically normalized in the same way, using the MIDs that were licensed by the Portuguese cognates. For each of the candidate cognate MIDs, we built a corresponding context vector. We then measured the context similarity for each MID considering its Portuguese source context and the corresponding Spanish one. We chose two similarity metrics, viz. the well-known cosine metric (Salton and McGill, 1983) and an inverted, normalized (within the interval [0,1]) variant of the city-block metric (advocated by Rapp (1999) as an alternative that outperformed cosine in his experiments).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Spanish Corpora </SectionTitle> <Paragraph position="0"> Figure 1 depicts the resulting curves. Both metrics reveal almost the same characteristics. Only for higher similarities, city-block allows a more fine-grained distinction.</Paragraph> <Paragraph position="1"> For 5,183 (57.6%) from 8,992 pairs of MIDs (one from a 'Portuguese' vector, the other from a 'Spanish' vector), no vector similarity at all could be measured. We distinguish between the following cases: There was no MID occurrence in the Spanish corpus.</Paragraph> <Paragraph position="2"> There was a MID occurrence in the Spanish corpus, but none in the Portuguese one.</Paragraph> <Paragraph position="3"> The vectors were orthogonal, i.e., the contexts did not overlap at all, although the MID occurred in the Spanish corpus, as well as in the Portuguese one. This can be interpreted in two ways: For reasonably frequent MIDs (cf. Figure 2 for the distribution in the corpora) this is the strongest evidence for false friends (formal cognates which are not semantically related), whereas for sparsely distributed MIDs, it does hardly permit any valid judgment concerning their status as false or true cognates.</Paragraph> <Paragraph position="4"> On the other hand, 1,540 MID pairs (in the sense from above) exceed similarity values of 0.2 (17.1%) and 2,065 pairs still share values greater than 0.15 (23%). The obvious question is: What is an adequate threshold? Figures 3 and 4 convey an answer to this question. Both figures are meant to illustrate the trade-off when one increases the threshold for the similarity of both vectors, the Portuguese and the Spanish one, for the MIDs under consideration. The central notion in these two figures is that of Kept Hypotheses, i.e., the proportion of MIDs for which to the City-Block Metrics for the Validation of MID Hypotheses null the assignment of the underlying cognate is judged as being semantically valid. When we consider all (100%) of the generated MIDs (n=8,992) as valid (hence, cosine and city-block are both zero), we get 953 false positives (given our empirically determined accuracy rate of 89.4%, and, hence, error rate of 10.6%) and, obviously, no false negative. Alternatively, when we consider instead 50% of the generated MIDs (n=4,496) as valid (with thresholds for cosine set at 0.05 and for city-block at 0.035), we get 297 (3.3%) false positives, and the number of false negatives increases at a level of 3,687 (around 41%, for both metrics). In order to reduce the set of false friends to zero using the cosine metric, 92.2% of all generated MID cognates will be rejected by the automatic validation for manual revision (analogously, the number of false negatives will increase). Interestingly, the same procedure using the city-block metric will lead to a rejection rate of 97%.</Paragraph> <Paragraph position="5"> At a first glance, this seems to contradict the statement of Rapp (1999), who found in a number of experiments that the city-block metric yields the best results among others, viz. cosine and Jaccard measure, Euclidean distance and scalar product. However, his measures were taken to find the most similar vector for a given word in order to automatically identify word translations. On the other hand, in our experiments, we intended to express the degree of similarity given a pair of cognates. We hypothesized that the city-block metric allows a more fine grained similarity judgment whilst others, e.g., cosine, the Jaccard and Dice coefficient, etc., which only account for overlapping elements of a vector, have a stronger demarcation power.</Paragraph> <Paragraph position="6"> Summarizing, when we increase the similarity thresholds, the number of MID hypotheses decreases as does the number of false positives (already at a rather low level), while the number of false negatives increases almost inversely related to the number of MID hypotheses. Therefore, it is up to the lexicon engineer to determine the level of pre-selection in these three dimensions. We also conclude from our experiments that a much larger corpus is needed in order to collect reasonable context evidence for the infrequent MIDs, in particular.</Paragraph> </Section> </Section> class="xml-element"></Paper>