File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0626_intro.xml
Size: 3,990 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0626"> <Title>Automatic Construction of Weighted String Similarity Measures</Title> <Section position="3" start_page="0" end_page="213" type="intro"> <SectionTitle> 2 Resources </SectionTitle> <Paragraph position="0"> Two types of textual resources were used in this study: * reference lexicons for the automatic generation of string matching functions * bilingual word pairs to be investigated with regard to string similarity A collection of bilingual word pairs is easy to produce. Similarity metrics should be applicable to every possible word pair from this set. However, some restrictions can be imposed on the choice of appropriate pairs. In this study, all word pairs were derived from sentence aligned corpora of technical texts which were collected in the PLUG corpus (Tiedemann, 1998b) as part of the PLUG project 1 (Ahrenberg et al., 1998).</Paragraph> <Paragraph position="1"> Technical texts are suitable for investigations on string similarity. The text collection which is examined comprises about 180,000 words per language and includes a large amount of technical expressions. Therefore, a comprehensive list of cognates can be expected from this corpus.</Paragraph> <Paragraph position="2"> Some further constraints were set in order to restrict the set of bilingual word pairs to be investigated: null minimal token length: Each token should contain at least a certain amount of characters. Very short strings do not represent reliable sources for string comparison. The minimal length of tokens used in this study was set to four characters.</Paragraph> <Paragraph position="3"> maximal distance: Token pairs were taken from sentence aligned bi-text. The position of each token in its sentence can be used to reduce the number of potential candidate pairs. One possibility is to set a maximum for the difference in position for each token pair. In this study the position difference may not exceed 10 token.</Paragraph> <Paragraph position="4"> minimal length difference ratio: Cognates should be of comparable length. Therefore, it is appropriate to restrict the set of candidates to strings whose length difference does not exceed a certain value. The quotient of the length of the shorter string and the length of the longer string can be used to calculate a ratio for measuring this difference. In this study the set of candidates were restricted to token pairs whose length difference ratio does not exceed a value of 0.7.</Paragraph> <Paragraph position="5"> Using these three restrictions a set of 308,362 candidate pairs were obtained from parts of the PLUG corpus.</Paragraph> <Paragraph position="6"> The selection of reference lexicons should be done with care. These lists of word pairs are decisive for the quality of the string matching function which will be produced. For availability reasons it was decided to use bilingual word lists which were produced in an automatic word alignment process. This is not the perfect solution because they contain quite a few errors and therefore they degrade the quality of the results to be produced.</Paragraph> <Paragraph position="7"> The reference lexicons were generated by word alignment based on statistic measures and empirical investigations. The software which was used for the extraction is the Uppsala word alignment tool (Tiedemann, 1997; Tiedemann, 1999). The following two word lists were investigated: GordSVEN: A list of 2,431 Swedish/English word alignments derived from the English/Swedish bi-text 'A Guest of Honour' by Nadine Gordimer with an estimated precision of about 95.8%.</Paragraph> <Paragraph position="8"> ScaniaSVEN: A list of 2,223 Swedish/English word alignments derived from the Swedish/English bi-texts in the Scania95 corpus (Sca, 1998) by measuring LCSR scores 2 with an estimated precision of about 92.5%.</Paragraph> <Paragraph position="9"> Both bi-texts are part of the PLUG corpus.</Paragraph> </Section> class="xml-element"></Paper>