File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2014_intro.xml
Size: 12,035 bytes
Last Modified: 2025-10-06 14:03:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2014"> <Title>ASSIST: Automated semantic assistance for translators</Title> <Section position="3" start_page="0" end_page="141" type="intro"> <SectionTitle> 2 System design </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="139" type="sub_section"> <SectionTitle> 2.1 The research hypothesis </SectionTitle> <Paragraph position="0"> Our research hypothesis is that translators can be assistedbysoftwarewhichsuggestscontextualex- null amples in the target language that are semantically and syntactically related to a selected example in the source language. To enable greater coverage we will exploit comparable rather than parallel corpora.</Paragraph> <Paragraph position="1"> Our research hypothesis leads us to a number of research questions: * Which semantic and syntactic contextual features of the selected example in the source language are important? * How do we find similar contextual examples in the target language? * How do we sort the suggested target language contextual examples in order to maximise their usefulness? In order to restrict the research to what is achievable within the scope of this project, we are focussing on translation from English to Russian using a comparable corpus of British and Russian newspaper texts. Newspapers cover a large set of clearly identifiable topics that are comparable across languages and cultures. In this project, we have collected a 200-million-word corpus of four major British newspapers and a 70-millionword corpus of three major Russian newspapers for roughly the same time span (2003-2004).1 In our proposed method, contexts of uses of English expressions defined by keywords are compared to similar Russian expressions, using semantic classes such as persons, places and institutions. For instance, the word agreement in the example the parties were frustratingly close to an agreement = a241a242a238a240a238a237a251 a225a251a235a232 a228a238 a238a225a232a228a237a238a227a238 a225a235a232a231a234a232 a234 a228a238a241a242a232a230a229a237a232a254 a241a238a227a235a224a248a229a237a232a255 belongs to a semantic class that also includes arrangement, contract, deal, treaty. In the result, the search for collocates of a225a235a232a231a234a232a233 (close) in the context of agreement words in Russian gives a short list of modifiers, which also includes the target: a228a238 a238a225a232a228a237a238a227a238 a225a235a232a231a234a232.</Paragraph> </Section> <Section position="2" start_page="139" end_page="140" type="sub_section"> <SectionTitle> 2.2 Semantic taggers </SectionTitle> <Paragraph position="0"> In this project, we are porting the Lancaster English Semantic Tagger (EST) to the Russian language. We have reused the existing semantic field taxonomyoftheLancasterUCRELsemanticanalysis system (USAS), and applied it to Russian. We 1Russian newspapers are significantly shorter than their British counterparts.</Paragraph> <Paragraph position="1"> have also reused the existing software framework developed during the construction of a Finnish Semantic Tagger (Lofberg et al., 2005); the main adjustments and modifications required for Finnish were to cope with the Unicode character set (UTF8) and word compounding.</Paragraph> <Paragraph position="2"> USAS-EST is a software system for automatic semantic analysis of text that was designed at Lancaster University (Rayson et al., 2004). The semantic tagset used by USAS was originally loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981).</Paragraph> <Paragraph position="3"> It has a multi-tier structure with 21 major discourse fields, subdivided into 232 sub-categories.2 In the ASSIST project, we have been working on both improving the existing EST and developing a parallel tool for Russian - Russian Semantic Tagger (RST).We have found that theUSAS semantic categories were compatible with the semantic categorizationsofobjectsandphenomenainRussian, null as in the following example:3</Paragraph> <Paragraph position="5"> However, we needed a tool for analysing the complex morpho-syntactic structure of Russian words. Unlike English, Russian is a highly inflected language: generally, what is expressed in English through phrases or syntactic structures is expressed in Russian via morphological inflections, especially case endings and affixation.</Paragraph> <Paragraph position="6"> For this purpose, we adopted a Russian morpho-syntactic analyser Mystem that identifies word forms, lemmas and morphological characteristics for each word. Mystem is used as the equivalent of the CLAWS part-of-speech (POS) tagger in the USAS framework. Furthermore, we adopted the Unicode UTF-8 encoding scheme to cope with the Cyrillic alphabet. Despite these modifications, the architecture of the RST software mirrors that of the EST components in general.</Paragraph> <Paragraph position="7"> The main lexical resources of the RST include a single-word lexicon and a lexicon of multi-word expressions (MWEs). We are building the Russian lexical resources by exploiting both dictionaries and corpora. We use readily available resources, e.g. lists of proper names, which are then se-</Paragraph> <Paragraph position="9"> Quantities: little; E4.1- = Unhappy; X9.1- = Ability, intelligence: poor; A6.3- = Comparing: little variety; O4.2- = Judgement of appearance: bad mantically classified. To bootstrap the system, we have hand-tagged the 3,000 most frequent Russian words based on a large newspaper corpus. Subsequently, the lexicons will be further expanded by feeding texts from various sources into the RST and classifying words that remain unmatched. In addition, we will experiment with semi-automatic lexicon construction using an existing machine-readable English-Russian bilingual dictionary to populate the Russian lexicon by mapping words from eachof thesemantic fieldsin theEnglish lexicon in turn. We aim at coverage of around 30,000 single lexical items and up to 9,000 MWEs, compared to the EST which currently contains 54,727 single lexical items and 18,814 MWEs.</Paragraph> </Section> <Section position="3" start_page="140" end_page="141" type="sub_section"> <SectionTitle> 2.3 The user interface </SectionTitle> <Paragraph position="0"> The interface is powered by IMS Corpus Workbench (Christ, 1994) and is designed to be used in the day-to-day workflow of novice and practising translators, so the syntax of the CWB query language has been simplified to adapt it to the needs of the target user community.</Paragraph> <Paragraph position="1"> The interface implements a search model for finding translation equivalents in monolingual comparable corpora, which integrates a number of statistical and rule-based techniques for extending search space, translating words and multiword expressions into the target language and restricting thenumberofreturnedcandidatesinordertomaximise precision and recall of relevant translation equivalents. In the proposed search model queries canbeexpandedbygeneratinglistsofcollocations for a given word or phrase, by generating similarity classes4 or by manual selection of words in concordances. Transfer between the source language and target language is done via lookup in a bilingual dictionary or via UCREL semantic codes, which are common for concepts in both languages. The search space is further restricted by applying knowledge-based and statistical filters (such as part-of-speech and semantic class filters, IDF filter, etc), by testing the co-occurrence of members of different similarity classes or by manually selecting the presented variants. These procedures are elementary building blocks that are used in designing different search strategies efficient for different types of translation equivalents 4Simclasses consist of words sharing collocates and are computed using Singular Value Decomposition, as used by (Rapp, 2004), e.g. Paris and Strasbourg are produced for Brussels, or bus, tram and driver for passenger.</Paragraph> <Paragraph position="2"> and contexts.</Paragraph> <Paragraph position="3"> The core functionality of the system is intended to be self-explanatory and to have a shallow learning curve: in many cases default search parameters work well, so it is sufficient to input a word or an expression in the source language in order to get back a useful list of translation equivalents, which can be manually checked by a translator to identify the most suitable solution for a given context. For example, the word combination frustrated passenger is not found in the majorEnglish-Russiandictionaries, whilenoneofthe candidate translations of frustrated are suitable in this context. The default search strategy for this phrase is to generate the similarity class for English words frustrate, passenger, produce all possible translations using a dictionary and to test cooccurrenceoftheresultingRussianwordsintarget null language corpora. This returns a list of 32 Russian phrases, which follow the pattern of 'annoyed / impatient / unhappy + commuter / passenger / driver'. Among other examples the list includes an appropriate translation a237a229a228a238a226a238a235a252a237a251a233 a239a224a241a241a224a230a232a240 ('unsatisfied passenger').</Paragraph> <Paragraph position="4"> The following example demonstrates the system's ability to find equivalents when there is a reliable context to identify terms in the two languages. Recent political developments in Russia produced a new expression a239a240a229a228a241a242a224a226a232a242a229a235a252 a239a240a229a231a232a228a229a237a242a224 ('representative of president'), which is as yet too novel to be listed in dictionaries.</Paragraph> <Paragraph position="5"> However, the system can help to identify the people that perform this duty, translate their names to English and extract the set of collocates that frequently appear around their names in British newspapers, including Putin's personal envoy and Putin's regional representative, even if no specific term has been established for this purpose in the British media.</Paragraph> <Paragraph position="6"> As words cannot be translated in isolation and their potential translation equivalents also often consist of several words, the system detects not only single-word collocates, but also multiword expressions. For instance, the set of Russian collocates of a225a254a240a238a234a240a224a242a232a255 (bureaucracy) includes a193a240a254a241a241a229a235a252 (Brussels), which offers a straightforward translation into English and has such multiword collocates as red tape, which is a suitable contextual translation for a225a254a240a238a234a240a224a242a232a255. More experienced users can modify default parameters and try alternative strategies, construct their own search paths from available basic building blocks and store them for future use. Stored strategies comprise several elementary stages but are executed in one go, although intermediate results can also be accessed via the &quot;history&quot; frame. Several search paths can be tried in parallel and displayed together, so an optimal strategy for a given class of phrases can be more easily identified. null Unlike Machine Translation, the system does not translate texts. The main thrust of the systemliesinitsabilitytofindseveraltargetlanguage null examples that are relevant to the source language expression. In some cases this results in suggestions that can be directly used for translating the source example, while in other cases the system provides hints for the translator about the range of target language expressions beyond what is available in bilingual dictionaries. Even if the precision of the current version is not satisfactory for an MT system (2-3 suitable translations out of 30-50 suggested examples), human translators are able to skim through the suggested set to find what is relevant for the given translation task.</Paragraph> </Section> </Section> class="xml-element"></Paper>