XML Viewer - c00-2142

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2142_metho.xml
Size: 12,248 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2142">
  <Title>Rapid Development of Translation Tools: Application to Persian and Turkish</Title>
  <Section position="4" start_page="0" end_page="982" type="metho">
    <SectionTitle>
2 General architecture
</SectionTitle>
    <Paragraph position="0"> MEAT is a publicly available environmeut I that assists a linguist in rapidly developing a machine translation system, in order to keep the overhead involved in learning and using the system as low as possible, the linguist uses use simple yet powerful basic data and control structures. These structures are oriented towards contemporary linguistic and con&gt; putationM linguistic theories.</Paragraph>
    <Paragraph position="1"> In MEAT, linguistic knowledge is entirely represented using Typed Feature Structures (TFS) (Carpenter, 1992; Zajac, 1992), the most widely used representational formalism today. We developed a fast implementation of Typed Feature Structm:es with appropriateness, based on an abstract machine view (eft Carpenter and Qu (1995), Wintner and Francez (1995). Bilingual dictionary entries as well as all kinds of rules (morphology, syntax, transfer, generation) are expressed as feature structures of specific types, so only one description language has to be mastered. This usually leads to a rapid familiarity with the system, yielding a high productivity almost from the start.</Paragraph>
    <Paragraph position="2"> i ht tp ://crl.nmsu. edu/~j aratrup/Meat  Runtime linguistic objects (words, syntactic structures etc.) are stored in a central data struct.ure. We use an extension of the well-known concept of a chart (Kay, 1973) to hold all temporary and final results. As nmltiple components have to process different origins of data, the chart is equipped with several different layers, each of which denotes a specific aspect of processing. Thus, at every point during the runtime of the system, the contents of the chart reflect what operations have been performed so far (Amtrup, 1999). The complete chart is available to the user using a graphical interface. This chart browser can be used to exactly trace why a specific solution was produced, a significant aid in developing and debugging grammars.</Paragraph>
    <Paragraph position="3"> MEAT addresses the necessity of carrying out several different tasks by providing a component-based architecture. The core of the system consists of the formalism and the chart data representation. All processing components arc implemented in the form of plug-ins, components that obey a small interface to comumnicate with the main application. The choice of which components to apply to an actual input, the order in which the components are processed, and individual parameters for components can be specified by the user, allowing for a highly flexible way of configuring a machine translation (or word-lookup, of glossing etc.) system (el. Amtrup et al. (2000) for a more detailed description of the architecture).</Paragraph>
    <Paragraph position="4"> The MEAT system is completely implemented in G++, resulting in a relatively tast mode of op-eration. The implenrentation of the TFS formalism supports between 3000 and 4500 unifications per second, depending on the application it is used in. Translating an average length sentence (20-25 words) takes about 3.5 seconds on a Pentium PII400 (in non-optimized debug mode). The system sup.ports Unix (tested on Solaris and Linux) and Win-dows95/98/NT. We use Unicode to represent character data, as we face translations of several different, non-European languages with a variety of scripts.</Paragraph>
  </Section>
  <Section position="5" start_page="982" end_page="984" type="metho">
    <SectionTitle>
3 Development cycle
</SectionTitle>
    <Paragraph position="0"> One of the main requirement facilitating the deployment of a new language with possibly scarce pre-existing resources is the ability to incrementally develop knowledge sources and translation capability (the incremental approach to MT development is described in (Zajac, 1999)). In the case of translation sy,~tems at our laboratory, we mostly translate into English. Thus, a complete set of English resources is already available (dictionary, generation grammars and morphological generation) and does not need to be developed.</Paragraph>
    <Paragraph position="1"> The first step in bootstrapping a running system is to build a bilingual dictionary. The work on the dictionary usually continues throughout the devel.opment process of higher level knowledge sources. We use dictionaries where entries are encoded as flat feature-value pairs, as shown in Figure 1.</Paragraph>
    <Paragraph position="2">  While this is already enough information to faciL irate a basic word-.for-word translation, in general a morphological analyzer for the source language is needed to translate real-world text. For MEAT, one can either import the results of an existing morphological analyzer, or use the native description language, based on a finite-state transducer using characters as left projections and typed feature structures as right projections (Zajac, 1998). After completing the morphological analysis of the source language and specifying the mapping of lexical features to English, glossing is available. The Glosser is an MEAT application consisting of morphological analysis of source language words, tbllowed by dictionary lookup for single words and compounds, and the translation into English inflected word forms. An example of the interface \[br the glosser is shown in.</Paragraph>
    <Paragraph position="3">  l?igure 2.</Paragraph>
    <Paragraph position="4"> ......................</Paragraph>
    <Paragraph position="5"> File Edit View G0 C0mlaun\[c~2ot Helo -: - &amp;quot;7- ....</Paragraph>
    <Paragraph position="7"> The next step in developing a nmdium-.quality, broad coverage translation system is to develop</Paragraph>
    <Paragraph position="9"> knowledge sources for the sttdeguctural analysis of input sentences. Mt';AT supports the use of modm lar mfific.ation grammars, which facilitate,,&amp;quot; develop. merit and debugging. E;ach grammar module ca,~. be developed and tested irJ isolation, the final system applying each gtammar in a linear fashion (Zajac and Amtrup, 2000). The main. component used is a bidirectional island-.parser (cf. Stock et al. (1988)) for unification-based gra.mmars. The grammar rules are usually writtet~ in the style of context-free rules  the right-hand side as a regular-.expression of feature structures. We plan to add more restricted types of grammars (e.g. based on finite-.state transducers) to give the linguist a richer choice of syntactic processes to choose from.</Paragraph>
    <Paragraph position="10"> For the time being, the transfer capabilities of the system are restricted to lexical transDr, as we have not finished the implementation of a complex transfer module. Thus, the grammar developer either needs to create structural descriptions that match the English generation, or the English generation grammar has to be modified for each language.</Paragraph>
    <Paragraph position="11"> At each point during the development of a trans-.</Paragraph>
    <Paragraph position="12">  lation syste.m, we consider it essential to be ahle not ouly to see the resuRs, but also to monitor the processing history of a result. Thus, th.e chart that leads 1;o the construction of all English olttptlt can be viewed in order to examine all intermediate constructions. In the MEAT system, each module records various steps of computations in the chart which can be inspected statically after processing. A unified data interface for all modules in the systein allows both the inspection of recorded internal data structures for each module (when it makes sen:-;e,  such aq ill a ch.art parser), aml tile im;pecticm of tit(.' input/ouPSput of all module,';. The graphical i~terfa.ce used to view complex a.t~alyscs i.r; shown in Figure 3. d A.pp\]J.ca{;ions In this section, we give an overview of the capal)i!i. ti(~'; of MEA71' using ~wo (-l!ri{~.nt examples from work af, our laboratory. In the Shiraz project ? (Amtrup ei; al., 2000), we developed a machine translatio~ system from Farsi tO English, for which no previous knowledge sources were awdlable. We mainly target news material and the transl~tion of web pages. Tile Tm'kish~English system has been developed with the Expedition project a, an enterprise for the rapid development of MT systems for low-density languages.</Paragraph>
    <Paragraph position="13"> Botl/systems use a common user interface for ac~ cess to MEAT, which is shown in Figure 5. The MT systems are targeted t() the translation of newsy. paper text and other sources available online (e.g.</Paragraph>
    <Paragraph position="14"> web pages). The emphasis is therefore put ou exten+ sive coverage rather than very-+high quality transta+ tion+ Currently, we reach for bol;h systmns a level of quality that allows to assess in detail tile content of source texts, at the expense of some rmfelicitous English.</Paragraph>
    <Paragraph position="15"> r~ ta~ ............. D vc~lv~ {ll;t,~c s:/Im~oa~tj am~ In Iplr,eanhltm-! C fl t g'~A. t..R ~d~q~ ~f~l ,+01. bl t</Paragraph>
  </Section>
  <Section position="6" start_page="984" end_page="984" type="metho">
    <SectionTitle>
4,1 Persian..+Englisb MT
</SectionTitle>
    <Paragraph position="0"> The input for the Persian-English system is usually taken from web pages (on-line news articles), although plain text can be handled as well. The im ternal encoding is Unicode, and various codeset conw.'rters are available; we also developed an ASCII-based transliteration to facilitate the easy acquisi~ tion of dictionaries and grammars (see Figure 1).</Paragraph>
    <Paragraph position="1"> The dictionary consists of approximately 50,000 en+ tries, single words as well as multi-word compounds.</Paragraph>
    <Paragraph position="2"> Additionally, we utilize a multi-lingual onomasticon maintained locally to identify proper names.</Paragraph>
    <Paragraph position="3">  The knowledge sources for syntax were (manually) developed using a corpus of 3,000 tagged and bracketed setrt, ences extracted h:om a 10MB corpus of Per.deg sian news articles. We use three grammars, respon.sible lor the attaetunent of auxiliaries to main verbs, the recognition and processing of light verb phenom-.</Paragraph>
    <Paragraph position="4"> ena, and phrasal and sentential syntax, respectively.</Paragraph>
    <Paragraph position="5"> The combined size is about 110 rules. The develop.</Paragraph>
    <Paragraph position="6"> ment of the Persian resources took several months, primarily due to the fact that the translation system was developed in parallel to the linguistic knowl..</Paragraph>
    <Paragraph position="7"> edge. The Persian resources were developed by a team of one computational linguist (morphological and syntactic grammars, overall supervision ibr lan-guage resources), and 3 lexicographers (dictionary and corpus annotation).</Paragraph>
  </Section>
  <Section position="7" start_page="984" end_page="985" type="metho">
    <SectionTitle>
4deg2 'lti~urkish+-English MT
</SectionTitle>
    <Paragraph position="0"> Withii~ yet another project (Expedition), we developed a machine translation system, for Turkish. This application functioned as a benchnrark on how much effort the building of a medium-quality system re-quires, given that an appropriate framework is al-.</Paragraph>
    <Paragraph position="1"> ready available.</Paragraph>
    <Paragraph position="2"> For Turkish, we use a pre-existing morphological analyzer (Oflazer, 1994). Turkish shows a rich derivational and inflectional morphology, which accounts for most of the system development work that was necessary to build a wrapper for integrating the Turkish morphological analyzer in the system (approximatly 60 person-hours) 4. The development of the Turkish syntactic grammars took around 100 person-hours, resulting in 85 unification-based phrase structure rules describing the basics of Turkish syntax. The development of the bilingual Turkish-English dictionary had been going on for  some time prior to the application of MEAT, and currently contains approximately 43,000 headwords.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML