File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0826_intro.xml

Size: 2,280 bytes

Last Modified: 2025-10-06 14:03:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0826">
  <Title>Combining Linguistic Data Views for Phrase-based SMT</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The main motivation behind our work is to introduce linguistic information, other than lexical units, to the process of building word and phrase alignments. Many other authors have tried to do so. See (Och and Ney, 2000), (Yamada and Knight, 2001), (Koehn and Knight, 2002), (Koehn et al., 2003), (Schafer and Yarowsky, 2003) and (Gildea, 2003).</Paragraph>
    <Paragraph position="1"> Far from full syntactic complexity, we suggest to go back to the simpler alignment methods first described by (Brown et al., 1993). Our approach exploits the possibility of working with alignments at two different levels of granularity, lexical (words) and shallow parsing (chunks). In order to avoid confusion so forth we will talk about tokens instead of words as the minimal alignment unit.</Paragraph>
    <Paragraph position="2"> Apart from redefining the scope of the alignment unit, we may use different degrees of linguistic annotation. We introduce the general concept of data view, which is defined as any possible representation of the information contained in a bitext. We enrich data view tokens with features further than lexical such as PoS, lemma, and chunk label.</Paragraph>
    <Paragraph position="3"> As an example of the applicability of data views, suppose the case of the word 'plays' being seen in the training data acting as a verb. Representing this information as 'playsVBZ' would allow us to distinguish it from its homograph 'playsNNS' for 'plays' as a noun. Ideally, one would wish to have still deeper information, moving through syntax onto semantics, such as word senses. Therefore, it would be possible to distinguish for instance between two realizations of 'plays' with different meanings: 'hePRP playsVBG guitarNN' and 'hePRP playsVBG basketballNN'.</Paragraph>
    <Paragraph position="4"> Of course, there is a natural trade-off between the use of data views and data sparsity. Fortunately, we hava data enough so that statistical parameter estimation remains reliable.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML