File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3126_intro.xml
Size: 2,434 bytes
Last Modified: 2025-10-06 14:04:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3126"> <Title>The LDV-COMBO system for SMT</Title> <Section position="3" start_page="166" end_page="166" type="intro"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> The LDV-COMBO system follows the SMT architecture suggested by the workshop organizers. We use the Pharaoh beam-search decoder (Koehn, 2004).</Paragraph> <Paragraph position="1"> First, training data are linguistically annotated. In order to achieve robustness the same tools have been used to linguistically annotate both languages. The SVMTool1 has been used for PoS-tagging (Gim'enez and M`arquez, 2004). The Freeling2 package (Carreras et al., 2004) has been used for lemmatizing.</Paragraph> <Paragraph position="2"> Finally, the Phreco software (Carreras et al., 2005) has been used for shallow parsing. In this paper we focus on data views at the word level. 6 different data views have been built: (W) word, (L) lemma, (WP) word and PoS, (WC) word and chunk IOB label, (WPC) word, PoS and chunk IOB label, (LC) lemma and chunk IOB label.</Paragraph> <Paragraph position="3"> Then, running GIZA++ (Och and Ney, 2003), we obtain token alignments for each of the data views.</Paragraph> <Paragraph position="4"> Combined phrase-based translation models are built on top of the Viterbi alignments output by GIZA++.</Paragraph> <Paragraph position="5"> Phrase extraction is performed following the phrase-extract algorithm depicted by Och (2002). We do not apply any heuristic refinement. We work with phrases up to 5 tokens. Phrase pairs appearing only once have been discarded. Scoring is performed by relative frequency. No smoothing is applied.</Paragraph> <Paragraph position="6"> In this paper we focus on the global phrase ex- null and M`arquez (2005). We build a single translation model from the union of alignments from the 6 data views described above. This model must match the input format. For instance, if the input is annotated with word and PoS (WP), so must be the translation model. Therefore either the input must be enriched with linguistic annotation or translation models must be post-processed in order to remove the additional linguistic annotation. We did not observe significant differences in either alternative. Therefore, we simply adapted translations models to work under the assumption of unannotated inputs (W).</Paragraph> </Section> class="xml-element"></Paper>