File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1013_metho.xml
Size: 7,786 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1013"> <Title>A Categorial Variation Database for English</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Building the CatVar </SectionTitle> <Paragraph position="0"> The CatVar database was developed using a combination of resources and algorithms including the Lexical Conceptual Structure (LCS) Verb and Preposition Databases (Dorr, 2001), the Brown Corpus section of the Penn Treebank (Marcus et al., 1993), an English morphological analysis lexicon developed for PC-Kimmo (Englex) (Antworth, 1990), NOMLEX (Macleod et al., 1998), and the Porter stemmer. The contribution of each of these sources is clearly labeled in the CatVar database, thus enabling the use of different cross-sections of the resource for different applications.4 Some of these resources were used to extract seed links between different words (Englex lexicon, NOMLEX and LDOCE). Others were used to provide a large-scale coverage of lexemes. In the case of the Brown Corpus, which doesn't provide lexemes for its words, the Englex morphological analyzer was used together with the part of speech specified in the Penn Tree Bank to extract the lexeme form. The Porter stemmer was later used as part of a clustering step to expand the seed links to create clusters of words that are categorial variants of each other, e.g., hungera1 , hungrya2a4a3 , hungera0 , hungrinessa1 .</Paragraph> <Paragraph position="1"> The current version of the CatVar (version 2.0) includes 62,232 clusters covering 96,368 unique lexemes.</Paragraph> <Paragraph position="2"> The lexemes belong to one of four parts-of-speech (Noun 62%, Adjective 24%, Verb 10% and Adverb 4%). Almost half of the clusters currently include one word only. Three-quarters of these single-word clusters are nouns and one-fifth are adjectives. The other half of the words is distributed in a Zipf fashion over clusters from size 2 to 27. Figure 1 shows the word-cluster distribution.</Paragraph> <Paragraph position="3"> A smaller supplementary database devoted to verb-preposition variations was constructed solely from the LCS verb and preposition lexicon using shared LCS primitives to cluster. The database was inspired by pairs such as crossa0 and acrossa5 which are used in Generation-Heavy MT. But since verb-preposition clusters are not typically morphologically related, they are higher Bleu scores were obtained when using the portions of the CatVar database that are most relevant to nominalized events (e.g., NOMLEX).</Paragraph> <Paragraph position="4"> kept separate from the rest of the CatVar database and they were not included in the evaluation presented in this paper.5 The CatVar is web-browseable at http://clipdemos.umiacs.umd.edu/catvar/. Figure 2 shows the CatVar web-based interface with the hunger cluster as an example. The interface allows searching clusters using regular expressions as well as cluster length restrictions. The database is also available for researchers in perl/C and lisp searchable formats.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Applications </SectionTitle> <Paragraph position="0"> Our project is focused on resource building and evaluation. However, the CatVar database is relevant to a number of natural language applications, including generation for MT, headline generation, and cross-language divergence unraveling for bilingual alignment. Each of these are discussed below, in turn.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Generation-Heavy Machine Translation The Generation-Heavy Hybrid Machine Translation </SectionTitle> <Paragraph position="0"> (GHMT) model was introduced in (Habash, 2002) to handle translation divergences between language pairs with asymmetrical (poor-source/rich-target) resources. The approach does not rely on a transfer lexicon or a common interlingual representation to map between divergent structural configurations from source to target language.</Paragraph> <Paragraph position="1"> Instead, different alternative structural configurations are over-generated and these are statistically ranked using a more than 230 verbs and 29 prepositions. Other examples of verb-preposition clusters include: avoida0 and away froma1 ; entera0 and intoa1 ; and bordera0 and besidea1 (or next toa1 ). The CatVar database is used as one of the constraints on the structural expansion step. For example, to allow the conflation of verbs such as makea0 or causea0 and an argument such as developmenta1 , the first condition for conflatability is finding a verb categorial variant of the argument developmenta1 . In this case the verb categorial variant is developa0 .6</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Headline Generation </SectionTitle> <Paragraph position="0"> The HeadGen headline generator was introduced in (Zajic et al., 2002) to create headlines automatically from newspaper text. The goal is to generate an informative headline (one that specifies the event and its participants) not just an indicative headline (which specifies the topic only). The system is implemented as a Hidden Markov Model enhanced with a postprocessor that filters out headlines that do not contain a verbal or nominalized event. This is achieved by verifying that there is at least one word in the generated headline that appears in CatVar as a V (a verbal event) or as a N whose verbal counterpart is in the same cluster (a nominalized event).</Paragraph> <Paragraph position="1"> A recent study indicates that there is a significant improvement in Bleu scores (using human-generated headlines as our references) when running headline generation with the CatVar filter:7 a0 HeadGen with CatVar filter: 0.1740 a0 HeadGen with no CatVar filter: 0.1687 This quantitative distinction correlates with humanperceived differences, e.g., between the two headlines Washingtonians fight over drugs and In the nation's capital (generated for the same story--with and without Cat-Var, respectively).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 DUSTer </SectionTitle> <Paragraph position="0"> DUSTer--Divergence Unraveling for Statistical Translation--was introduced in (Dorr et al., 2002).</Paragraph> <Paragraph position="1"> In this system, common divergence types are systematically identified and English sentences are transformed to bear a closer resemblance to that of another language using a mapping referred to as a1 -to-a1a3a2 . The objective is to enable more accurate alignment and projection of dependency trees in another language without requiring any training on dependency-tree data in that language.</Paragraph> <Paragraph position="2"> The CatVar database has been incorporated into two components of the DUSTer system: (1) In the a1 -to-a1a4a2 mapping, e.g., the transformation from kicka0 to LightVB kicka1 (corresponding to the English/Spanish divergence et al., 2002).</Paragraph> <Paragraph position="3"> pair kick/dar patada); and (2) During an automatic mark-up phase prior to this transformation, where the particular a1 -to-a1a5a2 mapping is selected from a set of possibilities based on the 2 input sentences. For example, the rule V[CatVar=N] -> LightVB N is selected for the transformation above by first checking that the verb V is associated with a word of category N in Cat-Var. Transforming divergent English sentences using this mechanism has been shown to facilitate word-level alignment by reducing the number of unaligned and multiplyaligned words.</Paragraph> </Section> </Section> class="xml-element"></Paper>