File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0201_intro.xml
Size: 3,486 bytes
Last Modified: 2025-10-06 14:03:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0201"> <Title>Applications of Lexical Information for Algorithmically Composing Multiple-Choice Cloze Items</Title> <Section position="4" start_page="1" end_page="2" type="intro"> <SectionTitle> 3 Source Corpus and Lexicons </SectionTitle> <Paragraph position="0"> Employing a web crawler, we retrieve the contents of Taiwan Review <publish.gio.gov.tw>, Taiwan Journal<taiwanjournal.nat.gov.tw>, and China Post <www.chinapost.com.tw>. Currently, we have 127,471 sentences that consist of 2,771,503 words in 36,005 types in the corpus. We look for useful sentences from web pages that are encoded in the HTML format. We need to extract texts from the mixture of titles, main body of the reports, and multimedia contents, and then segment the extracted paragraphs into individual sentences. We segment sentences with the help of MXTERMINA-TOR (Reynar and Ratnaparkhi, 1997). We then tokenize words in the sentences before assigning useful tags to the tokens.</Paragraph> <Paragraph position="1"> We augment the text with an array of tags that facilitate cloze item generation. We assign tags of part-of-speech (POS) to the words with MXPOST that adopts the Penn Treebank tag set (Ratnaparkhi, 1996). Based on the assigned POS tags, we annotate words with their lemmas. For instance, we annotate classified with classify and classified, respectively, when the original word has VBN and JJ as its POS tag. We also employ MINIPAR (Lin, 1998) to obtain the partial parses of sentences that we use extensively in our system. Words with direct relationships can be identified easily in the partially parsed trees, and we rely heavily on these relationships between words for WSD. For easy reference, we will call words that have direct syntactic relationship with a word W as W's signal words or simply signals.</Paragraph> <Paragraph position="2"> Since we focus on creating items for verbs, nouns, adjectives, and adverbs (Liu et al., 2005), we care about signals of words with these POS tags in sentences for disambiguating word senses. Specifically, the signals of a verb include its subject, object, and the adverbs that modify the verb. The signals of a noun include the adjectives that modify the noun and the verb that uses the noun as its object or predicate. For instance, in &quot;Jimmy builds a grand building.&quot;, both &quot;build&quot; and &quot;grand&quot; are signals of &quot;building&quot;. The signals of adjectives and adverbs include the words that they modify and the words that modify the adjectives and adverbs.</Paragraph> <Paragraph position="3"> When we need lexical information about English words, we resort to electronic lexicons. We use WordNet <www.cogsci.princeton.edu/~wn/> when we need definitions and sample sentences of words for disambiguating word senses, and we employ HowNet <www.keenage.com> when we need information about classes of verbs, nouns, adjectives, and adverbs.</Paragraph> <Paragraph position="4"> HowNet is a bilingual lexicon. An entry in HowNet includes slots for Chinese words, English words, POS information, etc. We rely heavily on the slot that records the semantic ingredients related to the word being defined. HowNet uses a limited set of words in the slot for semantic ingredient, and the leading ingredient in the slot is considered to be the most important one generally.</Paragraph> </Section> class="xml-element"></Paper>