File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-3002_intro.xml
Size: 1,322 bytes
Last Modified: 2025-10-06 14:03:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3002"> <Title>Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering</Title> <Section position="4" start_page="7" end_page="7" type="intro"> <SectionTitle> 1.3 Outline </SectionTitle> <Paragraph position="0"> This work constructs an unsupervised POS tagger from scratch. Input to our system is a considerable amount of unlabeled, monolingual text bar any POS information. In a first stage, we employ a clustering algorithm on distributional similarity, which groups a subset of the most frequent 10,000 words of a corpus into several hundred clusters (partitioning 1). Second, we use similarity scores on neighbouring co-occurrence profiles to obtain again several hundred clusters of medium- and low frequency words (partitioning 2). The combination of both partitionings yields a set of word forms belonging to the same derived syntactic category. To gain on text coverage, we add ambiguous high-frequency words that were discarded for partitioning 1 to the lexicon. Finally, we train a Viterbi tagger with this lexicon and augment it with an affix classifier for unknown words.</Paragraph> <Paragraph position="1"> The resulting taggers are evaluated against outputs of supervised taggers for various languages.</Paragraph> </Section> class="xml-element"></Paper>