File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1055_abstr.xml
Size: 2,906 bytes
Last Modified: 2025-10-06 13:47:00
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1055"> <Title>Deducing Linguistic Structure from the Statistics of Large Corpora</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4% error rate, when trained on moderate sized (500K word) corpora of English text (e.g. Church, 1988; Hindle, 1989). The success of these techniques suggests that much of the grammatical structure of language may be derived automatically through distributional analysis, an approach attempted and abandoned in the 1950s.</Paragraph> <Paragraph position="1"> We describe here two experiments to see how far purely distributional techniques can be pushed to automatically provide both a set of part of speech tags for English, and a grammatical analysis of free English text. We also discuss the state of a tagged NL corpus to aid such research (now amounting to 4 million words of hand-corrected part-of-speech tagging).</Paragraph> <Paragraph position="2"> In the experiment described in Section 2, we have developed a constituent boundary parsing algorithm which derives an (unlabelled) bracketing given text annotated for part of speech as input. This method is based on the hypothesis that constituent boundaries can be extracted from a given part-of-speech n-gram by analyzing the mutual information values within the n-gram, extended to a new generalization of the information theoretic measure of mutual information. This hypothesis is supported by the performance of an implementation of this parsing algorithm which determines recursively nested sentence structure, with an error rate of roughly 2 misplaced boundaries for test sentences of length 1015 words, and five misplaced boundaries for sentences of 15-30 tokens. To combat a limited set of specific circumstances in which the hypothesis fails, we use a small (4 rule, 8 symbol) distituent grammar, which indicates when two parts of speech cannol remain in the same constituent.</Paragraph> <Paragraph position="3"> In another experiment, described in Section 3, we investigate whether a distributional analysis can discover No. AFOSR-90-0066, and by ARO grant No. DAAL 03-89-C0031 PRI. Thanks to Ken Church, Stuart Shleber, Max Mintz, Aravind Joshi, Lila Gleitman and Tom Veatch for their valued suggestions and discussion.</Paragraph> <Paragraph position="4"> a part of speech tag set which might prove adequate to support experiments like that discussed above. We have developed a similarity measure which accurately clusters closed-class lexical items of the same grammatical category, excepting words which are ambiguous between multiple parts of speech.</Paragraph> </Section> class="xml-element"></Paper>