File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3234_metho.xml
Size: 3,500 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3234"> <Title>Trained Named Entity Recognition Using Distributional Clusters</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Co-Clustering </SectionTitle> <Paragraph position="0"> As in Brown, et al (1992), we seek a partition of the vocabulary that maximizes the mutual information between term categories and their contexts.</Paragraph> <Paragraph position="1"> To achieve this, we use information theoretic co-clustering (Dhillon et al., 2003), in which a space of entities, on the one hand, and their contexts, on the other, are alternately clustered to maximize mutual information between the two spaces.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Background </SectionTitle> <Paragraph position="0"> The input to our algorithm is two finite sets of symbols, say a25a27a26a29a28a31a30a33a32a35a34a36a30a38a37a31a34a40a39a40a39a40a39a41a34a36a30a33a42a44a43a46a45 (e.g., terms) and a26 a28a31a48a49a32a35a34a36a48a50a37a40a34a40a39a40a39a40a39a41a34a36a48a51a42a53a52a54a45 (e.g., term contexts), together with a set of co-occurrence count data consisting of a non-negative integer a55a57a56a31a58a4a59a61a60 for every pair of symbols a62a15a30a64a63a65a34a36a48a41a66a41a67 from a25 and a47 . The output is two partitions: a25 to (locally) maximize the mutual information between them, under a constraint limiting the total number of clusters in each partition.</Paragraph> <Paragraph position="1"> Recall that the entropy or Shannon information of a discrete distribution is:</Paragraph> <Paragraph position="3"> This quantifies average improvement in one's knowledge upon learning the specific value of an event drawn from a25 . It is large or small depending on whether a25 has many or few probable values.</Paragraph> <Paragraph position="4"> The mutual information between random variables a25 and a47 can be written: This quantifies the amount that one expects to learn indirectly about a25 upon learning the value of a47 , or vice versa.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The Algorithm </SectionTitle> <Paragraph position="0"> Let a25 be a random variable over vocabulary terms as found in some text corpus. We define a47 to range over immediately adjacent tokens, encoding co-occurrences in such a way as to distinguish left from right occurrences.</Paragraph> <Paragraph position="1"> Given co-occurrence matrices tabulated in this way, we perform an approximate maximization of</Paragraph> <Paragraph position="3"> a69 using a simulated annealing procedure in which each trial move takes a symbol a30 or a48 out of the cluster to which it is tentatively assigned and places it into another. Candidate moves are chosen by selecting a non-empty cluster uniformly at random, randomly selecting one of its members, then randomly selecting a destination cluster other than the source cluster. When temperature 0 is reached, all possible moves are repeatedly attempted until no further improvements are possible.</Paragraph> <Paragraph position="4"> For efficiency and noise reduction, we first cluster only the 5000 most frequent terms and context terms. The remaining terms in the corpus vocabulary are then added by assigning each term to the cluster that maximizes the mutual information objective function.</Paragraph> </Section> </Section> class="xml-element"></Paper>