File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0715_intro.xml
Size: 5,311 bytes
Last Modified: 2025-10-06 14:00:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0715"> <Title>Experiments on Unsupervised Learning for Extracting Relevant Fragments from Spoken Dialog Corpus</Title> <Section position="3" start_page="0" end_page="83" type="intro"> <SectionTitle> 2 Description of the algorithm </SectionTitle> <Paragraph position="0"> There are four main steps in our approach.</Paragraph> <Paragraph position="1"> First step is to make automatically labeling and to chunk each sentence from spoken dialog corpus into a set of short subphrases. We assume that in the spoken dialog a sentence consists of slightly related subphrases. In our experiment for labeling and chunking we use a relatively small set of domain independent words such as prepositions, determiners, articles, modals and adverbs. For example articles: a, an, the; prepositions: in, with, about, under, for, of, to; determiners: some, many.</Paragraph> <Paragraph position="2"> The domain independent words are grouped in subvocabularies. For instance, subvocabulary <article> includes words a, an, the. Some subvocabularies include only one word. If a given sentence includes article A we'll replace it by the label (<article>A), article THE we'll replace by the label (<article>THE) and so on. Very important feature of our algorithm is that some of the words selected for tags can predict the semantics of the followed words or subphrases. In all cases we could characterize this prediction as possibility. For example the word from predict semantics of the followed words or subphrases as a &quot;start point&quot;, a &quot;reason&quot;, a &quot;source&quot; or something else. For each of such tag words we create separate subvocabulary. In the process of labeling we examine given sentence from left to right and replace the tag words by the labels. For labeling we use tools based on AT&T CHRONUS system described by Levin (1995).</Paragraph> <Paragraph position="3"> In the process of chunking we examine the sentence from left to right. In one chunk we put one tag word label or tag word labels following one by one and other non tag words on the right up to but excluding next tag word label. There are two examples of the chunks: (<what>WHAT) TYPE, (<pronouns>I) (<article>A) FARE.</Paragraph> <Paragraph position="4"> We'll describe each non tag word by the vector of the features. Every component of the vector corresponds to subvocabulary of the tag words as it is described below: component 1 ~ (<article>...) component 2 ~ (<determiner>...) component 3 -4 (<modal>...) component 4 ~ (<of>OF) component 5 --4 (<to>TO) component n --~ (<from>FROM) Every component mean how many times tag word label was in the left context of described non tag word. Every component is an integer. Thus we have the list of non tag words and vectors of integers corresponding to this words. Second step is to cluster the words from all chunks by using the vectors of the features. In this step we extract from chunks the words which have enough semantically charged tags in the left context and group such words in the clusters.</Paragraph> <Paragraph position="5"> For clustering we take from the list the first non tag word and check if the number of different tags (number of non zero components of the vector) is more then threshold. The threshold value must be greater then the number of tag words having low semantical predictional power (articles, modals, auxiliaries, determiners). In our experiments we used threshold values from 6 up to 9. If the number of different tags for tested vector is more than threshold we'll consider this vector as a centre of the cluster and then looking for other vectors neighbouring to tested vector. When the neighbouring vectors are selected we'll remove them from the list of vectors. This procedure we'll repeat for all vectors non selected as a member of the class. For this experiments we have used distance measure based on Hamming metrics.</Paragraph> <Paragraph position="6"> In the third step we go back to the chunks and extract chunks which include words from one cluster. In this way we generate the clusters of the chunks.</Paragraph> <Paragraph position="7"> In the forth step we reduce the number of the chunk's clusters. We make union of all chunk's clusters except one tested cluster and then intersect this one with chunk's union. If all chunks from tested cluster are inside of the union we delete this tested chink cluster.</Paragraph> <Paragraph position="8"> Let us consider baseline algorithm which use &quot;stop words&quot; known in information retrieval systems. The idea of this algorithm is to delete the stop words from given sentence and return all of the remaining words as lexicon items.</Paragraph> <Paragraph position="9"> There are some principal differences between baseline algorithm and suggested algorithm.</Paragraph> <Paragraph position="10"> In suggested algorithm we are looking for the words which have enough semantically charged tags in the left context and then extract chunks which include selected words. In the baseline algorithm we are looking for only words remaining after deleting &quot;stop words&quot;.</Paragraph> </Section> class="xml-element"></Paper>