File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-4001_intro.xml
Size: 6,607 bytes
Last Modified: 2025-10-06 14:03:28
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-4001"> <Title>InfoMagnets: Making Sense of Corpus Data</Title> <Section position="3" start_page="253" end_page="254" type="intro"> <SectionTitle> 2 Functionality </SectionTitle> <Paragraph position="0"> Exploring a textual corpus in search of interesting topical patterns that correlate with externally observable variables is a non-trivial task. Take as an example the task of characterizing the process by which students and tutors negotiate with one another over a chat interface as they navigate instructional materials together in an on-line exploratory learning environment. A sensible approach is to segment all dialogue transcripts into topic-oriented segments and then group the segments by topic similarity. If done manually, this is a challenging task in two respects. First, to segment each dialogue the analyst must rely on their knowledge of the domain to locate where the focus of the dialogue shifts from one topic to the next. This, of course, requires the analyst to know what to look for and to remain consistent throughout the whole set of dialogues. More importantly, it introduces into the topic analysis a primacy bias. The analyst may miss important dialogue digressions simply because they are not expected based on observations from the first few dialogues viewed in detail. InfoMagnets addresses these issues by offering users a constant bird's eye view of their data. See Figure 1.</Paragraph> <Paragraph position="1"> As input, InfoMagnets accepts a corpus of textual documents. As an option to the user, the documents can be automatically fragmented into topically-coherent segments (referred to also as documents from here on), which then become the atomic textual unit . The documents (or topic segments) are automatically clustered into an initial organization that the user then incrementally adjusts through the interface. Figure 1 shows the initial document-to-topic assignment that InfoMagnets produces as a starting point for the user. The large circles represent InfoMagnets, or topic oriented cluster centroids, and the smaller circles represent documents. An InfoMagnet can be thought of as a set of words representative of a topic concept. The similarity between the vector representation of the words in a document and that of the words in an InfoMagnet translate into attraction in the two-dimensional InfoMagnet space.</Paragraph> <Paragraph position="2"> This semantic similarity is computed using Latent Semantic Analysis (LSA) (Landauer et al., 1998).</Paragraph> <Paragraph position="3"> Thus, a document appears closest to the InfoMagnet that best represents its topic.</Paragraph> <Paragraph position="4"> A document that appears equidistant to two InfoMagnets shares its content equally between the two represented topics. Topics with lots of documents nearby are popular topics. InfoMagnets with only a few documents nearby represent infrequent topics. Should the user decide to remove an InfoMagnet, any document with some level of attraction to that InfoMagnet will animate and reposition itself based on the topics still represented by the remaining InfoMagnets. At all times, the InfoMagnets interface offers the analyst a bird's eye view of the entire corpus as it is being analyzed and organized.</Paragraph> <Paragraph position="5"> Given the automatically-generated initial topic representation, the user typically starts by browsing the different InfoMagnets and documents. Using a magnifying cross-hair lens, the user can view the contents of a document on the top pane. As noted above, each InfoMagnet represents a topic concept through a collection of words (from the corpus) that convey that concept. Selecting the InfoMagnet displays this list of words on the left pane. The list is shown in descending order of importance with respect to that topic. By browsing each InfoMagnet's list of words and browsing Due to lack of space, we do not focus on our topicsegmentation algorithm. We intend to discuss this in the demo. nearby documents, the user can start recognizing topics represented in the InfoMagnet space and can start labeling those InfoMagnets.</Paragraph> <Paragraph position="6"> InfoMagnets with only a few neighboring documents can be removed. Likewise, InfoMagnets attracting too many topically-unrelated documents can be split into multiple topics. The user can do this semi-automatically (by requesting a split, and allowing the algorithm to determine where the best split is) or by manually selecting a set of terms from the InfoMagnet's word list and creating a new InfoMagnet using those words to represent the new InfoMagnet's topic. If the user finds words in an InfoMagnet's word list that lack topical relevance, the user can remove them from InfoMagnet's word list or from all the InfoMagnets' word lists at once.</Paragraph> <Paragraph position="7"> Users may also choose to manually assign a segment to a topic by &quot;snapping&quot; that document to an InfoMagnet. &quot;Snapping&quot; is a way of overriding the attraction between the document and other InfoMagnets. By &quot;snapping&quot; a document to an InfoMagnet, the relationship between the &quot;snapped&quot; document and the associated InfoMagnet remains constant, regardless of any changes made to the InfoMagnet space subsequently.</Paragraph> <Paragraph position="8"> If a user would like to remove the influence of a subset of the corpus from the behavior of the tool, the user may select an InfoMagnet and all the documents close to it and place them in the &quot;quarantine&quot; area of the interface. When placed in the quarantine, as when &quot;snapped&quot;, a document's assignment remains unchanged. This feature is used to free screen space for the user.</Paragraph> <Paragraph position="9"> If the user opts for segmenting each input discourse and working with topic segments rather than whole documents, an alternative interface allows the user to quickly browse through the corpus sequentially (Figure 2). By switching between this view and the bird's eye view, the user is able to see where each segment fits sequentially into the larger context of the discourse it was extracted from. The user can also use the sequential interface for making minor adjustments to topic segment boundaries and topic assignments where necessary. Once the user is satisfied with the topic representation in the space and the assignments of all documents to those topics, the tool can automatically generate an</Paragraph> </Section> class="xml-element"></Paper>