File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1093_metho.xml
Size: 18,571 bytes
Last Modified: 2025-10-06 14:14:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1093"> <Title>Information Classification and Navigation Based on 5W1H of the Target Information</Title> <Section position="4" start_page="571" end_page="572" type="metho"> <SectionTitle> 3 5WIH Classification and Navigation </SectionTitle> <Paragraph position="0"> Conventional keyword-based retrieval does not consider logical relationships between keywords. For example, the condition, &quot;NEC & semiconductor & produce&quot; retrieves an article containing &quot;NEC formed a technical alliance with B company, and B company produced semiconductor X.&quot; Mine et al. and Satoh et al. reported that this problem leads to retrieval noise and unnecessary results (Mine et al., 1997; Satoh and Muraki, 1993). This problem makes it difficult to meet the requirements of an office because it produces retrieval noise in these three types of operations.</Paragraph> <Paragraph position="1"> 5WlH information is who, when, where, what, why, how, and predicate information extracted from text data through the 5WlH extraction module using language dictionary and sentence analysis techniques. 5WlH extraction modules assign 5WlH indexes to the text data. The indexes are stored in list form of predicates and arguments (when, who, what, why, where, how) (Lesk et ai., 1997). The 5WlH index can suppress retrieval noise because the index considers the logical relationships between keywords. For example, the 5WlH index makes it possible to retrieve texts using the retrieval condition &quot;who: NEC & what: semiconductor & predicate: produce.&quot; It can filter out the article containing &quot;NEC formed a technical alliance with B company, and B company produced semiconductor X.&quot; Based on 5WlH information, we propose a 5WlH classification and navigation model which can meet office retrieval requirements. The model has three functions: episodic retrieval, multi-dimensional classification, and overall classification (Figure 1).</Paragraph> <Section position="1" start_page="571" end_page="571" type="sub_section"> <SectionTitle> 3.1 Episodic Retrieval </SectionTitle> <Paragraph position="0"> The 5WlH index can easily do episodic retrieval by choosing a set of related events and arranging 96.10 NEC adjusts semiconductor production downward.</Paragraph> <Paragraph position="1"> NEC postpones semiconductor production plant construction.</Paragraph> <Paragraph position="2"> NEC shifts semiconductor production to 64 Megabit next generation DRAMs.</Paragraph> <Paragraph position="3"> NEC invests Y= 40 billion for next generation semiconductor production.</Paragraph> <Paragraph position="4"> NEC semiconductor production 18% more than</Paragraph> <Paragraph position="6"> the events in temporal order. The results are readable by users as a kind of episode. For example, an NEC semiconductor production episode is made by retrieving texts containing &quot;who: NEC & what: semiconductor & predicate: product&quot; indexes and sorting the retrieved texts in temporal order (Figure 2).</Paragraph> <Paragraph position="7"> The 5WlH index can suppress retrieval noise by conventional keyword-based retrieval such as &quot;NEC & semiconductor & produce.&quot; Also, the result is an easily readable series of events which is able to meet episodic viewpoint requirements in office retrieval.</Paragraph> </Section> <Section position="2" start_page="571" end_page="572" type="sub_section"> <SectionTitle> 3.2 Multi-dimensional Classification </SectionTitle> <Paragraph position="0"> The 5WlH index has seven-dimensionai axes for classification. Texts are classified into categories on the basis of whether they contain a certain combination of 5WlH elements or not. Though 5WlH elements create seven-dimensional space, users are provided with a two-dimensional matrix because this makes it easier for them to understand text distribution. Users can choose a fundamental viewpoint from 5WlH elements to be the vertical axis. The other elements are arranged on the horizontal axis as the left matrix of Figure 3 shows. Classification makes it possible to access data from a user's comparative viewpoints by combining 5WlH elements.</Paragraph> <Paragraph position="1"> For example, the cell specified by NEC and PC shows the number of articles containing NEC as a &quot;who&quot; element and PC as a &quot;what&quot; element. Users can easily obtain comparable data by switching their fundamental viewpoint from the &quot;who&quot; viewpoint to the &quot;what&quot; viewpoint, for example, as the right matrix of Figure 3 shows. This meets comparative viewpoint requirements in office retrieval.</Paragraph> </Section> <Section position="3" start_page="572" end_page="572" type="sub_section"> <SectionTitle> 3.3 Overall Classification </SectionTitle> <Paragraph position="0"> When there are a large number of 5WlH elements, the classification matrix can be packed by using a thesaurus. As 5WlH elements axe represented by upper concepts in the thesaurus, the matrix can be condensed. Figure 4 has an example with six &quot;who&quot; elements which are represented by two categories.</Paragraph> <Paragraph position="1"> The matrix provides users with overall classification as well as detailed sub-classification through the selection of appropriate hierarchical levels. This meets overall classification requirements in office retrieval.</Paragraph> </Section> </Section> <Section position="5" start_page="572" end_page="575" type="metho"> <SectionTitle> 4 5W1H Information Extraction </SectionTitle> <Paragraph position="0"> 5W1H extraction was done by a case-based shallow parsing (CBSP) model based on the algorithm used in the VENIEX, Japanese information extraction system (Muraki et al., 1993). CBSP is a robust and effective method of analysis which uses lexical information, expression patterns and case-markers in sentences. Figure 5 shows the detail on the algorithm for CBSP.</Paragraph> <Paragraph position="1"> In this algorithm, input sentences are first segmented into words by Japanese morphological analysis (Japanese sentences have no blanks between words.) Lexical information is linked to each word such as the part-of-speech, root forms and semantic categories.</Paragraph> <Paragraph position="2"> Next, 5WlH elements are extracted by proper noun extraction, pattern expression matching and case-maker matching.</Paragraph> <Paragraph position="3"> In the proper noun extraction phase, a 60 050word proper noun dictionary made it possible to indicate people's names and organization names as &quot;who&quot; elements and place names as &quot;where&quot; elements. For example, NEC and China are respectively extracted as a &quot;who&quot; element and a &quot;where&quot; procedure CBSP; begin Apply morphological analysis to the sentence; foreach word in the sentence do begin if the word is a people's name or an organization name then Mark the word as a &quot;who&quot; element and push it to the stack; else if the word is a place name then Mark the word as a &quot;where&quot; element and push it to the stack; else if the word matches an organization name pattern then Mark the word as a &quot;who&quot; element and push it to the stack; else if the word matches a date pattern then Mark the word as a &quot;when&quot; element and push it to the stack; else if the word is a noun then if the next word is C/~C/ or t2 then Mark the word and the kept unspecified elements as &quot;who&quot; elements and push them to the stack; if the next word is ~: or ~= then Mark the word and the kept unspecified elements as &quot;what&quot; elements and push them to the stack; else Keep the word as an unspecified element; else if the word is a verb then begin Fix the word as the predicate element of a 5WlH set; repeat Pop one marked word from the stack; if the 5WlH element corresponding to the mark of the word is not fixed then Fix the word as the 5WlH element corresponding to its mark;</Paragraph> <Paragraph position="5"> *-No (NEC produces semiconductors in China.)&quot; In the pattern expression matching phase, the system extracts words matching predefined patterns as &quot;who&quot; and &quot;when&quot; elements. There are several typ- null ical patterns for organization names and people's names, dates, and places (Muraki et al., 1993). For example, nouns followed by ~J: (Co., Inc. Ltd.) and ~-~ (Univ.) mean they are organizations and &quot;who&quot; elements. For example, 1998 ~ 4 J~ 18 ~ (April 18, 1998) can be identified as a date. &quot;When&quot; elements can be recognized by focusing on the pattern for (year),)~ (month), and ~ (day).</Paragraph> <Paragraph position="6"> For words which are not extracted as 5WlH elements in previous phases, the system decides its works in the information access platform. The platform disseminates users with newspaper information through the company intranet. The platform structure is shown in Figure 6.</Paragraph> <Paragraph position="7"> Web robots collect newspaper articles from specified URLs every day. The data is stored in the database, and a 5WlH index data is made for the data. Currently, 6398 news articles are stored in the databases. Some articles are disseminated to users according to their profiles. Users can browse all the data through WWW browsers and use 5WlH classification and navigation functions by typing sentences or specifying regions in the browsing texts.</Paragraph> <Paragraph position="9"/> <Section position="1" start_page="573" end_page="575" type="sub_section"> <SectionTitle> 5.1 5W1H Information Extraction </SectionTitle> <Paragraph position="0"> &quot;When,&quot; &quot;who, .... what,&quot; and &quot;predicate&quot; information has been extracted from 6398 electronics industry news articles since August, 1996. We have evaluated extracted information for 6398 news headlines. The headline average length is approximately 12 words. Table 1 shows the result of evaluating &quot;who,&quot; &quot;what,&quot; and &quot;predicate&quot; information and overall extracted information.</Paragraph> <Paragraph position="1"> In this table, the results are classified with regard to the presence of corresponding elements in the news headlines. More than 90% of &quot;who,&quot; &quot;what,&quot; and &quot;predicate&quot; elements can correctly be extracted with our extraction algorithm from headlines having such elements. On the other hand, the algorithm is not highly precise when there is no corresponding element in the article. The errors are caused by picking up other elements despite the absence of the element to be extracted. However, the errors hardly affect applications such as episodic re-</Paragraph> <Paragraph position="3"> they only add unnecessary information and do not remove necessary information.</Paragraph> <Paragraph position="4"> The precision independent of the presence of the element is from 85% to 95% for each, and the overall precision is 82.4%.</Paragraph> <Paragraph position="5"> an example of episodic retrieval based on headline news saying, &quot;NEC ~)~-~C/)~::~:J: 0 18%~ (NEC produces 18% more semiconductors than expected.)&quot; The user specifies the region, &quot;NEC ~)C/ ~i~kC/)~i~ (NEC produces semiconductors)&quot; on the headline for episodic retrieval. A &quot;who&quot; element NEC, a &quot;what&quot; element ~i~$ (semiconductor), and a &quot;predicate&quot; element ~ (produce) are episodic retrieval keys. The extracted results are NEC's semiconductor production story.</Paragraph> <Paragraph position="6"> The upper frame of the window lists a set of headlines arranged in temporal order. In each article, NEC is a &quot;who&quot; element, the semiconductor is a &quot;what&quot; element and production is a &quot;predicate&quot; element. By tracing episodic headlines, the user can find that the semiconductor market was not good at the end of 1996 but that it began turning around in 1997. The lower frame shows an article corresponding to the headline in the upper frame. When the user clicks the 96/10/21 headline, the complete article is displayed in the lower frame.</Paragraph> <Paragraph position="7"> ery techniques.).&quot; &quot;Who&quot; elements are &quot;NEC, A Co., and B Co.&quot; listed on the vertical axis which is the fundamental axis in the upper frame of Figure 8. &quot;What&quot; elements are &quot;~-~?. (encode), ~*(data), []~ (recovery), and ~ (technique).&quot; h &quot;predicate&quot; element is a &quot;r,~ (develop).&quot; &quot;What&quot; and &quot;predicate&quot; elements are both arranged on the horizontal axis in the upper frame of Figure 8. When clicking a cell for &quot;who&quot;: NEC and &quot;what&quot;: ~ (encode), users can see the headlines of articles containing the above two keywords in the lower frame of Figure 8.</Paragraph> <Paragraph position="8"> When clicking on the &quot;What&quot; cell in the upper I! !'ii ................... ?~&quot;i IUI&quot;'U ~~i~ ~ ,~, ...... ~... :~.:~ ~::: :::::~:::~!:::::::::::::::::::::::::::::::::: ~:::::~: ~: ~:~m~ ~ }t~.il ....................... U ............................ E!:::: ............... ::::: &quot;U i!~ i ....... }; Il ~,:11~1 ~ ~ ...... ~:-: ........ : - i- 2 ---~ 7-- ~ ...... : ...... i - ~ ...... [::~IFT&quot;&quot;&quot;T:: ............. ~&quot;- &quot;?&quot;&quot;': -:'-7::'::~ ............ :&quot; ~ .......... ~'&quot;~:7 ''U ......... :,~&quot; &quot; '&quot; &quot; .... L }::~::; ::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::: ~:::::: &quot;:::: '::::::~:::: ::::::::::::::::::::: : } ~1~1~}&quot;&quot;~ ..................... - ................................... ~ ....................... : ............ ','T'&quot;~&quot;::--~Y ''m i&quot;&quot;~ &quot; frame of Figure 8, the user can switch the fundamental axis from &quot;who&quot; to &quot;what&quot; (Figure 9, upper frame). By switching the fundamental axis, the user can easily see classification from different viewpoints. On clicking the cell for &quot;what&quot;: ~{P. (encode) and &quot;predicate&quot;: ~2~ (develop), the user finds eight headlines (Figure 9, lower frame). The user can then see different company activities such as the 97/04/07 headline; &quot;C ~i ~o fzff'- ~' ~.~ ~f~g@~: ~ (C Company has developed data transmission encoding technology using a satellite),&quot; shown in the lower frame of Figure 9.</Paragraph> <Paragraph position="9"> In this way, a user can classify article headlines by switching 5WlH viewpoints.</Paragraph> <Paragraph position="10"> Overall classification is condensed by using an organization and a technical thesaurus. The organization thesaurus has three layers and 2800 items, and the technical thesaurus has two layers and 1000 technical terms. &quot;Who&quot; and &quot;what&quot; elements are respectively represented by the upper classes of the organization thesaurus and the technical thesaurus. The upper classes are vertical and horizontal elements in the multi-dimensional classification matrix. &quot;Predicate&quot; elements are categorized by several frequent predicates based on the user's priorities.</Paragraph> <Paragraph position="11"> Figure 10 shows the results of overall classification for 250 articles disseminated in April, 1997. Here, &quot;who&quot; elements on the vertical axis are represented by industry categories instead of company names, and &quot;what&quot; elements on the horizontal axis are represented by technical fields instead of technical terms. On clicking the second cell from the top of the &quot;who&quot; elements, ~]~Jt~ (electrical and mechanical) in Figure 10, the user can view subcategorized classification on electrical and mechanical industries as indicated in Figure 11. Here, ~: (electrical and mechanical) is expanded to the subcategories; ~J~ (general electric) ~_~ (power electric), ~I~ (home electric), ~.{~j~ (communication), and so on.</Paragraph> </Section> </Section> <Section position="6" start_page="575" end_page="576" type="metho"> <SectionTitle> 6 Current Status </SectionTitle> <Paragraph position="0"> The information access platform was exploited during the MIIDAS (Multiple Indexed Information Dissemination and Acquisition Service) project which NEC used internally (Okumura et al., 1997). The DEC Alpha workstation (300 MHz) is a server machine providing 5WlH classification and navigation functions for 50 users through WWW browsers.</Paragraph> <Paragraph position="1"> User interaction occurs through CGI and JAVA programs. null After a six-month trial by 50 users, four areas for improvement become evident.</Paragraph> <Paragraph position="2"> 1) 5WlH extraction: 5WlH extraction precision was approximately 82% for newspaper headlines. The extraction algorithm should be improved so that it can deal with embedded sentences and compound sentences.</Paragraph> <Paragraph position="3"> Also, dictionaries should be improved in order to be able to deal with different domains such as patent data and academic papers.</Paragraph> <Paragraph position="4"> 2) Episodic retrieval: The interface should be improved so that the user can switch retrieval from episodic to normal retrieval in order to compare retrieval data.</Paragraph> <Paragraph position="5"> Episodic retrieval is based on the temporal sorting of a set of related events. At present, geographic arrangement is expected to become a branch function for episodic retrieval. It is possible to arrange each event on a map by using 5WlH index data. This would enable users to trace moving events such as the onset of a typhoon or the escape of a criminal. 3) Multi-dimensional classification: Some users need to edit the matrix for themselves on the screen.</Paragraph> <Paragraph position="6"> Moreover, it is necessary to insert new keywords and delete unnecessary keywords.</Paragraph> </Section> class="xml-element"></Paper>