File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1084_intro.xml
Size: 1,722 bytes
Last Modified: 2025-10-06 14:00:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1084"> <Title>Automatic Semantic Sequence Extraction from Unrestricted Non-Tagged Texts</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> I/eeognition of contained words is an impof tan |preproecssing for syntactic parsing. Word recognition is mostly done based on dictionary lookup, and unknown words often cause parse errors. Thus most of the researches have been done on fixed corpora with special dictionaries for the domain.</Paragraph> <Paragraph position="1"> Part-of-speech(POS) tags are often used for term recognition. This kind of preprocessing is often time-consmning and causes anfi)iguity. Wtmn it conies to the corpus with high rate of unknown words it is not easy to do a fair parsing with dictionaries and rules.</Paragraph> <Paragraph position="2"> Obtaining the contained terms and phrases correctly can be an efficient preprocessing. In this paper we propose a method to recognize domain-specific sequences with simple and noneosty processing, which enables the use of unrestricted corpora fc)r NLP tools.</Paragraph> <Paragraph position="3"> We concentrate on building a tool for extracting nmaningful sequences automatically with less preparation. Our systcnl only necds a fair size of non-tagged training corpus of tim target language. No restriction is required for the training corpus. We do not need any preprocessing for the training corpus.</Paragraph> <Paragraph position="4"> We had experiments on email messages in Japanese and our system could recognize 69.06% of the undcfined sequences of the test corpus.</Paragraph> </Section> class="xml-element"></Paper>