File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-0903_concl.xml
Size: 3,386 bytes
Last Modified: 2025-10-06 13:55:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0903"> <Title>Automatic Dating of Documents and Temporal Text Classification</Title> <Section position="7" start_page="20" end_page="20" type="concl"> <SectionTitle> 5 Evaluation, Results and Conclusion </SectionTitle> <Paragraph position="0"> The system was trained using 67,000 news items selected at random from the GigaWord corpus.</Paragraph> <Paragraph position="1"> The evaluation took place on 678,924 news items extracted from items marked as being of type &quot;story&quot; or &quot;multi&quot; in the GigaWord corpus. Table 1 presents a summary of the evaluation results. Processing took around 2.33ms per item.</Paragraph> <Paragraph position="2"> The actual date was extracted from each news item in the GigaWord corpus and the day of week (DOW), week number and quarter calculated from the actual date.</Paragraph> <Paragraph position="3"> This information was then used to evaluate the system performance automatically. The average error for each type of classifier was also calculated automatically. For a result to be considered as correct, the system had to have the predicted value ranked in the first position equal to the actual value (of the type of period).</Paragraph> <Paragraph position="4"> The system results show that reasonable accurate dates can be guessed at the quarterly and yearly levels. The weekly classifier had the worst performance of all classifiers, likely as a result of weak association between periodical word frequencies and week numbers. Logical/sanity checks can be performed on ambiguous results.</Paragraph> <Paragraph position="5"> For example, consider a document written on 4 January 2006 and that the periodical classifiers give the following results for this particular document: These results are typical of the system, as particular classifiers sometimes get the period incorrect. In this example, the weekly classifier incorrectly classified the document as pertaining to week 52 (at the end of the year) instead of the beginning of the year. The system will use the facts that the monthly and quarterly classifiers agree together with the fact that week 1 follows week 52 if seen as a continuous cycle of weeks to correctly classify the document as being created on a Wednesday in January 2006.</Paragraph> <Paragraph position="6"> The capability to automatically date texts and documents solely from its contents (without any additional external clues or hints) is undoubtedly useful in various contexts, such as the forensic analysis of undated instant messages or emails (where the Day of Week classifier can be used to create partial orderings), and in authorship identification studies (where the Year classifier can be used to check that the text pertains to an acceptable range of years).</Paragraph> <Paragraph position="7"> The temporal classification and analysis system presented in this paper can handle any Indo-European language in its present form. Further work is being carried out to extend the system to Chinese and Arabic. Evaluations will be carried out on the GigaWord Chinese and GigaWord Arabic corpora for consistency. Current research is aiming at improving the accuracy of the classifier by using the non-periodic components and integrating a combined classification method with other systems.</Paragraph> </Section> class="xml-element"></Paper>