File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0207_evalu.xml
Size: 2,080 bytes
Last Modified: 2025-10-06 13:59:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0207"> <Title>LoLo: A System based on Terminology for Multilingual Extraction</Title> <Section position="8" start_page="62" end_page="63" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We have used the Rules Editor and the Information Extractor to evaluate the patterns on a corpus comprising 2408 texts and 858,650 tokens created by merging Test1 and Test2 corpora. The Arabic evaluation corpus comprised 5118 texts and 860,134 tokens. The N-gram pattern extractor (where N > 4) showed considerable promise in that who or what went up/or down was unambiguously extracted from the English test corpus using patterns generated through the training corpus. Initial results show high precision with the longer N-grams in English (Table 19) and Arabic (Table 20).</Paragraph> <Paragraph position="1"> However, some patterns return many extracted information that require trimming. For example many organizations names are extracted in Arabic using the pattern shown in table 21 but they usually have the word by-a-ratio (be-nesba, afii62764afii62762afii62785afii62824afii62761) attached at the end resulting in low precision.</Paragraph> <Section position="1" start_page="63" end_page="63" type="sub_section"> <SectionTitle> and Arabic </SectionTitle> <Paragraph position="0"> Because we have used the same training thresholds for English and Arabic, the patterns in Arabic appeared without the motion words.</Paragraph> <Paragraph position="1"> However the system can extract these words along with the org/instrument/index names because they appear frequently as slots in the patterns. null The N-gram patterns (when N [?] 4) show poor results in that either such patterns found in the training corpus are not found in the test corpus, or the patterns retrieved from test corpora are at semantic variance with the same pattern in the training corpus. This suggests that there is an optimal length of individual patterns in our local grammar.</Paragraph> </Section> </Section> class="xml-element"></Paper>