File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/c02-2005_concl.xml
Size: 1,753 bytes
Last Modified: 2025-10-06 13:53:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2005"> <Title>Scaled log likelihood ratios for the detection of abbreviations in text corpora</Title> <Section position="6" start_page="11" end_page="11" type="concl"> <SectionTitle> 5 Weaknesses and future steps </SectionTitle> <Paragraph position="0"> We have noted in section 2 that the scaling factors do not lead to a perfect classification. This is particularly reflected in the application of S(log l) to WSJ_1 and NZZ_7, which actually show the same problem: In the training corpus, ounces was always followed by *. In WSJ_1, the word said was always followed by *, and this also happened in NZZ_7 for kann. Without the inclusion of additional metrics, non-abbreviations which exclusively occur at the end of sentences are wrongly classified. The table in (20) illustrates, however, that the error rate for false negatives drops significantly if plausible corpus sizes are considered.</Paragraph> <Paragraph position="1"> (20) False negatives (f.n.) and corpus size We have also ignored abbreviation occuring at the end of the sentence. The next step will be to integrate methods for the detection of abbreviations at the end of the sentence, e.g. by integrating additional phonotactic information, and also to cover the problematic cases reported above.</Paragraph> <Paragraph position="2"> Conclusion We have presented an accurate and comparatively simple method for the detection of abbreviations which makes use of scaled log likelihood ratios. Experiments have shown that the method works well with large files and also with small samples with sparse data. We expect further improvements once additional classification schemata have been integrated.</Paragraph> </Section> class="xml-element"></Paper>