File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/e99-1045_concl.xml
Size: 2,892 bytes
Last Modified: 2025-10-06 13:58:21
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1045"> <Title>Encoding a Parallel Corpus for Automatic Terminology Extraction</Title> <Section position="4" start_page="275" end_page="275" type="concl"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> The general approach we adopted in the preprocessing and primary data encoding phases was to pass the raw texts through a sequence of filters. Each filter adds some small pieces of new information and writes a logfile in case of doubt.</Paragraph> <Paragraph position="1"> The output and the logfile in turn are used to improve the filter programs in order to minimize manual post-editing. This modular bootstrapping approach has advantages over huge parameterizable programs: filters are relatively simple and can be partially reused or easily adapted for texts with different formats; tuning the filters becomes less complex; when recovering from a previous stage the loss of work is minimized. The filters have been implemented in Perl which, due to its pattern matching mechanism via regular expressions, is a very powerful language for such applications.</Paragraph> <Paragraph position="2"> For the linguistic annotation we use the MULTEXT tools available from http://www.lpl.univaix.fr/projects/multext. We already have extensive experience with the tokenlzer MtSeg which distinguishes 11 classes of tokens, such as abbreviations, dates, various punctuations, etc. The customization of MtSeg via language-specific resource files has been done in a bootstrapping process similar to the filter programs. An evaluation of 10% of the Civil Code (~ 28,000 words) revealed only one type of tokenization error: a full stop that is not part of an abbreviation and is followed by an uppercase letter is recognized as end-of-sentence marker, e.g. in &quot;6. Absatz&quot;. This kind of error is unavoidable in German if we refuse to mark such patterns as compounds.</Paragraph> <Paragraph position="3"> Currently we are preparing the lemmatization and the POS tagging by using MtLex. MtLex is equipped with an Italian and a German lexicon which contain 138,823 and 51,010 different word forms respectively. To include the 15,013 (58,217) new Italian (German) word forms in our corpus the corresponding lexicons have been extended.</Paragraph> <Paragraph position="4"> The creation of the Italian lexicon took 2 MM.</Paragraph> <Paragraph position="5"> Future work will include the completion of the linguistic annotation. The MULTEXT tagger Mr-Tag will be used for the disambiguation of POS tags. Word alignment still requires the study of various approaches, e.g. (Dagan et al., 1993; Melamed, 1997). Finally, we are working on a sophisticated interface to navigate through parallel documents to disseminate the text corpus before terminology extraction has been completed.</Paragraph> </Section> class="xml-element"></Paper>