File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1031_metho.xml

Size: 7,066 bytes

Last Modified: 2025-10-06 14:12:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1031">
  <Title>ISSCO, Gen~ve t</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Brief Overview
</SectionTitle>
    <Paragraph position="0"> The BCP package consists of four submodules: preprocesser, morphology, alignment, and access. The text pre-processor, bcpmark, marks paragraph and sentence boundaries, numbers, words, and punctuation. The morphological analyzer, bcpmorf, is built around a unification-based parser, and returns feature-structure descriptions in SGML format, although the feature structure itself is in a linear notation only. The alignment module is the subject of much experimentation and currently is running with the Church-Gale alignment algorithm \[Gale and Church, 1991\]. The access module has been described in previous work \[Warwick et al., 1989\] and will not be discussed further here. The focus of this abstract will be on bcpmark and bcpmorf.</Paragraph>
    <Paragraph position="1"> *Many thanks to Graham Russell for his invaluable advice on this abstract.</Paragraph>
    <Paragraph position="2"> tISSCO, 54 route des Acacias, Gen~ve 1227, Switzerland &amp;quot;*Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Malostrauskd n~n~st~ 25, 118 00 Praha 1, Czechoslovakia</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="227" type="metho">
    <SectionTitle>
3 bcpmark: The Pre-Processor
</SectionTitle>
    <Paragraph position="0"> bcpmark is the first step in preparing text for the alignment program. It marks paragraph and sentence boundaries, numbers, words, and punctuation, with the output in SGML notation, bepmark is easily customized to suit a particular text type or language via a user-defined data file. Extensions and alterations to the data are accordingly simple. There are accompanying tools to check number standardization results and sentence boundary marking. Languages currently supported are French, German, Italian, Czech, and English.</Paragraph>
    <Section position="1" start_page="0" end_page="227" type="sub_section">
      <SectionTitle>
3.1 Input Text
</SectionTitle>
      <Paragraph position="0"> bcpmark is intended to be usable on all text types, so that entails a certain amount of flexibility. Regardless, there are two major problems: no interpretation of the input text, and the need to be &amp;quot;parameterized&amp;quot; for different textual conventions.</Paragraph>
      <Paragraph position="1"> Problems instantly arise in conjunction with numbers, abbreviations, conflicts with differing punctuation conventions, and capitalization. In particular, German noun capitalization causes great problems to a system which relies heavily on capitalization marking sentence beginnings. null In bcpmark, the sentences are marked by either the onset of a paragraph marker or by encountering an end-of-sentence punctuation mark in the appropriate context for a particular language. We define six contexts essential for deliminating sentences:  1. Characters are always considered part of a word.</Paragraph>
      <Paragraph position="2"> 2. Abbreviations which can never end a sentence even if they are followed by a dot. There may also be contracted abbreviations.</Paragraph>
      <Paragraph position="3"> 3. Abbreviations which in front of a number cannot end a sentence.</Paragraph>
      <Paragraph position="4"> 4. Words which followed by a number followed by a period usually signal a sentence boundary.</Paragraph>
      <Paragraph position="5"> 5. The sequence single-capital-letter, tapir alized-word is normally not recognized to be a sentence boundary. null 6. Certain words followed by sequences of the form number, capitalized-word  (especially in German texts) should not be marked as sentence boundaries.</Paragraph>
      <Paragraph position="6"> 7. Words which probably do not start a new sentence if preceded by a sequence number . This is especially useful for languages like German, which mark ordinal numerals by dots which do not indicate an end of sentence.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="227" end_page="227" type="metho">
    <SectionTitle>
4 Morphology
</SectionTitle>
    <Paragraph position="0"> Morphological variations can be classified as inflection, derivation, and compounding \[van Gaalen et al., 1991\].</Paragraph>
    <Paragraph position="1"> An adequate morphology should be able to handle all three. There are several parts to the BCP morphology: the morphology grammar, the regular and irregular dictionaries, and the code. There is also a facility for testing and debugging the morphology grammar. The output format is an SGML-notation version of feature structures, where ambiguous analyses are expressed in tags raher than multiple word-forms in the text.</Paragraph>
  </Section>
  <Section position="5" start_page="227" end_page="227" type="metho">
    <SectionTitle>
5 Alignment
</SectionTitle>
    <Paragraph position="0"> The technique originally used for aligning texts was to link regions of texts according to regularity of word co-occurrences across texts \[Catizone et al., 1989\]. Pairs of words were linked if they have similar distributions in their home texts. This strategy doesn't always work well because in many languages a good writer does not use the exact same word many times in a text. Similarly, a good translator does not always translate a word exactly the same way every time it occurs. Clearly this algorithm is heavily text dependent. For texts with limited vocabularies this might work extremely well, but in &amp;quot;free&amp;quot; text it falls.</Paragraph>
    <Paragraph position="1"> Currently we are experimenting with assorted algorithms; a major problem is having good test texts to run them on. So far the best results on reasonable text come from the Gale-Church algorithm \[Gale and Church, 1991\]. It has been tested on English, German, French, Czech, and Italian parallel texts. The Gale-Church algorithm relies on the length of regions, where the character is the unit of measurement. (For details see their paper.) We have experienced three problems with this method. First, the implementation of the algorithm published in Church-Gale severely limits the size of the input file \[Gale and Church, 1991\]. This is, however, merely an implementation problem. Second, there is no way to set &amp;quot;anchor points&amp;quot; and align around them. That is, one cannot pick two anchor points, one in each text, and have the program align the corresponding regions above and below the anchor points. (See \[Brown et al., 1991\] for discussion of an alternative.) This is not necessarily a problem either, and can be worked around. Lastly, it does not give usable results on texts which are not absolutely parallel. That is to say, on texts which do not have exactly the same number of large regions, with the same hierarchical structure. A single extra line of characters in one text will cause a complete failure of the alignment algorithm. This is a major difficulty.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML