File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2180_intro.xml
Size: 5,541 bytes
Last Modified: 2025-10-06 14:06:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2180"> <Title>United Kingdom</Title> <Section position="4" start_page="0" end_page="1028" type="intro"> <SectionTitle> 2 The basic approach </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1028" type="sub_section"> <SectionTitle> 2.1 Identification using regular expressions </SectionTitle> <Paragraph position="0"> The types of word construct of interest here lend themselves well to identification by matching regular expressions over each input sentence (considered as a record), tagging them as specific instances of general phenomena (e.g. dates, numbers, etc.).</Paragraph> <Paragraph position="1"> awk is a programming language specifically designed for in this type of string manipulation (matching, replacement, splitting). It has special provisions for treating text in the form of records. auk reads input record by record, matching user-defined regular expressions and executing corresponding actions according to whether a match has been retold.</Paragraph> <Paragraph position="2"> The regular expressions can be stored in variables and reused to build more complex expressions. This is important, as some of the phenomena we were attempting to match were complex (see below) and occurred in a vm'iety of formats. The auk-implemented tagger developed for this project, tag\[t, can be used as a stand-alone tagger for SGML texts. It has been integrated within the text handling procedures of the ALEP system after sentence recognition and before word recognition. When a pattern matches against the input, the matched string is replaced with a general tag of the relevant type (e.g. DATE, NUMBER). Subsequent tagging and morphological parsing then skip these tags, and further processing (i.e. syntactic analysis) is based on the tag value, not the original input string.</Paragraph> </Section> <Section position="2" start_page="1028" end_page="1028" type="sub_section"> <SectionTitle> 2.2 Sample case : currency patterns in German </SectionTitle> <Paragraph position="0"> tagit has been integrated into the German LS-GRAM grmnmar for the identification of word constructs occurring in the mini-corpus taken as the departure point for the work of the German group, consisting of an article on economics (taken from the weekly newspaper &quot;Die Zeit&quot;). As usual when using 'real-world' texts, mm~y messy details were found, including dates and numbers used within t)ercentages and, as would be expected from the text type, within amounts of currency.</Paragraph> <Paragraph position="1"> These occur both with and without numerals, e.g.</Paragraph> <Paragraph position="2"> &quot;16,7 Millionen Dollar&quot;, &quot;Sechsundzwanzig Millim'den D-Mark&quot;. The text examples are especially problematic given the German method of expressing the ones-digit before the tens-digit, e.g.</Paragraph> <Paragraph position="3"> &quot;Sechsundzwmmig&quot; is literally &quot;six-and-twenty&quot;. In order to deal with this phenomenon, regular expression patterns describing the currency mnounts were defned in awk. First, patterns for cardinals were specified, e.g} ~Umlauted characters and &quot;if' are matched by the system, though they are not shown here.</Paragraph> <Paragraph position="4"> Note that regular expressions are specified as strings and must be quoted using pairs of double quotes. Variables are not evaluated when they occur in quotes, so quoting is ended, and then restarted after the variable name, whence the proliferation of double quotes within the complex patterns.</Paragraph> <Paragraph position="5"> Some auk syntax : &quot;=&quot; is the assignment operator, parentheses are used for grouping, &quot;1&quot; is the disjunction operator, &quot;?&quot; indicates optionality of the preceding expression, &quot;+&quot; means one or more instances of the</Paragraph> <Paragraph position="7"> The actual pattern used in the implementation is more complex and goes up to 999, but the example shows the principle. Given this set of variables, the pattern assigned to card can matd, the text version of all cardinal numbers from \] to 999, e.g. &quot;Drei&quot;, &quot;Neunzehn&quot;, &quot;Zweiundzwanzig&quot;, &quot;Achthundert Ffinflmdvierzig&quot;, etc. The value assigned to range can match number, optionally with a comma as decimal point, e.g. &quot;99,09&quot;. The following patterns are also needed :</Paragraph> <Paragraph position="9"> The last two patterns described define measure being the succession of a cardinal number (as a digit or a string) followed by curmeasure, being the concatenation of amount and currency.</Paragraph> <Paragraph position="10"> But both of them, amount and currency, are defined as being optional. So that inputs like &quot;30,6 Mill\[arden Dollar&quot;, &quot;Zweiundzwanzig Dollar&quot; or &quot;Dreiundvierzig Mill\[arden Dollar&quot; are automatically recognized. But the definition of 'measure' disallows the tagging of &quot;Zweiundzwanzig&quot; as a 'measure' expression. The tag provided for this string will be the same as for any other cardinals.</Paragraph> <Paragraph position="11"> tagit applies these patterns to each record within the input, assigning the appropriate tag information in case a match is found. Further processing is described below.</Paragraph> </Section> </Section> class="xml-element"></Paper>