File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1035_metho.xml
Size: 5,129 bytes
Last Modified: 2025-10-06 14:15:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1035"> <Title>A Cascaded Finite-State Parser for Syntactic Analysis of Swedish</Title> <Section position="4" start_page="245" end_page="246" type="metho"> <SectionTitle> 3 The Grammar Framework </SectionTitle> <Paragraph position="0"> The Swedish grammar has been semi-automatically extracted from written text corpora by observing two phenomena: (i) which part-of-speech n-grams, are not allowed to be adjacent to each other in a constituent, and (ii) how and which function words signal boundaries between phrases and clauses. (i) uses the Mutual Information, statistics, based on the n-grams. Low n-gram frequencies, such as verb/noun-determiner, gave reliable cues for clause boundary, while high values such as numeral-noun did not, and thus rejected. Observation (i) is related to the notion of distituent grammars, &quot;...a distituent grammar is a list of tag pairs which cannot be adjacent within a constituent...&quot;, Magerman & Marcus (1990); (ii) is a supplement of (i), which recognizes formal indicators of subordination/co-ordination, such as conjunctions, subjunctions, and punctuation.</Paragraph> <Section position="1" start_page="245" end_page="245" type="sub_section"> <SectionTitle> 3.1 Syntactic Labelling and the Underlying Corpus </SectionTitle> <Paragraph position="0"> The syntactic analysis is completed through the recognition of a variety of phrasal constituents, sentential clauses, and subclauses. We follow the proposal defined by the EAGLES (1996), Syntactic Annotation Group, which recognizes a number of syntactic, metasymbolic categories that are subsumed in most current categories of constituency-based syntactic annotation. The labelled bracketing consists of the syntactic category of the phrasal constituent enclosed between brackets. Unlabelled bracketing is only adopted in cases of unrecognized syntactic constructions.</Paragraph> <Paragraph position="1"> The corpora we used consisted of a variety of different sources, about 200,000 tokens, collected in AVENTINUS. The rules are divided into levels, with each level consisting of groups of patterns ordered according to their internal complexity and length. A pattern consists of a category and a regular expression. The regular expressions are translated into finite-state automata, and the union of the automata yields a single, deterministic, finite-state, level recognizer, (Abney, 1996). Moreover, there is also the possibility of grouping words and/or part-of-speech tags using morphological and semantic criteria.</Paragraph> </Section> <Section position="2" start_page="245" end_page="246" type="sub_section"> <SectionTitle> 3.2 Grammar Rules </SectionTitle> <Paragraph position="0"> Some of the most important groups include: * Noun Phrases, Grammar0: the number of patterns in grammar0 is 180, divided in six different groups, depending on the length and complexity of the patterns. A large number of (parallel) coordination rules are also implemented at this level, depending on the similarity of the conjuncts with respect to several different characteristics, (cf. Nagao, 1992).</Paragraph> <Paragraph position="1"> phrases preceded by a preposition. Trapped adverbials, belonging to the noun phrase and not identified while applying grammar0, are merged within the np. Both simple and multi-word prepositions are used.</Paragraph> <Paragraph position="2"> * Verbal Groups, Grammar2: identifies and labels phrasal, non-phrasal, and complex verbal formations. The rules allow for any number of auxiliary verbs, possible intervening adverbs, and end with a main verb or particle. A distinction is made between finite/infinite active/passive verbal groups.</Paragraph> <Paragraph position="3"> * Clauses, Grammar3 and Grammar4: the clause resolution is based on surface criteria, outlined at the beginning of this chapter, and the rather fixed word order of Swedish.</Paragraph> <Paragraph position="4"> Grammar3 distinguishes different types of subordinate clauses; while Grammar4 recognizes main clauses. A unique level is designated for each type of clause</Paragraph> </Section> <Section position="3" start_page="246" end_page="246" type="sub_section"> <SectionTitle> 3.3 Grammatical Functions </SectionTitle> <Paragraph position="0"> Grammatical functions are heuristically recognized using the topographical scheme, originally developed for Danish, in which the relative position of all functional elements in the clause is mapped in the sentence, (Diderichsen, 1966).</Paragraph> </Section> <Section position="4" start_page="246" end_page="246" type="sub_section"> <SectionTitle> 3.4 An Example </SectionTitle> <Paragraph position="0"> The following short example illustrates the input and output to Cass-SWE: 'Under 1998 gick 8 799 fSretag i konkurs i Sverige. ', i.e. 'During 1998, 8 799 companies went bankrupt in Sweden.' The input to Cass-SWE is an annotated version of the text: in the input, as well as the head of each phrase and the grammatical functions: TIME, SUBJ(ect) and P-0BJ(ect).</Paragraph> </Section> </Section> class="xml-element"></Paper>