File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2016_metho.xml
Size: 15,626 bytes
Last Modified: 2025-10-06 14:07:02
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2016"> <Title>Rapid Parser Development: A Machine Learning Approach for Korean</Title> <Section position="4" start_page="118" end_page="119" type="metho"> <SectionTitle> 3 Treebanking Effort </SectionTitle> <Paragraph position="0"> The additional resources used to train and test a parser for Korean, which we will describe in more detail in the next section, were (1) a 1187 sentence treebank, (2) a set of 133 context features, and (3) background knowledge in form of an 'is-a' ontology with about 1000 entries. These resources were built by a team consisting of the principal researcher and two graduate students, each contributing about 3 months.</Paragraph> <Section position="1" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.1 Treebank </SectionTitle> <Paragraph position="0"> The treebank sentences are taken from the Korean newspaper Chosun, two-thirds from 1994 and the remainder from 1999. Sentences represent continuous articles with no sentences skipped for length or any other reason. The average sentence length is 21.0 words.</Paragraph> </Section> <Section position="2" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.2 Feature Set </SectionTitle> <Paragraph position="0"> The feature set describes the context of a partially parsed state, including syntactic features like the part of speech of the constituent at the front/top of the input list (as sketched in figure 2) or whether the second constituent on the parse stack ends in a comma, as well as semantic features like whether or not a constituent is a time expression or contains a location particle. The feature set can accommodate any type of feature as long as it is computable, and can thus easily integrate different types of background knowledge.</Paragraph> </Section> <Section position="3" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.3 Background Knowledge </SectionTitle> <Paragraph position="0"> The features are supported by background knowledge in the form of an ontology, which for example has a time-particle concept with nine sub-concepts (accounting for 9 of the 1000 entries mentioned above). Most of the background knowledge groups concepts like particles, suffixes, units (e.g. for lengths or currencies), temporal adverbs - semantic classes that are not covered by part of speech information of the lexicon, yet provide valuable clues for parsing.</Paragraph> </Section> <Section position="4" start_page="118" end_page="119" type="sub_section"> <SectionTitle> 3.4 Time Effort </SectionTitle> <Paragraph position="0"> The first graduate student, a native Korean and linguistics major, hired for 11 weeks, spent about 2 weeks getting trained, 6 weeks on building two-thirds of the treebank, 2 weeks providing most background knowledge entries and 1 week helping to</Paragraph> <Paragraph position="2"> synt: adv I &quot;reduce the 2 top elements of the parse stack to a frame with syntax 'vp' and roles 'pred' and 'obj'&quot; Boxes represent frames. The asterisk (*) represents the current parse position. Optionally, parse actions can have additional arguments, like target syntactic or semantic classes to overwrite any default. Elements on the input list are identified by positive integers, elements on the parse stack by negative integers. The feature 'Synt of -1' for example refers to the (main) syntactic category of the top stack element. Before the reduce operation, the feature 'Synt of-1' would evaluate to np (for &quot;a book&quot;), after the operation to vp (for &quot;bought a book&quot;). The input list is initialized with the morphologically analyzed words, possibly still ambiguous. After a sequence of shift (from input list to parse stack) and reduce (on the parse stack) operations, the parser eventually ends up with a single element on the parse stack, which is then returned as the parse tree.</Paragraph> <Paragraph position="3"> identify useful features. The other graduate student, a native Korean and computer science major, installed Korean tools including a terminal for hangul and the above mentioned KMA and KTAG, wrote a number of scripts tying all tools together, made some tool improvements, built one-third of the treebank and also contributed to the feature set. The principal researcher, who does not speak Korean, contributed about 3 person months, coordinating the project, training the graduate students, writing tree-bank consistency checking rules (see section 6), making extensions to the tree-to-parse-action-sequence module (see section 4.1) and contributing to the background knowledge and feature set.</Paragraph> </Section> </Section> <Section position="5" start_page="119" end_page="120" type="metho"> <SectionTitle> 4 Learning to Parse </SectionTitle> <Paragraph position="0"> We base our training on the machine learning based approach of (Hermjakob k: Mooney, 1997), allowing however unrestricted text and deriving the parse action sequences required for training from a treebank. The basic mechanism for parsing text into a shallow semantic representation is a shift-reduce type parser (Marcus, 1980) that breaks parsing into an ordered sequence of small and manageable parse actions. Figure 2 shows a typical reduce action. The key task of machine learning then is to learn to predict which parse action to perform next.</Paragraph> <Paragraph position="1"> Two key advantages of this type of deterministic parsing are that its linear run-time complexity with respect to sentence length makes the parser very fast, and that the parser is very robust in that it produces a parse tree for every input sentence.</Paragraph> <Paragraph position="2"> Figure 3 shows the overall architecture of parser training. From the treebank, we first automatically generate a parse action sequence. Then, for every step in the parse action sequence, typically several dozens per sentence, we automatically compute the value for every feature in the feature set, add on the parse action as the proper classification of the parse action example, and then feed these examples into a machine learning program, for which we use an extension of decision trees (Quinlan, 1986; Hermjakob & Mooney, 1997).</Paragraph> <Paragraph position="3"> We built our parser incrementally. Starting with a small set of syntactic features that are useful across all languages, early training and testing runs reveal machine learning conflict sets and parsing errors that point to additionally required features and possibly also additional background knowledge. A conflict set is a set of training examples that have identical values for all features, yet differ in their classification (= parse action). Machine learning can therefore not possibly learn how to handle all examples correctly.</Paragraph> <Paragraph position="4"> This is typically resolved by adding an additional feature that differentiates between the examples in a linguistically relevant way.</Paragraph> <Paragraph position="5"> Even treebanking benefits from an incremental approach. Trained on more and more sentences, and at the same time with also more and more features, parser quality improves, so that the parser as a treebanking tool has to be corrected less and less frequently, thereby accelerating the treebanking process. null and a feature set. The resulting parser has the form of a decision structure, an extension of decision trees. Given a seen or unseen sentence in form of a list of words, the decision structure keeps selecting the next parse action until a single parse tree covering the entire sentence has been built.</Paragraph> <Paragraph position="6"> The analyzer divides '31i1' into groups with varying number of sub-components with different parts of speech. When shifting in an element, the parser has to decide which one to pick, the third one in this case, using context of course.</Paragraph> <Paragraph position="7"> The module generating parse action sequences from a tree needs special split and merge operations for cases where the correct segmentation is not offered as a choice at all. To make things a little ugly, these splits can not only occur in the middle of a leaf constituent, but even in the middle of a character that might have been contracted from two characters, each with its own meaning.</Paragraph> </Section> <Section position="6" start_page="120" end_page="121" type="metho"> <SectionTitle> 5 Chosun Newspaper Experiments </SectionTitle> <Paragraph position="0"> Table 1 presents evaluation results with the number of training sentences varying from 32 to 1024 and with the remaining 163 sentences of the treebank used for testing.</Paragraph> <Paragraph position="1"> Precision: number of correct constituents in system parse number of constituents in system parse Recall: number of correct constituents in system parse number of constituents in logged parse Crossing brackets: number of constituents which violate constituent boundaries with a constituent in the logged parse. Labeled precision/recall measures not only structural correctness, but also the correctness of the syntactic label. Correct operations measures the number of correct operations during a parse that is continuously corrected based on the logged sequence; it measures the core machine learning algorithm performance in isolation. A sentence has a correct operating sequence, if the system fully predicts the logged parse action sequence, and a correct structure and labeling, if the structure and syntactic labeling of the final system parse of a sentence is 100% correct, regardless of the operations leading to it.</Paragraph> <Paragraph position="2"> Figures 4 and 5 plot the learning curves for two key metrics. While both curves are clearly heading z KMA actually produces 10 different alternatives in this case, of which only four are shown here.</Paragraph> <Section position="1" start_page="120" end_page="121" type="sub_section"> <SectionTitle> 4.1 Special Adaptation for Korean </SectionTitle> <Paragraph position="0"> The segmenter and morphological analyzer KMA returns a list of alternatives for each eojeol. However, the alternatives are not atomic but rather two-level constituents, or mini-trees. Consider for example the following four 1 alternatives for the eojeol '31il' (the 31st day of a month): sentence corresponding to table 1 in the right direction, up for precision, and down for crossing brackets, their appearance is somewhat jagged. For smaller data sets like in our case, this can often be avoided by running an n-fold cross validation test. However, we decided not to do so, because many training sentences were also used for feature set and background knowledge development as well as for intermediate inspection, and therefore might have unduly influenced the evaluation.</Paragraph> </Section> <Section position="2" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 5.1 Tagging accuracy </SectionTitle> <Paragraph position="0"> A particularly striking number is the tagging accuracy, 94.2%, which is dramatically below the equivalent 98% to 99% range for a good English or Japanese parser. In a Korean sentence, only larger constituents that typically span several words are separated by spaces, and even then not consistently, so that segmentation errors are a major source for tagging problems (as it is to some degree however also for Japanese2). We found that the segmentation part of KMA sometimes still struggles with relatively simple issues like punctuation, proposing for example words that contain a parenthesis in the middle of standard alphabetic characters. We have corrected some of these problems by pre- and post-processing the results of KMA, but believe that there is still a significant potential for further improvement. null In order to assess the impact of the relatively low tagging accuracy, we conducted experiments that simulated a perfect tagger by initializing the parser with the correctly segmented, morphologically analyzed and tagged sentence according to the treebank. By construction, the tagging accuracy in table 2 rises to 100%. Since the segmenter/tagger returns not just atomic but rather two-level constituents, the precision and recall values benefit particularly strongly, possibly inflating the improvements for these metrics, but other metrics like crossing brackets per sentence show substantial gains as well. Thus we believe that refined pre-parsing tools, as they are 2While Japanese does not use spaces at all, script changes between kanji, hiragana, and katakana provide a lot of segmentation guidance. Modern Korean, however, almost exclusively uses only a single phonetic script.</Paragraph> <Paragraph position="1"> in the process of becoming available for Korean, will greatly improve parsing accuracy.</Paragraph> <Paragraph position="2"> However, for true low density languages, such high quality preprocessors are probably not available so that our experimental scenario might be more realistic for those conditions. On the other hand, some low density languages like for example Tetun, the principal indigenous language of East Timor, are based on the Latin alphabet, separate words by spaces and have relatively little inflection, and therefore make morphological analysis and segmentation relatively simple.</Paragraph> </Section> </Section> <Section position="7" start_page="121" end_page="122" type="metho"> <SectionTitle> 6 Treebank Consistency Checking </SectionTitle> <Paragraph position="0"> It is difficult to maintain a high treebank quality.</Paragraph> <Paragraph position="1"> When training on a small treebank, this is particularly important, because there is not enough data to allow generous pruning.</Paragraph> <Paragraph position="2"> Treebanking is done by humans and humans err.</Paragraph> <Paragraph position="3"> Even with annotation guidelines there are often additional inconsistencies when there are several annotators. In the Penn Treebank (Marcus, 1993) for example, the word ago as in 'two years ago', is tagged 414 times as an adverb and 150 times as a preposition. null In many treebanking efforts, basic taggers and parsers suggest parts of speech and tree structures that can be accepted or corrected, typically speeding up the treebanking effort considerably. However, incorrect defaults can easily slip through, leaving blatant inconsistencies like the one where the constituent 'that' as in 'the dog that bit her' is treebanked as a noun phrase containing a conjunction (as opposed to a pronoun).</Paragraph> <Paragraph position="4"> From the very beginning of treebanking, we have therefore passed all trees to be added to the tree-bank through a consistency checker that looks for any suspicious patterns in the new tree. For every type of phrase, the consistency checker draws on a list of acceptable patterns in a BNF style notation. While this consistency checking certainly does not guarantee to find all errors, and can produce false alarms when encountering rare but legitimate constructions, we have found it a very useful tool to maintain treebank quality from the very beginning, easily offsetting the about three man days that it took to adapt the consistency checker to Korean.</Paragraph> <Paragraph position="5"> For a number of typical errors, we extended the checker to automatically correct errors for which this could be done safely, or, alternatively, suggest a likely correction for errors and prompt for confirmation/correction by the treebanker.</Paragraph> </Section> class="xml-element"></Paper>