File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-3009_metho.xml
Size: 9,581 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3009"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Integrated Morphological and Syntactic Disambiguation for Modern Hebrew</Title> <Section position="6" start_page="50" end_page="52" type="metho"> <SectionTitle> 4 The Integrated Model </SectionTitle> <Paragraph position="0"> As a rst attempt to model the interaction between the morphological and the syntactic tasks, we incorporate an intermediate level of part-of-speech (POS) tagging into our model. The key idea is that POS tags that are assigned to morphological segments at the word level coincide with the lowest level of non-terminals in the syntactic parse trees (cf. (Charniak et al., 1996)). Thus, POS tags can be used to pass information between the different tasks yet ensuring agreement between the two.</Paragraph> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 4.1 Formal Setting </SectionTitle> <Paragraph position="0"> Let wm1 be a sequence of words from a xed vocabulary, sn1 be a sequence of segments of words from a (different) vocabulary, tn1 a sequence of morphosyntactic categories from a nite tag-set, and let pi be a syntactic parse tree.</Paragraph> <Paragraph position="1"> We de ne segmentation as the task of identifying the sequence of morphological constituents that were concatenated to form a sequence of words. Formally, we de ne the task as (1), where seg(wm1 ) is the set of segmentations resulting from all possible morphological analyses of wn1 .</Paragraph> <Paragraph position="3"> Syntactic analysis, parsing, identi es the structure of phrases and sentences. In MH, such tree structures combine segments of words that serve different syntactic functions. We de ne it formally as (2), where yield(piprime) is the ordered set of leaves of a syntactic parse tree piprime.</Paragraph> <Paragraph position="5"> Similarly, we de ne POS tagging as (3), where analysis(sn1) is the set of all possible POS tag assignments for sn1 .</Paragraph> <Paragraph position="7"> The task of the integrated model is to nd the most probable segmentation and syntactic parse tree given a sentence in MH, as in (4).</Paragraph> <Paragraph position="9"> We reinterpret (4) to distinguish the morphological and syntactic tasks, conditioning the latter on the former, yet maximizing for both.</Paragraph> <Paragraph position="11"> Agreement between the tasks is implemented by incorporating morphosyntactic categories (POS tags) that are assigned to morphological segments and constrain the possible trees, resulting in (7).</Paragraph> <Paragraph position="13"> Finally, we employ the assumption that</Paragraph> <Paragraph position="15"> conjoined in a certain order.7 So, instead of (5) and (7) we end up with (8) and (9), respectively.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 4.2 Evaluation Metrics </SectionTitle> <Paragraph position="0"> The intertwined nature of morphology and syntax in MH poses additional challenges to standard parsing evaluation metrics. First, note that we cannot use morphemes as the basic units for comparison, as the proposed segmentation need not coincide with the gold segmentation for a given sentence. Since words are complex entities that 7Since concatenated particles (conjunctions et al.) appear in front of the stem, pronominal and in ectional af xes at the end of the stem, and derivational morphology inside the stem, there is typically a unique way to restore word boundaries. can span across phrases (see gure 2), we cannot use them for comparison either. We propose to rede ne precision and recall by considering the spans of syntactic categories based on the (spacefree) sequences of characters to which they correspond. Formally, we de ne syntactic constituents as <i,A,j> where i,j mark the location of characters. T = {<i,A,j> |A spans from i to j} and G = {<i,A,j> |A spans from i to j} represent the test/gold parses, respectively, and we calculate:8</Paragraph> </Section> <Section position="3" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 4.3 Experimental Setup </SectionTitle> <Paragraph position="0"> Our departure point for the syntactic analysis of MH is that the basic units for processing are not words, but morphological segments that are concatenated together to form words. Therefore, we obtain a segment-based probabilistic grammar by training a Probabilistic Context Free Grammar (PCFG) on a segmented and annotated MH corpus (Sima'an et al., 2001). Then, we use existing tools i.e., a morphological analyzer (Segal, 2000), a part-of-speech tagger (Bar-Haim, 2005), and a general-purpose parser (Schmid, 2000) to nd compatible morphological segmentations and syntactic analyses for unseen sentences.</Paragraph> <Paragraph position="1"> The Data The data set we use is taken from the MH treebank which consists of 5001 sentences from the daily newspaper 'ha'aretz' (Sima'an et al., 2001). We employ the syntactic categories and POS tag sets developed therein. Our data set includes 3257 sentences of length greater than 1 and less than 21. The number of segments per sentence is 60% higher than the number of words per sentence.9 We conducted 8 experiments in which the data is split to training and test sets and apply cross-fold validation to obtain robust averages.</Paragraph> <Paragraph position="2"> The Models Model I uses the morphological analyzer and the POS tagger to nd the most probable segmentation for a given sentence. This is done by providing the POS tagger with multiple morphological analyses per word and maximizing the sum summationtexttn plete corpus is 17 while the average number of morphological segments per sentence is 26.</Paragraph> <Paragraph position="3"> probable parse tree for the selected sequence of morphological segments. Formally, this model is a rst approximation of equation (8) using a step-wise maximization instead of a joint one.10 In Model II we percolate the morphological ambiguity further, to the lowest level of non-terminals in the syntactic trees. Here we use the morphological analyzer and the POS tagger to nd the most probable segmentation and POS tag assignment by maximizing the joint probability P(tn1,sn1|wm1 ) (Bar-Haim, 2005, section 5.2). Then, the parser is used to parse the tagged segments. Formally, this model attempts to approximate equation (9).</Paragraph> <Paragraph position="4"> (Note that here we couple a morphological and a syntactic decision, as we are looking to maximize P(tn1,sn1|wm1 ) [?] P(tn1|sn1 )P(sn1|wm1 ) and constrain the space of trees to those that agree with the resulting analysis.)11 In both models, smoothing the estimated probabilities is delegated to the relevant subcomponents. Out of vocabulary (OOV) words are treated by the morphological analyzer, which proposes all possible segmentations assuming that the stem is a proper noun. The Tri-gram model used for POS tagging is smoothed using Good-Turing discounting (see (Bar-Haim, 2005, section 6.1)), and the parser uses absolute discounting with various backoff strategies (Schmid, 2000, section 4.4).</Paragraph> <Paragraph position="5"> The Tag-Sets To examine the usefulness of various morphological features shared with the parsing task, we alter the set of morphosyntactic categories to include more ne-grained morphological distinctions. We use three sets: Set A contains bare POS categories, Set B identi es also de nite nouns marked for possession, and Set C adds the distinction between nite and non- nite verb forms.</Paragraph> <Paragraph position="6"> Evaluation We use seven measures to evaluate our models' performance on the integrated task.</Paragraph> <Paragraph position="7"> 10At the cost of incurring indepence assumptions, a step-wise architecture is computationally cheaper than a joint one and this is perhaps the simplest end-to-end architecture for MH parsing imaginable. In the absence of previous MH parsing results, this model is suitable to serve as a baseline against which we compare more sophisticated models.</Paragraph> <Paragraph position="8"> 11We further developed a third model, Model III, which is a more faithful approximation, yet computationally affordable, of equation (9). There we percolate the ambiguity all the way through the integrated architecture by means of providing the parser with the n-best sequences of tagged morphological segments and selecting the analysis <pi, tn1, sn1 > which maximizes the production P(pi|tn1, sn1 )P(sn1, tn1 |wm1 ). However, we have not yet obtained robust results for this model prior to the submission of this paper, and therefore we leave it for future discussion.</Paragraph> <Paragraph position="9"> First, we present the percentage of sentences for which the model could propose a pair of corresponding morphological and syntactic analyses.</Paragraph> <Paragraph position="10"> This measure is referred to as string coverage. To indicate morphological disambiguation capabilities we report segmentation precision and recall. To capture tagging and parsing accuracy, we refer to our rede ned Parseval measures and separate the evaluation of morphosyntactic categories, i.e., POS tags precision and recall, and phrase-level syntactic categories, i.e., labeled precision and recall (where root nodes are discarded and empty trees are counted as zero).12 The labeled categories are evaluated against the original tag set.</Paragraph> </Section> </Section> class="xml-element"></Paper>