File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-1806_relat.xml
Size: 3,870 bytes
Last Modified: 2025-10-06 14:15:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1806"> <Title>Multiword Unit Hybrid Extraction</Title> <Section position="3" start_page="1" end_page="1" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> For the purpose of MWU extraction, syntactical, statistical and hybrid syntaxico-statistical methodologies have been proposed. On one hand, purely linguistic systems (Didier Bourigault, 1993) propose to extract relevant MWUs by using techniques that analyse specific syntactical structures in the texts. However, these methodologies suffer from their monolingual basis as the systems require highly specialised linguistic techniques to identify clues that isolate possible MWU candidates.</Paragraph> <Paragraph position="1"> On the other hand, purely statistical systems (Frank Smadja, 1993; Ted Dunning, 1993; Gael Dias, 2002) extract discriminating MWUs from text corpora by means of association measure regularities. As they use plain text corpora and only require the information appearing in texts, such systems are highly flexible and extract relevant units independently from the domain and the language of the input text. However, these methodologies can only identify textual associations in the context of their usage. As a consequence, many relevant structures can not be introduced directly into lexical databases as they do not guarantee adequate linguistic structures for that purpose.</Paragraph> <Paragraph position="2"> Finally, hybrid syntactico-statistical systems (Beatrice Daille, 1996; Jean-Philippe Goldman et al. 2001) define co-occurrences of interest in terms of syntactical patterns and statistical regularities. Thus, such systems reduce the searching space to groups of words that correspond to a priori defined syntactical patterns (e.g.</Paragraph> <Paragraph position="3"> Adj+Noun, Noun+Prep+Noun) and apply statistical scores to identify the most relevant sequences of words.</Paragraph> <Paragraph position="4"> One major drawback of such systems is that they do not deal with a great proportion of interesting MWUs (e.g.</Paragraph> <Paragraph position="5"> phrasal verbs, prepositional locutions). Moreover, they lack flexibility as the syntactical patterns have to be revised whenever the targeted language changes.</Paragraph> <Paragraph position="6"> In order to overcome these difficulties, we propose an original architecture that combines word statistics with endogenously acquired linguistic information. We base our study on two assumptions. On one hand, a great deal of studies in lexicography and terminology assess that most of the MWUs evidence well-known morpho-syntactic structures (Gaston Gross, 1996). On the other hand, MWUs are recurrent combinations of words. Indeed, according to Benoit Habert and Christian Jacquemin (1993), the MWUs may represent a fifth of the overall surface of a text. Consequently, it is reasonable to think that the syntactical patterns embodied by the MWUs may be endogenously identified by using statistical scores over texts of part-of-speech tags exactly in the same manner as word dependencies are identified in corpora of words. So, the global degree of cohesiveness of any sequence of words may be evaluated by a combination of its degree of cohesiveness of words and the degree of cohesiveness of its associated part-of-speech tag sequence (See Figure 1).</Paragraph> <Paragraph position="7"> Compared to existing systems, the benefits of our architecture are clear. By avoiding human intervention in the definition of syntactical patterns, (1) HELAS provides total flexibility of use being independent of the targeted language and (2) it allows the identification of various MWUs like phrasal verbs, adverbial locutions, compound determinants, prepositional locutions and institutionalized phrases.</Paragraph> </Section> class="xml-element"></Paper>