XML Viewer - w05-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1008_metho.xml
Size: 11,075 bytes
Last Modified: 2025-10-06 14:10:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1008">
  <Title>Bootstrapping Deep Lexical Resources: Resources for Courses</Title>
  <Section position="4" start_page="69" end_page="70" type="metho">
    <SectionTitle>
3 Morphology-based Deep Lexical
Acquisition
</SectionTitle>
    <Paragraph position="0"> We first perform DLA based on the following morphological LRs: (1) word lists, and (2) morphological lexicons with a description of derivational word correspondences. Note that in evaluation, we presuppose that we have access to word lemmas although in the first instance, it would be equally possible to run the method over non-lemmatised data.5</Paragraph>
    <Section position="1" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
3.1 Character n-grams
</SectionTitle>
      <Paragraph position="0"> In line with our desire to produce DLA methods which can be deployed over both low- and high-density languages, our first feature representation takes a simple word list and converts each lexeme into a character n-gram representation.6 In the case of English, we generated all 1- to 6-grams for each lexeme, and applied a series of filters to: (1) filter out all n-grams which occurred less than 3 times in the lexicon data; and (2) filter out all n-grams which occur with the same frequency as larger n-grams they are proper substrings of. We then select the 3,900 character n-grams with highest saturation across the lexicon data (see Section 2.2).</Paragraph>
      <Paragraph position="1"> The character n-gram-based classifier is the simplest of all classifiers employed in this research, and can be deployed on any language for which we have a word list (ideally lemmatised).</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="70" type="sub_section">
      <SectionTitle>
3.2 Derviational morphology
</SectionTitle>
      <Paragraph position="0"> The second morphology-based DLA method makes use of derivational morphology and analysis of the process of word formation. As an example of how derivational information could assist DLA, knowing that the noun achievement is deverbal and incorporates the -ment suffix is a strong predictor of it being optionally uncountable and optionally selecting for a PP argument (i.e. being of lexical type n mass count ppof le).</Paragraph>
      <Paragraph position="1"> We generate derivational morphological features for a given lexeme by determining its word cluster in CATVAR7 (Habash and Dorr, 2003) and then for each sister lexeme (i.e. lexeme occurring in the  attempt to dehyphenate and then deprefix the word to find a match, failing which we look for the lexeme of smallest edit distance.</Paragraph>
      <Paragraph position="2">  same cluster as the original lexeme with the same word stem), determine if there is a series of edit operations over suffixes and prefixes which maps the lexemes onto one another. For each sister lexeme where such a correspondence is found to exist, we output the nature of the character transformation and the word classes of the lexemes involved. E.g., the sister lexemes for achievementN in CAT-VAR are achieveV, achieverN, achievableAdj and achievabilityN; the mapping between achievementN and achieverN, e.g., would be analysed as: N [?]ment$ - N +r$ Each such transformation is treated as a single feature. null We exhaustively generate all such transformations for each lexeme, and filter the feature space as for character n-grams above.</Paragraph>
      <Paragraph position="3"> Clearly, LRs which document derivational morphology are typically only available for high-density languages. Also, it is worth bearing in mind that derivational morphology exists in only a limited form for certain language families, e.g. agglutinative languages.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="70" end_page="71" type="metho">
    <SectionTitle>
4 Syntax-based Deep Lexical Acquisition
</SectionTitle>
    <Paragraph position="0"> Syntax-based DLA takes a raw text corpus and preprocesses it with either a tagger, chunker or dependency parser. It then extracts a set of 39 feature types based on analysis of the token occurrences of a given lexeme, and filters over each feature type to produce a maximum of 50 feature instances of highest saturation (e.g. if the feature type is the word immediately proceeding the target word, the feature instances are the 50 words which proceed the most words in our lexicon). The feature signature associated with a word for a given preprocessor type will thus have a maximum of 3,900 items (39x50x2).8</Paragraph>
    <Section position="1" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.1 Tagging
</SectionTitle>
      <Paragraph position="0"> The first and most basic form of syntactic pre-processing is part-of-speech (POS) tagging. For our purposes, we use a Penn treebank-style tagger custom-built using fnTBL 1.0 (Ngai and Florian, 2001), and further lemmatise the output of the tagger using morph (Minnen et al., 2000).</Paragraph>
      <Paragraph position="1"> 8Note that we will have less than 50 feature instances for some feature types, e.g. the POS tag of the target word, given that the combined size of the Penn POS tagset is 36 elements (not including punctuation).</Paragraph>
      <Paragraph position="2"> The feature types used with the tagger are detailed in Table 2, where the position indices are relative to the target word (e.g. the word at position [?]2 is two words to the left of the target word, and the POS tag at position 0 is the POS of the target word). All features are relative to the POS tags and words in the immediate context of each token occurrence of the target word. &amp;quot;Bi-words&amp;quot; are word bigrams (e.g. biword (1,3) is the bigram made up of the words one and three positions to the right of the target word); &amp;quot;bi-tags&amp;quot; are, similarly, POS tag bigrams.</Paragraph>
    </Section>
    <Section position="2" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.2 Chunking
</SectionTitle>
      <Paragraph position="0"> The second form of syntactic preprocessing, which builds directly on the output of the POS tagger, is CoNLL 2000-style full text chunking (Tjong Kim Sang and Buchholz, 2000). The particular chunker we use was custom-built using fnTBL 1.0 once again, and operates over the lemmatised output of the POS tagger.</Paragraph>
      <Paragraph position="1"> The feature set for the chunker output includes a subset of the POS tagger features, but also makes use of the local syntactic structure in the chunker input in incorporating both intra-chunk features (such as modifiers of the target word if it is the head of a chunk, or the head if it is a modifier) and inter-chunk features (such as surrounding chunk types when the target word is chunk head). See Table 2 for full details. null Note that while chunk parsers are theoretically easier to develop than full phrase-structure or tree-bank parsers, only high-density languages such as English and Japanese have publicly available chunk parsers.</Paragraph>
    </Section>
    <Section position="3" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
4.3 Dependency parsing
</SectionTitle>
      <Paragraph position="0"> The third and final form of syntactic preprocessing is dependency parsing, which represents the pinnacle of both robust syntactic sophistication and inaccessibility for any other than the highest-density languages. null The particular dependency parser we use is RASP9 (Briscoe and Carroll, 2002), which outputs head-modifier dependency tuples and further classifies each tuple according to a total of 14 relations; RASP also outputs the POS tag of each word token. As our features, we use both local word and  and chunker, and also dependency-derived features, namely the modifier of all dependency tuples the target word occurs as head of, and conversely, the head of all dependency tuples the target word occurs as modifier in, along with the dependency relation in each case. See Table 2 for full details.</Paragraph>
    </Section>
    <Section position="4" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
4.4 Corpora
</SectionTitle>
      <Paragraph position="0"> We ran the three syntactic preprocessors over a total of three corpora, of varying size: the Brown corpus ([?]460K tokens) and Wall Street Journal corpus ([?]1.2M tokens), both derived from the Penn Tree-bank (Marcus et al., 1993), and the written component of the British National Corpus ([?]98M tokens: Burnard (2000)). This selection is intended to model the effects of variation in corpus size, to investigate how well we could expect syntax-based DLA methods to perform over both smaller and larger corpora.</Paragraph>
      <Paragraph position="1"> Note that the only corpus annotation we make use of is sentence tokenisation, and that all preprocessors are run automatically over the raw corpus data. This is in an attempt to make the methods maximally applicable to lower-density languages where annotated corpora tend not to exist but there is at least the possibility of accessing raw text collections.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="71" end_page="72" type="metho">
    <SectionTitle>
5 Ontology-based Deep Lexical
Acquisition
</SectionTitle>
    <Paragraph position="0"> The final DLA method we explore is based on the hypothesis that there is a strong correlation between the semantic and syntactic similarity of words, a claim which is best exemplified in the work of Levin (1993) on diathesis alternations. In our case, we take word similarity as given and learn the syntactic behaviour of novel words relative to semanticallysimilar words for which we know the lexical types. We use WordNet 2.0 (Fellbaum, 1998) to determine word similarity, and for each sense of the target word in WordNet: (1) construct the set of &amp;quot;semantic neighbours&amp;quot; of that word sense, comprised of all synonyms, direct hyponyms and direct hypernyms; and (2) take a majority vote across the lexical types of the semantic neighbours which occur in the training data. Note that this diverges from the learning paradigm adopted for the morphology- and syntax-based DLA methods in that we use a simple voting strategy rather than relying on an external learner to carry out the classification. The full set of lexical entries for the target word is generated by taking the union of the majority votes across all senses of the word, such that a polysemous lexeme can potentially give rise to multiple lexical entries. This learning  procedure is based on the method used by van der Beek and Baldwin (2004) to learn Dutch countability. null As for the suite of binary classifiers, we fall back on the majority class lexical type as the default in the instance that a given lexeme is not contained in WordNet 2.0 or no classification emerges from the set of semantic neighbours. It is important to realise that WordNet-style ontologies exist only for the highest-density languages, and that this method will thus have very limited language applicability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML