File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1316_metho.xml

Size: 9,827 bytes

Last Modified: 2025-10-06 14:08:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1316">
  <Title>Selecting Text Features for Gene Name Classification: from Documents to Terms</Title>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Feature selection and engineering
</SectionTitle>
    <Paragraph position="0"> The main aim while selecting classification features is to find (and use) textual attributes that can improve the classification accuracy and accelerate the learning phase. In our experiments we examined the impact of different types of features on the performance of an SVM-based gene name classification task. The main objective was to investigate whether additional linguistic pre-processing of documents could improve the SVM results, and, in particular, whether semantic processing (such as terminological analysis) was beneficial for the classification task. In other words, we wanted to see which textual units should be generated as input feature vectors, and what level of pre-processing was appropriate in order to produce more accurate predictions.</Paragraph>
    <Paragraph position="1"> We have experimented with two types of textual features: in the first case, we have used a classic bag-of-single-words approach, with different levels of lexical pre-processing (i.e. single words, lemmas, and stems). In the second case, features related to semantic pre-processing of documents have been generated: a set of automatically extracted multi-word terms (other than gene names to be classified) has been used as a feature set. Additionally, we have experimented with features reflecting simple gene-gene co-occurrences within the same documents.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Single words as features
</SectionTitle>
      <Paragraph position="0"> The first set of experiments included a classic bag-of-single-words approach. All abstracts (from a larger collection, see Section 4) that contained at least one occurrence of a given gene or its aliases have been selected as documents relevant for that gene. These documents have been treated as a single virtual document pertinent to the given gene.</Paragraph>
      <Paragraph position="1"> All words co-occurring with a given gene in any of the abstracts were used as its features.</Paragraph>
      <Paragraph position="2"> A word has been defined as an alphanumeric sequence between two standard separators, with all numeric expressions that were not part of other words filtered out. In addition, a standard list of around 300 stop-words has been used to exclude some frequent non-content words.</Paragraph>
      <Paragraph position="3"> An idf-like measure has been used for feature weights: the weight of a word w for gene g is given  is a set of relevant documents for the gene g, f j (w) is the frequency of w in document j, and N w is the global frequency of w. Gene vectors,  containing weights for all co-occurring words, have been used as input for the SVM.</Paragraph>
      <Paragraph position="4"> It is widely accepted that rare words do not have any significant influence on accuracy (cf. (Leopold and Kindermann, 2002)), neither do words appearing only in few documents. In our experiments (demonstrated in Section 4), we compared the performance between the 'all-words approach' and an approach featuring words appearing in at least two documents. In the latter case, the dimension of the problem (expressed as the number of features) was significantly reduced (with factor 3), and consequently the training time was shortened (see Section 4).</Paragraph>
      <Paragraph position="5"> Since many authors claimed that the biomedical literature contained considerably more linguistic variations than text in general (cf. Yakushiji et al., 2001), we applied two standard transformations in order to reduce the level of lexical variability. In the first case, we used the EngCG POS tagger (Voutilainen and Heikkila, 1993) to generate lemmas, so that lemmatised words were used as features, while, in the second case, we generated stems by the Porter's algorithm (Porter, 1980). Analogously to words, the same idf-based measure was used for weights, and experiments were also performed with all features and with the features appearing in no less than two documents.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Terms as features
</SectionTitle>
      <Paragraph position="0"> Many literature-mining techniques rely heavily on the identification of main concepts, linguistically represented by domain specific terms (Nenadic et al., 2002b). Terms represent the most important concepts in a domain and have been used to characterise documents semantically (Maynard and Ananiadou, 2002). Since terms are semantic indicators used in scientific discourse, we hypothesised that they might be useful classification features.</Paragraph>
      <Paragraph position="1"> The high neology rate for terms makes existing glossaries incomplete for active and time-limited research, and thus automatic term extraction tools are needed for efficient terminological processing.</Paragraph>
      <Paragraph position="2"> In order to automatically generate term as features, we have used an enhanced version of the C-value method (Frantzi et al., 2000), which assigns termhoods to automatically extracted multi-word term candidates. The method combines linguistic formation patterns and statistical analysis. The linguistic part includes part-of-speech tagging, syntactic pattern matching and the use of a stop list to eliminate frequent non-terms, while statistical termhoods amalgamate four numerical characteristic of a candidate term, namely: the frequency of occurrence, the frequency of occurrence as a nested element, the number of candidate terms containing it as a nested element, and term's length.</Paragraph>
      <Paragraph position="3"> Due to the extensive term variability in the domain, the same concept may be designated by more than one term. Therefore, term variants conflation rules have been added to the linguistic part of the C-value method, in order to enhance the results of the statistical part. When term variants are processed separately by the statistical module, their termhoods are distributed across different variants providing separate frequencies for individual variants instead of a single frequency calculated for a term candidate unifying all of its variants. Hence, in order to make the most of the statistical part of the C-value method, all variants of the candidate terms are matched to their normalised forms by applying rule-based transformations and treated jointly as a term candidate (Nenadic et al., 2002a).</Paragraph>
      <Paragraph position="4"> In addition, acronyms are acquired prior to the selection of the term candidates and also mapped to their expanded forms, which are normalised in the same manner as other term candidates.</Paragraph>
      <Paragraph position="5"> Once a corpus has been terminologically processed, each target gene is assigned a set of terms appearing in the corresponding set of documents relevant to the given gene. Thus, in this case, gene vectors used in the SVM classifier contain co-occurring terms, rather than single words. As term weights, we have used a formula analogous to (1).</Paragraph>
      <Paragraph position="6"> Also, similarly to single-word features, we have experimented with terms appearing in at least two documents.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Combining word and term features
</SectionTitle>
      <Paragraph position="0"> The C-value method extracts only multi-word terms, which may be enriched during the normalisation process with some single-word terms, sourcing from e.g. acronyms or orthographic variations.</Paragraph>
      <Paragraph position="1"> In order to assess impact of both single and multi-word terms as features, we experimented with combining single-word based features with multi-word terms by using a simple kernel modification that concatenates the corresponding feature vectors. Thus, gene vectors used in this case contain both words and terms that genes co-occur with.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Document identifiers as features
</SectionTitle>
      <Paragraph position="0"> Term co-occurrences have been traditionally used as an indication of their similarity (Ushioda, 1986), with documents considered as bags of words in the majority of approaches. For example, Stapley et al.</Paragraph>
      <Paragraph position="1"> (2000) used document co-occurrence statistics of gene names in Medline abstracts to predict their connections. The co-occurrence statistics were represented by the reciprocal Dice coefficient. Similar approach has been undertaken by Jenssen et al.</Paragraph>
      <Paragraph position="2"> (2001): they identified co-occurrences of gene names within abstracts, and assigned weights to their &amp;quot;relationship&amp;quot; based on frequency of cooccurrence. null In our experiments, abstract identifiers (Pub-Med identifiers, PMIDs) have been used as features for classification, where the dimensionality of the feature space was equal to the number of documents in the document set. As feature weights, binary values (i.e. a gene is present/absent in a document) were used.</Paragraph>
      <Paragraph position="3"> We would like to point out that - contrary to other features - this approach is not a general learning approach, as document identifiers are not classification attributes that can be learnt and used against other corpora. Instead, this approach can be only used to classify new terms that appear in a closed corpus used for training.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML