File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2083_metho.xml
Size: 17,802 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2083"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Term Recognition Approach to Acronym Recognition</Title> <Section position="5" start_page="643" end_page="647" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="643" end_page="644" type="sub_section"> <SectionTitle> 3.1 Term-based long-form identification </SectionTitle> <Paragraph position="0"> We propose a method for identifying the long forms of an acronym based on a term extraction technique. We focus on terms appearing fre- null quently in the proximity of an acronym in a text collection. More specifically, if a word sequence co-occurs frequently with a specific acronym and not with other surrounding words, we assume that there is a relationship8 between the acronym and the word sequence.</Paragraph> <Paragraph position="1"> Figure 1 illustrates our hypothesis taking the acronym TTF-1 as an example. The tree consists of expressions collected from all sentences with the acronym in parentheses and appearing before the acronym. A node represents a word, and a path from any node to TTF-1 represents a long-form candidate9. The figure above each node shows the co-occurrence frequency of the corresponding long-form candidate. For example, long-form candidates 1, factor 1, transcription factor 1, and thyroid transcription factor 1 co-occur 218, 216, 213, and 209 times respectively with the acronym TTF- null with the acronym TTF-1, we note that they also co-occur frequently with the word thyroid.</Paragraph> <Paragraph position="2"> Meanwhile, the candidate thyroid transcription factor 1 is used in a number of contexts (e.g., expression of thyroid transcription factor 1, expressed thyroid transcription factor 1, gene encoding thyroid transcription factor 1, etc.).</Paragraph> <Paragraph position="3"> Therefore, we observe this to be the strongest relationship between acronym TTF-1 and its 8A sequence of words that co-occurs with an acronym does not always imply the acronym-definition relation. For example, the acronym 5-HT co-occurs frequently with the term serotonin, but their relation is interpreted as a synonymous relation.</Paragraph> <Paragraph position="4"> 9The words with function words (e.g., expression of, regulation of the, etc.) are combined into a node. This is due to the requirement for a long-form candidate discussed later (Section 3.3).</Paragraph> <Paragraph position="5"> A large collection of in the tree. We apply a number of validation rules (described later) to the candidate pair to make sure that it has an acronym-definition relation. In this example, the candidate pair is likely to be an acronym-definition relation because the long form thyroid transcription factor 1 contains all alphanumeric letters in the short form TTF-1.</Paragraph> <Paragraph position="6"> Figure 1 also shows another notable characteristic of long-form recognition. Assuming that the term thyroid transcription factor 1 has an acronym TTF-1, we can disregard candidates such as transcription factor 1, factor 1, and 1 since they lack the necessary elements (e.g., thyroid for all candidates; thyroid transcription for candidates factor 1 and 1; etc.) to produce the acronym TTF1. Similarly, we can disregard candidates such as expression of thyroid transcription factor 1 and encoding thyroid transcription factor 1 since they contain unnecessary elements (i.e., expression of and encoding) attached to the long-form. Hence, once thyroid transcription factor 1 is chosen as the most likely long form of the acronym TTF1, we prune the unlikely candidates: nested candidates (e.g., transcription factor 1); expansions (e.g., expression of thyroid transcription factor 1); and insertions (e.g., thyroid specific transcription factor 1).</Paragraph> </Section> <Section position="2" start_page="644" end_page="645" type="sub_section"> <SectionTitle> 3.2 Extracting acronyms and their contexts </SectionTitle> <Paragraph position="0"> Before describing in detail the formalization of long-form identification, we explain the whole process of acronym recognition. We divide the acronym extraction task into three steps (Figure 2): 1. Short-form mining: identifying and extracting short forms (i.e., acronyms) in a collection of documents 2. Long-form mining: generating a list of ranked long-form candidates for each short Acronym Contextual sentence ... .... .... .. . .... ..</Paragraph> <Paragraph position="1"> HML Hard metal lung diseases (HML) are rare, and complex to diagnose.</Paragraph> <Paragraph position="2"> HMM Heavy meromyosin (HMM) from conditioned hearts had a higher Ca++-ATPase activity than from controls. HMM Heavy meromyosin (HMM) and myosin subfragment 1 (S1) were prepared from myosin by using low concentrations of alpha-chymotrypsin.</Paragraph> <Paragraph position="3"> HMM Hidden Markov model (HMM) techniques are used to model families of biological sequences. HMM Hexamethylmelamine (HMM) is a cytotoxic agent demonstrated to have broad antitumor activity. HMN Hereditary metabolic neuropathies (HMN) are marked by inherited enzyme or other metabolic defects. ... ... .. ..... .. ....... . .......</Paragraph> <Paragraph position="4"> their contextual sentences.</Paragraph> <Paragraph position="5"> form by using a term extraction technique 3. Long-form validation: extracting short/long form pairs recognized as having an acronym-definition relation and eliminating unnecessary candidates.</Paragraph> <Paragraph position="6"> The first step, short-form mining, enumerates all short forms in a target text which are likely to be acronyms. Most studies make use of the following pattern to find candidate acronyms (Wren and Garner, 2002; Schwartz and Hearst, 2003): long form '(' short form ')' Just as the heuristic rules described in Schwartz and Hearst (Schwartz and Hearst, 2003), we consider short forms to be valid only if they consist of at most two words; their length is between two to ten characters; they contain at least an alphabetic letter; and the first character is alphanumeric. All sentences containing a short form in parenthesis are inserted into a database, which returns all contextual sentences for a short form to be processed in the next step. Table 1 shows an example of the database content.</Paragraph> </Section> <Section position="3" start_page="645" end_page="646" type="sub_section"> <SectionTitle> 3.3 Formalizing long-form mining as a term </SectionTitle> <Paragraph position="0"> extraction problem The second step, long-form mining, generates a list of long-form candidates and their likelihood scores for each short form. As mentioned previously, we focus on words or word sequences that co-occur frequently with a specific acronym and not with any other surrounding words. We deal with the problem of extracting long-form candidates from contextual sentences for an acronym in a similar manner as the term recognition task which extracts terms from the given text. For that purpose, we used a modified version of the C-value method (Frantzi and Ananiadou, 1999).</Paragraph> <Paragraph position="1"> C-value is a domain-independent method for automatic term recognition (ATR) which combines linguistic and statistical information, emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text by applying part-of-speech tagging, candidate extraction (e.g., extracting sequences of adjectives/nouns based on part-of-speech tags), and a stop-list. The statistical analysis assigns a termhood (likelihood to be a term) to a candidate term by using the following features: the frequency of occurrence of the candidate term; the frequency of the candidate term as part of other longer candidate terms; the number of these longer candidate terms; and the length of the candidate term.</Paragraph> <Paragraph position="2"> The C-value approach is characterized by the extraction of nested terms which gives preference to terms appearing frequently in a given text but not as a part of specific longer terms. This is a desirable feature for acronym recognition to identify long-form candidates in contextual sentences. The rest of this subsection describes the method to extract long-form candidates and to assign scores to the candidates based on the C-value approach.</Paragraph> <Paragraph position="3"> Given a contextual sentence as shown in Table 1, we tokenize a contextual sentence by non-alphanumeric characters (e.g., space, hyphen, colon, etc.) and apply Porter's stemming algorithm (Porter, 1980) to obtain a sequence of normalized words. We use the following pattern to extract long-form candidates from the sequence:</Paragraph> <Paragraph position="5"> Therein: [:WORD:] matches a non-function word; .* matches an empty string or any word(s) of any length; and $ matches a short form of the target acronym. The extraction pattern accepts a word or word sequence if the word or word sequence begins with any non-function word, and ends with any word just before the corresponding short form in the contextual sentence. We have defined 113 function words such as a, the, of, we, and be in an external dictionary so that long-form candidates cannot begin with these words.</Paragraph> <Paragraph position="6"> Let us take the example of a contextual sentence, &quot;we studied the expression of thyroid transcription factor-1 (TTF-1)&quot;. We extract the following substrings as long form candidates (words are stemmed): 1; factor 1; transcript factor 1; thyroid transcript factor 1; expression of thyroid transcript factor 1; and studi the expression of thyroid effect of adriamycin 3 25 23.6 E adrenodemedullated 1 19 17.7 o acellular dermal matrix 3 17 15.9 o peptide adrenomedullin 2 17 15.1 E effects of adrenomedullin 3 15 13.2 E resistance to adriamycin 3 15 13.2 E amyopathic dermatomyositis 2 14 12.8 o vincristine (vcr) and adriamycin 4 11 10.0 E drug adriamycin 2 14 10.0 E brevis and abductor digiti minimi 5 11 9.8 E minimi 1 83 5.8 N digiti minimi 2 80 3.9 N right abductor digiti minimi 4 4 2.5 E automated digital microscopy 3 1 0.0 m adrenomedullin concentration 2 1 0.0 N Valid = { o: valid, m: letter match, L: lacks necessary letters, E: expansion, N: nested, B: below the threshold } transcript factor 1. Substrings such as of thyroid transcript factor 1 (which begins with a function word) and thyroid transcript (which ends prematurely before the short form) are not selected as long-form candidates.</Paragraph> <Paragraph position="7"> We define the likelihood LF(w) for candidate w to be the long form of an acronym:</Paragraph> <Paragraph position="9"> Therein: w is a long-form candidate; freq(x) denotes the frequency of occurrence of a candidate x in the contextual sentences (i.e., co-occurrence frequency with a short form); Tw is a set of nested candidates, long-form candidates each of which consists of a preceding word followed by the candidate w; and freq(Tw) represents the total frequency of such candidates Tw.</Paragraph> <Paragraph position="10"> The first term is equivalent to the co-occurrence frequency of a long-form candidate with a short form. The second term discounts the co-occurrence frequency based on the frequency distribution of nested candidates. Given a long-form candidate t [?] Tw, freq(t)freq(Tw) presents the occurrence probability of candidate t in the nested candidate set Tw. Therefore, the second term of the formula calculates the expectation of the frequency of occurrence of a nested candidate accounting for the frequency of candidate w.</Paragraph> <Paragraph position="11"> Table 2 shows a list of long-form candidates for acronym ADM extracted from 7,306,153 MEDLINE abstracts10. The long-form mining step</Paragraph> <Paragraph position="13"> extracted 10,216 unique long-form candidates from 1,319 contextual sentences containing the acronym ADM in parentheses. Table 2 arranges long-form candidates with their scores in desending order. Long-form candidates adriamycin and adrenomedullin co-occur frequently with the acronym ADM.</Paragraph> <Paragraph position="14"> Note the huge difference in scores between the candidates abductor digiti minimi and minimi. Even though the candidate minimi co-occurs more frequently (83 times) than abductor digiti minimi (78 times), the co-occurrence frequency is mostly derived from the longer candidate, i.e., digiti minimi. In this case, the second term of Formula 2, the occurrence-frequency expectation of expansions for minimi (e.g., digiti minimi), will have a high value and will therefore lower the score of candidate minimi. This is also true for the candidate digiti minimi, i.e., the score of candidate digiti minimi is lowered by the longer candidate abductor digiti minimi. In contrast, the candidate abductor digiti minimi preserves its co-occurrence frequency since the second term of the formula is low, which means that each expansion (e.g, brevis and abductor digiti minimi, right abductor digiti minimi, ...) is expected to have a low frequency of occurrence.</Paragraph> </Section> <Section position="4" start_page="646" end_page="647" type="sub_section"> <SectionTitle> 3.4 Validation rules for long-form candidates </SectionTitle> <Paragraph position="0"> The final step of Figure 2 validates the extracted long-form candidates to generate a final set of short/long form pairs. According to the score in Table 2, adriamycin is the most likely long-form for acronym ADM. Since the long-form candidate adriamycin contains all letters in the acronym ADM, it is considered as an authentic long-form (marked as 'o' in the Valid field). This is also true for the second and third candidate (adrenomedullin and abductor digiti minimi).</Paragraph> <Paragraph position="1"> The fourth candidate doxorubicin looks interesting, i.e., the proposed method assigns a high score to the candidate even though it lacks the letters a and m, which are necessary to form the corresponding short form. This is because doxorubicin is a synonymous term for adriamycin and described directly with its acronym ADM. In this paper, we deal with the acronym-definition relation although the proposed method would be applicable to mining other types of relations marked by parenthetical expressions. Hence, we introduce a constraint that a long form must cover all alphanu- null # [ V a r i a b l e s ] # s f : t h e t a r g e t s h o r t[?]form .</Paragraph> <Paragraph position="2"> # c a n d i d a t e s : long[?]form c a n d i d a t e s . # r e s u l t : t h e l i s t o f d e c i s i v e long[?]forms . # t h r e s h o l d : t h e t h r e s h o l d o f cut[?]o f f . # S o r t long[?]form c a n d i d a t e s i n d e s c e n d i n g o r d e r c a n d i d a t e s . s o r t ( # o f s c o r e s .</Paragraph> <Paragraph position="3"> key=lambda l f : l f . s c o r e , r e v e r s e =True ) # I n i t i a l i z e r e s u l t l i s t as empty .</Paragraph> <Paragraph position="4"> r e s u l t = [ ] # P ic k up a lo ng form one by one from c a n d i d a t e s . f o r l f in c a n d i d a t e s : # Apply a cut[?]o f f based on termhood s c o r e .</Paragraph> <Paragraph position="5"> # Allow c a n d i d a t e s w i t h l e t t e r matching . . . . . ( a ) i f l f . s c o r e < t h r e s h o l d and not l f . match : c o n t i n u e # A long[?]form must c o n t a i n a l l l e t t e r s . . . . . . ( b ) i f l e t t e r r e c a l l ( sf , l f ) < 1 : c o n t i n u e # Apply p r u n i n g o f r e d u n d a n t l on g form . . . . . . ( c ) i f r e d u n d a n t ( r e s u l t , l f ) : c o n t i n u e # I n s e r t t h i s l ong form t o t h e r e s u l t l i s t . r e s u l t . append ( l f ) # Output t h e d e c i s i v e long[?]forms .</Paragraph> <Paragraph position="6"> p r i n t r e s u l t meric letters in the short form.</Paragraph> <Paragraph position="7"> The fifth candidate effect of adriamycin is an expansion of a long form adriamycin, which has a higher score than effect of adriamycin. As we discussed previously, the candidate effect of adriamycin is skipped since it contains unnecessary word(s) to form an acronym. Similarly, we prune the candidate minimi because it forms a part of another long form abductor digiti minimi, which has a higher score than the candidate minimi. The likelihood score LF(w) determines the most appropriate long-form among similar candidates sharing the same words or lacking some words.</Paragraph> <Paragraph position="8"> We do not include candidates with scores below a given threshold. Therefore, the proposed method cannot extract candidates appearing rarely in the text collection. It depends on the application and considerations of the trade-off between precision and recall, whether or not an acronym recognition system should extract such rare long forms. When integrating the proposed method with e.g., Schwartz and Hearst's algorithm, we treat candidates recognized by the external method as if they pass the score cut-off. In Table 2, for example, candidate automated digital microscopy is inserted into the result set whereas candidate adrenomedullin concentration is skipped since it is nested by candidate adrenomedullin.</Paragraph> <Paragraph position="9"> Figure 3 is a pseudo-code for the long-form validation algorithm described above. A long-form candidate is considered valid if the following conditions are met: (a) it has a score greater than a threshold or is nominated by a letter-matching algorithm; (b) it contains all letters in the corresponding short form; and (c) it is not nested, expansion, or insertion of the previously chosen long forms.</Paragraph> </Section> </Section> class="xml-element"></Paper>