File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/p06-2083_relat.xml
Size: 4,493 bytes
Last Modified: 2025-10-06 14:15:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2083"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Term Recognition Approach to Acronym Recognition</Title> <Section position="4" start_page="643" end_page="643" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> The goal of acronym identification is to extract pairs of short forms (acronyms) and long forms (their expanded forms or definitions) occurring in text5. Currently, most methods are based on letter matching of the acronym-definition pair, e.g., hidden markov model (HMM), to identify short/long form candidates. Existing methods of short/long form recognition are divided into pattern matching approaches, e.g., exploring an efficient set of heuristics/rules (Adar, 2004; Ao and Takagi, 2005; Schwartz and Hearst, 2003; Wren and Garner, 2002; Yu et al., 2002), and pattern mining approaches, e.g., Longest Common Substring (LCS) formalization (Chang and Sch&quot;utze, 2006; Taghva and Gilbreth, 1999).</Paragraph> <Paragraph position="1"> Schwartz and Hearst (2003) implemented an algorithm for identifying acronyms by using parenthetical expressions as a marker of a short form. A character matching technique was used, i.e. all letters and digits in a short form had to appear in the corresponding long form in the same order, to determine its long form. Even though the core algorithm was very simple, the authors report 99% precision and 84% recall on the Medstract gold standard6.</Paragraph> <Paragraph position="2"> However, the letter-matching approach is affected by the expressions in the source text and sometimes finds incorrect long forms such as acquired syndrome and a patient with human immunodeficiency syndrome7 instead of the correct one, acquired immune deficiency syndrome for the acronym AIDS. This approach also encounters difficulties finding a long form whose short form is arranged in a different word order, e.g., beta 2 adrenergic receptor (ADRB2). To LINE abstracts submitted to Schwartz and Hearst's algorithm (2003). An author does not always write a proper definition with a parenthetic expression.</Paragraph> <Paragraph position="3"> improve the accuracy of long/short form recognition, some methods measure the appropriateness of these candidates based on a set of rules (Ao and Takagi, 2005), scoring functions (Adar, 2004), statistical analysis (Hisamitsu and Niwa, 2001; Liu and Friedman, 2003) and machine learning approaches (Chang and Sch&quot;utze, 2006; Pakhomov, 2002; Nadeau and Turney, 2005).</Paragraph> <Paragraph position="4"> Chang and Sch&quot;utze (2006) present an algorithm for matching short/long forms with a statistical learning method. They discover a list of abbreviation candidates based on parentheses and enumerate possible short/long form candidates by a dynamic programming algorithm. The likelihood of the recognized candidates is estimated as the probability calculated from a logistic regression with nine features such as the percentage of long-form letters aligned at the beginning of a word. Their method achieved 80% precision and 83% recall on the Medstract corpus.</Paragraph> <Paragraph position="5"> Hisamitsu and Niwa (2001) propose a method for extracting useful parenthetical expressions from Japanese newspaper articles. Their method measures the co-occurrence strength between the inner and outer phrases of a parenthetical expression by using statistical measures such as mutual information, kh2 test with Yate's correction, Dice coefficient, log-likelihood ratio, etc. Their method deals with generic parenthetical expressions (e.g., abbreviation, non abbreviation paraphrase, supplementary comments), not focusing exclusively on acronym recognition.</Paragraph> <Paragraph position="6"> Liu and Friedman (2003) proposed a method based on mining collocations occurring before the parenthetical expressions. Their method creates a list of potential long forms from collocations appearing more than once in a text collection and eliminates unlikely candidates with three rules, e.g., &quot;remove a set of candidates Tw formed by adding a prefix word to a candidate w if the number of such candidates Tw is greater than 3&quot;. Their approach cannot recognise expanded forms occurring only once in the corpus. They reported a precision of 96.3% and a recall of 88.5% for abbreviations recognition on their test corpus.</Paragraph> </Section> class="xml-element"></Paper>