File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0804_metho.xml
Size: 23,086 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0804"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics How to Find Better Index Terms Through Citations</Title> <Section position="4" start_page="0" end_page="26" type="metho"> <SectionTitle> 2 Index Terms Through Link Structure </SectionTitle> <Paragraph position="0"> We aim to improve automatic indexing of scienti c papers by nding additional index terms outside of the documents themselves. In particular, we believe that good index terms can be found by following the link structure between documents.</Paragraph> <Section position="1" start_page="0" end_page="25" type="sub_section"> <SectionTitle> 2.1 Hyperlinks </SectionTitle> <Paragraph position="0"> There is a wealth of literature on exploiting link structure between web documents for IR, including the 'sharing' of index terms between hyperlinked pages. Bharat & Mihaila (2001), for instance, propagate title and header terms to the pointed-to page, while Marchiori (1997) recursively augments the textual content of a page with all the text of the pages it points to.</Paragraph> <Paragraph position="1"> Research has particularly concentrated on anchor text as a good place to nd index terms, i.e., the text enclosed in the <a> tags of the HTML document. It is a well-documented problem that webpages are often poorly self-descriptive (e.g., Brin & Page 1998, Kleinberg 1999). For instance, www.google.com does not contain the phrase search engine. Anchor text, on the other hand, is often a higher-level description of the pointed-to page. Davison (2000) provides a good discussion of just how well anchor text does this and provides experimental results to back this claim. Thus, beginning with McBryan (1994), there is a trend of propagating anchor text along its hyperlink to associate it with the linked page, as well as that in which it is found. Google, for example, includes anchor text as index terms for the linked page (Brin & Page 1998).</Paragraph> <Paragraph position="2"> Extending beyond anchor text, Chakrabarti et al. (1998) look for topic terms in a window of text around hyperlinks and weight that link accordingly, in the framework of a link structure algorithm, HITS (Kleinberg 1999).</Paragraph> </Section> <Section position="2" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.2 Citations </SectionTitle> <Paragraph position="0"> The anchor text phenomenon is also observed with citations: they are introduced purposefully alongside some descriptive reference to the cited document. Thus, this text should contain good index terms for the cited document. In the following sections, we motivate the use of reference terms as index terms for cited documents, rstly, with some citation examples and, secondly, by discussing previous work.</Paragraph> <Paragraph position="1"> Examples: Reference Terms as Index Terms Figure 1 shows some citations that exemplify why reference terms should be good index terms for the cited document. (1) is an example of a citation with intuitively good index terms (those underlined) for the cited paper around it; a searcher looking for papers about a learning system, particularly one that uses theory re nement and/or one that learns non-recursive NP and VP structures might be interested in the paper, as might those searching for information about ALLiS.</Paragraph> <Paragraph position="2"> The fact that an author has chosen those particular terms in referring to the paper means that they re ect what that author feels is important about the paper. It is reasonable, then, that other researchers interested in the same things would nd the cited paper useful and could plausibly use such terms as query terms. It is true that the cited paper may well contain these terms, and they may even be important, prominent terms, but this is not necessarily the case. There are numerous situations in which the terms in the document are not the best indicators of what is important in it. Firstly, what is important in a paper in terms of what it is known and cited for is not always the same as what is important in it in terms of subject matter or focus. Secondly, what are considered to be the important contributions of a paper may change over time. Thirdly, the terminology used to describe the important contributions may be different from that used in the paper or may change over time.</Paragraph> <Paragraph position="3"> (2) exempli es this special case, where a paper is referred to using terms that are not in the paper itself: the cited paper is the standard reference for the HITS algorithm yet the name HITS was only attributed to the algorithm after the paper was written and it doesn't contain the term at all1.</Paragraph> <Paragraph position="4"> The last two examples show how citing authors can provide higher level descriptions of the cited paper, e.g., good overview and comparison.</Paragraph> <Paragraph position="5"> These meta-descriptors are less likely to appear in the papers themselves as prominent terms yet, again, could plausibly be used as query terms for a searcher.</Paragraph> </Section> <Section position="3" start_page="25" end_page="26" type="sub_section"> <SectionTitle> Reference Directed Indexing </SectionTitle> <Paragraph position="0"> These examples (and many more) suggest that text used in reference to papers can provide useful index terms, just as anchor text does for webpages. Bradshaw & Hammond (2002) even go so far as to argue that reference is more valuable as a source of index terms than the document's own content. Bradshaw's theory is that, when citing, authors describe a document in terms similar to a searcher's query for the information it contains.</Paragraph> <Paragraph position="1"> However, there is no anchor text, per se, in papers, i.e., there are no HTML tags to delimit the text associated with a citation, unlike in webpages.</Paragraph> <Paragraph position="2"> The question is raised, therefore, of what is the anchor text equivalent for formal citations. Bradshaw (2003) extracts NPs from a xed window of around one hundred words around the citation and uses these as the basis of his Reference-Directed Indexing (RDI).</Paragraph> <Paragraph position="3"> Bradshaw evaluates RDI by, rst, indexing documents provided by Citeseer (Lawrence, Bollacker & Giles 1999). A set of 32 queries was cre- null (1) ALLiS (Architecture for Learning Linguistic Structures) is a learning system which uses theory re nement in order to learn non-recursive NP and VP structures (Dejean, 2000). (2) Such estimation is simpli ed from HITS algorithm (Kleinberg, 1998). (3) As two examples, (Rabiner, 1989) and (Charniak et al., 1993) give good overviews of the techniques and equations used for Markov models and part-of-speech tagging, but they are not very explicit in the details that are needed for their application. (4) For a comparison to other taggers, the reader is referred to (Zavrel and Daelemans, 1999). 24 documents in the collection with an authorwritten keywords section. Document relevance was determined by judging whether it addressed the same topic as the topic in the query source paper that is identi ed by the query keywords.</Paragraph> <Paragraph position="4"> Thus, the performance of RDI was compared to that of a standard vector-space model implementation (TF*IDF term weighting and cosine similarity retrieval), with RDI achieving better precision at top 10 documents (0.484 compared to 0.318, statistically signi cant at 99.5% con dence).</Paragraph> </Section> <Section position="4" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Citing Statements </SectionTitle> <Paragraph position="0"> In a considerably earlier study, closer to our own project, O'Connor (1982) motivated the use of words from citing statements as additional terms to augment an existing document representation. Though O'Connor did not have machine-readable documents, procedures for 'automatic' recognition of citing statements were developed and manually carried out on a collection of chemistry journal articles.</Paragraph> <Paragraph position="1"> Proceeding from the sentence in which a citation is found, a set of hand-crafted, mostly sentence-based rules were applied to select the parts of the citing paper that conveyed information about the cited paper. For instance, the citing sentence, S, was always selected. If S contained a connector (a keyword, e.g., this, similarly, former) in its rst twelve words, its predecessor, S[?]1, was also selected etc. The majority of rules selected sentences from the text; others selected titles and words from tables, gures and captions.</Paragraph> <Paragraph position="2"> The selected statements (minus stop words) were added to an existing representation for the cited documents, comprising human index terms and title and abstract terms, and a small-scale retrieval experiment was performed. A 20% increase in recall was found using the citing statements in addition to the existing index terms, though in a follow-up study on biomedical papers, the increase was only 4%2 (O'Connor 1983).</Paragraph> <Paragraph position="3"> O'Connor concludes that citing statements can aid retrieval but notes the inherent dif culty in identifying them. Some of the selection rules were only semi-automatic (e.g., required human identication of an article as a review) and most relied on knowledge of sentence boundaries, which is a non-trivial problem in itself. In all sentence-based cases, sentences were either selected in their entirety or not at all and O'Connor notes this as a source of falsely assigned terms.</Paragraph> </Section> </Section> <Section position="5" start_page="26" end_page="28" type="metho"> <SectionTitle> 3 Complex Citation Contexts </SectionTitle> <Paragraph position="0"> There is evidence, therefore, that good index terms for scholarly documents can be found in the documents that cite them. Identifying which terms around a citation really refer to it, however, is nontrivial. In this section, we discuss some examples of citations where this is the case and propose potential ways in which computational linguistics techniques may be useful in more accurately locating those reference terms. We take as our theoretical baseline all terms in a xed window around a citation.</Paragraph> <Section position="1" start_page="26" end_page="28" type="sub_section"> <SectionTitle> 3.1 Examples: Finding Reference Terms </SectionTitle> <Paragraph position="0"> The rst two examples in Figure 2 illustrate how the amount of text that refers to a citation can vary.</Paragraph> <Paragraph position="1"> Sometimes, only two or three terms will refer to a citation, as is often the case in enumerations such as (5). On the other hand, (6) shows a citation where much of the following section refers to the cited work. When a paper is heavily based on previous work, for example, extensive text may be afforded to describing that work in detail. Thus, this context could contribute dozens of legitimate index terms. A xed size window around a citation 2O'Connor attributes this to a lower average number of citing papers in the biomedical domain.</Paragraph> <Paragraph position="2"> (5) Similar advances have been made in machine translation (Frederking and Nirenburg, 1994), speech recognition (Fiscus, 1997) and named entity recognition (Borthwick et al., 1998). (6) Brown et al. (1993) proposed a series of statistical models of the translation process. IBM translation models try to model the translation probability ... which describes the relationship between a source language sentence ... and a target language sentence ... . In statistical alignment models ... a 'hidden' alignment ... is introduced, which describes a mapping from a target position ... to a source position ... . The relationship between the translation model and the alignment model is given by: ...</Paragraph> <Paragraph position="3"> (7) The results of disambiguation strategies reported for pseudo-words and the like are consistently above 95% overall accuracy, far higher than those reported for disambiguating three or more senses of polysemous words (Wilks et al. 1993; Leacock, Towell, and Voorhees 1993).</Paragraph> <Paragraph position="4"> (8) This paper concentrates on the use of zero, pronominal, and nominal anaphora in Chinese generated text. We are not concerned with lexical anaphora (Tutin and Kittredge 1992) where the anaphor and its antecedent share meaning components, while the anaphor belongs to an open lexical class.</Paragraph> <Paragraph position="5"> (9) Previous work on the generation of referring expressions focused on producing minimal distinguishing descriptions (Dale and Haddock 1991; Dale 1992; Reiter and Dale 1992) or descriptions customized for different levels of hearers (Reiter 1990). Since we are not concerned with the generation of descriptions for different levels of users, we look only at the former group of work, which aims at generating descriptions for a subsequent reference to distinguish it from the set of entities with which it might be confused.</Paragraph> <Paragraph position="6"> (10) Ferro et al. (1999) and Buchholz et al. (1999) both describe learning systems to nd GRs. The former (TR) uses transformation-based error-driven learning (Brill and Resnik, 1994) and the latter (MB) uses memory-based learning (Daelemans et al., 1999).</Paragraph> <Paragraph position="7"> would not capture all the terms referring to it and only those.</Paragraph> <Paragraph position="8"> In list examples such as (5), where multiple citations are in close proximity, almost any window size would result in overlapping windows and in terms being attributed to the wrong citation(s), as well as the right one. In such examples, the presence of other citations indicates a change in reference term 'ownership'. The same is often true of sentence boundaries, as they often signal a change in topic. Citations frequently occur at the start of sentences, as in (6), where a different approach is introduced. Similarly, a citation at the end of a sentence, as in (7), often indicates the completion of the current topic. In both cases, the sentence boundary (c.f. topic change) is also the boundary of the reference text. The same arguments increasingly apply to paragraph and section boundaries.</Paragraph> <Paragraph position="9"> (8) is another example where the reference text does not extend beyond the citation sentence, though the citation is not at a sentence boundary.</Paragraph> <Paragraph position="10"> Instead, the topic contrast is indicated by a linguictic cue, i.e., the negation in We are not. This illustrates another phenomenon of citations: in contrasting their work with others', researchers often explicitly state what their paper is not about. Intuitively, not only are these terms better descriptors of the cited rather than citing paper, they might even raise the question of whether one should go as far as excluding selected terms during indexing of the citing paper. We are not advocating this here, though, and note that, in practice, such terms would not have much impact on the document: we would expect them to have low term frequencies in comparison to the important terms in that document and in comparison to their frequencies in other documents where they are important.</Paragraph> <Paragraph position="11"> (9) is another example of this negation effect (We are not concerned with...). Along with (10), it also shows how complex the mapping between reference terms and citations can be. Firstly, reference terms may belong to more than one cita- null tion. For instance, in (10), describe learning systems to nd GRs refers to both Ferro et al. (1999) and Buchholz et al. (1999). Here, the presence of a second citation does not end the domain of the rst's reference text, indicated by the use of both and the conjunction between the citations. Similarly, transformation-based error-driven learning also refers to two citations but, in this case, they are on opposite sides of the reference text, i.e., Ferro et al. (1999) and (Brill and Resnik, 1994).</Paragraph> <Paragraph position="12"> Moreover, there is an intervening citation that it does not refer to, i.e., Buchholz et al. (1999). The same is true of memory-based learning.</Paragraph> </Section> </Section> <Section position="6" start_page="28" end_page="30" type="metho"> <SectionTitle> 4 Case Study </SectionTitle> <Paragraph position="0"> In this section, we study the effect of adding citation index terms to one document: The Mathematics of Statistical Machine Translation: Parameter Estimation from the Computational Linguistics journal3. Our experimental setting is a corpus of [?]9000 papers in the ACL Anthology4, a digital archive of computational linguistics research papers. We found 24 citations to the paper in 10 other Anthology papers (that we knew to have citations to this paper through an unrelated study). As a simulation of ideal processing, we then manually extracted the terms from those around those citations that speci cally referred to the paper, henceforth ideal reference terms. Next, we extracted all terms from a xed window of [?]50 terms on either side (equivalent to Bradshaw (2003)'s window size), henceforth xed reference terms. Finally, we calculated various term statistics, including IDF values across the corpus. All terms were decapitalized. We now attempt to draw a 'term pro le' of the document, both before and after those reference terms are added to the document, and discuss the implications for IR.</Paragraph> <Section position="1" start_page="28" end_page="30" type="sub_section"> <SectionTitle> 4.1 Index Term Analysis </SectionTitle> <Paragraph position="0"> Table 1 gives the top twenty ideal reference terms ranked by their TF*IDF values in the original document. Note that we observe the effects on the relative rankings of the ideal reference terms only, since it is these hand-picked terms that we consider to be important descriptors for the document and whose statistics will be most affected by the inclusion of reference terms. To give an indication of their importance relative to other terms in the</Paragraph> <Paragraph position="2"> document, however, the second column in Table 1 gives the absolute rankings of these terms in the original document. These numbers con rm that our ideal reference terms are, in fact, relatively important in the document; indeed, the top ve terms in the document are all ideal reference terms. Further down the ranking, the ideal reference terms become more 'diluted' with terms not picked from our 24 citations. An inspection revealed that many of these terms were French words from example translations, since the paper deals with machine translation between English and French. Thus, they were bad index terms, for our purposes.</Paragraph> <Paragraph position="3"> Hence, we observed the effect of adding, rst, the ideal reference terms then, separately, the xed reference terms to the document, summarized in Tables 2 to 5. Tables 2 and 3 show the terms with the largest differences in positions as a result of adding the ideal and xed reference terms respectively. null For instance, ibm's TF*IDF value more than doubled. The term ibm appears only six times in the document (and not even from the main text but from authors' institutions and one bibliography item) yet one of its major contributions is the machine translation models it introduced, now standardly referred to as 'the IBM models'. Con- null sequently, 'IBM' was contained in many citation contexts in citing papers, leading to an ideal reference term frequency of 11 for ibm. As a result, ibm is boosted eight places to rank 20. This exempli es how reference terms can better describe a document, in terms of what searchers might plausibly look for (c.f. Example 2).</Paragraph> <Paragraph position="4"> There were twenty terms that do not occur in the document itself but are nevertheless used by citing authors to describe it, shown in Tables 4 and 5. Many of these have high IDF values, indicating their distinctiveness in the corpus, e.g., decoders (6.41), corruption (6.02) and noisy-channel (5.75). This, combined with the fact that citing authors use these terms in describing the paper, means that these terms are intuitively high quality descriptors of the paper. Without the reference index terms, however, the paper would score zero for these terms as query terms.</Paragraph> <Paragraph position="5"> Many more xed reference terms were found per citation than ideal ones. This can introduce noise. In general, the TF*IDF values of ideal reference terms can only be further boosted by including more terms and a comparison of Tables 2 with 3 (or 4 with 5) shows that this is sometimes the case, e.g, ibm occurred a further eleven times in the xed reference terms, doubling its increase in TF*IDF. However, instances of those terms that only occurred in the xed reference terms did not, in fact, refer to the citation of the paper, by de nition of the ideal reference terms. For instance, one such extra occurrence of ibm is from a sentence following the citation that describes the exact model used in the current work: (11) According to the IBM models (Brown et al., 1993), the statistical word alignment model can be generally represented as in Equation (1) ... In this paper, we use a simpli ed IBM model 4 (Al-Onaizan et al., 1999), which ...</Paragraph> <Paragraph position="6"> Here, the second occurrence refers to (Al-Onaizan et al., 1999) but, by its proximity to the citation to our example paper (Brown et al., 1993), is picked up by the xed window. Since the term was arguably not directly intended to describe our paper, then, a different term might equally have been used; one that was inappropriate as an index term. Table 6 lists the xed reference terms that were not also in the ideal reference terms; almost 400 in total. The vast majority of these occur very infrequently which suggests that they should not greatly affect the term pro le of the document.</Paragraph> <Paragraph position="7"> However, the argument for adding good, high IDF reference terms that are not in the document itself conversely applies to adding bad ones: an 'incorrect' reference term added to the document will have its TF*IDF pushed off the zero mark, giving it the potential to score against inappropriate query terms. If such a term is distinctive (i.e., has a high IDF), the effect may be signi cant. The term giza, for example, has an IDF of 6.34 and is the name of a particular tool that is not mentioned in our example paper. However, since the tool is used to train IBM models, the two papers in the example above are often cited by the same papers and in close proximity. This increases the chances of such terms being picked up as reference terms for the wrong citation by a xed window, heightening the adverse effect on its term pro le.</Paragraph> </Section> </Section> class="xml-element"></Paper>