File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1203_metho.xml

Size: 21,286 bytes

Last Modified: 2025-10-06 14:09:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1203">
  <Title>Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions</Title>
  <Section position="5" start_page="15" end_page="16" type="metho">
    <SectionTitle>
3 Corpus annotation and interaction
</SectionTitle>
    <Paragraph position="0"> subgraphs To compile a corpus of sentences describing protein-protein interactions, we flrst selected pairs of proteins that are known to interact from the Database of Interacting Proteins3. We entered these pairs as search terms into the PubMed retrieval system. We then split the publication abstracts returned by the searches into sentences and included titles. These were again searched for the protein pairs. This gave us a set of 1927 sentences that contain the  names of at least two proteins that are known to interact. A domain expert annotated these sentences for protein names and for words stating their interactions. Of these sentences, 1114 described at least one protein-protein interaction. null Thereafter, weperformedadependencyanalysis and produced annotation of dependencies. To minimize the amount of mistakes, each sentence was independently annotated by two annotators and difierences were then resolved by discussion. The assigned dependency structure was produced according to the LG linkage conventions. Link types were not included in the annotation, and no cycles were introduced in the dependency graphs. All ambiguities where the LG parser is capable of at least enumeratingallalternatives(suchasprepositionalphrase null attachment) were enforced in the annotation.</Paragraph>
    <Paragraph position="1"> A random sample consisting of 300 sentences, including 28 publication titles, has so far been fully annotated, giving 7098 word-to-word dependencies. This set of sentences is the corpus we refer to in the following sections.</Paragraph>
    <Paragraph position="2"> An information extraction system targeted at protein-protein interactions and their types needstoidentifythreeconstituentsthatexpress an interaction in a sentence: the proteins involved and the word or phrase that states their interaction and suggests the type of this interaction. To extract this information from a LG linkage, the links connecting these items must be recovered correctly by the parser. The following deflnition formalizes this notion.</Paragraph>
    <Paragraph position="3"> Deflnition 1 (Interaction subgraph) The interaction subgraph for an interaction between two proteins A and B in a linkage L is the minimal connected subgraph of L that contains A, B, and the word or phrase that states their interaction.</Paragraph>
    <Paragraph position="4"> The recovery of a connected component containing the protein names and the interaction word is not su-cient: by the deflnition of a complete linkage, such a component is always present. Consequently, the exact set of links   thatformstheinteractionsubgraphmustberecovered. null For each interaction stated in a sentence, the corpus annotation specifles the proteins involved and the interaction word. The interaction subgraph for each interaction can thus be extracted automatically from the corpus. Because the corpus does not contain cyclic dependencies, the interaction subgraphs are unique. 366 interaction subgraphs were identifled from the corpus, one for each described interaction.</Paragraph>
    <Paragraph position="5">  Theinteractionsubgraphscanbepartiallyoverlapping, because a single link can be part of more than one interaction subgraph. Figure 1 shows an example of an annotated text fragment. null</Paragraph>
  </Section>
  <Section position="6" start_page="16" end_page="16" type="metho">
    <SectionTitle>
4 Evaluation criteria
</SectionTitle>
    <Paragraph position="0"> We evaluated the performance of the LG parser   accordingtothefollowingthreequantitativecriteria: null + Number of dependencies recovered + Number of fully correct linkages + Numberofinteractionsubgraphsrecovered  The number of recovered dependencies gives an estimate of the probability that a dependency will be correctly identifled by the LG parser (this criterion is also employed by, e.g., Collins etal. (1999)). Thenumberoffullycorrectlinkages, i.e. linkages where all annotated dependencies are recovered, measures the fraction of sentences that are parsed without error. However, a fully correct linkage is not necessary to extract protein-protein interactions from a sentence; to estimate how many interactions can potentially be recovered, we measure the number of interaction subgraphs for which all dependencies were recovered.</Paragraph>
    <Paragraph position="1"> For each criterion, we measure the performance for the flrst linkage returned by the parser. However, the flrst linkage as ordered by the heuristics of the LG parser was often not the best (according to the criteria above) of the linkages returned by the parser. To separate the efiect of the heuristics from overall LG performance, we identify separately for each of the three criteria the best linkage among the linkages returned by the parser, and we also report performance for the best linkages.</Paragraph>
    <Paragraph position="2"> We further divide the parsed sentences into three categories: (1) sentences for which the time tmax for producing a normal parse was exhausted and the parser entered panic mode, (2) sentences where linkages were sampled because more than kmax linkages were produced, and (3) stable sentences for which neither of these occurred. A full analysis of all linkages that the grammar allows is only possible for stable sentences. For sentences in the other two categories, random efiects may afiect the results: sentencesforwhichmorethan kmax linkagesare produced are subject to randomness in sampling, and sentences where the parser enters panic mode were always subject to subsequent sampling in our experiments.</Paragraph>
  </Section>
  <Section position="7" start_page="16" end_page="17" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> To evaluate the ability of the LG parser to produce correct linkages, we increased the number of stable sentences by setting the tmax parameter to 10 minutes and the kmax parameter to 10000 instead of using the defaults tmax = 30 seconds and kmax = 1000. When parsing the corpus using these parameters, 28 sentences fell into the panic category, 61 into the sampled category, and 211 were stable. The measured parser performance for the corpus is presented in Table 1.</Paragraph>
    <Paragraph position="1"> While the fraction of sentences that have a fully correct linkage as the flrst linkage is quite low (approximately 7%), for 28% of sentences the parser is capable of producing a fully correct linkage. Performance was especially poor for the publication titles in the corpus. Because titles are typically fragments not containing a verb, and LG is designed to model full clauses, the parser failed to produce a fully correct linkage for any of the titles.</Paragraph>
    <Paragraph position="2"> The performance for recovered interaction subgraphs is more encouraging, as 25% of the subgraphswererecoveredintheflrstlinkageand morethanhalfinthebestlinkage. Yetmanyinteraction subgraphs remain unrecovered by the parser: the results suggest an upper limit of approximately 60% to the fraction of protein-protein interactions that can be recovered from any linkage produced by the unmodifled LG.</Paragraph>
    <Paragraph position="3"> In the following sections we further analyze the reasons why the parser fails to recover all dependencies. null</Paragraph>
    <Section position="1" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
5.1 Panics
</SectionTitle>
      <Paragraph position="0"> No fully correct linkages and very few interaction subgraphs were found in the panic mode.</Paragraph>
      <Paragraph position="1"> This efiect may be partly due to the complexity of the sentences for which the parser en- null categories are explained in Section 4). The total rows give the number of criteria for each category, and the overall column gives combined results for all categories. tered panic mode. The efiect of panics can be better estimated by forcing the parser to bypass standard parsing and to directly apply panic options. For the 272 sentences where the parser did not enter the panic mode, 77% of dependencies were recovered in the flrst linkage. Whenthesesentenceswereparsedinforced panic mode, 67% of dependencies were recovered, suggesting that on average parses in panic mode recover approximately 10% fewer dependencies than in standard parsing mode. Similarly, the number of fully correct flrst linkages decreased from 22 to 6 and the number of interaction subgraphs recovered in the flrst linkage from 91 to 65. These numbers indicate that panics are a signiflcant cause of error.</Paragraph>
      <Paragraph position="2"> Experiments indicate than on a 1GHz machine approximately 40% of sentences can be fully parsed in under a second, 80% in under 10</Paragraph>
      <Paragraph position="4"> to fully parse. With tmax set to 10 minutes, the total parsing time was 165 minutes.</Paragraph>
      <Paragraph position="5"> Long parsing times are caused by ambiguous sentencesforwhichtheparsercreatesthousands or even millions of alternative linkages. In addition to simply increasing the time limit, the fractionofsentenceswheretheparserentersthe panic mode could therefore be reduced by reducing the ambiguity of the sentences, for example,byextendingthedictionaryoftheparser null (see Section 7).</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
5.2 Heuristics
</SectionTitle>
      <Paragraph position="0"> When several linkages are produced for a sentence, the LG parser applies heuristics to order the sentences so that linkages that are more likely to be correct are presented flrst. The heuristics are based on examination and intuitions on general English, and may not be optimal for biomedical text. Note in Table 1 that both for recovered full linkages and interaction  subgraphs,thenumberofitemsthatwererecovered in the best linkage is more than twice the numberrecoveredintheflrstlinkage,suggesting that a better ordering heuristic could dramatically improve the performance of the parser.</Paragraph>
      <Paragraph position="1"> Such improvements could perhaps be achieved by tuning the heuristics to the domain or by adopting a probabilistic ordering model.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="17" end_page="19" type="metho">
    <SectionTitle>
6 Failure analysis
</SectionTitle>
    <Paragraph position="0"> A signiflcant fraction of dependencies were not recovered in any linkage, even in sentences whereresourceswerenotexhausted. Inorderto identify reasons for the parser failing to recover the correct dependencies, we analyze sentences for which it is certain that the grammar cannot produce a fully correct linkage. We thus analyzed the 132 stable sentences for which some dependencies were not recovered.</Paragraph>
    <Paragraph position="1"> For each sentence, we attempt to identify the reason for the failure of the parser. For each identifledreason, wemanuallyeditthesentence to remove the source of failure. We repeat this procedure until the parser is capable of producing a correct parse for the sentence. Note that this implies that also the interaction sub-graphs in the sentence are correctly recovered, and therefore the reasons for failures to recover interaction subgraphs are a subset of the identifled issues. The results of the analysis are  summarized in Table 2. In many of the sentences, more than one reason for parser failure was found; in total 209 issues were identifled in the 132 sentences. The results are described in more detail in the following sections.</Paragraph>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
6.1 Fragments and ungrammatical
sentences
</SectionTitle>
      <Paragraph position="0"> As some of the analyzed sentences were taken from publication titles, not all of them were full clauses. To identify further problems when parsing fragments not containing a verb, the phrase \is explained&amp;quot; and required determiners wereaddedtothesefragments,atechniqueused also by Ding et al. (2003). The completed fragments were then analyzed for potential further problems.</Paragraph>
      <Paragraph position="1"> A number of other ungrammatical sentences were also encountered. The most common problem was the omission of determiners, but some other issues such as missing possessive markers and errors in agreement (e.g., \expressions...has&amp;quot;) were also encountered. Ungrammatical sentences pose interesting challenges for parsing. Because many authors are not native English speakers, a greater tolerance for grammatical mistakes should allow the parser to identify the intended parse for more sentences. Similarly, the ability to parse publication titles would extend the applicability of the parser; in some cases it may be possible to extract information concerning the key flndings of a publication from the title. However, while relaxing completeness and correctness requirements,suchasmandatorydeterminersand null subject-predicate agreement, would allow the  parsertocreateacompletelinkageformoresentences, it would also be expected to lead to increased ambiguity for all sentences, and subsequent di-culties in identifying the correct linkage. If the ability to parse titles is considered important, a potential solution not incurring this cost would be to develop a separate version of the grammar for parsing titles.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
6.2 Unknown grammatical structures
</SectionTitle>
      <Paragraph position="0">  ThemethodoftheLGimplementationforparsing coordinations was found to be a frequent cause of failures. A speciflc coordination problem occurs with multiple noun-modiflers: the parser assumes that coordinated constituents can be connected to the rest of the sentence through exactlyoneword, andthegrammarattaches all noun-modiflers to the head. Biomedical texts frequently contain phrases that cause these requirements to con ict: for example, in the phrase \capping protein and actin genes&amp;quot; (where \capping protein genes&amp;quot; and \actin genes&amp;quot; is the intended parse), the parser allows only one of the words \capping&amp;quot; and \protein&amp;quot; to connect to the word \genes&amp;quot;, and is thus unable to produce the correct linkage (for illustration, see Figure 2(a)).</Paragraph>
      <Paragraph position="1"> This multiple modifler coordination issue could be addressed by modifying the grammar to chain modiflers (Figure 2(b)). This alternative model is adopted by another major dependency grammar, the EngCG-based Connexor Machinese. The problem could also be addressed by altering the coordination handling system in the parser.</Paragraph>
      <Paragraph position="2"> Other identifled grammatical structures not known to the parser were number postmodiflers to nouns (e.g., \serine 38&amp;quot;), speciflers in parentheses (e.g., \profllin mutant (H119E)&amp;quot;), coordinationwiththephrase\butnot&amp;quot;, andvarious unknown uses of colons and quotes. Single instances of several distinct unknown grammatical structures were also noted (e.g., \5 to 10&amp;quot;, \as expected from&amp;quot;, \most concentrated in&amp;quot;). Most of these issues can be addressed by local modiflcations to the grammar.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
6.3 Unknown word handling
</SectionTitle>
      <Paragraph position="0"> The LG parser assigns unknown words to categories based on morphological or other surface clues when possible. For remaining unknown  words, parses are attempted by assigning the words to the generic noun, verb and adjective types in all possible combinations.</Paragraph>
      <Paragraph position="1"> Some problems with the unknown word processing method were encountered during analysis; for example, the assumption that unknown capitalizedwordsarepropernounsoftencaused failures, especially in sentences beginning with an unknown word. Similarly, the assumption that words containing a hyphen behave as adjectives was violated by a number of unknown verbs (e.g., \cross-links&amp;quot;).</Paragraph>
      <Paragraph position="2"> Another problem that was noted occurred with lowercase unknown words that should be treated as proper nouns: because LG does not allowunknownlowercasewordstoactasproper nouns, the parser assigns incorrect structure to a number of phrases containing words such as \actin&amp;quot;. Improving unknown word handling requires some modiflcations to the LG parser.</Paragraph>
    </Section>
    <Section position="4" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
6.4 Dictionary issues
</SectionTitle>
      <Paragraph position="0"> CaseswheretheLGdictionarycontainsaword, but not in the sense in which it appears in a sentence, almost always lead to errors. For example, the LG dictionary does not contain the word \assembly&amp;quot; in the sense \construction&amp;quot;, causing the parser to erroneously require a determiner for \protein assembly&amp;quot;4. A related frequent problem occurred with proper names  headedbyacommonnoun,wheretheparserexpectsadeterminerforsuchnames(e.g.,\myosin null heavychain&amp;quot;),andfailswhenoneisnotpresent.</Paragraph>
      <Paragraph position="1"> These issues are mostly straightforward to address in the grammar, but di-cult to identify automatically.</Paragraph>
    </Section>
    <Section position="5" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
6.5 Biomedical entity names
</SectionTitle>
      <Paragraph position="0"> Many of the causes for parser failure discussed above are related to the presence of biomedical entity names. While the causes for failures relating to names can be addressed in the  grammar,theexistenceofbiomedicalnamedentity (NE) recognition systems (for a recent survey, see, e.g., Bunescu et al. (2004)) suggests an alternative solution: NEs could be identifled in preprocessing, and treated as single (proper noun) tokens during the parse. During failure analysis, 59 cases (28% of all cases) were noted wherethisprocedurewouldhaveeliminatedthe error, assuming that no errors are made in NE 430 distinct problematic word deflnitions were identifled, including \breakdown&amp;quot;, \composed&amp;quot;, \factor&amp;quot;, \half&amp;quot;, \independent&amp;quot;, \localized&amp;quot;, \parallel&amp;quot;, \promoter&amp;quot;, \segment&amp;quot;, \upstream&amp;quot; and \via&amp;quot;. recognition. However, the performance of current NE recognition systems is not perfect, and it is not clear what the efiect of adopting such a method would be on parser performance.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="19" end_page="19" type="metho">
    <SectionTitle>
7 Dictionary extension
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
Szolovits(2003)describesanautomaticmethod
</SectionTitle>
      <Paragraph position="0"> for mapping lexical information from one lexicon to another, and applies this method to augmenttheLGdictionarywithtermsfromtheex- null tensiveUMLSSpecialistlexicon. Theextension introduces more than 125,000 new words into the LG dictionary, more than tripling its size.</Paragraph>
      <Paragraph position="1">  WeevaluatedtheefiectofthisdictionaryextensiononLGparserperformanceusingthecriteria null describedabove. Thefractionofdistincttokens in the corpus found in the parser dictionary increased from 52% to 72% with the dictionary extension, representing a signiflcant reduction in uncertainty. This decrease was coupled with a 32% reduction in total parsing time.</Paragraph>
      <Paragraph position="2"> Because the LG parser is unable to produce any linkage for sentences where it cannot identifyaverb(evenincorrectly),extendingthedic- null tionary signiflcantly reduced the ability of LG  toextractdependenciesintitles,wherethefraction of recovered dependencies fell from the already low value of 67% to 55%.</Paragraph>
      <Paragraph position="3"> Forthesentencesexcludingtitles,thebeneflts ofthedictionaryextensionweremostsigniflcant for sentences that were in the panic category when using the unextended LG dictionary; 12 of these 28 sentences could be parsed without panic with the dictionary extension. In the flrst linkage of these sentences, the fraction of recovered dependencies increased by 8%, and the fraction of recovered interaction subgraphs increased from zero to 15% with the dictionary extension.</Paragraph>
      <Paragraph position="4"> The overall efiect of the dictionary extension was positive but modest, with no more than 2.5% improvement for either the flrst or best linkages for any criterion, despite the threefold increase in dictionary size. This result agrees with the failure analysis: most problems cannotberemovedbyextendingthedictionaryand null must instead be addressed by modiflcations of the grammar or parser.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML