File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1051_intro.xml
Size: 3,713 bytes
Last Modified: 2025-10-06 14:03:25
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1051"> <Title>Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature</Title> <Section position="3" start_page="401" end_page="402" type="intro"> <SectionTitle> 2 Problem Formalization </SectionTitle> <Paragraph position="0"> The problem considered here is that of identifying interactions between genes and proteins from biomedical literature. More specifically, we performed experiments on two slightly different benchmark data sets (see Section 4 for a detailed description). In the former (AImed) gene/protein interactions are annotated without distinguishing the type and roles of the two interacting entities.</Paragraph> <Paragraph position="1"> The latter (LLL challenge) is more realistic (and complex) because it also aims at identifying the roles played by the interacting entities (agent and target). For example, in Figure 1 three entities are mentioned and two of the six ordered pairs of GENIA/topics/Corpus/GTB.html entities actually interact: (sigma(K),cwlH) and R32, between three entities, E1, E2 and E3.</Paragraph> <Paragraph position="2"> In our approach we cast relation extraction as a classificationproblem, inwhichexamplesaregenerated from sentences as follows. First of all, we describe the complex case, namely the protein/gene interactions (LLL challenge). For this data set entity recognition is performed using a dictionary of protein and gene namesinwhichthetypeoftheentitiesisunknown.</Paragraph> <Paragraph position="3"> We generate examples for all the sentences containing at least two entities. Thus the number of examples generated for each sentence is given by the combinations of distinct entities (N) selected two at a time, i.e. NC2. For example, as the sentence shown in Figure 1 contains three entities, the totalnumberofexamplesgeneratedis 3C2 = 3. In each example we assign the attribute CANDIDATE to each of the candidate interacting entities, while the other entities in the example are assigned the attribute OTHER, meaning that they do not participate in the relation. If a relation holds between the two candidate interacting entities the example is labeled 1 or 2 (according to the roles of the interacting entities, agent and target, i.e. to the direction of the relation); 0 otherwise. Figure 2 shows the examples generated from the sentence in Fig- null ated from the sentence in Figure 1.</Paragraph> <Paragraph position="4"> Note that in generating the examples from the sentence in Figure 1 we did not create three neg- null ative examples (there are six potential ordered relations between three entities), thereby implicitly under-sampling the data set. This allows us to make the classification task simpler without loosing information. As a matter of fact, generating examples for each ordered pair of entities would produce two subsets of the same size containing similar examples (differing only for the attributes CANDIDATE and OTHER), but with different classification labels. Furthermore, under-sampling allows us to halve the data set size and reduce the data skewness.</Paragraph> <Paragraph position="5"> For the protein-protein interaction task (AImed) we use the correct entities provided by the manual annotation. As said at the beginning of this section, this task is simpler than the LLL challenge because there is no distinction between types (all entities are proteins) and roles (the relation is symmetric). As a consequence, the examples are generated as described above with the following difference: an example is labeled 1 if a relation holds between the two candidate interacting entities; 0 otherwise.</Paragraph> </Section> class="xml-element"></Paper>