File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2016_metho.xml

Size: 7,577 bytes

Last Modified: 2025-10-06 14:10:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2016">
  <Title>The GOD model</Title>
  <Section position="3" start_page="147" end_page="148" type="metho">
    <SectionTitle>
2 The GOD algorithm
</SectionTitle>
    <Paragraph position="0"> The basic assumption of the GOD model is that paradigmatic relations can be established only among terms in the same Semantic Domain, while concepts belonging to different fields are mainly unrelated (Gliozzo, 2005). Such relations can be identified by considering Subject-Verb-Object (SVO) patterns involving domain specific terms (i.e. syntagmatic relations).</Paragraph>
    <Paragraph position="1"> When a query Q = (q1,q2,...,qn) is formulated, GOD operates as follows: Domain Discovery Retrieve the ranked list dom(Q) = (t1,t2,...,tk) of domain specific terms such that sim(ti,Q) &gt; thprime, where sim(Q,t) is a similarity function capturing domain proximity and thprime is the domain specificity threshold.</Paragraph>
    <Paragraph position="2"> Relation Extraction For each SVO pattern involving two different terms ti [?] dom(Q) and tj [?] dom(Q) such that the term ti occurs in the subject position and the term tj occurs in the object position return the relation tivtj if score(ti,v,tj) &gt; thprimeprime, where score(ti,v,tj) measures the syntagmatic association among ti, v and tj.</Paragraph>
    <Paragraph position="3"> In Subsection 2.1 we describe into details the Domain Discovery step. Subsection 2.2 is about the relation extraction step.</Paragraph>
    <Section position="1" start_page="147" end_page="148" type="sub_section">
      <SectionTitle>
2.1 Domain Discovery
</SectionTitle>
      <Paragraph position="0"> Semantic Domains (Magnini et al., 2002) are clusters of very closely related concepts, lexicalized by domain specific terms. Word senses are determined and delimited only by the meanings of other words in the same domain. Words belonging to a limited number of domains are called domain words. Domain words can be disambiguated by simply identifying the domain of the text.</Paragraph>
      <Paragraph position="1"> As a consequence, concepts belonging to different domains are basically unrelated. This observation is crucial from a methodological point of view, allowing us to perform a large scale structural analysis of the whole lexicon of a language, otherwise computationally infeasible. In fact, restricting the attention to a particular domain is a way to reduce the complexity of the overall relation extraction task, that is evidently quadratic in the number of terms.</Paragraph>
      <Paragraph position="2"> Domain information can be expressed by exploiting Domain Models (DMs) (Gliozzo et al., 2005). A DM is represented by a k x kprime rectangular matrix D, containing the domain relevance for each term with respect to each domain, where k is the cardinality of the vocabulary, and kprime is the size of the Domain Set.</Paragraph>
      <Paragraph position="3"> DMs can be acquired from texts in a totally unsupervised way by exploiting a lexical coherence assumption (Gliozzo, 2005). To this aim, term clustering algorithms can be adopted: each cluster represents a Semantic Domain. The degree of association among terms and clusters, estimated by the learning algorithm, provides a domain relevance function. For our experiments we adopted a clustering strategy based on Latent Semantic Analysis, following the methodology described in (Gliozzo, 2005). This operation is done off-line, and can be efficiently performed on large corpora. To filter out noise, we considered only those terms having a frequency higher than 5 in the corpus.</Paragraph>
      <Paragraph position="4"> Once a DM has been defined by the matrix D, the Domain Space is a kprime dimensional space, in which both texts and terms are associated to Domain Vectors (DVs), i.e. vectors representing their domain relevances with respect to each domain.</Paragraph>
      <Paragraph position="5"> The DV vectortprimei for the term ti [?] V is the ith row ofD, where V = {t1,t2,...,tk} is the vocabulary of the corpus. The similarity among DVs in the Domain Space is estimated by means of the cosine operation.</Paragraph>
      <Paragraph position="6"> When a query Q = (q1,q2,...,qn) is formulated, its DV vectorQprime is estimated by</Paragraph>
      <Paragraph position="8"> and then compared to the DVs of each term ti [?] V by adopting the cosine similarity metric</Paragraph>
      <Paragraph position="10"> where vectortprimei and vectorqprimej are the DVs for the terms ti and qj, respectively.</Paragraph>
      <Paragraph position="11"> All those terms whose similarity with the query is above the domain specificity threshold thprime are  thenreturnedasanoutputofthefunctiondom(Q).</Paragraph>
      <Paragraph position="12"> Empirically, we fixed this threshold to 0.5. In general, the higher the domain specificity threshold, thehighertherelevanceofthediscoveredrelations for the query (see Section 3), increasing accuracy while reducing recall. In the previous example, dom(god) returns the terms lord, prayer, creator and mercy, among the others.</Paragraph>
    </Section>
    <Section position="2" start_page="148" end_page="148" type="sub_section">
      <SectionTitle>
2.2 Relation extraction
</SectionTitle>
      <Paragraph position="0"> As a second step, the system analyzes all the syntagmatic relations involving the retrieved entities.</Paragraph>
      <Paragraph position="1"> To this aim, as an off-line learning step, the system acquires Subject-Verb-Object (SVO) patterns from the training corpus by using regular expressions on the output of a shallow parser.</Paragraph>
      <Paragraph position="2"> In particular, GOD extracts the relations tivtj for each ordered couple of domain specific terms (ti,tj) such that ti [?] dom(Q), tj [?] dom(Q) and score(ti,v,tj) &gt; thprimeprime. The confidence score is estimated by adopting the heuristic confidence measure described in (Reinberger et al., 2004), reported below:</Paragraph>
      <Paragraph position="4"> where F(t) is the frequency of the term t in the corpus, F(t,v) is the frequency of the SV pattern involving both t and v, F(v,t) is the frequency of the VO pattern involving both v and t, and F(ti,v,tj) is the frequency of the SVO pattern involving ti, v and tj. In general, augmenting thprimeprime is a way to filter out noisy relations, while decreasing recall.</Paragraph>
      <Paragraph position="5"> It is important to remark here that all the extractedpredicatesoccuratleastonceinthecorpus, null then they have been asserted somewhere. Even if it is not a sufficient condition to guarantee their truth, it is reasonable to assume that most of the sentences in texts express true assertions.</Paragraph>
      <Paragraph position="6"> The relation extraction process is performed on-line for each query, then efficiency is a crucial requirement in this phase. It would be preferable to avoid an extensive search of the required SVO patterns, because the number of sentences in the corpus is huge. To solve this problem we adopted an inverted relation index, consisting of three hash tables: the SV(VO) table report, for each term, the frequency of the SV(VO) patterns where it occurs as a subject(object); the SVO table reports, for each ordered couple of terms in the corpus, the frequency of the SVO patterns in which they co-occur. All the information required to estimate Formula 3 can then be accessed in a time proportional to the frequencies of the involved terms. In general, domain specific terms are not very frequent in a generic corpus, allowing a fast computation in most of the cases.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML