File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-0719_intro.xml

Size: 12,526 bytes

Last Modified: 2025-10-06 14:06:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0719">
  <Title>I I I I I I I I I I</Title>
  <Section position="3" start_page="0" end_page="137" type="intro">
    <SectionTitle>
2 FreeNet
</SectionTitle>
    <Paragraph position="0"> FreeNet, an acronym for finite relation expression network, is a system for describing and exploring finite binary relations. Here we mean relation in the mathematical sense, i.e. a set of ordered pairs. We concern ourselves with finite sets of pairs of tokens drawn from a finite set of tokens, or vocabulary.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Tokens and relations
</SectionTitle>
      <Paragraph position="0"> A token in FreeNet is simply a normalized string of characters drawn from a finite vocabulary. The vocabulary might be a dictionary of English words, a set of movie titles, or a set of names of researchers. The system is assumed to implement normalization as a function from input strings to strings.</Paragraph>
      <Paragraph position="1"> A relation in FreeNet is a finite set of ordered pairs of tokens, or links. Each relation has a name that, like a token, is simply a normalized string of characters drawn from a finite vocabulary (which we shall do better to call an alphabet, for reasons made clear below.) Use of the FreeNet system can be seen to consist of three distinct processing phases: the relation computation stage, in which a set of relations is derived from some knowledge or data source and transduced to an explicit set of labeled ordered pairs; the graph construction stage, in which this set of labeled pairs Is transduced to an efficient multigraph representation; and the query stage, in which a user can interact with the system to find paths in the multigraph that match a certain specification.</Paragraph>
      <Paragraph position="2"> FreeNet consists of software to do the second and third phases. Implementation of a specific instance of FreeNet requires the user to write software to do the first phase, but support software exists for an optional filtering substage that constrains the input pair set in certain ways--eliminating pairs that contain stopwords, enforcing limits on the fanout of tokens, and enforcing strength thresholds, for instance. The second phase, graph building, simply entails providing a set of triples {two tokens and a relation) to the system. The order in which the triples appear in the input does not matter, as it is the program's responsibility to reorder the links as necessary and to store the graph efficiently.</Paragraph>
      <Paragraph position="3"> The third phase, querying, is the chief novel contribution, and is described below.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="136" type="sub_section">
      <SectionTitle>
2.2 Regular expressions
</SectionTitle>
      <Paragraph position="0"> The power behind FreeNet lies in the user's ability to compose primitive relations to build more complex relations that it may use in its queries.</Paragraph>
      <Paragraph position="1"> The primary mechanism for building complex relations is the regular expression over the alphabet of relation names. Just as a regular expression over ASCII characters specifies a regular set of strings recursively in terms of other sets, so too can a regular expression over relation names specify a set of ordered pairs recursively in terms of other sets and various operators.</Paragraph>
      <Paragraph position="2"> The following grammar specifies allowable regular expressions in FreeNet.</Paragraph>
      <Paragraph position="4"> These regexp-building operators are described below. null Concatenation The concatenation operator is used to compose two relations directly. The expression rl r2 denotes the set of pairs (a,b) such that for some token c, (a,c) E rl and (c,b) E r2. For example, a network implementing a genealogy database might offer primitive parent and brother relations. In that case, the relation denoted by the regular expression {parent brother) is what we know of as the uncle relation.</Paragraph>
      <Paragraph position="5">  Conjunction Conjunction takes the intersection of two relations: plainly, the intersection of their respective pair sets. The expression rl * r2 denotes the set of pairs (a, b) such that (a, b) E rl and (a, b) E r2. Supposing that in a lexical semantic net we have the relations required_by and requires, then a symmetric symbiotic_with relation might be implemented as their conjunction.</Paragraph>
      <Paragraph position="6"> Union The union operator is used to join two relations. The expression rl I r2 denotes the set of pairs (a,b) such that (a,b) E rl or (a,b) E r2. In an ErdSsnumber like application, for example, two authors may be &amp;quot;related&amp;quot; if they have coauthored a paper or if one has cited the other.</Paragraph>
      <Paragraph position="7"> Transitive closure We commonly reason about the transitive closure of relations. The transitive closure operator implements homogeneous reachability--is there a path between the tokens using links only of a certain type? Namely, let r'l denote the relation r and r'i for i &gt; 1 denote the relation (r r'(i-1)). Then r* denotes the union of all r'i as i ranges from 0 to infinity. (Note that since we assume finite relations, this set is always finite.) In the genealogy example, paxenc* would be what we consider the &amp;quot;ancestor&amp;quot; relation.</Paragraph>
      <Paragraph position="8"> Inverse, Complement, and Sibling A few more unary operators are minor conveniences in building relations. The inverse operator swaps every pair: r- denotes the set of pairs (a,b) such that (b.a) E r. Taking the union of a relation with its inverse produces a new relation that is guaranteed to be symmetric.</Paragraph>
      <Paragraph position="9"> The complement operator produces a set containing all pairs but those in a certain relation, r' denotes the set of pairs (a, b) such that (b, a) ~r. (The vocabulary is assumed to be fixed after the graph is built, and so the universe is well-defined.) The sibling operator produces pairs that have in common their relation with a certain other token. rX denotes the set of pairs (a, b) such that a ~ b and there exists a c such that (a,c) E r and (b,c) E r. Thus (parent-)~, relation is the genealogical sibling relation formed by applying the inverse operator and then the sibling operator to the &amp;quot;parent&amp;quot; relation. Note A simple structural induction can be used to prove that any relation built from these operators is also a relation. Additional operators to support set addition and subtraction of constant pair sets are also available.</Paragraph>
    </Section>
    <Section position="3" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
2.3 Queries
</SectionTitle>
      <Paragraph position="0"> Queries in FreeNet are path specifications expressed as a sequence of tokens or token variables with interleaved relation regexps. More precisely, every query is ofthe form (W &lt;regexp&gt;)* W. where W is either a</Paragraph>
      <Paragraph position="2"> constant token or a variable wi, and &lt;regexp&gt; is a regular expression over relations, as defined above.</Paragraph>
      <Paragraph position="3"> FreeNet returns a shortest path (or all paths) in the multigraph that match the query, binding the variables in the query to concrete tokens. The output includes the names of all of the primitive relation links traversed.</Paragraph>
      <Paragraph position="4"> Queries in the Internet version of FreeNet can take one of four forms, each parameterized by one or two tokens; but these demonstrate what are expected to be common queries. Below, the &amp;quot;ANY&amp;quot; regexp is the union of all available (or selected) primitive relations. The comma (&amp;quot;,') represents the universal relation, linking all pairs of tokens; the comma relation can thus be used in FreeNet queries to implement conjunction of clauses.</Paragraph>
      <Paragraph position="5">  * Shortest path: This query takes two arguments s and t, and outputs the result of the query &amp;quot;s ANY* t'. This finds a shortest path, using any of the selected relations, between the source and the target.</Paragraph>
      <Paragraph position="6"> * Fanout: This query takes a single argument s and outputs the result of &amp;quot;s ANY wz&amp;quot;. This simply shows all words related in some way to the source.</Paragraph>
      <Paragraph position="7"> * Intersection search: This query takes two ar null guments s and t and outputs the result of &amp;quot;s AI~ wl , t t.tlY wt'. This is useful for finding what two tokens &amp;quot;have in common&amp;quot; in terms of primitive relationships with other tokens. The two relations involved in such a path need not be identical.</Paragraph>
      <Paragraph position="8"> * Coercion: This query takes two arguments s and t, two relations rel and re2, and outputs the result of&amp;quot;s rot wt re2 w2 ret t&amp;quot;. This is useful for a wide variety of constraint-solving, such as, in the lexical semantic net case, pun and rhyme generation.</Paragraph>
    </Section>
    <Section position="4" start_page="136" end_page="137" type="sub_section">
      <SectionTitle>
2.4 Implementation issues
</SectionTitle>
      <Paragraph position="0"> A FreeNet multigraph is stored sparsely for efficient ofltine (disk) access as a list of variable-length adjacency lists. Each element in an adjacency list is a single 32-bit word that describes an arc by combining its destination token ID and relation ID; the source token ID for an arc is implicit in its row.</Paragraph>
      <Paragraph position="1"> An index of offsets into the list is precomputed and stored together with hash tables for the token and relation namespaces. At no point in query processing is more than a single line of the list (equivalently, a set of links emanating from the same source node) in memory at once.</Paragraph>
      <Paragraph position="2"> Graph construction A number ofoptimizations in the layout of the multigraph on disk are essential if arbitrary searches over large multigraphs are to be efficient. Of particular concern is disk seek time, because traversing the graph entails accessing different rows of the adjacency list representation in rapid succession. One simple preprocessing step is to sort each row of the'  representation by the word identifier's row location, so that all of the nodes emanating from a fixed source can be accessed wixth a unidirectional sweep.</Paragraph>
      <Paragraph position="3"> A trickier concern is the ordering of the rows themselves. We desire to order the rows so that related words tend to appear near each other so that seek time between them is minimized. We can formalize this problem by asking for an ordering that minimizes the average offset difference between a randomly chosen edge in the multigraph. This problem is at least as computationally hard as the wellstudied, NP-complete bandwidth problem in graph theory (Papadimitriou, 1076), which is to find a linear ordering of the vertices of a given graph such that the maximum difference in the ordering between any two adjacent vertices is minimal. We are studying approximation algorithms (Blum et al., to appear) that allow this preprocessing step to be carried out efficiently during database construction.</Paragraph>
      <Paragraph position="4"> Querying Supporting arbitrary FreeNet queries that allow the full range of regular expression operators, is a non-trivial data structures problem, because it is prohibitively expensive to add new links with the occurrence of a new regexp. Instead, the graph is static. Each relation in the &amp;quot;alphabet&amp;quot; of relations is converted to an ASCII character, and stock regexp .processing software is used to convert each regexp m a query to a state machine. A query is converted to a single state machine by concatenating its constituent regexp state machines, interleaving &amp;quot;constraint points&amp;quot; that enforce the identity of multiple bindings of the same variable. A dynamic set of state IDs and backtrace IDs is associated with each token to support breadth-first search.</Paragraph>
      <Paragraph position="5"> The query templates above are implemented without all this machinery, by simply performing breadth-first-search on the graph, maintaining a single backtrace ID for each node, and allowing or prohibiting certain relations as specified by the user. Coercion is implemented as a hard-coded path constraint. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML