XML Viewer - p01-1065

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1065_metho.xml
Size: 21,761 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1065">
  <Title>A Generic Approach to Parallel Chart Parsing with an Application to LinGO</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Analysis of Parsings
</SectionTitle>
    <Paragraph position="0"> To analyze the possibilities for parallelism in computations they are often represented as task graphs. A task graph is a directed acyclic graph, where the nodes represent some unit of computation, called a task, and the arcs represent the execution dependencies between the tasks. Task graphs can be used to analyze the critical path, which is the minimal time required to complete a computation, given an in nite amount of processors. From Brent (1974) and Graham (1969) we know that there exist P-processor schedulings where the execution time TP is bound as follows: null</Paragraph>
    <Paragraph position="2"> where T1 is the total work, or the execution time for the one processor case, and T1 is the critical path. Furthermore, to e ectively use P processors, the average parallelism P = T1=T1 should be larger than P.</Paragraph>
    <Paragraph position="3"> The rst step of the analysis is to nd an appropriate graph representation for parsing computations. According to Caroll (1994), performing a complexity analysis solely at the level of grammars and parsing schemata can give a distorted image of the parsing process in practice. For this reason, we based our analysis on actual parsings. The experiments were based on the fuse test suite, which is a balanced extract from four appointment scheduling (spoken) dialogue corpora (incl.</Paragraph>
    <Paragraph position="4"> VerbMobil). Fuse contains over 2000 sentences with an average length of 11.6.</Paragraph>
    <Paragraph position="5"> We de ne a task graph for a single parsing computation as follows. First, we distinguish two types of tasks: uni cation tasks and match tasks. A uni cation task executes a single uni cation operation. A match task is responsible for all the actions that are taken when a uni cation succeeds: matching the resulting edge with other edges in the chart and putting resulting uni cation tasks on the agenda. The match task is also responsible for applying ltering techniques like the quick check (Malouf et al., 2000). The tasks are connected by directed arcs that indicate the execution dependencies.</Paragraph>
    <Paragraph position="6"> We de ne the cost of each uni cation task as the number of nodes visited during the uni cation and successive copying operation.</Paragraph>
    <Paragraph position="7"> Uni cation operations are typically responsible for over 90% of the total work. In addition, the cost of the match tasks are spread out over succeeding uni cation tasks. We therefore simply neglect the cost for match operations, and assume that this does not have a signi cant impact on our measurements. The length of a path in the graph can now be dened as the sum of the costs of all nodes on  type 2 task graphs (average and worst case).</Paragraph>
    <Paragraph position="8"> the path. The critical path length T1 can be de ned as the longest path between any two nodes in the graph.</Paragraph>
    <Paragraph position="9"> The presented model resembles a very ne-grained scheme for distributing work, where each single uni cation tasks to be scheduled independently. In a straightforward implementation of such a scheme, the scheduling overhead can become signi cant. Limiting the scheduling overhead is crucial in obtaining considerable speedup. It might therefore be tempting to group related tasks into a single unit of execution to mitigate this overhead.</Paragraph>
    <Paragraph position="10"> For this reason we also analyzed a task graph representation where only match tasks spawn a new unit of execution. The top graph in Figure 1 shows an example of a task graph for the rst approach. The bottom graph of Figure 1 shows the corresponding task graph for the second approach. Note that because a uni cation task may depend on more than one match task, a choice has to be made in which unit of execution the uni cation task is put.</Paragraph>
    <Paragraph position="11"> Table 1 shows the results of the critical path analysis of both approaches. For the rst approach, the critical path is uniquely de ned. For the second approach we show both the worst case, considering all possible schedulings, and an average case. The results for T1, T1, and P are averaged over all sentences.1 The results show that, using the rst approach, there is a considerable amount of parallelism in the parsing computations. The results also show that a small change in the design of a parallel parser can have a signi cant impact on the value for P. To obtain a speedup of P, in practice, there should be a safety margin between P and P. This suggests that the rst approach is a considerably saver choice, especially when one is considering using more than a dozen of processors.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Design and Implementation
</SectionTitle>
    <Paragraph position="0"> Based on the discussion in the preceding sections, we can derive two requirements for the design of a parallel parser: it should be close in design to a sequential parser and it should allow each single uni cation operation to be scheduled dynamically. The parallel parser we will present in this section meets both requirements. null Let us rst focus on how to meet the rst requirement. Basically, we let each processor, run a regular sequential parser augmented with a mechanism to combine the results of the di erent parsers. Each sequential parser component is contained in a di erent thread.</Paragraph>
    <Paragraph position="1"> By using threads, we allow each parser to share the same memory space. Initially, each thread is assigned a di erent set of work, for example, resembling a di erent part of the input string. A thread will process the uni cation tasks on the agenda and, on success, will perform the resulting match task to match the new edge with the edges on its chart. After completing the work on its agenda, a thread will match the edges on its chart with the edges derived so far by the other threads. This may produce new uni cation tasks, which the thread puts on its agenda. After the communication phase is completed, it returns to normal parsing mode to execute the work on its agenda. This process continues until all edges 1Note that since PT1=PT16= PT1=T1, the results for P turn out slightly lower than might have been expected from the values of T1 and T1.</Paragraph>
    <Paragraph position="2">  of all threads have been matched against each other and all work has been completed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Data Structures
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows an outline of our approach in terms of data structures. Each thread contains an agenda, which can be seen as a queue of uni cation tasks, a chart, which stores the derived edges, and a heap, which is used to store the typed-feature structures that are referenced by the edges. Each thread has full access to its own agenda, chart, and heap, and has read-only access to the respective structures of all other threads. Grammars are read-only and can be read by all threads.</Paragraph>
      <Paragraph position="1"> In the communication phase, threads need read-only access to the edges derived by other threads. This is especially problematic for the feature structures. Many uni cation algorithms need write access to scratch elds in the graph structures. Such algorithms are therefore not thread-safe.2 For this reason we use the thread-safe uni cation algorithm presented by Van Lohuizen (2000), which is comparable in performance to Tomabechi's algorithm (Tomabechi, 1991).</Paragraph>
      <Paragraph position="2"> Note that each thread also has its own agenda. Some parsing systems require strict control over the order of evaluation of tasks.</Paragraph>
      <Paragraph position="3"> The distributed agendas that we use in our approach may make it hard to implement such a strict control. One solution to the problem would be to use a centralized agenda. The disadvantage of such a solution is that it might increase the synchronization overhead. Techniques to reduce the synchronization overhead 2In this context, thread safe means that the same data structure can be involved in more than one operation, of more than one thread, simultaneously. global shared NrThreadsIdle, Generation, IdleGen Sched() var threadGen, newWork, isIdle  threadGen Generation Generation+1 while NrThreadsIdle6= P do 1. newWork not IsEmpty(agenda).</Paragraph>
      <Paragraph position="4"> 2. Process the agenda as in the sequential case. In addition, stamp each newly derived I edge by setting I:generation to the current value for threadGen and add I to this thread's edge list.</Paragraph>
      <Paragraph position="5"> 3. Examine all the other threads for newly derived edges. For each new edge I and for each edge J on the chart for which holds I:generation &gt; J:generation, add the corresponding task to the agenda if it passes the lter. If any edge was processed, set newWork to true.</Paragraph>
      <Paragraph position="6"> 4. if not newWork then newWork Steal() 5. lock GlobalLock 6. if newWork then  in such a setup can be found in (Markatos and LeBlanc, 1992).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Scheduling Algorithm
</SectionTitle>
      <Paragraph position="0"> At startup, each thread calls the scheduling algorithm shown in Figure 3. This algorithm can be seen as a wrapper around an existing sequential parser that takes care of combining the results of the individual threads. The functionality of the sequential parser is embedded in step 2. After this step, the agenda will be empty. The communication between threads takes place in step 3. Each time a thread executes this step, it will proceed over all the newly derived edges of other threads (foreign edges) and match them with the edges on its own chart (local edges). Checking the newly derived edges of other threads can simply be done by proceeding over a linked list of derived edges maintained by the respective threads. Threads record the last visited edge of the list of each other thread. This ensures that each newly derived item needs to be visited only once by each thread.</Paragraph>
      <Paragraph position="1"> As a result of step 3, the agenda may become non-empty. In this case, newWork will be set and step 2 is executed again. This cycle continues until all work is completed.</Paragraph>
      <Paragraph position="2"> The remaining steps serve several purposes: load balancing, preventing double work, and detecting termination. We will explain each of these aspects in the following sections. Note that step 6 and 7 are protected by a lock.</Paragraph>
      <Paragraph position="3"> This ensures that no two threads can execute this code simultaneously. This is necessary because Step 6 and 7 write to variables that are shared amongst threads. The overhead incurred by this synchronization is minimal, as a thread typically iterates over this part only a small number of times. This is because the depth of the derivation graph of any edge is limited (average 14, maximum 37 for the fuse test set).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Work Stealing
</SectionTitle>
      <Paragraph position="0"> In the design as presented so far, each thread exclusively executes the uni cation tasks on its agenda. Obviously, this violates the requirement that each uni cation task should be scheduled dynamically.</Paragraph>
      <Paragraph position="1"> In (Blumofe and Leiserson, 1993), it is shown that for any multi-threaded computation with work T1 and task graph depth T1, and for any number P of processors, a scheduling will achieve TP T1=P +T1 if for the scheduling holds that whenever there are more than P tasks ready, all P threads are executing work. In other words, as long as there is work on any queue, no thread should be idle.</Paragraph>
      <Paragraph position="2"> An e ective technique to ensure the above requirement is met is work stealing (Frigo et al., 1998). With this technique, a thread will rst attempt to steal work from the queue of another thread before denouncing itself to be idle. If it succeeds, it will resume normal execution as if the stolen tasks were its own. Work stealing incurs less synchronization overhead than, for example, a centralized work queue.</Paragraph>
      <Paragraph position="3"> In our implementation, a thread becomes a thief by calling Steal, at step 4 of Sched.</Paragraph>
      <Paragraph position="4"> Steal allows stealing from two types of queues: the agendas, which contain outstanding uni cation tasks, and the unchecked foreign edges, which resemble outstanding match tasks between threads.</Paragraph>
      <Paragraph position="5"> A thief rst picks a random victim to steal from. It rst attempts to steal the victim's match tasks. If it succeeds, it will perform the matches and put any resulting uni cation tasks on its own agenda. If it cannot gain exclusive access to the lists of unchecked foreign edges, or if there were no matches to be performed, it will attempt to steal work from the victim's agenda. A thief will steal half of the work on the agenda. This balances the load between the two threads and minimizes the chance that either thread will have to call the expensive steal operation soon thereafter. Note that idle threads will keep calling Steal until they either obtain new work or all other threads become idle.</Paragraph>
      <Paragraph position="6"> Obviously, stealing eliminates the exclusive ownership of the agenda and unchecked foreign edge lists of the respective threads. As a consequence, a thread needs to lock its agenda and edge lists each time it needs to access it. We use an asymmetric mutual exclusion scheme, as presented in (Frigo et al., 1998), to minimize the cost of locking for normal processing and move more of the overhead to the side of the thief.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Preventing Duplicate Matches
</SectionTitle>
      <Paragraph position="0"> When two matching edges are stored on the charts of two di erent threads, it should be prevented that both threads will perform the corresponding match. Failing to do so can cause the derivation of duplicate edges and eventually a combinatorial explosion of work.</Paragraph>
      <Paragraph position="1"> Our solution is based on a generation scheme.</Paragraph>
      <Paragraph position="2"> Each newly derived edge is stamped with the current generation of the respective thread, threadGen (see step 2). In addition, a thread will only perform the match for two edges if the edge on its chart has a lower generation than the foreign edge (see step 3). Obviously, because the value of threadGen is unique for the thread (see step 6), this scheme prevents two edges from being matched twice.</Paragraph>
      <Paragraph position="3"> Sched also ensures that two matching edges will always be matched by at least one thread. After a thread completes step 3, it will always raise its generation. The new generation will be greater than that of any foreign edge processed before. This ensures that when an edge is put on the chart, no foreign edge with a higher generation has been matched against the respective chart before.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Termination
</SectionTitle>
      <Paragraph position="0"> A thread may terminate when all work is completed, that is, if and only if the following conditions hold simultaneously: all agendas of all threads are empty, all possible matches between edges have been processed, and all threads are idle. Step 7 of Sched enforces that these conditions hold before any thread leaves Sched. Basically, each thread determines for itself whether its queues are empty and raises the global counter NrThreadsIdle accordingly. When all threads are idle simultaneously, the parser is nished.</Paragraph>
      <Paragraph position="1"> A thread's agenda is guaranteed to be empty whenever newWork is false at step 7.</Paragraph>
      <Paragraph position="2"> The same does not hold for the unchecked foreign edges. Whenever a thread derives a new edge, all other edges need to perform the corresponding matches. The following mechanism enforces this. The rst thread to become idle raises the global generation and records it in IdleGen. Subsequent idle threads will adopt this as their idle generation. Whenever a thread derives a new edge, it will raise Generation and reset NrThreadsIdle (step 6).</Paragraph>
      <Paragraph position="3"> This invalidates IdleGen which implicitly removes the idle status from all threads. Note that step 7 lets each thread perform an additional iteration before raising NrThreadsIdle. This allows a thread to check for foreign edges that were derived after step 3 and before 7.</Paragraph>
      <Paragraph position="4"> Once all work is done, detecting termination  suite for various number of processors.</Paragraph>
      <Paragraph position="5"> requires at most 2P synchronization steps.3</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 Implementation
</SectionTitle>
      <Paragraph position="0"> The implementation of the system consists of two parts: MACAMBA and CaLi.</Paragraph>
      <Paragraph position="1"> MACAMBA stands for Multi-threading Architecture for Chart And Memoization-Based Applications. The MACAMBA framework provides a set of objects that implement the scheduling technique presented in the previous section. It also includes a set of support objects like charts and a thread-safe unication algorithm. CaLi is an instance of a MACAMBA application that implements a Chart parser for the LinGO grammar. The design of CaLi was based on PET (Callmeier, 2000), one of the fastest parsers for LinGO.</Paragraph>
      <Paragraph position="2"> It implements the quick check (Malouf et al., 2000), which, together with the rule check, takes care of ltering over 90% of the failing uni cation tasks before they are put on the agenda. MACAMBA and CaLi were both implemented in Objective-C and currently run on Windows NT, Linux, and Solaris.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Performance Results
</SectionTitle>
    <Paragraph position="0"> The performance of the sequential version of CaLi is comparable to that of PET.4 In addition, for the single-processor parallel version of CaLi the total overhead incurred by scheduling is less than 1%.</Paragraph>
    <Paragraph position="1"> The rst set of experiments consisted of running the fuse test suite on a SUN Ultra Enterprise with 8 nodes, each with a 400 MHz 3No locking is required once a thread is idle.</Paragraph>
    <Paragraph position="2"> 4Respectively, 1231s and 1339s on a 500MHz P-III, where both parsers used the same parsing schema. UltraSparc processor, for a varying number of processors. Table 2 shows the results of these experiments.5 The execution times for each parse are measured in wall clock time. The time measurement of a parse is started before the rst thread starts working and ends only when all threads have stopped. The fuse test suite contains a large number of small sentences that are hard to parallelize. These results indicate that deploying multiple processors on all input sentences unconditionally still gives a considerable overall speedup.</Paragraph>
    <Paragraph position="3"> The second set of experiments were run on a SUN Enterprise10000 with 64 250 MHz UltraSparc II processors. To limit the amount of data generated by the experiments, and to increase the accuracy of the measurements, we selected a subset of the sentences in the fuse suite. The parser is able to parse many sentences in the fuse suite in fewer than several milliseconds. Measuring speedup is inaccurate in these cases. We therefore eliminated such sentences from the test suite. From the remaining sentences we made a selection of 500 sentences of various lengths.</Paragraph>
    <Paragraph position="4"> The results are shown in Figure 4. The gure includes a graph for the maximum, minimum, and average speedup obtained over all sentences. The maximum speedup of 31.4 is obtained at 48 processors. The overall peak is reached at 32 processors where the average speedup is 17.3. One of the reasons for the decline in speedup after 32 processors is the overhead in the scheduling algorithm. Most notably, the total number of top-level iterations of Sched increases for larger P. The minimum speedups of around 1 are obtained for, often small, sentences that contain too little inherent parallelism to be parallelized effectively. null Figure 4 shows a graph of the parallel efciency, which is de ned as speedup divided by the number of processors. The average efciency remains close to 80% up till 16 processors. Note that super linear speedup is achieved with up to 12 processors, repeatedly for the same set of sentences. Super lin5Because the system was shared with other users, only 6 processors could be utilized.</Paragraph>
    <Paragraph position="5">  speedup and parallel e ciency based on wall clock time.</Paragraph>
    <Paragraph position="6"> ear speedup can occur because increasing the number of processors also reduces the amount of data handled by each node. This reduces the chance of cache misses.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML