On-the-Fly Pipeline Parallelism

I-TING ANGELINA LEE, CHARLES E. LEISERSON, TAO B. SCHARDL, and ZHUNPING ZHANG.
MIT CSAIL
JIM SUKHA, Intel Corporation

Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism.

Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this paper investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the PIPER algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The PIPER algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with \( T_1 \) work and \( T_\infty \) span (critical-path length), PIPER executes the computation on \( P \) processors in \( T_P \leq T_1/P + O(T_\infty + \log P) \) expected time. PIPER also limits stack space, ensuring that it does not grow unboundedly with running time.

We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

Categories and Subject Descriptors: D.3.3 [Language Constructs and Features]: Concurrent programming structures; D.3.4 [Programming Languages]: Processors—Run-time environments

General Terms: Algorithms, Languages, Theory.

Additional Key Words and Phrases: Cilk, multicore, multithreading, parallel programming, pipeline parallelism, on-the-fly pipelining, scheduling, work stealing.

1. INTRODUCTION

\(^1\)Pipeline parallelism \cite{Blelloch1997,Giacomoni2008,Gordon2006,MacDonald2004,McCool2012,Navarro2009,Pop2011,Reed2011,Sanchez2011,Suleman2010} is a well-known parallel-programming pattern that can be used to parallelize a variety of applications, including streaming applications from the do-

\(^1\) Pipeline parallelism should not be confused with instruction pipelining in hardware \cite{Rojas1997} or software pipelining \cite{Lam1988}.
mains of video, audio, and digital signal processing. Many applications, including the ferret, dedup, and x264 benchmarks from the PARSEC benchmark suite [Bienia et al. 2008; Bienia and Li 2010], exhibit parallelism in the form of a linear pipeline, where a linear sequence $S = \langle S_0, \ldots, S_{n-1} \rangle$ of abstract functions, called stages, are executed on an input stream $I = \langle a_0, a_1, \ldots, a_{n-1} \rangle$. Conceptually, a linear pipeline can be thought of as a loop over the elements of $I$, where each loop iteration $i$ processes an element $a_i$ of the input stream. The loop body encodes the sequence $S$ of stages through which each element is processed. Parallelism arises in linear pipelines because the execution of iterations can overlap in time, that is, iteration $i$ may start after the preceding iteration $i-1$ has started, but before $i-1$ has necessarily completed.

Most systems that provide pipeline parallelism employ a construct-and-run model, as exemplified by the pipeline model in Intel Threading Building Blocks (TBB) [McCool et al. 2012], where the pipeline stages and their dependencies are defined a priori before execution. Systems that support construct-and-run pipeline parallelism include the following: [Agrawal et al. 2010; Consel et al. 2003; Gordon et al. 2006; MacDonald et al. 2004; Mark et al. 2003; McCool et al. 2012; OpenMP 3.0 2008; Ottoni et al. 2005; Pop and Cohen 2011; Rangan et al. 2004; Sanchez et al. 2011; Suleman et al. 2010; Thies et al. 2007].

We have extended the Cilk parallel-programming model [Frigo et al. 1998; Leiserson 2010; Intel Corporation 2013] to Cilk-P, a system that augments Cilk’s native fork-join parallelism with on-the-fly pipeline parallelism, where the linear pipeline is constructed dynamically as the program executes. The Cilk-P system provides a flexible linguistic model for pipelining that allows the structure of the pipeline to be determined dynamically as a function of data in the input stream. Cilk-P also admits a variable number of stages across iterations, allowing the pipeline to take on shapes other than simple rectangular grids. The Cilk-P programming model is flexible, yet restrictive enough to allow provably efficient scheduling, as Sections 5 through 8 will show. In particular, Cilk-P’s scheduler provides automatic “throttling” to ensure that the computation uses bounded space. As a testament to the flexibility provided by Cilk-P, we were able to parallelize the x264 benchmark from PARSEC, an application that cannot be programmed easily using TBB [Reed et al. 2011].

Although Cilk-P’s support for defining linear pipelines on the fly is more flexible than construct-and-run approaches and the ordered directive in OpenMP [OpenMP 3.0 2008], which supports a limited form of on-the-fly pipelining, it is less expressive than other approaches. Blelloch and Reid-Miller [Blelloch and Reid-Miller 1997] describe a scheme for on-the-fly pipeline parallelism that employs futures [Friedman and Wise 1978; Baker and Hewitt 1977] to coordinate the stages of the pipeline, allowing even nonlinear pipelines to be defined on the fly. Although futures permit more complex, nonlinear pipelines to be expressed, this generality can lead to unbounded space requirements to attain even modest speedups [Blumofe and Leiserson 1998].
To illustrate the ideas behind the Cilk-P model, consider a simple 3-stage linear pipeline such as in the ferret benchmark from PARSEC [Bienia et al. 2008; Bienia and Li 2010]. Figure 1 shows a pipeline dag (directed acyclic graph) $G = (V, E)$ representing the execution of the pipeline. Each of the 3 horizontal rows corresponds to a stage of the pipeline, and each of the $n$ vertical columns is an iteration. We define a pipeline node $(i, j) \in V$, where $i = 0, 1, \ldots, n - 1$ and $j = 0, 1, 2$, to be the execution of $s_j(a_i)$, the $j$th stage in the $i$th iteration, represented as a vertex in the dag. The edges between nodes denote dependencies. A stage edge from node $(i, j)$ to node $(i, j')$, where $j < j'$, indicates that $(i, j')$ cannot start until $(i, j)$ completes. A cross edge from node $(i - 1, j)$ to node $(i, j)$ indicates that $(i, j)$ can start execution only after node $(i - 1, j)$ completes. Cilk-P always executes nodes of the same iteration in increasing order by stage number, thereby creating a vertical chain of stage edges. Cross edges between corresponding stages of adjacent iterations are optional. We can categorize the stages of a Cilk-P pipeline. A stage is a serial stage if all nodes belonging to the stage are connected by cross edges, and it is a hybrid stage otherwise. The ferret pipeline, for example, exhibits a static structure often referred to as an “SPS” pipeline, since stage 0 and stage 2 are serial and stage 1 is parallel. Cilk-P requires that pipelines be linear, since iterations are totally ordered and dependencies go between adjacent iterations, and in fact, stage 0 of any Cilk-P pipeline is always a serial stage. Later stages may be serial, parallel, or hybrid, as we shall see in Sections 2 and 3.

To execute a linear pipeline, Cilk-P follows the lead of TBB and adopts a bind-to-element approach [McCool et al. 2012; MacDonald et al. 2004], where workers (scheduling threads) execute pipeline iterations either to completion or until an unresolved dependency is encountered. In particular, Cilk-P and TBB both rely on “work-stealing” schedulers (see, for example, [Arora et al. 2001; Blumofe and Leiserson 1999; Burton and Sleep 1981; Frigo et al. 1998; Finkel and Manber 1987; Kranz et al. 1989]) for load balancing. In contrast, many systems that support pipeline parallelism, including typical Pthreaded implementations, execute linear pipelines using a bind-to-stage approach, where each worker executes a distinct stage and coordination between workers is handled using concurrent queues [Gordon et al. 2006; Sanchez et al. 2011; Thies et al. 2007]. Some researchers report that the bind-to-element approach generally outperforms bind-to-stage [Navarro et al. 2009; Reed et al. 2011], since a work-stealing scheduler can do a better job of dynamically load-balancing the computation, but our own experiments show mixed results.

A natural theoretical question is, how much parallelism is inherent in the ferret pipeline (or in any pipeline)? How much speedup can one hope for? Since the computation is represented as a dag $G = (V, E)$, one can use a simple work/span analysis [Cormen et al. 2009, Ch. 27] to answer this question. In this analytical model, we assume that each vertex $v \in V$ executes in some time $w(v)$. The work of the computation, denoted $T_W$, is essentially the serial execution time, that is, $T_W = \sum_{v \in V} w(v)$. The span of the computation, denoted $T_S$, is the length of a longest weighted path through $G$, which is essentially the time of an infinite-processor execution. The parallelism is the ratio $T_W/T_S$, which is the maximum possible speedup attainable on any number of processors, using any scheduler.

Unlike in some applications, in the ferret pipeline, each node executes serially, that is, its work and span are the same. Let $w(i, j)$ be the execution time of node $(i, j)$. Assume that the serial stages 0 and 2 execute in unit time, that is, for all $i$, we have $w(i, 0) = w(i, 2) = 1$, and that the parallel stage 1 executes in time $r \gg 1$, that is, for all $i$, we have $w(i, 1) = r$. Because the pipeline dag is grid-like, the span of this SPS pipeline can be realized by some staircase walk through the dag from node $(0, 0)$ to node $(n - 1, 2)$. The work of this pipeline is therefore $T_W = n(r + 2)$, and the span is

$$T_S = \max_{0 \leq x < n} \left\{ \sum_{i=0}^{x} w(i, 0) + w(x, 1) + \sum_{i=x}^{n-1} w(i, 2) \right\} = n + r.$$
 Consequently, the parallelism of this dag is $T_1/T_\infty = n(r+2)/(n+r)$, which for $1 \ll r \leq n$ is at least $r/2 + 1$. Thus, if stage 1 contains much more work than the other two stages, the pipeline exhibits good parallelism.

On an ideal shared-memory computer, Cilk-P guarantees to execute the *ferret* pipeline efficiently. In particular, Cilk-P guarantees linear speedup on a computer with up to $T_1/T_\infty = O(r)$ processors. Generally, Cilk-P executes a pipeline with linear speedup as long as the parallelism of the pipeline exceeds the number of processors on which the computation is scheduled. Moreover, as Section 3 will describe, Cilk-P allows stages of the pipeline themselves to be parallel using recursive pipelining or fork-join parallelism.

In practice, it is also important to limit the space used during an execution. Unbounded space can cause thrashing of the memory system, leading to slowdowns not predicted by simple execution models. In particular, a bind-to-element scheduler must avoid creating a *runaway* pipeline — a situation where the scheduler allows many new iterations to be started before finishing old ones. In Figure 1, a runaway pipeline might correspond to executing many nodes in stage 0 (the top row) without finishing the other stages of the computation in the earlier iterations. Runaway pipelines can cause space utilization to grow unboundedly, since every started but incomplete iteration requires space to store local variables.

Cilk-P automatically *throttles* pipelines to avoid runaway pipelines. On a system with $P$ workers, Cilk-P inhibits the start of iteration $i + K$ until iteration $i$ has completed, where $K = \Theta(P)$ is the *throttling limit*. Throttling corresponds to putting *throttling edges* from the last node in each iteration $i$ to the first node in iteration $i + K$. For the simple pipeline from Figure 1, throttling does not adversely affect asymptotic scalability if stages are uniform, but it can be a concern for more complex pipelines, as Section 11 will discuss. The Cilk-P scheduler guarantees efficient scheduling of pipelines as a function of the parallelism of the dag in which throttling edges are included in the calculation of span.

**Contributions**

Our prototype Cilk-P system adapts the Cilk-M [Lee et al. 2010] runtime scheduler to support on-the-fly pipeline parallelism using a bind-to-element approach. This paper makes the following contributions:

— We describe linguistics for Cilk-P that allow on-the-fly pipeline parallelism to be incorporated into the Cilk fork-join parallel programming model (Section 2).
— We illustrate how Cilk-P linguistics can be used to express the x264 benchmark as a pipeline program (Section 3).
— We characterize the execution dag of a Cilk-P pipeline program as an extension of a fork-join program (Section 4).
— We introduce the PIPER scheduling algorithm, a theoretically sound randomized work-stealing scheduler (Section 5).
— We prove that PIPER is asymptotically efficient, executing Cilk-P programs on $P$ processors in $T_P \leq T_1/P + O(T_\infty + \log P)$ expected time (Sections 6 and 7).
— We bound space usage, proving that PIPER on $P$ processors uses $S_P \leq P(S_1 + fDk)$ stack space for pipeline iterations, where $S_1$ is the serial stack space, $f$ is the “frame size,” $D$ is the depth of nested pipelines, and $K$ is the throttling limit (Section 8).
— We describe our implementation of PIPER in the Cilk-P runtime system, introducing two key optimizations: lazy enabling and dependency folding (Section 9).
— We demonstrate that the *ferret*, *dedup*, and x264 benchmarks from PARSEC, when hand-compiled for the Cilk-P runtime system, run competitively with existing Pthreaded implementations (Section 10).

We conclude in Section 11 with a discussion of the performance implications of throttling.
2. ON-THE-FLY PIPELINE PROGRAMS

Cilk-P’s linguistic model supports both fork-join and pipeline parallelism, which can be nested arbitrarily. For convenience, we shall refer to programs containing nested fork-join and pipeline parallelism simply as pipeline programs. Cilk-P’s on-the-fly pipelining model allows the programmer to specify a pipeline whose structure is determined during the pipeline’s execution. This section reviews the basic Cilk model and shows how on-the-fly parallelism is supported in Cilk-P using a "pipe_while" construct.

We first outline the basic semantics of Cilk without the pipelining features of Cilk-P. We use the syntax of Cilk++ [Leiserson 2010] and Intel® Cilk™ Plus [Intel Corporation 2013] which augments serial C/C++ code with two principal keywords: cilk_spawn and cilk_sync. When a function invocation is preceded by the keyword cilk_spawn, the function is spawned as a child subcomputation, but the runtime system may continue to execute the statement after the cilk_spawn, called the continuation, in parallel with the spawned subroutine without waiting for the child to return. The complementary keyword to cilk_spawn is cilk_sync, which acts as a local barrier and joins together all the parallelism forked by cilk_spawn within a function. Every function contains an implicit cilk_sync before the function returns.

To support on-the-fly pipeline parallelism, Cilk-P provides a pipe_while keyword. A pipe_while loop is similar to a serial while loop, except that loop iterations can execute in parallel in a pipelined fashion. The body of the pipe_while can be subdivided into stages, with stages named by user-specified integer values that strictly increase as the iteration executes. Each stage can contain nested fork-join and pipeline parallelism.

The boundaries of stages are denoted in the body of a pipe_while using the special functions pipe_stage and pipe_stage_wait. These functions accept an integer stage argument, which is the number of the next stage to execute and which must strictly increase during the execution of an iteration. Every iteration i begins executing stage 0, represented by node (i, 0). While executing a node (i, j′), if control flow encounters a pipe_stage(j) or pipe_stage_wait(j) statement, where j > j′, then node (i, j′) ends, and control flow proceeds to node (i, j). A pipe_stage(j) statement indicates that node (i, j) can start executing immediately, whereas a pipe_stage_wait(j) statement indicates that node (i, j) cannot start until node (i − 1, j) completes. The pipe_stage_wait(j) in iteration i creates a cross edge from node (i − 1, j) to node (i, j) in the pipeline dag. Thus, by design choice, Cilk-P imposes the restriction that pipeline dependencies only go between adjacent iterations. As we shall see in Section 9, this design choice facilitates the “lazy enabling” and “dynamic dependency folding” runtime optimizations.

The pipe_stage and pipe_stage_wait functions can be used without an explicit stage argument. Omitting the stage argument while executing stage j corresponds to an implicit stage argument of j + 1, meaning that control moves onto the next stage.

Cilk-P’s semantics for pipe_stage and pipe_stage_wait statements allow for stage skipping, where execution in an iteration i can jump stages from node (i, j′) to node (i, j), even if j > j′ + 1. If control flow in iteration i + 1 enters node (i + 1, j″) after a pipe_stage_wait, where j′ < j″ < j, then we implicitly create a null node (i, j′′) in the pipeline dag, which has no associated work and incurs no scheduling overhead, and insert stage edges from (i, j′) to (i, j″) and from (i, j″) to (i, j), as well as a cross edge from (i, j′′) to (i + 1, j″).

3. ON-THE-FLY PIPELINING OF x264

To illustrate the use of Cilk-P’s pipe_while loop, this section describes how to parallelize the x264 video encoder [Wiegand et al. 2003].

We begin with a simplified description of x264. Given a stream ⟨f0, f1, ...⟩ of video frames to encode, x264 partitions the frame into a two dimensional array of “macroblocks” and encodes each macroblock. A macroblock in frame fi is encoded as a function of the encodings of similar mac-

---

2Cilk++ and Cilk Plus also include other features that are not relevant to the discussion here.
roblocks within \( f_i \) and similar macroblocks in frames “near” \( f_i \). A frame \( f_j \) is \emph{near} a frame \( f_i \) if \( i - b \leq j \leq i + b \) for some constant \( b \). In addition, we define a macroblock \((x',y')\) to be \emph{near} a macroblock \((x,y)\) if \( x - w \leq x' \leq x + w \) and \( y - w \leq y' \leq y + w \) for some constant \( w \).

The type of a frame \( f_i \) determines how a macroblock \((x,y)\) in \( f_i \) is encoded. If \( f_i \) is an \emph{I-frame}, then macroblock \((x,y)\) can be encoded using only \emph{previous} macroblocks within \( f_i \) — macroblocks at positions \((x',y')\) where \( y' < y \) or \( y' = y \) and \( x' < x \). If \( f_i \) is a \emph{P-frame}, then macroblock \((x,y)\)’s encoding can also be based on nearby macroblocks in nearby preceding frames, up to the most recent preceding I-frame,\(^3\) if one exists within the nearby range. If \( f_i \) is a \emph{B-frame}, then macroblock \((x,y)\)’s encoding can be based also on nearby macroblocks in nearby frames, likewise, up to the most recently preceding I-frame and up to the next succeeding I- or P-frame.

Based on these frame types, an \emph{x264} encoder must ensure that frames are processed in a valid order such that dependencies between encoded macroblocks are satisfied. A parallel \emph{x264} encoder can pipeline the encoding of I- and P-frames in the input stream, processing each set of intervening B-frames after encoding the latest I- or P-frame on which the B-frame may depend.

Figure 2 shows Cilk-P pseudocode for an \emph{x264} linear pipeline. Conceptually, the \emph{x264} pipeline begins with a serial stage (lines 7–16) that reads frames from the input stream and determines the type of each frame. This stage buffers all B-frames at the head of the input stream until it encounters an I- or P-frame. After this initial stage, \( s \) hybrid stages process this I- or P-frame row by row (lines 17–24), where \( s \) is the number of rows in the video frame. After all rows of this I- or P-frame have been processed, the \emph{PROCESS_BFRAMES} stage processes all B-frames in parallel (lines 26–28), and then the END stage updates the output stream with the processed frames (line 30).

Two issues arise with this general pipelining strategy, both of which can be handled using on-the-fly pipeline parallelism. First, the encoding of a P-frame must wait for the encoding of rows in the previous frame to be completed, whereas the encoding of an I-frame need not. These conditional

\(^3\)To be precise, up to a particular type of I-frame called an \emph{IDR-frame}.
Fig. 3. The pipeline dag generated for x264. Each iteration processes either an I- or P-frame, each consisting of $s$ rows. As the iteration index $i$ increases, the number of initial stages skipped in the iteration also increases. This stage skipping produces cross edges into an iteration $i$ from null nodes in iteration $i-1$. Null nodes are represented as the intersection between two edges.

dependencies are implemented in lines 19–23 of Figure 2 by executing a pipe_stage_wait or pipe_stage statement conditionally based on the frame’s type. In contrast, many construct-and-run pipeline mechanisms assume that the dependencies on a stage are fixed for the entirety of a pipeline’s execution, making such dynamic dependencies more difficult to handle. Second, the encoding of a macroblock in row $x$ of P-frame $f_i$ may depend on the encoding of a macroblock in a later row $x+w$ in the preceding I- or P-frame $f_{i-1}$. The code in Figure 2 handles such offset dependencies on line 16 by skipping $w$ additional stages relative to the previous iteration. A similar stage-skipping trick is used on line 25 to ensure that the processing of a P-frame in iteration $i$ depends only on the processing of the previous I- or P-frame, and not on the processing of preceding B-frames. Figure 3 illustrates the pipeline dag corresponding to the execution of the code in Figure 2, assuming that $w=1$. Skipping stages shifts the nodes of an iteration down, adding null nodes to the pipeline, which do not increase the work or span.
4. COMPUTATION-DAG MODEL

Although the pipeline-dag model provides intuition for programmers to understand the execution of a pipeline program, it is not precise enough to prove theoretical performance guarantees. For example, a pipeline dag has no real way of representing nested fork-join or pipeline parallelism within a node. This section describes how to represent the execution of a pipeline program as a more refined “computation dag.” First, we present an example of a simple pipeline program using pipe_while loops, and explain how to transform it into an ordinary Cilk program with special function calls to enforce non-fork-join dependencies. Then we describe how to model these transformed programs as computation dags.

We shall model an execution of a pipeline program as a “(pipeline) computation dag,” which is based on the notion of a fork-join computation dag for ordinary Cilk programs [Blumofe and Leiserson 1998, 1999] without pipeline parallelism. Let us first review this computation dag model for ordinary Cilk programs. A fork-join computation dag \( G = (V,E) \) represents the execution of a Cilk program, where each vertex in \( V \) denotes a unit-cost instruction. For convenience, we shall assume that instructions that call into the runtime system execute in unit time. Edges in \( E \) indicate ordering dependencies between instructions. The normal serial execution of one instruction after another creates a serial edge from the first instruction to the next. A cilk_spawn of a function creates two dependency edges emanating from the instruction immediately before the cilk_spawn: the spawn edge goes to the first instruction of the spawned function, and the continue edge goes to the first instruction after the spawned function. A cilk_sync creates a return edge from the final instruction of each spawned function to the cilk_sync instruction (as well as an ordinary serial edge from the instruction that executed immediately before the cilk_sync). We can model a particular execution of an ordinary fork-join Cilk program as conceptually generating the computation dag \( G \) dynamically, as it executes. Thus, after the program has finished executing, we have a complete dag \( G \) that captures the structure of parallelism in that execution.

Intuitively, we shall model an execution of a pipeline program as a (pipeline) computation dag by augmenting a traditional fork-join computation dag with cross and throttling dependencies. More formally, to generate a pipeline computation dag for an arbitrary pipeline-program execution, we use the following three-step process:

1. Transform the executed code in each pipe_while loop into ordinary Cilk code, augmented with special functions to implement cross and throttling dependencies.
2. Model the execution of this augmented Cilk program as a fork-join computation dag, ignoring cross and throttling dependencies.
3. Augment the fork-join computation dag with cross and throttling edges derived from the special functions.

The remainder of this section examines each of these steps in detail.

**Code transformation for a pipe_while loop**

Let us first consider the process of translating a pipe_while loop into ordinary Cilk code. Conceptually, a pipe_while loop is transformed into an augmented ordinary Cilk program in which an ordinary while loop sequentially spawns off each iterations of the pipe_while loop. In this while loop, first, each iteration of this while loop executes stage 0. Upon executing the first pipe_stage or pipe_stage_wait instruction in an iteration \( i \), the remainder of the \( i \) is spawned off, allowing the remaining stages of this \( i \) to execute in parallel with iteration \( i + 1 \). By executing stage 0 of a pipe_while iteration before spawning the remaining stages, stage 0 is ensured to execute sequentially across all iterations of the while loop. Each iteration may execute additional runtime functions to enforce cross and throttling dependencies between iterations.

This conceptual transformation of a pipe_while loop is complicated by specific semantic features of pipe_while iterations. For example, although stage 0 of each iteration executes before the remaining stages of each iteration are spawned, the runtime ensures that all stages of an iteration
int fd_out = open_output_file();
bool done = false;
pipe_while (!done) {
    chunk_t *chunk = get_next_chunk();
    if (chunk == NULL) {
        done = true;
    } else {
        pipe_stage_wait(1);
        bool isDuplicate = deduplicate(chunk);
        pipe_stage(2);
        if (!isDuplicate)
            compress(chunk);
        pipe_stage_wait(3);
        write_to_file(fd_out, chunk);
    }
}

Fig. 4. Cilk-P pseudocode for the parallelization of the dedup compression program as an SSPS pipeline.

operate on the same set of iteration-local variables. Furthermore, to ensure that an iteration executes pipeline stages sequentially, the runtime executes an implicit cilk_sync at the end of each stage to sync all child functions spawned within the stage before allowing the next stage to begin.

To illustrate more precisely the semantic features of pipe_while iterations, including how the Cilk-P runtime manages frames and iterations of a pipe_while loop, let us consider a Cilk-P implementation of a specific pipeline program, namely, the dedup compression program from PARSEC [Bienia et al. 2008; Bienia and Li 2010]. The benchmark can be parallelized by using a pipe_while to implement an SSPS pipeline. Figure 4 shows Cilk-P pseudocode for dedup, which compresses the provided input file by removing duplicated “chunks,” as follows. Stage 0 (lines 4–6) of the program reads data from the input file and breaks the data into chunks (line 4). As part of stage 0, it also checks the loop-termination condition and sets the done flag to true (line 6) if the end of the input file is reached. If there is more input to be processed, the program begins stage 1, which calculates the SHA1 signature of a given chunk and queries a hash table whether this chunk has been seen using the SHA1 signature as key (line 9). Stage 1 is a serial stage as dictated by the pipe_stage_wait on line 8. Stage 2, which the pipe_stage on line 10 indicates is a parallel stage, compresses the chunk if it has not been seen before (line 12). The final stage, a serial stage, writes either the compressed chunk or its SHA1 signature to the output file depending on whether it is the first time the chunk has been seen (line 14).

Figure 5 illustrates how the Cilk-P runtime system manages frames and pipeline iterations for the pipe_while loop for dedup presented in Figure 4. This code transformation has six key components, which illustrate the general structure of parallelism in pipeline programs.

1. As shown in lines 3–50, a pipe_while loop is “lifted” using a C++ lambda function [Stroustrup 2013, Sec.11.4] and converted to an ordinary while loop whose iterations correspond to iterations of the pipeline. This lambda function declares a control frame object pcf (on line 4) to keep track of runtime state needed for the pipe_while loop, including a variable pcf.i to index iterations, which line 4 initializes to 0.

2. Each iteration of the while loop allocates an iteration frame to store local data for each pipeline iteration. Before starting a pipeline iteration pcf.i, the loop allocates a new iteration frame next_iter_f for iteration pcf.i, as shown in line 6. The iteration frame stores local variables declared in the body of an iteration that persist across pipeline stages. For dedup, for example, Figure 4 shows that the local variable chunk is used through all stages. The iteration frame also stores a stage counter variable, iter_f->stage_counter, to track the currently executing stage for the iteration.

3. The body of this while loop is split into two nested lambda functions, the first for stage 0 of the iteration (lines 8–18), and the second for the remaining stages in the iteration (lines 21–41), if they exist. This transformation guarantees that stage 0 is always a serial stage, since the first lambda function is directly called in the body of the while loop. The test condition of the pipe_while loop is evaluated as part of stage 0, as demonstrated in line 10. In contrast, the
int fd_out = open_output_file();
bool done = false;

while (true) {
    _Cilk_pipe_iter_frame* next_iter_f = pcf.get_new_iter_frame(pcf.i);
    // Stage 0 of an iteration.
    [&]()
    {
        next_iter_f->continue_after_stage0 = false;
        if (!done)
            next_iter_f->chunk = get_next_chunk();
        if (next_iter_f->chunk == NULL)
            done = true;
        else
            next_iter_f->continue_after_stage0 = true;
    }()
    // Spawn the remaining stages of iteration pcf.i, if they exist.
    if (next_iter_f->continue_after_stage0)
        cilk_spawn [&](_Cilk_pipe_iter_frame* iter_f)
        {
            // assert (iter_f->stage_counter < 1);
            iter_f->stage_counter = 1;
            // mode (i,1) begins
            iter_f->stage_wait(1);
            bool isDuplicate = deduplicate(iter_f->chunk);
            cilk_sync;
            // assert (iter_f->stage_counter < 2);
            iter_f->stage_counter = 2;
            // mode (i,2) begins
            if (!isDuplicate)
                compress(iter_f->chunk);
            cilk_sync;
            // assert (iter_f->stage_counter < 3);
            iter_f->stage_counter = 3;
            // mode (i,3) begins
            write_to_file(fd_out, iter_f->chunk);
            cilk_sync;
            iter_f->stage_counter = INT64_MAX;
        }()
    else {
        break;
    }
    // Advance to next iteration and check for throttling.
    pcf.i++;
    pcf.throttle(pcf.i - pcf.K);
    cilk_sync;
}

Fig. 5. Pseudocode resulting from translating the execution of the Cilk-P dedup implementation from Figure 4 into Cilk Plus code augmented by cross and throttling dependencies, implemented by iter_f->stage_wait and pcf.throttle, respectively. The unbound variable pcf.K is the throttling limit.

cilk_spawn in line 21 allows the remaining stages of an iteration to execute in parallel with the next iteration of the loop. The cilk_sync immediately after the end of the while loop (line 49) ensures that all spawned iterations complete before the pipe_while loop finishes.

(4) The last statement in the while loop (line 47) is a call to a special function throttle, defined by the control frame pcf, which enforces the throttling dependency that iteration pcf.i can not start until iteration pcf.i - pcf.K has completed.

(5) A pipe_stage statement in the original pipe_while loop is transformed into an update to iter_f->stage_counter, while a pipe_stage_wait statement is transformed into an update followed by a call to iter_f->stage_wait, which ensures that the cross dependency on the
previous iteration is satisfied. In dedup, stages 1, 2, and 3 are thus delineated by updates to 
iter_f->stage_counter in lines 23, 29, and 35, respectively. The end of the iteration is deline-
ed by setting iter_f->stage_counter to its maximum value, such as in line 40.

(6) At the end of each stage, a cilk_sync (lines 17, 27, 33, and 39), guarantees that any nested 
fork-join parallelism is enclosed within the stage, that is, any functions spawned in cilk_spawn 
statements within the stage return before the next stage begins.

Figure 5 uses lambda functions to capture the parallel control structure of Figure 4 directly in 
Cilk, without changing the semantics of the cilk_spawn or cilk_sync keywords. It also introduces 
an additional variable in the iteration frame, continue_after_stage0, so that execution can re-
sume correctly at the continuation of stage 0 in the second lambda function in each iteration. While 
these lambdas capture the parallel control structure of the Cilk-P dedup implementation in Figure 4, 
for more complicated pipeline iterations, such as when stage 0 ends in the middle of a loop, this 
transformation can be tricky to express at the level of pure ordinary Cilk Plus code. At a lower 
level, however, a compiler need only generate code that saves the program state analogously to an 
ordinary cilk_spawn. It may be simpler and more efficient, therefore, to eliminate one or more 
of the lambda functions, and instead implement modified versions of cilk_sync and cilk_spawn 
statements specifically for pipe_while loop transformations.

For example, the code in Figure 5 uses lambda functions in line 3 and line 8 only to create nested 
scopes for parallelism and ensure the desired behavior for a cilk_sync statement. Without the 
lambda function in line 3, the last cilk_sync in line 49 would also synchronize with any functions 
that were spawned in the enclosing function before calling the pipe_while loop. Similarly, the 
lambda for stage 0 in line 8 exists only to guarantee that the cilk_sync in line 17 joins only the 
parallelism within stage 0, and not with any of the lambda functions spawned in line 21. All the 
lambda functions in Figure 5 capture the environment of the enclosing function by reference because 
the body of the pipe_while loop is allowed to access variables declared in the enclosing function, 
such as fd_out and done. In practice, a compiler might avoid generating these nested lambdas, and 
instead simply generate a special kind of cilk_sync instruction at the end of a stage or pipe_while 
loop that joins only the appropriate functions. This code would not be directly expressible in Cilk 
Plus, but is likely to be simpler and more efficient.

Similarly, although Figure 5 describes an iteration as being split into two lambda functions — 
one for stage 0 and one for the subsequent stages of the iteration — in practice, it may be simpler 
to merge those lambda functions. Instead of using an ordinary cilk_spawn to spawn the rest of 
the stages of an iteration separately from stage 0, for example, a system might instead try to spawn 
a single lambda function for the entire iteration. Then, the system might allow other workers to 
steal the continuation of the spawn of the iteration only after the iteration finishes its stage 0, not 
immediately after the spawn occurs.

Pipeline computation dag for dedup

Given the transformed code for a pipe_while loop, the second and third steps generate a pipeline 
computation dag that models the execution of this transformed loop. The second step models the 
execution of the transformed code when ignoring all calls to stage_wait and throttle, and then 
the third step augments the resulting fork-join computation dag with cross and throttling edges 
derived from those calls. Figure 6 illustrates the salient features of the final pipeline computation 
dag that corresponds to executing the code in Figure 5. Let us examine the structure of the dag in 
Figure 6 by first considering the vertices and edges that model the execution of Figure 5, ignoring 
calls to stage_wait and throttle, and then examining the cross and throttling edges added by 
these calls.

Let us first see how the vertices in Figure 6 correspond to the lines of code in Figure 5. Let $i$ be 
an integer where $0 \leq i \leq n$, and let $j$ be an integer greater than 0.

— The vertices labeled $x_i$ and $z_i$ correspond to the execution of instructions inserted by the runtime.

Vertices $x_0$ and $z_n$, for example, correspond to executing the first and final instructions, respec-
Fig. 6. An example pipeline computation dag for a `pipe_while` loop with \( n \) iterations, corresponding to the transformation shown in Figure 5. The vertices are organized to reflect their organization in a pipeline dag, where columns of vertices correspond to distinct iterations. Conceptually, the vertices corresponding to the execution of a node are contained in a rounded box. A column of these boxes corresponds to an iteration of the `pipe_while`, while a row of these boxes corresponds to a stage. Additional vertices and edges appear in this dag to denote instructions executed by the runtime to handle iterations of a `pipe_while`, as well as their parallel control dependencies. Cross and throttling edges are colored blue, while edges in typical Cilk programs are colored black.

- The computation subdag rooted at \( a_{i,0} \) and terminated at vertex \( b_{i,0} \) correspond to executing stage 0 and associated runtime instructions for managing the `while` loop in iteration \( i \). Vertex \( a_{i,0} \) corresponds to executing line 5. Vertex \( b_{i,0} \) corresponds to executing the `cilk_spawn` statement on line 21, except when \( i = n \), in which case \( b_{n,0} \) corresponds to executing line 43. The vertices in Figure 6 on paths from \( a_{i,0} \) to \( b_{i,0} \) correspond to executing the intervening instructions in lines 5–21. The `cilk_sync` statement in the lambda for stage 0 ensures that vertex \( b_{i,0} \) is the single leaf vertex for this computation subdag.

- For \( i < n \), the computation subdag rooted at \( a_{i,j} \) and terminated at \( b_{i,j} \) corresponds to the execution of node \((i, j)\) in the pipeline dag. For example, in iteration \( i \), vertex \( a_{i,1} \) corresponds to executing line 25 — the first instruction in node \((i, 1)\) — and vertex \( b_{i,1} \) corresponds to executing line 27 — the final instruction in node \((i, l)\). The vertices on paths from \( a_{i,1} \) to \( b_{i,1} \), in Figure 6, correspond to executing the intervening instructions in lines 25–27. Notice that, if node \((i, j)\) is the destination of a cross edge, then \( a_{i,j} \) corresponds to executing `stage_wait`. The `cilk_sync` statement at the end of each stage — lines 27, 33, and 39 for stages 1, 2, and 3, respectively — ensure that \( b_{i,j} \) is the single leaf in the computation subdag corresponding to the execution of node \((i, j)\).

- The `stage_counter` vertices \( c_{i,\text{end}} \) and \( c_{i,j} \) for integers \( j > 0 \) correspond to updates in iteration \( i \) to the iteration frame’s `stage_counter` variable. For example, \( c_{i,2} \) corresponds to executing line 29 in iteration \( i \). Vertex \( c_{i,\text{end}} \) corresponds to executing line 40 in iteration \( i \), which terminates the iteration. We call \( c_{i,\text{end}} \) the `terminal` vertex for iteration \( i \).
For convenience, in the computation subdag that models the execution of node \((i, j)\), we call vertex \(a_{i,j}\) the **node root** vertex, and we call vertex \(b_{i,j}\) the **node terminal** vertex.

The correspondence between instructions in Figure 5 and the vertices of Figure 6 describes most of the edges in Figure 6, based on the structure of fork-join computation dags. For example, the code in Figure 5 shows that, for each \(i\) where \(0 \leq i < n\), edge \((x_i, a_{i,0})\) is a serial edge, edge \((b_{i,0}, c_{i,1})\) is a spawn edge, and edge \((b_{i,0}, z_i)\) is a continue edge. Meanwhile, for each iteration \(i\) where \(0 \leq i < n\), edge \((z_i, x_{i+1})\) is a serial edge, reflecting the fact that stage 0 is a serial stage. Similarly, for \(j > 1\), the edges \((b_{i,(j-1)}, c_{i,j})\) and \((c_{i,j}, a_{i,j})\) that connect the node terminal of \((i, j-1)\) to the node root of \((i, j)\) are serial edges, reflecting the fact that each iteration of the pipeline executes the pipeline stages sequentially. Finally, for each iteration \(i\) where \(0 \leq i < n\), edge \((b_{i,3}, c_{i,\text{end}})\) is a serial edge, and edge \((c_{i,\text{end}}, z_n)\) is a return edge. These vertex and edge definitions are established by modeling an execution of the transformed code as an ordinary Cilk Plus program, when stage_wait and throttle instructions are ignored.

Finally, we consider the cross and throttling edges in Figure 6 enforced by stage_wait and throttle instructions.

For each iteration \(i\) where \(0 < i < n\), a call to stage_wait implements a cross edge, which connects a stage counter vertex in iteration \(i-1\) to a node root in iteration \(i\). For example, in each iteration \(i\) of the loop in Figure 5, the stage_wait call on line 25 implements the cross edge \((c_{i-1,2}, a_{i,1})\), and the stage_wait call on line 37 implements the cross edge \((c_{i-1,\text{end}}, a_{i,3})\). Conceptually, because a stage counter vertex \(c_{i,j}\) occurs after the node terminal for stage \(j-1\) and before the node root for stage \(j\), a cross edge \((c_{i-1,j}, a_{i,j-1})\) ensures that node \((i-i-1)\) in iteration \(i\) executes after node \((i-1, j-1)\). When \(j\) is the final stage in an iteration \(i-1\), the iteration terminal \(c_{i-1,\text{end}}\) fills the role of the stage counter vertex \(c_{i-1,j+1}\).

A throttling edge connects the terminal of iteration \(i \leq n-K\) to \(x_i+K\) in iteration \(i+K\), where \(K\) is the throttling limit. Figure 6 illustrates throttling edges when \(K = 2\) and shows that a throttling edge exists from \(c_{i,\text{end}}\) to \(x_{i+2}\) for each iteration \(i\) where \(0 < i < n-2\). These throttling edges thus prevent node \((i, 0)\) from executing before all nodes in iteration \(i-K\) complete, thereby limiting the number of iterations that may execute simultaneously. Notice that the destination of a throttling edge is some node \(x_i\), not \(z_n\). In other words, only return edges, not throttling edges, terminate at \(z_n\).

**General pipeline computation dags**

To generalize the structure of the pipeline computation dag in Figure 6 for arbitrary Cilk-P pipelines, we must specify how null nodes are handled. In some iteration \(i\), for stage \(j > 0\), suppose that node \((i, j)\) is a null node. In this case, none of the vertices \(c_{i,j}, a_{i,j}, b_{i,j}\), nor any of the vertices on paths between these, map to executed instructions, and therefore these vertices do not exist in the computation dag. To demonstrate what happens to the edges that would normally enter and exit these vertices, we may suppose that the computation dag is originally constructed with dummy vertices \(c_{i,j}, a_{i,j}, b_{i,j}\) and \(b_{i,j}\) connected in a path, and then all three of these vertices are contracted into the stage counter vertex following \(b_{i,j}\). Notice that, because \(a_{i,j}\) is a dummy vertex, it does not correspond to a call to stage_wait, and thus it has no incoming cross edge. Furthermore, notice that this model for handling null nodes may cause multiple cross edges to exit the same stage counter vertex. We shall see that this is does not pose a problem for the PIPER scheduler.

### 5. THE PIPER SCHEDULER

PIPzer executes a pipeline program on a set of \(P\) workers using work-stealing. For the most part, PIPzer’s execution model can be viewed as modification of the scheduler described by Arora, Blumofe, and Plaxton [Arora et al. 2001] (henceforth referred to as the ABP model) for computation dags arising from pipeline programs. PIPzer deviates from the ABP model in one significant way, however, in that it performs a "tail-swap" operation.

We describe the operation of PIPzer in terms of the pipeline computation dag \(G = (V, E)\). Each worker \(p\) in PIPzer maintains an **assigned vertex** corresponding to the instruction that \(p\) executes
on the current time step. We say that a vertex \( u \) is ready if all its predecessors have been executed. Executing an assigned vertex \( v \) may enable a vertex \( u \) that is a direct successor of \( v \) in \( G \) by making \( u \) ready. Each worker maintains a deque of ready vertices. Normally, a worker pushes and pops vertices from the tail of its deque. A “thief,” however, may try to steal a vertex from the head of another worker’s deque. It is convenient to define the extended deque \( \langle v_0, v_1, \ldots, v_r \rangle \) of a worker \( p \), where \( v_0 \in V \) is \( p \)’s assigned vertex and \( v_1, v_2, \ldots, v_r \in V \) are the vertices in \( p \)’s deque in order from tail to head.

On each time step, each Piper worker \( p \) follows a few simple rules for execution based on the type of \( p \)’s assigned vertex \( v \) and how many direct successors are enabled by the execution of \( v \), which is at most 2. (Although \( v \) may have multiple immediate successors in the next iteration due to cross-edge dependencies from null nodes, executing \( v \) can enable at most one such vertex, since the nodes in the next iteration execute serially.) We assume that the rules are executed atomically.

First, we consider the cases where the assigned vertex \( v \) of a worker \( p \) is not the last vertex of an iteration, that is, \( v \) is not an iteration terminal.

— If executing \( v \) enables only one direct successor \( u \), then \( p \) simply changes its assigned vertex from \( v \) to \( u \).

— If executing \( v \) enables two successors \( u \) and \( w \), then \( p \) changes its assigned vertex from \( v \) to one successor \( u \), and pushes the other successor \( w \) onto its deque. Only two possible types of vertices \( v \) can enable two successors, and the decision of which successor to push onto the deque depends on the type of \( v \). If \( v \) is a vertex correspond to a normal spawn, then \( u \) follows the spawn edge (\( u \) is the child), and \( w \) follows the continue edge (\( w \) is the continuation). If \( v \) is a stage counter vertex in iteration \( i \) that does not end the iteration, then \( u \) is the node root of the next node in iteration \( i \), and \( w \) is the node root of a node in iteration \( i + 1 \).

— If executing \( v \) enables no successors and the deque of \( p \) is not empty, then \( p \) pops the bottom element \( u \) from its deque and changes its assigned vertex from \( v \) to \( u \).

— If executing \( v \) enables no successors and the deque of \( p \) is empty, then \( p \) becomes a thief. As a thief, \( p \) randomly picks another worker to be its victim, tries to steal the vertex \( u \) at the head of the victim’s deque, and sets the assigned vertex of \( p \) to \( u \) if successful. Otherwise, \( p \)’s assigned node becomes NULL, and \( p \) remains a thief.

These cases are consistent with the normal ABP model.

Piper handles the end of an iteration differently, however, due to throttling edges. Suppose that a worker \( p \) has an assigned vertex \( v \) representing the terminal of an iteration in a given pipe_while loop, and suppose that the edge leaving \( v \) is a throttling edge to a vertex \( z \). When \( p \) executes \( v \), two cases are possible.

— Suppose executing \( v \) does not enable \( z \). Then, no new vertices are enabled, and \( p \) acts like the normal ABP model, that is, either popping the bottom element \( u \) from its deque, or becoming a thief.

— Suppose executing \( v \) does enable \( z \). Then, \( p \) performs two actions. First, \( p \) changes its assigned vertex from \( v \) to \( z \). Second, if \( p \) has a nonempty deque, then \( p \) performs a tail swap: it exchanges its assigned vertex \( v \) with the vertex at the tail of its deque.

This tail-swap operation is designed to empirically reduce Piper’s space usage and cause Piper to favor retiring old iterations over starting new ones. Without the tail swap, in a normal ABP-style execution, when a worker \( p \) finishes an iteration \( i \) that enables a vertex via a throttling edge, \( p \) would conceptually choose to start a new iteration \( i + K \), even if iteration \( i + 1 \) were already suspended and on its deque. With the tail swap, \( p \) resumes iteration \( i + 1 \), leaving \( i + K \) available for stealing. The tail swap also enhances cache locality by encouraging \( p \) to execute consecutive iterations.

It may seem, at first glance, that a tail-swap operation might significantly reduce the parallelism, since the vertex \( z \) enabled by the throttling edge is pushed onto the bottom of the deque. Intuitively, if there were additional work above \( z \) in the deque, then a tail swap could significantly delay the start of iteration \( i + K \). Lemma 6.4 in Section 6 will show, however, that a tail-swap operation only
void F(int n) {
  if (n < 2)
    g(n);
  else {
    cilk_spawn F(n-1);
    f(n-2);
    cilk_sync;
  }
}

void G(int n) {
  if (n == 0)
    int i = 0;
  pipe_while(i < 2) {
    ++i; // Stage 0
    pipeStage_wait(1);
    H(); // Stage 1.
  }
  cilk_spawn foo();
  cilk_sync;
}

void H() {
  cilk_sync;
  ...
}

ACM Transactions on Parallel Computing, Vol. V, No. N, Article A, Publication date: January YYYY.

Fig. 7. Contours for a computation with fork-join and pipeline parallelism. The fork-join function F contains nested calls to a function G that contains a pipe_while loop with two iterations. The function G itself calls a fork-join function H in stage 1 of each iteration. Each letter a through h labels a contour in the dag for F(4). The vertices a_1, b_1, ..., d_1 are contour roots. For example, c(a_i) = a_i for all k.

occurs on deques with exactly 1 element. Thus, whenever a tail swap occurs, z is at the top of the deque and is immediately available to be stolen.

6. STRUCTURAL INVARIANTS

During the execution of a pipeline program by PIPER, the worker deques satisfy two structural invariants, called the “contour” property and the “depth” property. This section states and proves these invariants.

Intuitively, we would like to describe the structure of the worker deques in terms of frames — activation records — of functions’ local variables, since the deques implement a “cactus stack” [Hauck and Dent 1968; Lee et al. 2010]. As Figure 5 illustrates, a pipe_while loop corresponds to a parent control frame with a spawned child for each iteration executing on its own iteration frame. Although the actual Cilk-P implementation manages frames in this fashion, the parallel control of a pipe_while, which more directly affects the contents of worker deques, really does follow the schema illustrated in Figure 5, where stage 0 of an iteration i executes in a lambda called from the parent, rather than in the spawned child lambda function which contains the rest of i. Consequently, we introduce “contours” to represent this structure.

Consider a computation dag G = (V, E) that arises from executing a pipeline program. A contour is a path in G composed only of serial and continue edges. A contour must be a path, because there can be at most one serial or continue edge entering or leaving any vertex. We call the first vertex of a contour the root of the contour, which is the only vertex in the contour that has an incoming spawn edge (except for the initial instruction of the entire computation, which has no incoming edges). Consequently, contours can be organized into a tree hierarchy, where one contour is a parent of another if the first contour contains a vertex that spawns the root of the second. Given a vertex v ∈ V, let c(v) denote the contour to which v belongs. For convenience, we shall assume that all contours are maximal, meaning that no two vertices in distinct contours are connected by a serial or continue edge.

Figure 7 illustrates contours for a simple function F with both nested fork-join and pipeline parallelism. For the pipe_while loop in G, stage 0 of pipeline iteration 0 (a_8 and a_9) is considered part of the same contour that starts the pipe_while loop, not part of contour f which represents the rest of the stages of iteration 0. In terms of function frames, however, it is natural to consider stage 0 as sharing a function frame with the rest of the stages in the same iteration. Although contour
boundaries happen to align with function boundaries when we consider only fork-join parallelism in Cilk, contours and function frames are actually distinct orthogonal concepts, as highlighted by pipe\_while loops.

One important property of contours, which can be shown by structural induction, is that, for any function invocation \( f \), the vertices \( p \) and \( q \) corresponding to the first and last instructions in \( f \) belong to the same contour, that is, \( c(p) = c(q) \). Using this property and the identities of its edges, one can show the following facts about contours in a pipeline computation dag.

**FACT 1.** For a given pipe\_while loop on \( n \) iterations, the vertices \( x_i, a_i, b_i, \) and \( z_i, \) for all \( i \) where \( 0 \leq i \leq n \), all lie in the same contour, that is, \( c(z_i) = c(x_i) = c(a_i, 0) = c(b_i, 0) = c(z_i) \).

**FACT 2.** For an iteration \( i \) of a pipe\_while loop, let \( J = \{ j_1, j_2, \ldots, j_{i-1} \} \) denote the set of stage numbers such that, for \( j \in J \), node \((i, j)\) is not a null node. For all \( j \in J \), the stage counter vertices \( c_{i, j} \) and \( c_{i, \text{end}} \) and the node root and node terminal vertices \( a_{i, j} \) and \( b_{i, j} \) for node \((i, j)\) all lie in the same contour, that is, \( c(c_{i, \text{end}}) = c(c_{i, j}) = c(a_{i, j}) = c(b_{i, j}) \).

The following two lemmas describe two important properties exhibited in the execution of a pipeline program.

**LEMMA 6.1.** Only one vertex in a contour can belong to any extended deque at any time.

**PROOF.** The vertices in a contour form a chain and are, therefore, enabled serially. \( \square \)

The structure of a pipe\_while guarantees that each iteration creates a separate contour for all its stages after stage 0, and that all contours for iterations of the pipe\_while share a common parent in the contour tree. These properties lead to the following lemma.

**LEMMA 6.2.** If an edge \((u, v)\) is a cross edge, then \( c(u) \) and \( c(v) \) are siblings in the contour tree and correspond to adjacent iterations in a pipe\_while loop. If an edge \((u, v)\) is a throttling edge, then \( c(v) \) is the parent of \( c(u) \) in contour tree.

**PROOF.** For some \( i > 0 \), every cross edge \((u, v)\) connects a stage counter vertex \( x \) in iteration \( i-1 \) (that is, \( u \) equals either \( c_{i-1, j} \) for some \( j \) or \( c_{i-1, \text{end}} \)) to a node root \( v = a_{i,k} \) in iteration \( i \) in the same pipe\_while loop. By Fact 2, the root of the contour \( c(u) \) is the first stage counter vertex in iteration \( i-1 \), that is, the stage counter vertex in \( i-1 \) that is the destination of a spawn edge from the node terminal \( b_{i-1, 0} \). Thus the contour \( c(u) \) is a child of the contour \( c(b_{i-1, 0}) \) in the contour tree. Similarly, the root of the contour \( c(v) = c(a_{i,k}) \) is a child of the contour \( c(b_{i, 0}) \) containing the node terminal \( b_{i, 0} \). Because \( b_{i-1, j} \) and \( a_{i,k} \) belong to iterations of the same pipe\_while loop, Fact 1 implies that \( c(b_{i-1, 0}) = c(b_{i, 0}) \). Because \( c(u) \) and \( c(v) \) are both children of this contour, \( c(u) \) and \( c(v) \) are siblings in the contour tree, showing the first part of the lemma statement.

For a pipe\_while loop of \( n \) iterations with throttling limit \( K \), a throttling edge \((u, v)\) connects \( u \), the terminal of an iteration \( i < n - K \), to a vertex \( v = x_{i+K} \) in the computation dag. By the reasoning above, we know \( x \) is in a child contour of \( c(b_{i, 0}) \). By Fact 1, we know \( c(b_{i, 0}) = c(x_{i+K}) = c(v) \). Thus, \( u \) is in a child contour of \( c(v) \), showing the second part of the lemma statement. \( \square \)

As PIPER executes a pipeline program, the deques of workers are highly structured with respect to contours.

**Definition 6.3.** At any time during an execution of a pipeline program which produces a computation dag \( G = (V, E) \), consider the extended deque \( \langle v_0, v_1, \ldots, v_r \rangle \) of a worker \( p \). This deque satisfies the **contour property** if for all \( k = 0, 1, \ldots, r-1 \), one of the following two conditions holds.

1. Contour \( c(v_{k+1}) \) is the parent of \( c(v_k) \).
2. The root of \( c(v_k) \) is the root for some iteration \( i \), the root of \( c(v_{k+1}) \) is the root for the next iteration \( i+1 \), and if \( k+2 \leq r \), then \( c(v_{k+2}) \) is the common parent of both \( c(v_k) \) and \( c(v_{k+1}) \).

Contours allow us to prove an important property of the tail-swap operation.
LEMMA 6.4. At any time during an execution of a pipeline program which produces a computation dag \( G = (V, E) \), suppose that worker \( p \) enables a vertex \( x \) via a throttling edge as a result of executing its assigned vertex \( v_0 \). If \( p \)’s deque satisfies the contour property (Definition 6.3), then one of the following conditions holds.

1. Worker \( p \)’s deque is empty and \( x \) becomes \( p \)’s new assigned vertex.
2. Worker \( p \)’s deque contains a single vertex \( v_1 \) which becomes \( p \)’s new assigned vertex and \( x \) is pushed onto \( p \)’s deque.

PROOF. Because \( x \) is enabled by a throttling edge, \( v_0 \) must be the terminal of some iteration \( i \), and Lemma 6.2 implies that \( c(x) \) is the parent of \( c(v_0) \). Because \( x \) is just being enabled, Lemma 6.1 implies that no other vertex in \( c(x) \) can belong to \( p \)’s deque. Suppose that \( p \)’s extended deque \( \langle v_0, v_1, \ldots, v_r \rangle \) contains \( r \geq 2 \) vertices. By Lemma 6.1, either \( v_1 \) or \( v_2 \) belongs to contour \( c(x) \), neither of which is possible, and hence \( r = 0 \) or \( r = 1 \). If \( r = 0 \), then \( x \) is \( p \)’s assigned vertex. If \( r = 1 \), then the root of \( c(v_1) \) is the start of iteration \( i + 1 \). Since \( x \) is enabled by a throttling edge, a tail swap occurs, making \( v_1 \) the assigned vertex of \( p \) and putting \( x \) onto \( p \)’s deque.

To analyze the time required for PIPER to execute a computation dag \( G = (V, E) \), define the enabling tree \( G_T = (V, E_T) \) as the tree containing an edge \((u, v) \in E_T\) if \( u \) is the last predecessor of \( v \) to execute. The enabling depth \( d(u) \) of \( u \in V \) is the depth of \( u \) in the enabling tree \( G_T \).

Definition 6.5. At any time during an execution of a pipeline program which produces a computation dag \( G = (V, E) \), consider the extended deque \( \langle v_0, v_1, \ldots, v_r \rangle \) of a worker \( p \). The deque satisfies the depth property if the following conditions hold:

1. For \( k = 1, 2, \ldots, r - 1 \), we have \( d(v_{k-1}) \geq d(v_k) \).
2. For \( k = r \), we have \( d(v_{k-1}) \geq d(v_k) \) or \( v_k \) has an incoming throttling edge.
3. The inequalities are strict for \( k > 1 \).

Theorem 6.6. At all times during an execution of a pipeline program by PIPER, all deques satisfy the contour and depth properties (Definitions 6.3 and 6.5).

PROOF. The proof follows a similar induction to the proof of Lemma 3 from [Arora et al. 2001]. Intuitively, we replace the “designated parents” discussed in [Arora et al. 2001] with contours, which exhibit similar parent-child relationships.

The claim holds vacuously in the base case, that is, for any empty deque.

Assuming inductively that the statement is true, consider the possible actions of PIPER that modify the contents of the deque. For \( r > 1 \), let \( v_0, v_1, \ldots, v_r \) denote the vertices on \( p \)’s extended deque before \( p \) executes \( v_0 \), and let \( v'_0, v'_1, \ldots, v'_r \) denote the vertices on \( p \)’s extended deque afterwards. A worker \( p \) may execute its assigned vertex \( v_0 \), thereby enabling 0, 1, or 2 vertices, or another worker \( q \) may steal a vertex from the top of the deque.

Worker \( q \) steals a vertex from \( p \)’s deque. The statement holds because the identities of the remaining vertices in \( p \)’s deque are unchanged. Similarly, the claim holds vacuously for \( q \) because \( q \)’s extended deque has only the stolen vertex.

Executing \( v_0 \) enables 0 vertices. Worker \( p \) pops \( v_1 \) from the bottom of its deque to become its new assigned vertex \( v'_0 \). This action shifts all vertices in the deque down, that is, \( r' = r - 1 \) and for all \( k \) we have \( v'_k = v_{k+1} \). The statement holds because the identities of the remaining vertices in \( p \)’s deque are unchanged.

Executing \( v_0 \) enables 1 vertex \( u \). Worker \( p \) changes its assigned vertex from \( v_0 \) to \( v'_0 = u \) and leaves all other vertices in the deque unchanged, that is, \( r' = r \) and \( v'_k = v_k \) for all \( k > 1 \). For vertices \( v_2, v_3, \ldots, v_r \), if they exist, Definition 6.3 holds by induction. We therefore only need to consider the relationship between \( u \) and \( v_1 \).

The contour property holds by induction if \( c(u) = c(v_0) \), that is, if the edge \((v_0, u)\) is a serial or continue edge. The depth property also holds by induction because we are replacing \( v_0 \) on the ex-
tended deque with a successor node \( u \), and thus \( d(u) > d(v_0) \). Consequently, we need only consider the cases where \( (v_0, u) \) is either a spawn edge, a return edge, a cross edge, or a throttling edge.

— Edge \( (v_0, u) \) cannot be a spawn edge because executing a spawn node always enables 2 children.

— If \( (v_0, u) \) is a return edge, then \( c(u) \) is the parent of \( c(v_0) \). By Lemma 6.1, at most one vertex in \( c(u) \) may be on \( p \)'s deque, and thus the inductive hypothesis shows that the deque contains at most 1 vertex \( v_1 \). In particular, if the root of \( c(v_0) \) is the root of an iteration \( i \) for some pipe_while loop, then the root of \( c(v_1) \) is the root of iteration \( i + 1 \). This situation is impossible, however, because every vertex in \( c(v_1) \) serially precedes \( u \), and thus executing \( v_0 \) cannot enable \( u \). The deque must therefore be empty, in which case the properties hold vacuously.

— If \( (v_0, u) \) is a throttling edge, then Lemma 6.4 specifies the structure of worker \( p \)'s extended deque. In particular, Lemma 6.4 states that the deque contains at most 1 vertex. If \( r = 0 \), the deque is empty and the properties hold vacuously. Otherwise, \( r = 1 \) and the deque contains one element \( v_1 \), in which case the tail-swap operation assigns \( v_1 \) to \( p \) and puts \( u \) into \( p \)'s deque. The contour property holds, because \( c(u) \) is the parent of \( c(v_1) \). The depth property holds, because \( z \) is enabled by a throttling edge.

— Suppose that \( (v_0, u) \) is a cross edge. Lemma 6.2 shows that a cross edge \( (v_0, u) \) can only exist between vertices in sibling iteration contours. By the inductive hypothesis, \( c(v_1) \) must be either the parent of \( c(v_0) \) or equal to \( c(u) \). In the latter case, however, enabling \( u \) would place two vertices from \( c(u) \) onto the same deque, which Lemma 6.1 implies is impossible. Contour \( c(v_1) \) is therefore the common parent of \( c(v_0) \) and \( c(u) \), and thus setting \( v_0' = u \) maintains the contour property. The depth property holds because \( u \) is a successor of \( v_0 \), and \( d(u) > d(v_0) \).

**Executing \( v_0 \) enables 2 vertices, \( u \) and \( w \).** Without loss of generality, assume PIPER pushes the vertex \( w \) onto the bottom of its deque and assigns itself vertex \( u \). Hence, we have \( r' = r + 1 \), \( v_0' = u \), \( v_1' = w \), and \( v_k' = v_{k-1} \) for all \( 1 < k < r' \). Definition 6.5 holds by induction, because the enabling edges \( (v_0, u) \) and \( (v_0, w) \) imply that \( d(v_0) < d(u) = d(w) \). For vertices \( v_2, v_3, \ldots, v_r \), if they exist, Definition 6.3 holds by induction. We therefore need only verify Definition 6.3 for vertices \( u \) and \( w \).

To enable 2 vertices, \( v_0 \) must have at least 2 outgoing edges. Then, we need to consider only three cases for \( v_0 \), namely \( v_0 \) executes a cilk_spawn, \( v_0 \) is the terminal of some iteration \( i \) in a pipe_while loop, or \( v_0 \) is a stage counter vertex.

— If \( v_0 \) executes a cilk_spawn, then \( c(w) = c(v_0) \) and \( c(u) \) is a child contour of \( c(v_0) \), maintaining Definition 6.3.

— If \( v_0 \) is the terminal of an iteration \( i \), then its 2 outgoing edges are a throttling edge and a return edge. The destination of each edge, however, is in the same contour — the parent contour of \( c(v_0) \). Thus, executing \( v_0 \) could have enabled at most 1 vertex, making this case impossible.

— If \( v_0 \) is a stage counter vertex, then \( w \) must be the destination of a cross edge. In this case, vertex \( u \) is a node root in the same contour \( c(u) = c(v_0) \), and by Lemma 6.2, \( c(u) \) and \( c(w) \) are adjacent siblings in the contour tree. As such, we need only show that \( c(v_1) \), if it exists, is their parent. But if \( c(v_1) \) is not the parent of \( c(v_0) = c(u) \), then by induction it must be that \( c(u) \) and \( c(v_1) \) are adjacent siblings. In this case \( c(v_1) = c(w) \), which is impossible by Lemma 6.1.

☐

### 7. TIME ANALYSIS OF PIPER

This section bounds the completion time for PIPER, showing that PIPER executes pipeline program asymptotically efficiently. Specifically, suppose that a pipeline program produces a computation dag \( G = (V, E) \) with work \( T_1 \) and span \( T_w \) when executed by PIPER on \( P \) processors. We show that for any \( \epsilon > 0 \), the running time is \( T_p \leq T_1/P + O(T_w + \lg P + \lg(1/\epsilon)) \) with probability at least \( 1 - \epsilon \), which implies that the expected running time is \( E[T_p] \leq T_1/P + O(T_w + \lg P) \). This bound is comparable to the work-stealing bound for fork-join dags originally proved in [Blumofe and Leiserson 1999].
We adapt the potential-function argument of [Arora et al. 2001]. PIPER executes computation
dags in a style similar to their work-stealing scheduler, except for tail swapping. Although [Arora et al. 2001] ignores the issue of memory contention, we handle it using the “recycling game” analysis
from [Blumofe and Leiserson 1999], which contributes the additive $O(\lg P)$ term to the bounds.

The crux of the proof is to bound the number of steal attempts performed during the execution
of a computation dag $G$ in terms of its span $T_v$. We measure progress through the computation
by defining a potential function for a vertex in the computation dag based on its depth in the enabling
tree. Consider a particular execution of a computation dag $G = (V, E)$ by PIPER. For that execution, we define the weight of a vertex $v$ as $w(v) = T_v - d(v)$, and we define the potential of vertex $v$ at a
given time as

$$
\phi(v) = \begin{cases} 
3^{2w(v) - 1} & \text{if $v$ is assigned,} \\
3^{2w(v)} & \text{otherwise.}
\end{cases}
$$

We define the potential of a worker $p$’s extended deque $\langle v_0, v_1, \ldots, v_r \rangle$ as $\phi(p) = \sum_{k=0}^{r} \phi(v_k)$.

Given this potential function, the proof of the time bound follows the same overall structure as the
proof in [Arora et al. 2001].

First, we prove two properties of worker deques involving the potential function.

Lemma 7.1. At any time during an execution of a pipeline program which produces a computation
dag $G = (V, E)$, the extended deque $\langle v_0, v_1, \ldots, v_r \rangle$ of every worker $p$ satisfies the following:

1. $\phi(v_r) + \phi(v_{r-1}) \geq 3\phi(p)/4$.
2. Let $\phi'$ denote the potential after $p$ executes $v_0$. Then we have $\phi(p) - \phi'(p) = 2(\phi(v_0) + \phi(v_1))/3$, if $p$ performs a tail swap, and $\phi(p) - \phi'(p) \geq 5\phi(v_0)/9$ otherwise.

Proof. The analysis to show Property 1 is analogous to the analysis in [Arora et al. 2001, Lem. 6]. Since Theorem 6.6 shows that $p$’s extended deque satisfies the depth property, we have

$$
d(v_0) \geq d(v_1) > d(v_2) > \cdots > d(v_{r-2}) > d(v_{r-1}).
$$

If $r = 0$, then $\phi(p) = \phi(v_0)$. If $r \geq 1$, we have

$$
\phi(p) = \sum_{k=1}^{r} 3^{2w(v_k)} + 3^{2w(v_0) - 1}
\leq 3^{2w(v_r)} + \left(3^{2w(v_{r-1})} \sum_{k=1}^{r-1} \frac{1}{3^{2(r-k-1)}} \right) + 3^{2w(v_0) - 1}
\leq 3^{2w(v_r)} + 3^{2w(v_{r-1})} + \frac{3^{2w(v_{r-1})}}{4}
\leq \phi(v_r) + \phi(v_{r-1}) + \frac{\phi(p)}{4},
$$

and thus $\phi(v_r) + \phi(v_{r-1}) \geq 3\phi(p)/4$.

Now we argue that, in any time step $t$ during which worker $p$ executes its assigned vertex $v_0$, the
potential of $p$’s extended deque decreases. Let $\phi'$ denote the potential after the time step. If $v_0$ is
the terminal of an iteration $i$, and PIPER performs a tail swap after executing $v_0$, then Lemma 6.4
dictates the state of the deque before and after $p$ executes $v_0$, from which we deduce that $\phi(p) -
\phi'(p) = 2(\phi(v_0) + \phi(v_1))/3$. The remaining cases follow from [Arora et al. 2001], which shows that
$\phi(p) - \phi'(p) \geq 5\phi(v_0)/9$. □

As in [Arora et al. 2001], we analyze the behavior of workers randomly stealing from each other
using a balls-and-weighted-bins analog. We want to analyze the case where the top 2 elements are
stolen out of any deque, however, not just the top element. To address this case, we modify Lemma 7
of [Arora et al. 2001] to consider the probability that 2 out of $2P$ balls land in the same bin.
**Lemma 7.2.** Consider $P$ bins, where for $p = 1, 2, \ldots, P$, bin $p$ has weight $W_p$. Suppose that $2P$ balls are thrown independently and uniformly at random into the $P$ bins. For bin $p$, define the random variable $X_p$ as

$$X_p = \begin{cases} W_p & \text{if at least 2 balls land in bin } p, \\ 0 & \text{otherwise}. \end{cases}$$

Let $W = \sum_{p=1}^P W_p$ and $X = \sum_{p=1}^P X_p$. For any $\beta$ in the range $0 < \beta < 1$, we have $\Pr \{X \geq \beta W\} > 1 - 3/(1 - \beta)^2$.

**Proof.** For each bin $p$, consider the random variable $W_p - X_p$. It takes on the value $W_p$ when 0 or 1 ball lands in bin $p$, and otherwise it is 0. Thus, we have

$$E[W_p - X_p] = W_p \left(1 - \frac{1}{P}\right) 2^P + 2P \left(1 - \frac{1}{P}\right) 2^{P-1} \left(\frac{1}{P}\right)$$

$$= W_p \left(1 - \frac{1}{P}\right) 2^P \frac{(3P - 1)}{(P-1)}.$$

Since $(1 - 1/P)^P$ approaches $1/e$ and $(3P - 1)/(P-1)$ approaches 3, we have $\lim_{P \to \infty} E[W_p - X_p] = 3W_p/e^2$. In fact, one can show that $E[W_p - X_p]$ is monotonically increasing, approaching the limit from below, and thus $E[W - X] \leq 3W/e^2$. By Markov’s inequality, we have that $\Pr \{(W - X) > (1 - \beta)W\} < E[W - X]/(1 - \beta)W$, from which we conclude that $\Pr \{X < \beta W\} \leq 3/(1 - \beta)e^2$. \hfill \square

To use Lemma 7.2 to analyze PIPER, we divide the time steps of the execution of $G$ into a sequence of rounds, where each round (except the first, which starts at time 0) starts at the time step after the previous round ends and continues until the first time step such that at least 2 $P$ steal attempts — and hence less than $3P$ steal attempts — occur within the round. The following lemma shows that a constant fraction of the total potential in all deques is lost in each round, thereby demonstrating progress.

**Lemma 7.3.** Consider a pipeline program executed by PIPER on $P$ processors. Suppose that a round starts at time step $t$ and finishes at time step $t'$. Let $\Phi$ denote the potential at time $t$, let $\Phi'$ denote the potential at time $t'$, let $\Phi = \sum_{p=1}^P \phi(p)$, and let $\Phi' = \sum_{p=1}^P \phi'(p)$. Then we have $\Pr \{\Phi - \Phi' \geq \Phi/4\} > 1 - 6/e^2$.

**Proof.** We first show that stealing twice from a worker $p$’s deque contributes a potential drop of at least $\phi(p)/2$. The proof follows a similar case analysis to that in the proof of Lemma 8 in [Arora et al. 2001] with two main differences. First, we use the two properties of $\phi$ in Lemma 7.1. Second, we must consider the case unique to PIPER, where $p$ performs a tail swap after executing its assigned vertex $v_0$.

We first observe that, if $p$ is the target of at least 2 steal attempts, then PIPER’s actions on $p$’s extended deque between time steps $t$ and $t'$ contribute a potential drop of at least $\phi(p)/2$. Let $\langle v_0, v_1, \ldots, v_r \rangle$ denote the vertices on $p$’s extended deque at time $t$, and suppose that at least 2 steal attempts target $p$ between time step $t$ and time step $t'$.

- If $p$’s extended deque is empty, then $\phi(p) = 0$, and the statement holds trivially.
- If $r = 0$, then $p$’s extended deque consists solely of a vertex $v_0$ assigned to $p$, and $\phi(p) = \phi(v_0)$.
- If $r > 0$, then $p$’s extended deque consists of vertices on $p$’s extended deque at time $t$, and suppose that the potential decreases by at least $5\phi(p)/9 \geq \phi(p)/2$.
- Suppose that $r > 1$. By time $t'$, both $v_r$ and $v_{r-1}$ have been removed from $p$’s deque, either by being stolen or by being assigned to $p$. In either case, Lemma 7.1 implies that the overall potential decreases by at least $2(\phi(v_r) + \phi(v_{r-1}))/3 \geq \phi(p)/2$. 

ACM Transactions on Parallel Computing, Vol. V, No. N, Article A, Publication date: January YYYY.
Suppose that \( r = 1 \). If executing \( v_0 \) results in \( p \) PIPER performing a tail swap, then by Lemma 7.1, the potential drops by at least \( 2(\Phi(v_0) + \Phi(v_1))/3 \geq \Phi(p)/2 \), since \( \Phi(p) = \Phi(v_0) + \Phi(v_1) \). Otherwise, by Lemma 7.1, the execution of \( v_0 \) results in an overall decrease in potential of \( 5\Phi(v_0)/9 \), and because \( p \) is the target of at least 2 steal attempts, \( v_1 \) is assigned by time \( t' \), decreasing the potential by \( 2\Phi(v_1)/3 \).

We now consider all \( P \) workers and \( 2P \) steal attempts between time steps \( t \) and \( t' \). We model these steal attempts as ball tosses in the experiment described in Lemma 7.2. Suppose that we assign each worker \( p \) a weight of \( W_p = \Phi(p)/2 \). These weights \( W_p \) sum to \( W = \Phi/2 \). If we think of steal attempts as ball tosses, then the random variable \( X \) from Lemma 7.2 bounds from below the potential decrease due to actions on \( p \)'s deque. Specifically, if at least 2 steal attempts target \( p \)'s deque in a round (which corresponds conceptually to at least 2 balls landing in bin \( p \)), then the potential drops by at least \( W_p \). Moreover, \( X \) is a lower bound on the potential decrease within the round, i.e., \( X \leq \Phi - \Phi' \). By Lemma 7.2, we have \( \Pr \{ X \geq W/2 \} > 1 - 6/e^2 \). Substituting for \( X \) and \( W \), we conclude that \( \Pr \{ (\Phi - \Phi') \geq \Phi/4 \} > 1 - 6/e^2 \).

We are now ready to prove the completion-time bound.

**Theorem 7.4.** Consider an execution of a pipeline program by PIPER on \( P \) processors which produces a computation dag with work \( T_1 \) and span \( T_w \). For any \( \epsilon > 0 \), the running time is \( T_P \leq T_1/P + O(T_w + \lg P + \lg(1/\epsilon)) \) with probability at least \( 1 - \epsilon \).

**Proof.** On every time step, consider each worker as placing a token in a bucket depending on its action. If a worker \( p \) executes an assigned vertex, \( p \) places a token in the work bucket. Otherwise, \( p \) is a thief and places a token in the steal bucket. There are exactly \( T_1 \) tokens in the work bucket at the end of the computation. The interesting part is bounding the size of the steal bucket.

Divide the time steps of the execution of \( G \) into rounds. Recall that each round contains at least \( 2P \) and less than \( 3P \) steal attempts. Call a round successful if after that round finishes, the potential drops by at least \( 1/4 \) fraction. From Lemma 7.3, a round is successful with probability at least \( 1 - 6/e^2 \). Since the potential starts at \( \Phi_0 = 3^{2T_w} - 1 \), ends at 0, and is always an integer, the number of successful rounds is at most \( (2T_w - 1)\log_{4/3}(3) < 8T_w \). Consequently, the expected number of rounds needed to obtain \( 8T_w \) successful rounds is at most \( 48T_w \), and the expected number of tokens in the steal bucket is therefore at most \( 3P \cdot 48T_w = 144PT_w \).

For the high-probability bound, suppose that the execution takes \( n = 48T_w + m \) rounds. Because each round succeeds with probability at least \( p = 1/6 \), the expected number of successes is at least \( np = 8T_w + m/6 \). We now compute the probability that the number \( X \) of successes is less than \( 8T_w \). As in [Arora et al. 2001], we use the Chernoff bound \( \Pr \{ X < np - a \} < e^{-a^2/2np} \), with \( a = m/6 \). Choosing \( m = 48T_w + 24\ln(1/\epsilon) \), we have

\[
\Pr \{ X < 8T_w \} < e^{-(m/6)^2} \cdot e^{-\frac{(m/6)^2}{2np}} < e^{-m/24} \cdot e^{-m/24} = e^{-m/24} \leq \epsilon.
\]

Hence, the probability that the execution takes \( n = 96T_w + 24\ln(1/\epsilon) \) rounds or more is less than \( \epsilon \), and the number of tokens in the steal bucket is at most \( 288T_w + 72\ln(1/\epsilon) \).

The additional \( \lg P \) term comes from the “recycling game” analysis described in [Blumofe and Leiserson 1999], which bounds any delay that might be incurred when multiple processors try to access the same deque in the same time step in randomized work-stealing.

**8. Space Analysis of PIPER**

This section derives bounds on the stack space required by PIPER by extending the bounds in [Blumofe and Leiserson 1999] for fully strict fork-join parallelism to include pipeline parallelism. We show that PIPER on \( P \) processors uses \( S_P \leq P(S_1 + fDK) \) stack space for pipeline iterations, where \( S_1 \) is the serial stack space, \( f \) is the “frame size,” \( D \) is the depth of nested linear pipelines, and \( K \) is the throttling limit.
To model PIPER’s usage of stack space, we partition the vertices of a pipeline computation dag G into a tree of contours, in a similar manner to that described in Section 6. We assume that every contour c of G has an associated frame size representing the stack space consumed by c while it or any of its descendant contours are executing. The space used by PIPER on any time step is the sum of frame sizes of all contours c which are either active — c is associated with a vertex in some worker’s extended deque — or suspended — the earliest unexecuted vertex in the contour is not ready.

As Section 4 describes, the contours of a pipe_while loop do not directly correspond to the control and iteration frames allocated for the loop. In particular, as demonstrated in the code transformation in Figure 5, stage 0 allocates an iteration frame for all stages of the iteration and executes using that iteration frame. To account for the space used when executing an iteration i of a pipe_while loop, consider an active or suspended contour c, and let v be the vertex in c on a worker’s deque, if c is active, or the earliest unexecuted vertex in c, if c is suspended. If v lies on a path in c between a stage 0 node root a_{i,0} and its corresponding node terminal b_{i,0} for some iteration i of a pipe_while loop, then v incurs an additional space cost equal to the size of i’s iteration frame.

The following theorem bounds the stack space used by PIPER. Let S_P denote the maximum over all time steps of the stack space PIPER uses during a P-worker execution of G. Thus, S_P is the stack space used by PIPER for a serial execution. Define the pipe nesting depth D of G as the maximum number of pipe_while contours on any path from leaf to root in the contour tree. The following theorem generalizes the space bound S_P ≤ PS_1 from [Blumofe and Leiserson 1999], which deals only with fork-join parallelism, to pipeline programs.

**Theorem 8.1.** Consider a pipeline program with pipe nesting depth D executed on P processors by PIPER with throttling limit K. The execution requires S_P ≤ P(S_1 + fDK) stack space, where f is the maximum frame size of any contour of any pipe_while iteration and S_1 is the serial stack space.

**Proof.** We show that, at each time step during PIPER’s execution of a pipeline program, PIPER satisfies a variant of the “busy-leaves property” [Blumofe and Leiserson 1999] with respect to the tree of active and suspended contours during that time step. This proof follows a similar induction to that in [Blumofe and Leiserson 1999, Thm. 3], with one change, namely, that a leaf contour c may stall if the earliest unexecuted vertex in c is the destination of a cross edge.

If v is the destination of a cross edge, then v must be associated with some pipe_while loop, specifically, as a node root a_{i,j} for some iteration i in the loop. The source of this cross edge must be in a previous iteration of the same pipe_while loop. Consider the leftmost iteration i’ < i of this pipe_while loop, that is, the iteration with the smallest iteration index that has not completed. By definition of the leftmost iteration, all previous iterations in this pipe_while have completed, and thus no node root a_{i’,j} in iteration i’ may be the destination of a cross edge whose source has not executed. In other words, the contour associated with this leftmost iteration must be active or have an active descendant. Consequently, each suspended contour c is associated with a pipe_while loop, and for each such contour c, there exists an active sibling contour associated with the same pipe_while loop.

We account for the stack space PIPER uses by separately considering the active and suspended leaf contours. Because there are P processors executing the computation, at each time step, PIPER has at most P active leaf contours, each of which may use at most S_1 stack space. Each suspended leaf contour, meanwhile, is associated with a pipe_while loop, whose leftmost iteration during this time step either is active or has an active descendant. Any pipe_while loop uses at most fDK stack space to execute its iterations, because the throttling edge from the terminal of the leftmost iteration precludes having more than K active or suspended iterations in any one pipe_while loop. Thus, for each vertex in its deque that is associated with an iteration of a pipe_while loop, each worker p accounts for at most fDK stack space in suspended sibling contours for iterations of the same pipe_while loop, or fDK stack space overall. Summing the stack space used over all workers gives PfDK additional stack-space usage. □
9. CILK-P RUNTIME DESIGN

This section describes the Cilk-P implementation of the PIPER scheduler. We first introduce the data structures Cilk-P uses to implement a pipe_while loop. Then we describe the two main optimizations that the Cilk-P runtime exploits: lazy enabling and dynamic dependency folding.

Data structures

Like the Cilk-M runtime [Lee et al. 2010] on which it is based, Cilk-P organizes runtime data into frames. The transformed code in Figure 5 reflects the frames Cilk-P allocates to execute a pipe_while loop. In particular, Cilk-P executes a pipe_while loop in its own control frame, which handles the spawning and throttling of iterations. Furthermore, each iteration of a pipe_while loop executes as an independent child function, with its own iteration frame. This frame structure is similar to that of an ordinary while loop in Cilk-M, where each iteration spawns a function to execute the loop body. Cross and throttling edges, however, may cause the iteration and control frames to suspend.

Cilk-P’s runtime employs a simple mechanism to track progress of an iteration $i$. As seen in Figure 5, the frame of iteration $i$ maintains a stage counter, which stores the stage number of the current node in $i$. In addition, the iteration $i$’s frame maintains a status field, which indicates whether $i$ is suspended due to an unsatisfied cross edge. Because executed nodes in an iteration $i$ have strictly increasing stage numbers, checking whether a cross edge into iteration $i$ is satisfied amounts to comparing the stage counters of iterations $i$ and $i-1$. Any iteration frame that is not suspended corresponds to either a currently executing or a completed iteration.

Cilk-P implements throttling using a join counter in the control frame. Normally in Cilk-M, a frame’s join counter simply stores the number of active child frames. Cilk-P also uses the join counter to limit the number of active iteration frames in a pipe_while loop to the throttling limit $K$. Starting an iteration increments the join counter, while returning from an iteration decrements it. If a worker tries to start a new iteration when the control frame’s join counter is $K$, the control frame suspends until a child iteration returns.

Using these data structures, one could implement PIPER directly, by pushing and popping the appropriate frames onto deques as specified by PIPER’s execution model. In particular, the normal THE protocol [Frigo et al. 1998] could be used for pushing and popping frames from a deque, and frame locks could be used to update fields in the frames atomically. Although this approach directly matches the model analyzed in Sections 7 and 8, it incurs unnecessary overhead for every node in an iteration. Cilk-P implements lazy enabling and dynamic dependency folding to reduce this overhead.

Lazy enabling

In the PIPER algorithm, when a worker $p$ finishes executing a node in iteration $i$, it may enable an instruction in iteration $i+1$, in which case $p$ pushes this instruction onto its deque. To implement this behavior, intuitively, $p$ must check right — read the stage counter and status of iteration $i+1$ — whenever it finishes executing a node. The work to check right at the end of every node could amount to substantial overhead in a pipeline with fine-grained stages.

Lazy enabling allows $p$’s execution of an iteration $i$ to defer the check-right operation, as well as avoid any operations on its deque involving iteration $i+1$. Conceptually, when $p$ enables work in iteration $i+1$, this work is kept on $p$’s deque implicitly. When a thief $q$ tries to steal iteration $i$’s frame from $p$’s deque, $q$ first checks right on behalf of $p$ to see whether any work from iteration $i+1$ is implicitly on the deque. If so, $q$ resumes iteration $i+1$ as if it had found it on $p$’s deque. In a similar vein, the Cilk-P runtime system also uses lazy enabling to optimize the check-parent operation — the enabling of a control frame suspended due to throttling.

Lazy enabling requires $p$ to behave differently when $p$ completes an iteration. When $p$ finishes iteration $i$, it first checks right, and if that fails (because iteration $i+1$ need not be resumed), it checks its parent. It turns out that these checks find work only if $p$’s deque is empty, that is, if all
other work on p’s deque has been stolen. Therefore, p can avoid performing these checks at the end of an iteration if its deque is not empty.

Lazy enabling is an application of the work-first principle [Frigo et al. 1998]: minimize the scheduling overheads borne by the work of a computation, and amortize them against the span. Requiring a worker to check right every time it completes a node adds overhead proportional to the work of the pipe_while in the worst case. With lazy enabling, the overhead can be amortized against the span of the computation. For programs with sufficient parallelism, the work dominates the span, and the overhead becomes negligible.

Dynamic dependency folding

In dynamic dependency folding, the frame for iteration i stores a cached value of the stage counter of iteration i−1, hoping to avoid checking already satisfied cross edges. In a straightforward implementation of Piper, before a worker p executes each node in iteration i with an incoming cross edge, it reads the stage counter of iteration i−1 to see if the cross edge is satisfied. Reading the stage counter of iteration i−1, however, can be expensive. Besides the work involved, the access may contend with whatever worker q is executing iteration i−1, because q may be constantly updating the stage counter of iteration i−1.

Dynamic dependency folding mitigates this overhead by exploiting the fact that an iteration’s stage counter must strictly increase. By caching the most recently read stage-counter value from iteration i−1, worker p can sometimes avoid reading this stage counter before each node with an incoming cross edge. For instance, if q finishes executing a node (i−1,j), then all cross edges from nodes (i−1,0) through (i−1,j) are necessarily satisfied. Thus, if p reads j from iteration i−1’s stage counter, p need not reread the stage counter of i−1 until it tries to execute a node with an incoming cross edge (i,k) where k > j. This optimization is particularly useful for fine-grained stages that execute quickly.

10. EVALUATION

This section presents empirical studies of the Cilk-P prototype system. We investigated the performance and scalability of Cilk-P using the three PARSEC [Bienia et al. 2008; Bienia and Li 2010] benchmarks that we ported, namely ferret, dedup, and x264. The results show that Cilk-P’s implementation of pipeline parallelism has negligible overhead compared to its serial counterpart. We compared the Cilk-P implementations to TBB and Pthreaded implementations of these benchmarks. We found that the Cilk-P and TBB implementations perform comparably, as do the Cilk-P and Pthreaded implementations for ferret and x264. The Pthreaded version of dedup outperforms both Cilk-P and TBB, because the bind-to-element approaches of Cilk-P and TBB produce less parallelism than the Pthreaded bind-to-stage approach. Moreover, the Pthreading approach benefits more from “oversubscription.” We study the effectiveness of dynamic dependency folding on a synthetic benchmark called pipe-fib, demonstrating that this optimization can be effective for applications with fine-grained stages.

We ran two sets of experiments on two different machines. The first set was collected on an AMD Opteron system with 4 2 GHz quad-core CPU’s having a total of 8 GBytes of memory. Each processor core has a 64-KByte private L1-data-cache and a 512-KByte private L2-cache. The 4 cores on each chip share the same MByte L3-cache. The benchmarks were compiled with GCC (or G++ for TBB) 4.4.5 using -O3 optimization, except for x264, which by default comes with -O4. The machine was running a custom version of Debian 6.08 (squeeze) modified by MIT CSAIL, using a custom Linux kernel 3.4.0, patched with support for thread-local memory mapping for Cilk-M [Lee et al. 2010]. The second set was collected on an Intel Xeon E5 system with 2 2.4 GHz 8-core CPU’s having a total of 32 GBytes of memory. Each processor core has a 32-KByte private L1-data-cache and a 256-KByte private L2-cache. The 8 cores on each chip share the same 2-MByte L3-cache. The benchmarks were compiled with GCC (or G++ for TBB) 4.6.3 using -O3 optimization, except for x264, which by default comes with -O4. The machine was running Fedora 16, using a custom Linux kernel 3.6.11, also patched with support for thread-local memory mapping for Cilk-M.

ACM Transactions on Parallel Computing, Vol. V, No. N, Article A, Publication date: January YYYY.
Fig. 8. Performance comparison of the three ferret implementations running on the AMD Opteron system. The experiments were conducted using native, the largest input data set that comes with the PARSEC benchmark suite. The left-most column shows the number of cores used (P). Subsequent columns show the running time (T_P), speedup over serial running time (T_S / T_P), and scalability (T_I / T_P) for each system. The throttling limit was K = 10P.

Performance evaluation on PARSEC benchmarks

We implemented the Cilk-P versions of the three PARSEC benchmarks by hand-compiling the relevant pipe while loops using techniques similar to those described in [Lee et al. 2010]. We then compiled the hand-compiled benchmarks with GCC. The ferret and dedup applications can be parallelized as simple pipelines with a fixed number of stages and a static dependency structure. In particular, ferret uses the 3-stage SPS pipeline shown in Figure 1, while dedup uses a 4-stage SPS pipeline as described in Figure 4.

For the Pthreaded versions, we used the code distributed with PARSEC. The PARSEC Pthreaded implementations of ferret and dedup employ the oversubscription method [Reed et al. 2011], a bind-to-stage approach that creates more than one thread per pipeline stage and utilizes the operating system for load balancing. For the Pthreaded implementations, when the user specifies an input parameter of Q, the code creates Q threads per stage, except for the first (input) and last (output) stages which are serial and use only one thread each. To ensure a fair comparison, for all applications, we ran the Pthreaded implementation using taskset to limit the process to P cores (which corresponds to the number of workers used in Cilk-P and TBB), but experimented to find the best setting for Q.

We used the TBB version of ferret that came with the PARSEC benchmark, and implemented the TBB version of dedup, both using the same strategies as for Cilk-P. TBB’s construct-and-run approach proved inadequate for the on-the-fly nature of x264, however, and indeed, in their study of these three applications, Reed, Chen, and Johnson [Reed et al. 2011] say, “Implementing x264 in TBB is not impossible, but the TBB pipeline structure is not suitable.” Thus, we had no TBB benchmark for x264 to include in our comparisons.

For each benchmark, we throttled all versions similarly, unless specified otherwise in the figure captions. For Cilk-P on the AMD Opteron system, we used the default throttling limit of 4P, where P is the number of cores. This default value seems to work well in general, although since ferret scales slightly better with less throttling, we used a throttling limit of 10P for ferret in our experiments. Similarly, for Cilk-P on the Intel Xeon system, a throttling of 10P seems to work well, and thus we used a throttling of 10P for experiments on the Intel Xeon system. TBB supports a settable parameter that serves the same purpose as Cilk-P’s throttling limit. For the Pthreaded implementations, we throttled the computation by setting a size limit on the queues between stages, although we did not impose a queue size limit on the last stage of dedup (the default limit is 2^30), since doing so causes the program to deadlock.

<table>
<thead>
<tr>
<th>P</th>
<th>Processing Time (T_P)</th>
<th>Speedup (T_S / T_P)</th>
<th>Scalability (T_I / T_P)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Cilk-P</td>
<td>Pthreads</td>
<td>TBB</td>
</tr>
<tr>
<td>1</td>
<td>432.4</td>
<td>430.0</td>
<td>432.7</td>
</tr>
<tr>
<td>2</td>
<td>220.4</td>
<td>212.2</td>
<td>223.8</td>
</tr>
<tr>
<td>3</td>
<td>146.9</td>
<td>140.8</td>
<td>147.0</td>
</tr>
<tr>
<td>4</td>
<td>111.5</td>
<td>106.0</td>
<td>111.8</td>
</tr>
<tr>
<td>5</td>
<td>89.2</td>
<td>89.9</td>
<td>90.8</td>
</tr>
<tr>
<td>6</td>
<td>74.8</td>
<td>73.8</td>
<td>76.1</td>
</tr>
<tr>
<td>7</td>
<td>64.7</td>
<td>64.2</td>
<td>65.9</td>
</tr>
<tr>
<td>8</td>
<td>57.3</td>
<td>57.0</td>
<td>57.7</td>
</tr>
<tr>
<td>9</td>
<td>51.1</td>
<td>49.8</td>
<td>52.9</td>
</tr>
<tr>
<td>10</td>
<td>46.4</td>
<td>45.5</td>
<td>47.3</td>
</tr>
<tr>
<td>11</td>
<td>42.5</td>
<td>41.7</td>
<td>43.2</td>
</tr>
<tr>
<td>12</td>
<td>39.4</td>
<td>38.6</td>
<td>40.0</td>
</tr>
<tr>
<td>13</td>
<td>36.6</td>
<td>37.2</td>
<td>37.6</td>
</tr>
<tr>
<td>14</td>
<td>34.4</td>
<td>35.0</td>
<td>35.3</td>
</tr>
<tr>
<td>15</td>
<td>32.2</td>
<td>32.9</td>
<td>33.5</td>
</tr>
</tbody>
</table>

ACM Transactions on Parallel Computing, Vol. V, No. N, Article A, Publication date: January YYYY.
were conducted using native, the largest input data set that comes with the PARSEC benchmark suite. The column headers are the same as in Figure 8. The throttling limit was $K = 4P$.

Figures 8–10 show the performance results for the different implementations of the three benchmarks running on the AMD Opteron system. Each data point in the study was computed by averaging the results of 10 runs. The standard deviation of the numbers was typically just a few percent, indicating that the numbers should be accurate to within better than 10 percent with high confidence (2 or 3 standard deviations). We suspect that the superlinear scalability obtained for some measurements is due to the fact that more L1- and L2-cache is available when running on multiple cores.

The three tables from Figures 8–10 show that the Cilk-P and TBB implementations of ferret and dedup are comparable, indicating that there is no performance penalty incurred by these applications for using the more general on-the-fly pipeline instead of a construct-and-run pipeline. Recall that both Cilk-P and TBB execute using a bind-to-element approach.

The dedup performance results for Cilk-P and TBB are inferior to those for Pthreads, however. The Pthreaded implementation scales to about 8.5 on 16 cores, whereas Cilk-P and TBB seem to plateau at around 6.7. There appear to be two reasons for this discrepancy.

We dropped four out of the 3500 input images from the original native data set, because those images are black-and-white, which trigger an array index out of bound error in the image library provided.
On-the-Fly Pipeline Parallelism

<table>
<thead>
<tr>
<th>P</th>
<th>Cilk-P</th>
<th>Pthreads</th>
<th>TBB</th>
<th>Cilk-P</th>
<th>Pthreads</th>
<th>TBB</th>
<th>Cilk-P</th>
<th>Pthreads</th>
<th>TBB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>153.2</td>
<td>152.5</td>
<td>151.5</td>
<td>1.04</td>
<td>1.04</td>
<td>1.05</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>77.5</td>
<td>89.7</td>
<td>77.0</td>
<td>2.05</td>
<td>1.77</td>
<td>2.06</td>
<td>1.98</td>
<td>1.70</td>
<td>1.97</td>
</tr>
<tr>
<td>3</td>
<td>51.8</td>
<td>56.9</td>
<td>53.5</td>
<td>3.07</td>
<td>2.79</td>
<td>2.97</td>
<td>2.96</td>
<td>2.68</td>
<td>2.83</td>
</tr>
<tr>
<td>4</td>
<td>39.9</td>
<td>42.8</td>
<td>40.0</td>
<td>3.99</td>
<td>3.72</td>
<td>3.98</td>
<td>3.84</td>
<td>3.57</td>
<td>3.79</td>
</tr>
<tr>
<td>5</td>
<td>32.3</td>
<td>34.4</td>
<td>32.5</td>
<td>4.93</td>
<td>4.63</td>
<td>4.89</td>
<td>4.75</td>
<td>4.44</td>
<td>4.66</td>
</tr>
<tr>
<td>6</td>
<td>27.4</td>
<td>29.7</td>
<td>27.6</td>
<td>5.81</td>
<td>5.36</td>
<td>5.75</td>
<td>5.59</td>
<td>5.14</td>
<td>5.48</td>
</tr>
<tr>
<td>7</td>
<td>23.5</td>
<td>25.7</td>
<td>24.0</td>
<td>6.76</td>
<td>6.18</td>
<td>6.61</td>
<td>6.51</td>
<td>5.93</td>
<td>6.30</td>
</tr>
<tr>
<td>8</td>
<td>21.0</td>
<td>22.7</td>
<td>21.3</td>
<td>7.57</td>
<td>7.00</td>
<td>7.45</td>
<td>7.29</td>
<td>6.72</td>
<td>7.10</td>
</tr>
<tr>
<td>9</td>
<td>19.0</td>
<td>21.8</td>
<td>19.4</td>
<td>8.37</td>
<td>7.31</td>
<td>8.20</td>
<td>8.07</td>
<td>7.01</td>
<td>7.81</td>
</tr>
<tr>
<td>10</td>
<td>17.3</td>
<td>18.8</td>
<td>17.8</td>
<td>9.19</td>
<td>8.46</td>
<td>8.94</td>
<td>8.86</td>
<td>8.11</td>
<td>8.52</td>
</tr>
<tr>
<td>11</td>
<td>15.8</td>
<td>17.3</td>
<td>16.4</td>
<td>10.05</td>
<td>9.20</td>
<td>9.68</td>
<td>9.68</td>
<td>8.82</td>
<td>9.23</td>
</tr>
<tr>
<td>12</td>
<td>14.7</td>
<td>15.8</td>
<td>15.3</td>
<td>10.81</td>
<td>10.08</td>
<td>10.37</td>
<td>10.42</td>
<td>9.67</td>
<td>9.88</td>
</tr>
<tr>
<td>13</td>
<td>13.8</td>
<td>14.6</td>
<td>14.5</td>
<td>11.52</td>
<td>10.88</td>
<td>10.98</td>
<td>11.10</td>
<td>10.43</td>
<td>10.47</td>
</tr>
<tr>
<td>14</td>
<td>13.0</td>
<td>13.7</td>
<td>13.8</td>
<td>12.20</td>
<td>11.64</td>
<td>11.56</td>
<td>11.76</td>
<td>11.17</td>
<td>11.02</td>
</tr>
<tr>
<td>15</td>
<td>12.2</td>
<td>12.9</td>
<td>13.1</td>
<td>12.98</td>
<td>12.30</td>
<td>12.15</td>
<td>12.51</td>
<td>11.80</td>
<td>11.58</td>
</tr>
<tr>
<td>16</td>
<td>11.6</td>
<td>12.2</td>
<td>12.5</td>
<td>13.77</td>
<td>13.02</td>
<td>12.70</td>
<td>13.26</td>
<td>12.48</td>
<td>12.10</td>
</tr>
</tbody>
</table>

Fig. 11. Performance comparison of the three ferret implementations running on the Intel Xeon system. The experiments were conducted using native, and the column headers are the same as in Figure 8. The throttling limit was $K = 10P$.

First, the dedup benchmark on the test input has limited parallelism. We modified the Cilkview scalability analyzer [He et al. 2010] to measure the work and span of our hand-compiled Cilk-P dedup programs, and we measured the parallelism of dedup to be merely 7.4. The bind-to-stage Pthreaded implementation creates a pipeline with a different structure from the bind-to-element Cilk-P and TBB versions, which enjoys slightly more parallelism.

Second, since file I/O is the main performance bottleneck for dedup, the Pthreaded implementation effectively benefits from oversubscription — using more threads than processing cores — and its strategic allocation of threads to stages. Specifically, since the first and last stages perform file I/O, which is inherently serial, the Pthreaded implementation dedicates one thread to each of these stages, but dedicates multiple threads to the other compute-intensive stages. While the writing thread is performing file I/O to write data out to the disk, the OS may deschedule it, allowing the compute-intensive threads to be scheduled. This behavior explains how the Pthreaded implementation scales by more than a factor of $P$ for $P = 1-4$, even though the computation is restricted to only $P$ cores using taskset. Moreover, when we ran the Pthreaded implementation without throttling on a single core, the computation ran about 20% faster than the original serial implementation of dedup.

This performance boost may be explained if the computation and file I/O operations are effectively overlapped. With multiple threads per stage, the Pthreaded implementation performs better, however, since throttling appears to inhibit threads working on stages that are further ahead, allowing threads working on heavier stages to obtain more processing resources, thereby balancing the load.

Figures 11–13 show the performance results for the different implementations of the three benchmarks running on the Intel Xeon system. The relative performance between the three implementations for each benchmark follows similar trend as in the results from running on the AMD Opteron system. The two sets of results differ in the following ways. First, the serial running time across implementations for each benchmark is about 2–3 times faster on the Intel Xeon system than on the AMD Opteron system. This discrepancy of serial running times can be explained by the fact that the Intel Xeon processors have higher clock frequency and that the system overall has more memory. In addition, the memory bandwidth on the Intel system is about 2–3 times higher than on the AMD system, depending on the data size accessed by the computation. Second, we observed less speedup from dedup across the three implementations on the Intel system than on the AMD system. The reason for this smaller speedup is because the number of last-level cache misses is increases substantially on the Intel system between a serial execution and a parallel execution. This increase in number of cache misses causes the time spent in the user code to double when running on 16 processors compared with running serially. The AMD system, on the other hand, does not appear to

---

5 We measured the memory latencies using the latest software release of Imbench [McVoy and Staelin 1996] at the time of printing, that is, version 3.0 – a9.
Fig. 12. Performance comparison of the three dedup implementations running on the Intel Xeon system. The experiments were conducted using native, and the column headers are the same as in Figure 8. The throttling limit was \( K = 10P \) for Cilk-P and TBB, and 4\( P \) for Pthreads, because the Cilk-P and TBB implementations performed better with throttling limit of 10\( P \) than 4\( P \), whereas the Pthreaded implementation was the other way around.

Fig. 13. Performance comparison between the Cilk-P implementation and the Pthreaded implementation of x264 (encoding only) running on the Intel Xeon system. The experiments were conducted using native, and the column headers are the same as in Figure 8. The throttling limit was \( K = 10P \) for Cilk-P and 4\( P \) for Pthreads, because the Cilk-P implementation performed better with throttling limit of 10\( P \) than 4\( P \), whereas the Pthreaded implementation was the other way around.

exhibit the same cache behavior. While we suspect that the culprit is false or true sharing on some memory location, our measurements unfortunately suggest that these additional cache misses come mostly from the compress library used by dedup, for which we lack the source code.

In summary, Cilk-P performs comparably to TBB while admitting more expressive semantics for pipelines. Cilk-P also performs comparably to the Pthreaded implementations of ferret and x264, although its bind-to-element strategy seems to suffer on dedup compared to the bind-to-stage strategy of the Pthreaded implementation. Despite losing the dedup “bake-off,” Cilk-P’s strategy has the significant advantage that it allows pipelines to be expressed as deterministic programs. Determinism greatly reduces the effort for debugging, release engineering, and maintenance (see, for example, [Bocchino, Jr. et al. 2009]) compared with the inherently nondeterministic code required to set up Pthreaded pipelines.

**Evaluation of dynamic dependency folding**

We also studied the effectiveness of dynamic dependency folding. Since the PARSEC benchmarks are too coarse grained to permit such a study, we implemented a synthetic benchmark, called pipe-fib, to study this optimization technique. The pipe-fib benchmark computes the \( n \)th Fibonacci num-

ACM Transactions on Parallel Computing, Vol. V, No. N, Article A, Publication date: January YYYY.
on-the-Fly Pipeline Parallelism

<table>
<thead>
<tr>
<th>Program</th>
<th>Folding</th>
<th>Serial Overhead</th>
<th>Speedup</th>
<th>Scalability</th>
</tr>
</thead>
<tbody>
<tr>
<td>pipe-fib</td>
<td>no</td>
<td>20.7</td>
<td>4.13</td>
<td>4.8</td>
</tr>
<tr>
<td>pipe-fib-256</td>
<td>no</td>
<td>20.7</td>
<td>1.13</td>
<td>12.32</td>
</tr>
<tr>
<td>pipe-fib</td>
<td>yes</td>
<td>20.7</td>
<td>1.05</td>
<td>11.65</td>
</tr>
<tr>
<td>pipe-fib-256</td>
<td>yes</td>
<td>20.7</td>
<td>1.05</td>
<td>12.43</td>
</tr>
</tbody>
</table>

Fig. 14. Performance evaluation using the pipe-fib benchmark on the AMD Opteron system. We tested the Cilk-P system with two different programs, the ordinary pipe-fib, and the pipe-fib-256, which is coarsened. Each program is tested with and without the dynamic dependency folding optimization. For each program for a given setting, we show the running time of its serial counter part (T₁), running time executing on a single worker (Tₛ), on 16 workers (T₁₆), its serial overhead, scalability, and speedup obtained running on 16 workers.

ber Fₙ in binary. It uses a pipeline algorithm that operates in Θ(n²) work and Θ(n) span. To construct the base case, pipe-fib allocates three arrays of size Θ(n) and initializes the first two arrays with the binary representations of F₁ and F₂, both of which are 1. To compute F₃, pipe-fib performs a ripple-carry addition on the two input arrays and stores the sum into the third output array. To compute Fₙ, pipe-fib repeats the addition by rotating through the arrays for inputs and output until it reaches Fₙ. In the pipeline for this computation, each iteration i computes Fᵢ₋₁ and a stage j within the iteration computes the jth bit of Fᵢ₊₁. Since the benchmark stops propagating the carry bit as soon as possible, it generates a triangular pipeline dag in which the number of stages increases with iteration number. Given that each stage in pipe-fib starts with a pipe-stage_wait, and each stage contains little work, it serves as an excellent microbenchmark to study the overhead of pipe-stage_wait.

Figure 14 shows the performance results on the AMD Opteron system, obtained by running the ordinary pipe-fib with fine-grained stages, as well as pipe-fib-256, a coarsened version of pipe-fib in which each stage computes 256 bits instead of 1. As the data in the first row show, even though the serial overhead for pipe-fib without coarsening is merely 13%, it fails to scale and exhibits poor speedup. The reason is that checking for dependencies due to cross edges has a relatively high overhead compared to the little work in each fine-grained stage. As the data for pipe-fib-256 in the second row show, coarsening the stages improves both serial overhead and scalability. Ideally, one would like the system to coarsen automatically, which is what dynamic dependency folding effectively does.

Further investigation revealed that the time spent checking for cross edges increases noticeably when the number of workers increases from 1 to 2. It turns out that when iterations are run in parallel, each check for a cross-edge dependency necessarily incurs a true-sharing conflict between the two adjacent active iterations, an overhead that occurs only during parallel execution. Dynamic dependency folding eliminated much of this overhead for pipe-fib, as shown in the third row of Figure 14, leading to scalability that is much closer to the coarsened version without the optimization, although a slight price is still paid in speedup. Employing both optimizations, as shown in the last row of the table, produces the best numbers for both speedup and scalability.

We have done similar performance study on the Intel Xeon system, and the relative performance of pipe-fib and pipe-fib-256, with and without dependency folding, show similar trends as in the results on the AMD system. The parallel running times on the Intel system are affected more by the cache misses due to false sharing on the three arrays (since different workers may be working on different parts of the same array), thereby getting less speedup for pipe-fib with dependency folding (specifically, a speedup of 8.8). This cache effect goes away if we simply use larger data type for the array and coarsen it slightly to avoid false sharing.

11. CONCLUSION

What impact does throttling have on theoretical performance? PIPER relies on throttling to achieve its provable space bound and avoid runaway pipelines. Ideally, the user should not worry about
thus have that

Removing all (subdag corresponding to the execution of)

constant a

pipelines, which contain no hybrid stages and, for each stage

has cost

amount between iterations.

It is straightforward to generalize the result such that the cost of a stage may vary by a constant
factor of executing that stage in any other iteration — throttling does not affect the asymptotic
performance of PIPER executing \( \hat{G} \).

**Theorem 11.1.** Consider a uniform unthrottled linear pipeline \( \hat{G} = (V, \hat{E}) \). Suppose that PIPER throttles the execution of \( \hat{G} \) on \( P \) processors using a throttling limit of \( K = aP \), for some constant \( a > 1 \). Then PIPER executes \( \hat{G} \) in time \( T_P \leq (1 + c/a)\hat{T}_1/P + c(\hat{T}_\infty + \lg P + \lg(1/\varepsilon)) \) for some sufficiently large constant \( c \) with probability at least \( 1 - \varepsilon \), where \( \hat{T}_1 \) is the total work in \( \hat{G} \) and \( \hat{T}_\infty \) is the span of \( \hat{G} \).

**Proof.** For convenience, let us suppose that each stage has exactly the same cost in every iteration. It is straightforward to generalize the result such that the cost of a stage may vary by a constant amount between iterations.

Let \( \hat{G} \) denote the computation dag of a uniform pipeline with \( n \) iterations and \( s \) stages in each iteration. Let \( W \) be the total work in each iteration. We assume that \( W \) includes the work performed within each stage in an iteration, as well as the cost incurred to checking any cross-edge dependencies from the previous iteration. Then for the unthrottled dag \( \hat{G} \), we know the work is \( \hat{T}_1 = nW \). Similarly, let \( S \) be the work of the most expensive serial stage in an iteration.

For the unthrottled uniform dag \( \hat{G} \), any longest path from the beginning of any node \((i, 0)\) to the end of any node \((i + x, s - 1)\) has cost \( W + xS \). Any path connecting these nodes is represented by some interleaving of \( s \) vertical steps and \( x \) horizontal steps. For any path, the work of all vertical steps is exactly \( W \), since the vertical steps must execute each stage 0 through \( s - 1 \). Each horizontal step executes exactly one stage, thus incurring cost at most \( xS \). In particular, this result implies that \( \hat{T}_\infty = W + nS \), for the path from node \((0, 0)\) to \((n - 1, s - 1)\).

Now consider the work \( \hat{T}_1 \) and the span \( T_\infty \) of the throttled dag \( \hat{G} \). Conceptually, since \( G \) adds \( n - K \) zero-cost throttling edges to \( \hat{G} \), we have \( \hat{T}_1 = \tilde{T}_1 \); the work remains the same.

The throttling edges may increase the span, however. Consider a critical path \( p \) through \( G \), which has cost \( T_\infty \). Suppose this path \( p \) contains \( q \geq 0 \) throttling edges. Label the throttling edges along \( p \) in order of increasing iteration number, with throttling edge \( k \) connecting the node terminal of (the subdag corresponding to the execution of) \((i_k, s - 1)\) to the node root of \((i_k + K, 0)\), for \( 1 \leq k \leq q \). Removing all \( q \) throttling edges from \( p \) splits \( p \) into \( q + 1 \) segments \( p_0, p_1, \ldots, p_q \). More precisely, let \( p_k \) denote the path from the beginning of node \((i_k + K, 0)\) to the end of \((i_{k+1}, s - 1)\), where \( i_0 = -K \) and \( i_{q+1} = n - 1 \). By our previous result, the cost of each segment \( p_k \) is \( W + (i_{k+1} - i_k - K)S \). We thus have that \( T_\infty \), which is the total cost of \( p \), satisfies

\[
T_\infty = \sum_{k=0}^{q-1} (W + (i_{k+1} - i_k - K)S) + W + (i_{q+1} - i_q - K)S
= qW + W + (i_{q+1} - i_0)S - (q + 1)KS
= qW + W + (n - qK)S
= W + nS + q(W - KS)
= \tilde{T}_\infty + q(W - KS)
\]
A cut-and-paste argument shows that throttling edges in \( G \) can only increase the span if we have \( W > KS \); otherwise, one can create an equivalent or longer path by going horizontally through \( K \) copies of the most expensive stage, rather than down an iteration \( i \), and across a throttling edge that skips to the beginning of iteration \( i + K \).

Applying Theorem 7.4 to the throttled dag \( G \), we know \textsc{Piper} executes \( G \) in expected time \( T_P \) which satisfies

\[
T_P \leq \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon))
\]

\[
= \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)) + cq(W - KS).
\]

If \( q = 0 \), then the extra \( cq(W - KS) \) term is 0, giving the desired bound. Otherwise, assume \( q > 0 \). Since every throttling edge skips ahead \( K \) iterations, we know that the critical path uses at most \( q < n/K \) throttling edges. Using this bound for \( q \) and letting \( K = aP \) for some constant \( a > 1 \), we can rewrite the expression for \( T_P \) as

\[
T_P \leq \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)) + \frac{cn}{aP} (W - aPS)
\]

\[
= \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)) + \left( \frac{c}{a} \right) \left( \frac{nW}{P} \right) \left( 1 - \frac{aPS}{W} \right)
\]

\[
= \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)) + \left( \frac{c}{a} \right) \left( \frac{T_1}{P} \right) \left( 1 - \frac{aPS}{W} \right)
\]

\[
= \left( 1 + \frac{c}{a} \left( 1 - \frac{aPS}{W} \right) \right) \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)).
\]

We know \( W \geq KS = aPS \) for the throttling edges to be included as part of the critical path. In general, \( W \) could be much larger than \( aPS \), giving us the final result:

\[
T_P \leq \left( 1 + \frac{c}{a} \right) \frac{T_1}{P} + c(T_\infty + \lg P + \lg(1/\varepsilon)).
\]

\( \square \)

Second, we consider nonuniform pipelines, where the cost of a node \((i, j)\) may vary across iterations. It turns out that nonuniform pipelines can pose performance problems, not only for \textsc{Piper},...
but for any scheduler that throttles the computation. Figure 15 illustrates a pathological nonuniform pipeline for any scheduler that uses throttling. In this dag, $T_1$ work is distributed across $(T_1^{1/3} + T_2^{1/3})/2$ iterations such that any $T_1^{1/3} + T_2^{1/3}$ work each. Intuitively, achieving a speedup of 3 on this dag requires having at least 1 heavy iteration and $\Theta(T_1^{1/3})$ light iterations active simultaneously, which is impossible for any scheduler that uses a throttling limit of $K = o(T_1^{1/3})$. The following theorem formalizes this intuition.

**Theorem 11.2.** Let $\hat{G} = (V, \hat{E})$ denote the nonuniform unthrottled linear pipeline shown in Figure 15, with work $T_1$ and span $T_w = 2T_1^{2/3}$. Let $S_1$ denote the optimal stack-space usage when $\hat{G}$ is executed on 1 processor. Any $P$-processor execution of $G$ that achieves $T_P \leq T_1/\rho$, where $\rho$ satisfies $3 \leq \rho \leq O(T_1/T_w)$, uses space $S_P \geq S_1 + (\rho - 2)T_1^{1/3}/2$.  

**Proof.** Consider the pipeline computation shown in Figure 15. Suppose that a scheduler executing the pipeline dag requires $S_1 + x - 1$ space to execute $x$ iterations of the pipeline simultaneously, that is, $S_1$ stack space to execute the function containing the pipeline, plus unit space per pipeline iteration the scheduler executes in parallel. Consequently, the scheduler executes the pipeline serially using $S_1$ space, and incurs an additional unit space overhead per pipeline iteration it executes in parallel. Furthermore, suppose that each node is a serial computation, that is, no nested parallelism exists in the nodes of Figure 15.

Consider a time step during which the scheduler is executing instructions from $k$ heavy iterations in parallel. Because stage 0 of the pipeline in Figure 15 is serial, to execute instructions from $k$ heavy iterations in parallel requires executing the node for stage 0 in at least $(k - 1)T_1^{1/3} + 1$ consecutive iterations. Because stage 2 is serial, the scheduler must have executed stage 0, but not stage 2, for at least $(k - 1)T_1^{1/3} + 1$ iterations. Consequently, the scheduler requires at least $S_P \geq S_1 + (k - 1)T_1^{1/3}$ stack space to execute instructions from $k$ heavy iterations in parallel.

We now bound the number $k$ of heavy iterations the scheduler must execute in parallel to achieve a speedup of $\rho$. The total time $T_P$ the scheduler takes to execute the instructions of the pipeline in Figure 15 is at least the total time it takes to execute all $T_1^{2/3} \cdot T_1^{1/3}/2 = T_1/2$ instructions in heavy iterations. Assuming the scheduler executes instructions from $k$ heavy iterations simultaneously on each time step it executes an instruction from a heavy iteration, we have $T_P \geq T_1/(2k)$. Rearranging terms gives us that $k \geq T_1/(2T_P) = \rho/2$.

Combining these bounds, we find that the scheduler requires at least $S_P \geq S_1 + (\rho - 2)T_1^{1/3}/2$ space to achieve a speedup of $\rho$ when executing the pipeline in Figure 15.  

Intuitively, these two theorems present two extremes of the effect of throttling on pipeline dags. One interesting avenue for research is to determine what are the minimum restrictions on the structure of an unthrottled linear pipeline $G$ that would allow a scheduler to achieve parallel speedup on $P$ processors using a throttling limit of only $\Theta(P)$.

**Acknowledgments**

Thanks to Loren Merritt of x264 LLC and Hank Hoffman of University of Chicago (formerly of MIT CSAIL) for answering questions about x264. Thanks to Yungang Bao of Institute of Computing Technology, Chinese Academy of Sciences (formerly of Princeton) for answering questions about the PARSEC benchmark suite. Thanks to Bradley Kuszmaul of MIT CSAIL for tips and insights on file I/O related performance issues. Thanks to Arch Robison of Intel for providing constructive feedback on an early draft of this paper. Thanks to Will Hasenplaugh of MIT CSAIL and Nasro Min-Allah of COMSATS Institute of Information Technology in Pakistan for helpful discussions. We especially thank the reviewers for their thoughtful comments.

**References**


