Toward Speculative Loop Pipelining for High-Level Synthesis

Loop pipelining (LP) is a key optimization in modern high-level synthesis (HLS) tools for synthesizing efficient hardware datapaths. Existing techniques for automatic LP are limited by static analysis that cannot precisely analyze loops with data-dependent control flow and/or memory accesses. We propose a technique for speculative LP that handles both control-flow and memory speculations in a unified manner. Our approach is entirely expressed at the source level, allowing a seamless integration to development flows using HLS. Our evaluation shows significant improvement in throughput over standard LP.


I. INTRODUCTION
A LTHOUGH FPGA accelerators benefit from excellent energy/performance characteristics, their usage is hindered by the lack of high-level programming tools. High-level synthesis (HLS) tools aim at making FPGAs more accessible by enabling the automatic derivation of highly optimized hardware designs, directly from high-level specifications (C, C++, and OpenCL). Although HLS is now a mature technology, existing tools still suffer from many limitations, and are usually less efficient compared to manual designs by experts.
HLS tools take algorithmic specifications in the form of C/C++ as inputs, just like standard compilers. They benefit from decades of progress in program optimization and parallelization since most of these optimizations are also relevant for deriving efficient hardware.
In practice, the usage of HLS tools resembles that of CAD tools, such as Logic/RTL synthesizers, where the design process involves a lot of interactions with the user. These interactions allow the exploration of performance and area tradeoffs in the resulting hardware. They usually consist of evaluating the impact of certain key optimizations to derive the best solution given performance and/or area constraints.
Loop pipelining (LP) is one of these key transformations as it enables the synthesis of complex yet area-efficient hardware datapaths. Current HLS tools are effective at applying LP to loops with regular control and simple memory access patterns, but struggle with data-dependent control flow and/or memory accesses. The main reason is that existing LP techniques rely on static schedules, which cannot precisely capture data-dependent behaviors.
In this work, we address this limitation through a speculative LP (SLP) framework supporting both control flow and memory dependence speculation. More specifically, our contributions are the following.
1) SLP for HLS expressed entirely as source-level transformations. 2) Automation of SLP in a source-to-source compiler.
3) An experimental evaluation of the approach that shows significant performance improvement over standard LP. An important strength of our work is that the speculative design is expressed entirely at the source level. This allows our approach to be seamlessly integrated into HLS design flows providing two key benefits: 1) the pipelined datapath is synthesized by the HLS tools that are capable of deriving efficient designs and 2) we do not compromise on the easeof-use aspects: programmers keep all the productivity benefits (e.g., easier/faster testing) of having high-level specifications.
The remainder of this article is organized as follows. Section II introduces LP in HLS and its current limitations. Section III details our source-level transformations to realize SLP, and Section IV describes its automation. We evaluate our work in Section V, discuss related work in Section VI, and conclude in Section VII.

II. BACKGROUND AND MOTIVATION
Designing efficient hardware accelerators using HLS relies on the ability to uncover and exploit parallelism from the algorithm. Unlike compilers targeting CPUs or GPUs, where the target hardware is known at compile time, HLS tools enjoy more degrees of freedom. As a consequence, quantifying the effective performance of a design is more complex than for a programmable device.

A. Loop Pipelining in HLS
The performance of a hardware accelerator is determined by its achievable clock speed (f max ) and its operation-level parallelism (i.e., the number of useful operations per cycle). HLS optimizations may affect either of these two characteristics.
LP is one of such optimizations that consists of overlapping the execution of consecutive iterations of a loop. A pipelined loop is characterized by two metrics. 0278-0070 c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information. The function S is mapped to a three-stage pipelined operator and has a dependence distance of three. The function F is mapped to a single-stage operator, with a dependence distance of one. This gives a pipelined schedule with II = 1 and = 3.
1) The pipeline latency ( ) corresponds to the number of pipeline stages in the synthesized datapath.
2) The initiation interval (II) is the delay separating the execution of two consecutive iterations. When targeting FPGAs, designers generally aim for fully pipelined loops with II = 1. Fig. 1 shows a simple kernel along with its fully pipelined execution (that is, with II = 1). The value produced by S is said to have a dependence distance of three, since it is used three iterations later. If the distance is shorter than three, then the II must be increased to wait for the next iteration's input to be computed. HLS tools use compiler analyses to determine if consecutive iterations of a loop can be executed in parallel, which may not be possible due to dependences and/or resource constraints (e.g., memory ports).
LP slightly differs from software pipelining, which targets instruction set-based architectures [1], because the latency of operations in the resulting hardware datapath can be flexibly tuned to the target clock speed specified by the designer. Therefore, the depth of the resulting pipelined schedule can be made larger to enable higher clock speeds. Because of the abundant number of registers in FPGAs, the area overhead due to pipelining is low, making it an effective transformation (datapaths with 100+ pipeline stages are not uncommon).

B. Limits of Loop Pipelining
HLS tools rely on simple dependence analysis techniques and often fail at precisely identifying how much a loop can actually be pipelined. Although there are more advanced analyses and optimizations [2]- [5], the compiler still fails at identifying pipeline opportunities in many situations. This is the case when the control flow and/or access patterns become too irregular and/or data dependent to be amenable to any compiler analysis.
For example, consider the loop kernel shown in Fig. 2(a), which exposes two execution paths: 1) a fast path (single cycle) and 2) a slow path (three cycles). In this kernel, the actual execution path depends on the value of x, which is only known at runtime. Because of this, any static dependence analysis fails at precisely computing the dependencies. Therefore, the compiler has no choice but to follow a conservative (i.e., worst case) schedule, as illustrated in Fig. 2(b). In this schedule, the compiler conservatively assumes that the slow path is always taken and does not expose iteration-level parallelism.
Such examples are quite common in practice, especially when dealing with emerging application domains, such as graph analytics, network packet inspection, or algorithms operating on sparse data structures.
However, the execution of multiple iterations may be overlapped by speculating that the fast path is always taken, as illustrated in Fig. 2(c). In case of a misspeculation, operations started with incorrect values [C(x) and F(x) at the fourth iteration] must be canceled and wait for the correct values to become available [S(x) at the third iteration]. This approach is similar to that of a pipelined in-order CPU, where the execution of instructions is speculatively overlapped, which potentially causes pipeline hazards.
The existing work on speculative execution for HLS either addresses the problem of runtime data-dependence management [6], [7] or improve tool support for loops with complex control flow [8]- [13]. The only exception is the work by Josipović et al. [14] that adds speculation to dataflow circuits and is capable of addressing both issues. We discuss related work further in Section VI.

III. PROPOSED APPROACH
Our goal is to automatically derive circuits implementing SLP from C programs. This section describes the program transformations we use to achieve this.

A. Speculative Loop Pipelining at the Source Level
We propose to express SLP directly at the source level, in contrast to prior work [7], [14] that operates at the HLS back end or at the RTL-level. Our insight consists in decoupling the pipeline control logic from the pipelined datapath in the source code.
We take advantage of this decoupling as follows. 1) We increase the dependence distance exposed in the datapath by speculating over the outcome of a conditional. This extra slack allows the HLS tool to derive more aggressively pipelined designs. 2) We enhance the control logic to monitor the loop execution and handle misspeculations. A key characteristic of this source-level transformation is that it does not directly perform operation-level pipelining. Instead, this task is delegated to the HLS LP optimization. This approach has many benefits.
1) It simplifies code generation and testing, preserves the benefits of static pipelining (low control overhead, ability to share resources when targeting II > 1, etc.) while benefiting from the performance improvements offered by a speculative schedule. 2) Since the misspeculation detection logic is directly weaved into the code, cycle-accurate performance numbers can be obtained simply by executing the program and tracking misspeculation overhead. Fig. 3 shows an example of our source-level implementation of SLP. The transformation takes estimated pipeline latency of In (c), C(x) at the third iteration evaluates to true, causing misspeculation. Average II assumes 20% misspeculation rate. There are a few simplifications. The FSM is represented as a state diagram, but is expressed as a switch/case statement in the C code. Arrays of infinite length are used for mis_x, s_x, s_y, and ctrl, where the actual implementation uses circular buffers (using modulo addressing) whose depths correspond to the reuse distances. Initializations of the circular buffers before entering the loop are omitted. each path as inputs: C = 2, S = 5, and F = H = 1. The resulting code exposes a shorter critical path length ( F ) to the HLS tools, enabling fully pipelined schedules with II = 1, by speculatively using outputs of F. In this transformed source code, loop iterations correspond to cycles in the eventual HW, and no longer correspond to the original loop iterations. The functions operate over circular buffers (see 1 ), and the dependence distances over these buffers bound the pipeline depth. For example, S uses s_x[t-5], and hence S may take up to five cycles to produce Since the dependence analyses in HLS tools are sometimes too conservative, we use annotations 1 to override their analysis when necessary.
The control is delegated to a finite-state machine (see 3 ) whose state is updated every iteration. The role of each state is as follows.
1) The FILL state is the pipeline startup phase.
2) The RUN state is the speculative pipeline steady state. During this stage, the correctly speculated values are committed to their corresponding variables. The FSM remains in the RUN state until a misspeculation is detected, in which case the state transits to the STALL state.
3) The STALL state is used during misspeculation recovery.
It is used as a wait state until the outcome of the recovery path is ready. When this is the case, the FSM switches to the ROLLB state. 4) The ROLLB state is the last stage of the misspeculation recovery process. In this state, the recovered state is committed and then the pipeline is restarted by looping back to the FILL state. The rollback mechanism (see 4 ) is implemented in the datapath as a conditional statement. The speculated value is replaced by the value computed in the recovery path. The commit step (see 5 ) is implemented as a collection of guarded assignments to the original variables. The committed value for a speculated variable is selected from either the speculated path (when speculation is successful) or the recovery path (when speculation is unsuccessful). In the former case, the commit uses a previously computed result (s_x[t-1] in this case) instead of s_x[t] to compensate for the pipeline latency for computing ctrl[t], which is scheduled at stage 2 in our example. Fig. 4. Rollback on arrays using a store buffer mechanism implemented at the C-level. We use arrays of infinite length to simplify the presentation. The actual implementation uses shift registers.

B. Parameters of the Recovery Mechanism
The recovery mechanism is implemented as extra control logic in the FSM, and circular buffers to keep track of the history of values produced by the datapath.
The legality of the transformation may also be split into two: 1) data and 2) control. The datapath ( 1 ) does not affect correctness since it is the control logic that dictates which values are committed. The estimated latency of each function is used as the dependence distance to give the HLS tools sufficient slack to pipeline the datapath.
The control path ( 3 through 6 ) is responsible for providing the right data to the datapath and committing correct values in order. The duration of recovery states in the FSM, as well as the rollback distances, are computed based on a schedule of all operations in the loop body as illustrated in Fig. 3 (rightmost). The FILL state waits until both C and F are ready: one cycle in this example. The pipeline is stalled until the slow path is done computing, after the speculation is detected, and hence it is the difference between latency of S and C: three cycles, which is split between STALL and ROLLB as described above. During rollback, all values that were computed with incorrectly speculated values must be reverted. In this example, x is replaced by the output of S, and outputs of H are rolled back by four cycles, or S − H : the difference between when H and S have finished computing.

C. Managing Arrays
In this section, we discuss how our approach can be extended to support array accesses. The main difficulty is associated with the need to rollback from incorrectly speculated operations. This can easily be achieved for scalar variables by recording their previous values using shift registers. However, recording the entire state of arrays is not practical.
We alleviate this issue through a mechanism similar to store buffers used in speculative processors, which we implement at the source level as depicted in Fig. 4. This mechanism delays Loop with data-dependent memory dependence (left) and its speculative equivalent (right), assuming one-iteration write latency.
write operations to the memory bank (array) through a queue of pending store operations (data_q and addr_q; see 1 ).
Whenever a read operation is issued to the array, it is first matched against pending writes (see 2 ). We make sure matching is performed in parallel by fully unrolling the comparison loop. In the case of a match, the most recent write value is returned. Otherwise, the read is forwarded to the memory bank. This ensures that read always access the most recent array state. It is also possible to read data from any of the k last states of the array, by selecting the set of pending writes matched during the read operation.
A rollback to the dth previous array state is implemented by dropping the d most recent pending writes in the queue (see 3 ). If a state is not invalidated by the end of the queue, the value is written to the memory. However, this mechanism may incur significant area overhead, and the depth of the buffer (i.e., the number of pending writes) must be kept to a minimum.

D. Supporting Memory Speculation
Loop bodies with irregular control flow tend to have datadependent memory access patterns as well. Because they are not amenable to static analysis, these patterns limit the ability to statically pipeline the loop. The prior work has addressed this limitation through memory speculation: speculate that the dependence distances are sufficiently large to overlap consecutive iterations [6], [7], [12]- [14]. These approaches use runtime checks to determine the actual dependence distance to detect and to recover from memory dependence hazards. We use the approach by Alle et al. [6] to implement memory speculation at the source level. The main difference from their work is in the automation-we support both control flow and memory speculation in a single framework-discussed in Section IV.
The rest of this section illustrates their approach through an example in Fig. 5. The original loop (left) has two statements (R and W) that access the array tab, creating RAW dependences. The dependence distance cannot be analyzed at compile time because of data-dependent indexing functions (H and G). Thus, a static scheduler must conservatively assume that the dependence distance is one. Consider the case when the main operation (E+W) is mapped to a two-stage pipelined operator. Then, II must be at least two because of the conservatively assumed dependence distance. This II may be reduced by speculating that the dependence distance is at least two.
The approach by Alle et al. [6] transforms the loop into the one depicted in Fig. 5 (right). In this example, the memory write latency is assumed to be one iteration: writes to the tab[] array are correctly reflected to reads two or more iterations later. (One iteration eventually translates to one cycle when II = 1.) This transformation consists of several steps as follows.
1) The loop body is executed speculatively by reading possibly incorrect data from array tab[] (see 1 ). The prefix s_ denotes speculative values. 2) If no address aliasing occurred, the speculation is valid, and speculative values are committed (see 2 ). This includes the update of array tab[]. When read and write addresses alias, speculative values are simply discarded.
3) The pending write address (d_wr_ad) either stores the address of an on-going commit (wr_ad) or an invalid address (−1) in case of misspeculation (see 3 ). This address is used in the next iteration to test for aliasing. Note that in the transformed code, loop iterations do not correspond to the iterations in the original code. The code implements a stall by skipping the commit and recomputing it in the next iteration. The condition evaluates to true in the second try because (wr_ad) is set to −1 in the first (failed) try.
In this modified code, the reuse distance for the RAW dependence is extended to two iterations, offering additional slack to the pipeline scheduler.
The technique can be generalized to speculate over longer distances (longer read/write latency). Then, the current read address must be matched against a history of pending write addresses. This is equivalent to a load-store queue (LSQ) structure used in speculative out-of-order processors. In this case, our implementation becomes similar to that described in Fig. 4.

E. Profitability of SLP
The profitability of SLP is a function of the relative area overhead and the misspeculation rate. The effective II increases with higher rates of misspeculation, and at some point, the speedup becomes uninteresting with respect to the area overhead of SLP. The effectiveness of SLP for a given loop is determined by comparing the effective II and achievable clock speed against the area overhead caused by speculation. The effective II can be obtained through source-level simulation or computed assuming a certain misspeculation rate. Area and frequency estimates can only be obtained after the HLS synthesis step.
In addition, there are some characteristics of loops that favor SLP, which may be used to further guide the selection of target loops.
1) Relatively small number of arrays (in memory with low latency) to avoid the excessive cost in LSQs. 2) Complex conditions and/or additional computations using the results of fast/slow paths. These computations must be performed regardless of the outcome of the condition (speculation) and take up its own space. This reduces the relative cost of SLP. In other words, when the loop body is simple, the relative overhead of SLP becomes large, requiring low misspeculation rates to be effective.

IV. AUTOMATION IN COMPILER
In this section, we describe how the transformations described in Section III can be automated. We envision our approach to be part of the interactive development flow using HLS tools, where the use of SLP becomes another performance tuning knob for the developer. Thus, the focus is on how to transform the code in a systematic manner in a compiler.

A. Program Representation
We use a representation based on Gated-SSA [15]. Gated-SSA is an extension of standard SSA [16] where nodes are replaced by either μ or γ nodes, and where arrays are considered as values and updated through α operations. The semantics of these operations are as follows.
1) μ(x out , x in ) nodes appear at loop headers and select the initial x out and loop-carried x in values of a given variable x. 2) γ (c, x false , x true ) nodes appear at joint points of a conditional and act as multiplexers determining which definition reaches the confluence node. 3) α(a, i, v) are used for array assignment. A new array "value" is computed from the previous value, a, by replacing the ith element with v. This allows arrays to be considered as atomic objects. Fig. 7 shows the Gated-SSA representation of the program in Fig. 6. In this representation, the conditional updates of x are captured by the γ x node, and the updates to array tab are represented by the α tab node. The loop-carried dependences involving x and tab are modeled with the μ nodes (μ x and μ tab ) and their corresponding back edges tagged with the minimum reuse distance inferred by the compiler (in this case, a distance of one). We extend Gated-SSA with an additional node type: rollback nodes are used to model rollbacks to previous iterations. A rollback node uses two inputs (data d and control c). When c = 0, the rollback node forward its most recent data value to the output. When c > 0, the node discards c most recent values and forward the value stored c iterations ago. During code generation, rollback nodes are lowered either as circular buffers (for scalars) or store buffers (for arrays) as explained in Section III-C.

B. Identifying Speculative Execution Points
In the Gated-SSA representation, conditional branches are modeled as γ nodes. Implementing control-flow speculation amounts to speculating the outcome of a γ node by choosing one of its input data before control data are known. The following criteria define conditionals where speculation is relevant.
1) The γ node must be part of a critical loop whose delay prevents pipelining. Otherwise, speculation does not improve II. 2) The execution paths exposed by the conditional must expose a fast and a slow path as in Fig. 7. If paths are balanced, speculation is useless. The length (in terms of pipeline stages) of each path is obtained through an analysis based on predefined delays of operators. The delays between pairs of operations in the strongly connected component (SCC) are computed, assuming no resource constraints, which is part of the computation of recurrence-constrained minimum II (RecII) in iterative modulo scheduling [1]. The delay from the μ node to immediate predecessors of the γ node gives the path latency estimates.
Note that the accuracy of this analysis does not affect correctness. The path latency estimates are used to identify fast/slow patterns and as slacks given to HLS tools in the form of dependence distances. The inaccuracy in the estimates leads to excessive/insufficient slacks, affecting the performance of the design: excessive slack increases misspeculation penalty and insufficient slack constraints LP and may result in designs with larger IIs and/or slower clock speeds.
In addition, the profitability of SLP should be determined as described in Section III-E to decide if SLP is applied or not.
In this article, we assume that the conditionals to speculate are given as programmer input.

C. Constructing the Recovery Logic
We also use latency estimates as described above to compute the parameters of the recovery logic. One difference is that only the dependence edge corresponding to the fast path is considered for the speculative γ node to account for speculation. In addition, we make the following assumptions.
1) The latency of the fastest path is one; the latency estimates are normalized to satisfy this assumption. The HLS tools are capable of pipelining the resulting code with larger II if desired.
2) The transformation to expose memory speculation (discussed in Section IV-E) is already applied.
3) The γ nodes to speculate are provided by the user as annotations. The operations corresponding to the (end of) fast, slow, and control paths are called F, S, and C, respectively. The delay between the μ node and when the result of an operation X becomes available is denoted as θ X .
All the parameters of the recovery logic are computed based on the relative difference between when the output of each operation is produced with respect to a couple of key timestamps.
1) θ validate : When the validation of speculation happens. This is the max of θ C , θ F , and θ for all other operations that depend on F. When the validation succeeds, all values computed with the speculative value are committed. 2) θ rollback : When the rollback in case of misspeculation happens. This is the max of θ S , and θ for all other operations that depend on S. Then, the parameters used in the recovery logic are as follows.
3) Rollback Distance for Operation X: θ rollback − θ X . 4) Commit Distance for Operation X: θ validate − θ X . The rollback distance also becomes the size of the buffers used to store speculative values or the store buffers in case of arrays. Note that the buffers are only necessary for live-out values, and arrays with write accesses within the SCC.

D. Transformation on Gated-SSA Representation
Our speculative pipelining transformation is implemented as a set of simple structural modifications to the Gated-SSA IR, following the principles depicted in Fig. 3. These changes include the following.
1) Modifying the distance associated with all input edges of the γ node to expose dependence distances matching the expected number of pipeline stages along each input path. 2) Creating shadow variables for every speculated live-out variables, with their corresponding multiplexer nodes. 3) Instantiating (and wiring) the FSM implementing the misspeculation recovery logic. Fig. 8. Transformed IR in which we exposed memory speculation through an additional path exposing a dependence distance of 2. Since speculated paths belong to distinct SCCs, we must insert an FIFO channel with rollback capabilities between them.

4)
Creating an additional path exporting the commit values out of the SCC. 5) Inserting rollback nodes along the back edges of all nonspeculated live out variables. We then generate the HLS C code implementing the selected speculation mechanism.

E. Memory Dependence Speculation
As explained in Section III, memory speculation can be expressed directly at the source level. We show how this speculation mechanism can be exposed as a control-flow speculation opportunity. Fig. 8 shows a transformed loop IR starting from our earlier example (Figs. 6 and 7). An additional path involving array tab is exposed after the transformation. This path delays the array value by one iteration-a delay is added to the edge from μ y to γ tab -and is selected as the outcome of node γ tab when the RAW dependence distance is guaranteed to be greater than one.
This modification exposes a new cycle involving γ tab , in which the cumulative dependence distance is two, enabling a two-stage pipelined implementation for function H. From there, we can simply take the advantage of control-flow speculation to select the fast path as a speculation target.
Memory dependence speculation should only be used when the dependence distance is data dependent. Otherwise, memory speculation does not help reduce the critical path length. We assume that the memory accesses to apply memory dependence speculation are annotated by the user.

F. Managing Concurrent Execution of SCCs
In the Gated-SSA representation, a loop involving interiteration dependences manifests as a collection of SCCs, each of them involving one or more μ nodes. This is illustrated in Fig. 7, where the loop IR consists of two distinct SCCs.
Since loop-carried dependences only manifest within an SCC, we can safely apply SLP separately for each SCC. This leads to multiple concurrently running SCCs with their own FSMs, where some of the SCCs may be running speculatively.
We apply the principle of decoupled software pipelining [17] to allow these SCCs to run without synchronizations. In decoupled software pipelining, SCCs synchronize with each other through FIFO channels that are used to exchange data. An SCC halts when there is no input available or when the output buffers are full. In our case, this means that each speculated SCC operates independently and communicates through FIFOs.
In some cases, we also need to rollback and access a previously consumed data token from an external input node outside of the current SCC. We solve this problem by inserting rollback nodes to all incoming edges of a speculated SCC, as illustrated in Fig. 8.
The absence of deadlocks in our approach is guaranteed by two important properties.
1) The graph formed by all the SCCs is by definition acyclic (cycles being embedded within SCCs). 2) Within an SCC, all the values corresponding to an iteration of the loop are written to the FIFOs at the same time, once all values are ready (that is, in the commit stage for SCCs with speculation). The FIFOs between SCCs, therefore, act as buffers to absorb variability in latency caused by speculation. Note that all SCCs involved should have similar effective II. Otherwise, the SCC with the largest II would become the bottleneck, slowing down all other SCCs.

G. Multiple Speculations
As explained in the above, our approach naturally handles multiple speculations in different SCCs. However, it is also possible to have multiple conditionals in a single SCC that benefit from speculation. These speculations may be intertwined, making scheduling more challenging. In particular, the outcome of a speculated γ node may depend on speculated values. This is illustrated in Fig. 9, where γ y indirectly depends on γ x .
In our approach, we handle multiple speculations in a single SCC by viewing them as single speculation. In other words, we delay the resolution of all speculated γ nodes until the last one is ready (at stage 3 in our example). When a misspeculation is detected, the pipeline is rolled back by multiple iterations, corresponding to the maximum number of speculated γ nodes in the path (two iterations in our example). This approach greatly simplifies the control logic, as misspeculation detection is performed at a single pipeline stage, and the rollback distance is a constant relative to this stage.
We acknowledge that this approach increases misspeculation penalty when compared to a more fine-grained resolution of speculations. There are many subtle tradeoffs on how to manage multiple speculations, exposing an interesting design space of its own. We leave further exploration of this design space as future work.

V. EVALUATION
In this section, we evaluate speculative designs produced by our approach. We first evaluate the resulting designs in terms of their characteristics (frequency, II, and area cost) compared with fully static designs. Then, we discuss the final performance assuming different misspeculation rates.
For our experiments, we used Xilinx Vivado HLS 2019.1 as a back end and used a Zynq xc7z020clg484-3 as the target FPGA.

A. Benchmarks
We use the following benchmarks in our evaluation. 1) Ex-Simple: Our first motivating example depicted in Fig. 2(a). 2) Ex-Rollback: Our example requiring a rollback mechanism, as depicted in Fig. 3. 3) Ex-Memory: An example requiring a store queue to rollback on an array, as explained in Section III-C. 4) CPU: A simple instruction set simulator (ISS) for a toy CPU with four instructions: memory load/store, jump, and add, operating on floating-point data. This benchmark aims at demonstrating that our approach can automatically infer an in-order pipelined microarchitecture (with its corresponding hazard detection logic) directly from an ISS specification. 5) Newton-Raphson: A root-finding algorithm, which was used by Josipović et al. in their work on speculative dataflow circuits (SDCs) [14]. We used the code made available 2 by the authors. 6) Histogram: Weighted histogram that accumulates floating-point weights in the corresponding bins. We use a sparse variant used by Cheng et al. [18] in their work on combining dynamically and statically scheduled circuits. This benchmark is an example of speculation on a memory dependence. 7) gSum, gSumIf, and getTanh(Double): Benchmarks used by Cheng et al. [18]. All of these codes have a simple data-dependent condition that branches to an inexpensive branch (no computation or default value) or an expensive branch with a large number of floating-point operations.
We used the code made available 3 by the authors.

B. Area Overhead
Table I summarizes the HW characteristics of speculative loop pipelined designs (SLP) and a baseline design with standard LP. There are multiple sources of area overhead.
1) Cost of Reduced II: Improving II directly increases area cost, simply because of increased parallelism. This overhead is proportional to the complexity of the datapath. When there is a long chain of floating-point operations in the slow path (e.g., gSum and gSumIf), DSP usage is multiplied by large factors correlated with the reduction in II. 2) Cost of (Load) Store buffers: These buffers are necessary for programs that have speculative writes to arrays, increasing overhead on some benchmarks (Ex-Array and Histogram). 3) When the design has multiple SCCs, we added an FIFO array for synchronizing the two loops, increasing the number of BRAMs used. This is the case for gSum and gSumIf. 4) Rollback and delay mechanisms should ideally be implemented using shift registers, which are either mapped to SRLs or to FFs (the final decision is up to Vivado P&R tool). However, Vivado HLS may fail at inferring precise reuse distances. In such a case, we model our delay lines using circular buffers with modulo indexing. This can lead the tool to infer BRAM primitives even for very short delay lines, as it is the case for our Ex-Simple example. 5) Other components of the misspeculation mechanism, such as FSM and rollback mechanisms of scalars. These take a small part of the overhead. Overall, the area overhead is small compared to the theoretical maximum speedup with no speculation, which is equal  to II of the fully static design because we reduce II to 1 in all benchmarks. Table II summarizes the impact on performance assuming fixed misspeculation rates. The penalty of misspeculation is STALL + FILL cycles. The speedup is computed as

C. Effective Throughput
The benefit of SLP is most visible for the CPU and Newton-Raphson benchmarks, where the relative area cost of SLP is low. SLP has relatively low area overhead for these benchmarks because they satisfy the characteristics discussed in Section III-E, making them ideal targets for SLP. Other benchmarks are not ideal targets due to the datapath being too simple (Histogram, gSum, and gSumIf) or having a large portion of the datapath only used in case of misspeculation (get-Tanh(double)). These benchmarks require low misspeculation rates for SLP to be effective. Table III provides a qualitative comparison of related work on the topic of speculative execution for HLS.

VI. DISCUSSION AND RELATED WORK
Our work on speculative pipelining has some connections with earlier work on automatic syntheses of pipelined instruction set processor architectures [19], [20]. These approaches operate from a high-level description of the target architecture specified in a domain-specific language. Our work is much more general as it operates from a general-purpose language and requires much less user intervention.
Holtmann and Ernst [10] were the first to discuss SLP in the context of HLS. They identified the need for rollback registers and proposed a scheduling algorithm, but did not evaluate the area overhead of their approach, nor address memory access aliasing. This severely limits the applicability of their work. Gupta et al. [8] and Lapotre et al. [9] developed techniques to perform control-flow speculation-branch prediction and speculative code motion-in HLS. However, they do not speculate over multiple loop iterations as we do. Alle et al. [6] and Dai et al. [7] proposed mechanisms to allow pipelining of loops with dynamic memory dependences. Our work unifies both control-speculation and memory-speculation using a single framework.
SDCs [14] apply speculation in the context of dataflow execution. They add mechanisms into dataflow circuits to allow nodes to execute with speculated data and to rollback when necessary. They use LSQs to support memory accesses during speculative execution. Our work has similar goals, but we focus on loops and target mostly static designs in contrast to fully dynamic designs in their work. Our code transformations expose optimization opportunities to the HLS tools by decoupling the control logic from the datapaths. This allows the HLS tools to perform static pipelining that has many benefits over fully dynamic circuits, including resource sharing.
In addition, our designs are expressed completely at the source level, unlike some of these works that use external HDL components. This gives an important practical advantage to developers by allowing testing/verification to be performed at the source level.   [14] Dynamic and static scheduling (DSS) [18] is closely related to our work, although their approach does not perform speculation. The main idea is to separately schedule different parts of the code that do not benefit from dynamic scheduling and only perform dataflow execution among these statically scheduled components. Their work targets programs that have patterns similar to ours (branch into fast and slow paths) and further optimize area by reducing II for a path that is taken infrequently. When the condition is simple, their approach achieves small II since the correct path to take can be determined almost immediately. In contrast, our approach starts the next iteration without waiting for the evaluation of the condition to finish, achieving better II on programs that have long conditions, such as Newton-Raphson or CPU in our benchmarks. Table IV compares the area cost of our work with the most recent related work [14], [18]. Note that we took the published values for the other work, but the experiment setups are not exactly the same. We target the same FPGA as DSS [18].
SLP achieves similar or better performance with respect to DSS in the benchmarks we used. This is partly because they assumed that the slower path is always taken when scheduling, which gives the most compact design but higher II. Our approach speculates the opposite: the fast path is always taken, which minimizes II but uses more area. For these benchmarks, we believe that the DSS approach gives similar results to our work if they assume that the fast path is always taken. However, this is only true for these benchmarks that have inexpensive conditions. Our work is fundamentally different in that we speculate and thus is able to achieve shorter II when the condition is complex.
Our approach gives a similar area and performance to SDC for the Newton-Raphson benchmark. Since both designs achieve II = 1, there is little room for resource sharing, and hence the differences are mostly in the control logic. We did not include other benchmarks used in their work because they all speculate on the exit condition of a loop. Although our approach handles such programs-our approach also executes invalid iterations at the end of the loop, which are not committed-we focus on programs with data-dependent branches in the loop body for our evaluation.
In our opinion, the main benefit of SLP with respect to dataflow-based approaches [13], [14] is that it builds on a classical HLS framework (centralized control through a single FSM, static pipelining). This makes its integration in existing HLS tools much easier than for dynamic dataflow-based approaches, which follow a radically different execution model.

VII. CONCLUSION
In this work, we proposed a complete flow to support SLP in the context of high-level synthesis. Our approach supports both control flow and memory access speculation and is implemented as a source-to-source transformation. The experimental results show that speculation can bring significant performance improvement for several important HLS kernels while allowing seamless integration within existing tools.