Certified and efficient instruction scheduling. Application to interlocked VLIW processors.

Cyril Six, Sylvain Boulmé, David Monniaux

To cite this version:
Cyril Six, Sylvain Boulmé, David Monniaux. Certified and efficient instruction scheduling. Application to interlocked VLIW processors.. Proceedings of the ACM on Programming Languages, ACM, 2020, OOPSLA 2020, pp.129. 10.1145/3428197. hal-02185883v3

HAL Id: hal-02185883
https://hal.archives-ouvertes.fr/hal-02185883v3
Submitted on 23 Nov 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Certified and Efficient Instruction Scheduling
Application to Interlocked VLIW Processors

CRYL SIX, Kalray S.A., France and Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, France
SYLVAIN BOULMÉ, Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, France
DAVID MONNIAUX, Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, France

CompCert is a moderately optimizing C compiler with a formal, machine-checked, proof of correctness: after successful compilation, the assembly code has a behavior faithful to the source code. Previously, it only supported target instruction sets with sequential semantics, and did not attempt reordering instructions for optimization.

We present here a CompCert backend for a VLIW core (i.e. with explicit parallelism at the instruction level), the first CompCert backend providing scalable and efficient instruction scheduling. Furthermore, its highly modular implementation can be easily adapted to other VLIW or non-VLIW pipelined processors.

CCS Concepts: • Software and its engineering → Formal software verification; Retargetable compilers; • Theory of computation → Scheduling algorithms; • General and reference → Performance; • Computer systems organization → Superscalar architectures; Very long instruction word.

Additional Key Words and Phrases: Formal verification of compiler optimizations, Instruction-level parallelism, the Coq proof assistant, Symbolic execution, Hash-consing.

Note: this paper extends with appendices the peer-reviewed paper


1 INTRODUCTION

The CompCert certified compiler [Leroy 2009a,b] is the first optimizing C compiler with a formal proof of correctness that is used in industry [Bedin Franca et al. 2012; Kästner et al. 2018]. In particular, it does not have the middle-end bugs usually found in compilers [Yang et al. 2011], thus making it a major success story of software verification.

CompCert features a number of middle-end optimizations (constant propagation, inlining, common subexpression elimination, etc.) as well as some backend optimizations (register allocation using live ranges, clever instruction selection on some platforms). However, it does not attempt to reorder operations, which are issued in almost the same order as they are written in the source code. This may not be so important on processors with out-of-order or speculative execution (e.g. x86), since such hardware may dynamically find an efficient ordering on its own; yet it hinders

Authors’ addresses: Cyril Six, Kalray S.A., 180 avenue de l’Europe, Montbonnot, 38330, France, Cyril.Six@kalray.eu, Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, 700 avenue Centrale, Saint Martin d’Hères, 38401, France, Cyril.Six@univ-grenoble-alpes.fr; Sylvain Boulmé, Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, 700 avenue Centrale, Saint Martin d’Hères, 38401, France, Sylvain.Boulme@univ-grenoble-alpes.fr; David Monniaux, Univ. Grenoble Alpes, CNRS, Grenoble INP, Verimag, 700 avenue Centrale, Saint Martin d’Hères, 38401, France, David.Monniaux@univ-grenoble-alpes.fr.
performance on in-order processors, especially superscalar ones (multiple execution units able to execute several instructions at once, in parallel).

VLIW (Very Long Word Instruction) processors [Fisher 1983] require the assembly code to specify explicitly which instructions are to be executed in parallel. A VLIW bundle of instructions is an aggregate of atomic computations running in parallel on the execution units of the processor. Compared to out-of-order architectures, an in-order VLIW processor has a simpler control logic, thus using less CPU die space and energy for the same computing power; it is more predictable with respect to execution time, which is important in safety-critical applications where a worst-case execution time (WCET) must be estimated or even justified by a sound analysis [França et al. 2011]. In addition, a simpler control structure may be more reliable.\(^1\)

Due to their simpler design, such processors require more complex compilers to benefit from their potential. Compilers must indeed find an efficient way to decompose the behavior of the high-level program (typically in C) into a sequence of parallel atomic computations. Optimizing compilers for VLIW processors has a long and successful history since the seminal work of Fisher [1981]; Rau et al. [1982], followed by Feautrier [1991]; Lam [1988] and the Multiflow compiler [Lowney et al. 1993]. In the case of CompCert, the problem is made harder by the need to formally verify that this transformation is sound, that is, that it preserves the program semantics.

This paper presents an extension of CompCert with certified assembly generation for an interlocked VLIW processor (Kalray KVX core), along with an intrablock postpass scheduling optimization (postpass meaning that it occurs after instruction selection, register allocation, and spilling). However, only a few parts are specific to this processor: many of the insights and a large part of the implementation are likely to be applicable to other architectures, in particular to multiple-issue in-order cores (e.g. ARM Cortex A-53). Furthermore, we think general insights can be gained from our experiment, beyond the issue of instruction scheduling, such as efficiently certifying the output of compiler optimization phases by certified symbolic execution with hash-consing.

1.1 Overview of the Kalray KVX VLIW Core

The Kalray KVX core implements a 6-issue Fisher-style VLIW architecture [Fisher et al. 2005] (partial predication, dismissible loads, no rotating registers). It sequentially executes blocks of instructions called bundles, with parallel execution within them.

**Bundles.** A bundle is a block of instructions that are to be issued into the pipeline at the same cycle. They execute in parallel with the following semantics: if an instruction writes into a register that is read by another instruction of the same bundle, then the value that is read is the value of the register prior to executing the bundle. If two instructions of the same bundle write to the same register, then the behavior at runtime is non-deterministic. For example, the bundle written in pseudo-code "\(R_1 := 1; R_1 := 2\)" assigns \(R_1\) non-deterministically. On the contrary, "\(R_1 := R_2; R_2 := R_1\)" is deterministic and swaps the contents of \(R_1\) and \(R_2\) registers in one atomic execution step. In assembly code, bundles are delimited by `;` (Fig. 1). Compilers must ensure that each bundle does not require more resources than available—e.g., the KVX has only one load/store unit, thus a bundle should contain at most one load/store instruction. The assembler refuses ill-formed bundles.

**Execution Pipeline.** In the case of the KVX, bundles are executed through a 8-stage interlocked pipeline: the first stage prefetches the next bundle (PF stage), the second decodes it (ID stage), the third reads the registers (RR stage), then the last five stages (E1 through E5) perform the actual computation and write to the destination registers; depending on the instructions the writes occur

\(^1\)For instance, Intel’s Skylake processor had a bug that crashed programs, under complex conditions [Leroy 2017].
sooner or later (e.g., an addition takes fewer stages than a multiplication). If, during the RR stage\(^2\), one of the read registers of an instruction in the bundle is not available, the pipeline stalls: the bundle stops advancing through the pipeline until the register gets its result (Fig. 1).\(^3\)

![Cycle | ID | RR | E1 | E2 | E3 | E4+E5](attachment:cycle_table.png)

<table>
<thead>
<tr>
<th>Cycle</th>
<th>ID</th>
<th>RR</th>
<th>E1</th>
<th>E2</th>
<th>E3</th>
<th>E4+E5</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>B1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>B2</td>
<td>B1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>B3</td>
<td>B2</td>
<td>B1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>B4</td>
<td>B3</td>
<td>B2</td>
<td>B1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>B4</td>
<td>B3</td>
<td></td>
<td>B2</td>
<td>B1</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>B4</td>
<td>B3</td>
<td></td>
<td></td>
<td>STALL</td>
<td>STALL</td>
</tr>
<tr>
<td>7</td>
<td>B5</td>
<td>B4</td>
<td>B3</td>
<td></td>
<td></td>
<td>STALL</td>
</tr>
</tbody>
</table>

B1 : \( R_1 := \text{load}(R_0 + 0); \)
B2 : \( R_2 := \text{load}(R_0 + 4); \)
B3 : \( R_3 := R_1 + R_2; R_4 := R_1 * R_2; \)
B4 : \( R_4 := R_3 + R_4; \)
B5 : \( \text{store}(R_0, R_3); \)
B6 : \( R_6 := R_7 + R_8; \)

Fig. 1. The pipeline stalls at cycles 5 and 6 because B3 is waiting for the results of \( R_1 \) and \( R_2 \) from bundles B1 and B2, which are completed at stage E3. Stage PF (not shown here) happens just before the ID stage.

Processor implementations can be divided into: out-of-order processors (e.g. modern x86), which may locally re-schedule instructions to limit stalls;\(^4\) in-order, which execute the instructions exactly in the order of the assembly code. On an in-order processor, an optimizing compiler should provide an efficient schedule; this is all the more important if the processor is multiple-issue or VLIW, since a single stalling cycle could have been used for executing multiple instructions.

1.2 Modular Design of the CompCert Compiler

Usual compilers (GCC, Clang/LLVM, ICC) split the compilation process into several components. In the case of CompCert, a frontend first parses the source code into an intermediate representation (IR)—called Cminor—that is independent of the target machine [Blazy et al. 2006]. Then, a backend transforms the Cminor program into an assembly program for the target machine [Leroy 2009b]. Each of these components introduces several IRs, which are linked by compilation passes. A compilation pass can either transform a program from an IR to another (transformation pass), or optimize within an IR (optimization pass). As illustrated in Fig. 2, CompCert introduces more IRs than usual compilers. This makes its whole proof more modular and manageable, because each compilation pass comes with its own proof of semantic preservation.

Fig. 2. The Intermediate Languages of CompCert

---

\(^2\)Or the ID stage, for some instructions such as conditional branching.

\(^3\)When a register is read before some prior instruction has written to it, non-interlocked VLIW processors use the old value. The compiler must then take instruction latencies and pipeline details into account to generate correct code, including across basic blocks. This is not the case for the KVX, where these aspects are just matters of code efficiency, not correctness.

\(^4\)For instance, in Fig. 1, seeing that the bundle B3 is stalled because its arithmetic instructions depend on a load in B2, an out-of-order processor could instead schedule the execution of B6.
Within the backend, compilers usually first introduce an unbounded number of pseudo-registers, which are then mapped to actual machine registers, with possible spills (saving on the stack, then reloading) when needed. This mapping is performed by the register allocation pass. Compiler backend passes are usually divided into two groups: those happening before register allocation, and those happening after. This paper presents a postpass scheduling optimization: it reorders and bundles instructions at the very end of the backend, after register allocation.

1.3 Porting CompCert to a VLIW Architecture

Porting a VLIW architecture such as the KVX processor presents two main challenges:

- **How to represent bundles in CompCert?** The existing Asm languages are sequential. We need to define a parallel semantics within bundles for our VLIW processor.
- **How to include a scheduling pass within CompCert?** On in-order processors, particularly those capable of executing multiple instructions at the same time, it is of paramount importance for execution speed that instructions are ordered in a way that minimizes stalls, which is not, in general, the order in which they are written in the C program. A scheduling pass reorders the instructions, with knowledge of their execution latencies, to minimize stalling. For instance, in Fig. 1, this pass could schedule B6 before B3. The task of grouping instructions into bundles (bundling) on a VLIW processor is usually performed by a postpass scheduler: instructions are in the same bundle if they are scheduled in the same time slot.

Certified scheduling was already explored by Tristan and Leroy [2008], who extended CompCert with a certified postpass list-scheduler, split into (i) an untrusted oracle written in OCaml that computes a scheduling for each basic block in order to minimize pipeline stalls (ii) a checker—certified in Coq—that verifies the oracle results. We identified three issues with their approach.

Firstly, their scheduling operates at the Mach level, simpler than Asm. Since some aspects (stack and control flow handling) are only detailed in the Mach to Asm pass, a model of latencies and pipeline use at the Mach level cannot be accurate. Furthermore, our scheduling needs the actual Asm instructions to construct well-formed bundles.

Secondly, their checker has exponential complexity w.r.t. the size of basic blocks, making it slow or even impractical as the number of instructions within a basic block grows. We thus needed to devise new algorithms that scale much better but that can still be proved correct in Coq.

Finally, Tristan and Leroy’s proof only holds for a preliminary version of CompCert where non-terminating executions were not modeled (see Tristan [2009, Section 3.5.3]). CompCert now models diverging executions as well. This makes the semantic preservation proof more complex.

Tristan and Leroy’s approach neither was integrated into CompCert, nor, to our best knowledge, seriously evaluated experimentally, probably due to prohibitive compile-times.

1.4 Contributions

Our main contribution is a scalable, certified and highly modular scheduler with bundling, combining an untrusted scheduling oracle with a verified scheduling checker. Both the oracle and checker are highly generic; we instantiated them with the instruction set and (micro-)architecture of the Kalray KVX core. We evaluated experimentally both scalability and the quality of the produced code.

Our solution solves the issues in Tristan and Leroy [2008]. Our certified scheduler is made of:

- An oracle, written in OCAML, producing a sequence of bundles for each basic block. We implemented a greedy list-scheduler with a priority heuristic based on latencies.  

---

5 A basic block is defined as a sequence of instructions with a single entry point (possibly named by a label in front of the sequence) and a single exit point (e.g., a control-flow instruction at the end of the sequence).

6 We briefly evaluate this choice compared to an optimal scheduler based on integer linear programming in Appendix D.
• A generic certified scheduling checker, written in Coq, with a proof of semantic preservation, implementing two independent checks: (1) Verifying that, assuming sequential execution within each bundle, the reordered basic block preserves the sequential semantics of the original one. This is achieved by comparing the symbolic execution of two basic blocks, as did Tristan and Leroy. The exponential complexity of their approach is avoided by introducing (verified) hash-consing. (2) Verifying that, for each bundle, the sequential and parallel executions have the same semantics. This reduces to checking that each bundle never uses a register after writing to it.

These checks are performed on a new IR, called AbstractBasicBlock, which makes them easier to implement and prove, and which is moreover generic w.r.t the instruction set. The core of our certified scheduler is independent from the instruction set: it can be reused for other processors, or other IRs (e.g. in another prepass intrablock scheduling).

We compiled various software packages with our version of CompCert for the KVX, including our scheduler7 and compared their execution time to that of the same software compiled with the reference compiler for the KVX (versions of the GNU C Compiler supplied by the chip designers).

1.5 Related Works
Our CompCert backend for the KV3 processor initially benefited from that of Barany [2018] for the KV2, though Barany’s backend generates only one instruction per bundle, and does not model the VLIW semantics. He also faced the challenge of representing pairs of 32-bit registers in CompCert for handling 64-bit floating-point values on the KV2. The KV3 natively has 64-bit registers.

Scheduling with timing and resource constraints is a classical problem; [Micheli 1994, §5.4]. Ample work exists on scheduling for VLIW processors [Dupont de Dinechin 2004]—but with no machine-checked proof of correctness of the compiler implementation.

Tristan and Leroy [2008]; Tristan [2009]; Tristan and Leroy [2010] studied more advanced scheduling techniques, including software pipelining, which are particularly suited to prepass optimization. We plan to consider these in future work.

Schulte et al. apply constraint programming to instruction selection, register allocation, code motion and other optimizations [Blindell et al. 2017; Castañeda Lozano et al. 2019]. Their process can even be optimal (w.r.t. their cost model) on medium-sized functions. They consider a wider class of optimizations than we do, but they do not provide any machine-verified proof of correctness.

Our CompCert optimization at assembly level postdates the one of Mullen et al. [2016]. They target x86-32 with more advanced peephole optimizations than we do: modulo register liveness and considering pointers as 32 bit-integers. But, they do not tackle instruction scheduling. Moreover, we show how to transfer basic blocks to the assembly, whereas they shortcut this issue by requiring unchecked assumptions (entitled “calling conventions”) on the generated assembly.

1.6 Architecture of Our Solution (and of This Paper)
Our ultimate goal is to generate efficient assembly code for our VLIW architecture: the AsmVLIW language is our final representation. It formalizes the assembly semantics of our VLIW target: the bundles are defined as basic blocks with a parallel execution inside.

Our postpass scheduler is formalized as a transformation on basic blocks. It takes as input our Asmblock IR, which shares its syntax with AsmVLIW, but with sequential execution inside basic blocks instead of a parallel one; these two languages are described in Section 3. Our postpass scheduling itself is described in Sections 4 to 6. Before that, Section 2 recalls the necessary details of CompCert. Finally, Section 8 presents experimental evaluations of our backend.

7Our full source code is available on https://gricad-gitlab.univ-grenoble-alpes.fr/certicompil/compcert-kvx.
In summary, we extended the CompCert architecture with the passes shown in Fig. 3. The preliminary stage of our backend constructs the Asmblock program from the Mach program. As Section 7 explains, the basic block structure cannot be recovered from the usual Asm languages of CompCert. Thus, we recover it from Mach, through a new IR—called Machblock—whose syntax reflects the basic block structure of Mach programs, and then translates each basic block from Machblock to obtain an Asmblock program. Then, our postpass scheduling from Asmblock to AsmVLIW takes each block from Asmblock, performs intra-block scheduling via an external untrusted oracle, and uses a certified checker to verify the generated AsmVLIW bundles. The architecture of this pass and its verification are given in Sect. 4, while those on the actual intra-block scheduling problem solved by our oracle are given in Sect. 6. The core of our scheduling checker—involving symbolic evaluation of basic blocks with hash-consing—operates on a new auxiliary IR, called AbstractBasicBlock, presented in Sect. 4 and 5, and further detailed in Appendix C.

2 COMPCERT BACKEND SEMANTICS

2.1 Correctness of Compilation Passes

In CompCert [Leroy 2009b], the semantics of a program $P$ consists of predicates for describing initial and final states, as well as a predicate $S \rightarrow S'$ (usually named step) indicating if one execution step can run from state $S$ to state $S'$ by generating a trace $t$—where $t$ is either a single observable event (e.g., an external call or an access to a volatile variable) or $e$ (absence of observable event). The formal correctness property of CompCert expresses that, given a source program $P_1$ without undefined behavior (i.e., that can always run step from a non-final state), if the compilation of $P_1$ produces some assembly program $P_2$, then the “observational behaviors” of $P_2$ are included in those of $P_1$, as formalized by Leroy [2009a,b]. In order to simplify correctness proofs of its successive passes (Fig. 2), CompCert uses an alternative definition for the correctness. One of them is the forward simulation applicable on passes between deterministic languages.

In its simplest form (lockstep simulation), given a relation ($\sim$) matching states between $P_1$ and $P_2$, a forward simulation involves proving that: (1) the initial (resp. final) states of $P_1$ match those of $P_2$; (2) given two matching states, if $P_1$ steps to a state $S'_1$, then $P_2$ steps to a state $S'_2$ matching $S'_1$.

Another form of forward simulation is the plus simulation, where the target program can execute one or more steps instead of just one. Fig. 4 schematizes both simulations.
2.2 The Asm IR

CompCert defines one Asm language per target processor. An Asm program consists of functions, each with a function body. Asm states are of a single kind “State(rs, m)” where rs is the register state (a mapping from register names to values), and m is the memory state (a mapping from addresses to values). An initial state is one where the PC register points to the first instruction of the main function, and the memory is initialized with the program code. The instructions of Asm are those of the target processor, each one with its associated semantics that specifies how the instruction modifies the registers and the memory.

In each CompCert backend the instruction semantics are modeled by an exec_instr function which takes as argument an instruction i, a register state rs and a memory state m, and returns the next state (rs’, m’) given by the execution of i; or the special state Stuck if the execution failed.

Here are some instructions that can be modeled in Asm:

- Pcall(s): calls the function from symbol s (saves PC into RA and then sets PC to the address of s);
- Paddw(rd, r1, r2): writes in rd the result of the addition of the lower 32-bits of r1 and r2;
- Pcb(bt, r, l): evaluates the test bt on register r—if it evaluates to true, PC jumps to the label l;
- Pigoto(r): PC jumps to the value of the register r.

Most of the instructions modeled in Asm directly correspond to the actual instructions of the processor. There can also be a few pseudo-instructions like Pallocframe or builtins, which are specified in Asm, then replaced by a sequence of instructions in a non-certified part of CompCert. Due to distinctions in immediate vs register operand, word vs doubleword operand size etc. there are as many as 193 instructions to be modeled in the KVX processor.

3 SEMANTICS FOR A VLIW ASSEMBLY LANGUAGE

The Asm language of our target processor introduces a syntax and semantics for bundles of instructions. Bundles can be seen as a special case of basic blocks: zero or more labels giving (equivalent) names to the entry-point of the block; followed by zero or more basic instructions – i.e. instructions that do not branch, such as arithmetic instructions or load/store; and ended with at most one control flow instruction, such as conditional branching to a label.

Semantically, basic blocks have a single entry-point and a single exit-point: branching from/to the middle of a basic block is impossible. It is thus possible to define a semantics that steps through each block atomically, sequentially executing the program block by block. We call such a semantics a blockstep semantics. The notion of basic block is interesting for scheduling optimizations: reordering the sequence of basic instructions in a basic block without changing its (local) blockstep does not change the (global) semantics of the surrounding program.

We provide two such blockstep semantics which only differ on how they combine instructions within basic blocks: an IR with sequential semantics called AsmBlock, and one with parallel semantics called AsmVLIW. Our backend first builds an AsmBlock program from a Mach program, by detecting syntactically its basic block structure (Section 7). Then, each basic block is split into bundles of the AsmVLIW IR (Sections 4 & 6). Below, Section 3.1 defines the syntax shared between AsmVLIW and AsmBlock. Then, Section 3.2 defines AsmVLIW, and Section 3.3 defines AsmBlock.

3.1 Syntax of Bundles & Basic Blocks

We first split the instructions into two main inductive types: basic for basic instructions and control for control flow ones. Then, a basic block (or a bundle) is syntactically defined as a record of type bblock with three fields: a list of labels, a list of basic instructions, and an optional control flow one.

```plaintext
Record bblock := { header: list label; body: list basic; exit: option control; correct: wf_bblock body exit }
```
In our AsmVLIW and Asmblock semantics, on a None exit, the PC is incremented by the amount of instructions in the block. This convention makes reasoning easier when splitting a basic block into a sequence of smaller ones. In order to avoid infinite stuttering (due to incrementing PC by 0), we further require that a block should contain at least one instruction.

Sections 3.2 and 3.3 define, respectively, the parallel and the sequential blockstep semantics of this syntax. A state in AsmVLIW and Asmblock is expressed the same way as in Asm: it is either a pair \((rs, m)\) where \(rs\) (register state) maps registers to values and \(m\) (memory) maps addresses to values, or a Stuck in case of failure (e.g. division by zero). Hence, executing a single instruction in our semantics gives an outcome defined as either a (Next \(rs\) \(m\)) state, or a Stuck execution. Then, each blockstep takes as input an initial state \((rs, m)\), fetches the block pointed by \(rs[PC]\) (the value of PC in \(rs\)), and executes the content of that block. Finally, that blockstep either returns the next state or propagates any encountered failure.

### 3.2 Parallel Semantics of AsmVLIW

A bundle is a group of instructions that are to be issued in the same cycle through the pipeline. The pipeline stages of our interlocked VLIW processor can be abstracted into:

**Reading Stage** the contents of the registers are fetched.

**Computing Stages** the output values are computed, which can take several cycles. Once an output is computed, it is available to other bundles waiting at the reading stage.

**Writing Stage** the results are written to the registers.\(^8\)

Our processor stalls at a reading stage whenever the result is not yet available. The exact number of cycles required to compute a value thus only has an impact on the performance: our formal semantics abstracts away the computing stages and only considers the reading and writing stages.

Reads are always deterministic: they happen at the start of the execution of our bundle. However, the write order is not necessarily deterministic, e.g. if the same register is written twice within the same bundle. We first introduce a deterministic semantics where the writes are performed in the order in which they appear in the bundle. For instance, the bundle “\(R_0 := 1; R_0 := 2\)” assigns 2 to \(R_0\), in our in-order semantics. The actual non-deterministic semantics is then defined by allowing the execution to apply an arbitrary permutation on the bundle, before applying the in-order semantics.

#### 3.2.1 In-Order Parallel Semantics

We model the reading stage by introducing an internal state containing a copy of the initial state (prior to executing the bundle). Such an internal state is thus of the form \((rs, \text{rsr}, \text{mr}, \text{mw})\) where \((\text{rsr}, \text{mr})\) is the copy of the initial state, and \((\text{rsr}, \text{mw})\) is the running state where the values are written. Fig. 5 schematizes the semantics.

- The function \((b\text{step } b \text{rsr }\text{rs }\text{mr }\text{mw})\) executes the basic instruction \(b\), fetching the values from \(\text{rsr} \text{ and } \text{mr}\), and performing the writes on \(\text{rs} \text{ and } \text{mw}\) to give an outcome.
- The function \((e\text{step } f \text{ext } sz \text{rsr }\text{rs }\text{mw})\) does the same with the optional control flow instruction \(\text{ext}\): if there is no instruction, then it just increments PC by \(sz\), the size of the block; here, \(f\) is the current function—in which branching instructions look for labels, like in other Asm semantics.
- The function \((\text{parexec}_\text{wio }\ldots \text{rsr }\text{mr})\) is the composition of the basic and control steps.

For example, the bundle “\(R_0 := R_1; R_1 := R_0; \text{jump@toto}\)” runs over an initial register state \(\text{rsr}_0 = \text{rs}\) in 3 steps:

1. “\(R_0 := R_1\)” leads to \(\text{rsr}_1 = \text{rsr}_0[R_0 \leftarrow \text{rs}[R_1]]\)
2. “\(R_1 := R_0\)” leads to \(\text{rsr}_2 = \text{rsr}_1[R_1 \leftarrow \text{rs}[R_0]]\)
3. “\(\text{jump }@\text{toto}\)” leads to \(\text{rsr}_3 = \text{rsr}_2[\text{PC }\leftarrow @\text{toto}]\)

\(^8\)This includes the Program Counter register, which is updated at the end of each bundle execution.
The final register state of the parallel in-order blockstep is
\[ rsw_3 = rsr[R_0 \leftarrow rsr[R_1] ; R_1 \leftarrow rsr[R_0] ; PC \leftarrow @toto] \]
As expected, this bundle swaps the contents of \( R_0 \) and \( R_1 \).

The in-order parallel execution of a list of basic instructions is formally defined in Coq by the following function, where \( \text{"NEXT } rs,m \leftarrow\ e_1 \text{ IN } e_2 \text{"} \) is a notation for:
\[
\text{match } e_1 \text{ with } \text{Next } rs,m \Rightarrow e_2 | \_ \Rightarrow \text{Stuck end}
\]

Fixpoint parexec_wio_body bdy rsr rsw mr mw : outcome
match bdy with nil ⇒ Next rsw mw |
| bi :: bdy’ ⇒ NEXT rsw’, mw’ ← bstep bi rsr rsw mr mw IN parexec_wio_body bdy’ rsr rsw’ mr mw’ end

The in-order parallel execution of a block (defined below) first performs a parallel in-order execution on the body (the list of basic instructions), and then performs a parallel execution with the optional control flow instruction. Here, \( f \) is the current function and \( sz \) is the offset by which PC is incremented in the absence of a control-flow instruction.

Definition parexec_wio f bdy ext sz rs m :=
\[
\text{NEXT rsw’, mw’ ← parexec_wio_body bdy rs rs m m IN estep f ext sz rsw’ mw’}
\]

3.2.2 Deterministic Out-of-Order Parallel Semantics. The in-order parallel semantics defined above is not very representative of how a VLIW processor works, since concurrent writes may happen in any order. This issue is solved by relation \( \text{parexec_bblock } f \ b \ rs \ m \ o \), which holds if there exists a permutation of instructions such that the in-order parallel execution of block \( b \) with initial state \( (rs, m) \) gives the outcome \( o \).

Definition parexec_bblock f b rs m o :=
\[
\exists bdy1 bdy2, \text{Sorting.\text{Permutation (bdy1 ++ bdy2) b.(body) \land o=(NEXT rsw’, mw’ ← parexec_wio f bdy1 b.(exit) (Ptrofs.repr (size b)) rs m IN parexec_wio_body bdy2 rs rs’ mw’)}
\]

Formally, the execution takes any permutation of the body and splits this permutation into two parts \( bdy1 \) and \( bdy2 \). It first executes \( bdy1 \), then the control flow instruction, then \( bdy2 \). While PC is possibly written before the end of the execution of the bundle, the effect on the control-flow takes place when the next bundle is fetched, reflecting the behavior detailed in Footnote 8.

This semantics gives a fair abstraction of the actual VLIW processor. However, the proof of semantic preservation of CompCert requires the target language to be deterministic. Consequently, we force our backend to emit bundles that have the same semantics irrespectively of the order of writes. This is formalized by the relation below:

Definition det_parexec f b rs m rs’ m’ :=
\[
\forall o, \text{parexec_bblock } f \ b \ rs \ m \ o \rightarrow o = \text{Next } rs’ m’
\]

Given \( (rs’, m’) \), the above holds only if all possible outcomes \( o \) satisfying \( \text{parexec_bblock } f \ b \ rs \ m \ o \) are exactly \( (\text{Next } rs’ m’) \); that is, it only holds if \( (rs’, m’) \) is the only possible outcome. We then use the \( \text{det_parexec} \) relation to express the step of our AsmVLIW semantics: if \( \text{det_parexec} \) does not hold then it is not possible to construct a step.
3.3 Sequential Semantics in Asmblock

Asmblock is the IR just before AsmVLIW: instructions are grouped in basic blocks. These are not re-ordered and split into bundles yet: execution within a block is sequential.

The given sequential semantics of a basic block, called exec_bblock below, is similar to the semantics of a single instruction in other Asm representations of CompCert. Just like AsmVLIW, its execution first runs the body and then runs the control flow. Our sequential semantics of single instructions reuses bstep and estep by using the same state for reads and for writes. Our semantics of single instructions is thus shared between the sequential Asmblock and the parallel AsmVLIW.

\[
\text{Fixpoint exec_body bdy rs m: outcome} \\
\text{match body with nil ⇒ Next rs m} \\
\text{| bi::bdy' ⇒ NEXT rs', m' ← bstep bi rs rs m m IN exec_body bdy' rs' m'} \\
\text{end}
\]

\[
\text{Definition exec_bblock f b rs m: outcome} = \\
\text{NEXT rs', m' ← exec_body b. (body) rs m m IN estep f.(exit) (Ptrofs.repr (size b)) rs' rs' m'}
\]

4 CERTIFIED INTRABLOCK POSTPASS SCHEDULING

Our postpass scheduling takes place during the pass from Asmblock to AsmVLIW (Fig. 3). This pass has two goals: (1) reordering the instructions in each basic block to minimize the stalls; (2) grouping into bundles the instructions that can be executed in the same cycle. Similarly to Tristan and Leroy [2008], our scheduling is computed by an untrusted oracle that produces a result which is checked by Coq-proved verifiers. A major benefit of this design is the ability to change the untrusted oracle without modifying our Coq proofs.

The verifiers check semantic correctness only. If some generated bundle exceeds resource constraints, it will be rejected by the assembler.

Scheduling is performed block by block from the Asmblock program. As depicted in Fig. 6, it generates a list 1b of AsmVLIW bundles from each basic block B. More precisely, a basic block B from Asmblock enters the PostpassScheduling module. This module sends B to an external untrusted scheduler, which returns a list of bundles 1b, candidates to be added to the AsmVLIW program (scheduling is detailed in Section 6). The PostpassScheduling module then checks that B and 1b are indeed semantically equivalent through dedicated verifiers. Then, PostpassScheduling either adds 1b to the AsmVLIW program, or stops the compilation if the verifier returned an error.

In Coq, the scheduler is declared as a function \(^9\) splitting a basic block “B: bblock” into a value that is then transformed into a sequence of bundles “1b: list bblock”.\(^{10}\)

\[\text{Axiom schedule: bblock → (list (list basic))∗(option control)}\]

\(^9\)The scheduler is declared as a pure function like other CompCert oracles.

\(^{10}\)It would be unsound to declare schedule returning directly a value of type “list bblock”, since the “correct” proof field of bblock does not exist for the OCAML oracle.
The proof of the pass uses a “Plus” simulation (Fig. 4): one step of the initial basic block $B$ in the sequential AsmBlock semantics is simulated by stepping sequentially all bundles of $1b$ for the parallel AsmVLIW semantics. This forward simulation results from composition of these two ones:

1. A plus simulation ensuring that executing $B$ is the same as executing $1b$ in the sequential AsmBlock semantics, which proves the re-ordering part of the postpass scheduling.
2. A lockstep simulation ensuring that executing each bundle of $1b$ with the AsmBlock semantics gives the same result as executing this bundle with the parallel AsmVLIW semantics.

Each of these two forward simulations is actually derived from the correctness property of a dedicated verifier. In other words, we prove that if each of these two verifiers returns "OK", then the corresponding forward simulation holds. The following sections describe these two verifiers and their correctness proof. We first introduce AbstractBasicBlock, a helper IR that we use for both simulations. Then we describe the “parallelizability checker” ensuring a lockstep simulation (2). Finally, we describe the “simulation checker” ensuring a plus simulation (1).

### 4.1 AbstractBasicBlock IR

The core of our verifiers lies in the AbstractBasicBlock specific representation. AbstractBasicBlock provides a simplified syntax, in which the registers that are read or assigned by each instruction appear syntactically. The memory is encoded by a pseudo-register denoted by $m$. We illustrate in Example 4.1 how we have translated some instructions into AbstractBasicBlock assignments.

**Example 4.1 (Syntax of AbstractBasicBlock).** Examples of translations into AbstractBasicBlock:

1. the addition of two registers $r_2$ and $r_3$ into $r_1$ is written “$r_1 := \text{add}[r_2, r_3]$”;
2. the load into register $r_1$ of memory address “$\text{ofs}[r_2]$” (representing $\text{ofs} + r_2$ where $\text{ofs}$ is an integer constant) is written “$r_1 := (\text{load }\text{ofs})[m, r_2]$”;
3. the store of register $r_1$ into memory address “$\text{ofs}[r_2]$” is written “$m := (\text{store }\text{ofs})[m, r_1, r_2]$”.

AbstractBasicBlock is dedicated to intra-block analyses. In AbstractBasicBlock, a block is encoded as a list of assignments, meant to be executed either sequentially or in parallel depending on the semantics. Each assignment involves an expression, composed of operations and registers.

The syntax and semantics of AbstractBasicBlock are generic, and have to be instantiated with the right parameters for each backend. Though we only did that task with the KV3 backend, we believe it could easily be extended to other backends, and possibly other IRs as well.

AbstractBasicBlock provides a convenient abstraction over assembly instructions like most IR of CompCert except Asm: it leverages the hundreds of AsmVLIW instructions into a single unified representation. Compared to the Mach IR—used in the initial approach of [Tristan and Leroy 2008]— AbstractBasicBlock is more low-level: it allows to represent instructions that are not present at the Mach level, like those presented in Section 4.6. It is also more abstract: there is no particular distinction between basic and control-flow instructions, the latter being represented as an assignment of a regular pseudo-register PC. The simplicity of its formal semantics (given in Section 5) is probably the key design point that allows us to program and prove efficiently the verifiers.

The following sections summarize how we used AbstractBasicBlock to certify the scheduling of AsmBlock/AsmVLIW. See Section 5 for a more detailed presentation.

---

11This encoding should be refined in order to introduce alias analysis
4.2 Parallelizability Checker

To check whether the sequential and parallel semantics of a certain bundle give the same result, we translate the bundle into AbstractBasicBlock with a trans_block function, then we use the is_parallelizable function from AbstractBasicBlock.

Function is_parallelizable analyzes the sequence of AbstractBasicBlock assignments and checks that no pseudo-register is read or rewritten after being written once. For example, blocks “\( r_1 := r_2 \); \( r_2 := r_2 \)” and “\( r_1 := r_2 \); \( r_2 := r_1 \)” are accepted as parallelizable. However, “\( r_1 := r_2 \); \( r_1 := r_3 \)” are rejected, because \( r_1 \) is used after being written. See details in Appendix C.

When is_parallelizable returns true, the list of assignments has the same behavior in both the sequential and the parallel semantics. This property at the AbstractBasicBlock level can be lifted back to the AsmVLIW/Asmblock level, because the list of assignments returned by trans_block is proven to bisimulate the input block—both for the sequential and the parallel semantics. This bisimulation is also useful for the simulation checker described next section.

Proving the previously mentioned forward simulation (2) then relies on proving the following lemma bblock_para_check_correct, which is proven by using the above bisimulation property.

| Definition bblock_para_check bundle: bool := is_parallelizable (trans_block bundle) |
|Lemma bblock_para_check_correct ge f bundle rs m m': bblock_para_check bundle = true → exec_bblock ge f bundle rs m = Next rs' m' → det_parexec ge f bundle rs m rs' m' |

4.3 Verifying Intrablock Reordering

In order to reason on reordering, we define a concatenation predicate: “\( \text{is_concat } tb \ \text{lb} \)” means that the bblock \( tb \) is the concatenation of the list of bundles \( lb \). Formally, \( lb \) must be non-empty, only its head may have a non-empty header, only its tail may have a control flow instruction, \( tb.(\text{body}) \) must be the concatenation of all the bodies of \( lb \), and the header (resp. exit) of the head (resp. tail) of \( lb \) must correspond to the header (resp. exit) of \( tb \).

We also define a block simulation: a block \( b \) is simulated by a block \( b' \) if and only if when the execution of \( b \) is not Stuck, executing \( b \) and \( b' \) from the same initial state gives the same result (using the sequential semantics). That is, \( b' \) preserves any non-Stuck outcome of \( b \).

| Definition bblock_simu ge f b b' := V rs m, exec_bblock ge f b rs m <> Stuck → exec_bblock ge f b rs m = exec_bblock ge f b' rs m |

The forward simulation (1) reduces to proving the correctness of our verified_schedule function on each basic block, as formalized by this property.

| Theorem verified_schedule_correct: V ge f B lb, \((\text{verified_schedule } B) = (\text{OK } lb) \rightarrow \exists \ \text{tb}, \ \text{is_concat } tb \ lb \ \land \ \text{bblock_simu} \ ge f B \ tb\) |

In detail, (verified_schedule \( B \)) calls our untrusted scheduler (schedule \( B \)) and then builds the sequence of bundles \( lb \) as well as their concatenation \( tb \). Then, it checks each bundle for parallelizability with the bblock_para_check function described in the previous section. Finally, it calls a function bblock_simub: bblock \( \rightarrow \) bblock \( \rightarrow \) bool checking whether \( tb \) simulates \( B \).

Function bblock_simub is composed of two steps. First, each basic block is compiled (through the trans_block function mentioned in Section 4.2) into a sequence of AbstractBasicBlock assignments. Second, like in [Tristan and Leroy 2008], the simulation test symbolically executes each AbstractBasicBlock code and compares the resulting symbolic memories (Fig. 7). A symbolic memory roughly corresponds to a parallel assignment equivalent to the input block. More precisely,

\[ \text{More precisely, if the result of } \text{“bblock_simub } B \ \text{tb} \text{” is true, then } \text{“bblock_simu } ge f B \ \text{tb} \text{” holds.} \]
this symbolic execution computes a term for each pseudo-register assigned by the block: this term represents the final value of the pseudo-register in function of its initial value.

Example 4.2 (Equivalence of symbolic memories). Let us consider the two blocks $B_1$ and $B_2$ below:

\begin{align*}
(B_1) & \quad r_1 := r_1 + r_2; \quad r_3 := \text{load}(m, r_2); \quad r_1 := r_1 + r_3 \\
(B_2) & \quad r_3 := \text{load}(m, r_2); \quad r_1 := r_1 + r_2; \quad r_1 := r_1 + r_3
\end{align*}

They are both equivalent to this parallel assignment:

$$r_1 := (r_1 + r_2) + \text{load}(m, r_2) \quad \parallel \quad r_3 := \text{load}(m, r_2)$$

Indeed, $B_1$ and $B_2$ bisimulate (they simulate each other).

Collecting only the final term associated with each pseudo-register is actually incorrect: an incorrect scheduling oracle could insert additional failures. The symbolic memory must thus also collect a list of all intermediate terms on which the sequential execution may fail and that have disappeared from the final parallel assignment. See Example 4.3 below. Formally, the symbolic memory and the input block must be bisimulable\(^{13}\) as pictured on Fig 7.

\begin{figure}[h]
\centering
\scalebox{0.8}{
\begin{tikzpicture}
\node (B) at (0,0) {$B$};
\node (tb) at (4,0) {$\text{tb}$};
\node (Asmblock) at (0,-2) {Asmblock};
\node (AbstractBasicBlock) at (0,-3.5) {AbstractBasicBlock};
\node (Symbolic memories) at (0,-5) {Symbolic memories};
\node (bblock_simu) at (4,-2) {$b\text{block\_simu}$};
\node (trans_block) at (1,-2) {	extit{trans\_block}};
\node (compilations) at (1,-3.5) {	extit{compilations}};
\node (symbolic executions) at (1,-5) {	extit{symbolic executions}};
\node (simulated by) at (4,-5) {simulated by};
\node (bisimulation) at (4,-4) {bisimulation};
\draw (B) -- (Asmblock);
\draw (Asmblock) -- (AbstractBasicBlock);
\draw (AbstractBasicBlock) -- (Symbolic memories);
\draw (Symbolic memories) -- (bblock_simu);
\end{tikzpicture}}
\caption{Diagram of $b\text{block\_simu}$ Correctness}
\end{figure}

Example 4.3 (Simulation on symbolic memories). Consider:

\begin{align*}
(B_1) & \quad r_1 := r_1 + r_2; \quad r_3 := \text{load}(m, r_2); \quad r_1 := r_1 + r_3 \\
(B_1) & \quad r_3 := r_1 + r_2; \quad r_1 := r_3 + r_3
\end{align*}

Both $B_1$ and $B_2$ lead to the same parallel assignment:

$$r_1 := (r_1 + r_2) + (r_1 + r_2) \quad \parallel \quad r_3 := r_1 + r_3$$

However, $B_1$ is simulated by $B_2$ whereas the converse is not true. This is because the memory access in $B_1$ may cause its execution to fail, whereas this failure cannot occur in $B_2$. Thus, the symbolic memory of $B_1$ should contain the term “$\text{load}(m, r_2)$” as a potential failure. We say that a symbolic memory $d_1$ is simulated by a symbolic memory $d_2$ if and only if their parallel assignment are equivalent, and the list of potential failures of $d_2$ is included in the list of potential failures of $d_1$. See Section 5 for the formal definitions.

As illustrated in Examples 4.2 and 4.3, the computation of symbolic memories involves many duplications of terms. Thus, comparing symbolic memories with structural equalities of terms, as performed in [Tristan and Leroy 2008], is exponential time in the worst case. In order to solve this issue, we have developed a generic verified hash-consing factory for Coq. Hash-consing consists in memoizing the constructors of some inductive data-type in order to ensure that two structurally equal terms are actually allocated to the same object in memory. This enables us to replace (expensive) structural equalities by (constant-time) pointer equalities.

4.4 Generic and Verified Hash-Consing

We give here a brief overview of the hash-consing mechanism. More details on AbstractBasicBlock and its hash-consing mechanism are described in Section 5.

\(^{13}\)A "symbolic memory" corresponds here to a kind of parallel assignment, and does not represent a memory (despite the terminology). This confusion comes from the fact that "symbolic execution" as introduced by King [1976] refers to \textit{how to compute} "symbolic memories" and does not refer to \textit{what} they represent. Indeed, in our case, "symbolic execution" computes this alternative representation of each block by mimicking the sequential execution. See Sect. 5 for a formal overview.
Hash-consing a data-type simply consists in replacing the usual constructors of this data-type by smart constructors that perform the memoization of each constructor. This memoization is usually delegated to a dedicated function that can in turn be generated from a generic factory [Filliâtre and Conchon 2006]. Our hash-consing technique follows this principle. However, whereas the memoization factory of Filliâtre and Conchon [2006] (in OCaml) has no formal guarantee, ours satisfies a simple correctness property that is formally verified in Coq: each memoizing function observationally behaves like an identity.

Our Coq memoization function on terms invokes an external untrusted OCaml oracle that takes as input a given term, and returns a memoized term (possibly memoizing the input term in the process). Then, our Coq memoization function dynamically checks that the memoized term and the input term are isomorphic, or aborts the computation if it cannot ensure they are. This check is kept constant-time by using OCaml pointer equality to compare already memoized subterms: \( f(t_1, \ldots, t_m) \) and \( g(u_1, \ldots, u_n) \) are tested for isomorphism by checking that the head symbols \( f \) and \( g \) are identical, the numbers of arguments \( m \) and \( n \) are the same, and for all \( i \), \( t_i \) and \( u_i \) are the same pointer. We have thus imported OCaml pointer equality into Coq.

Importing an OCaml function into Coq is carried out by declaring the type of this OCaml function through an axiom: the Coq axiom is replaced by the actual OCaml function at extraction. Using a pure function type in this Coq axiom implicitly assumes that the OCaml function is logically deterministic (like any Coq function): calling the function twice on equal inputs should give equal outputs—where equality is Coq equality: structural equality. In contrast, the OCaml pointer equality does not satisfy this property: two structurally equal values do not necessarily have the same pointer.\(^{14}\)

We solve this issue by using the \texttt{pointer equality} from the IMPURE library of [Boulmé and Vandendorpe 2019], which embeds OCaml functions in Coq through a non-deterministic monad. In particular, it represents OCaml pointer equality as a non-deterministic function.\(^{15}\) We then use the axiom from IMPURE stating that, if pointer equality returns true, then the two values are (structurally) equal.

Here, \texttt{bool} is logically interpreted as the type of all “subsets” of Booleans; \texttt{phys_eq} is the physical equality, later extracted as the OCaml \((==)\); and \texttt{true} is the may-return relation of the IMPURE library: \texttt{phys_eq x y \Rightarrow true} means that \texttt{true} is a possible result of \texttt{phys_eq x y}. In other words, even if \texttt{phys_eq x y \sim true} and \texttt{phys_eq x y \sim b}, then we cannot conclude that \texttt{b=true} (we could also have \texttt{b=false}). The IMPURE library does not even assume that \texttt{\forall x, phys_eq x x \sim true}.

These axioms are proved to be non-contradictory w.r.t. the Coq logic. They express a correctness property of OCaml physical equality, from which we derive efficient and formally verified hash-consing.

### 4.5 Peephole Optimization

We have expressed the semantics of assembly instructions by decomposing them into atomic operations, which we used to define the symbolic execution. This means that distinct groups of instructions that decompose into the same atomic operations are considered equivalent. We exploit

\(^{14}\)Hence, if we model pointer equality (OCAML’s \(==\)) as an infixed function “\texttt{phys_eq:V \{A\}. A \rightarrow A \rightarrow bool}”, then we are able to prove this wrong proposition (when \(x\) and \(y\) are structurally equals but are distinct pointers):

\[
y=x \land (\texttt{phys_eq x y})=\text{false} \land (\texttt{phys_eq x x})=\text{true} \rightarrow \text{false=true}
\]

\(^{15}\)In the Coq logic, two occurrences of variable “\(x\)” may correspond to two distinct objects in memory (e.g. after substituting \(y=x\) in “\(p\ x\ y\)”). This is why \texttt{phys_eq} must appear as “non-deterministic” to Coq’s eyes.
this to implement peephole optimization: local replacement of groups of instructions by faster ones, prior to scheduling. On Fig. 6, this untrusted optimization is performed by a preliminary pass of the “Scheduler” oracle, and thus dynamically checked by our bbblock_simub trusted verifier.

Currently our only peephole optimizations are the replacement of two (respectively, four) store instructions from an aligned group of double-word (64-bit) registers (e.g. $r18, r19$) to a succession of offsets from the same base register (e.g. $8[r12], 16[r12]$) by a single quadruple-word (respectively, octuple-word) store instruction; and the same with loads. In practice, this quickens register spilling and restoring sequences, such as those around function calls: CompCert spills and restores registers in ascending order, whereas in general it would only be sheer chance that register allocation placed data that is consecutive in memory into a aligned group of consecutive registers. Similar optimizations could be conducted on the ARM (multiple loads and stores) and AAarch64 (loads and stores of arbitrary pairs of registers) architectures.

4.6 Atomic Sequences of Assignments in AbstractBasicBlock

In the previous examples of this paper, each AsmVLIW instruction is translated to a single assignment of AbstractBasicBlock. However, in many cases (extended-word accesses of Sect. 4.5, control-flow instructions, frame-handling pseudo-instructions, etc), AsmVLIW instructions cannot correspond to a single assignment: they are rather translated to AbstractBasicBlock as an atomic sequence of assignments (ASA). A form of parallelism may occur in such a sequence through the special operator Old($e$) where $e$ is an expression, meaning that the evaluation of $e$ occurs in the initial state of the ASA. And an AsmVLIW bundle (resp. an Asmblock basic-block) actually corresponds to the parallel (resp. sequential) composition of a list of such ASA.

For example, the parallel load from a 128-bit memory word involves two contiguous (and adequately aligned) destination registers $d_0$ and $d_1$ that are distinct from each other by construction—but not necessarily from the base address register $a$. This parallel load is thus translated into the following ASA, that emulates a parallel assignment of $d_0$ and $d_1$, even if $a = d_0$:

$$d_0 := (\text{load } i)[m, a] ; \quad d_1 := (\text{load } (i + 8))[m, (\text{Old } a)]$$

See Section 5.1 for a formal definition of atomic sequences of assignments.

5 FORMALIZING A SYMBOLIC EXECUTION WITH HASH-CONSING

This section sketches how the symbolic execution with hash-consing of AbstractBasicBlock is formalized. This requires to first present the formal sequential semantics of this IR (Sect. 5.1). See Appendix C for more details on AbstractBasicBlock (in particular its parallel semantics and the formalization of the parallelizability test).

5.1 Syntax and Sequential Semantics of AbstractBasicBlock

We sketch below the formal definition of AbstractBasicBlock syntax and its sequential semantics. Its syntax is parametrized by a type $R.t$ of pseudo-registers (positive integers in practice) and a type $\text{op}$ of operators. Its semantics is parametrized by a type $\text{value}$ of values, a type $\text{genv}$ for global environments, and a function $\text{op}_\text{eval}$ evaluating operators to an “option value”.

Let us introduce the semantics in a top-down (i.e. backward) style. Function $\text{run}$ defines the semantics of a bbblock by sequentially iterating over the execution of instructions, called $\text{inst}_\text{run}$. The $\text{inst}_\text{run}$ function takes two memory states as input: $m$ as the current memory, and $\text{Old}$ as the initial state of the instruction run (the duplication is carried out in $\text{run}$). It invokes the evaluation of an expression, called $\text{exp}_\text{eval}$. Similarly, the $\text{exp}_\text{eval}$ function takes two memory states as input: the current memory is replaced by $\text{Old}$ when entering under the $\text{Old}$ operator.
5.2 Sketch of Our Verified Hash-Consed Terms

Using King [1976] terminology, a symbolic value is a kind of term. In such terms, a pseudo-register represents its value in the initial memory of block execution. Hence, the structure of our terms is similar to type \texttt{exp} without the \texttt{Old} operator. Below, we define an inductive type \texttt{hterm} (together with \texttt{list_hterm}) for hash-consed terms: it adds an hash-tag information (of type \texttt{hashcode}) to each constructor. These hash-tags are computed by our external memoizing factory and intend to identify each hash-consed term as a unique integer. Actually, a misuse on these hash-tags will not affect the correctness of our verifier but only its performance (including its success). Indeed, it is ignored by the formal semantics of \texttt{hterm}, called \texttt{ht_eval}.

Our symbolic execution with hash-consed terms is parametrized by two memoizing functions \texttt{hC_term} and \texttt{hC_list_term}. Indeed, our simulation test ultimately performs two symbolic executions, one for each block: these two symbolic executions share the same memoizing functions, leading to an efficient comparison of the symbolic memories through pointer equality. The correctness property associated with each of these functions is directly derived from our generic certified memoization factory (which is detailed in Appendix C.4.2). Here is the specification of \texttt{hC_term} using notations of IMPURE (\texttt{hC_list_term} is similar).

These memoizing functions are invoked in the smart constructors of \texttt{hterm} and \texttt{list_hterm}. Below, we give the smart constructor—called \texttt{hApp}—for the \texttt{App} case with its correctness property. It uses a special hash-tag called \texttt{unknown_hid} (never allocated by our memoizing oracle): \texttt{hC_term} replaces this special hash-tag by the one actually allocated for this term.
5.3 An Abstract Model of Our Simulation Test (without Hash-Consing)

The formal proof of our simulation test is decomposed into two parts using a data-refinement style. In the first part, we define an abstract model of the symbolic execution and the simulation test (without hash-consing): this allows to reduce the simulation of two basic blocks for their sequential semantics to the simulation of their symbolic memories computed through an abstract definition of the symbolic execution. In a second part, sketched in Section 5.4, this abstract symbolic execution is refined using concrete data-structures and in particular hash-consing.

The symbolic execution of a block is modelled as a function \( bblock \_smem : \text{block} \to \text{smem} \), where a symbolic memory of type \( \text{smem} \) is abstractly modelled by the pair of a predicate \( \text{pre} \) expressing at which condition the intermediate computations of the block do not fail, and of a parallel assignment \( \text{post} \) on the pseudo-registers. For the sake of this presentation, we model terms with type \( \text{hterm} \), but without real hash-consing (all hash-tags are set to \( \text{unknown\_hid} \)).

\[
\text{Record \ smem} = \{ \text{pre: genv} \to \text{mem} \to \text{Prop}; \text{post: R.t} \to \text{hterm} \}. (* \text{abstract symbolic memories} *)
\]

Then, the bisimulation property between symbolic and sequential execution is expressed by:

\[
\text{Lemma \ bblock\_smem\_correct \ p \ d: \ bblock\_smem \ p = d } \to \forall m m', \text{run \ ge \ p \ m=Some \ m'} \leftrightarrow (d.\text{pre} \ ge m \land \forall x, \text{ht}\_eval \ ge (d.\text{post} \ x) m = \text{Some} \ (m' \ x))
\]

This lemma allows to reduce the simulation of block executions to the simulation of symbolic memories, formalized by \( \text{smem\_simu} \) below.

\[
\text{Definition \ smem\_valid \ ge \ (d: \text{smem}) \ (m: \text{mem}): \text{Prop} := d.\text{pre} \ ge m \land \forall x, \text{ht}\_eval \ ge (d.\text{post} \ x) m <> \text{None}\\
\text{Definition \ smem\_simu \ (d1 d2: \text{smem}): \text{Prop} := \forall ge m, \text{smem\_valid \ ge \ d1 \ m } \leftrightarrow \text{smem\_valid \ ge \ d2 \ m } \land \text{ht}\_eval \ ge (d1.\text{post} \ x) m = \text{ht}\_eval \ ge (d2.\text{post} \ x) m \)
\]

Theorem \( bblock\_smem\_simu \ p1 p2: \text{smem\_simu} (bblock\_smem \ p1) (bblock\_smem \ p2) \to \forall ge m, (\text{run \ ge \ p1 \ m}) <> \text{None} \to (\text{run \ ge \ p1 \ m}) = (\text{run \ ge \ p2 \ m})
\]

Internally, as coined in the name of “symbolic execution” by King [1976], \( bblock\_smem \) mimics \( \text{run} \) (the sequential execution of the block), by replacing operations on memories of type \( \text{mem} \) by operations on type \( \text{smem} \) given in Fig. 8. The initial symbolic memory is defined by \( \text{smem\_empty} \). The evaluation of expressions on symbolic memories is defined by \( \text{exp\_term} \): it outputs a term (a symbolic value). Also, the assignment on symbolic memories is defined by \( \text{smem\_set} \). To conclude, starting from \( \text{smem\_empty} \), the symbolic execution preserves the bisimulation of symbolic memories w.r.t. the sequential execution, on each assignment.

5.4 Refining Symbolic Execution with Hash-Consed Terms

We now refine the type \( \text{smem} \) into type \( \text{hsmem} \). The latter involves a dictionary of type \( \text{Dict.t\ hterm} \) (positive maps in practice) associating pseudo-registers of type \( \text{R.t} \) to hash-consed terms. Type \( \text{hsmem} \) is related to \( \text{smem} \) (in a given environment \( \text{ge} \)) by relation \( \text{smem\_model} \).

\[
\text{(* The type of our symbolic memories with hash-consing *)}\\
\text{Record \ hsmem= \{ \text{hpre: list\ hterm}; \text{hpost: Dict.t\ hterm} \}}\\
\text{(* implementation of the [smem\_valid] predicate *)}\\
\text{Definition \ hsmem\_valid\ ge \ (hd: \text{hsmem}) \ (m: \text{mem}): \text{Prop} := \forall ht, \text{List.In \ ht \ hd(hpre)} \to \text{ht}\_eval \ ge \ ht \ m <> \text{None}}\\
\text{(* implementation of the symbolic memory evaluation *)}\\
\text{Definition \ hsmem\_post\_eval \ ge \ (hd: \text{hsmem}) \ (x: \text{mem}): \text{option \ value} := \text{match Dict.get \ hd(hpost)} \ x \ \text{with None} \Rightarrow \text{Some} \ (m \ x) | \text{Some \ ht} \Rightarrow \text{ht}\_eval \ ge \ ht \ m \ \text{end}}\\
\text{(* The data-refinement relation *)}\\
\text{Definition \ smem\_model \ ge \ (hd: \text{smem}) \ (d: \text{mem}): \text{Prop} := (V m, \text{smem\_valid \ ge \ hd \ m} \leftrightarrow \text{smem\_valid \ ge \ d \ m}) \\
\land \forall m x, \text{smem\_valid \ ge \ d \ m } \to \text{hsmem\_post\_eval \ ge \ hd \ x \ m } = \text{ht}\_eval \ ge (d.\text{post} \ x) m}
\]

Fig. 9 provides an implementation of the operations of Fig. 8 that preserves the data-refinement relation \( \text{smem\_model} \). It uses the monadic operators provided by \text{IMPURE}: its unit noted "RET _"
(* initial symbolic memory *)
Definition smem_empty : smem := { | pre := (fun _ _ => True); post := (fun x => Input x unknown_hid) |}
(* symbolic evaluation of the right-hand side of an assignment *)
Fixpoint exp_term (e : exp) (d old : smem) : hterm :=
  match e with
  | PReg x => d. (post) x | Old e => exp_term e old old
  | Op o le => App o (list_exp_term le d old) unknown_hid
  end with
list_exp_term (le : list_exp) (d old : smem) : list_term := ...
(* effect of an assignment on the symbolic memory *)
Definition smem_set (d : smem) x (t : term) :=
  { | pre := (fun ge m => (t_eval ge (d. (post) x) m) <> None \land (d. (pre) ge m));
        post := (fun y => if R.eq_dec x y then t else d. (post) y) |}

Fig. 8. Basic Operations of the Symbolic Execution in the Abstract Model

(* initial symbolic memory *)
Definition hsmem_empty : hsmem := { | hpre := nil; hpost := Dict.empty |}
Lemma hsmem_empty_correct ge : smem_model ge smem_empty hsmem_empty
  (* symbolic evaluation of the right-hand side of an assignment *)
Fixpoint exp_hterm (e : exp) (hd hod : hsmem) :=
  match e with
  | Old e => exp_hterm e hod hod
  | PReg x =>
    match Dict.get hd. (post) x with
    | Some ht => RET ht
    | None => hInput x (* smart constructor for Input *) end
  | Op o le => DO lt f list_exp_hterm le hd hod ;
        hApp o lt (* smart constructor for App *) end with
list_exp_hterm (le : list_exp) (d od : hsmem) := ...
Lemma exp_hterm_correct ge e hod od d ht :
  smem_model ge od od \land smem_model ge d hd \land exp_hterm e hd hod \lor ht \rightarrow
  \forall m, smem_valid ge d m \land smem_valid ge od m \land ht_eval ge t m = ht_eval ge (exp_term e d od) m
(* effect of an assignment on the symbolic memory (naive wrt the actual implementation *)
Definition hsmem_set (hd : hsmem) x (ht : hterm) :=
  { | hpre := hd. hpre; hpost := Dict.set hd x ht |}
Lemma hsmem_set_correct hd x ht hd' :
  hsmem_model ge d hd \land \forall m, smem_valid ge d m \land ht_eval ge t m = ht_eval ge (ht_eval ge t m) \rightarrow
  hsmem_set hd x ht \lor hd' \rightarrow smem_model ge (smem_set d x t) hd'

Fig. 9. Data-Refinement of Symbolic Execution with Hash-Consing

and its bind operator noted “DO _ \larr _ ; _”. The smart constructors building hash-consed terms
are invoked by the exp_hterm (i.e. the evaluation of expressions on symbolic memories).

Then, the symbolic execution bblock_hsmem : bblock \rightarrow ?? hsmem invokes these operations
on each assignment of the block. We prove that it refines bblock_smem from lemma of Fig. 9.

Lemma bblock_hsmem_correct p hd : bblock_hsmem p \rightarrow ge, smem_model ge (bblock_smem p) hd

Finally, the main function of the simulation test (detailed in Appendix C.4.3) creates two memo-
izing functions hC_term and hC_list_term as presented Sect. 5.2. Then, it invokes the symbolic
execution bblock_hsmem on each block. These two symbolic executions share the memoizing
functions hC_term and hC_list_term, meaning that each term produced by one of the symbolic
executions is represented by a unique pointer. The symbolic executions produce thus two hsmem
and we compare them efficiently using physical equality on hash-consed terms.

6 INTRABLOCK SCHEDULING ORACLE
The postpass schedule is computed by an untrusted oracle, which first invokes a processor-
dependent frontend that turns the scheduling problem of a given basic-block into an optimization
problem (Sec. 6.2), which is then solved by one of our processor-independent oracles: (1) one instruc-
tion per bundle (2) greedy bundling without reordering (3) list scheduling (default) (4) reduction to
ILP (Integer Linear Programming; Appendix D), solved by an external tool (e.g., Gurobi).
Find \( t : 0 \ldots n \rightarrow \mathbb{N} \) satisfying:

- A resource usage constraint
  \[
  \forall i \sum_{j|t(j)=i} u(K(j)) \leq r
  \]  

- Latency constraints of the kind (where \( j' > j \)):
  \[
  t(j') - t(j) \geq \delta
  \]

Fig. 10. Our Family of Scheduling Problems

6.1 Bundlers without Reordering

In order to measure the performance impact of scheduling, we provide two simple backends without any reordering: the first one trivially issues one instruction per bundle, while the second one attempts to greedily “pack” successive instructions without altering the sequential semantics.

6.2 Scheduling as an Optimization Problem

We refer the reader to [Micheli 1994, Ch. 5] for a general background on scheduling problems in hardware, which is not far from our software problem [Dupont de Dinechin 2004]. Here, we explain the exact problem we need to solve on the Kalray VLIW architecture.

We have \( n \) instructions to schedule, that is, compute a function \( t : 0 \ldots n - 1 \rightarrow \mathbb{N} \) assigning a time slot to each instruction. These time slots will be used to group instructions into bundles: first bundle is all instructions \( j \) such that \( t(j) = 0 \), next bundle all those such that \( t(j) = 1 \) etc.

Each instruction \( j \) is characterized by a kind \( K(j) \) (whether it is an addition, a multiplication, a load, etc.). This schedule must satisfy three classes of constraints:

**Semantic Dependencies** Read and write dependencies are examined for each processor register, as well as the pseudo-register \( m \), standing for the whole addressable memory. These dependencies are functionally relevant: code reordered without paying attention to them is generally incorrect.

- **Read after write**: If instruction \( j \) writes to register \( r \) and this is the last write to \( r \) before an instruction \( j' \) reading from \( r \), then the schedule should respect \( t(j') - t(j) \geq 1 \).
- **Write after write**: If instruction \( j \) writes to register \( r \) and this is the last write to \( r \) before an instruction \( j' \) writing to \( r \), then the schedule should respect \( t(j') - t(j) \geq 1 \).
- **Write after read**: Instruction \( j \) reads from \( r \), the next write to \( r \) is instruction \( j' \), then \( t(j') - t(j) \geq 0 \).

**Latency Constraints** The description of the processor microarchitecture states, for each instruction, the number of clock cycles after which the values it produces are ready. More precisely, it states that if an instruction of kind \( k' \) is scheduled at least \( \delta \) cycles after an instruction of kind \( k \) then it incurs no waiting for reading the output of the other instruction. In most cases, \( \delta \) depends only on \( k \), but there are “bypasses” for some \( k' \) with lower \( \delta \) than for others. All these are mentioned in the processor documentation. For memory loads, we take the timing for a L1 cache hit.

The KVX processor is **interlocked**: latencies do not affect architectural semantics. If an instruction is scheduled before its operands are ready, the result is unchanged, the only consequence is that the whole bundle to which the instruction belongs is stalled. Thus, mistakes in the latencies may lead only to suboptimal performance, not to incorrect results.

**Resource Usage Constraints** The processor has a limited number of processing units. Therefore, a bundle of instructions must not request more processing units of a given kind than available. Also, there is a limit on the number of instruction words (“syllables”) inside a bundle. Bundles that do not abide by these rules will be rejected by the assembler.

The architecture documentation describes these limitations as a constant vector of available resources \( r \in \mathbb{N}^m \) and, for each instruction kind \( k \), a vector \( u(k) \in \mathbb{N}^m \). The constraint is that the sum of all \( u(K(j)) \) for all instructions \( j \) scheduled within the same bundle \( i \) should be coordinate-wise less than or equal to \( r \), as expressed by Inequality (1) in Fig. 10.
The semantic dependencies and the latency constraints are instances of Inequality (2). In fact, the “read after write” dependencies are subsumed by the latency constraints between the output values and the read operands.

Finally, we introduce an extra time slot \( t(n) \) representing the time at which all instructions have already been completely executed, in the sense that all their outputs have been computed. We thus add extra latency constraints of the form \( t(n) - t(j) \geq \delta \) to express that output operands should be available at time \( t(n) \). Hence, \( t(n) \) is the makespan of our basic block, which we wish to minimize.

Our scheduling problem is thus an instance of the system of inequalities in Fig. 10: a correct sequence of bundles using \( t(n) \) cycles in total is directly built from any solution \( t \).

6.3 (Critical Paths) List Scheduler

Our default solver is based on a variant of Coffman-Graham list scheduling [Dupont de Dinechin 2004] [Micheli 1994, §5.4] with one heuristic: instructions with the longest latency path to the exit get priority. This is fast (quasi-linear-time) and computes an optimal schedule in almost all practical cases.

We consider that time \( i \) starts from 0, and we choose at each step which instructions \( j \) to schedule at time \( i \) (those for which \( t(j) \leq i \)). Our Algorithm 1 chains two ideas 16:

Maximal Scheduling Sets Assume we have already chosen a set \( S \) of instructions to be scheduled at time \( i \), such that \( \sum_{j \in S} u(K(j)) \leq r \). Assume there is \( j' \notin S \) such that all its operands are ready, and \( \sum_{j \in S \cup \{j'\}} u(K(j)) \leq r \). Then it is always at least as good to schedule \( j' \) in the same time slot as the instructions in \( S \), compared to scheduling only \( S \): this cannot increase the makespan. Thus, at every step we can restrict the search to \( S \) maximal w.r.t. the inclusion ordering among the feasible \( S \).

Critical Path Heuristic The question is then which \( S \) to consider if there are many of them, often the case for the first bundles of a block—since all instructions using only registers with their values at the start of the block can be scheduled in the first bundle.

Consider the (multi)graph \( G \) with an edge \( j \rightarrow j' \) for each inequality (2). It is acyclic, since all these edges satisfy \( j' > j \). In a valid schedule, \( t(j) \) is at most \( t(n) - l(j, n) \) where \( l(j, n) \) is the maximal length of paths from \( j \) to \( n \) in \( G \). If we had no resource constraints, in an optimal schedule we would have \( t(j) = t(n) - l(j, n) \). When constructing a maximal \( S \), we thus consider \( j \) in decreasing order of \( l(j, n) \); in other words, we try to schedule first the instructions on the critical path.

This algorithm never backtracks. If the choice is non-optimal, it may miss a better solution. This happens on example of Figure 11 (sequential code shown, one line per bundle). The timing constraints of this examples are: \( t_8 \geq \) all other times, \( t_6 \geq t_3 \) (write-after-read), \( t_6 - t_1 \geq 1 \) (write-after-write, WAW), \( t_3 - t_0 \geq 1 \) (WAW), \( t_3 - t_1 \geq 1 \) (read-after-write, RAW), \( t_3 - t_2 \geq 1 \) (RAW). The critical path lengths to \( t_8 \) are 1 for \( t_0, t_1, t_2, t_6 \) and 0 for \( t_3, t_4, t_5, t_7 \).

Algorithm 1: Sketch of our List Scheduler

\[
i := 0 \quad Q := \{0 \ldots n-1\}
\]

while \( Q \neq \emptyset \) do
\[
R := \emptyset \quad a := r
\]
for \( j' \in Q \) do
\[
\text{ready} := \text{true}
\]
for \( j \rightarrow j' \in G \) do
\[
\text{if } t(j) > j' - \delta \text{ then } \text{ready} := \text{false};
\]
if \( \text{ready} \) then \( R := R \cup \{j'\} \);
for \( j \in R \) (in descending \( l(j, n) \) order) do
\[
\text{if } a \geq u(K(j)) \text{ then } a := a - u(K(j)) \parallel Q := Q \setminus \{j\} \parallel t(j) := i
\]
i := i + 1

\[16\]This algorithm gives a simplified view of our implementation. The latter pre-computes all \( l(j, n) \) by graph traversal. And, it avoids scanning for all \( j' \in Q \) by updating an array, indexed by \( i \), of sets of instructions ready to be scheduled \( R(i) \) at time \( i \): an instruction is added to the appropriate \( R(i) \) when its last predecessor has been scheduled.
The list scheduler breaks ties between instructions 0, 1, 2, 6 according to their order in the original program and schedules instructions 0, 1, 2 at time slot 0, saturates the “maximum number of syllables” resource at this time slot and yields a schedule with a makespan of 4 (Fig. 12). In contrast, the optimal scheduler, using integer linear programming, yields a makespan of 3 (Fig. 13).

7 BASIC BLOCK RECONSTRUCTION

We motivate and briefly present our solution to reconstruct basic blocks in CompCert, necessary for the later scheduling pass. More details can be found in appendices A and B.

7.1 Necessity of Constructing Basic Blocks at the Mach Level

Mach is the IR closest to assembly with a significant level of abstraction. It features 3-address code instructions, with generic instructions (such as Mload, Mstore and Mop) that are to be translated into their Asm architecture specific equivalents. The ABI (Application Binary Interface) is also abstracted away into specific Mach instructions handling the stack and function parameters.

One major difference between Mach and Asm lies in their semantics: the “next” Mach instructions to be executed are directly in a Mach state (as a list of instructions), whereas the Asm instructions are stored in memory, accessed by the PC register. In such Asm semantics, the PC register can jump anywhere within the function body, not necessarily to a label. On the contrary, the Mach semantics ensures that jumps can only branch to particular points (labels, function entry points and return addresses). This property is not carried within the Asm IR. This makes reconstructing basic blocks from Asm (in a hypothetical Asm to Asmblock pass) impossible: our AsmVLIW semantics requires that PC never jump inside a basic block.

Our solution is to construct the basic blocks earlier, at the Mach level, by introducing a new Machblock IR as well as an architecture independent Mach to Machblock translation, and then adapting the former Mach to Asm pass to a new Machblock to Asmblock pass. Not only does introducing Machblock allows separating the proof of the basic block reconstruction from the proof of the Mach to Asm translation, but it also makes part of the process reusable for other backends.

7.2 Translating and Proving Mach to Machblock

The Mach to Machblock translation is a purely syntactic one: Mach instructions are separated into labels, basic and control-flow instructions (much like Section 3). We create an initial basic block and start filling basic instructions inside. The translation ends the current basic block whenever a label or control-flow instruction is met - after which it creates a new basic block.

The semantic preservation proof then checks that stepping through a certain number of instructions at the Mach level is equivalent to stepping through a corresponding basic block in Machblock.

17See the igoto instruction in Section 2.2.
This is done with an Option simulation: executing a Mach instruction either leads to a “stuttering” with a decreasing measure, or the execution of the whole Machblock block once the end is reached. See Appendix A.2 for details.

7.3 Translating and Proving Machblock to Asmblock

The Machblock to Asmblock translation relies on translating each Machblock basic block into an Asmblock basic block. Each Machblock basic instruction is translated into one or more Asmblock basic instructions—and each Machblock control flow instruction is translated into any number of Asmblock basic instructions, followed by a control flow instruction.

The proof of Machblock to Asmblock is a star simulation: each Machblock basic block is simulated by one (or more, in the case of a builtin) Asmblock basic blocks; like in usual Mach to Asm translation there is only a single stuttering case, on the Mach step restoring the caller state (see Appendix A.1.2 for detailed explanations). The simulation used for that proof is a blockstep simulation (stepping one basic block at a time) that we have to decompose in terms of instruction-step simulations.

In usual Mach to Asm passes, the match_states simulation specifies that the value of registers and memory stay correct throughout the execution. It also specifies that the code to which the PC register points must correspond to the code that is executed. Here, our instruction-step simulations cannot directly use such a match_states relation, since the PC register would have to point in the middle of a basic block (to fetch the current instruction), however our semantics disallow that.

We thus split the match_states relation into two different simulations: match_codestate which handles the simulation between Machblock and Asmblock regardless of the presence of the instructions in memory (these are instead saved in an intermediate “ghost state” called codestate), and a match_asmstate relation ensuring that the instructions stored in the codestate are present at the memory address pointed to by PC.

The actual blockstep simulation theorem is then cut into several smaller theorems, following the simulation diagram of Fig. 14. This approach re-uses most of the existing Mach to Asm proofs with a minimum amout of readaptations. More details can be found in Appendix B.

---

18 In our case, the measure is the (statically known) number of instructions left to run before the end of the basic block.

19 For instance, translating a Machblock conditional instruction may require basic instructions to evaluate the condition, followed by the actual Pcb conditional branch instruction.
8 EXPERIMENTAL EVALUATION

Our implementation adds to CompCert around 28Kloc of Coq and 5Kloc of OCAML, much more than e.g. 10 Kloc of Coq and 2Kloc of OCAML each for the Risc-V and x86 targets. Our target assembly is described by around 1.8K lines of specification in the AsmVLIW module. This is a little more than other Asm (1–1.3Klines). Our scheduling oracle is implemented by 2.4Kloc of OCAML (half for its frontend, half for its backend).

The remaining of this section describes our evaluation of this implementation. Firstly, we measure the compilation time of our optimization w.r.t the other optimization passes of CompCert. Then, we compare timings of compiled code with our CompCert to that of the default compiler for the target platform, measuring clock cycles using performance counters. We also compare our timings in different configurations to study the impact of postpass scheduling in CompCert.

8.1 Experimental Compiling Time

We experimentally checked that our oracle and its verifier have linear running times, by instrumenting the generated OCAML code of the compiler to get the user timings and basic block sizes.

Fig. 15 shows our measurements in logarithmic scales. Each point in this figure corresponds to an actual basic block from our benchmarks, verified or scheduled (for the list-scheduling) 1000 times. The verifier is generally a little slower than the oracle, but both are experimentally of linear complexity. The biggest basic block we came accross, of around 500 instructions, was scheduled and verified in approximately 4 ms, which is the same time required for other CompCert optimizations such as constant propagation or common subexpression elimination.

These compile times are in line with other optimization passes of CompCert.

8.2 Benchmarks Used

We evaluated our optimization on a range of applications that could be used on the Kalray KVX core: critical embedded systems and computational benchmarks. This is our target benchmark. We also evaluated CompCert on the Polybench benchmarks [Pouchet 2012].

radiotrans, convertible and heater-control are benchmarks compiled from synchronous dataflow programming languages (respectively Heptagon, Lustre v6 and Lustre v4). Such languages are for instance used to specify and implement fly-by-wire aircraft controls [França et al. 2011]. The C source code compiled from the high-level specification is then often compiled without optimization so that it can be easily matched to the resulting assembly code. CompCert’s advantage in this area is that it allows using optimizations, its semantics preservation proof replacing the manual structural matching between assembly and C code. lift is a lift controller program from TACLeBench, a collection of benchmarks used for worst-case execution time research [Falk et al. 2016].

On the computational part, bitsliced-aes and sha-256 are cryptography primitives taken from [Mosnier 2019] and [Patrick 2015], glpk runs GLPK (GNU Linear Programming Kit [Makhorin 2012]) on an example. picosat is an optimized SAT solver [Biere 2008], ran over a Sudoku example. genann is a minimal artificial neural network [Van Winkle 2018]. float-mat is a textbook implementation of floating-point matrix multiplication; float-mat-v2 is a version with high level transformations such
as loop unrolling done at the source level. jpeg-6b is from the Libjpeg [Lane and the Independent JPEG Group (IJG) 1998]. zlib [Gailly and Adler 2017] is a data-compression library.

8.3 Impact of Optimizations

Figure 16 illustrates the impact of our optimization pass on the performance of the generated code. The reference version uses the list scheduler of Section 6.3. Two others use the bundlers without reordering of Section 6.1: “orig” emits one instruction per bundle (close to what straightforwardly generating instructions one by one as other CompCert backends would produce), and “pack” uses the greedy bundler. In Figure 16, the execution time of each version is compared to those of the reference. Higher percentages mean better timings.

Postpass scheduling has a noticeable impact on performance: compared to the reference “orig” version, we get an average performance increase of 41%. However, some benchmarks such as convertible are barely affected by the optimization—indeed, the main loop features around 800 different variables in the same scope, which do not fit into the 64 registers of the KVX. Register spills are thus generated, which in turn prevent scheduling since we do not yet have any alias analysis yet, which would allow reordering memory accesses.

The “pack” version increases performance slightly w.r.t. “orig”, but not by much compared to true scheduling. We gain an average of 23% by scheduling instead of naive greedy bundling.

A word of warning: since our scheduler operates after register allocation,20 it is highly sensitive to register reuse: write($v_1$, $r_1$) . . . read($r_1$) . . . write($v_2$, $r_2$) . . . read($r_2$) (with no other accesses to $r_1$, $r_2$) can be rescheduled to write($v_1$, $r_1$) . . . write($v_2$, $r_2$) . . . read($r_1$) . . . read($r_2$) but write($v_1$, $r_1$) . . . read($r_1$) . . . write($v_2$, $r_1$) . . . read($r_1$) cannot. In some cases, optimizations prior to register allocation, with actual improvements in intermediate code, lead to worse final performance because the optimized code, after register allocation, happens to reuse a register in a “hot” loop in a way that prevents instructions from being scheduled optimally, whereas the less optimized code does

20We have not modified CompCert’s register allocator, except for allowing float and integer values to be allocated to a single bank of registers.
Fig. 17. Relative Speed of Generated Code with Kalray’s GCC, from -O0 to -O3 Optimization Levels; 100% being the Speed with CompCert by List Scheduling (Lower is Slower). Note that for some safety-critical applications with mandatory traceability from source to object code (e.g., DO178 level-A avionics), only gcc -O0 is likely to be usable, whereas CompCert’s correctness proof is accepted for traceability [Bedin França et al. 2012]. Our target benchmark is pictured at the top, Polybench is at the bottom picture.
not. This makes it difficult to measure the impact of optimizations, because whether or not registers are reused in this way is a matter of luck. Adding a prepass scheduler could solve this problem.\footnote{But, even with a prepass scheduler, a postpass scheduler on the assembly code would still be required: the assembly code is the level where instructions (e.g. loads of spilled registers) are precisely scheduled.}

### 8.4 Comparison of CompCert with Kalray’s GCC

We also compared our CompCert compiler to the GCC\footnote{The GNU Compiler Collection, https://gcc.gnu.org/} compiler supplied by Kalray, adapted from version 7.5.0, at -O0...-O3 optimization levels. -O1 deactivates scheduling and thus only generates bundles of one instruction. -O0 loads and stores variables from memory at every use.

The results (see Fig. 17) vary considerably depending on the benchmark—furthermore, at the time of this writing, the GCC backend is still being developed by Kalray: in particular some optimizations are not yet functional, and code selection could be improved in a few places. It is thus hard to draw meaningful conclusions on the comparison with GCC, though it allowed us to outline some optimizations that could be made by CompCert to improve performance.

For example, while CompCert is outperformed by GCC on `float-mat`, it comes close in performance on `float-mat-v2`, meaning that we identified which high-level transformations need to be integrated into CompCert to improve performance on this kind of functions.

Regardless, in average, we produce code 276\% faster than code produced by -O0, 19\% faster than -O1, 18\% slower than -O2, and 25\% slower than -O3. In some cases, we produce faster code than GCC, in the best case 8\% faster than -O3.

### 8.5 Remarks and Limitations

Even though we aim at comparing the quality of instruction scheduling, we actually compare two very different whole compilers. In particular, the machine-independent parts of CompCert do not perform certain optimizations that GCC does, among which:

- certain \textit{strength reductions}: GCC converts a multiplicative expression \(ci\), where \(c\) is a loop-invariant constant and \(i\) is a loop index with a step of one into a new variable \(x\) stepping in increments of \(c\);
- \textit{loop invariant code motion} and, more generally, any form of code motion across basic blocks;
- \textit{loop unrolling} and other loop optimizations; CompCert compiles loops straightforwardly;
- \textit{structure or array “disaggregation”:} expanding an aggregate (accessed through \texttt{base + index} loads and stores) into independent scalar variables, which can be allocated to registers;
- \textit{interprocedural optimisations}: the only one performed by CompCert is \textit{inlining}, and CompCert’s inlining heuristic is less aggressive than GCC’s.

In contrast, CompCert replaces a 32-bit signed division by a constant with a small efficient sequence of code [Granlund and Montgomery 1994], whereas GCC calls a generic library function.

Certain compiler differences are subtler but may have dramatic effects in some cases: e.g., our version of CompCert sometimes does not simplify an inlined function if a parameter is a constant value allowing simpler instructions to be used, e.g., replacing a floating-point division by 2 by a multiplication by 0.5. Some of these discrepancies have great importance on some benchmarks. For instance, textbook matrix multiplication can be greatly optimized by strength reduction (removal of multiplications for computing the address of array cells according to indices and stride), loop unrolling and loop invariant code motion. In some benchmarks, we consider both the original code and some slight manual optimization thereof, reproducing optimizations that GCC would perform.
9 CONCLUSION AND FUTURE WORK

Trusted Computing Base. Customizing Coq’s extraction mechanism to call external OCaml procedures increases the trusted computing base. Yet we limited this increase: the only property we trust is that OCaml’s pointer equality implies structural equality (of Coq).\textsuperscript{23}

Lessons learned. Formal proof forces developers to rigorously document the compiler, with precise semantics and invariants. Proving programs in Coq is heavyweight, but there is almost no bug-finding after testing on real programs: the compiler just works. We however had a few bugs in the parts of CompCert not checked by formal proofs—printing of assembly instructions, stack frame (de)allocation. Most bugs in the scheduling oracles were found by testing them on random examples; an untrusted checker with detailed error messages is useful for such early testing.

Apart from the difficulty of finding suitable invariants and proof techniques, another main hurdle was interpreting benchmark results. Missed optimization bugs (e.g. an inferior schedule was selected) were particularly hard to catch.

Future work. Our scheduler is intra-block; this implies it cannot for instance anticipate computations by moving them to free time slots before a branching instruction. We plan to implement a superblock scheduler allowing such movements on the critical path, before register allocation.

In some cases, we were able to identify register reuse as the cause of disappointing performance. Our postpass scheduler has to obey read-over-write and write-over-write dependencies. CompCert’s register allocator sometimes reuses registers in ways that prevent some better scheduling from being adopted; again, prepass scheduling should help in this respect: it would generate a reasonable schedule, register allocation would be performed on that schedule and the postpass scheduler would then perform local adjustments.

Our backend cannot at present reorder a memory read and a memory write, or two memory writes, even when their addresses cannot overlap. Also, CompCert sometimes does not recognize that it is reloading a value that it recently stored. We plan to add some form of alias analysis to our system to resolve both these issues.

Despite our efforts, our instruction selection is still perfectible: a few instructions that could be of use are still not selected. We shall work on this, though we do not hope much improvement can be gained. Selecting vector instructions automatically (loop vectorization) could improve performance in some cases, but doing so would entail considerable changes to CompCert’s notions for mapping variables to registers. Hardware loops could save some cycles on tight loops, but again these would require considerable changes to CompCert’s backend.

The main cause of inefficiency of the code generated for some examples, compared to GCC’s, is the lack of some high level optimizations in CompCert, for instance better inlining heuristics, structure or array “disaggregation” (expanding an aggregate into scalar variables), loop-invariant code motion and strength reduction. Again, this is left to future work.

\textsuperscript{23}Our correctness proof does not rely on other properties that our OCaml code provides (in particular, the correctness proof of our verifier does not assume that two isomorphic hash-consed data structures in existence at the same time are always allocated to the same place in memory). In order not to have to convert CompCert’s code generation flow to the full monadic style of Impure, we also unsafely exit from the Impure monad in order to obtain the scheduled code as though the verified scheduler were guaranteed to be (logically) deterministic. We do not think that this weakness hides a real issue: even if an unexpected bug in some of our OCaml oracles makes them non-deterministic, we do not call the scheduler twice on the same code, so there is no absurd case where we could go to if two different calls gave different results. This is in line with similar implicit assumptions elsewhere in CompCert that oracles are deterministic. These assumptions result from a shortcut in the formalization and are expected to be useless, without any bad consequence if they are wrong.
ACKNOWLEDGMENTS
We wish to thank Benoît Dupont de Dinechin, Xavier Leroy, Gergő Barany, Thomas Vandendorpe as well as the anonymous reviewers for their helpful remarks.
REFERENCES


A BASIC-BLOCK RECONSTRUCTION

This appendix explains how we recover the basic-block structure of a Mach program. Appendix A.1 first recalls some preliminaries on Mach and Asm IR in CompCert. This helps to motivate our new Machblock IR, between Mach and Asmblock, introduced at Appendix A.2. The latter also presents the translation between Mach and Machblock. The pass of Machblock to Asmblock is presented in Appendix B.

A.1 Preliminaries: from Mach to Asm in CompCert

Appendix A.1.1 presents the Mach IR of CompCert. Appendix A.1.2 explains the difference between Mach and Asm and presents the proof from Mach to Asm in existing CompCert backends.

A.1.1 The Mach IR of CompCert. The Mach IR consists of a set of functions, each with a function body. Roughly speaking, Mach is a simplified assembly language (with 3-address code instructions handling the actual registers of the target processor, except for some special registers like the program counter PC) where:

- The ABI (Application Binary Interface) is abstracted into Mach instructions allowing access to the stack and function parameters Mgetstack, Msetstack and Mgetparam.
- Loads and stores stay generic, and are not yet expanded into the “real” load/store instructions. The same applies to branching instructions.
- Calling instructions Mcall and Mtailcall can only branch either on a function symbol, or a register on an address that must be the first address of a function.
- Branching instructions such as Mgoto branch to labels in the current function (like in LLVM).
- There is neither a PC (Program Counter) nor a RA (Return Address) register. The remaining code to execute is an explicit part of the current state.

Mach states describe the register state rs, the global memory state m, and the stack state st. They are of three kinds, with the following meanings:

- (State st f c rs m): the first instruction of code c is about to be run in current function f;
- (Callstate st f rs m): function f is about to be run, the caller context has just been pushed on stack st;
- (Returnstate st rs m): a caller context is about to be restored from stack st.

A.1.2 Proving a Translation from Mach to Asm. The major difference between Mach and Asm really lies in the execution semantics. In Mach, the remaining code to execute is directly in the state and Mach semantics provide a clean notion of internal function call: execution can only enter into an internal function by its syntactic entry-point. In Asm, the code to execute resides in memory,
and is pointed to by the PC register. Through jumps into registers and a bit of pointer arithmetic, an Asm program can jump into the middle of a function, like in Return-Oriented-Programming (ROP) [Buchanan et al. 2008]. Thus, proving the code generation pass from Mach to Asm implies ensuring that the generated code does not have such a behavior (assuming that the Mach program does not have any undefined behavior): it simulates Mach execution where such a behavior does not exist.\(^{24}\)

Formally, it involves introducing a suitable “\(\sim\)” relation matching Mach states with Asm states. The gist of it consists in expressing a correspondence between the register states as well as the memory, in addition to the following properties depending on the Mach state: if it is a State, then the PC should point to the Asm code generated from Mach; if it is a Callstate, then the PC should point to the callee function, and the RA register to the return address (i.e. the address following the call instruction in the caller); otherwise, it is a Returnstate and the PC should point to the return address.

Then, the proof involves a “Star” simulation. Such a simulation of a program \(P_1\) by a program \(P_2\) is defined — for a relation \(S_1 \sim S_2\) matching states \(S_1\) from \(P_1\) with states \(S_2\) from \(P_2\) — by the following conditions:

- The initial states match
  \[\forall S_1, P_1.\text{istate } S_1 \implies \exists S_2, P_2.\text{istate } S_2 \land S_1 \sim S_2\]

- The final states match
  \[\forall S_1 S_2 r, S_1 \sim S_2 \land P_1.\text{fstate } S_1 r \implies P_2.\text{fstate } S_2 r\]

- The execution steps match through the “Star” simulation diagram (also depicted in Figure 4)

\[\forall S_1 S_2 S'_1 t, S_1 \sim S_2 \land S_1 \xrightarrow{t} S'_1 \implies \exists S'_2, S'_1 \sim S'_2 \land (S_2 \xrightarrow{t} S'_2 \lor (S_2 \xrightarrow{t} S'_2 \land |S'_2| < |S_1|))\]

The “Star” simulation diagram expresses that each single step of \(P_1\) producing a trace \(t\) can be simulated by several steps of \(P_2\) producing the same trace \(t\). In particular, when \(P_1\) performs an internal step (where \(t = \epsilon\), \(P_2\) can stutter, i.e. perform no computation step. But, if \(P_1\) loops forever without producing any observable event, then \(P_2\) cannot stutter infinitely.\(^{25}\) Indeed, stuttering is only allowed if the step of \(P_1\) makes the state decrease for a well-founded order (hence, sequences of successive stutterings cannot be infinite).

The “Star” simulation from Mach to Asm thus corresponds to prove that one Mach step (i.e. one transition of Fig. 18) gives the same result as several Asm instructions. For instance, the Mach step from Callstate into State is simulated by the steps of the Asm function prologue that allocate the stack-frame and save it into registers FP (Frame Pointer) and RA. The Mach conditional branching

\(^{24}\)Thus, ROP attacks on code generated by CompCert are only possible from undefined behaviors of the source code.

\(^{25}\)Otherwise an infinite silent loop \(P_1\) could be compiled into a program \(P_2\) returning in one step, and this would be incorrect.
step is simulated by the Asm steps that compute the result of the condition, and then branch accordingly. Actually, the only stuttering step of Asm w.r.t Mach corresponds to the Restoring step from Returnstate.

### A.2 Mach to Machblock Translation

As explained in A.1.2, on an arbitrary Asm program, we cannot prove that an execution cannot jump to the middle of a function, and in particular to the middle of a basic block. Thus, the basic block structure can only be recovered syntactically from the function structure of Mach programs. Moreover, proving that a block-step semantics simulates an instruction-step semantics is not trivial. Hence, it seems interesting to capitalize this effort for a generic component w.r.t the processor. This motivates us to introduce a new IR between Mach and Asmblock, called Machblock (Fig. 3), which introduces a sequential blockstep semantics on Mach programs.

The basic blocks syntax in Machblock is very similar to that of Asmblock, except that instructions are Mach instructions and that empty basic blocks are allowed (but not generated by our translation from Mach). We have only defined a sequential semantics for Machblock, which is Mach one, except that a whole basic block is run in one single step. This block-step is internally made up of several computation steps, one by basic or control-flow instruction.

```plaintext
Record bblock := { header: list label; body: list basic_inst; exit: option control_flow_inst }
```

The code of our translation from Mach to Machblock is straightforward: it groups successive Mach instructions into a sequence of “as-big-as-possible” basic blocks, while preserving the initial order of instructions. Indeed, each Mach instruction corresponds syntactically to either a label, a basic instruction, or a control-flow instruction of Machblock.

Proving that this straightforward translation is a forward simulation is much less simple than naively expected. Our proof is based on a special case of “Option” simulation (Fig. 19). Intuitively, the Machblock execution stutters until the Mach execution reaches the last execution of the current block, then while the Mach execution runs the last step of the current block, the Machblock execution runs the whole block in one step. Hence, the measure over Mach states—that indicates the number of Machblock successive stuttering steps—is simply the size of the block (including the number of labels) minus 1. Formally, we have introduced a dedicated simulation scheme, called “Block” simulation in order to simplify this simulation proof. This specialized scheme avoids defining the simulation relation—written “∼” in Fig. 19—relating Mach and Machblock states in the stuttering case. In other words, our scheme only requires relating of Mach and Machblock states at the beginning of a block. Indeed, the “Block” simulation scheme of a Mach program P1 by a Machblock program P2 is defined by the two conditions below (where S1 and S1’ are states of P1; S2 and S2’ are states of P2; and t is a trace):

1. Stuttering case of P2 for one step of P1:

   \[ |S_1| > 0 \land S_1 \xrightarrow{t} S_1' \quad \Rightarrow \quad t = \epsilon \land |S_1| = |S_1'| + 1 \]

2. One step of P2 for \(|S_1|+1\) steps of P1:

   \[ S_1 \sim S_2 \land S_1 \xrightarrow{t^{\mid S_1\mid+1}} S_1' \quad \Rightarrow \quad \exists S_2', S_2 \xrightarrow{t} S_2' \land S_1' \sim S_2' \]

Naively, relation \( S_1 \sim S_2 \) would be defined as “\( S_2 = \text{trans\_state}(S_1) \)” where \text{trans\_state} translates the Mach state in \( S_1 \) into a Machblock state, only by translating the Mach codes of \( S_1 \) into Machblock codes. However, this simple relation is not preserved by goto instructions on labels. Indeed, in Mach semantics, a goto branches to the instruction following the label. On the contrary, in Machblock semantics, a goto branches to the block containing the label. Hence, we define \( S_1 \sim S_2 \) as the following relation: “either \( S_2 = \text{trans\_state}(S_1) \), or the next Machblock step
from \( S_2 \) reaches the same Machblock state as the next step from \( \text{trans\_state}(S_1) \). The condition (2) of the “Block” simulation is then proved according to the decomposition of Figure 20.

On the right-hand side, \( c \) and \( b::bl \) are, respectively, the initial Mach and Machblock codes where \( b \) is the basic block at the head of the Machblock code. Relation \( \text{match\_state} \) is the Coq name of the simulation relation (also noted "\( \sim \)" in the paper). The Machblock step from \( b::bl \) simulates the following \( |b| \) Mach steps from \( c \). First, skip all labels: this leads to Mach code \( c_0 \). Second, run all basic instructions: this leads to Mach code \( c_1 \). Finally, run the optional control-flow: this leads to code \( c_2 \). Each of these three subdiagrams is an independent lemma of our Coq proof.

![Fig. 20. Overview of our Proof for Condition (2) of the “Block” Simulation of Mach by Machblock](image-url)

**B MACHBLOCK TO ASMBLOCK**

The Machblock to Asmblock pass is adapted from the Mach to Asm pass of other backends (see Appendix A.1.2), but with a consequential change: instead of manipulating instructions (and reasoning on the execution of single instructions), we are now manipulating basic blocks of instructions, and must reason on the execution of an entire basic block (in one single step). We are giving here details to hint the reader on the differences between the two approaches, in particular for the star simulation proof.

This appendix details the biggest differences, to give the viewer an idea of the difficulty involved in porting an existing Mach to Asm translation into a Machblock to Asmblock variant.

**B.1 Machblock to Asmblock Translation**

**B.1.1 Overview.** Given a basic block from Machblock, the translation consists in emitting an Asmblock basic block which performs the same operation.

The Mach to Asm pass of other backends translates code function by function, then within each function, instruction by instruction.

Our translation goes through the Machblock code function by function, then basic block by basic block. Within each basic block, the labels, basic instructions, then the control-flow instruction are translated consecutively.

The label translation is straightforward. A label in COMPJERT is a positive identification number, so we translate a label from Machblock to Asmblock by applying the identity function.

The translation of Machblock basic instructions give Asmblock basic instructions. Indeed, if a Machblock instruction does not modify the control, then there is no reason for the corresponding Asmblock instructions to do so. By "modifying the control" we mean here: not taking any branch, and going to the "next" instruction in line instead. This is done by the \( \text{transl\_basic\_code} \) and \( \text{transl\_instr\_basic} \) with the following signatures (the Boolean is there for an optimization explained in the next section):

\[
\begin{align*}
\text{transl\_basic\_code} : & \quad \text{Machblock.function} \rightarrow \text{list Machblock.basic\_inst} \rightarrow \text{bool} \rightarrow \text{list Asmblock.basic\_inst} \\
\text{transl\_instr\_basic} : & \quad \text{Machblock.function} \rightarrow \text{Machblock.basic\_inst} \rightarrow \text{bool} \rightarrow \text{list Asmblock.basic\_inst}
\end{align*}
\]
The translation of Machblock control-flow instructions is less trivial. It gives a list of Asmblock basic instructions (possibly empty), followed by a single Asmblock control-flow instruction.

For example, the translation of a MBreturn cannot be just a ret assembly instruction, because other operations such as restoring the stack pointer must be handled in the function epilogue.

Another example is the (MBcond cond args lbl) instruction which evaluates the arguments args with the condition cond, and jumps to lbl if this evaluation is true. Evaluating a comparison of a register to 0 can be done in one instruction in the Kalray assembly. However, comparing two registers must be done by at least two instructions: one (compd rd rs1 rs2) instruction comparing the two registers rs1 and rs2, writing the Boolean result to rd, then a cb (Conditional Branch) instruction using the value of rd to decide of the branch.

The translation of control-flow instructions is done by a transl_instr_control function of signature:

\[
\text{transl}_\text{instr}_\text{control}: \text{Machblock} \cdot \text{function} \rightarrow \text{option \ Machblock} \cdot \text{control}_{-}\text{flow}_{-}\text{inst} \rightarrow \text{list \ Asmblock} \cdot \text{instruction}
\]

If we note \((L; lb; c)\) a Machblock basic block, the translation then consists in generating the Asmblock basic block \((L; t_b(lb) \leftrightarrow l'; c')\) where \(t_b\) is transl_basic_code, and \(l', c'\) are defined such that \(t_c(c) = l' \leftrightarrow (c' :: \text{nil})\), with \(t_c\) being transl_instr_control.

B.1.2 The Argument Pointer Code Optimization. There are many possible ways to pass the arguments to a function in assembly code. In order to ensure a compatibility between the different compilers as well as hand-written assembly code, each architecture defines an ABI (Application Binary Interface), a specification of how arguments should be handled.

The arguments are usually first passed into registers, and then onto the stack if there isn’t enough place in the registers. In the case of the Kalray Coolidge ABI (but it is also the case for some other ABIs like the Risc-V), when there are too many arguments to be contained by the registers, the caller pushes the extra arguments on the stack. Since each stack-frame is adjacent, the callee can then access the arguments directly via its SP (Stack Pointer) register. For instance, if the stack size of the callee is 24 bytes long, and the stack is in decreasing order (parents are in higher addresses), then the first extra argument will be at address \((\text{SP}+32)\).

This poses an issue within the memory model of CompCert. A CompCert address is a couple \((b, off)\), where \(b\) is an integer representing the memory block, and \(off\) is the offset within that memory block. In this model, memory blocks are not contiguous. If we take the above example, \((\text{SP}+32)\) would actually be pointing to an invalid memory. Consequently, in CompCert, function arguments cannot be accessed with the SP register directly.

The solution adopted by CompCert backends, is to keep the value of the "old SP " somewhere on the stack. It is then part of the specifications that the function prologue should be saving SP before modifying SP. This is done by the Pallocframe pseudo-instruction, for which we give below the specification:

\[
\begin{array}{c}
| \text{Pallocframe sz pos} =\hline
\text{let } (\text{mw, stk) } := \text{Mem.alloc mr } 0 \text{ sz in} \\
\text{let } \text{sp } := \text{(Vptr stk Ptrofs.zero) in} \\
\text{match } \text{Mem.storev Mptr mw (Val.offset_ptr sp pos) rsr#SP with} \\
| \text{None } \Rightarrow \text{ Stuck}
\end{array}
\]
This pseudo-instruction is doing two actions in regards to the old value of SP:
- Storing it at (Val.offset_ptr sp pos)
- Copying it at the register FP, a register arbitrarily chosen among the general purpose registers that can be used for any operation.

This pseudo-instruction does not correspond to any particular assembly instruction: it is expended into a sequence of assembly instructions by not-formally-verified (but trusted) part of CompCert\textsuperscript{26}.

The copy into FP is part of a small optimization done during code generation. Loading a value from memory is expensive in most architectures, and having to load this "old SP" for each access of a parameter would result in a particularly inefficient code. In order to alleviate this, CompCert backends remember during the code generation whether they previously loaded the old SP or not. Then, when translating a MBgetparam, additional code is inserted if the old SP isn’t already loaded in FP. For instance, translating two consecutive MBgetparam writing a parameter in a different register than FP, will result in at most one FP reload operation.

In order to remember whether FP is loaded, a Boolean value parameter ep is added to each translation function. The translation starts with ep=true (the prologue initially loads the old SP into FP), then throughout the translation sets ep to true if it has been reloaded, or to false if it might have been destroyed. For instance, ep is set to false every time a label is encountered, since it is not possible to know where did the control originate from.

More precisely, the next value of ep (outside of labels, handled separately) is given by this function (in the case of Kalray backend, but other backends also have a similar function):

\begin{verbatim}
Definition fp_is_parent (before : bool) (i: Machblock.basic_inst) : bool :=
  match i with
  | MBgetstack ofs ty dst => before && negb (mreg_eq dst MFP)
  | MBsetstack src ofs ty => before
  | MBgetparam ofs ty dst => negb (mreg_eq dst MFP)
  | MBop op args res => before && negb (mreg_eq res MFP)
  | MBload chunk addr args dst => before && negb (mreg_eq dst MFP)
  | MBstore chunk addr args res => before
end
\end{verbatim}

MFP stands for the Mach register corresponding to FP. In the case of an operation whose result is to be stored in FP, we must invalidate FP by setting ep to false, hence the \((\text{negb } (\text{mreg}\_\text{eq } \text{dst MFP})\)) used in most cases of the match.

B.1.3 Constraints of an Asmblock Basic Block. There are two main constraints to verify when forming an Asmblock basic block. These are enforced by a wf\_bblock predicate inside a bbblock record.

The first constraint is that the bbblock should never be empty. Indeed, by our semantics, PC is incremented by the size of the block in the end of its execution. If there are three instructions

\textsuperscript{26}It can be a source of bugs to perform such a critical operation in the uncertified part of CompCert, especially when taking into account the complex support for variadic arguments. Yet, as of now, all backends adopt that approach, and careful testing can rule out these bugs.
in the bblock, then, at the end of the bblock step, PC is incremented by three. If the bblock was allowed to be empty, then PC would not get modified, we would then execute the same bblock, which would cause an infinite stuttering.

The second constraint comes from the Pbuiltin special instructions. A Pbuiltin instruction represents a builtin instruction, that is, an instruction that is inserted directly from the C code to the assembly code, without any translation from the compiler.

Here is an example of a C code containing such a builtin:

```c
/* Invalidating data cache */
__builtin_k1_dinval();
int v = tab[40];
printf("%d\n", v+3);
```

The __builtin_k1_dinval builtin produces directly the instruction dinval, which is used to invalidate the data cache, before accessing a value on the heap. Such instructions are specially handled by compilers.

In terms of CompCert assembly semantics, they are treated as external function calls (generating a trace), however they pose the problem that, in assembly, we do not know exactly what they are made of. Their expansion lies in the uncertified part of CompCert, with the only restriction that a builtin cannot create new functions or sections. Each builtin instruction could potentially be generating an entire block of code, with its own labels and more than one control-flow instructions.

We could treat them the same way we treat function call instructions, but then the uncertified scheduler could return us a schedule where the builtin is to be bundled with another basic instruction, which could lead to absurd code if the builtin is eventually expanded to several instructions and/or labels.

We are hitting here the limits of the builtin specification inside CompCert. In order to prevent the above case, we chose to isolate each builtin, with one desired property: if an Asmblock basic block has a builtin instruction, then that instruction must be the only instruction of the basic block.

This ensures that the uncertified expansion should have no effect on the rest of the generated code - though this is not proven formally, by lack of precise builtin semantics. Just like the function prologue and epilogue, test benchmarks are used to remedy this lack of formal certification.

Let us also recall a third constraint, which is not specific to a bblock but rather a direct definition of our Asmblock semantics: we cannot branch in the middle of a bblock, i.e. a Stuck state is induced when the PC register does not point to the beginning of a bblock. This third constraint is necessary to define a blockstep semantics.

B.1.4 Function Prototypes for Implementation. We describe here the actual functions used for implementing the translation.

For each function, linearly, each basic block is translated via a transl_blocks function, starting with ep=true:

```ocaml
Definition transl_function (f: Machblock.function) :=
  do lb ← transl_blocks f (Machblock.fn_code) true;
```
The `make_prologue` consists in inserting the `Pallocframe` pseudo-instruction, as well as instructions making sure the return address is stored on the stack, in order to have a correct linking when executing the `ret` instruction of the function epilogue.

The `transl_blocks` goes through each basic block of the list, translates it, then propagates `ep` if the block does not have any label.

transl_block is split into three functions: `transl_basic_code` which translates linearly each basic instruction of the Machblock basic block, then `transl_exit_code` which translates the optional control-flow instruction, and finally, `gen_bblocks` which makes one or several basic blocks based on the results of the two last functions.

The `gen_bblocks` function ensures that we generate `bblocks` that satisfy the two earlier described constraints:
- a `bblock` cannot be empty: while we believe this should never happen, we must nevertheless tackle what happens in the case that it does to ensure a forward simulation. We have chosen to generate a `bblock` with a single `nop` instruction if we ever come across an empty Machblock basic block.
- a `builtin` instruction must be alone in its `bblock`. When we encounter one, we split the basic block into two `bblocks`.

Finally, the `extract_basic` and `extract_ctl` functions (not detailed here) extract respectively the basic and the control-flow instructions of a list of instructions.

### B.2 Forward Simulation Proof

#### B.2.1 Issues of Using the Usual Mach to Asm Diagram.

The usual diagram consists in starting from two states $s_1$ and $s_2$ for which a certain `match_states` relation holds, executing one Mach
Certified and Efficient Instruction Scheduling

instruction, then executing the translated Asm instruction and proving that the two obtained states \(s'_1\) and \(s'_2\) also verify the match_states relation, that we give below for our backend:

\[
\text{Inductive match_states : Machblock.state \rightarrow Asmvliw.state \rightarrow Prop :=}
\]
\[
| \text{match_states.intro :}
| \quad \forall s \text{ fb sp c ep ms m m' rs f tf tc}
| \quad (\text{STACKS: match_stack ge s})
| \quad (\text{FIND: Genv.find_funct_ptr ge fb = Some (Internal f)})
| \quad (\text{MEXT: Mem.extends m m'})
| \quad (\text{AT: transl_code_at_pc ge (rs PC) fb f c ep tf tc})
| \quad (\text{AG: agree ms sp rs})
| \quad (\text{DXP: ep = true \rightarrow rs#FP = parent_sp s})
| \quad \text{match_states (Machblock.State s fb sp c ms m)}
| \quad \text{(Asmvliw.State rs m')}
\]

This relation ensures several things, among which:

- **AG** and **MEXT** ensures the value of registers and memory match those of the Machblock state
- **AT** ensures that, in the memory pointed by PC register, the Machblock code \(c\) is translated into the Asm block code \(tc\) with the Boolean parameter \(ep\).
- **DXP** ensures that if \(ep\) is \(true\), then FP must contain the value of the old SP.

In the other CompCert backends, the **match_states** relation holds between each execution of a Mach instruction. For each possible Mach instruction, lemmas are then proved to ensure that executing on one hand the Mach instruction, on the other hand the translated Asm instructions, lead to the same result.

In our case, using the above **match_states** directly is tricky: instead of executing one Mach instruction and reasoning on its translation, we execute an entire basic block of Machblock instructions, and must reason on the translation of the whole block.

A possibility for us could be to define a finer grain step relation, which would execute a single instruction of a basic block on the Machblock part. In Machblock, executing one instruction could be as simple as executing its effects, then removing it from the current basic block. On the Asm block side, one could suggest to increment PC instruction by instruction instead of doing it all at the end of the block, however that would not be compatible with the constraint that we must not branch in the middle of a block.

**B.2.2 A New Diagram to Prove Machblock to Asmblock.** In the lights of this difficulty, we chose to introduce a new state definition, used for reasoning at a finer grain. We call it a Codestate, and define it as follows:

\[
\text{Record codestate :=}
\]
\[
\text{Codestate ( pstate: state; (* projection to Asmblock.state *)}
| \quad \text{pheader: list label; (* list of label *)}
| \quad \text{pbody1: list basic; (* list of basics coming from the Machblock body *)}
| \quad \text{pbody2: list basic; (* list of basics coming from the Machblock exit *)}
| \quad \text{pctl: option control; (* exit instruction, coming from the Machblock exit *)}
| \quad \text{ep: bool; (* reflects the [ep] variable used in the translation *)}
| \quad \text{rem: list AB.bblock; (* remaining bblocks to execute *)}
| \quad \text{cur: bblock (* current bblock to execute - useful to increment PC *)}
\]

A Codestate augments a state of Asmblock by also including the instructions to execute, much like Machblock. It also includes the value of ep used when translating the first instruction of Codestate.

\[27\text{We are here only describing the particular case of executing code within an internal function}\]
With this new Codestate, we can decompose the `match_states` relation into two relations.

The first of these two relations is `match_codestate`, which ensures an agreement between a Machblock state and a Codestate, namely: the code residing in the Codestate must have been a result of a translation of a Machblock code, the memory and register states should correspond, and also the ep value of the Codestate should match with the one used in the translation.

```coq
Inductive match_codestate fb: Machblock.state -> codestate -> Prop :=
  | match_codestate_intro :
    
    ∀ s sp ms m rs0 m0 f tc ep c bb tbb tbi
    (STACKS: match_stack ge s) 
    (FIND: Genv.find_funct_ptr ge fb = Some (Internal f)) 
    (MEXT: Mem.extends m m0) 
    (TBC: transl_basic_code f (MB.body bb) (if MB.header bb then ep else false) = OK tbc) 
    (TIC: transl_instr_control f (MB.exit bb) = OK tbi) 
    (TBLS: transl_blocks f c false = OK tc) 
    (AG: agree ms sp rs0) 
    (DXP: (if MB.header bb then ep else false) = true -> rs0#FP = parent_sp s)

match_codestate fb (Machblock.State s fb sp (bb::c) ms m)
{"| pstate := (Asmvliw.State rs0 m0);
  pheader := (MB.header bb);
  pbody1 := tbc;
  pbody2 := extract_basic tbi;
  pctl1 := extract_ctl tbi;
  ep := ep;
  rem := tc;
  cur := tbb |
} 
```

The second relation is `match_asmstate` between a Codestate and an Asmblock state, ensuring that the code present in a Codestate actually resides in memory of the Asmblock state, at the address pointed by the PC register.

```coq
Inductive match_asmstate fb: codestate -> Asmvliw.state -> Prop :=
  | match_asmstate_some :
    ∀ rs f tf tc m tbb of s ep tbdy tex lhd
    (FIND: Genv.find_funct_ptr ge fb = Some (Internal f)) 
    (TRANSF: transf_function f = OK tf) 
    (PCeq: rs PC = Vptr fb ofs) 
    (TAIL: code_tail (Ptrofs.unsigned ofs) (fn_blocks tf) (tbb::tc))

match_asmstate fb
{"| pstate := (Asmvliw.State rs m);
  pheader := lhd;
  pbody1 := tbdy;
  pbody2 := extract_basic tex;
  pctl1 := extract_ctl tex;
  ep := ep;
  rem := tc;
  cur := tbb |
} (Asmvliw.State rs m)
```

Both relations take an extra `fb` parameter, which is the function block, an integer value identifying the current function.

The details of proving the Machblock to Asmblock pass with these two new `match` relations are very cumbersome (it is in general the case for all of the Mach to Asm proofs of the various backends - there are a lot of details and corner cases to consider). In particular, we are not covering here the case of a builtin instruction. However, we are giving below the general idea of simulating a `bblock` without builtin, assuming we already have an existing Mach to Asm proof to base ourselves on.
Figure 21 depicts the diagram we went for. It can be decomposed in three theorems for simulating respectively the Machblock header, body, and exit. The exec_header is a predicate which removes the header from the block, and sets ep to false if there was a header:

\[
\text{Inductive exec_header: codestate \to codestate \to Prop :=}
\]

We start by the theorem \text{match_state_codestate}, allowing us to decompose \((\text{match_states} S_1 S'_1)\) into a \((\text{match_codestate} S_1 cs_{10})\) and a \((\text{match_asmstate} cs_{10} S'_1)\) with a fitting \(cs_{10}\):

\[
\text{Theorem match_state_codestate:}
\]

The Machblock header simulation is then straightforward, and is proven by the \text{step_simu_header} theorem below. After this theorem, the state \(s_{11}\) is free of any header:

\[
\text{Theorem step_simu_header:}
\]

The body simulation is then proven by induction on the list of basic instructions of \(s_{11}\), each individual case is covered by adapting the old proofs of Mach to Asm for the basic instructions.
This theorem gives a state \( s_{12} \) without any body anymore, just the exit instruction. We can then use the last theorem, \( \text{step_simu_control} \):

Theorem \( \text{step_simu_control} \):
\[
\begin{align*}
\forall \ bb \ bs \ fb \ sp \ c \ ms \ m \ rs1 \ m1 \ ms' \ cs1 \ m',
MB.\text{header} \ bb = \text{nil} \rightarrow \\
(\forall \ ef \ args \ res, MB.\text{exit} \ bb <> \text{Some (MBbuiltin ef args res)}) \rightarrow \\
\text{body_step} \ ge \ s \ fb \ sp \ (MB.\text{body} \ bb) \ ms \ m \ ms' \ m' \rightarrow \\
\text{pstate} \ cs1 = (\text{State rs1 m1}) \rightarrow \\
\text{match_codestate} \ fb \ (MB.\text{State} \ s \ fb \ sp (bb::c) ms \ m) \ cs1 \rightarrow \\
(\exists \ rs2 \ m2 \ cs2 \ ep, \\
\text{cs2} = (\{ \text{pstate} := (\text{State rs2 m2}); \text{pheader} := \text{nil}; \text{pbbody1} := \text{nil}; \text{pbbody2} := \text{pbbody2} \text{cs1}; \\
\text{pctl} := \text{pctl} \text{cs1}; \text{ep} := \text{ep}; \text{rem} := \text{rem} \text{cs1}; \text{cur} := \text{cur} \text{cs1} \}) \\
\land \text{exec_body} \ tge (\text{pbbody1} \text{cs1}) \text{rs1 m1} = \text{Next} \ text{rs2 m2} \\
\land \text{match_codestate} \ fb \ (MB.\text{State} \ s \ fb \ sp (bb::c) ms \ m) \ cs2 \\
\text{exec_body} \ tge \ (pbbody2 \text{cs1}) \text{rs2 m2} = \text{Next} \ text{rs3 m3} \\
\land \text{exec_control_rel} \ tge \ \text{tbb} \ \text{tbdy2} \ \text{tbdy2} \ \text{tbbd2} \ \text{tbbd2} \ \text{tbb} \ \text{tbdy2} \\
\land \text{match_states} \ S'' \ (\text{State rs4 m4})
\end{align*}
\]

That last theorem gives us a \text{match_states} between a \( S'_2 \) and \( S_2 \) which is what we are looking for. The final theorem \( \text{step_simulation_bblock} \) then uses the four theorems to prove the plus simulation.

Theorem \( \text{step_simulation_bblock} \):
\[
\begin{align*}
\forall \ sf \ f \ sp \ bs \ fb \ sp \ (\text{Machblock.body} \ bb) \ ms \ m \ ms' \ m' \rightarrow \\
(\forall \ ef \ args \ res, MB.\text{exit} \ bb <> \text{Some (MBbuiltin ef args res)}) \rightarrow \\
\text{exit_step} \ return_address_offset \ ge \ (MB.\text{exit} \ bb) (MB.\text{State} \ s \ fb \ sp (bb::c) ms' m') \ t S'' \rightarrow \\
(\exists \ rs3 \ m3 \ rs4 \ m4, \\
\text{exec_body} \ tge \ tbdy2 \ rs2 \ m2 = \text{Next} \ rs3 m3 \\
\land \text{exec_control_rel} \ tge \ \text{tbb} \ \text{tbdy2} \ \text{tbdy2} \ \text{tbbd2} \ \text{tbbd2} \ \text{tbb} \\
\land \text{match_states} \ S'' \ (\text{State rs4 m4})
\end{align*}
\]

C OVERVIEW OF THE ABSTRACTBASICBLOCK INTERMEDIATE REPRESENTATION

AbstractBasicBlock is an IR (Intermediate Representation) dedicated to verification of the results of scheduling/bundling oracles operating on basic blocks. This IR is only used for verification: there is no translation from AbstractBasicBlock to another IR of CompCert. It is independent of the target processor and from the remainder of CompCert.

Because of this good feature, this appendix describing AbstractBasicBlock intends to be self-contained, and does not even require to read other parts of the paper in details (except to understand the applications motivating AbstractBasicBlock). In particular, this appendix reformulates entirely Sect. 5, while providing much more details.\(^{28}\)

Appendix C.1 explains informally how our assembly instructions are compiled into AbstractBasicBlock: this introduces the syntax of AbstractBasicBlock instructions. Appendix C.2 formally defines this syntax and its associated semantics. Appendix C.3 presents the \text{parallelizability test},\(^{28}\) of course, this comes at the price of repetitions. Sorry for the careful reader of Sect. 5.
which checks that a bundle/basic block has the same behavior in sequential and in parallel executions. Appendix C.4 presents the simulation test, which checks that the sequential semantics of basic blocks is preserved by scheduling.

C.1 Introduction through the Translation from Asmblock and AsmVLIW

AbstractBasicBlock defines a (deeply-embedded) language for representing the semantics of single assembly instructions as the assignment of one or more pseudo-registers. For example, an instruction “add r1, r2, r3” is represented as an assignment “r1 := add[r2, r3]”. Hence, AbstractBasicBlock distinguishes syntactically which pseudo-registers are in input or output of each instruction. Moreover, it gives to all operations (including load/store and control-flow ones) a single signature “\text{list exp} \to \text{exp}”. A binary operation like add will just dynamically fail, if applied to an unexpected list of arguments. This makes the syntax of AbstractBasicBlock very simple.

Let us consider less straightforward examples. Our translation from Asmblock to AbstractBasicBlock represents the whole memory as a single pseudo-register called here \(m\). Hence, instruction “load \(r_1, r_2, i\)” (where \(i\) is an integer constant representing offset) is encoded an assignment “\(r_1 := \text{load}(i)[m, r_2]\)” where the underlying operation is “(load \(i\))”. In other words, the syntax of AbstractBasicBlock provides an infinite number of operations “(load \(i\))” (one for each \(i\)). Similarly, a “store \(r_1, r_2, i\)” is encoded an assignment “\(m := \text{store}(i)[m, r_1, r_2]\)” reflecting that the whole memory is potentially modified.

We also encode control-flow instructions in AbstractBasicBlock: a control-flow instruction modifies the special register \(PC\) (the program counter). Actually, we consider that each bundle of a VLIW processor has one control-flow instruction: when the latter is implicit in the assembly code, it corresponds to the increment of \(PC\) by the size of the bundle. Hence, in our translation of bundles to AbstractBasicBlock, each control-flow instruction performs at least the assignment “\(PC := (\text{incr} i)[PC]\)” where \(i\) is an integer representing the size of the bundle. Typically, a conditional branch such as “\(1, r, l\)” (where \(l\) is the label and \(r\) a register) is translated as the sequence of two assignments in AbstractBasicBlock:

\[
PC := (\text{incr} i)[PC] ; \quad PC := (1, l)[PC, r]
\]

It could equivalently be coded as the assignment “\(PC := (1, l)[(\text{incr} i)[PC], r]\)”. However, we find it more convenient to insert the incrementation of \(PC\) before the assignments specific to each control-flow instruction. A more complex control-flow instruction such as “\(\text{call } f\)” (where \(f\) is a function symbol) modifies two registers: \(PC\) and \(RA\) (the returned address). Hence “\(\text{call } f\)” is translated as the sequence of 3 assignments in AbstractBasicBlock:

\[
PC := (\text{incr} i)[PC] ; \quad RA := PC ; \quad PC := (\text{cte address}_f)[]
\]

To resume, an instruction of AbstractBasicBlock is a sequence of assignments. An abstract basic block is simply a list of such instructions: this list is run in sequence (for the sequential semantics), or in parallel (for the parallel semantics). Hence, there is a single translation from our assembly to AbstractBasicBlock: this translation produces a bisimulable basic block, both for the sequential semantics and the parallel semantics.

Finally, Asmblock contains instructions modifying several pseudo-registers in parallel. One of them is an atomic parallel load from a 128-bit memory word in two contiguous (and adequately aligned) destination registers \(d_0\) and \(d_1\). These two destination registers are distinct from each other by construction—but not necessarily from the base address register \(a\). These parallel assignments are expressed in the sequential semantics of AbstractBasicBlock instructions with the special \(\text{old}\) operator of AbstractBasicBlock expressions: an expression “(old \(e\))” evaluates “\(e\)” in the initial state.
of the surrounding AbstractBasicBlock instruction. Hence, the parallel load of 128-bit words is given in terms of two loads of 64-bit words:

\[ d_0 := (\text{load } i)[m, a] \quad d_1 := (\text{load } (i + 8))[m, (\text{Old } a)] \]

Similarly, our assembly provides a pseudo-instruction \texttt{freeframe} modifying both the memory and some registers. It is involved in the epilogue of functions. In the semantics, \texttt{freeframe} modifies the memory \(m\) by deallocating the current stack frame in the memory model of CompCert. It also updates register \(SP\) (the stack pointer) accordingly and destroys the contents of a scratch register called here \(tmp\). The modifications to \(SP\) and \(m\) are performed in “parallel”, since \(SP\) indicates the current stack frame in \(m\), and the new value of \(SP\) is read from this stack frame. For the pseudo-instruction \texttt{“freeframe}\_i1\_i2\texttt{”}\ (where \(i_1\) and \(i_2\) are two integers), our translation from Asmblock to AbstractBasicBlock introduces two intermediate operations: first, \texttt{“(freeframe\_m i1 i2)”}\ for the effect on memory, and second, \texttt{“(freeframe\_SP i1 i2)\”}\ for the effect on the stack pointer. Then, the pseudo-instruction \texttt{“freeframe}\_i1\_i2\texttt{”}\ is translated as the \textit{sequence} of 3 assignments in AbstractBasicBlock:

\[
m := (\text{freeframe}\_m \_i1 \_i2)[SP, m]; \\
SP := (\text{freeframe}\_SP \_i1 \_i2)[SP, (\text{Old } m)]; \\
tmp := \text{Vundef}[]
\]

In conclusion, each instruction of our assembly is translated into a sequence of assignments, where some of these assignments modify several pseudo-registers in “parallel” thanks to the special \texttt{Old} operator. We speak about \textit{atomic sequences of assignments}: these sequences represent atomic instructions which can themselves be combined either sequentially or in parallel.

### C.2 Syntax, Sequential and Parallel Semantics

The syntax of AbstractBasicBlock is parametrized by: a type \(R.t\) of pseudo-registers (the type of positive integers in practice) and a type \(op\) of operators. The semantics of AbstractBasicBlock is parametrized by: a type \(value\) of values and a type \(genv\) for global environments; and an evaluation function:

\[
\text{Parameter } \text{op\_eval} : genv \rightarrow op \rightarrow \text{list } value \rightarrow \text{option } value
\]

By convention, a None value in the semantics represents an error. For the underlying assembly instruction, it is either a dynamic error (like an invalid pointer dereference) or a syntactic error (the operation is not called on the right numbers of parameters).

The syntax of the language is given by:

\[
\text{Inductive } \text{exp} := \text{PReg}(x:R.t) \mid \text{Op}(o:op) (le: \text{list}\_\text{exp}) \mid \text{Old}(e:\text{exp}) \\
\quad \text{with } \text{list}\_\text{exp} := \ldots \\
\text{Definition } \text{inst} := \text{list } (R.t \ast \text{exp}).(* \text{inst = atomic sequence of assignments }*)
\]

The semantics introduces a notion of memory from pseudo-registers into values.

\[
\text{Definition } \text{mem} := R.t \rightarrow value \\
\text{Definition } \text{assign } (m:\text{mem}) (x:R.t) (v: value): \text{mem} := \text{fun } y \Rightarrow \text{if } R.\text{eq\_dec } x y \text{ then } v \text{ else } m y
\]

\[29\text{Such an operator Old is quite standard in Hoare logic assertions. For example, see the ACSL annotation language of Frama-C [Kirchner et al. 2015].}\]

\[30\text{A benefit of this translation is that our scheduling oracle may replace two loads of 64-bit words into one load of a 128-bit words, and our verifier is able to check “for free” whether the replacement is semantically correct.}\]
Then, the sequential semantics of a block takes a memory $m$ as input and returns an optional memory. It simply iterates sequentially over the execution of instructions, called $\text{inst\_run}$, and detailed below. Here, “$\text{SOME } v \leftarrow e_1 \text{ IN } e_2$” means “$\text{match } e_1 \text{ with Some } v \Rightarrow e_2 \text{ | } _- \Rightarrow \text{None }$ end”.

```
Fixpoint run ge (p: bblock) (m: mem): option mem :=
  match p with
  | nil ⇒ Some m
  | i::p' ⇒ SOME m' ← inst_run ge i m m IN run ge p' m'
end
```

The $\text{inst\_run}$ function takes two memory states as input: $m$ as the current memory, and $\text{old}$ as the initial state of the instruction run (the duplication is carried out in $\text{run}$ above). It invokes the evaluation of expression, called $\text{exp\_eval}$ and defined below.

```
Fixpoint inst_run ge (i: inst) (m old: mem): option mem :=
  match i with
  | nil ⇒ Some m
  | (x,e)::i' ⇒ SOME v' ← exp_eval ge e m old
    IN inst_run ge i' (assign m x v') old
end
```

```
Fixpoint exp_eval ge (e: exp) (m old: mem): option value :=
  match e with
  | PReg x ⇒ Some (m x)
  | Op o le ⇒ SOME lv ← list_exp_eval ge le m old
    op_eval ge o lv
  | Old e ⇒ exp_eval ge e old old
end
```

Similarly, the $\text{exp\_eval}$ function takes two memory states as input: the current memory is replaced by $\text{old}$ when entering under the $\text{Old}$ operator.

```
Fixpoint list_exp_eval ge (le: list_exp) (m old: mem): option (list value) :=
...
```

Now, we define the non-deterministic out-of-order parallel semantics of AbstractBasicBlock as the $\text{prun}$ relation below. Like the semantics of AsmVLIW defined at Section 3.2, it is defined from the in-order parallel semantics, called $\text{prun\_iw}$ below. This out-of-order execution simply invokes the $\text{prun\_iw}$ on an arbitrary permutation $p'$ of the block and after duplicating the initial memory.

```
Definition prun ge p m (om: option mem) := \exists p', om = (prun_iw ge p' m m) ∧ Permutation p p'
```

Indeed, $\text{prun\_iw}$ is parametrized by $m$ for the write-only memory and by $\text{old}$ for the read-only memory (which is thus the initial memory of the block).

```
Fixpoint prun_iw ge p m old: option mem :=
  match p with
  | nil ⇒ Some m
  | i::p' ⇒ SOME m1 ← inst_prun ge i m old old IN
    prun_iw ge p' m1 old
end
```

The parallel semantics of an instruction now takes three memories as input: $m$ for the write-only memory, $\text{old}$ for the read-only memory (which is thus the initial memory of the block), and $\text{tmp}$ a duplication of the old memory, with modifications that are purely local to the instruction.

```
Fixpoint inst_prun ge (i: inst) (m tmp old:mem): option mem :=
  match i with
  | nil ⇒ Some m
  | (x,e)::i' ⇒ SOME v' ← exp_eval ge e tmp old IN
    inst_prun i' ge (assign m x v') (assign tmp x v') old
end
```
Note that, like in AsmVLIW, the sequential semantics of an instruction is a particular case of the parallel one. We have \((\text{inst\_run } ge \ i \ m \ old) = (\text{inst\_prun } ge \ i \ m \ m \ old)\). Moreover, in the sequential and parallel semantics of a block, instructions are considered \textit{atomically}: splitting/merging instructions in the block does generally not preserve the semantics.

C.3 Parallelizability Testing

Our parallelizability test is a function \texttt{is\_parallelizable} taking a basic block \(p\) and returning a Boolean. If this Boolean is true, then any out-of-order parallel execution returns the same\(^{31}\) result (possibly \texttt{None}) as the sequential execution. In this case, out-of-order parallel execution is thus deterministic.

\begin{equation}
\text{Theorem is\_parallelizable\_correct (p: bblock): is\_parallelizable p = true \rightarrow } \\
\forall \text{ ge m om}', \text{ prun ge p m om}' \leftrightarrow \text{ om}' = \text{ run ge p m}
\end{equation}

The \texttt{is\_parallelizable} test analyzes the sets of pseudo-registers used by each instruction. The type of such sets of pseudo-registers is noted here \texttt{S.\_\_\_} and is implemented by prefix-trees from module \texttt{PositiveSet} of the Coq standard library. Function \texttt{is\_parallelizable} invokes two functions, \texttt{inst\_wframe} and \texttt{inst\_frame}, of type inst \rightarrow \texttt{S.\_\_\_}:

- \((\text{inst\_wframe } i)\) is the set of all pseudo-registers \textit{written} by instruction \(i\);
- \((\text{inst\_frame } i)\) is the set of all pseudo-registers \textit{used}—i.e. read or written—by \(i\);

Then, \((\text{is\_parallelizable } p)\) simply checks the absence of Use-After-Write: no instruction of \(p\) uses a pseudo-register after a previous instruction of \(p\) has written in it.

\begin{verbatim}
Fixpoint is\_pararec (p: bblock) (previously_written: S.\_\_\_): bool :=
  match p with
  | nil => true
  | i::p' => S.is_disjoint (inst\_frame i) previously_written
           &&& is\_pararec p' (S.union (inst\_wframe i) previously_written)
  end

Definition is\_parallelizable (p: bblock) := is\_pararec p S.empty
\end{verbatim}

The proof of \texttt{is\_parallelizable\_correct} results from the conjunction of two properties: the absence of Write-After-Write ensures that out-of-order parallel execution is deterministic; the absence of Read-After-Write ensures that sequential execution gives the same result as in-order parallel execution. To simplify this proof, we use a \textit{data-refinement} style: first, we prove it when \texttt{frames} are implemented by lists instead of prefix-trees; then, we prove that the handling of frames implemented by prefix-trees emulates the handling of frames using lists. There is thus little proof about prefix-trees. A more detailed and significant example of data-refinement style is given in the next section.

C.4 Simulation Testing

The sequential simulation of a block \(p_1\) by a block \(p_2\) is defined by the following pre-order:

\begin{equation}
\text{Definition bblock\_simu (p1 p2: bblock): Prop := } \\
\forall \text{ ge m, (run ge p1 m) \lnot\lnot (run ge p2 m)}
\end{equation}

We have implemented a simulation test: it takes two blocks \(p_1\) and \(p_2\), and returns a Boolean, such that if this latter is true then \((\text{bblock\_simu } p1 p2)\). This test is largely inspired by the list-scheduling verifier of Tristan and Leroy \cite{2008}, but with two major differences. First, they define their verifier for the Mach IR, while ours defined for \texttt{AbstractBasicBlock} is more generic. Second, we use hash-consing in order to avoid a combinatorial explosion of the test.

\(^{31}\text{Here, we admit functional extensionality to compare memories, like other parts of CompCert.}\)
Like in Tristan and Leroy [2008], our simulation test symbolically executes each block, and then simply compares the resulting symbolic memories. As introduced in Section 4.3, such a symbolic memory bisimulates the input block by combining a parallel assignment together with a list of potential failures. We recall the examples of Section 4.3 below.

Example (Reminder of Example 4.2). Let us consider the two blocks below (in informal syntax):

\[
\begin{align*}
    r_1 &:= r_1 + r_2; \\
    r_2 &:= \text{load}[m, r_2]; \\
    r_3 &:= r_1 + r_3
\end{align*}
\]

These two blocks are both equivalent to the parallel assignment (in an informal syntax):

\[
    r_1 := (r_1 + r_2) + \text{load}[m, r_2]; \\
    r_3 := r_1 + r_3
\]

Indeed, these two blocks simulate each other (they bisimulate).

Example (Reminder of Example 4.3). Let us consider the two blocks below:

\[
\begin{align*}
    r_1 &:= r_1 + r_2; \\
    r_2 &:= \text{load}[m, r_2]; \\
    r_3 &:= r_1; \\
    r_4 &:= r_1 + r_3
\end{align*}
\]

Again, \( p_1 \) and \( p_2 \) lead to the same parallel assignment:

\[
    r_1 := (r_1 + r_2) + (r_1 + r_2); \\
    r_3 := r_1 + r_2
\]

However, \( p_1 \) is simulated by \( p_2 \) whereas the converse is not true. This is because the “useless” memory access in \( p_1 \) may cause its execution to fail, whereas this failure cannot happen in \( p_2 \). Thus, the symbolic memory of \( p_1 \) should contain the term “\( \text{load}[m, r_2] \)” as a potential failure.

Our formal development is decomposed into two parts using a data-refinement style. In a first part, presented in Appendix C.4.1, we define a model of the symbolic execution and the simulation test. In a second part, sketched by Appendix C.4.3, we refine this model with efficient data-structures and algorithms, involving hash-consing of terms. Indeed, as illustrated by the previous examples, without a mechanism dealing efficiently with duplication of terms, symbolic execution produces terms that may be exponentially big w.r.t to the size of the source block. Our technique for hash-consing terms is explained in Appendix C.4.2.

C.4.1 A Model of our Simulation Test. The principle of symbolic execution has been first introduced by King [1976]. “Symbolic execution” refers to how to compute “symbolic memories” (and not to what they are): mimicking the concrete execution while replacing operations on “concrete memories” by operations on “symbolic memories”.

In this analogy, “values” are replaced by “symbolic values”, which are actually terms evaluated in the initial memory. Hence, our type \( \text{term} \) of terms—defined below—is similar to \( \text{type} \exp \) without the \( \text{old} \) operator: in a term, a pseudo-register represents its value in the initial memory of block execution.

\[
\begin{align*}
\text{Inductive} & \quad \text{term} := \text{Input}(x:R.t) \mid \text{App}(o: \text{op})(l: \text{list\_term}) \quad \text{with} \quad \text{list\_term} := \ldots \\
\text{Fixpoint} & \quad \text{term\_eval}(\text{ge: genv})(t: \text{term})(m: \text{mem}): \text{option value} := \ldots
\end{align*}
\]

In our model, the symbolic execution of a block is a function \( \text{bblock\_smem}: \text{block} \to \text{smem} \), where a symbolic memory of type \( \text{smem} \) is the pair of a predicate \( \text{pre} \) expressing at which condition the intermediate computations of the block do not fail, and of a parallel assignment \( \text{post} \) on the pseudo-registerners.

\[
\begin{align*}
\text{Record} & \quad \text{smem} := (\text{pre: genv} \to \text{mem} \to \text{Prop}; \text{post: Rt} \to \text{term})
\end{align*}
\]

Then, the bisimulation property between the symbolic memory and sequential execution is expressed by the \( \text{bblock\_smem\_correct} \) lemma below. It uses the \( \text{smem\_correct} \) predicate, relating the symbolic memory \( d \) with an initial memory \( m \) and a final optional memory \( o m \).

\[
\begin{align*}
\text{Definition} & \quad \text{smem\_correct ge (d: smem) (m: mem) (om: option mem): Prop :=} \\
& \quad \quad \forall m', \text{om} = \text{Some m'} \to (d(\text{pre}) \text{ ge m} \land \forall x, \text{term\_eval ge (d(\text{post}) x) m = \text{Some (m' x)})}
\end{align*}
\]

\[
\begin{align*}
\text{Lemma} & \quad \text{bblock\_smem\_correct p m: smem\_correct ge (bblock\_smem p) m (run ge p m)}
\end{align*}
\]
By using this lemma, we transfer the notion of simulation of block executions into the simulation of symbolic memories, through the predicate \( \text{smem\_simu} \) below. In particular, proposition \((\text{smem\_valid ge d m})\) holds iff the underlying execution does not return a \text{None} result from the initial memory \(m\).

**Definition** \(\text{smem\_valid ge (d: smem) (m: mem): Prop} :=\)
\[
d.(\text{pre}) \text{ ge m } \land \forall x, \text{term\_eval ge (d.(post) x) m} \neq \text{None}
\]

**Definition** \(\text{smem\_simu (d1 d2: smem): Prop} :=\)
\[
(\forall \text{ ge m, smem\_valid ge d1 m } \rightarrow \text{smem\_valid ge d2 m})
\land (\forall \text{ ge m x, smem\_valid ge d1 m } \rightarrow \text{term\_eval ge (d1.(post) x) m} = \text{term\_eval ge (d2.(post) x) m})
\]

**Theorem** \(\text{bblock\_smem\_simu p1 p2:}\)
\[
\text{smem\_simu (bblock\_smem p1) (bblock\_smem p2)} \rightarrow \text{bblock\_simu ge p1 p2}
\]

Internally, as coined in the name of “symbolic execution”, \(\text{bblock\_smem}\) mimics \text{run}, by replacing operations on memories of type \(\text{mem}\) by operations of type \(\text{smem}\): these operations on the symbolic memory are given in Fig. 22. The initial symbolic memory is defined by \(\text{smem\_empty}\). The evaluation of expressions on symbolic memories is defined by \(\text{exp\_term}\): it outputs a term (i.e. a symbolic value). Also, the assignment on symbolic memories is defined by \(\text{smem\_set}\). To conclude, starting from \(\text{smem\_empty}\), the symbolic execution preserves the \(\text{smem\_correct}\) relation w.r.t the initial memory and the current (optional) memory, on each assignment.

---

(* initial symbolic memory *)

**Definition** \(\text{smem\_empty} := \{| \text{pre}=(\text{fun } _ _ \Rightarrow \text{True}); \text{post}=(\text{fun } x \Rightarrow \text{Input } x) \}|\)

(* symbolic evaluation of the right-hand side of an assignment *)

**Fixpoint** \(\text{exp\_term (e: exp) (d old: smem) : term} :=\)
\[
\text{match e with}
| \text{PReg x} \Rightarrow d.(\text{post}) x
| \text{Op o le} \Rightarrow \text{App o (list\_exp\_term le d old)}
| \text{Old e} \Rightarrow \text{exp\_term e old old}
\text{end}
\]

(* effect of an assignment on the symbolic memory *)

**Definition** \(\text{smem\_set (d:smem) x (t:term) :=}\)
\[
\{| \text{pre}=(\text{fun ge m} \Rightarrow (\text{term\_eval ge (d.(post) x) m} \neq \text{None} \land (d.(pre) ge m));}
\text{post}=(\text{fun y} \Rightarrow \text{if R.eq\_dec x y then t else d.(post) y}) \}|\)

---

**Fig. 22. Basic Operations of the Symbolic Execution in the Abstract Model**

---

**C.4.2 Formally Verified Hash-Consed Terms in Coq.** Hash-consing is a standard technique of imperative programming, which in our case has two benefits: it avoids duplication of structurally equal terms in memory, and importantly, it reduces (expansive) structural equality tests over terms, to (very cheap) pointer equality tests. In our verified backend, we thus need to import pointer equality from OCAML. However, importing pointer equality as a pure function, such as \(\forall \{A\}, A \rightarrow A \rightarrow \text{bool}\), would be unsafe. Indeed, such a Coq function is—by construction—compatible with the logical equality of Coq (i.e. the structural equality), which is not the case of pointer equality. Thus, we instead import pointer equality from the IMPURE library of [Boulmé and Vandendorpe 2019].

The IMPURE Library and its Pointer Equality Test. The IMPURE library provides an approach to safely embed type-safe OCAML impure computations into Coq-verified code: impure computations are abstracted as non-deterministic computations. For a given type \(A\), type \(\text{??A}\) represents the type of non-deterministic computations returning values of type \(A\): it can be interpreted as \(\mathcal{P}(A)\),
the type $A \rightarrow \text{Prop}$ of predicates over $A$. Formally, the type transformer “??” is axiomatized as a monad that provides a may-return relation “??$A \rightarrow A \rightarrow \text{Prop}$“. Intuitively, when “$A$” is seen as “$\mathcal{P}(A)$”, then “??$A$” simply corresponds to identity. At extraction, “??$A$” is extracted like $A$, and its binding operator is efficiently extracted as an OCaml let-in. See details in [Boulmé and Vandendorpe 2019]. Moreover, this library declares a trusted pointer equality with the following axioms.

\[
\begin{align*}
\text{Axiom } \text{phys_eq} &: \forall (A), A \rightarrow A \rightarrow ?? \text{ bool} \\
\text{Axiom } \text{phys_eq_true} &: \forall A (x \ y: A), \text{phys_eq } x\ y \rightarrow \text{true} \rightarrow x = y
\end{align*}
\]

In other words, in our Coq model, the pointer equality test is seen as a non-deterministic function, since it can distinguish more objects than the logical equality of Coq. Moreover, when it answers true, we know that the two objects under test are logically equals.

A Generic and Verified Factory of Memoizing Functions for Hash-Consing. Hash-consing is a fundamentally impure construction, and it is not easy to retrofit it into a pure language; Braibant et al. [2014] propose several approaches for hash-consing in Coq and in code extracted from Coq to OCAML. However, we need weaker properties than what they aim for. They wish to use physical equality (or equality on an “identifier” type) as equivalent to semantic equality; they use this to provide a fast equality test for Binary Decision Diagrams (BDD)—two Boolean functions represented by reduced ordered binary decision diagrams are equal if and only if the roots of the diagrams are physically the same. In contrast, we just need physical equality to imply semantic equality. This allows a lighter approach.

Hash-consing consists in memoizing the constructors of some inductive data-type—such as the terms described above—in order to ensure that two structurally equal terms are allocated to the same object in memory. In practice, this technique simply replaces the usual constructors of the data-type by smart constructors that perform memoization. Memoization is usually delegated to a dedicated function in turn generated from a generic factory.

On the top of the IMPURE library, we have defined in Coq a generic and verified memoization factory. This factory is inspired by that of Filliâtre and Conchon [2006] in OCAML. However, whereas their factory was not formally verified, ours satisfies a simple correctness property that is formally verified in Coq (and shown sufficient for the formal correctness of our simulation test). Actually, we use an external untrusted OCAML oracle that creates memoizing functions and we only dynamically check that these untrusted functions behave observationally like an identity. Let us insist on this point: the formal correctness of our memoization factory does not assume nor prove that our oracle is correct; it only assumes that the embedding of OCAML untrusted oracles in Coq verified code through the IMPURE library is correct (see the details in [Boulmé and Vandendorpe 2019]). We now detail a slightly simplified version of this factory.\(^{32}\)

Our generic memoization factory is parametrized by a record of type (hashP $A$), where $A$ is the type of objects to memoize. Below, hashcode is an abstract data type on the Coq side, extracted as an OCAML int. Function hash_eq is typically a fast equality test, for comparing a new object to already memoized ones in smart constructors. This test typically compares the sons of the root node w.r.t pointer equality (the example for terms is given below by term_hash_eq function). Function hashing is expected to provide a unique hashcode for data that are equal modulo hash_eq. Finally, set_hid is invoked by memoizing functions to allocate a fresh and unique hash-tag to new objects (this hash-tag is used by efficient implementations of hashing).

\(^{32}\)Our actual factory also provides some debugging features, which are useful for printing a trace when the whole simulation test fails. We omit these implementation details in this presentation.
Record hashP (A:Type) := {
  hash_eq : A → A → ?? bool;
  hashing : A → ?? hashcode;
  set_hid : A → hashcode → A;
}

The details on hashing and set_hid are only relevant for efficiency: these functions are simply ignored in our formal proofs. Hence, given such (hashP A) structure, our OCAML oracle xhCons returns a (fresh) memoizing function of type (A → ??A).

Axiom xhCons : ∀ (A), hashP A → ??( A → ??A). (* declares our OCaml oracle in Coq *)

Such a memoizing function of type (A → ??A) is expected to behave as an identity w.r.t hash_eq. Actually, as we do not trust xhCons, we dynamically check this property.

33 Hence, our verified generic memoization factory in CoQ—called hCons below—simply wraps each function returned by xhCons with this defensive check: it raises an exception if the memoizing function does not return a result equal to its input (w.r.t hash_eq). Below, the notation “DO x ← e1 ;; e2” stands for a bind operation of the may-return monad of the IMPURE library (it is extracted as “let x = e1 in e2”). Moreover, “RET e” is the unit of this monad (it is extracted as “e”). Function “assert_b” is also provided by IMPURE.

Definition hCons (A) (hp: hashP A): ??( A → ??A) :=
  DO hC ← xhCons hp ;;
  RET (fun x ⇒
    DO y ← hC x ;;
    DO b ← hp.(hash_eq) x y ;;
    assert_b b "xhCons: hash_eq differs" ;; (* exception raised if Boolean [b] is [false] *)
  RET y)

We are thus able to formally prove the following (trivial) correctness property on hCons, which is sufficient in our development to reason about hash-consing. Here, the relation R is typically an equivalence under which we want to observe hash-consed objects.

Lemma hCons_correct A (hp: hashP A) (R: A → A → Prop):
  (∀ x y, hp. (hash_eq) x y {true → R x y}) → ∀ hC, hCons hp {hC → ∀ x y, hC x {y → R x y}}

Smart Constructors for Hash-Consed Terms. In our development, we need hash-consing on two types of objects: term and list_term, because they are mutually inductive. First, we redefine type term and list_term into hterm and list_hterm by inserting a hash-tag—called below hid—at each node.

Inductive hterm :=
  | Input (x:R) (hid:hashcode)
  | App (o: op) (l: list_hterm) (hid:hashcode)
with list_hterm :=
  | LTnil (hid:hashcode)
  | LTcons (t:hterm) (l: list_hterm) (hid:hashcode)

Thus, we also have to redefine term_eval and list_term_eval for their “hterm” versions. Note that these functions simply ignore hash-tags.

Fixpoint hterm_eval (ge: genv) (t: hterm) (m: mem): option value :=
  match t with
    | Input x _ ⇒ Some (m x)
    | App (o: op) (l: list_hterm) (hid:hashcode)
      with list_hterm_eval ge (l: list_hterm) (m: mem) (struct l): option (list value) :=...

33 As hash_eq is expected to be constant-time, this dynamic check only induces a small overhead.
Then, we define two records of type \((\text{hashP hterm})\) and \((\text{hashP list_hterm})\). Below, we only detail the case of \((\text{hashP hterm})\), as the \((\text{hashP list_hterm})\) case is similar. First, the hash_eq field of \((\text{hashP hterm})\) is defined as function \text{term_hash_eq} below. On the Input case, we use the structural equality over pseudo-registers. On the App case, we use an equality \text{op_eq} on type op in parameters of the simulation test, and we use the pointer equality over the list of terms.

\begin{verbatim}
Definition term_hash_eq (ta tb: hterm): ?? bool :=
  match ta, tb with
  | Input xa _, Input xb _ => if R.eq_dec xa xb then RET true else RET false
  | App oa lta _, App ob ltb _ =>
    DO b f op_eq oa ob ;;
    if b then phys_eq lta ltb else RET false
  | _,_ => RET false
end
\end{verbatim}

Second, the hashing field of \((\text{hashP hterm})\) is defined as function \text{term_hashing} below. This function uses an untrusted oracle “\text{hash}: \forall \{A\}, A \rightarrow ??\text{hashcode}” extracted as the polymorphic Hashing.hash of the OCaml standard library. It also uses \text{list_term_get_hid} defined below—that returns the hash-tag at the root node. To ensure memoization efficiency, two terms that are distinct w.r.t \text{term_hash_eq} are expected to have distinct \text{term_hashing} with a high probability.\(^3\)

\(^3\)Two terms equals w.r.t \text{term_hash_eq} must also have the same \text{term_hashing}.

Finally, the set_hid field of \((\text{hashP hterm})\) updates the hash-tag at the root node. It is defined by:

\begin{verbatim}
Definition term_set_hid (t: hterm) (hid: hashcode): hterm :=
  match t with
  | Input x _ => Input x hid
  | App op l _ => App op l hid
end
\end{verbatim}

Having defined two records of type \((\text{hashP hterm})\) and \((\text{hashP list_hterm})\) as sketched above, we can now instantiate \text{hCons} on each of these records. We get two memoizing functions \text{hC_term} and \text{hC_list_term} (Fig. 23). The correctness property associated with each of these functions is derived from \text{hCons_correct} with an appropriate relation R: the semantical equivalence of terms (or list of terms). These memoizing functions and their correctness properties are parameters of the code building \text{hterm} and \text{list_hterm} described below.
Indeed, these functions are involved in the smart constructors of \( h\text{term} \) and \( \text{list}\_h\text{term} \). Below, we give the smart constructor—called \( h\text{App} \)—for the \( \text{App} \) case with its correctness property. It uses a special hash-tag called \( \text{unknown\_hid} \) (never allocated by our \( \text{xhCons} \) oracle). The three other smart constructors are similar.

\[
\text{Definition} \quad h\text{App} \ (o:o) \ (l:\text{list}\_h\text{term}) : \ ?\ h\text{term} := h\text{C\_term} \ (\text{App} \ o \ l \ \text{unknown\_hid})
\]

\[
\text{Lemma} \quad h\text{App\_correct} \ o \ l : \ \forall \ t, \ h\text{App} \ o \ l \leadsto t \rightarrow \ \\
\ \forall \ ge \ m, \ h\text{term}\_eval \ ge \ t \ m = \ (\text{SOME} \ v \leftarrow \text{list}\_h\text{term}\_eval \ ge \ l \ m \ \text{IN} \ \text{op}\_eval \ ge \ o \ v)
\]

In the next section, we only build \( h\text{term} \) and \( \text{list}\_h\text{term} \) by using the smart constructors defined above. This ensures that we can replace the structural equality over type \( h\text{term} \) by the physical equality. However, this property does not need to be formally proved (and we have no such formal proof, since this property relies on the correctness of our untrusted memoization factory).

### C.4.3 Implementing the Simulation Test

Our implementation can be decomposed in two parts. First, we implement the symbolic execution function as a data-refinement of the \( \text{bblock}\_\text{smem} \) function of Appendix C.4.1. Then, we exploit the \( \text{bblock}\_\text{smem}\_\text{simu} \) theorem to derive the simulation test.

**Refining Symbolic Execution with Hash-Consed Terms.** Our symbolic execution builds hash-consed terms. It invokes the smart constructors of Appendix C.4.2, and is thus itself parametrized by the memoizing functions \( h\text{C\_term} \) and \( h\text{C\_list\_term} \) defined in Figure 23. Note that our simulation test will ultimately perform two symbolic executions, one for each block. Furthermore, these two symbolic executions share the same memoizing functions, leading to an efficient comparison of the symbolic memories through pointer equality. In the following paragraph, functions \( h\text{C\_term} \) and \( h\text{C\_list\_term} \) remain implicit parameters as authorized by the section mechanism of Coq.

Figure 24 refines the type \( \text{smem} \) of symbolic memories into a type \( \text{hsmem} \). The latter involves a dictionary with pseudo-registers of type \( R \ . t \) as keys, and terms of \( h\text{term} \) as associated data. These dictionaries of type \( (\text{Dict.} \ t \ h\text{term}) \) are implemented as prefix-trees, through the \text{PositiveMap} module of the Coq standard library.

Figure 24 also relates type \( \text{hsmem} \) to type \( \text{smem} \) (in a given environment \( ge \)), by a relation called \( \text{smem}\_\text{model} \). The \( \text{hpre} \) field of the symbolic memory is expected to contain a list of all the potential failing terms in the underlying execution. Hence, predicate \( \text{hsmem\_valid} \) gives a precondition on the initial memory \( m \) ensuring that the underlying execution will not fail. This predicate is thus expected to be equivalent to the \( \text{smem\_valid} \) predicate of the abstract model. Function \( \text{hsmem\_post\_eval} \) gives the final (optional) value associated with pseudo-register \( x \) from the initial memory \( m \); if \( x \) is not in the \( \text{hpost} \) dictionary, then its associated value is that of the initial memory (it is expected to be unassigned by the underlying execution). This function is thus expected to simulate the evaluation of the symbolic memory of the abstract model.

Hence, \( \text{smem}\_\text{model} \) is the (data-refinement) relation for which our implementation of the symbolic execution simulates the abstract model of Appendix C.4.1. Figure 25 provides an implementation of the operations of Figure 22 that preserves the data-refinement relation. The smart
Certified and Efficient Instruction Scheduling

The type of our symbolic memories with hash-consing

Record hsmem = (hpre: list hterm; hpost: Dict t hterm)

Implementation of the [smem_valid] predicate

Definition hsmem_valid ge (hd: hsmem) (m: mem): Prop :=
  ∀ ht, List.In ht hd. (hpre) hterm_eval ge ht m <> None

Implementation of the symbolic memory evaluation

Definition hsmem_post_eval ge (hd: hsmem) x (m: mem): option value :=
  match Dict.get hd. (hpost) x with
  | None ⇒ Some (m x)
  | Some ht ⇒ hterm_eval ge ht m
  end

The data-refinement relation

Definition smem_model ge (d: smem) (hd: hsmem): Prop :=
  (∀ m, hsmem_valid ge hd m ↔ smem_valid ge d m)
  ∧ ∀ m x, smem_valid ge d m → hsmem_post_eval ge hd x m = term_eval ge (d. (post) x) m

Fig. 24. Data-Refinement of Symbolic Memories with Hash-Consed Terms

The data-refinement relation

Definition bblock_hsmem : bblock → ?? hsmem := ...

Lemma bblock_hsmem_correct p hd: bblock_hsmem p ≃ hd → ∀ ge, smem_model ge (bblock_smem p) hd

The Main Function of the Simulation Test. Let us now present the main function of the simulation test, called bblock_simu_test below. First, it creates two memoizing functions hC_term and hC_list_term (Fig. 23) from the generic factory hCons (see Appendix C.4.2 for details). Then, it invokes the symbolic execution bblock_hsmem on each block. Notice that these two symbolic executions share the memoizing functions hC_term and hC_list_term, meaning that each term produced by one of the symbolic executions is represented by a unique pointer. The symbolic executions produce two symbolic memories d1 and d2. We compare them using two auxiliary functions specified in Fig. 26. Hence, (Dict.eq_test d1. (hpost) d2. (hpost)) compares whether each pseudo-register is assigned to the same term w.r.t pointer equality in both symbolic memories. Finally, (test_list_incl d2. (hpre) d1. (hpre)) compares whether each term of d2. (hpre) is also present in d1. (hpre): i.e., whether all potential failures of d2 are potential failures of d1. Again, in test_list_incl, terms are compared for pointer equality. Let us note that test_list_incl is itself efficiently implemented (with a linear execution time), by using an untrusted OCAML oracle.

Our implementation of hsmem_set is actually able to apply some rewriting rules. But, this feature is still not used by our verified scheduler.

The code of bblock_simu_test has been largely simplified, by omitting the complex machinery which is necessary to produce an understandable trace for CompCert developers in the event of a negative answer.

35
36
Cyril Six, Sylvain Boulmé, and David Monniaux

(* initial symbolic memory *)

\[
\text{Definition } \text{hsmem_empty} : \text{hsmem} \equiv \{ \text{hpre} \equiv \text{nil} ; \text{hpost} \equiv \text{Dict.empty} \}.
\]

(* symbolic evaluation of the right-hand side of an assignment *)

\[
\text{Fixpoint } \text{exp_hterm} (e : \text{exp}) (\text{hd} : \text{hsmem}) : \text{?? hterm} \equiv
\begin{cases}
\text{match } e \text{ with} \\
| \text{PReg } x & \Rightarrow \text{match } \text{Dict.get} \text{hd}(\text{post}) x \text{ with} \\
| \text{None} & \Rightarrow \text{hInput} x (* \text{smart constructor for Input} *) \\
| \text{Some } ht & \Rightarrow \text{RET } ht \\
end\\n| \text{Op } o \text{ le} & \Rightarrow \text{DO } \text{lt} \leftarrow \text{list_exp_hterm} \text{le} \text{hd} \text{hod}; \text{hApp}\text{ o } \text{lt} (* \text{smart constructor for App} *) \\
| \text{Old } e & \Rightarrow \text{exp_hterm} e \text{hod} \text{hod} \\
end
\end{cases}
\]

\[
\text{Lemma } \text{exp_hterm_correct } e \text{ hod} od \text{d} \text{t} \text{dod} \text{ht} : \text{smem_model ge od} \text{hod} \rightarrow \text{smem_model ge d} \text{hd} \rightarrow \text{exp_hterm e} \text{hd} \text{hod} \rightarrow \text{ht} \rightarrow \\
\forall m, \text{smem_valid ge d} m \rightarrow \text{smem_valid ge od} m \rightarrow \\
\text{hterm_eval ge t m = term_eval ge (exp_term e} d \text{od} m)
\]

(* effect of an assignment on the symbolic memory *)

\[
\text{Definition } \text{hsmem_set} (\text{hd} : \text{hsmem}) x (\text{ht} : \text{hterm}) : \text{?? hsmem} \equiv
\begin{cases}
\text{match } \text{ht} \text{ with} \\
| \text{hInput} x & \Rightarrow \text{RET } \{ \text{hpre} \equiv \text{ht} ; \text{hpost} \equiv \text{Dict.set} \text{hd} x \text{ht} \}
end
\end{cases}
\]

\[
\text{Lemma } \text{hsmem_set_correct } hd x \text{ht} \text{d} \text{t} \text{dod} \text{ht} : \text{smem_model ge d} \text{hd} \rightarrow \forall m, \text{smem_valid ge d} m \rightarrow \text{hterm_eval ge t m = term_eval ge (exp_term e} d \text{od} m)
\]

with a hash-table. More precisely, the formal proof of \text{test_list_incl_correct} relies on a property derived by parametricity from the polymorphic type of this untrusted oracle. This applies a "theorems for free" technique described in \cite{Boulmé and Vandendorpe 2019}.

\[
\text{Definition } \text{bblock_simu_test} (p1 p2 : \text{bblock}) : \text{?? bool} \equiv
\begin{cases}
\text{DO } \text{hC_term} \leftarrow \text{hCons} (\text{hash_eq}=\text{term_hash_eq}; \text{hashing}=\text{term_hashing}; \text{set_hid}=\text{term_set_hid});
\text{DO } \text{hC_list_term} \leftarrow \text{hCons} ...
\text{DO } \text{d2} \leftarrow \text{bblock_hsmem} \text{hC_term} \text{hC_list_term} p2;
\text{DO } \text{d} \leftarrow \text{Dict.eq_test} \text{d1}(\text{hpost}) \text{d2}(\text{hpost});
\text{if } b \text{ then } \text{test_list_incl} \text{d2}(\text{hpre}) \text{d1}(\text{hpre});
\text{else } \text{RET } \text{false}
end
\end{cases}
\]

\[
\text{Lemma } \text{bblock_simu_test_correct } (p1 p2 : \text{bblock}) : \\
\text{bblock_simu_test reduce p1 p2} \rightarrow \text{true} \rightarrow \forall \text{ge}, \text{bblock_simu ge p1 p2}
\]

The proof of \text{bblock_simu_test_correct} directly results from the conjunction of the two correctness properties of Fig. 26 with \text{bblock_smem_correct} and \text{bblock_hsmem_correct}.

\[
\text{Definition } \text{Dict.eq_test} : \forall (A), \text{Dict.t A} \rightarrow \text{Dict.t A} \rightarrow \text{?? bool}
\]

\[
\text{Lemma } \text{Dict.eq_test_correct A} (d1 d2 : \text{Dict.t A}) : \text{Dict.eq_test d1 d2} \rightarrow \text{true} \\
\forall \text{x:Rt}, \text{Dict.get} d1 x = \text{Dict.get} d2 x
\]

\[
\text{Definition } \text{test_list_incl} : \forall (A), \text{list A} \rightarrow \text{list A} \rightarrow \text{?? bool}
\]

\[
\text{Lemma } \text{test_list_incl_correct A} (l1 l2 : \text{list A}) : \text{test_list_incl l1 l2} \rightarrow \text{true} \\
\forall \text{t:A}, \text{List.In t} l1 \rightarrow \text{List.In t} l2
\]

Fig. 26. Formal Specification of the two Auxiliary Functions Used by the Simulation Test
D   OPTIMAL ILP SCHEDULER

We provide a solver based on Integer Linear Programming (ILP), that optimally solves scheduling problem of Section 6.2. This ILP solver is not intended for production use as it is too costly. Rather, it is used to validate how often the “critical-path” scheduler of Section 6.3 actually computes an optimal solution (for a counter-example, see Fig. 12 vs Fig. 13).

Roughly speaking, our optimal solver will turn the scheduling problem into an ILP problem, and then, invoke an external ILP solver to find a solution. More precisely, we first compute an upper bound $B$ on the makespan $t(n)$, typically by running the “critical-path” scheduler of Section 6.3: if the latter computes a solution with makespan $B + 1$, then we can restrict our ILP search to solutions of makespan at most $B$.

We first compute, for every instruction $j$, the maximum length $\alpha(j)$ of a path ending at $j$ in the graph, and the maximum length $l(j)$ of a path starting at $j$ (necessarily to $n$). Let us define $\beta(j) = B - l(j)$. Then we know that any solution $t$ satisfies $\alpha(j) \leq t(j) \leq \beta(j)$ for all $j$.

For every $j$ and every $\alpha(j) \leq i \leq \beta(j)$, we introduce a Boolean variable $x(j, i)$, meaning that instruction $j$ is scheduled at time $i$. An instruction is scheduled at a single time slot, thus all these variables are exclusive, as expressed by Equation (3) of Fig. 27. Then, $t(j)$ is easily recovered from the Boolean variables (Equation 4). The latency (and dependency) constraints $t(j') - t(j) \geq \delta$ are directly copied into the ILP problem. Alternatively, a constraint $t(j') - t(j) \geq \delta$ can for instance be encoded as Inequality (5). The resource constraints are implemented by introducing, for all $0 \leq i \leq B$ and all $1 \leq h \leq m$, an instance of Inequality (6) where $u_h(k)$ denotes the $h$-th coordinate of $u(k)$ and $r_h$ the $h$-th coordinate of $r$.

\[
\sum_{i=\alpha(j)}^{\beta(j)} x(j, i) = 1 \\
\]  \hspace{1cm} (3)

\[
t(j) = \sum_{i=\alpha(j)}^{\beta(j)} x(j, i)i \\
\]  \hspace{1cm} (4)

\[
x(j, i) \leq \sum_{i'=\max(i+\delta, \alpha(j'))}^{\beta(j')} x(j', i') \\
\]  \hspace{1cm} (5)

\[
\sum_{j|\alpha(j) \leq i \leq \beta(j)} x(j, i)u_h(K(j)) \leq r_h \\
\]  \hspace{1cm} (6)

Fig. 27. Turning the Scheduling Problem into one of ILP

The resulting ILP problem can be used in two ways:

**Optimization** Minimize $t(n)$.

**Decision** Test for the existence of a solution with makespan $t(n) \leq B$. Then set $B$ to be $t(n) - 1$ and restart the process, until no better solution is found.

While it may appear that this optimization, requiring one ILP call, would be more efficient than a sequence of decision problems, it seems that, experimentally, this is not the case.

Such an ILP problem can also be formulated as *pseudo-Boolean*, that is, a problem where all variables are Boolean and the constraints are linear inequalities. It suffices to replace $t(j)$ by its definition from Eq. (4) everywhere.

We support Gurobi and CPlex as ILP backends, and we have run experiments with Sat4J and others as pseudo-Boolean backends.

Experimental Evaluation of the (Sub)Optimality of our List Scheduler. We compiled our benchmarks with the ILP scheduling described above. The ILP scheduler outperformed our list scheduler of
Section 6.3 in only 8 of 26,161 basic blocks: for 1 block, removing 2 cycles (from 106 to 104); in others, removing 1.

As detailed above, we first compute an initial schedule by list scheduling, then try to find improvements using ILP. The ILP solver is not called if that initial schedule is obviously optimal since its makespan is the minimum makespan imposed by a critical path. Among the 26,161, we called the ILP solver only 2,368 times.

In most cases, solving ILP problems using Gurobi takes a few milliseconds of CPU time. However, it can be considerably costlier in some rare cases: for instance, Gurobi took more than 7 hours to check that the initial schedule for a 147 instruction block was optimal.