Automatic Source-to-Source Error Compensation of Floating-Point Programs

—Numerical programs with IEEE 754 ﬂoating-point computations may suffer from inaccuracies since ﬁnite precision arithmetic is an approximation of real arithmetic. Solutions that reduce the loss of accuracy are available as, for instance, compensated algorithms, more precise computation with double-double or similar libraries. Our objective is to automatically improve the numerical quality of a numerical program with the smallest impact on its performances. We deﬁne and implement source code transformation to derive automatically compensated programs. We present several experimental results to compare the transformed programs and existing solutions. The transformed programs are as accurate and efﬁcient than the implementations of compensated algorithms when the latter exist.


I. INTRODUCTION
In this paper, we focus on numerical programs using IEEE 754 floating-point arithmetic.Several techniques have been introduced to improve the accuracy of numerical algorithms, as for instance expansions [4], [23], compensations [7], [10], differential methods [14] or extended precision arithmetic using multiple-precision libraries [5], [8].Nevertheless, bugs from numerical failures are numerous and well known [2], [18].This illustrates that these improvement techniques are not known enough outside the floating-point arithmetic community, or not sufficiently automated to be applied more systematically.For example, the programmer has to modify the source code by overloading floating-point types with double-double arithmetic [8] or, less easily, by compensating the floatingpoint operations with error-free transformations (EFT) [7].The latter transformations are difficult to implement without a preliminary manual step to define the modified algorithm.
We present a method that allows a non floating-point expert to improve the numerical accuracy of his program without impacting too much the execution time.Our approach facilitates the numerical accuracy improvement by automating compensation process.Even if we not provide error bounds on the processed algorithms, our approach takes advantage of a fast program transformation, available for a large community of developers.So, we propose to automatically introduce at compile-time a compensation step by using error-free transformations.We have developed a tool to parse C programs and generate a new C code with a compensated treatment: floatingpoint operations ± and × are replaced by their respective errorfree TWOSUM and TWOPRODUCT algorithms [21,Chap. 4].The main advantage of this method compared to operator overloading is to benefit from code optimizations and an efficient code generation.Program transformation is strongly motivated by our perspectives for the multi-criteria optimization of programs [25].These optimizations will allow to tradeoff between accuracy and execution time.Programs will be partially compensated by using transformation strategies to meet some time and accuracy constraints which could be difficult to reach with operator overloading.
To demonstrate the efficiency of this approach, we compare our automatically transformed algorithms to existing compensated ones such as floating-point summation [22] and polynomial evaluation [7], [10].The goal of this demonstration is to recover automatically the same results in terms of accuracy and execution time.Compensation is known to be a good choice to benefit from the good instruction level parallelism (ILP) of compensated algorithms compared to the ones derived using fixed-length expansions such as double-double or quaddouble [8], [15].Results for the automatically transformed algorithms, both in terms of accuracy and execution time, are shown to be very close to the results for the implementation of the studied compensated algorithms.This article is organized as follows.Section II introduces background material on floating-point arithmetic, errorfree transformations, and accuracy improvement techniques like double-double arithmetic and compensation.The core of this article is Section III, where we present our automatic code transformation to optimize the accuracy of floating-point computations with the smallest execution time overhead.In Section IV, we present some experimental results to illustrate the interesting behavior of our approach compared to existing ones.Conclusion and perspectives are proposed in Section V.

II. PRELIMINARIES
In this section we recall classical notations to deal with IEEE floating-point arithmetic, basic methods to analyze the accuracy of floating-point computations, and EFTs of the basic operations ± and ×.We also present how to exploit these EFTs with expansions and compensations.

A. IEEE Floating-Point Arithmetic
In base β and precision p, IEEE floating-point numbers have the form: , and e is the exponent.The IEEE 754-2008 standard [24] defines such numbers for several formats, that is, for various pairs (β, p).It also defines rounding modes, and the semantics of the basic operations ±, ×, ÷, √ •.
Notation and assumptions.Throughout the paper, all computations are performed in binary64 format, with the round-tonearest mode.We assume that neither overflow nor underflow occurs during the computations.We use the following notations: • F is the set of all normalized floating-point numbers.For example, in the binary64 format floating-point numbers are expressed with β = 2 over 64 bits including p = 53 bits, 11 for the exponent e, and 1 for the sign s.
• f l(•) denotes the result of a floating-point computation where every operation inside the parenthesis is performed in the working precision and the round-tonearest mode.
• ulp(x) is the floating-point value of the unit in the last place of x defined by ulp(x) = 2 e • 2 1−p .Let x = f l(x) for a real number x.We have |x − x| ulp( x)/2.
Accuracy analysis.One way of estimating the accuracy of x = f l(x) is through the number of significant bits # sig shared by x and x: where E rel ( x) is the relative error defined by:

B. Error-Free Transformations
Error-free transformations (EFT) provide lossless transformations of basic floating-point operations • ∈ {+, −, ×}.Let a, b ∈ F and x = f l(a • b).There exists a floating-point value y = a • b − x such that a • b = x + y.We have |y| ulp( x)/2.Hence x (resp.y) is the upper (resp.lower) part of a • b and no digit of x overlaps with y.The practical interest of EFTs comes from Algorithms 1, 2, 4, and 5 which exactly compute in floating-point arithmetic the error term y for the sum and the product.[Dekker, 1971].
Algorithm 4: TWOPRODUCT(a, b) [Dekker, 1971].Algorithm 3, due to Veltkamp [4], splits a binary floatingpoint number into two floating-point numbers containing the upper and lower parts.It is used in Algorithm 4, introduced by Dekker [4], to compute the EFT of a product for the cost of 17 flops.
Some processors have a fused multiply-add (FMA) instruction which evaluates expressions such as a×b±c with a single rounding error.Algorithm 5 takes advantage of this instruction to compute the exact product of two floating-point numbers much faster, namely with 2 flops instead of 17 flops with TWOPRODUCT.
Table I presents the number of operations and the depth of the dependency graph (that is, the critical path in the EFT data flow graph) for each of the output values x and y of these algorithms.It is shown in [13] that TWOSUM is optimal, both in terms of the number of operations and the depth of the dependency graph.

EFT algorithm flop depth
x y x y The result x = f l(a • b) is computed and available after only one floating-point operation.Moreover, the computation of y exposes some parallelism which can be exploited and, therefore, explains the efficiency of the algorithms [15].

C. Double-Double and Compensated Algorithms
We focus now on two methods using these EFTs to double the accuracy: double-double expansions and compensations.Then we recall why compensated algorithms are more efficient than double-double algorithms.
Double-double expansions.We present here the algorithms by Briggs, Kahan, and Bailey used in the QD library [8].Let a, a H and a L be floating-point numbers of precision p.The corresponding double-double number of a is the unevaluated sum a H + a L where a H , and a L do not overlap: |a L | ulp(a H )/2. Double-double arithmetic simulates computations with precision 2p.Proofs are detailed in [17].
Algorithms 6 and 7 compute the sum and the product of two double-double numbers more accurately than Dekker's algorithms [4].Double-double algorithms need a step of renormalization to guarantee |x l | ulp(x H )/2.This step is insured by a FASTTWOSUM EFT and is represented by dotted boxes in Figure 2.   In practice, double-double algorithms can be simply used by overloading the basic operations as for example in Algorithm 8, which is the double-double version of SUM, the classical recursive algorithm to evaluate Compensated algorithms.As double-double algorithms, compensated algorithms can double the accuracy.We focus here on this class of algorithms.We already mentioned that double-double algorithms are easy to derive.On the contrary, compensated algorithms have been, up to now, defined case by case and by experts of rounding error analysis [7], [9], [10], [11], [22].For example the compensated Algorithm 9, SUM2 [22], returns a twice more accurate sum.Double-double versus compensation.Previous double-double and compensated sums provide roughly the same accuracy.How do they compare in terms of computing time?Algorithm 9 needs 7n − 6 flops compared to n − 1 for the original SUM algorithm.The double-double summation implementation by Algorithm 8 needs 10n − 9 flops, that is, almost 1.43 more floating-point operations than the compensated algorithm.Now, let us consider the instruction level parallelism by inspecting the number of instructions which could be simultaneously executed per one cycle (IPC).In the case of the classical SUM algorithm, each iteration performs one floatingpoint operation.Each iteration can be followed immediately by the next iteration, so IPC(SUM) = (n − 1)/n ≈ 1.With the SUMDD algorithm, each iteration of the loop contains 10 operations versus 7 for the SUM2 algorithm.Nevertheless, the main difference between both algorithms is in the parallelization of the loop iterations.The SUMDD algorithm suffers from renormalization, and one iteration may only be followed by the next one with the latency of 7 floating-point operations, so IPC(SUMDD) = (10n − 9)/(7n − 5) ≈ 1.42.The SUM2 algorithm does not suffer from such drawbacks and iterations can be executed with a latency of only one flop: IPC(SUM2) = (7n−6)/(n+5) ≈ 7.So SUM2 benefits from a seven times higher ILP.A detailed analysis has been presented in [16].This fact is measurable in practice, and compensated algorithms exploit this low level parallelism much better than double-double ones.The example of HORNER's polynomial evaluation algorithm is detailed in [15].This latter shows that the compensated HORNER's algorithm runs at least twice as fast as the double-double counterpart with the same output accuracy.This efficiency motivates us to automatically generate existing compensated algorithms.

III. AUTOMATIC CODE TRANSFORMATION
We present how to improve accuracy thanks to code transformation.Experimental results are presented in Section IV.

A. Improving Accuracy: Methodology
Our code transformation automatically compensates programs and follows the next three steps.

1) First, detect floating-point computations sequences.
A sequence is the set S of dataflow dependent operations required to obtain one or several results.2) Then for each sequence S i compute the error terms and accumulate them beside the original computation sequence by (a) replacing floating-point operations by the corresponding EFTS, and (b) accumulating error terms following Algorithms 10 and 11 given hereafter.At this stage, every floating-point number x ∈ S i becomes a compensated number, denoted x, δ x where δ x ∈ F is the accumulated error term attached to the computed result x. 3) Finally close the sequences.Closing is the compensation step itself, so that close(S i ) means computing x ← f l(x + δ x ) for x being a result of S i .

B. Compensated operators
Algorithms 10 and 11 allow us to automatically compensate for the error of basic floating-point operations.Inputs are now compensated numbers.Two different sources of errors have to be considered.First, the error generated by the elementary operation itself, which is computed with an EFT.This computation corresponds to δ ± and δ × in the first line of Algorithms 10 and 11.Second, the errors inherited from the two operands, denoted by δ a and δ b , are accumulated with the previous generated error.This accumulation corresponds to the second line of Algorithms 10 and 11.These inherited errors come from previous floatingpoint calculations.Operands with no inherited error are processed to minimize added compensations. Figure 3 shows such variants which can be obtained by removing the dashed or dotted lines.

IV. EXPERIMENTAL RESULTS
We now describe our CoHD tool that implements this code transformation.We apply it to several case studies chosen such that there exist compensated versions to compare with.We also add comparisons with the corresponding double-double versions.

A. The CoHD Tool
CoHD is a source-to-source transformer written in OCaml and built as a compiler.The front-end, which reads input C files, comes from a previous development by Casse [3].The middle-end implements some passes of optimization, from classical compiler passes such as operand renaming or threeaddress code conversion [1,Chap. 19].It also implements one pass of floating-point error compensation.This pass uses our methodology and the algorithms defined in Section III.Then, the back-end translates the intermediate representation into C code.

B. Case Studies
We study here the cases described in Table II which are representative of existing compensated algorithms.
Accuracy and execution time measurements.Accuracy is measured as the number of significant bits in the floating-point mantissa.So, 53 is the maximum value we can expect from the binary64 format.
A reliable measure of the execution time is more difficult to obtain.Such measurements are not always reproducible because of many side effects (operating system, executing programs,. . .).Significant measures are provided here using two software tools.First, PAPI (Performance Application Programming Interface) [20] allows us to read the physical counters of cycles or instructions that correspond to an actual execution.The second software, PERPI [6], measures the numbers of cycles and instructions of one ideal execution, that is, one execution by a machine with infinite resources.The latter measure is more related to a performance potential than to the actual one as provided by PAPI.Using both tools provides confident and complementary results.
Horner's polynomial evaluation.We automatically compensate HORNER's scheme and compare it with DDHORNER (a double-double HORNER evaluation) and COMPHORNER (a compensated HORNER algorithm).The compensated algorithm and the data come from [7].Let p H (x) = (x − 0.75) 5 (x − 1) 11 be evaluated with HORNER's scheme, for 512 x ∈ F ∩ [0.68, 1.15].Figure 4 shows the accuracy of this evaluation using HORNER (original), DDHORNER, COMPHORNER, and our automatically generated ACHORNER algorithm.In each case, we measure the number of significant bits # sig .The original HORNER's accuracy is low since the evaluation is processed in the neighborhood of multiple roots: most of the time, there is no significant bit.The other algorithms yield better accuracy.Our automatically generated algorithm has the same accuracy behavior as the twice more accurate DDHORNER and COM-  Table III shows the real and ideal performances of the algorithms in terms of numbers of instructions, cycles, and instructions per cycle (IPC).The automatically generated algorithm has almost the same number of instructions and cycles than the compensated one.Moreover, Table III confirms that compensated algorithms expose more ILP than double-double ones.Even if the code of the algorithm generated by our approach is slightly different from the code of the existing algorithm, it appears to be here as accurate and efficient as the one of [7].Moreover, our future multi-criteria optimizations will produce algorithms quite different, by trading off accuracy against speed [25].
Further results.We now synthesize the case studies of algorithms and data presented in Table II.They are chosen such that the algorithm returns no significant digit at the working precision while all of them are recovered by the twice more accurate ones.
Table IV presents the differences of the number of significant bits, between the automatically compensated algorithms AC and the compensated (COMP) ones, or the double-double (DD) ones.For example, the first line concerns HORNER's scheme of p H for data x 1 .The difference between the number of significant bits of ACHORNER and COMPHORNER is zero.The AC algorithm is as accurate as the existing compensated one.The difference with the DD algorithm is of one bit.Most of the other results exhibit a similarly good behavior.The slight differences of the last three data sets for the summation are due to the different effects of the sum length onto the compensated and double-double solutions: the accuracy bound of the former is quadratically affected by the length while being only linearly dependant in the double-double case.This appears only when the condition number is large enough.lines, respectively).The top and middle parts of Figure 5 show respectively the ratios of the number of cycles and of the number of instructions.For example, the rightmost plot is SUM(d 7 ).We see that the instruction ratio AC/COMP = 1.
The original SUM2 algorithm and our automatically generated one share the same number of instructions (both for the real and ideal measurements).The instruction ratio compared to the DD one is 0.8.This means 20% less generated instructions.Similarly, compensated algorithms (original and generated) present only 30% of the total cycles number of the DD one.
The bottom part of Figure 5 shows the ratio of the number of instructions per cycle.We observe that AC algorithms have the same features as the original compensated ones.Measurements confirm also the interest of compensated algorithms, which have a better ILP potential than DD ones.
Finally, we note that the ILP potential (shown as dotted lines in Figure 5) is not fully exploited in our experimental environment.In a more favorable environment, where the hardware could exploit much more ILP, even better results for compensated algorithms are expected.

V. CONCLUSIONS AND PERSPECTIVES
In this article we discussed the automated transformation of programs using floating-point arithmetic.We propose a new method for automatically compensating the floating-point errors of the computations, which improves the accuracy without impacting execution time too much.The automatic transformation produces some compensated algorithms which are as accurate and efficient as the ones derived case by case.The efficiency of our approach has been illustrated on various case studies.
It remains now to validate this approach (and the CoHD tool) on real and more sophisticated programs.To achieve this, we have to add the support of floating-point division, square-root, and the elementary functions.Moreover, this work is actually a first step toward the automatic generation of multi-criteria program optimizations (with respect to accuracy and execution time).It will allow us to apply partial error compensation and optimize for the execution time overhead.Strategies of partial transformations assured by code synthesis will be the subject of another paper (which we currently write), whose abstract is given in [25].

Figure 1 Fig. 1 :
Figure 1 defines diagrams for floating-point operations ± and ×, and for their EFTs.It allows us to graphically represent transformation algorithms as basic computational blocks.a

Fig. 3 :
Fig. 3: Diagrams for Algorithms 10 and 11 for the automatic compensation of the sum (a), and the product (b).The dashed or dotted lines are removed when a or b are not compensated numbers but standard floating-point numbers.

Figure 5
Figure 5  presents the performance ratios between automatically compensated algorithms (AC) and existing compensated (COMP) or double-double (DD) ones.Here again, PAPI real measures (the mean of 10 6 executions) and PERPI ideal measures are proposed (plain or dotted

Fig. 5 :
Fig. 5: Performance ratios between automatically compensated algorithms (AC) and existing compensated (COMP) or doubledouble (DD) ones.Line drawings are real measurements done with PAPI (mean of 10 6 values) while dotted ones are ideal measures done with PERPI.

TABLE I :
Floating-point operation count and dependency graph depth for various EFTs.

TABLE II :
Case studies and data for SUM and polynomial evaluation with HORNER, CLENSHAW, and DECASTELJAU.

TABLE III :
Performance measurements of the algorithms: COMPHORNER, DDHORNER, and ACHORNER.Real values (PAPI) are the mean of 10 6 measures.Ideal values (PERPI) are displayed within parentheses.

TABLE IV :
Differences of the number of significant bits # sig for automatically compensated (AC) algorithm versus the existing compensated algorithms (COMP), and the doubledouble (DD) ones.