Let q1, q2, and q be the numbers of decimal digits
needed to represent C1, C2, and C’. If not zero, the
rounded coefficient C will require between 1 and p
decimal digits. Rounding is not necessary if C’
represented in decimal requires at most p digits, but
it is necessary otherwise.
If q ≤ p, then the result is exact:
n = (n’)
rnd,p
= (C’ · 10
e2
)
rnd,p
=
(C’)
rnd,p
· 10
e2
= C’ · 10
e2
Otherwise, if q > p let x = q – p ≥ 1. Then:
n = (n’)
rnd,p
= (C’ · 10
e2
)
rnd,p
=
(C’)
rnd,p
· 10
e2
= C · 10
e2+x
If after rounding C = 10
p
(rounding overflow), then
n =10
p-1
· 10
e2+x+1
.
A simple analysis shows that rounding is trivial if q1
+ e1 – q2 – e2 ≥ p. If this is not the case, i.e. if
|q1 + e1 – q2 – e2| ≤ p – 1
then the sum C’ has to be calculated and it has to be
rounded to p decimal digits. This case can be
optimized by separating it in sub-cases as shall be
seen further.
The algorithm presented next uses Property 1 in
order to round correctly (to the destination precision)
the result of a decimal floating-point addition in
rounding to nearest mode, and also determines
correctly the exactness of the result by using a
simple comparison operation. First, an
approximation of the result’s coefficient is
calculated using Property 1. This will be either the
correctly rounded coefficient, or it will be off by one
ulp (unit-in-the-last-place). The correct result as well
as its exactness can be determined directly from the
calculation, without having to compute a remainder
through a binary multiplication followed by a
subtraction for this purpose. This makes the
rounding operation for decimal floating-point
addition particularly efficient.
Decimal Floating-Point Addition with Rounding
to Nearest
The straightforward method to calculate the result is
to convert both coefficients to a decimal encoding,
perform a decimal addition, round the exact decimal
result to nearest to the destination precision, and
then convert the coefficient of the final result back to
binary. It would also be possible to store the
coefficients in decimal all the time, but then neither
software nor hardware implementations could take
advantage easily of existing instructions or circuitry
that operate on binary numbers. The algorithm used
for decimal floating-point addition in rounding to
nearest mode is Algorithm 1, shown further.
If the smaller operand represents more than a
rounding error in the larger operand, the sum C’ =
C1 · 10
e1–e2
+ C2 is calculated. If the number of
decimal digits q needed to represent this number
does not exceed the precision p of the destination
format, then no rounding is necessary and the result
is exact. If q > p, then x = q – p decimal digits have
to be removed from the lower part of C’, and C’ has
to rounded correctly to p decimal digits. For correct
rounding to nearest, 0.5 ulp is added to C’: C’’ = C’
+ 1/2 · 10
x
. The result is multiplied by k
x
≈ 10
-x
(C*
= C’’ · k
x),
where the pre-calculated values k
x
are
stored for all x {1, 2, …, p}. A test for midpoints
follows (0 < f* < 10
–p
, where f* is the fractional part
of C*) and if affirmative, the result is rounded to the
nearest even integer. (For example if the exact result
4567.5 has to be rounded to nearest to four decimal
places, the rounded result will be 4568.) Next the
algorithm checks for rounding overflow (p+1
decimal digits are obtained instead of p) and finally
it checks for exactness.
Note that the straightforward method for the
determination of midpoints and exactness is to
calculate a remainder r = C’ – C 10
x
∈[0, 10
x
).
Midpoint results could be identified by comparing
the remainder with 1/2·10
x
, and exact results by
comparing the remainder with 0. However, the
calculation of a remainder – a relatively costly
operation – was avoided in Algorithm 1 and instead
a single comparison to a pre-calculated constant was
used. This simplified method to determine midpoints
and exactness along with the ability to use Property
1 make Algorithm 1 more efficient for decimal
floating-point addition than previously known
methods.
Algorithm 1. Calculate the sum of two decimal
floating-point numbers rounded to nearest to p
decimal digits, and determine its exactness.
q1, q2 = number of decimal digits needed to
represent C1, C2 // from table lookup
if |q1 + e1 – q2 – e2| ≥ p then
// assuming that e1 ≥ e2 round the result
// directly as 0 < C2 < 1 ulp (C1 · 10
e1–e2
);
the result n = C1 · 10
e1
or
n = C1 · 10
e1
± 10
e1+q1–p
is inexact
else // if |q1 + e1 – q2 – e2| ≤ p – 1
C’ = C1 · 10
e1–e2
+ C2 // binary integer
// multiplication and addition;
// 10
e1–e2
from table lookup
q = number of decimal digits needed to
represent C’ // from table lookup
if q ≤ p the result n = C’ 10
e2
is exact
else if q ∈ [p+1, 2·p] continue
x = q – p, number of decimal digits to be
removed from lower part of C’, x ∈ [1, p]
C’’ = C’ + 1/2 · 10
x
// 1/2 · 10
x
// pre-calculated, from table lookup
k
x
= 10
–x
(1 + ε), 0 < ε < 2
–⎡2·ρ·p⎤
/ / pre-calculated as specified in Property 1
C* = C’’ · k
x
= C’’ · K
x
· 2
–Ex
// binary integer multiplication with
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
16