On most hardware, multiplication is significantly faster than division. So if you have to divide many numbers by the same divisor, it is usually faster to determine the reciprocal of the divisor once and multiply the numbers with the reciprocal. If you divide by a constant, Gforth performs this optimization automatically.
However, for cases where the divisor is not known during compilation, Gforth provides words that allow you to implement this optimization without to much fuss.
Let’s start with an example: You want to divide all elements of an array of cells by the same number n. A straightforward way to implement this is:
: array/ ( addr u n -- ) -rot cells bounds u+do i @ over / i ! 1 cells +loop drop ;
A possibly more efficient version looks like this:
: array/ ( addr u n -- ) {: | reci[ staged/-size ] :} reci[ /f-stage1m cells bounds u+do i @ reci[ /f-stage2m i ! 1 cells +loop ;
This example first creates a local buffer reci[
with size
staged/-size
for storing the reciprocal data. Then
/f-stage1m
computes the reciprocal of n and stores it in
reci[
. Finally, inside the loop /f-stage2m
uses the
data in reci[
to compute the quotient.
There are some limitations: Only positive divisors are supported for
/f-stage1m
; for u/-stage1m
you can use a divisor of 2 or
higher. You get an error if you try to use an unsupported divisor.
You must initalize the reciprocal buffer for the floored second-stage
words with /f-stage1m
and for the unsigned second-stage words
with u/-stage1m
. You must not modify the reciprocal buffer
between the first stage and the second stage; basically, don’t treat
it as a memory buffer, but as something that is only mutable by the
first stage; the point of this rule is that future versions of Gforth
will not consider aliasing of this buffer.
Measurements show that staged division is not always beneficial:
break 100 elem even speedup core 7 2.09 Skylake (Core i5-6600K) - 0.94 Rocket Lake (Xeon E-2388G) 40 1.09 Golden Cove (Core i3-1315U P-core) - 0.85 Gracemont (Core i3-1315U E-core) 6 1.68 Zen2 (Ryzen 9 3900X) - 0.56 Zen3 (Ryzen 7 5800X)
The words are:
staged/-size
( – u ) gforth-1.0 “staged-slash-size”
Size of buffer for u/-stage1m
or /f-stage1m
.
/f-stage1m
( n a-reci – ) gforth-1.0 “slash-f-stage1m”
Compute the reciprocal of n and store it in the buffer
a-reci of size staged/-size
. Throws an error if
n<1.
/f-stage2m
( n1 a-reci – nquotient ) gforth-1.0 “slash-f-stage2m”
Nquotient is the result of dividing n1 by the divisor represented
by a-reci, which was computed by /f-stage1m
.
modf-stage2m
( n1 a-reci – umodulus ) gforth-1.0 “mod-f-stage2m”
Umodulus is the remainder of dividing n1 by the divisor represented
by a-reci, which was computed by /f-stage1m
.
/modf-stage2m
( n1 a-reci – umodulus nquotient ) gforth-1.0 “slash-mod-f-stage2m”
Nquotient is the quotient and umodulus is the remainder of
dividing n1 by the divisor represented by a-reci, which was
computed by /f-stage1m
.
u/-stage1m
( u a-reci – ) gforth-1.0 “u-slash-stage1m”
Compute the reciprocal of u and store it in the buffer
a-reci of size staged/-size
. Throws an error if
u<2.
u/-stage2m
( u1 a-reci – uquotient ) gforth-1.0 “u-slash-stage2m”
Uquotient is the result of dividing u1 by the divisor represented
by a-reci, which was computed by u/-stage1m
.
umod-stage2m
( u1 a-reci – umodulus ) gforth-1.0 “u-mod-stage2m”
Umodulus is the remainder of dividing u1 by the divisor represented
by a-reci, which was computed by u/-stage1m
.
u/mod-stage2m
( u1 a-reci – umodulus uquotient ) gforth-1.0 “u-slash-mod-stage2m”
Uquotient is the quotient and umodulus is the remainder of
dividing u1 by the divisor represented by a-reci, which was
computed by u/-stage1m
.
Gforth currently does not support staged symmetrical division.
You can recover the divisor from (the address of) a reciprocal with
staged/-divisor @
:
staged/-divisor
( addr1 – addr2 ) gforth-1.0 “staged-slash-divisor”
Addr1 is the address of a reciprocal, addr2 is the address containing the divisor from which the reciprocal was computed.
This can be useful when looking at the decompiler output of Gforth: a division by a constant is often compiled to a literal containing the address of a reciprocal followed by a second-stage word.
The performance impact of using these words strongly depends on the
architecture (does it have hardware division?) and the specific
implementation (how fast is hardware division?), but just to give you
an idea about the relative performance of these words, here are the
cycles per iteration of a microbenchmark (which performs the mentioned
word once per iteration) on two AMD64 implementations; the norm
column shows the normal division word (e.g., u/
), while the
stg2 column shows the corresponding stage2 word (e.g.,
u/-stage2m
):
Skylake Zen2 norm stg2 norm stg2 41.3 15.8 u/ 35.2 21.4 u/ 39.8 19.7 umod 36.9 25.8 umod 44.0 25.3 u/mod 43.0 33.9 u/mod 48.7 16.9 /f 36.2 22.5 /f 47.9 20.5 modf 37.9 27.1 modf 53.0 24.6 /modf 45.8 35.4 /modf 227.2 u/stage1 101.9 u/stage1 159.8 /fstage1 97.7 /fstage1