6.5.5 Two-stage integer division

On most hardware, multiplication is significantly faster than division. So if you have to divide many numbers by the same divisor, it is usually faster to determine the reciprocal of the divisor once and multiply the numbers with the reciprocal. If you divide by a constant, Gforth performs this optimization automatically.

However, for cases where the divisor is not known during compilation, Gforth provides words that allow you to implement this optimization without to much fuss.

Let’s start with an example: You want to divide all elements of an array of cells by the same number n. A straightforward way to implement this is:

: array/ ( addr u n -- )
  -rot cells bounds u+do
    i @ over / i !
  1 cells +loop
  drop ;

A possibly more efficient version looks like this:

: array/ ( addr u n -- )
  {: | reci[ staged/-size ] :}
  reci[ /f-stage1m
  cells bounds u+do
    i @ reci[ /f-stage2m i !
  1 cells +loop ;

This example first creates a local buffer reci[ with size staged/-size for storing the reciprocal data. Then /f-stage1m computes the reciprocal of n and stores it in reci[. Finally, inside the loop /f-stage2m uses the data in reci[ to compute the quotient.

There are some limitations: Only positive divisors are supported for /f-stage1m; for u/-stage1m you can use a divisor of 2 or higher. You get an error if you try to use an unsupported divisor. You must initalize the reciprocal buffer for the floored second-stage words with /f-stage1m and for the unsigned second-stage words with u/-stage1m. You must not modify the reciprocal buffer between the first stage and the second stage; basically, don’t treat it as a memory buffer, but as something that is only mutable by the first stage; the point of this rule is that future versions of Gforth will not consider aliasing of this buffer.

Measurements show that staged division is not always beneficial:

break  100 elem 
even   speedup  core
  7      2.09   Skylake (Core i5-6600K)
  -      0.94   Rocket Lake (Xeon E-2388G)
 40      1.09   Golden Cove (Core i3-1315U P-core)
  -      0.85   Gracemont (Core i3-1315U E-core)
  6      1.68   Zen2 (Ryzen 9 3900X)
  -      0.56   Zen3 (Ryzen 7 5800X)

The words are:

staged/-size ( – u  ) gforth-1.0 “staged-slash-size”

Size of buffer for u/-stage1m or /f-stage1m.

/f-stage1m ( n a-reci –  ) gforth-1.0 “slash-f-stage1m”

Compute the reciprocal of n and store it in the buffer a-reci of size staged/-size. Throws an error if n<1.

/f-stage2m ( n1 a-reci – nquotient ) gforth-1.0 “slash-f-stage2m”

Nquotient is the result of dividing n1 by the divisor represented by a-reci, which was computed by /f-stage1m.

modf-stage2m ( n1 a-reci – umodulus ) gforth-1.0 “mod-f-stage2m”

Umodulus is the remainder of dividing n1 by the divisor represented by a-reci, which was computed by /f-stage1m.

/modf-stage2m ( n1 a-reci – umodulus nquotient ) gforth-1.0 “slash-mod-f-stage2m”

Nquotient is the quotient and umodulus is the remainder of dividing n1 by the divisor represented by a-reci, which was computed by /f-stage1m.

u/-stage1m ( u a-reci –  ) gforth-1.0 “u-slash-stage1m”

Compute the reciprocal of u and store it in the buffer a-reci of size staged/-size. Throws an error if u<2.

u/-stage2m ( u1 a-reci – uquotient ) gforth-1.0 “u-slash-stage2m”

Uquotient is the result of dividing u1 by the divisor represented by a-reci, which was computed by u/-stage1m.

umod-stage2m ( u1 a-reci – umodulus ) gforth-1.0 “u-mod-stage2m”

Umodulus is the remainder of dividing u1 by the divisor represented by a-reci, which was computed by u/-stage1m.

u/mod-stage2m ( u1 a-reci – umodulus uquotient ) gforth-1.0 “u-slash-mod-stage2m”

Uquotient is the quotient and umodulus is the remainder of dividing u1 by the divisor represented by a-reci, which was computed by u/-stage1m.

Gforth currently does not support staged symmetrical division.

You can recover the divisor from (the address of) a reciprocal with staged/-divisor @:

staged/-divisor ( addr1 – addr2  ) gforth-1.0 “staged-slash-divisor”

Addr1 is the address of a reciprocal, addr2 is the address containing the divisor from which the reciprocal was computed.

This can be useful when looking at the decompiler output of Gforth: a division by a constant is often compiled to a literal containing the address of a reciprocal followed by a second-stage word.

The performance impact of using these words strongly depends on the architecture (does it have hardware division?) and the specific implementation (how fast is hardware division?), but just to give you an idea about the relative performance of these words, here are the cycles per iteration of a microbenchmark (which performs the mentioned word once per iteration) on two AMD64 implementations; the norm column shows the normal division word (e.g., u/), while the stg2 column shows the corresponding stage2 word (e.g., u/-stage2m):

 Skylake              Zen2
norm stg2           norm stg2
41.3 15.8 u/        35.2 21.4 u/           
39.8 19.7 umod      36.9 25.8 umod         
44.0 25.3 u/mod     43.0 33.9 u/mod        
48.7 16.9 /f        36.2 22.5 /f           
47.9 20.5 modf      37.9 27.1 modf         
53.0 24.6 /modf     45.8 35.4 /modf        
    227.2 u/stage1      101.9 u/stage1
    159.8 /fstage1       97.7 /fstage1