Compiler technology is significant for the success of modern processors. Parallelizing is a difficult issue, so the principles for a 4stack compiler are drawn here in short.
Stack code has been used as intermediate or target code by a wide range of compilers. However, these compilers don't exploit instruction-level parallelism, because they have been designed for single stack machines. A compiler for the 4stack processor thus must apply more sophisticated compiler techniques.
Good parallelizing schedulers are only known at the basic block level. The 4stack processor allows to lengthen basic blocks using conditional execution. Known techniques as loop unrolling and function inlining further lengthen basic blocks.
A compiler backend prototype for the 4stack processor exists. This prototype generates a directed acyclic graph (DAG) in static single assignment form (SSA) from a (single) stack oriented intermediate language. The DAG is attributed with the latest execution time (LET), the path length (from the beginning of the block), the number of uses, and the latest use.
The first pass applies list scheduling, the dominant technique. In short, list scheduling is a sort operation using some heuristics. We use largest LET last, and break ties with maximum path length last.
The 4stack processor poses a number of restrictions to instruction scheduling. Operations consume their arguments, so additional stack effects have to be inserted if the argument is needed later (one stack effect is for free). Arguments are consumed if the corresponding pseudo-register is dead afterwards. A natural heuristic to get close to optimal code is to place consuming operations on the same stack as the generating operation.
Therefore the next pass tries to slot operations due to these constraints, starting with the last operation. If two operations struggle for the same resources, the operation with the smaller path will loose, and is rescheduled one cycle before then.
Each operation consumes one or both source value, so it checks for the last use. If it's the last user, it selects the operation that generates this value to be scheduled on the same stack, and marks itself as the consumer.
A second scheduling pass has to insert necessary stack operations. If the value is burried deeper in the stack, a stack swap operation is scheduled on the same stack one cycle before. If neither value can be consumed, a stack copy operation is scheduled on the same stack one cycle before. If these stack operations displace other operations, a new stack slot must be found for them, possibly resulting in a reschedule of all operations before the inserted stack operation.
Since each rescheduling pass slots at least one operation and the inserted stack operation, the worst case complexity is O(n^2). The 4stack processor can only access a limited number of stack elements through stack effects. If values are burried too deep in the stack, access is impossible. If such a value is needed, a copy to memory (into a stack frame) is necessary.
Local variables should stay on the stack as long as possible, and copies to memory should not occur if it's not necessary. Therefore values generated in different control flow branches should be produced on the same stack positions to minimize stack operations. Therefore the stack order of one control flow branch (after scheduling) is used as starting point to schedule the other control branch(es).
Experiments showed that if the instruction level parallelism (ILP) of the compiled block is well suited for the 4stack processor, this approach leads to fast compilation and good code. If the block has more ILP, the result degrades. Therefore the compiler should tie dependent parts together so that they are easy to schedule. It is also recommended not to unroll loops beyond the ``sweet spot'' of ILP.