Two important considerations have to be made when introducing a new architecture:
The office PC market is certainly in the first category; it's becoming a saturated market. Performance upgrades still give some vitality to this market, but the typical office application doesn't need so much more resources, especially for users that refuse to buy every software update.
A field of rapidly growing computing power demand arise in telecommunication; both wire-bound and wireless; and both on the service provider and the subscriber side (ADSL, cable modem, high bandwidth wireless telecommunication like UMTS and EDGE, ATM switching). Except ATM switching, all these applications demand signal processing power, especially UMTS.
High bandwidth mobile wireless communication (UMTS, EDGE) provide applications like video conferencing or portable video on demand, that itself require a significant amount of signal processing on the client side (compress/decompress video and audio). Low power consumption is very important in this area, too. Furthermore, small devices require new input means like speech recognition, which also requires a good deal of computing performance.
More growth is also expected in the consumer/appliance market. There much more performance demanding applications (3D games, especially) acutally drive the PC market. However, just for playing, PCs are ways too expensive and too difficult to use to gain a much wider market. Can you imagine those people who are not capable to program their VCR to buy a PC?
Beside from games, other applications spread into the homes, such as internet surfing and digital television. The current approach is to have a special purpose box for every consumer application:
All these boxes have one in common: they demand lots of CPU power, to decode MPEG streams or to render 3D graphics. The least CPU demanding box is that for internet access. You can have all the functionality in one box, and this box need not to be a PC, because none of the applications demand hard disk access, but games, digital VCR and DVD player should be able to read (and write) a DVD disk.
There's another aspect why you won't buy a PC instead of these boxes: You may not get along with only one of them. The kids want to play games, while the dad wants to see sports, which the mom can't stand...
Aside from playing games, watching TV and surfing in the internet, this box can be used as graphical terminal to access a multiuser PC; mostly as X terminal to run Unix flavour applications.
The market above has some specific product requirements:
So what you need is a processor tuned for signal processing and 3D rendering. There's lot of parallelism in these applications, so to get speed, the architecture will exploit fine grain parallelism, thus either be superscalar or a VLIW. Price considerations rule out superscalar architectures.
Another requirement is low memory usage. Cheap boxes won't come with lots of memory. Traditional VLIWs however have bloated code. So an architecture has to be found which reduces code bloat.
The 4stack processor is a eight-issue VLIW processor, which four ALU and four stack operations (counted as four operations in total), two load/store operations and two address update operations. Two DSP MAC units and two floating point units (one add, one multiply) allow high performance signal processing and 3D geometry computations. This gives it enough performance to be used for all the applications above.
The simplification of a VLIW architecture results in most transistores of the core actually used for work, not for decoding, scheduling, register renaming and other operations a superscaler CPU needs to do. Less than 500k transistors are required for the core, leaving more space for caches, and allows for cheaper implementations with less cache. Also, a low transitor budget and smaller caches reduce the current consumption, important for mobile applications. When an Athlon with about 20 million logic transistors burns 40W at 1 GHz, the 4stack core is estimated to consume about 1W at 1 GHz when implemented in the same technology; the careful design with gated clocks will reduce this number even further.
The stack paradigm greatly increases instruction density. While a normal RISC processor needs 32 bits for one instruction, the 4stack processor encodes 8 operations in 64 bits. VLIW architectures can't always fill each operation slot, however, if at least two operation slots are filled, the 4stack architecture breaks even. This gives better utilization of program memory, leading to cheaper chips (smaller instruction cache) and to smaller system costs (fewer memory required). Zero-cycle branches and predicated instructions allow fast execution of conventional and object oriented workload (needed to process recompiled Java fast).
The 4stack processor features memory protection, virtual memory, superviser/user mode distinction, and other things PC CPUs have, while most DSP CPUs haven't. So an upgrade path form the game/digital TV console to get a real computer is possible.