You are looking at an unfinished tutorial which when complete will become part of the new Jitter web page

GNU Jitter — Luca Saiu

[2021-12-17: Jitter has been officially accepted as part of the GNU project! The web page for Jitter is now on; however this tutorial is still only hosted here on]

GNU Jitter is a software generating C code for very efficient language virtual machines from a relatively high-level specification describing instructions and runtime structures.

[registers, stacks] [gc, threads; what is not completely automatic yet can already be accomplished by using provided macros]

The generated code is easy to use and very portable. A VM will behave identically on any platform, the only observable difference being performnace.

[dispatches; no assembly in VM-specific or user code]

Jitter includes a relatively extensive test suite currently comprising around 700 test cases, which supports cross-compilation and running test cases under qemu-user. Generated native code can be checked and disassembled even in cross configurations, relying on cross-compiled GNU binutils.


[define VM runtime data structures, VM instructions, optimisation rewrite rules code generation API] [VM programs: executing, printing, parsing, disassembling]

Users of Jitter

The GNU poke project by José Marchesi relies on a Jittery VM for the efficient execution of Poke programs.

The Jitter distribution itself comes with examples complex enough to require their own documentation and examples, inclduing a Lisp system.

A concrete example

[This is a boring example and not a particularly good application, to make it more accessible. See JitterLisp and this message for something more interesting] example-vms/structured/

Euclidean algorithm

Structured language source example-vms/structured/examples/euclid-iterative.structured:

var a = 1333333333,
    b = 1;

while a <> b do
  if a < b then
    b := b - a;
    a := a - b;

print a;

Defining a register-based VM

[ example-vms/structured/structured.jitter Make sure to edit the file structured.jitter and set fast-register-no to 3 or 4 before recompiling if you want to run these tests on your machine. Recompiling will take a while, but code execution will be fast. The file is currently distributed with fast-register-no set to 0 so that it can be compiled quickly. ]

register code example-vms/structured/structured-code-generator-register.c

[ Only one register class named r, holding integer values. Routines using this VM have access to an unlimited number of registers named %r0, %r1, and so on. More complex VMs may define additional register classes, for example to hold floating-point, string or vector values. Dynamically typed languages may use just one register class containing data such as a C struct containing a tag and a union, or some tagged pointer to similar data. For the structured example a single r register class suffices. ]

[what mov is] [ In this VM, by an arbitrary convention, I chose to write the destination operand of an instruction on the right. Therefore in a VM routine the instruction mov %r1, %r2 will cause the value held by the register %r1 to be copied into the register %r2. The VM instruction mov 7, %r3 would set or initialise the register %r3 to the value 7. ]

instruction mov (?Rn 0 1 -1 2 structured_literal_printer,

[ mov R (r) n ? ! ?! 0 1 -1 2 structured_literal_printer code end JITTER_ARGN0 JITTER_ARGN1 N JITTER_ARGN1 = JITTER_ARGN0; ]

instruction minus (?Rn structured_literal_printer,
                   ?Rn 1 2 structured_literal_printer,
instruction b (?f)


instruction bge (?Rn 0 structured_literal_printer,
                 ?Rn 0 structured_literal_printer,



instruction be (?Rn 0 structured_literal_printer,
                ?Rn 0 structured_literal_printer,

Register VM code

[ ]

[ structured --print ]

        mov    1333333333, %r0
        mov    1, %r1
        be     %r0, %r1, $L9
        bge    %r0, %r1, $L6
        minus  %r1, %r0, %r1
        b      $L7
        minus  %r0, %r1, %r0
        bne    %r0, %r1, $L3
        b      $L9
        print  %r0

[ This routine only uses two VM registers, %r0 and %r1. [if ...] ]

[third-to-last instruction, b $L9 unneeded]

[ register code, x86_64, no-threading structured --disassemble ]

# 0x466d40: !BEGINBASICBLOCK 0x7ffff7d7e6e0 (0 bytes):
# 0x466d48: mov/nR/%r0 0x4f790d55 (9 bytes):
    0x00007ffff7d7e6e0 41 bc 55 0d 79 4f        movl   $0x4f790d55,%r12d
    0x00007ffff7d7e6e6 4d 89 e5                 movq   %r12,%r13
# 0x466d50: mov/n1/%r1 (6 bytes):
    0x00007ffff7d7e6e9 41 b8 01 00 00 00        movl   $0x1,%r8d
# 0x466d50: be/%r0/%r1/fR 0x466d90 (9 bytes):
    0x00007ffff7d7e6ef 4d 39 c5                 cmpq   %r8,%r13
    0x00007ffff7d7e6f2 0f 84 22 00 00 00        je     0x00007ffff7d7e71a
# 0x466d58: !BEGINBASICBLOCK 0x7ffff7d7e6f8 (0 bytes):
# 0x466d60: bge/%r0/%r1/fR 0x466d70 (9 bytes):
    0x00007ffff7d7e6f8 4d 39 c5                 cmpq   %r8,%r13
    0x00007ffff7d7e6fb 0f 8d 08 00 00 00        jge    0x00007ffff7d7e709
# 0x466d68: minus/%r1/%r0/%r1 (3 bytes):
    0x00007ffff7d7e701 4d 29 e8                 subq   %r13,%r8
# 0x466d68: b/fR 0x466d78 (5 bytes):
    0x00007ffff7d7e704 e9 03 00 00 00           jmpq   0x00007ffff7d7e70c
# 0x466d70: !BEGINBASICBLOCK 0x7ffff7d7e709 (0 bytes):
# 0x466d78: minus/%r0/%r1/%r0 (3 bytes):
    0x00007ffff7d7e709 4d 29 c5                 subq   %r8,%r13
# 0x466d78: !BEGINBASICBLOCK 0x7ffff7d7e70c (0 bytes):
# 0x466d80: bne/%r0/%r1/fR 0x466d58 (9 bytes):
    0x00007ffff7d7e70c 4d 39 c5                 cmpq   %r8,%r13
    0x00007ffff7d7e70f 0f 85 e3 ff ff ff        jne    0x00007ffff7d7e6f8
# 0x466d88: b/fR 0x466d90 (5 bytes):
    0x00007ffff7d7e715 e9 00 00 00 00           jmpq   0x00007ffff7d7e71a
# 0x466d90: !BEGINBASICBLOCK 0x7ffff7d7e71a (0 bytes):
# 0x466d98: print/%r0/retR 0x7ffff7d7e72b (17 bytes):
    0x00007ffff7d7e71a 49 bc 2b e7 d7 f7 ff 7f 00 00    movabsq $0x7ffff7d7e72b,%r12
    0x00007ffff7d7e724 ff a4 24 88 00 00 00     jmpq   *0x88(%rsp)
# 0x466da0: exitvm (7 bytes):
    0x00007ffff7d7e72b 48 8b 44 24 58           movq   0x58(%rsp),%rax
    0x00007ffff7d7e730 ff e0                    jmpq   *%rax

[ Lines starting with # are comments added by Jitter's disassembly function to delimit specialised VM instructions. ]

[ The two three-argument VM instructions minus %r1, %r0, %r1 and minus %r0, %r1, %r0 have turned into two different specialised instructions, with no arguments: minus/%r1/%r0/%r1 and minus/%r0/%r1/%r0: each consists of a single hardware instruction subq, with different register operands. [fast!] ] [ If minus had been used with some other arguments, then the specialised instructions would have been different, and in some case residual arguments would have remained in the specialised version: [ mov/nR/%r0 vs mov/n1/%r1 ] ]

The hardware instructions for specialised VM instructions come at consecutive addresses in memory: for example the first b instruction starts (in this run) at the memory address 0x00007ffff7d7e704, three bytes after the beginning of the minus/%r1/%r0/%r1 right before it at 0x00007ffff7d7e701, which takes three bytes. Specialised instructions are replicated: the second b below starts at a different memory address, 0x00007ffff7d7e715, and does not share code with the first one. This means that, in the absence of VM-level branches, control can simply fall thru from one VM instruction to the next, sparing the cost of any hardware-level branching. [the name: no-threading: trivial dispatch, or no dispatch cost]

[ VM register %r0 mapped to the hardware register %r13. VM register %r1 mapped to the hardware register %r8. Chosen by GCC. If there are not enough available hardware registers, some fast registers will be mapped to memory at a statically known offset from a register. Slow registers will be mapped to memory at an offset from the base to be computed at run time.
This information, while not difficult to deduce from the disassembly, can also be obtained directly from a Jitter function by calling the structured program with the command-line option --print-locations: ]

 0.                     base: %rbx         (register)
 1.               residual 0: %r12         (register)
 2.            mainstack top: %rbp         (register)
 3.   mainstack undertop ptr: %r14         (register)
 4.                      %r0: %r13         (register)
 5.                      %r1: %r8          (register)
 6.                      %r2: %r9          (register)

[ Look at how fast branches are compiled ]

[ mov/nR/%r0 not quite efficient as mov/n1/%r1, despite both being specialisations of the same VM instruction mov. 1333333333 not a common constant in programs, differently from 1. When translating the VM program 1333333333 is residualised, where here 1 has been translated as a fast literal, due to the definition of the VM instruction mov, above. Residualised instruction arguments must be materialized before use with no-threading dispatch, usually into a hardware register -- in this case %r12. ]

Defining a stack-based VM

stack vm code example-vms/structured/structured-code-generator-stack.c

[not the actual definition, but this generates the same code thanks to GCC's cleverness]

instruction push-stack (?Rnl 0 1 -1 2 structured_literal_printer)


instruction pop-stack (!R)
    jitter_int top = JITTER_TOP_MAINSTACK ();
    JITTER_ARGN0 = top;


instruction minus-stack ()
    jitter_int top = JITTER_TOP_MAINSTACK ();
    jitter_int under_top = JITTER_UNDER_TOP_MAINSTACK ();

    JITTER_PUSH_MAINSTACK (under_top - top);

[less-stack implemented as an ordinary stack operation like minus-stack]

instruction bt-stack (?f)
    jitter_int top = JITTER_TOP_MAINSTACK ();
instruction bf-stack (?f)
    jitter_int top = JITTER_TOP_MAINSTACK ();

Stack VM code

[ ]

        mov             1333333333, %r0
        mov             1, %r1
        push-stack      %r0
        push-stack      %r1
        bf-stack        $L24
        push-stack      %r0
        push-stack      %r1
        bf-stack        $L15
        push-stack      %r1
        push-stack      %r0
        pop-stack       %r1
        b               $L19
        push-stack      %r0
        push-stack      %r1
        pop-stack       %r0
        push-stack      %r0
        push-stack      %r1
        bt-stack        $L6
        b               $L24
        push-stack      %r0

[...the eagle-eyed reader will have noticed how... mov...]

[...rewrite rule (named push-pop--movn)... ]:

rule push-pop--movn
    push-stack n $a; pop-stack R $b
    mov $a, $b

[ other optimisation opportunities remain: for example it would be easy to rewrite different-stack followed by bf-stack $L into a new single instruction consuming two elements off the stack to branch when they are equal. I recommend this optimisation as a beginner exercise; other rewrites are possible, which would turn some common stack instructions patterns into code similar to the register VM case. ]

stack code, disassembled, no-threading, x86_64

# 0x466d40: !BEGINBASICBLOCK 0x7ffff7d7cde0 (0 bytes):
# 0x466d48: mov/nR/%r0 0x4f790d55 (9 bytes):
    0x00007ffff7d7cde0 41 bc 55 0d 79 4f        movl   $0x4f790d55,%r12d
    0x00007ffff7d7cde6 4d 89 e5                 movq   %r12,%r13
# 0x466d50: mov/n1/%r1 (6 bytes):
    0x00007ffff7d7cde9 41 b8 01 00 00 00        movl   $0x1,%r8d
# 0x466d50: push-stack/%r0 (11 bytes):
    0x00007ffff7d7cdef 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7cdf3 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7cdf7 4c 89 ed                 movq   %r13,%rbp
# 0x466d50: push-stack/%r1 (11 bytes):
    0x00007ffff7d7cdfa 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7cdfe 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ce02 4c 89 c5                 movq   %r8,%rbp
# 0x466d50: different-stack (15 bytes):
    0x00007ffff7d7ce05 49 39 2e                 cmpq   %rbp,(%r14)
    0x00007ffff7d7ce08 40 0f 95 c5              setne  %bpl
    0x00007ffff7d7ce0c 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce10 40 0f b6 ed              movzbl %bpl,%ebp
# 0x466d50: bf-stack/fR 0x466d90 (21 bytes):
    0x00007ffff7d7ce14 48 89 e8                 movq   %rbp,%rax
    0x00007ffff7d7ce17 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce1b 49 8b 6e 08              movq   0x8(%r14),%rbp
    0x00007ffff7d7ce1f 48 83 f8 00              cmpq   $0x0,%rax
    0x00007ffff7d7ce23 0f 84 de 00 00 00        je     0x00007ffff7d7cf07
# 0x466d58: !BEGINBASICBLOCK 0x7ffff7d7ce29 (0 bytes):
# 0x466d60: push-stack/%r0 (11 bytes):
    0x00007ffff7d7ce29 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ce2d 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ce31 4c 89 ed                 movq   %r13,%rbp
# 0x466d60: push-stack/%r1 (11 bytes):
    0x00007ffff7d7ce34 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ce38 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ce3c 4c 89 c5                 movq   %r8,%rbp
# 0x466d60: less-stack (15 bytes):
    0x00007ffff7d7ce3f 49 39 2e                 cmpq   %rbp,(%r14)
    0x00007ffff7d7ce42 40 0f 9c c5              setl   %bpl
    0x00007ffff7d7ce46 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce4a 40 0f b6 ed              movzbl %bpl,%ebp
# 0x466d60: bf-stack/fR 0x466d70 (21 bytes):
    0x00007ffff7d7ce4e 48 89 e8                 movq   %rbp,%rax
    0x00007ffff7d7ce51 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce55 49 8b 6e 08              movq   0x8(%r14),%rbp
    0x00007ffff7d7ce59 48 83 f8 00              cmpq   $0x0,%rax
    0x00007ffff7d7ce5d 0f 84 35 00 00 00        je     0x00007ffff7d7ce98
# 0x466d68: push-stack/%r1 (11 bytes):
    0x00007ffff7d7ce63 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ce67 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ce6b 4c 89 c5                 movq   %r8,%rbp
# 0x466d68: push-stack/%r0 (11 bytes):
    0x00007ffff7d7ce6e 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ce72 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ce76 4c 89 ed                 movq   %r13,%rbp
# 0x466d68: minus-stack (13 bytes):
    0x00007ffff7d7ce79 49 8b 06                 movq   (%r14),%rax
    0x00007ffff7d7ce7c 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce80 48 29 e8                 subq   %rbp,%rax
    0x00007ffff7d7ce83 48 89 c5                 movq   %rax,%rbp
# 0x466d68: pop-stack/%r1 (13 bytes):
    0x00007ffff7d7ce86 49 8b 06                 movq   (%r14),%rax
    0x00007ffff7d7ce89 49 89 e8                 movq   %rbp,%r8
    0x00007ffff7d7ce8c 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ce90 48 89 c5                 movq   %rax,%rbp
# 0x466d68: b/fR 0x466d78 (5 bytes):
    0x00007ffff7d7ce93 e9 30 00 00 00           jmpq   0x00007ffff7d7cec8
# 0x466d70: !BEGINBASICBLOCK 0x7ffff7d7ce98 (0 bytes):
# 0x466d78: push-stack/%r0 (11 bytes):
    0x00007ffff7d7ce98 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ce9c 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7cea0 4c 89 ed                 movq   %r13,%rbp
# 0x466d78: push-stack/%r1 (11 bytes):
    0x00007ffff7d7cea3 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7cea7 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ceab 4c 89 c5                 movq   %r8,%rbp
# 0x466d78: minus-stack (13 bytes):
    0x00007ffff7d7ceae 49 8b 06                 movq   (%r14),%rax
    0x00007ffff7d7ceb1 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7ceb5 48 29 e8                 subq   %rbp,%rax
    0x00007ffff7d7ceb8 48 89 c5                 movq   %rax,%rbp
# 0x466d78: pop-stack/%r0 (13 bytes):
    0x00007ffff7d7cebb 49 8b 06                 movq   (%r14),%rax
    0x00007ffff7d7cebe 49 89 ed                 movq   %rbp,%r13
    0x00007ffff7d7cec1 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7cec5 48 89 c5                 movq   %rax,%rbp
# 0x466d78: !BEGINBASICBLOCK 0x7ffff7d7cec8 (0 bytes):
# 0x466d80: push-stack/%r0 (11 bytes):
    0x00007ffff7d7cec8 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7cecc 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7ced0 4c 89 ed                 movq   %r13,%rbp
# 0x466d80: push-stack/%r1 (11 bytes):
    0x00007ffff7d7ced3 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7ced7 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7cedb 4c 89 c5                 movq   %r8,%rbp
# 0x466d80: different-stack (15 bytes):
    0x00007ffff7d7cede 49 39 2e                 cmpq   %rbp,(%r14)
    0x00007ffff7d7cee1 40 0f 95 c5              setne  %bpl
    0x00007ffff7d7cee5 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7cee9 40 0f b6 ed              movzbl %bpl,%ebp
# 0x466d80: bt-stack/fR 0x466d58 (21 bytes):
    0x00007ffff7d7ceed 48 89 e8                 movq   %rbp,%rax
    0x00007ffff7d7cef0 49 83 ee 08              subq   $0x8,%r14
    0x00007ffff7d7cef4 49 8b 6e 08              movq   0x8(%r14),%rbp
    0x00007ffff7d7cef8 48 83 f8 00              cmpq   $0x0,%rax
    0x00007ffff7d7cefc 0f 85 27 ff ff ff        jne    0x00007ffff7d7ce29
# 0x466d88: b/fR 0x466d90 (5 bytes):
    0x00007ffff7d7cf02 e9 00 00 00 00           jmpq   0x00007ffff7d7cf07
# 0x466d90: !BEGINBASICBLOCK 0x7ffff7d7cf07 (0 bytes):
# 0x466d98: push-stack/%r0 (11 bytes):
    0x00007ffff7d7cf07 49 89 6e 08              movq   %rbp,0x8(%r14)
    0x00007ffff7d7cf0b 49 83 c6 08              addq   $0x8,%r14
    0x00007ffff7d7cf0f 4c 89 ed                 movq   %r13,%rbp
# 0x466d98: print-stack/retR 0x7ffff7d7cf23 (17 bytes):
    0x00007ffff7d7cf12 49 bc 23 cf d7 f7 ff 7f 00 00    movabsq $0x7ffff7d7cf23,%r12
    0x00007ffff7d7cf1c ff a4 24 b0 00 00 00     jmpq   *0xb0(%rsp)
# 0x466da0: exitvm (7 bytes):
    0x00007ffff7d7cf23 48 8b 44 24 58           movq   0x58(%rsp),%rax
    0x00007ffff7d7cf28 ff e0                    jmpq   *%rax

[TOS optimisation: %rbp is the TOS, %r14 is the under-top pointer. The benefit is clear in specialised instructions such as push-stack/%r0. ]

[ The specialised VM instruction minus-stack looks clumsy on x86_64 because of the limitation on the number and the order of the operands in the hardware instruction subq. ]

[again mov/nR/%r0 less efficient than mov/n1/%r1, for the same reason as the specialisations of mov in the previous case.]

The same code on MIPS

register code on mips:

# 0x4728e8: !BEGINBASICBLOCK 0x7f6fe6f0 (0 bytes):
# 0x4728ec: mov/nR/%r0 0x4f790d55 (12 bytes):
    0x7f6fe6f0 3c114f79         lui     $17,0x4f79
    0x7f6fe6f4 36310d55         ori     $17,$17,0xd55
    0x7f6fe6f8 0220b025         or      $22,$17,$0
# 0x4728f0: mov/n1/%r1 (4 bytes):
    0x7f6fe6fc 24150001         addiu   $21,$0,1
# 0x4728f0: be/%r0/%r1/fR 0x472910 (8 bytes):
    0x7f6fe700 12d5000c         beq     $22,$21,0x7f6fe734
    0x7f6fe704 00000000         sll     $0,$0,0x0
# 0x4728f4: !BEGINBASICBLOCK 0x7f6fe708 (0 bytes):
# 0x4728f8: bge/%r0/%r1/fR 0x472900 (12 bytes):
    0x7f6fe708 02d5102a         slt     $2,$22,$21
    0x7f6fe70c 10400004         beqz    $2,0x7f6fe720
    0x7f6fe710 00000000         sll     $0,$0,0x0
# 0x4728fc: minus/%r1/%r0/%r1 (4 bytes):
    0x7f6fe714 02b6a823         subu    $21,$21,$22
# 0x4728fc: b/fR 0x472904 (8 bytes):
    0x7f6fe718 0bdbf9c9         j       0x7f6fe724
    0x7f6fe71c 00000000         sll     $0,$0,0x0
# 0x472900: !BEGINBASICBLOCK 0x7f6fe720 (0 bytes):
# 0x472904: minus/%r0/%r1/%r0 (4 bytes):
    0x7f6fe720 02d5b023         subu    $22,$22,$21
# 0x472904: !BEGINBASICBLOCK 0x7f6fe724 (0 bytes):
# 0x472908: bne/%r0/%r1/fR 0x4728f4 (8 bytes):
    0x7f6fe724 16d5fff8         bne     $22,$21,0x7f6fe708
    0x7f6fe728 00000000         sll     $0,$0,0x0
# 0x47290c: b/fR 0x472910 (8 bytes):
    0x7f6fe72c 0bdbf9cd         j       0x7f6fe734
    0x7f6fe730 00000000         sll     $0,$0,0x0
# 0x472910: !BEGINBASICBLOCK 0x7f6fe734 (0 bytes):
# 0x472914: print/%r0/retR 0x7f6fe748 (20 bytes):
    0x7f6fe734 3c117f6f         lui     $17,0x7f6f
    0x7f6fe738 3631e748         ori     $17,$17,0xe748
    0x7f6fe73c 8fa20490         lw      $2,1168($29)
    0x7f6fe740 00400008         jr      $2
    0x7f6fe744 00000000         sll     $0,$0,0x0
# 0x472918: exitvm (12 bytes):
    0x7f6fe748 8fa204a8         lw      $2,1192($29)
    0x7f6fe74c 00400008         jr      $2
    0x7f6fe750 00000000         sll     $0,$0,0x0

[ ]

 0.                     base: $16          (register)
 1.                  scratch: $23          (register)
 2.               residual 0: $17          (register)
 3.               residual 1: $18          (register)
 4.               residual 2: $19          (register)
 5.            mainstack top: $20          (register)
 6.   mainstack undertop ptr: $fp          (register)
 7.                      %r0: $22          (register)
 8.                      %r1: $21          (register)
 9.                      %r2: $7           (register)

[ suboptimal use of delay slots, for example in the first b/fR: The unconditional jump j comes right after a VM instruction ending with a subu hardware instruction, which could be moved to the delay slot in place of the nop (sll $0, $0, 0x0). ] [ cannot schedule hardware instructions across VM instruction boundaries. SPARC and SH have the same problem. Luckily delay slots are becoming less common in recent architectures, including the latest MIPS. ]

stack code on mips:

# 0x4728e8: !BEGINBASICBLOCK 0x7f6fcdf0 (0 bytes):
# 0x4728ec: mov/nR/%r0 0x4f790d55 (12 bytes):
    0x7f6fcdf0 3c114f79         lui     $17,0x4f79
    0x7f6fcdf4 36310d55         ori     $17,$17,0xd55
    0x7f6fcdf8 0220b025         or      $22,$17,$0
# 0x4728f0: mov/n1/%r1 (4 bytes):
    0x7f6fcdfc 24150001         addiu   $21,$0,1
# 0x4728f0: push-stack/%r0 (12 bytes):
    0x7f6fce00 afd40004         sw      $20,4($30)
    0x7f6fce04 27de0004         addiu   $30,$30,4
    0x7f6fce08 02c0a025         or      $20,$22,$0
# 0x4728f0: push-stack/%r1 (12 bytes):
    0x7f6fce0c afd40004         sw      $20,4($30)
    0x7f6fce10 27de0004         addiu   $30,$30,4
    0x7f6fce14 02a0a025         or      $20,$21,$0
# 0x4728f0: different-stack (16 bytes):
    0x7f6fce18 8fc20000         lw      $2,0($30)
    0x7f6fce1c 27defffc         addiu   $30,$30,-4
    0x7f6fce20 00544826         xor     $9,$2,$20
    0x7f6fce24 0009a02b         sltu    $20,$0,$9
# 0x4728f0: bf-stack/fR 0x472910 (20 bytes):
    0x7f6fce28 27defffc         addiu   $30,$30,-4
    0x7f6fce2c 02801025         or      $2,$20,$0
    0x7f6fce30 8fd40004         lw      $20,4($30)
    0x7f6fce34 1040003c         beqz    $2,0x7f6fcf28
    0x7f6fce38 00000000         sll     $0,$0,0x0
# 0x4728f4: !BEGINBASICBLOCK 0x7f6fce3c (0 bytes):
# 0x4728f8: push-stack/%r0 (12 bytes):
    0x7f6fce3c afd40004         sw      $20,4($30)
    0x7f6fce40 27de0004         addiu   $30,$30,4
    0x7f6fce44 02c0a025         or      $20,$22,$0
# 0x4728f8: push-stack/%r1 (12 bytes):
    0x7f6fce48 afd40004         sw      $20,4($30)
    0x7f6fce4c 27de0004         addiu   $30,$30,4
    0x7f6fce50 02a0a025         or      $20,$21,$0
# 0x4728f8: less-stack (12 bytes):
    0x7f6fce54 8fc20000         lw      $2,0($30)
    0x7f6fce58 27defffc         addiu   $30,$30,-4
    0x7f6fce5c 0054a02a         slt     $20,$2,$20
# 0x4728f8: bf-stack/fR 0x472900 (20 bytes):
    0x7f6fce60 27defffc         addiu   $30,$30,-4
    0x7f6fce64 02801025         or      $2,$20,$0
    0x7f6fce68 8fd40004         lw      $20,4($30)
    0x7f6fce6c 10400010         beqz    $2,0x7f6fceb0
    0x7f6fce70 00000000         sll     $0,$0,0x0
# 0x4728fc: push-stack/%r1 (12 bytes):
    0x7f6fce74 afd40004         sw      $20,4($30)
    0x7f6fce78 27de0004         addiu   $30,$30,4
    0x7f6fce7c 02a0a025         or      $20,$21,$0
# 0x4728fc: push-stack/%r0 (12 bytes):
    0x7f6fce80 afd40004         sw      $20,4($30)
    0x7f6fce84 27de0004         addiu   $30,$30,4
    0x7f6fce88 02c0a025         or      $20,$22,$0
# 0x4728fc: minus-stack (12 bytes):
    0x7f6fce8c 8fc20000         lw      $2,0($30)
    0x7f6fce90 27defffc         addiu   $30,$30,-4
    0x7f6fce94 0054a023         subu    $20,$2,$20
# 0x4728fc: pop-stack/%r1 (16 bytes):
    0x7f6fce98 8fc20000         lw      $2,0($30)
    0x7f6fce9c 0280a825         or      $21,$20,$0
    0x7f6fcea0 27defffc         addiu   $30,$30,-4
    0x7f6fcea4 0040a025         or      $20,$2,$0
# 0x4728fc: b/fR 0x472904 (8 bytes):
    0x7f6fcea8 0bdbf3b9         j       0x7f6fcee4
    0x7f6fceac 00000000         sll     $0,$0,0x0
# 0x472900: !BEGINBASICBLOCK 0x7f6fceb0 (0 bytes):
# 0x472904: push-stack/%r0 (12 bytes):
    0x7f6fceb0 afd40004         sw      $20,4($30)
    0x7f6fceb4 27de0004         addiu   $30,$30,4
    0x7f6fceb8 02c0a025         or      $20,$22,$0
# 0x472904: push-stack/%r1 (12 bytes):
    0x7f6fcebc afd40004         sw      $20,4($30)
    0x7f6fcec0 27de0004         addiu   $30,$30,4
    0x7f6fcec4 02a0a025         or      $20,$21,$0
# 0x472904: minus-stack (12 bytes):
    0x7f6fcec8 8fc20000         lw      $2,0($30)
    0x7f6fcecc 27defffc         addiu   $30,$30,-4
    0x7f6fced0 0054a023         subu    $20,$2,$20
# 0x472904: pop-stack/%r0 (16 bytes):
    0x7f6fced4 8fc20000         lw      $2,0($30)
    0x7f6fced8 0280b025         or      $22,$20,$0
    0x7f6fcedc 27defffc         addiu   $30,$30,-4
    0x7f6fcee0 0040a025         or      $20,$2,$0
# 0x472904: !BEGINBASICBLOCK 0x7f6fcee4 (0 bytes):
# 0x472908: push-stack/%r0 (12 bytes):
    0x7f6fcee4 afd40004         sw      $20,4($30)
    0x7f6fcee8 27de0004         addiu   $30,$30,4
    0x7f6fceec 02c0a025         or      $20,$22,$0
# 0x472908: push-stack/%r1 (12 bytes):
    0x7f6fcef0 afd40004         sw      $20,4($30)
    0x7f6fcef4 27de0004         addiu   $30,$30,4
    0x7f6fcef8 02a0a025         or      $20,$21,$0
# 0x472908: different-stack (16 bytes):
    0x7f6fcefc 8fc20000         lw      $2,0($30)
    0x7f6fcf00 27defffc         addiu   $30,$30,-4
    0x7f6fcf04 00544826         xor     $9,$2,$20
    0x7f6fcf08 0009a02b         sltu    $20,$0,$9
# 0x472908: bt-stack/fR 0x4728f4 (20 bytes):
    0x7f6fcf0c 27defffc         addiu   $30,$30,-4
    0x7f6fcf10 02801025         or      $2,$20,$0
    0x7f6fcf14 8fd40004         lw      $20,4($30)
    0x7f6fcf18 1440ffc8         bnez    $2,0x7f6fce3c
    0x7f6fcf1c 00000000         sll     $0,$0,0x0
# 0x47290c: b/fR 0x472910 (8 bytes):
    0x7f6fcf20 0bdbf3ca         j       0x7f6fcf28
    0x7f6fcf24 00000000         sll     $0,$0,0x0
# 0x472910: !BEGINBASICBLOCK 0x7f6fcf28 (0 bytes):
# 0x472914: push-stack/%r0 (12 bytes):
    0x7f6fcf28 afd40004         sw      $20,4($30)
    0x7f6fcf2c 27de0004         addiu   $30,$30,4
    0x7f6fcf30 02c0a025         or      $20,$22,$0
# 0x472914: print-stack/retR 0x7f6fcf48 (20 bytes):
    0x7f6fcf34 3c117f6f         lui     $17,0x7f6f
    0x7f6fcf38 3631cf48         ori     $17,$17,0xcf48
    0x7f6fcf3c 8fa2047c         lw      $2,1148($29)
    0x7f6fcf40 00400008         jr      $2
    0x7f6fcf44 00000000         sll     $0,$0,0x0
# 0x472918: exitvm (12 bytes):
    0x7f6fcf48 8fa204a8         lw      $2,1192($29)
    0x7f6fcf4c 00400008         jr      $2
    0x7f6fcf50 00000000         sll     $0,$0,0x0

[ Here the TOS register is $20, and the under-top pointer $30 (named $fp, like in the output from --print-locations above). ]

[ TOS optimisation is easier to appreciate on MIPS where minus-stack looks like a good example of a stack operation consuming two elements and producing one: a lw instruction loads the under-top stack elements, addiu decrements the under-top-pointer, and subu does the actual work, using the top register as both one the sources and the destination.
Here less-stack follows the same pattern of minus-stack. ]

[ again suboptimal use of delay slots, for example at the end of bf-stack/fR, where it would have been correct to move lw $20,4($30) past the branch, replacing the nop sll $0,$0,0x0. This problem is easier to observe in stack code than register code, due to conditionals having side effects on the stack. ]

Minimal threading

register code on RISC-V, minimal threading [2021 update: this is still a good example for minimal-threading but now that Jitter supports no-threading on RISC-V it is no longer the best that can be achieved on this architecture]:

# 0x4d740: !BEGINBASICBLOCK 0x40009fe6e0 (2 bytes):
    0x00000040009fe6e0 0421                     c.addi  x8,8
# 0x4d748: mov/nR/%r0 0x4f790d55 (6 bytes):
    0x00000040009fe6e2 00043903                 ld      x18,0(x8)
    0x00000040009fe6e6 0421                     c.addi  x8,8
# 0x4d750: mov/n1/%r1 (2 bytes):
    0x00000040009fe6e8 4985               x19,1
# 0x4d750: be/%r0/%r1/fR 0x4d790 (12 bytes):
    0x00000040009fe6ea 01391563                 bne     x18,x19,0x00000040009fe6f4
    0x00000040009fe6ee 6000                     c.ld    x8,0(x8)
    0x00000040009fe6f0 601c                     c.ld    x15,0(x8)
    0x00000040009fe6f2 8782                     c.jr    x15
    0x00000040009fe6f4 0421                     c.addi  x8,8
# 0x4d758: !BEGINBASICBLOCK 0x40009fe6f6 (2 bytes):
    0x00000040009fe6f6 0421                     c.addi  x8,8
# 0x4d760: bge/%r0/%r1/fR 0x4d770 (12 bytes):
    0x00000040009fe6f8 01394563                 blt     x18,x19,0x00000040009fe702
    0x00000040009fe6fc 6000                     c.ld    x8,0(x8)
    0x00000040009fe6fe 601c                     c.ld    x15,0(x8)
    0x00000040009fe700 8782                     c.jr    x15
    0x00000040009fe702 0421                     c.addi  x8,8
# 0x4d768: minus/%r1/%r0/%r1 (4 bytes):
    0x00000040009fe704 412989b3                 sub     x19,x19,x18
# 0x4d768: b/fR 0x4d778 (6 bytes):
    0x00000040009fe708 6000                     c.ld    x8,0(x8)
    0x00000040009fe70a 601c                     c.ld    x15,0(x8)
    0x00000040009fe70c 8782                     c.jr    x15
# 0x4d770: !BEGINBASICBLOCK 0x40009fe70e (2 bytes):
    0x00000040009fe70e 0421                     c.addi  x8,8
# 0x4d778: minus/%r0/%r1/%r0 (4 bytes):
    0x00000040009fe710 41390933                 sub     x18,x18,x19
# 0x4d778: !BEGINBASICBLOCK 0x40009fe714 (2 bytes):
    0x00000040009fe714 0421                     c.addi  x8,8
# 0x4d780: bne/%r0/%r1/fR 0x4d758 (12 bytes):
    0x00000040009fe716 01390563                 beq     x18,x19,0x00000040009fe720
    0x00000040009fe71a 6000                     c.ld    x8,0(x8)
    0x00000040009fe71c 601c                     c.ld    x15,0(x8)
    0x00000040009fe71e 8782                     c.jr    x15
    0x00000040009fe720 0421                     c.addi  x8,8
# 0x4d788: b/fR 0x4d790 (6 bytes):
    0x00000040009fe722 6000                     c.ld    x8,0(x8)
    0x00000040009fe724 601c                     c.ld    x15,0(x8)
    0x00000040009fe726 8782                     c.jr    x15
# 0x4d790: !BEGINBASICBLOCK 0x40009fe728 (2 bytes):
    0x00000040009fe728 0421                     c.addi  x8,8
# 0x4d798: print/%r0/retR 0x40009fe730 (6 bytes):
    0x00000040009fe72a 748bb783                 ld      x15,1864(x23)
    0x00000040009fe72e 8782                     c.jr    x15
# 0x4d7a0: exitvm (6 bytes):
    0x00000040009fe730 718bb783                 ld      x15,1816(x23)
    0x00000040009fe734 8782                     c.jr    x15

[ ]

[ !BEGINBASICBLOCK ] [ Branches more expensive: load-load-branch. See for example b/fR. x8 thread pointer. Address where to branch loaded into a temporary, here always x15, branch via register. Residuals are loaded from memory, at a known constant offset from the thread pointer x8. The thread pointer needs to be incremented when using residuals, so that it points to the next residual in a different VM instruction. ]

Direct threading

register code on arm, direct-threading:

# 0x5cd68: mov/nR/%r0 0x292c0, 0x4f790d55 (16 bytes):
    0x000292c0 e5941008         ldr     r1, [r4, #8]
    0x000292c4 e2844008         add     r4, r4, #8
    0x000292c8 e5145004         ldr     r5, [r4, #-4]
    0x000292cc e1a0f001         mov     r15, r1
# 0x5cd70: mov/n1/%r1 0x291f8 (16 bytes):
    0x000291f8 e5941004         ldr     r1, [r4, #4]
    0x000291fc e3a06001         mov     r6, #1
    0x00029200 e2844004         add     r4, r4, #4
    0x00029204 e1a0f001         mov     r15, r1
# 0x5cd74: be/%r0/%r1/fR 0x25718, 0x5cda4 (28 bytes):
    0x00025718 e1550006         cmp     r5, r6
    0x0002571c 05944004         ldreq   r4, [r4, #4]
    0x00025720 05941000         ldreq   r1, [r4]
    0x00025724 01a0f001         moveq   r15, r1
    0x00025728 e5941008         ldr     r1, [r4, #8]
    0x0002572c e2844008         add     r4, r4, #8
    0x00025730 e1a0f001         mov     r15, r1
# 0x5cd7c: bge/%r0/%r1/fR 0x260d4, 0x5cd90 (28 bytes):
    0x000260d4 e1560005         cmp     r6, r5
    0x000260d8 d5944004         ldrle   r4, [r4, #4]
    0x000260dc d5941000         ldrle   r1, [r4]
    0x000260e0 d1a0f001         movle   r15, r1
    0x000260e4 e5941008         ldr     r1, [r4, #8]
    0x000260e8 e2844008         add     r4, r4, #8
    0x000260ec e1a0f001         mov     r15, r1
# 0x5cd84: minus/%r1/%r0/%r1 0x28640 (16 bytes):
    0x00028640 e5941004         ldr     r1, [r4, #4]
    0x00028644 e0466005         sub     r6, r6, r5
    0x00028648 e2844004         add     r4, r4, #4
    0x0002864c e1a0f001         mov     r15, r1
# 0x5cd88: b/fR 0x25700, 0x5cd94 (12 bytes):
    0x00025700 e5944004         ldr     r4, [r4, #4]
    0x00025704 e5941000         ldr     r1, [r4]
    0x00025708 e1a0f001         mov     r15, r1
# 0x5cd90: minus/%r0/%r1/%r0 0x28450 (16 bytes):
    0x00028450 e5941004         ldr     r1, [r4, #4]
    0x00028454 e0455006         sub     r5, r5, r6
    0x00028458 e2844004         add     r4, r4, #4
    0x0002845c e1a0f001         mov     r15, r1
# 0x5cd94: bne/%r0/%r1/fR 0x26df8, 0x5cd7c (28 bytes):
    0x00026df8 e1550006         cmp     r5, r6
    0x00026dfc 15944004         ldrne   r4, [r4, #4]
    0x00026e00 15941000         ldrne   r1, [r4]
    0x00026e04 11a0f001         movne   r15, r1
    0x00026e08 e5941008         ldr     r1, [r4, #8]
    0x00026e0c e2844008         add     r4, r4, #8
    0x00026e10 e1a0f001         mov     r15, r1
# 0x5cd9c: b/fR 0x25700, 0x5cda4 (12 bytes):
    0x00025700 e5944004         ldr     r4, [r4, #4]
    0x00025704 e5941000         ldr     r1, [r4]
    0x00025708 e1a0f001         mov     r15, r1
# 0x5cda4: print/%r0/retR 0x29fa8, 0xffffffff (24 bytes):
    0x00029fa8 e1a01005         mov     r1, r5
    0x00029fac e59d0004         ldr     r0, [r13, #4]
    0x00029fb0 ebff9b09         bl      0x00010bdc
    0x00029fb4 e5941008         ldr     r1, [r4, #8]
    0x00029fb8 e2844008         add     r4, r4, #8
    0x00029fbc e1a0f001         mov     r15, r1
# 0x5cdac: exitvm 0x282bc (4 bytes):
    0x000282bc ea001913         b       0x0002e710

[ ]

 0.           thread pointer: r4           (register)
 1.                     base: r6           (register)
 2.            link register: [sp, #20]    (memory)
 3.            mainstack top: r10          (register)
 4.   mainstack undertop ptr: fp           (register)
 5.                      %r0: r7           (register)
 6.                      %r1: r8           (register)
 7.                      %r2: r9           (register)

[ Similar to minimal threading, but here every VM instruction ends with the load-load-branch sequence. VM instruction addresses not consecutive. There is no !BEGINBASICBLOCK so VM branches are less expensive than on minimal threading, but VM instruction fallthru is as expensive as a branch. In this particular example, featuring very short basic blocks, direct threading performs better than usual compared to minimal threading. ]

[comment. The typical direct-threading "load-load-branch" sequence.]

Here GCC has chosen to use r4 as the thread pointer, and r1 as a temporary holding the address where to branch.
On ARM the hardware program counter is accessible as the general-purpose register r15; any instruction writing into r15, for example mov r15, r1, is therefore a branch. This pattern is common in threaded code (direct threading and minimal threading), to express an indirect jump via register.

switch dispatching

[some generic comment: portability, non-GCC. Behavior will be identical, performance will be worse.] [specialisation remains, so with Jitter the user still gains fast registers and fast literals compared to a naïve solution, even with switch dispatch.]

Not shown here

[procedure calls, overflow handling, conditional fast branching on data tags, predefined macros for more sophisticated stack operations; multiple stacks; user-defined runtime data structures]



[comparison between stack and registers not entirely fair in this case; the register-based code generator is more sophisticated, and the stack version could benefit from more rewrite rules] Same program above translated into C.

[GCC 9 snapshot from 2018-12-30, on every architecture. -march=haswell -march=loongson2f -march=mips32 -static]

#include <stdio.h>

main (void)
  long a = 1333333333;
  long b = 1;

  while (a != b)
    if (a < b)
      b = b - a;
      a = a - b;

  printf ("%li\n", a);

  return 0;

[ I have verified that in this case GCC does not exploit the fact that the value of a at the end is actually a known constant. ]

Intel i7-4700MQ (Haswell) variable frequency up to 3400MHz:

User times on Haswell (very complex out-of-order superscalar), seconds, other cores idle.
Jittery VM, stack
fast registers, fast literals no fast registers, no fast literals
switch 24.28 24.36
direct threading 12.33 13.60
minimal threading 6.54 9.54
no threading 6.37 6.79
Jittery VM, registers
fast registers, fast literals no fast registers, no fast literals
switch 5.18 6.40
direct threading 3.99 5.96
minimal threading 5.18 6.38
no threading 0.80 3.22
C, x86_64-unknown-linux-gnu-gcc -march=haswell
-O2 0.40
-O3 0.83


Lemote Yeeloong 8089_B (Loongson2F 800MHz MIPS CPU):

User times on Loongson2F (simpler out-of-order superscalar), seconds.
Jittery VM, stack
fast registers, fast literals no fast registers, no fast literals
switch 553.56 452.62
direct threading 134.70 151.57
minimal threading 112.81 183.06
no threading 104.38 190.23
Jittery VM, registers
fast registers, fast literals no fast registers, no fast literals
switch 128.24 138.89
direct threading 21.88 85.35
minimal threading 23.57 82.35
no threading 10.10 63.97
C, mipsel-unknown-linux-gnu-gcc -march=loongson2f
-O2 6.18
-O3 10.10

Ben Nanonote (JZ4720 333MHz MIPS CPU, non-superscalar pipeline):

User times on JZ4720 (very simple non-superscalar pipeline), seconds.
Jittery VM, stack
fast registers, fast literals no fast registers, no fast literals
switch 1165.38 1267.46
direct threading 494.86 694.23
minimal threading 269.78 490.75
no threading 200.29 302.53
Jittery VM, registers
fast registers, fast literals no fast registers, no fast literals
switch 283.53 437.60
direct threading 126.70 278.14
minimal threading 118.52 265.77
no threading 24.54 143.33
C, mipsel-unknown-linux-uclibc-gcc -march=mips32
-O2 20.44
-O3 20.44

[no-threading is the clear winner, on any supported architecture and micro-architecture]


I am currently testing the following configurations, in some cases with multiple GCC versions, via cross-toolchains and emulation:

Right now Jitter includes enough assembly support to enable no-threading dispatch on x86_64, M68k, 32-bit MIPS (both traditional and r6), 32-bit PowerPC, RISC-V (32- and 64-bit) SPARC (32- and 64-bit), and SH. I mean to extend the same level of support to all of the architectures in the triplets above, with the exception of SH (see the information about SH below.

Minimal-threading support, which is extremely easy to add, currently exists for Aarch64, Alpha, ARM, PA-RISC, S390x.
Support for more architectures will come.

Adding support for a new architecture requires relatively little effort, and the internals are being worked on to make porting even easier. [somewhere: Jitter's source code is quite well factored and contains very little assembly, all confined within an architecture-specific subdirectory]

The SH architecture (sh4eb-unknown-linux-gnu, sh4-unknown-linux-gnu) presents special challenges with the most efficient dispatches; I will make an effort to support it, but that may not be entirely feasible in the end. switch and direct threading remain of course supported on SH, and are already very reliable: direct threading simply relies on GNU C, with no special assembly required; switch dispatching is even more portable.

Other architectures, for example 32-bit s390, are probably easy to support as well. I am now concentrating my efforts on the configurations which I can test with conveniently generated cross-toolchains and emulators.


Already usable. An external project actually using it. API not completely stable yet. Future tasks: debugging API, machine-generation of tagging and untagging macros for dynamically-typed languages, a copying garbage collector. Some minor optimisations.

Existing source code mostly clean, and very well commented. A few modules to be rewritten, particularly jitter/jitter-replicate.c. The bug. Manual. Ports (mostly trivial).

GNU Jitter is free software, released under the GNU GPL version 3 or later.

An elaborate Jitter example, JitterLisp, has evolved into an interesting sub-project in its own right.

Downloading the sources

Jitter's source code is currently hosted on my own git server, available to anybody via HTTP. Please contact me if you have some reason to require SSH access.

I am now back to using the master branch on git, with short-lived feature branches.

Other resources

[hacker emblem]
Luca Saiu
Last modified: 2022-01

Copyright © 2017-2022 Luca Saiu
Verbatim copying and redistribution of this entire page are permitted provided this notice is preserved.