[2021-12-17: Jitter has been officially accepted as part of the GNU project! The web page for Jitter is now on gnu.org; however this tutorial is still only hosted here on ageinghacker.net.]
GNU Jitter is a software generating C code for very efficient language virtual machines from a relatively high-level specification describing instructions and runtime structures.
[registers, stacks] [gc, threads; what is not completely automatic yet can already be accomplished by using provided macros]
The generated code is easy to use and very portable. A VM will behave identically on any platform, the only observable difference being performnace.
[dispatches; no assembly in VM-specific or user code]
Jitter includes a relatively extensive test suite currently comprising around 700 test cases, which supports cross-compilation and running test cases under qemu-user. Generated native code can be checked and disassembled even in cross configurations, relying on cross-compiled GNU binutils.
[define VM runtime data structures, VM instructions, optimisation rewrite rules code generation API] [VM programs: executing, printing, parsing, disassembling]
The GNU poke project by José Marchesi relies on a Jittery VM for the efficient execution of Poke programs.
The Jitter distribution itself comes with examples complex enough to require their own documentation and examples, inclduing a Lisp system.
[This is a boring example and not a particularly good application, to make it more accessible. See JitterLisp and this message for something more interesting] example-vms/structured/
Structured language source example-vms/structured/examples/euclid-iterative.structured:
var a = 1333333333, b = 1; while a <> b do if a < b then b := b - a; else a := a - b; end; end; print a;
[ example-vms/structured/structured.jitter Make sure to edit the file structured.jitter and set fast-register-no to 3 or 4 before recompiling if you want to run these tests on your machine. Recompiling will take a while, but code execution will be fast. The file is currently distributed with fast-register-no set to 0 so that it can be compiled quickly. ]
register code example-vms/structured/structured-code-generator-register.c
[ Only one register class named r, holding integer values. Routines using this VM have access to an unlimited number of registers named %r0, %r1, and so on. More complex VMs may define additional register classes, for example to hold floating-point, string or vector values. Dynamically typed languages may use just one register class containing data such as a C struct containing a tag and a union, or some tagged pointer to similar data. For the structured example a single r register class suffices. ]
[what mov is] [ In this VM, by an arbitrary convention, I chose to write the destination operand of an instruction on the right. Therefore in a VM routine the instruction mov %r1, %r2 will cause the value held by the register %r1 to be copied into the register %r2. The VM instruction mov 7, %r3 would set or initialise the register %r3 to the value 7. ]
instruction mov (?Rn 0 1 -1 2 structured_literal_printer, !R) code JITTER_ARGN1 = JITTER_ARGN0; end end
[ mov R (r) n ? ! ?! 0 1 -1 2 structured_literal_printer code end JITTER_ARGN0 JITTER_ARGN1 N JITTER_ARGN1 = JITTER_ARGN0; ]
instruction minus (?Rn structured_literal_printer, ?Rn 1 2 structured_literal_printer, !R) code JITTER_ARGN2 = JITTER_ARGN0 - JITTER_ARGN1; end end
instruction b (?f) code JITTER_BRANCH_FAST (JITTER_ARGF0); end end
[ b f JITTER_ARGF0 JITTER_BRANCH_FAST ]
instruction bge (?Rn 0 structured_literal_printer, ?Rn 0 structured_literal_printer, ?f) code JITTER_BRANCH_FAST_IF_NOTLESS_SIGNED (JITTER_ARGN0, JITTER_ARGN1, JITTER_ARGF2); end end
[ JITTER_BRANCH_FAST_IF_NOTLESS_SIGNED JITTER_BRANCH_FAST within an ordinary C conditional such as if (JITTER_ARGN0 >= JITTER_ARGN1) ]
[...]
instruction be (?Rn 0 structured_literal_printer, ?Rn 0 structured_literal_printer, ?f) code JITTER_BRANCH_FAST_IF_EQUAL (JITTER_ARGN0, JITTER_ARGN1, JITTER_ARGF2); end end
[ ]
[ structured --print ]
mov 1333333333, %r0 mov 1, %r1 be %r0, %r1, $L9 $L3: bge %r0, %r1, $L6 minus %r1, %r0, %r1 b $L7 $L6: minus %r0, %r1, %r0 $L7: bne %r0, %r1, $L3 b $L9 $L9: print %r0 exitvm
[ This routine only uses two VM registers, %r0 and %r1. [if ...] ]
[third-to-last instruction, b $L9 unneeded]
[ register code, x86_64, no-threading structured --disassemble ]
# 0x466d40: !BEGINBASICBLOCK 0x7ffff7d7e6e0 (0 bytes): # 0x466d48: mov/nR/%r0 0x4f790d55 (9 bytes): 0x00007ffff7d7e6e0 41 bc 55 0d 79 4f movl $0x4f790d55,%r12d 0x00007ffff7d7e6e6 4d 89 e5 movq %r12,%r13 # 0x466d50: mov/n1/%r1 (6 bytes): 0x00007ffff7d7e6e9 41 b8 01 00 00 00 movl $0x1,%r8d # 0x466d50: be/%r0/%r1/fR 0x466d90 (9 bytes): 0x00007ffff7d7e6ef 4d 39 c5 cmpq %r8,%r13 0x00007ffff7d7e6f2 0f 84 22 00 00 00 je 0x00007ffff7d7e71a # 0x466d58: !BEGINBASICBLOCK 0x7ffff7d7e6f8 (0 bytes): # 0x466d60: bge/%r0/%r1/fR 0x466d70 (9 bytes): 0x00007ffff7d7e6f8 4d 39 c5 cmpq %r8,%r13 0x00007ffff7d7e6fb 0f 8d 08 00 00 00 jge 0x00007ffff7d7e709 # 0x466d68: minus/%r1/%r0/%r1 (3 bytes): 0x00007ffff7d7e701 4d 29 e8 subq %r13,%r8 # 0x466d68: b/fR 0x466d78 (5 bytes): 0x00007ffff7d7e704 e9 03 00 00 00 jmpq 0x00007ffff7d7e70c # 0x466d70: !BEGINBASICBLOCK 0x7ffff7d7e709 (0 bytes): # 0x466d78: minus/%r0/%r1/%r0 (3 bytes): 0x00007ffff7d7e709 4d 29 c5 subq %r8,%r13 # 0x466d78: !BEGINBASICBLOCK 0x7ffff7d7e70c (0 bytes): # 0x466d80: bne/%r0/%r1/fR 0x466d58 (9 bytes): 0x00007ffff7d7e70c 4d 39 c5 cmpq %r8,%r13 0x00007ffff7d7e70f 0f 85 e3 ff ff ff jne 0x00007ffff7d7e6f8 # 0x466d88: b/fR 0x466d90 (5 bytes): 0x00007ffff7d7e715 e9 00 00 00 00 jmpq 0x00007ffff7d7e71a # 0x466d90: !BEGINBASICBLOCK 0x7ffff7d7e71a (0 bytes): # 0x466d98: print/%r0/retR 0x7ffff7d7e72b (17 bytes): 0x00007ffff7d7e71a 49 bc 2b e7 d7 f7 ff 7f 00 00 movabsq $0x7ffff7d7e72b,%r12 0x00007ffff7d7e724 ff a4 24 88 00 00 00 jmpq *0x88(%rsp) # 0x466da0: exitvm (7 bytes): 0x00007ffff7d7e72b 48 8b 44 24 58 movq 0x58(%rsp),%rax 0x00007ffff7d7e730 ff e0 jmpq *%rax
[ Lines starting with # are comments added by Jitter's disassembly function to delimit specialised VM instructions. ]
[ The two three-argument VM instructions minus %r1, %r0, %r1 and minus %r0, %r1, %r0 have turned into two different specialised instructions, with no arguments: minus/%r1/%r0/%r1 and minus/%r0/%r1/%r0: each consists of a single hardware instruction subq, with different register operands. [fast!] ] [ If minus had been used with some other arguments, then the specialised instructions would have been different, and in some case residual arguments would have remained in the specialised version: [ mov/nR/%r0 vs mov/n1/%r1 ] ]
The hardware instructions for specialised VM instructions come at consecutive addresses in memory: for example the first b instruction starts (in this run) at the memory address 0x00007ffff7d7e704, three bytes after the beginning of the minus/%r1/%r0/%r1 right before it at 0x00007ffff7d7e701, which takes three bytes. Specialised instructions are replicated: the second b below starts at a different memory address, 0x00007ffff7d7e715, and does not share code with the first one. This means that, in the absence of VM-level branches, control can simply fall thru from one VM instruction to the next, sparing the cost of any hardware-level branching. [the name: no-threading: trivial dispatch, or no dispatch cost]
[
VM register %r0 mapped to the hardware register %r13.
VM register %r1 mapped to the hardware register %r8.
Chosen by GCC.
If there are not enough available hardware registers, some fast registers will be mapped to
memory at a statically known offset from a register.
Slow registers will be mapped to memory at an offset from the base to be computed
at run time.
This information, while not difficult to deduce from the disassembly, can also be
obtained directly from a Jitter function by calling the structured
program with the command-line option
--print-locations:
]
0. base: %rbx (register) 1. residual 0: %r12 (register) 2. mainstack top: %rbp (register) 3. mainstack undertop ptr: %r14 (register) 4. %r0: %r13 (register) 5. %r1: %r8 (register) 6. %r2: %r9 (register)
[ Look at how fast branches are compiled ]
[ mov/nR/%r0 not quite efficient as mov/n1/%r1, despite both being specialisations of the same VM instruction mov. 1333333333 not a common constant in programs, differently from 1. When translating the VM program 1333333333 is residualised, where here 1 has been translated as a fast literal, due to the definition of the VM instruction mov, above. Residualised instruction arguments must be materialized before use with no-threading dispatch, usually into a hardware register -- in this case %r12. ]
stack vm code example-vms/structured/structured-code-generator-stack.c
[not the actual definition, but this generates the same code thanks to GCC's cleverness]
instruction push-stack (?Rnl 0 1 -1 2 structured_literal_printer) code JITTER_PUSH_MAINSTACK (JITTER_ARGN0); end end
[]
instruction pop-stack (!R) code jitter_int top = JITTER_TOP_MAINSTACK (); JITTER_DROP_MAINSTACK (); JITTER_ARGN0 = top; end end
[]
instruction minus-stack () code jitter_int top = JITTER_TOP_MAINSTACK (); jitter_int under_top = JITTER_UNDER_TOP_MAINSTACK (); JITTER_DROP_MAINSTACK (); JITTER_DROP_MAINSTACK (); JITTER_PUSH_MAINSTACK (under_top - top); end end
[less-stack implemented as an ordinary stack operation like minus-stack]
instruction bt-stack (?f) code jitter_int top = JITTER_TOP_MAINSTACK (); JITTER_DROP_MAINSTACK (); JITTER_BRANCH_FAST_IF_NONZERO (top, JITTER_ARGF0); end end
instruction bf-stack (?f) code jitter_int top = JITTER_TOP_MAINSTACK (); JITTER_DROP_MAINSTACK (); JITTER_BRANCH_FAST_IF_ZERO (top, JITTER_ARGF0); end end
[ ]
mov 1333333333, %r0 mov 1, %r1 push-stack %r0 push-stack %r1 different-stack bf-stack $L24 $L6: push-stack %r0 push-stack %r1 less-stack bf-stack $L15 push-stack %r1 push-stack %r0 minus-stack pop-stack %r1 b $L19 $L15: push-stack %r0 push-stack %r1 minus-stack pop-stack %r0 $L19: push-stack %r0 push-stack %r1 different-stack bt-stack $L6 b $L24 $L24: push-stack %r0 print-stack exitvm
[...the eagle-eyed reader will have noticed how... mov...]
[...rewrite rule (named push-pop--movn)... ]:
rule push-pop--movn rewrite push-stack n $a; pop-stack R $b into mov $a, $b end
[ other optimisation opportunities remain: for example it would be easy to rewrite different-stack followed by bf-stack $L into a new single instruction consuming two elements off the stack to branch when they are equal. I recommend this optimisation as a beginner exercise; other rewrites are possible, which would turn some common stack instructions patterns into code similar to the register VM case. ]
stack code, disassembled, no-threading, x86_64
# 0x466d40: !BEGINBASICBLOCK 0x7ffff7d7cde0 (0 bytes): # 0x466d48: mov/nR/%r0 0x4f790d55 (9 bytes): 0x00007ffff7d7cde0 41 bc 55 0d 79 4f movl $0x4f790d55,%r12d 0x00007ffff7d7cde6 4d 89 e5 movq %r12,%r13 # 0x466d50: mov/n1/%r1 (6 bytes): 0x00007ffff7d7cde9 41 b8 01 00 00 00 movl $0x1,%r8d # 0x466d50: push-stack/%r0 (11 bytes): 0x00007ffff7d7cdef 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7cdf3 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7cdf7 4c 89 ed movq %r13,%rbp # 0x466d50: push-stack/%r1 (11 bytes): 0x00007ffff7d7cdfa 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7cdfe 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ce02 4c 89 c5 movq %r8,%rbp # 0x466d50: different-stack (15 bytes): 0x00007ffff7d7ce05 49 39 2e cmpq %rbp,(%r14) 0x00007ffff7d7ce08 40 0f 95 c5 setne %bpl 0x00007ffff7d7ce0c 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce10 40 0f b6 ed movzbl %bpl,%ebp # 0x466d50: bf-stack/fR 0x466d90 (21 bytes): 0x00007ffff7d7ce14 48 89 e8 movq %rbp,%rax 0x00007ffff7d7ce17 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce1b 49 8b 6e 08 movq 0x8(%r14),%rbp 0x00007ffff7d7ce1f 48 83 f8 00 cmpq $0x0,%rax 0x00007ffff7d7ce23 0f 84 de 00 00 00 je 0x00007ffff7d7cf07 # 0x466d58: !BEGINBASICBLOCK 0x7ffff7d7ce29 (0 bytes): # 0x466d60: push-stack/%r0 (11 bytes): 0x00007ffff7d7ce29 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ce2d 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ce31 4c 89 ed movq %r13,%rbp # 0x466d60: push-stack/%r1 (11 bytes): 0x00007ffff7d7ce34 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ce38 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ce3c 4c 89 c5 movq %r8,%rbp # 0x466d60: less-stack (15 bytes): 0x00007ffff7d7ce3f 49 39 2e cmpq %rbp,(%r14) 0x00007ffff7d7ce42 40 0f 9c c5 setl %bpl 0x00007ffff7d7ce46 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce4a 40 0f b6 ed movzbl %bpl,%ebp # 0x466d60: bf-stack/fR 0x466d70 (21 bytes): 0x00007ffff7d7ce4e 48 89 e8 movq %rbp,%rax 0x00007ffff7d7ce51 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce55 49 8b 6e 08 movq 0x8(%r14),%rbp 0x00007ffff7d7ce59 48 83 f8 00 cmpq $0x0,%rax 0x00007ffff7d7ce5d 0f 84 35 00 00 00 je 0x00007ffff7d7ce98 # 0x466d68: push-stack/%r1 (11 bytes): 0x00007ffff7d7ce63 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ce67 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ce6b 4c 89 c5 movq %r8,%rbp # 0x466d68: push-stack/%r0 (11 bytes): 0x00007ffff7d7ce6e 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ce72 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ce76 4c 89 ed movq %r13,%rbp # 0x466d68: minus-stack (13 bytes): 0x00007ffff7d7ce79 49 8b 06 movq (%r14),%rax 0x00007ffff7d7ce7c 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce80 48 29 e8 subq %rbp,%rax 0x00007ffff7d7ce83 48 89 c5 movq %rax,%rbp # 0x466d68: pop-stack/%r1 (13 bytes): 0x00007ffff7d7ce86 49 8b 06 movq (%r14),%rax 0x00007ffff7d7ce89 49 89 e8 movq %rbp,%r8 0x00007ffff7d7ce8c 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ce90 48 89 c5 movq %rax,%rbp # 0x466d68: b/fR 0x466d78 (5 bytes): 0x00007ffff7d7ce93 e9 30 00 00 00 jmpq 0x00007ffff7d7cec8 # 0x466d70: !BEGINBASICBLOCK 0x7ffff7d7ce98 (0 bytes): # 0x466d78: push-stack/%r0 (11 bytes): 0x00007ffff7d7ce98 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ce9c 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7cea0 4c 89 ed movq %r13,%rbp # 0x466d78: push-stack/%r1 (11 bytes): 0x00007ffff7d7cea3 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7cea7 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ceab 4c 89 c5 movq %r8,%rbp # 0x466d78: minus-stack (13 bytes): 0x00007ffff7d7ceae 49 8b 06 movq (%r14),%rax 0x00007ffff7d7ceb1 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7ceb5 48 29 e8 subq %rbp,%rax 0x00007ffff7d7ceb8 48 89 c5 movq %rax,%rbp # 0x466d78: pop-stack/%r0 (13 bytes): 0x00007ffff7d7cebb 49 8b 06 movq (%r14),%rax 0x00007ffff7d7cebe 49 89 ed movq %rbp,%r13 0x00007ffff7d7cec1 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7cec5 48 89 c5 movq %rax,%rbp # 0x466d78: !BEGINBASICBLOCK 0x7ffff7d7cec8 (0 bytes): # 0x466d80: push-stack/%r0 (11 bytes): 0x00007ffff7d7cec8 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7cecc 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7ced0 4c 89 ed movq %r13,%rbp # 0x466d80: push-stack/%r1 (11 bytes): 0x00007ffff7d7ced3 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7ced7 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7cedb 4c 89 c5 movq %r8,%rbp # 0x466d80: different-stack (15 bytes): 0x00007ffff7d7cede 49 39 2e cmpq %rbp,(%r14) 0x00007ffff7d7cee1 40 0f 95 c5 setne %bpl 0x00007ffff7d7cee5 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7cee9 40 0f b6 ed movzbl %bpl,%ebp # 0x466d80: bt-stack/fR 0x466d58 (21 bytes): 0x00007ffff7d7ceed 48 89 e8 movq %rbp,%rax 0x00007ffff7d7cef0 49 83 ee 08 subq $0x8,%r14 0x00007ffff7d7cef4 49 8b 6e 08 movq 0x8(%r14),%rbp 0x00007ffff7d7cef8 48 83 f8 00 cmpq $0x0,%rax 0x00007ffff7d7cefc 0f 85 27 ff ff ff jne 0x00007ffff7d7ce29 # 0x466d88: b/fR 0x466d90 (5 bytes): 0x00007ffff7d7cf02 e9 00 00 00 00 jmpq 0x00007ffff7d7cf07 # 0x466d90: !BEGINBASICBLOCK 0x7ffff7d7cf07 (0 bytes): # 0x466d98: push-stack/%r0 (11 bytes): 0x00007ffff7d7cf07 49 89 6e 08 movq %rbp,0x8(%r14) 0x00007ffff7d7cf0b 49 83 c6 08 addq $0x8,%r14 0x00007ffff7d7cf0f 4c 89 ed movq %r13,%rbp # 0x466d98: print-stack/retR 0x7ffff7d7cf23 (17 bytes): 0x00007ffff7d7cf12 49 bc 23 cf d7 f7 ff 7f 00 00 movabsq $0x7ffff7d7cf23,%r12 0x00007ffff7d7cf1c ff a4 24 b0 00 00 00 jmpq *0xb0(%rsp) # 0x466da0: exitvm (7 bytes): 0x00007ffff7d7cf23 48 8b 44 24 58 movq 0x58(%rsp),%rax 0x00007ffff7d7cf28 ff e0 jmpq *%rax
[TOS optimisation: %rbp is the TOS, %r14 is the under-top pointer. The benefit is clear in specialised instructions such as push-stack/%r0. ]
[ The specialised VM instruction minus-stack looks clumsy on x86_64 because of the limitation on the number and the order of the operands in the hardware instruction subq. ]
[again mov/nR/%r0 less efficient than mov/n1/%r1, for the same reason as the specialisations of mov in the previous case.]
register code on mips:
# 0x4728e8: !BEGINBASICBLOCK 0x7f6fe6f0 (0 bytes): # 0x4728ec: mov/nR/%r0 0x4f790d55 (12 bytes): 0x7f6fe6f0 3c114f79 lui $17,0x4f79 0x7f6fe6f4 36310d55 ori $17,$17,0xd55 0x7f6fe6f8 0220b025 or $22,$17,$0 # 0x4728f0: mov/n1/%r1 (4 bytes): 0x7f6fe6fc 24150001 addiu $21,$0,1 # 0x4728f0: be/%r0/%r1/fR 0x472910 (8 bytes): 0x7f6fe700 12d5000c beq $22,$21,0x7f6fe734 0x7f6fe704 00000000 sll $0,$0,0x0 # 0x4728f4: !BEGINBASICBLOCK 0x7f6fe708 (0 bytes): # 0x4728f8: bge/%r0/%r1/fR 0x472900 (12 bytes): 0x7f6fe708 02d5102a slt $2,$22,$21 0x7f6fe70c 10400004 beqz $2,0x7f6fe720 0x7f6fe710 00000000 sll $0,$0,0x0 # 0x4728fc: minus/%r1/%r0/%r1 (4 bytes): 0x7f6fe714 02b6a823 subu $21,$21,$22 # 0x4728fc: b/fR 0x472904 (8 bytes): 0x7f6fe718 0bdbf9c9 j 0x7f6fe724 0x7f6fe71c 00000000 sll $0,$0,0x0 # 0x472900: !BEGINBASICBLOCK 0x7f6fe720 (0 bytes): # 0x472904: minus/%r0/%r1/%r0 (4 bytes): 0x7f6fe720 02d5b023 subu $22,$22,$21 # 0x472904: !BEGINBASICBLOCK 0x7f6fe724 (0 bytes): # 0x472908: bne/%r0/%r1/fR 0x4728f4 (8 bytes): 0x7f6fe724 16d5fff8 bne $22,$21,0x7f6fe708 0x7f6fe728 00000000 sll $0,$0,0x0 # 0x47290c: b/fR 0x472910 (8 bytes): 0x7f6fe72c 0bdbf9cd j 0x7f6fe734 0x7f6fe730 00000000 sll $0,$0,0x0 # 0x472910: !BEGINBASICBLOCK 0x7f6fe734 (0 bytes): # 0x472914: print/%r0/retR 0x7f6fe748 (20 bytes): 0x7f6fe734 3c117f6f lui $17,0x7f6f 0x7f6fe738 3631e748 ori $17,$17,0xe748 0x7f6fe73c 8fa20490 lw $2,1168($29) 0x7f6fe740 00400008 jr $2 0x7f6fe744 00000000 sll $0,$0,0x0 # 0x472918: exitvm (12 bytes): 0x7f6fe748 8fa204a8 lw $2,1192($29) 0x7f6fe74c 00400008 jr $2 0x7f6fe750 00000000 sll $0,$0,0x0
[ ]
0. base: $16 (register) 1. scratch: $23 (register) 2. residual 0: $17 (register) 3. residual 1: $18 (register) 4. residual 2: $19 (register) 5. mainstack top: $20 (register) 6. mainstack undertop ptr: $fp (register) 7. %r0: $22 (register) 8. %r1: $21 (register) 9. %r2: $7 (register)
[ suboptimal use of delay slots, for example in the first b/fR: The unconditional jump j comes right after a VM instruction ending with a subu hardware instruction, which could be moved to the delay slot in place of the nop (sll $0, $0, 0x0). ] [ cannot schedule hardware instructions across VM instruction boundaries. SPARC and SH have the same problem. Luckily delay slots are becoming less common in recent architectures, including the latest MIPS. ]
stack code on mips:
# 0x4728e8: !BEGINBASICBLOCK 0x7f6fcdf0 (0 bytes): # 0x4728ec: mov/nR/%r0 0x4f790d55 (12 bytes): 0x7f6fcdf0 3c114f79 lui $17,0x4f79 0x7f6fcdf4 36310d55 ori $17,$17,0xd55 0x7f6fcdf8 0220b025 or $22,$17,$0 # 0x4728f0: mov/n1/%r1 (4 bytes): 0x7f6fcdfc 24150001 addiu $21,$0,1 # 0x4728f0: push-stack/%r0 (12 bytes): 0x7f6fce00 afd40004 sw $20,4($30) 0x7f6fce04 27de0004 addiu $30,$30,4 0x7f6fce08 02c0a025 or $20,$22,$0 # 0x4728f0: push-stack/%r1 (12 bytes): 0x7f6fce0c afd40004 sw $20,4($30) 0x7f6fce10 27de0004 addiu $30,$30,4 0x7f6fce14 02a0a025 or $20,$21,$0 # 0x4728f0: different-stack (16 bytes): 0x7f6fce18 8fc20000 lw $2,0($30) 0x7f6fce1c 27defffc addiu $30,$30,-4 0x7f6fce20 00544826 xor $9,$2,$20 0x7f6fce24 0009a02b sltu $20,$0,$9 # 0x4728f0: bf-stack/fR 0x472910 (20 bytes): 0x7f6fce28 27defffc addiu $30,$30,-4 0x7f6fce2c 02801025 or $2,$20,$0 0x7f6fce30 8fd40004 lw $20,4($30) 0x7f6fce34 1040003c beqz $2,0x7f6fcf28 0x7f6fce38 00000000 sll $0,$0,0x0 # 0x4728f4: !BEGINBASICBLOCK 0x7f6fce3c (0 bytes): # 0x4728f8: push-stack/%r0 (12 bytes): 0x7f6fce3c afd40004 sw $20,4($30) 0x7f6fce40 27de0004 addiu $30,$30,4 0x7f6fce44 02c0a025 or $20,$22,$0 # 0x4728f8: push-stack/%r1 (12 bytes): 0x7f6fce48 afd40004 sw $20,4($30) 0x7f6fce4c 27de0004 addiu $30,$30,4 0x7f6fce50 02a0a025 or $20,$21,$0 # 0x4728f8: less-stack (12 bytes): 0x7f6fce54 8fc20000 lw $2,0($30) 0x7f6fce58 27defffc addiu $30,$30,-4 0x7f6fce5c 0054a02a slt $20,$2,$20 # 0x4728f8: bf-stack/fR 0x472900 (20 bytes): 0x7f6fce60 27defffc addiu $30,$30,-4 0x7f6fce64 02801025 or $2,$20,$0 0x7f6fce68 8fd40004 lw $20,4($30) 0x7f6fce6c 10400010 beqz $2,0x7f6fceb0 0x7f6fce70 00000000 sll $0,$0,0x0 # 0x4728fc: push-stack/%r1 (12 bytes): 0x7f6fce74 afd40004 sw $20,4($30) 0x7f6fce78 27de0004 addiu $30,$30,4 0x7f6fce7c 02a0a025 or $20,$21,$0 # 0x4728fc: push-stack/%r0 (12 bytes): 0x7f6fce80 afd40004 sw $20,4($30) 0x7f6fce84 27de0004 addiu $30,$30,4 0x7f6fce88 02c0a025 or $20,$22,$0 # 0x4728fc: minus-stack (12 bytes): 0x7f6fce8c 8fc20000 lw $2,0($30) 0x7f6fce90 27defffc addiu $30,$30,-4 0x7f6fce94 0054a023 subu $20,$2,$20 # 0x4728fc: pop-stack/%r1 (16 bytes): 0x7f6fce98 8fc20000 lw $2,0($30) 0x7f6fce9c 0280a825 or $21,$20,$0 0x7f6fcea0 27defffc addiu $30,$30,-4 0x7f6fcea4 0040a025 or $20,$2,$0 # 0x4728fc: b/fR 0x472904 (8 bytes): 0x7f6fcea8 0bdbf3b9 j 0x7f6fcee4 0x7f6fceac 00000000 sll $0,$0,0x0 # 0x472900: !BEGINBASICBLOCK 0x7f6fceb0 (0 bytes): # 0x472904: push-stack/%r0 (12 bytes): 0x7f6fceb0 afd40004 sw $20,4($30) 0x7f6fceb4 27de0004 addiu $30,$30,4 0x7f6fceb8 02c0a025 or $20,$22,$0 # 0x472904: push-stack/%r1 (12 bytes): 0x7f6fcebc afd40004 sw $20,4($30) 0x7f6fcec0 27de0004 addiu $30,$30,4 0x7f6fcec4 02a0a025 or $20,$21,$0 # 0x472904: minus-stack (12 bytes): 0x7f6fcec8 8fc20000 lw $2,0($30) 0x7f6fcecc 27defffc addiu $30,$30,-4 0x7f6fced0 0054a023 subu $20,$2,$20 # 0x472904: pop-stack/%r0 (16 bytes): 0x7f6fced4 8fc20000 lw $2,0($30) 0x7f6fced8 0280b025 or $22,$20,$0 0x7f6fcedc 27defffc addiu $30,$30,-4 0x7f6fcee0 0040a025 or $20,$2,$0 # 0x472904: !BEGINBASICBLOCK 0x7f6fcee4 (0 bytes): # 0x472908: push-stack/%r0 (12 bytes): 0x7f6fcee4 afd40004 sw $20,4($30) 0x7f6fcee8 27de0004 addiu $30,$30,4 0x7f6fceec 02c0a025 or $20,$22,$0 # 0x472908: push-stack/%r1 (12 bytes): 0x7f6fcef0 afd40004 sw $20,4($30) 0x7f6fcef4 27de0004 addiu $30,$30,4 0x7f6fcef8 02a0a025 or $20,$21,$0 # 0x472908: different-stack (16 bytes): 0x7f6fcefc 8fc20000 lw $2,0($30) 0x7f6fcf00 27defffc addiu $30,$30,-4 0x7f6fcf04 00544826 xor $9,$2,$20 0x7f6fcf08 0009a02b sltu $20,$0,$9 # 0x472908: bt-stack/fR 0x4728f4 (20 bytes): 0x7f6fcf0c 27defffc addiu $30,$30,-4 0x7f6fcf10 02801025 or $2,$20,$0 0x7f6fcf14 8fd40004 lw $20,4($30) 0x7f6fcf18 1440ffc8 bnez $2,0x7f6fce3c 0x7f6fcf1c 00000000 sll $0,$0,0x0 # 0x47290c: b/fR 0x472910 (8 bytes): 0x7f6fcf20 0bdbf3ca j 0x7f6fcf28 0x7f6fcf24 00000000 sll $0,$0,0x0 # 0x472910: !BEGINBASICBLOCK 0x7f6fcf28 (0 bytes): # 0x472914: push-stack/%r0 (12 bytes): 0x7f6fcf28 afd40004 sw $20,4($30) 0x7f6fcf2c 27de0004 addiu $30,$30,4 0x7f6fcf30 02c0a025 or $20,$22,$0 # 0x472914: print-stack/retR 0x7f6fcf48 (20 bytes): 0x7f6fcf34 3c117f6f lui $17,0x7f6f 0x7f6fcf38 3631cf48 ori $17,$17,0xcf48 0x7f6fcf3c 8fa2047c lw $2,1148($29) 0x7f6fcf40 00400008 jr $2 0x7f6fcf44 00000000 sll $0,$0,0x0 # 0x472918: exitvm (12 bytes): 0x7f6fcf48 8fa204a8 lw $2,1192($29) 0x7f6fcf4c 00400008 jr $2 0x7f6fcf50 00000000 sll $0,$0,0x0
[ Here the TOS register is $20, and the under-top pointer $30 (named $fp, like in the output from --print-locations above). ]
[
TOS optimisation is easier to appreciate on MIPS where
minus-stack looks like a good
example of a stack operation consuming two elements and producing
one: a lw instruction loads the
under-top stack elements, addiu
decrements the under-top-pointer, and
subu does the actual work, using
the top register as both one the sources and the destination.
Here less-stack follows the same
pattern of minus-stack.
]
[ again suboptimal use of delay slots, for example at the end of bf-stack/fR, where it would have been correct to move lw $20,4($30) past the branch, replacing the nop sll $0,$0,0x0. This problem is easier to observe in stack code than register code, due to conditionals having side effects on the stack. ]
register code on RISC-V, minimal threading [2021 update: this is still a good example for minimal-threading but now that Jitter supports no-threading on RISC-V it is no longer the best that can be achieved on this architecture]:
# 0x4d740: !BEGINBASICBLOCK 0x40009fe6e0 (2 bytes): 0x00000040009fe6e0 0421 c.addi x8,8 # 0x4d748: mov/nR/%r0 0x4f790d55 (6 bytes): 0x00000040009fe6e2 00043903 ld x18,0(x8) 0x00000040009fe6e6 0421 c.addi x8,8 # 0x4d750: mov/n1/%r1 (2 bytes): 0x00000040009fe6e8 4985 c.li x19,1 # 0x4d750: be/%r0/%r1/fR 0x4d790 (12 bytes): 0x00000040009fe6ea 01391563 bne x18,x19,0x00000040009fe6f4 0x00000040009fe6ee 6000 c.ld x8,0(x8) 0x00000040009fe6f0 601c c.ld x15,0(x8) 0x00000040009fe6f2 8782 c.jr x15 0x00000040009fe6f4 0421 c.addi x8,8 # 0x4d758: !BEGINBASICBLOCK 0x40009fe6f6 (2 bytes): 0x00000040009fe6f6 0421 c.addi x8,8 # 0x4d760: bge/%r0/%r1/fR 0x4d770 (12 bytes): 0x00000040009fe6f8 01394563 blt x18,x19,0x00000040009fe702 0x00000040009fe6fc 6000 c.ld x8,0(x8) 0x00000040009fe6fe 601c c.ld x15,0(x8) 0x00000040009fe700 8782 c.jr x15 0x00000040009fe702 0421 c.addi x8,8 # 0x4d768: minus/%r1/%r0/%r1 (4 bytes): 0x00000040009fe704 412989b3 sub x19,x19,x18 # 0x4d768: b/fR 0x4d778 (6 bytes): 0x00000040009fe708 6000 c.ld x8,0(x8) 0x00000040009fe70a 601c c.ld x15,0(x8) 0x00000040009fe70c 8782 c.jr x15 # 0x4d770: !BEGINBASICBLOCK 0x40009fe70e (2 bytes): 0x00000040009fe70e 0421 c.addi x8,8 # 0x4d778: minus/%r0/%r1/%r0 (4 bytes): 0x00000040009fe710 41390933 sub x18,x18,x19 # 0x4d778: !BEGINBASICBLOCK 0x40009fe714 (2 bytes): 0x00000040009fe714 0421 c.addi x8,8 # 0x4d780: bne/%r0/%r1/fR 0x4d758 (12 bytes): 0x00000040009fe716 01390563 beq x18,x19,0x00000040009fe720 0x00000040009fe71a 6000 c.ld x8,0(x8) 0x00000040009fe71c 601c c.ld x15,0(x8) 0x00000040009fe71e 8782 c.jr x15 0x00000040009fe720 0421 c.addi x8,8 # 0x4d788: b/fR 0x4d790 (6 bytes): 0x00000040009fe722 6000 c.ld x8,0(x8) 0x00000040009fe724 601c c.ld x15,0(x8) 0x00000040009fe726 8782 c.jr x15 # 0x4d790: !BEGINBASICBLOCK 0x40009fe728 (2 bytes): 0x00000040009fe728 0421 c.addi x8,8 # 0x4d798: print/%r0/retR 0x40009fe730 (6 bytes): 0x00000040009fe72a 748bb783 ld x15,1864(x23) 0x00000040009fe72e 8782 c.jr x15 # 0x4d7a0: exitvm (6 bytes): 0x00000040009fe730 718bb783 ld x15,1816(x23) 0x00000040009fe734 8782 c.jr x15
[ ]
[ !BEGINBASICBLOCK ] [ Branches more expensive: load-load-branch. See for example b/fR. x8 thread pointer. Address where to branch loaded into a temporary, here always x15, branch via register. Residuals are loaded from memory, at a known constant offset from the thread pointer x8. The thread pointer needs to be incremented when using residuals, so that it points to the next residual in a different VM instruction. ]
register code on arm, direct-threading:
# 0x5cd68: mov/nR/%r0 0x292c0, 0x4f790d55 (16 bytes): 0x000292c0 e5941008 ldr r1, [r4, #8] 0x000292c4 e2844008 add r4, r4, #8 0x000292c8 e5145004 ldr r5, [r4, #-4] 0x000292cc e1a0f001 mov r15, r1 # 0x5cd70: mov/n1/%r1 0x291f8 (16 bytes): 0x000291f8 e5941004 ldr r1, [r4, #4] 0x000291fc e3a06001 mov r6, #1 0x00029200 e2844004 add r4, r4, #4 0x00029204 e1a0f001 mov r15, r1 # 0x5cd74: be/%r0/%r1/fR 0x25718, 0x5cda4 (28 bytes): 0x00025718 e1550006 cmp r5, r6 0x0002571c 05944004 ldreq r4, [r4, #4] 0x00025720 05941000 ldreq r1, [r4] 0x00025724 01a0f001 moveq r15, r1 0x00025728 e5941008 ldr r1, [r4, #8] 0x0002572c e2844008 add r4, r4, #8 0x00025730 e1a0f001 mov r15, r1 # 0x5cd7c: bge/%r0/%r1/fR 0x260d4, 0x5cd90 (28 bytes): 0x000260d4 e1560005 cmp r6, r5 0x000260d8 d5944004 ldrle r4, [r4, #4] 0x000260dc d5941000 ldrle r1, [r4] 0x000260e0 d1a0f001 movle r15, r1 0x000260e4 e5941008 ldr r1, [r4, #8] 0x000260e8 e2844008 add r4, r4, #8 0x000260ec e1a0f001 mov r15, r1 # 0x5cd84: minus/%r1/%r0/%r1 0x28640 (16 bytes): 0x00028640 e5941004 ldr r1, [r4, #4] 0x00028644 e0466005 sub r6, r6, r5 0x00028648 e2844004 add r4, r4, #4 0x0002864c e1a0f001 mov r15, r1 # 0x5cd88: b/fR 0x25700, 0x5cd94 (12 bytes): 0x00025700 e5944004 ldr r4, [r4, #4] 0x00025704 e5941000 ldr r1, [r4] 0x00025708 e1a0f001 mov r15, r1 # 0x5cd90: minus/%r0/%r1/%r0 0x28450 (16 bytes): 0x00028450 e5941004 ldr r1, [r4, #4] 0x00028454 e0455006 sub r5, r5, r6 0x00028458 e2844004 add r4, r4, #4 0x0002845c e1a0f001 mov r15, r1 # 0x5cd94: bne/%r0/%r1/fR 0x26df8, 0x5cd7c (28 bytes): 0x00026df8 e1550006 cmp r5, r6 0x00026dfc 15944004 ldrne r4, [r4, #4] 0x00026e00 15941000 ldrne r1, [r4] 0x00026e04 11a0f001 movne r15, r1 0x00026e08 e5941008 ldr r1, [r4, #8] 0x00026e0c e2844008 add r4, r4, #8 0x00026e10 e1a0f001 mov r15, r1 # 0x5cd9c: b/fR 0x25700, 0x5cda4 (12 bytes): 0x00025700 e5944004 ldr r4, [r4, #4] 0x00025704 e5941000 ldr r1, [r4] 0x00025708 e1a0f001 mov r15, r1 # 0x5cda4: print/%r0/retR 0x29fa8, 0xffffffff (24 bytes): 0x00029fa8 e1a01005 mov r1, r5 0x00029fac e59d0004 ldr r0, [r13, #4] 0x00029fb0 ebff9b09 bl 0x00010bdc 0x00029fb4 e5941008 ldr r1, [r4, #8] 0x00029fb8 e2844008 add r4, r4, #8 0x00029fbc e1a0f001 mov r15, r1 # 0x5cdac: exitvm 0x282bc (4 bytes): 0x000282bc ea001913 b 0x0002e710
[ ]
0. thread pointer: r4 (register) 1. base: r6 (register) 2. link register: [sp, #20] (memory) 3. mainstack top: r10 (register) 4. mainstack undertop ptr: fp (register) 5. %r0: r7 (register) 6. %r1: r8 (register) 7. %r2: r9 (register)
[ Similar to minimal threading, but here every VM instruction ends with the load-load-branch sequence. VM instruction addresses not consecutive. There is no !BEGINBASICBLOCK so VM branches are less expensive than on minimal threading, but VM instruction fallthru is as expensive as a branch. In this particular example, featuring very short basic blocks, direct threading performs better than usual compared to minimal threading. ]
[comment. The typical direct-threading "load-load-branch" sequence.]
Here GCC has chosen to use r4 as the
thread pointer, and
r1 as a temporary holding the address where
to branch.
On ARM the hardware program counter is accessible as the general-purpose
register r15; any instruction writing
into r15, for example
mov r15, r1, is therefore a branch.
This pattern is common in threaded code (direct threading and minimal
threading), to express an indirect jump via register.
[some generic comment: portability, non-GCC. Behavior will be identical, performance will be worse.] [specialisation remains, so with Jitter the user still gains fast registers and fast literals compared to a naïve solution, even with switch dispatch.]
[procedure calls, overflow handling, conditional fast branching on data tags, predefined macros for more sophisticated stack operations; multiple stacks; user-defined runtime data structures]
Performance:
[comparison between stack and registers not entirely fair in this case; the register-based code generator is more sophisticated, and the stack version could benefit from more rewrite rules] Same program above translated into C.
[GCC 9 snapshot from 2018-12-30, on every architecture. -march=haswell -march=loongson2f -march=mips32 -static]
#include <stdio.h> int main (void) { long a = 1333333333; long b = 1; while (a != b) if (a < b) b = b - a; else a = a - b; printf ("%li\n", a); return 0; }
[ I have verified that in this case GCC does not exploit the fact that the value of a at the end is actually a known constant. ]
Intel i7-4700MQ (Haswell) variable frequency up to 3400MHz:
Jittery VM, stack | ||
---|---|---|
fast registers, fast literals | no fast registers, no fast literals | |
switch | 24.28 | 24.36 |
direct threading | 12.33 | 13.60 |
minimal threading | 6.54 | 9.54 |
no threading | 6.37 | 6.79 |
Jittery VM, registers | ||
fast registers, fast literals | no fast registers, no fast literals | |
switch | 5.18 | 6.40 |
direct threading | 3.99 | 5.96 |
minimal threading | 5.18 | 6.38 |
no threading | 0.80 | 3.22 |
C, x86_64-unknown-linux-gnu-gcc -march=haswell | ||
-O2 | 0.40 | |
-O3 | 0.83 |
foo
Lemote Yeeloong 8089_B (Loongson2F 800MHz MIPS CPU):
Jittery VM, stack | ||
---|---|---|
fast registers, fast literals | no fast registers, no fast literals | |
switch | 553.56 | 452.62 |
direct threading | 134.70 | 151.57 |
minimal threading | 112.81 | 183.06 |
no threading | 104.38 | 190.23 |
Jittery VM, registers | ||
fast registers, fast literals | no fast registers, no fast literals | |
switch | 128.24 | 138.89 |
direct threading | 21.88 | 85.35 |
minimal threading | 23.57 | 82.35 |
no threading | 10.10 | 63.97 |
C, mipsel-unknown-linux-gnu-gcc -march=loongson2f | ||
-O2 | 6.18 | |
-O3 | 10.10 |
Ben Nanonote (JZ4720 333MHz MIPS CPU, non-superscalar pipeline):
Jittery VM, stack | ||
---|---|---|
fast registers, fast literals | no fast registers, no fast literals | |
switch | 1165.38 | 1267.46 |
direct threading | 494.86 | 694.23 |
minimal threading | 269.78 | 490.75 |
no threading | 200.29 | 302.53 |
Jittery VM, registers | ||
fast registers, fast literals | no fast registers, no fast literals | |
switch | 283.53 | 437.60 |
direct threading | 126.70 | 278.14 |
minimal threading | 118.52 | 265.77 |
no threading | 24.54 | 143.33 |
C, mipsel-unknown-linux-uclibc-gcc -march=mips32 | ||
-O2 | 20.44 | |
-O3 | 20.44 |
[no-threading is the clear winner, on any supported architecture and micro-architecture]
I am currently testing the following configurations, in some cases with multiple GCC versions, via cross-toolchains and emulation:
Right now Jitter includes enough assembly support to enable no-threading dispatch on x86_64, M68k, 32-bit MIPS (both traditional and r6), 32-bit PowerPC, RISC-V (32- and 64-bit) SPARC (32- and 64-bit), and SH. I mean to extend the same level of support to all of the architectures in the triplets above, with the exception of SH (see the information about SH below.
Minimal-threading support, which is extremely easy to add, currently
exists for
Aarch64,
Alpha,
ARM,
PA-RISC,
S390x.
Support for more architectures will come.
Adding support for a new architecture requires relatively little effort, and the internals are being worked on to make porting even easier. [somewhere: Jitter's source code is quite well factored and contains very little assembly, all confined within an architecture-specific subdirectory]
The SH architecture (sh4eb-unknown-linux-gnu, sh4-unknown-linux-gnu) presents special challenges with the most efficient dispatches; I will make an effort to support it, but that may not be entirely feasible in the end. switch and direct threading remain of course supported on SH, and are already very reliable: direct threading simply relies on GNU C, with no special assembly required; switch dispatching is even more portable.
Other architectures, for example 32-bit s390, are probably easy to support as well. I am now concentrating my efforts on the configurations which I can test with conveniently generated cross-toolchains and emulators.
Already usable. An external project actually using it. API not completely stable yet. Future tasks: debugging API, machine-generation of tagging and untagging macros for dynamically-typed languages, a copying garbage collector. Some minor optimisations.
Existing source code mostly clean, and very well commented. A few modules to be rewritten, particularly jitter/jitter-replicate.c. The bug. Manual. Ports (mostly trivial).
GNU Jitter is free software, released under the GNU GPL version 3 or later.
An elaborate Jitter example, JitterLisp, has evolved into an interesting sub-project in its own right.
Jitter's source code is currently hosted on my own git server, available to anybody via HTTP. Please contact me if you have some reason to require SSH access.
I am now back to using the master branch on git, with short-lived feature branches.
Copyright © 2017-2022 Luca Saiu
Verbatim copying and redistribution of this entire page are permitted provided this notice is preserved.