source
raw/articles/2026-04-16-retrofitting-jit-compilers-into-c-interpreters.md
original: https://tratt.net/laurie/blog/2026/retrofitting_jit_compilers_into_c_interpreters.html
Laurence Tratt: Retrofitting JIT Compilers into C Interpreters
source: https://tratt.net/laurie/blog/2026/retrofitting_jit_compilers_into_c_interpreters.html
fetched_at: 2026-04-16T02:20:28Z
fetch_status: ok
Home
>
Blog
≡
Home
>
Blog
email
Bluesky
Mastodon
Twitter
Bluesky
Twitter
Retrofitting JIT Compilers into C Interpreters
RSS feed:
whole site
or
just the blog
Recent posts
Retrofitting JIT Compilers into C Interpreters
PLISS 2026
Upcoming Talk in London
Async and Finaliser Deadlocks
What Context Can Bring to Terminal Mouse Clicks
Why Firsts Matter
LLM Inflation
Comparing the Glove80 and Maltron keyboards
The LLM-for-software Yo-yo
The Fifth Kind of Optimisation
Blog archive
C interpreters are a common language implementation technique and the basis for
the reference implementations of languages such as Lua, Ruby, and
Python. Unfortunately, C interpreters are slow, especially compared to
language implementations powered by JIT compilers. In this post I’m going to show
that it is possible to take C interpreters and, by changing a tiny proportion
of code, automatically turn them into JIT compiling VMs (Virtual Machines)
1
. This offers a point in the language performance design space
that was previously out of reach: better performance while retaining
full compatibility with reference implementations.
The system that lets us do this is
yk
,
something we’ve have been working on for some time. I want to
set expectations appropriately. yk is alpha-stage software: it’s not production ready,
though it’s some way beyond a hacky research prototype in breadth and
quality. You can quite easily hit
TODO
s that halt the program, or encounter
missing parts where execution continues suboptimally; we only have a subset
of the optimisations we’d like; only an x64 is currently implemented; and we don’t yet have the breadth of
languages and benchmarks that one would need to draw strong conclusions about
performance.
That said, you can already see some interesting results.
Seeing
programming
language performance is quite hard, but this video might give you an idea.
First, we run a Mandelbrot program through the reference
PUC Lua
VM
2
and then through
yklua
, our fork of PUC Lua with
an automatically derived JIT compiler:
[Video]
I must confess that I have cherry picked an example
where we get better-than-average performance. Rather than the ~4x improvement
in the video above, on
our Lua benchmark suite
3
so far a more
realistic geometric mean is a little under 2x. Small benchmarks inevitably flatter the sorts of techniques
I’ll be talking about, so “real” programs will on average see less than that
4
.
What might surprise you is that we only added about 400LoC to PUC Lua, and changed
under 50LoC. This hopefully makes the trade-off yk offers clear: yklua does
not reach the performance peaks of the wonderful, carefully hand-written,
LuaJIT; but LuaJIT is stuck in the 13 year old past of Lua 5.1
5
,
whereas yklua trivially tracks updates to PUC Lua. The point of this post isn’t
to tell you that one side or the other of this trade-off is best. Rather, it’s
to show that the trade-off exists, and that yk fills in a part of the design
space that was largely out of bounds before.
There’s also nothing inherently Lua specific about yk. Alongside various smaller
implementations, recently we’ve put a few days effort into adjusting
MicroPython
leading to – you might spot a naming
pattern here –
ykmicropython
. The
MicroPython interpreter uses some idioms that we have yet to support, and any
benchmark which encounters one of those currently has terrible performance. But
if you have a program which doesn’t hit many of those parts you can
already see a meaningful performance improvement. The Mandelbrot
benchmark isn’t where ykmicropython performs best, but it’s still good enough
to see a difference (“normal” MicroPython top, ykmicropython bottom):
[Video]
Eventually, my hope is that if you have a C interpreter, big or small, yk will
allow you to add a JIT to it for little effort, gaining additional performance
for free.
In this post I’ll explain why we created yk, why it works
the way it does, and some of the technical challenges involved. Despite
the length of this post, I won’t come close to covering everything I
arguably should, and I’ll inevitably simplify much of what I do cover. Future posts will
hopefully fill in more details, add nuance, and track yk’s evolution.
I want to thank Shopify and the Royal Academy of Engineering, who together generously
funded the underlying research. I am immensely grateful to both, though of course
I speak for neither organisation! The work I’m presenting in this post is a
joint effort with
Edd Barrett
,
Lukas
Diekmann
, and
Pavel Durov
. I want to dedicate this work
to the late Chris Seaton, who was an early champion of yk when it was no
more than a tiny demo: he is much missed by those of us who considered him a
friend.
Why are JIT compilers hard?
Many programming languages are implemented as
interpreters
.
This is the most common approach for dynamically typed languages, but is also
true of a surprising number of other languages, from CPU simulators to domain
specific languages. Interpreters are simple to write, but do not have great
performance.
It is therefore common to add a Just-In-Time (JIT) compiler to interpreters. The
fundamental thesis of JIT compilation is that the way a program executed in
the near past is indicative of how it will execute in the future. JIT compilers
thus
speculatively optimise
6
a program; whilst the speculations hold, performance
is good; if the speculations fail to hold, performance will (possibly temporarily) decrease.
JIT compilers therefore observe a running program, identify the
hot
(i.e.
most frequently executed) parts and some surrounding context (e.g. the types of
objects used), extract and optimise those, and compile them to machine code
(henceforth “JITed code”). The JIT-compiled code can then be run in place of
the interpreter.
JIT compilers are difficult and expensive to create. Most obviously, they have
lots
of delicate, moving parts, and JIT compiler authors need to be expert in
multiple domains. Minor mistakes can lead to days of frustrating debugging;
architectural mistakes can take years to undo. Over a decade ago, someone who
knew more than me estimated that
HotSpot
(aka “the JVM” i.e. the
standard Java VM) had already consumed well over 1,000 person years of effort.
V8, the JavaScript JIT compiling VM in Chrome, has had a large team working on
it for many years.
Less obviously, because of their complexity, JIT compilers can be difficult to evolve.
Even if one creates a fully compatible JIT compiler for a given version of a
language, the blighters in control of the language specification are bound to
extend or alter the language. It’s quite easy for changes in the language to
break assumptions embedded deep within a JIT compiler, causing the latter to lag
behind the language specification. Eventually, “lag” has
a tendency to turn into “will never catch up”, which tends to be the beginning
of irrelevance.
JIT compilers also make achieving full – and I mean “full”! –
compatibility difficult. Often it is necessary to create a new VM in order to
create the best possible JIT compiler. However, doing so almost inevitably
means foregoing full compatibility. The new VM might not be able to interface
with existing extension modules; it might run finalisers at slightly different
points in time; not implement every last function in the standard library; and
so on. If a user tries a new VM and their software doesn’t work out of the box,
not only will they not experiment further, but they will often resolve never to
waste time on that VM again.
Perhaps the best evidence for how hard JIT compilers are to create is simply to
enumerate the publicly
7
known JIT compilers for Python include Cinder, IronPython, Jython,
Nuitka, Psyco, Pyjion, PyPy, Pyston, S6, Shed Skin, Skybison, Starkiller, TrufflePython,
Unladen Swallow, WPython, and Zippy. Few of these are still active, despite
the skilled people involved, and the desires and hopes of many users.
Automatic JIT compilers
We could bypass most of the problems I’ve noted above if we could somehow
automatically create JIT compilers. There are two main approaches to automatically creating JIT
compilers from interpreters, with those two approaches exemplified by two
systems: meta-tracing in
RPython
(which is part of the PyPy project); and (a form of) partial evaluation in
Truffle
.
RPython and Truffle are brilliant technical achievements: I remain in awe
of both projects. They can, in different ways and with different trade-offs,
create astonishingly high-performance JIT compilers from interpreters.
There is, however, a catch: both systems require you to create a
new
interpreter. They thus run into the almost-but-not-quite-compatible challenge I
mentioned earlier. Perhaps because of this, neither system has seen the
widespread use that their excellent performance would suggest they should.
Deriving a JIT compiler from C interpreters
While most of the languages we’re interested in optimising
have an English specification, users tend to take what the reference interpreter
does as the specification. From the perspective of a language lawyer, users are clearly wrong to
do so, but few users know that language lawyers even exist — they mostly care about
completing the task at hand!
What this means is so obvious that I did not fully internalise it for many
years: for many languages, the C interpreter
is
the source of truth for the language.
Rightly or wrongly, any
implementation which deviates from that source of truth, even in minor ways, will
be unacceptable to most users.
This means that RPython and Truffle, wonderful though they are, leave an
important gap in the design space. When I eventually woke up to that
reality, the question we needed to answer became obvious: is it
possible to automatically derive a JIT compiler from a C interpreter? If we
could do so, we could avoid the compatibility issues that prevent many users
from using faster language implementations.
Other people have had, at least to some degree, similar thoughts and
there is a small body of work sharing the same aim, notably
this often forgotten
paper
and the later
copy-and-patch
. Putting
aside the (sometimes large) amount of manual work these approaches require,
their basic aim is to JIT compile an interpreter’s core
interpreter loop. This is a worthwhile thing to do, but many programs spend
much of their time outside of that loop. For example, basic functionality
like “is object o1
==
o2?” is often deferred to functions outside of the core
loop. In my opinion, if you want to get good performance, you have to be able
to inline code far beyond the core interpreter loop.
Ultimately, at least in my pea brain, there is then only one technique that can
realistically achieve the aim we want: meta-tracing
8
. And,
indeed, that’s exactly what yk does: it meta-traces C interpreters. That’s why
we could take in the normal C Lua VM, add a few lines of code, and have it
become a JIT compiler.
However “meta-tracing for C” is a far harder thing than that simple phrase
suggests. Put bluntly, it’s very close to trying to ram a square peg into a
round hole, which is why more than one person I’ve met has suggested, mostly
politely, that it’s never going to work. Indeed, we’ve had to solve all sorts
of problems, big and small, to make this a practical reality.
Tracing and meta-tracing
Meta-tracing is a surprisingly challenging technique to explain
9
, but
I’ll do my best. First let’s reuse terminology that I think came from
Truffle: a
host
language is the one we write interpreters in (e.g. C); a
guest
language is the language being implemented (e.g. Lua, Ruby, your favourite CPU
simulator, etc).
Tracing
Let’s start briefly by looking at (non-meta) tracing JIT compilation. Such a
system looks for
hot loops
(e.g.
for
or
while
loops) in a guest program.
When it encounters one, it then
traces
(i.e. records) the execution of the
next iteration of the loop, optimises it, and compiles it into machine code.
When the interpreter next comes to execute that loop, it hands execution over to
the JIT-compiled code.
For example, imagine I have this guest language program:
for
x
in
...
do
if
x >
0
then
y = y +
3
end
end
and the
for
loop has become hot enough to trace. Let’s assume that in the
particular iteration that was traced
x
had a concrete value greater than 0.
The resulting trace might look roughly as follows:
OP_LOOKUP("x")
guard true, pop() > 0
OP_LOOKUP("y")
OP_INT(3)
OP_ADD
OP_SET("y")
The trace is, mostly, a record of the guest language operations (“opcodes” or
“bytecodes”) that were executed. We can then JIT compile the trace above into machine
code and use it in place of the interpreter.
There is, however, one obvious oddity in the trace. When we executed the
for
loop,
x > 0
evaluated to true, and we thus executed the true branch of the
if
statement. You can clearly see
y = y + 3
in the above fragment but the
if
has become a
guard
. Formally speaking, we say a trace linearises control
flow. In plain English, a trace records only one an
if
statement’s
true
or
false
branches, not both.
What that means is that a trace only records a subset of a program’s behaviour:
at some point, an execution of the trace might need to move beyond that
subset. Guards record the points when this can happen: if a guard fails, the
rest of the trace is invalid. When that happens, the JIT-compiled code must
deoptimise
back to the interpreter.
We can thus immediately see a fundamental difference between tracing JIT compilers
and the more common “method-based” JIT compilers. In the latter, one is looking
for functions that execute frequently. When such a function is found, it is
compiled (at run-time!) in much the same way that a traditional ahead-of-time
compiler like gcc/clang would compile a function. In other words, method-based
JIT compilers produce code which captures far more of a program’s behaviour
10
than a trace-based compiler.
Some readers with long memories will remember that the first JavaScript JIT
compiler in Firefox was TraceMonkey. As its name suggest, it was a tracing JIT
compiler for JavaScript, but its performance track
record was uneven, and after 2 or 3 years it was replaced by a method-based
JIT compiler. This experience has led a number of people over the years to declare that
tracing JIT compilers are inferior in all ways.
My view is more nuanced. If you’ve got enough time, enough
money, and a green field, method-based JIT compilers are almost certainly going
to be superior in the general case
11
. If you lack any of those three factors, the
right technical choice might be different. yk is explicitly aimed at the case
where you lack a green field
12
.
Meta-tracing
How does meta-tracing differ from tracing? A tracing JIT compiler is
specialised to a particular guest language, and contains a hand-written trace
recorder, a hand-written optimiser, and a hand-written code generator. A
meta-tracing system says “hand me an interpreter for your favourite guest
language and I will do the rest for you”. It does that by recording what the
host language interpreter is doing while executing the guest language.
For example if I have a C interpreter along these lines:
Instruction code = ...;
int
pc =
0
;
Value stack = ...;
while
(
true
) {
Instruction i = code[pc];
switch
(
GET_OPCODE
(i)) {
case
OP_LOOKUP:
push
(
lookup
(
GET_OPVAL
())); pc++;
break
;
case
OP_ADD:
push
(
pop
() +
pop
()); pc++;
break
;
case
OP_JGT:
if
(
pop
() >
0
) pc++;
else
pc +=
GET_OPVAL
(i);
break
;
case
...: ...
}
}
then my guest language fragment from earlier will lead to a trace along the
following lines:
push(lookup("x")); pc++;
guard true, pop() > 0; pc++;
push(lookup("y")); pc++;
push(3); pc++;
push(pop() + pop()); pc++;
set("y", pop()); pc++;
As this might suggest, we have recorded the code the host-language interpreter
executed while it was itself running the guest program. The (meta-)trace we see
here is thus a trace of the C interpreter’s actions. That trace was
automatically constructed by the meta-tracer; it can then be passed to the
meta-tracing systems optimiser and code generator. In other words, meta-tracing
automates what must be hand-written in a (non-meta) tracing JIT compiler.
There’s one additional factor that’s worth highlighting about meta-tracing: it
naturally, and aggressively, inlines into function calls. In other words rather
than recording
push(...)
the trace will record the contents of
push
so the
trace will look more like:
guard true, used < capacity;
storage[used] = elem;
used += 1;
This ability to inline is a fundamental part of meta-tracing’s strengths, and why
yk is different than previous approaches to speeding up C interpreters.
From a C interpreter to a meta-tracing interpreter
Now we’ve seen the high-level idea of meta-tracing, the next question is: how
does yk convert a C interpreter into a meta-tracing-compatible interpreter? The
first key part is that we use a fork of LLVM called – wait for it –
ykllvm
which must be used to compile the
interpreter. At a basic level this just requires changing
cc
and
ld
in the
interpreter’s
Makefile
to:
‘yk-config release --cc‘ -o vm.o vm.c
‘yk-config release --ldflags‘ -o vm vm.o
At a deeper level, ykllvm is doing three main things: adding the ability to
record a trace; serialising IR so that the interpreter can be reconstructed
after trace recording; and making deoptimisation possible. We’ll look at
the last of these later: let’s start with the first two.
When we’ve detected a hot loop at the guest language level, we need a way to
record what actions the interpreter takes. There are various ways one could
do this
13
, but the one we use now is to insert calls to a
function
__yk_trace_basicblock(id)
in every
basic block
: roughly speaking,
this can be thought of as “at every point there’s an
if
or equivalent”. We
give every basic block a unique ID.
After an interpreter has run through ykllvm it is compiled in a way that is
roughly equivalent to the following:
Instruction code = ...;
int
pc =
0
;
Value stack = ...;
while
(
true
) {
__yk_tracebasicblock
(
0
);
Instruction i = code[pc];
switch
(
GET_OPCODE
(i)) {
case
OP_LOOKUP:
__yk_tracebasicblock
(
1
);
push
(
lookup
(
GET_OPVAL
()));
pc++;
break
;
case
OP_ADD:
__yk_tracebasicblock
(
2
);
push
(
pop
() +
pop
());
pc++;
break
;
case
OP_JGT:
__yk_tracebasicblock
(
2
);
if
(
pop
() >
0
) {
__yk_tracebasicblock
(
3
);
pc++;
}
else
{
__yk_tracebasicblock
(
4
);
pc +=
GET_OPVAL
(i);
}
break
;
case
...: ...
}
}
When a trace is recorded we’ll end up with a sequence along the lines of
[0, 1, 0, 1, 0, 2, ...]
as a representation of the guest program executing
OP_LOOKUP; OP_LOOKUP; OP_ADD
.
When we’ve got that sequence of basic block IDs we then need to reconstruct
ykllvm’s view of the interpreter’s IR (Intermediate Representation). Roughly
speaking, this can be thought of as a tree / graph representation of the
interpreter that ykllvm built up whilst compiling it. If we take a chunk of
LLVM’s “normal” IR for yklua and represent it textually we get:
359: ; preds = %341
%360 = load ptr, ptr %355, align 8, !tbaa !11
%361 = load ptr, ptr %352, align 8, !tbaa !11
%362 = call ptr @luaH_getshortstr(ptr noundef %361, ptr noundef %360) #18
%363 = getelementptr inbounds nuw i8, ptr %362, i64
%364 = load i8, ptr %363, align 8, !tbaa !9
%365 = and i8 %364, 15
%366 = icmp eq i8 %365, 0
br i1 %366, label %371, label %367
367: ; preds = %359
%368 = load i64, ptr %362, align 8, !tbaa !11
store i64 %368, ptr %345, align 8, !tbaa !11
%369 = load i8, ptr %363, align 8, !tbaa !9
%370 = getelementptr inbounds nuw i8, ptr %345, i64 8
store i8 %369, ptr %370, align 8, !tbaa !9
br label %3329
371: ; preds = %341, %359
%372 = phi ptr [ %362, %359 ], [ null, %341 ]
store ptr %203, ptr %171, align 8, !tbaa !11
%373 = load ptr, ptr %172, align 8, !tbaa !11
store ptr %373, ptr %13, align 8, !tbaa !11
call void @luaV_finishget(ptr noundef nonnull %0, ptr noundef nonnull %352, ptr noundef %355, ptr noundef %345, ptr noundef %372)
%374 = load volatile i32, ptr %173, align 8, !tbaa !11
br label %3329
ykllvm converts that to a slightly different IR: we tweak a few things (e.g.
the fearsomely complicated
getelementptr
instruction is split into
easier-to-handle variants) and add
safepoint
s (we’ll come back to those
later):
bb47:
%47_0: ptr = ptr_add %46_2, 8
%47_1: i8 = load %47_0
%47_2: i8 = and %47_1, 15i8
%47_3: i1 = eq %47_2, 0i8
condbr %47_3, bb49, bb48 [safepoint: 396i64, (%0_0, %0_2, %0_5, %0_6, ...
bb48:
%48_0: i64 = load %46_2
%45_3 = %48_0
%48_2: i8 = load %47_0
%48_3: ptr = ptr_add %45_3, 8
%48_3 = %48_2
br bb812
bb49:
%49_0: ptr = phi bb47 -> %46_2, bb45 -> 0x0
%6_10 = %29_0
%49_2: ptr = load %11_5
%0_21 = %49_2
call luaV_finishget(%0_0, %45_10, %45_13, %45_3, %49_0) [safepoint: 397i64, ...
br bb50
This “ykllvm IR” is then serialised (i.e. converted into a moderately efficient
sequence of bytes) and placed in a special section in the binary alongside the
“normally compiled” C code. When a trace is recorded, we can then map basic
block IDs (e.g.
47
) to a fragment in ykllvm IR, stitch all those fragments
together, optimise, and compile them. As this might suggest, although I’ve
framed this post in terms of “meta-tracing C interpreters”, arguably what we’re
really doing is better described as meta-tracing (a view of) LLVM IR.
Optimising a trace
Just stitching together a trace of recorded basic blocks and compiling them
into machine code gives at best moderate performance gains. Indeed, in yk the
situation would be worse than in previous approaches because of things like
yk_trace_basicblock
, which frustrate some static optimisations in LLVM!
What we need to do is optimise the
traces. yk has several
14
traditional compiler optimisations built in.
For example constant folding allows us to take a trace like this:
%7: i8 = 8
%8: i8 = 7
%9: i8 = add %7, %8
call print(%9)
and optimise it to:
%9: i8 = 15
call print(%9)
Dead load / store analysis allows us to take a trace such as:
%8: ptr = ...
%9: i8 = ...
store %9, %8 ; store the value of %9 to the pointer in %8
%11: i8 = load %8
call print(%11)
and optimise it to:
%8: ptr = ...
%9: i8 = ...
store %9, %8
call print(%9)
and so on.
Helping the optimiser
The optimisations above are effective and fast to run — but on a normal
interpreter, they only have a moderate effect. The fundamental reason for
that is that a C interpreter is an arbitrary program and yk’s optimiser
often either can’t soundly optimise parts of it, or isn’t sure that doing so will be
worth it.
Fortunately, interpreter authors have additional knowledge that can hugely help
the optimiser. Sometimes that knowledge is more about the guest language (e.g.
a function’s opcodes are immutable) and sometimes it is more about the
interpreter (e.g. that a seemingly unbounded loop will in fact only run a small
number of times).
Crucially, none of this knowledge is directly expressed in the C code. What we
need is a way for the interpreter author to pass this extra knowledge to yk so
that its optimiser can take advantage of it. Fortunately, RPython has done a
lot of the intellectual heavy lifting in this regard, and we can reuse many of
its concepts. Surprisingly, adding or tweaking just a few lines of C code
can lead to significant performance improvements in JIT-compiled code. In order
to understand why, I’ll dive into detail about two of the major optimisation
features that yk offers interpreter authors.
Promotion
Let’s start with one of the core features,
promotion
, which takes a run-time value
and “burns it into” a trace as a constant, and an associated
guard (to ensure that if the value changes on subsequent execution,
deoptimisation occurs, and execution is still correct).
yk exposes promotion as a C function
yk_promote
. In normal use, this is
just the identity function
15
, but when a trace is being recorded,
yk_promote
copies the value passed to it into a buffer: in the compiled trace, the call to
yk_promote
will be replaced with the associated constant.
For example, in our C interpreter from earlier we might implement
OP_INT(x)
which pushes a constant integer (sometimes called an “immediate”) onto the
stack as follows:
...
case
OP_INT:
push
(
yk_promote
(constant_pool[
GET_OPVAL
(i)])); pc++;
break
;
Without the
yk_promote
the trace for
OP_ADD(3)
might look roughly as follows:
tmp = constant_pool[i & 8]
push(tmp)
With
yk_promote
it will become:
tmp = constant_pool[i & 8]
guard true, tmp == 3
push(3)
The bad news is that we’ve had to insert a
guard
(in case the speculation
turns out not to hold) but the good news is that after that guard we know for
sure that
tmp
is the constant 3, rather than some unknown value. That means – if
I hand-wave a bit, because if I went into full detail we’d be here for hours –
that when we see a sequence such as:
tmp = constant_pool[i & 8]
guard true, tmp == 3
push(3)
pc += 1
i = code[pc]
tmp = constant_pool[i & 8]
guard true, tmp == 4
push(4)
push(pop() + pop())
we can optimise that last instruction into
push(7)
without even looking at
the stack (though we will have to ensure that the
used
portion of the stack
is adjusted appropriately). Doing so not only removes some instructions (
pop
must do a
load
at the very least) but makes it possible for us to generate
better machine code.
Idempotent functions
As you might have guessed from the example above,
yk_promote
isn’t
enough on its own to make a big dent in performance. Happily, though, we can
combine it with another feature to make what is perhaps the single biggest
individual optimisation for a yk interpreter: exposing that opcodes
are immutable
16
.
The additional tool we need is a way of expressing
idempotent
functions: that
is, functions that always return the same output given the same input
17
. yk
allows this to be expressed with the
yk_idempotent
function annotation
18
.
Cast your mind back to the C interpreter from earlier and this line:
Instruction i = code[pc];
Let’s assume that for a given
pc
, we always return the same
Instruction
(e.g. “at pc 4 you will always see OP_ADD”). We then change this line to
be a function call:
Instruction i =
load_inst
(
yk_promote
(code + pc));
and then add the function
load_inst
with the appropriate
yk_idempotent
annotation:
__attribute
((yk_idempotent))
Instruction
load_inst
(Instruction *
pc
) {
return
*pc;
}
During recording trace, yk will record the value that
load_inst
returned. If the trace
optimiser can prove that
load_inst
was called with a constant value (e.g.
because of
yk_promote
), it will then remove the call to
load_inst
and
replace it with that return value. In other words if
load_inst(0x1234)
returns
0b01110111
then in a trace we will simply see that latter constant.
The obvious advantage of this is that we have removed the load of the
opcode entirely, but what we’ve really done is open the floodgates for
yk’s optimiser to do its thing on instruction decoding. For example if
GET_OPVAL
is:
define
GET_OPVAL
(
i
) ((i & 0b11110000) >>
4
)
then yk, knowing that
i
is
0b01110111
, will then optimise
GET_OPVAL(i)
to
0b0111
. As this suggests, optimisations tend to accumulate, particularly in
the context of tracing interpreter loops: proving that something is constant
often allows other things further down the trace to be optimised too.
The good news doesn’t stop there. Since we had to promote
pc
, calculations
like
pc += GET_OPVAL(i)
optimise to a constant address; and that allows us to
remove most guards based on them (which we can now trivially prove are true).
In other words, by simply breaking out
load_inst
and declaring it idempotent
– a diff of less than 10 lines in yklua – we are able to remove nearly all
instructions relating to instruction decoding, as well as several knock-on
operations. Individually each operation that’s optimised away is small, and
very fast to run, but when enough of them are optimised away there’s a big
effect. In yklua, removing the
yk_idempotent
annotation slows things down by
around 4x!
Other optimisations
yk provides other optimisations. For example, by default yk
does not inline host-language functions which contain loops into a trace: annotating
functions with
yk_unroll
tells yk to inline these functions, unrolling the
loops they contain. This is always semantically correct, but generally
dangerous for performance (how often does a loop go around?) — except in
cases where you know that the loop will go around a small, constant or
near-constant number of times. A classic example of this is a function
in an interpreter which unpacks function arguments: guest language functions
tend to have a fixed number of parameters. The argument unpacking function
tends to be a good candidate for
yk_unroll
.
However, rather than going through all the possible optimisations, what I hope you’re starting to
realise from my description is that there is quite a lot of low-hanging fruit
in many C interpreters that can help yk. Indeed, nearly all the performance
gains I showed at the start of this post result from basic tweaks to the
interpreter (a few
yk_promote
s, a couple of
yk_idempotent
s, and several
yk_unroll
s and so on).
An open question is how much more benefit can one get? My intuition is: quite a
lot, but it will be highly dependent on how invasive the changes you’re willing
to make are.
At the trivial end of things are simple refactorings that make the interpreter
more yk friendly. For example, sometimes a very large, important, function has
a single (unbounded) loop in it: moving that loop into its own function makes
the larger function inlineable. More difficult – though heavily dependent on
the interpreter – would be to add
Self-style
maps
, which
would open up a lot of opportunities for yk’s optimiser.
Backwards code generation
yk is big enough, complex enough, and with enough novelty that, wherever
possible, we’ve tried to filch good ideas from elsewhere. One idea we’ve
recently made use of comes from LuaJIT: yk generates machine code by traversing
the trace in reverse. This turns out to allow us to generate better code, more
quickly, and to do so with less implementation complexity. Since, as far as I
know, yk is only the second system to do this, it’s worth giving an overview of
what the advantages of doing so are.
For example, given a trace such as:
%5: i8 = load %4
%6: i8 = load %4
%7: i8 = add %5, %6
yk generates the (say) x64 code for
add
first, and only then for the two
load
s. There are two main reasons for doing this: simple register allocation;
and implicit dead code elimination.
Register allocation sounds like it should be a simple thing for a compiler, and
in some ways it is. We have variables (e.g. the
%x
variables in traces) and
we assign them to registers in the CPU (e.g.
rax
). When we run out of
registers (of which there are typically few, commonly 16 on modern CPUs), we
put variables on the stack (which is near-infinite in size). Unfortunately,
done naively, too many variables end up on the stack and performance tanks
19
.
As that might suggest, good register allocation isn’t simple. Indeed, so
challenging are the trade-offs between “generate code that runs fast enough”
and “don’t make an unusably slow compiler” that register allocation has been an
active research topic for decades. Summarising the reasons for that succinctly
is hard, but the basic problem is that you need to look arbitrarily far
backwards and forwards in a program to do good register allocation, and that’s
difficult and expensive for programs with complex control flow.
Fortunately, in tracing compilers, LuaJIT shows that we can do something
completely different: backwards code generation. In normal forwards code
generation, one assigns a fairly arbitrary register to a variable
and hopes that will prove to be a good choice for later parts of the trace. In
backwards code generation, one assigns a register to a variable knowing that
it’s a good choice now, and leaves it up to “earlier” code to adjust as
necessary. Those two choices might sound equivalent, but they’re not.
For quite a while, yk did forwards code generation. The register allocator was
fiddly and fragile, and in order to make it generate even vaguely good code, we had to
have an initial backwards pass that calculated the lifetime of variables and also,
where appropriate, “this variable will definitely need to
end up in
rdi
later” and so on.
That we had to have an initial backwards pass was, eventually, the clue for me
to consider adopting LuaJIT’s approach. I must admit that at first I found
backwards code generation made my head hurt: forwards code generation is far
more intuitive. But, once I’d got used to the idea
20
, the eventual
register
allocator
became relatively simple to work with. Crucially, it’s easy to use
21
, faster to run, and generates far better code than forwards code
generation.
I also mentioned above that backwards code generation implicitly does dead code
elimination. When a platform backend asks what register variable
%7
should be
in, it is implicitly saying “
%7
will be used later in the trace”. We record
that using a single bit. That means that we no longer need to work out
%7
s
lifetime: we find it implicitly upon our first encounter (which, given this is
backwards code generation, is its last use). Then, when we come to process
%7
we say “is this variable constructed using side effects?” (if so we must
generate code for it) or “if it doesn’t have side effects, is it used later in
the trace?”. That latter condition
is
dead code elimination. All in all, dead
code elimination is fewer than 10LoC.
The reason I’m highlighting this is because traces end up with quite a lot of
dead code, whether due to trace optimisation or platform specific code generation tricks.
For example, consider this trace:
%2: ptr = ...
%3: ptradd %2, 8
%4: load %3
On x64 yk this will generate the following x64 code:
; %2: ptr = ...
mov rax, ...
; %3: ptradd %2, 8
; %4: load %3
load rdi, [rax+8]
Note that we didn’t generate any code for the
ptradd
instruction because
load
was able to incorporate it indirectly. This might seem trivial, but it
means we’ve avoided using a register for
%3
: you don’t have to do this very
often to have a big, positive, impact on register allocation!
Locations
Astute readers will have noticed that I haven’t made clear how an interpreter
tells yk where guest language loops are. As you might expect, yk has absolutely
no idea where a Lua or Ruby or Python loop might start or end: indeed, it
doesn’t even know whether such languages have
while
loops or some other
construct entirely.
To tell where yk where loops are, interpreters have to use a
location
, which
we pass to a
control point
. Locations are simple: they’re markers that tell
yk something about part of the guest program (typically about a specific
opcode in the user’s program). Control points
22
are
more subtle: they’re the point in the interpreter where control is handed over
to yk alongside a location. A control point call might do nothing, might start
tracing, or might start executing a compiled trace.
For our purposes
23
, we can consider locations to come in two
flavours:
null
: this guest-level
pc
is not the start of a loop.
loop
: this guest-level
pc
is the start of a loop.
One constructs these with
yk_location_null
or
yk_location_new
respectively, with those two functions returning an opaque
YkLocation
type.
Strictly speaking, one doesn’t need null locations, but they often make it easier
to get an interpreter up and running with yk.
YkLocation
s are quite deliberately, a machine word wide, so that you can
store them efficiently (e.g. alongside relevant opcodes). That said, the
easiest thing to do is to create an array with one
YkLocation
per opcode in
the user’s program. For example, if we have a language that represents
while
loops with a backwards jump opcode
OP_JMP
we might create the array of
YkLocation
s as follows:
Instruction code = ...;
YkLocation yklocs =
calloc
(sizeof(YkLocation), prog_len);
for
(
int
pc =
0
; pc < prog_len; pc++) {
Instruction i = code[pc];
if
(prog[pc] == OP_JMP &&
GET_OPVAL
(i) <
0
)
yklocs[pc] =
yk_location_new
();
else
yklocs[pc] =
yk_location_null
();
}
And we would insert the control point call into the main loop as follows
(having suitably initialised yk with
yk_mt_new
):
YkMT mt =
yk_mt_new
(
NULL
);
Instruction code = ...;
YkLocation yklocs = ...;
int
pc =
0
;
Value stack = ...;
while
(
true
) {
Instruction i = code[pc];
yk_mt_control_point
(mt, &yklocs[pc]);
switch
(
GET_OPCODE
(i)) {
case
OP_LOOKUP: ...
...
At a basic level that’s really all that needs to be done in an interpreter.
Now, there are deeper questions such as “what’s the right place to start a
loop?”, “what is a loop?” (e.g. recursive function calls), and “what are good
traces?” (too many can be as bad as too few). Some of those are solely for
interpreter authors; some need help from yk; and we haven’t yet implemented all
the bits that interpreter authors need. I won’t dwell further on that here: I
hope to return to this in the near future, as we’ve experimented with various
promising designs recently.
Deoptimisation
Earlier I spoke quite a bit about “speculation” and “deoptimisation”. To recap:
when we speculate on a value we leave behind a guard; if that guard fails, we
deoptimise from the JIT-compiled code back to the C interpreter.
For those in the know “deoptimise … back to the C interpreter” will sound, to
use the technical term, bonkers. It implies that somehow we can jump back to an
arbitrary point in the binary that LLVM compiled, which seems impossible.
Fortunately, it is possible — though it required quite a bit of work, and the
acceptance of some trade-offs.
Before I go into detail, let’s think roughly about what needs to happen. When a
control point starts executing JIT-compiled code, it doesn’t create a new
function frame: instead it takes over the AOT (Ahead-Of-Time) LLVM-compiled
binary’s function frame and then adds stuff of its own choosing at the end.
When a guard fails, we need to: stop the program; adjust the stack so that it’s
in the right shape for the AOT binary; put values in the right register; and
jump to the right offset in the AOT binary.
You may not be surprised to learn that LLVM was not built with such a use-case
in mind. Fortunately it has the basic tool we need:
stackmaps
,
which tell us the layout of registers and the stack at a given point in the LLVM IR.
We then roll our own concept of
safepoints
24
built upon stackmaps:
in essence a safepoint pairs a normal LLVM instruction with a stackmap.
In essence, ykllvm records a safepoint at each conditional and function call. When a guard is about to be created in a
trace, yk then knows the stack of safepoints it relates to. The reason for a
“stack” of safepoints is due to inlining: the conditional that the guard
relates to may be many functions deep from where the trace started. By
recording a safepoint just before each function call, we know what the stack of
AOT frames (would have) look(ed) like.
Stackmaps are conceptually simple. For a given LLVM variable
%7
they record
information such as: this variable’s value is stored in registers
rdi
,
r8
,
and at stack offset
0x7F
. Except they don’t quite do so, probably because
this part of LLVM is still experimental
25
. Stackmaps
only record
one
location
for a
value (e.g.
%7
is stored in
rdi
) but we need to know
all
of the
locations. We also needed to know additional information such as the callee-saved
registers passed to inlined functions. Stackmaps in ykllvm are thus extended to do just that.
We then had to deal with the fact that LLVM happily moves stackmaps around,
allowing instructions to be inserted between them and their once-neighbours.
This means that they no longer contain accurate information, and certainly
don’t serve as valid safepoints anymore! To solve that
we introduced a
new pass
that reassociates stackmaps with the correct neighbouring instruction, updating the
stackmap as appropriate.
However, there are a few parts of LLVM
which mess with things beyond our pass’s ability to fix. Our solution, alas, is
simply to turn some optimisations off in functions that can be traced. This
slows down the interpreter somewhat, but fortunately yk’s trace optimiser is
able to recover the lost performance once code is traced.
Deoptimisation also raises a completely different, more fundamental, problem
which has nothing to do with LLVM: what happens to stack pointers? Consider this code:
int
x = ...;
int
y = &x;
if
(z)
printf
("
%d
\n
", *y);
Let’s imagine that
z
is
false
when we record the trace but
true
at some
point later: we’ll deoptimise then call
printf
with a stack address. However,
that stack address relates to when JIT-compiled code was running, when the
stack had a very different layout: the chances of the address being the same
on the AOT stack is slim.
Fortunately, the fix for this is simple: we introduce a shadow stack (creating
a fresh shadow stack for each newly created thread) and place all variables
whose address is taken on it. The shadow stack is shared between AOT and JIT
frames — the latter does not extend or alter it. So, whether we’re running in
JIT or AOT compiled code, the address of
x
will be the same
26
.
Conclusions
A couple of weeks ago I finally got around to porting yklua from Lua-5.4.6 to
Lua 5.5.0: that’s two year’s worth of changes to Lua
27
. How long did
it take me? Well under 2 hours. I lost 30 minutes to a boring linker-related
oddity
28
, and not much less time to some external Lua tests that no
longer work on Lua-5.5. The “hard” part of the work took me at most an hour:
in any previous system, such changes would probably have taken months, if
not longer.
Anecdote though it may be, I hope this gives you a sense that yk
is a genuinely new point in the language performance design space, and that
it has the potential to benefit many languages and users.
All that said, I don’t want to pretend that yk is the finished article: it most
definitely isn’t! There are all sorts of things that need improving. Some are
minor and will be noticed by few. Some will have much more profound effects.
For example, at the time of writing, ykmicropython happens to work in a way
that hits several gaps in yk: the resulting performance can be comically bad.
Most of these are known gaps: filling them in isn’t rocket science, but it is
work.
I hope, though, that you can see the promise that yk has to offer: that I
could update yklua so quickly still seems magical to me! That said, we still have plenty of work
to do to realise yk’s full promise, but we’re a very small team (as of April, 1½ people if
you count generously), so please be realistic about the pace at which that can
occur!
For those who are interested in playing with yk, I suggest starting at the
yk
book
. Enjoy!
Acknowledgements
: Thanks to Edd Barrett and Lukas Diekmann for comments.
2026-04-15 12:35
Older
If you’d like updates on new blog posts: follow me on
Mastodon
or
Twitter
;
or
subscribe to the RSS feed
;
or
subscribe to email updates
:
Failed to subscribe: please load page and try again
Sending...
Footnotes
1
A quick note on terminology. Formally speaking, a VM contains
one or more language implementations, for example both an interpreter and a
JIT compiler. “VM” is therefore for the umbrella term, and I’ll only use a
specific term (“interpreter” or “JIT”) where it’s necessary to do so.
☒
A quick note on terminology. Formally speaking, a VM contains
one or more language implementations, for example both an interpreter and a
JIT compiler. “VM” is therefore for the umbrella term, and I’ll only use a
specific term (“interpreter” or “JIT”) where it’s necessary to do so.
2
One unexpected outcome of this is that I have developed a huge
admiration for the PUC Lua interpreter. It’s written in an
intimidating style, with seemingly endless C macros. Once I got my head
around it, I realised that it is one of the most thoughtful codebases I
have ever seen. Everything is there for a reason; there is a consistency
which one rarely sees; and it is incredibly efficient.
☒
One unexpected outcome of this is that I have developed a huge
admiration for the PUC Lua interpreter. It’s written in an
intimidating style, with seemingly endless C macros. Once I got my head
around it, I realised that it is one of the most thoughtful codebases I
have ever seen. Everything is there for a reason; there is a consistency
which one rarely sees; and it is incredibly efficient.
3
Which, itself, isn’t as big as I would like.
☒
Which, itself, isn’t as big as I would like.
4
For example, if you have an allocation heavy program, you will
see only small benefits from yklua; if your program spends all of its time in the
interpreter loop, or certain libraries, you might be lucky enough to see
6x. As you might expect, all these figures have been gradually improving
over time – mostly in yk, but also in yklua itself – and will continue to
do so. The speedups I’ve shown are thus best thought of as a floor, not a
ceiling, for these particular benchmarks.
☒
For example, if you have an allocation heavy program, you will
see only small benefits from yklua; if your program spends all of its time in the
interpreter loop, or certain libraries, you might be lucky enough to see
6x. As you might expect, all these figures have been gradually improving
over time – mostly in yk, but also in yklua itself – and will continue to
do so. The speedups I’ve shown are thus best thought of as a floor, not a
ceiling, for these particular benchmarks.
5
Well, almost. It’s a slight superset of Lua-5.1, with, I believe,
a smattering of features of Lua-5.2.
☒
Well, almost. It’s a slight superset of Lua-5.1, with, I believe,
a smattering of features of Lua-5.2.
6
Which, in this context, is a posh way of saying “gamble”.
☒
Which, in this context, is a posh way of saying “gamble”.
7
Yes, there are/were additional Python JIT compilers that did not become
public knowledge.
☒
Yes, there are/were additional Python JIT compilers that did not become
public knowledge.
8
If you want to think of yk as “RPython for LLVM”, you will be
very close to the truth!
☒
If you want to think of yk as “RPython for LLVM”, you will be
very close to the truth!
9
This old post of
mine
is better
than nothing, but I would explain things in a different way now.
☒
This old post of
mine
is better
than nothing, but I would explain things in a different way now.
10
I say “far more” rather than “all” because
method-based JITs still speculate on things like object types, concrete
values, and so on.
☒
I say “far more” rather than “all” because
method-based JITs still speculate on things like object types, concrete
values, and so on.
11
Though there is a subset of real programs where a well-written
tracing JIT compilers will win out.
☒
Though there is a subset of real programs where a well-written
tracing JIT compilers will win out.
12
And, alas, enough money. I do know people who’ve got into compilers
for the money. As a route to riches, it seems to me quixotic.
☒
And, alas, enough money. I do know people who’ve got into compilers
for the money. As a route to riches, it seems to me quixotic.
13
We put a
huge
amount of effort into doing this in
hardware. One day I will write up why that didn’t work.
☒
We put a
huge
amount of effort into doing this in
hardware. One day I will write up why that didn’t work.
14
Though, as of yet, nowhere near as many as I would like it to have.
☒
Though, as of yet, nowhere near as many as I would like it to have.
15
i.e.
yk_promote(x) == x
.
☒
i.e.
yk_promote(x) == x
.
16
This technique here can be easily adapted to situations where
opcodes aren’t immutable, but aren’t expected to change often. There
is a small overhead from doing so, but in the grand scheme of things, it’s
insignificant.
☒
This technique here can be easily adapted to situations where
opcodes aren’t immutable, but aren’t expected to change often. There
is a small overhead from doing so, but in the grand scheme of things, it’s
insignificant.
17
Unlike the similar concept of a pure function, an idempotent function
can have side-effects, but those side-effects must not change the output
value that is produced. For example, an idempotent function might insert a
key into a hashmap if it is not present, and then always return the same
key subsequently.
☒
Unlike the similar concept of a pure function, an idempotent function
can have side-effects, but those side-effects must not change the output
value that is produced. For example, an idempotent function might insert a
key into a hashmap if it is not present, and then always return the same
key subsequently.
18
Idempotency in yk is a
strong
guarantee on the interpreter
author’s behalf: unlike promotion, if the interpreter author gets
yk_idempotent
wrong, the system is in the land of undefined behaviour.
☒
Idempotency in yk is a
strong
guarantee on the interpreter
author’s behalf: unlike promotion, if the interpreter author gets
yk_idempotent
wrong, the system is in the land of undefined behaviour.
19
Though very modern CPUs are doing an ever-better job at making
loads and stores to the stack as cheap as register accesses. I’ve been stunned by
some of the results I’ve seen on recent AMD CPUs. That said, the general
point holds: even if
some
stack accesses are now more-or-less free,
many, probably most, are still expensive.
☒
Though very modern CPUs are doing an ever-better job at making
loads and stores to the stack as cheap as register accesses. I’ve been stunned by
some of the results I’ve seen on recent AMD CPUs. That said, the general
point holds: even if
some
stack accesses are now more-or-less free,
many, probably most, are still expensive.
20
For me, the key idea – and I admit that I don’t know if this is
how LuaJIT does it – was to think about register allocation as finding
diffs between the register state for the
n+1
th and
n
th instructions
☒
For me, the key idea – and I admit that I don’t know if this is
how LuaJIT does it – was to think about register allocation as finding
diffs between the register state for the
n+1
th and
n
th instructions
21
We do have a very simple “hints” API that can tell us what
register a value earlier in the trace will be in. This is mostly used for
things like function return arguments. Happily, from a performance
perspective, it’s only used on demand, requiring fairly few calls, and
isn’t an entire pass that requires O(n) memory.
☒
We do have a very simple “hints” API that can tell us what
register a value earlier in the trace will be in. This is mostly used for
things like function return arguments. Happily, from a performance
perspective, it’s only used on demand, requiring fairly few calls, and
isn’t an entire pass that requires O(n) memory.
22
This isn’t a great name, because it incorrectly implies
“control flow” when that might not actually be the case for a given
location. We haven’t yet come up with a better name.
☒
This isn’t a great name, because it incorrectly implies
“control flow” when that might not actually be the case for a given
location. We haven’t yet come up with a better name.
23
Not only are some of these names in flux, but I will be
adding additional concepts to yk soon. We needn’t concern ourselves with
them for now.
☒
Not only are some of these names in flux, but I will be
adding additional concepts to yk soon. We needn’t concern ourselves with
them for now.
24
Not to be confused with LLVM’s safepoints, which are also called
statepoints (
yes, really
), which
aren’t quite the same thing as patchpoints. As this
might suggest, there are multiple experimental APIs in this part of LLVM, mostly
intended to help garbage collectors — but none of these APIs seems
complete in its implementation, with some having obvious flaws.
Although I’ve seen several references to these features being used by
production VMs, it’s unclear to me whether they’re using stock LLVM or a
local fork (perhaps like ykllvm) where these things are fixed. For example, see this
brave
soul
who has enumerated several of the sharp edges in stock LLVM.
☒
Not to be confused with LLVM’s safepoints, which are also called
statepoints (
yes, really
), which
aren’t quite the same thing as patchpoints. As this
might suggest, there are multiple experimental APIs in this part of LLVM, mostly
intended to help garbage collectors — but none of these APIs seems
complete in its implementation, with some having obvious flaws.
Although I’ve seen several references to these features being used by
production VMs, it’s unclear to me whether they’re using stock LLVM or a
local fork (perhaps like ykllvm) where these things are fixed. For example, see this
brave
soul
who has enumerated several of the sharp edges in stock LLVM.
25
The
stackmap documentation
says that it should have enough information to rebuild stack frames:
An important motivation for this design is to allow a runtime to
commandeer a stack frame when execution reaches an instruction address
associated with a stack map. The runtime must be able to rebuild a stack
frame and resume program execution using the information provided by the
stack map. For example, execution may resume in an interpreter or a
recompiled version of the same function.
but our experience suggests it does not currently have all the necessary
information.
☒
The
stackmap documentation
says that it should have enough information to rebuild stack frames:
An important motivation for this design is to allow a runtime to
commandeer a stack frame when execution reaches an instruction address
associated with a stack map. The runtime must be able to rebuild a stack
frame and resume program execution using the information provided by the
stack map. For example, execution may resume in an interpreter or a
recompiled version of the same function.
but our experience suggests it does not currently have all the necessary
information.
26
Interestingly this fix doesn’t work for
setjmp
: if you call
setjmp
in JIT-compiled code, then deoptimise, and call
longjmp
, bad
things happen. We have a reasonable design for a fix for this in the works.
☒
Interestingly this fix doesn’t work for
setjmp
: if you call
setjmp
in JIT-compiled code, then deoptimise, and call
longjmp
, bad
things happen. We have a reasonable design for a fix for this in the works.
27
Lua’s
download page
shows that Lua-5.4.6
was released on 2023-05-02 and Lua-5.5.0 on 2025-12-15.
☒
Lua’s
download page
shows that Lua-5.4.6
was released on 2023-05-02 and Lua-5.5.0 on 2025-12-15.
28
Arguably I got lucky here. Linkers are mysterious things,
understood by few people. I, alas, am not in that select few.
☒
Arguably I got lucky here. Linkers are mysterious things,
understood by few people. I, alas, am not in that select few.
Comments
(optional)
(used only to verify your comment: it is not displayed)
Notify me of further comments
Failed to submit comment: please load page and try again
Submitting...
Home
>
Blog
≡
Home
>
Blog
email
Bluesky
Mastodon
Twitter
Bluesky
Twitter