18th April 2023
Another update on the hobby JIT C compiler that I’ve been working on.
In trying to make updates faster and not-flaky, I previously hooked the compiler directly up to the editor’s buffer via RPC. This way, when the buffer is modified the changes to the code can be applied without mucking around waiting for file system notifications.
But, in the previous version there was still a mostly-unnecessary step
where the .c
file was provided over RPC, but it still generated an
object file (.dyo
) that went to disk. Now, the code and data are
generated directly to their final
location
in memory. There’s still a small “link” step to fix up the references
between .c
files. This made the code simpler, and also deleted
quite a bit of serialization code too. I did use the .dyo
files for
debugging sometimes, so I might need to re-add some introspection
functionality at some point.
… which leads to getting rid of the janky combination of nmake
and
GNU make
makefiles that I had been using to build the project. There’s
now a simple Ninja generator, which also
includes phony targets for running all the tests.
Previously, the tests had to be run sequentially because otherwise tests
that compiled a shared common .c
file would collide with where they were
writing the .dyo
, and would bork themselves1. Since there’s
now nothing written back to disk, ninja can trivially parallelize the test
running, which reduces test run time to about 20% of what it was before.
Additionally, there’s multiple reasonable build configs, so I don’t have
to keep fiddling with editing /Ox /GL
to /Od
or
/fsanitize=address
.
With the in-tree build made more tidy, I was trying to figure out a way
to reference its ninja file or Summon The Unspeakable Beast write a
CMakeLists.txt
to make it easy to use the compiler elsewhere. I
decided the simplest way to consume the compiler in an
embedded-into-a-larger-program context would be a single standalone .c
and .h
.
This was a bit messy, but with a little rearranging and some preprocessor substitutions while building the amalgamation, it also allows for creating an object file that only has four (non-static) exported functions along with a single data structure for configuration.
static
ing everything under the rug isn’t making the pig any
prettier2, but it made me happy to only see readelf
report 4
exports, rather than linking against a big pile of hoohaa, previously
including such outstanding function names as cast()
and link()
.
As an aside, mpack has an amalgamated build like this, and is a very nicely written C library, if you’re ever in need of MessagePack‘ing things around (that’s how the test shell does RPC with Neovim).
Once the compiler was more nicely embedded in the shell, I started thinking about debugging. I’m sort of trying to avoid doing a standard Attach Debugger, separate process-controlling-process that lets you single step, set breakpoints, etc.
Partly, I’m steering away from that because of course it’s a large
undertaking that involves lots of UI and threading and testing and
complexity. But also because it doesn’t feel like quite the right
solution when I can just as easily edit the running code to add a
print
as there’s no waiting around for a recompile, relink, relaunch.
There will certainly be situations where I’d want to “Step In/Out/Over”
to understand the flow of code, but being able to evaluate an arbitrary
expression in a particular context is a large chunk of my debugging
desires.
So I just added prints for a while, and realized what would be nice
(especially with this being C) would be to not have to fiddle around
with a whole mess of printf
format specifiers every time I wanted to
view something. It’s not too big a deal for a few counters or
values, but once you’ve got a tree of structures, a GUI debugger with
+
buttons to hop around in the structures is much better.
C of course, is notoriously just “code and ranges of memory” and doesn’t
know or care anything about types at runtime. But! we’re writing the
compiler, so now this particular C compiler has
<reflect.h>
.
That is a somewhat ambitious header name, but for now there’s a single
intrisinic function named _ReflectTypeOf(x)
that will return a
_ReflectType*
that describes the given expression or type x
.
I had a some moments of questioning my life choices when trying to create the user string version of names for types like these3. But after more attempts than I would admit to in a tech interview on either side of the table, I think it’s correct. -ish.
In the back of my mind while writing that was the acute awareness that
that type, and actually, most gibberish types (say, function pointers
taking a mess and returning worse) are, as far as C cares, all just 8
beautiful uncaring bytes. Once they’re jammed through a parser
and enough of a “type check” to satisfy someone that will happily
convert anything to or from a void*
: .quad 0
or sub rsp, 8
.
Anyway.
When starting on easing print debugging, at first I thought about doing
something like Python’s __repr__
. I think the functionality that’s in
reflect.h
now should allow adding something like __repr__
or closer
to the mostly-automatic Rust #[derive(Debug)]
(not including perhaps
some complexity dealing with string lifetimes).
Once that works, I was thinking it would be cool to have a single
keystroke that inserts a literal 🔍 directly in front of an expression.
Since we’re recompiling on every buffer change anyway, it’s not much
different than typing into a watch window. The compiler can do something
<handwave> to turn the funky Unicode into
__repr__
-plus-RPC-to-editor. And, if it’s serializing into a
structured format, the editor can also expand and collapse for viewing
the structures. And voila, a weird combination of printf
and a “Watch
Window”. Or maybe that won’t really work at all, I’m not sure.
I also thought more about updating data definitions when they change at
runtime. As previously described, only the functions and .rodata
(constants) are currently updated. Changes to structures apply to how
new code is compiled, but the compiler doesn’t try to go back to patch
your old data that already exists in memory.
In part this goes back to C being “code plus untyped memory ranges”: if
you’re sbrk
ing yourself some
memory and then saying that these bytes are that particular
structure, and then later you update the structure definition, the
compiler can’t plausibly help update objects (and I don’t really think
it would make a lot of sense to try).
But if you’re writing a simple game and have a global array of stuff:
typedef struct Particle {
double x;
double y;
double x_velocity;
double y_velocity;
} Particle;
Particle particles[MAX_PARTICLES];
then when you edit struct Particle
to include a pointer to a texture
and a colour, it doesn’t seem implausible that the particles
array
could be reallocated, spread out, and the new fields zero-initialized
for things to keep working as expected.
I’m not entirely sure about the details of that yet though. For example,
if you rename y_velocity
to y_speed
, you might expect it to “just”
be a rename and not touch any of your data. But if you renamed
y_velocity
to lifetime
with the intention of that “slot” in the
structure being something new, you might expect it to be zeroed. So I
think I’ll probably let that idea percolate some more. I assume
CLOS or
someone solved this 50 years ago, so perhaps someone can tell me how
It Shall Work.
Also emboldened by adding my first _[A-Z]
-prefixed extensions4 for
reflection, it might be time for some sort of blessed or partly-magic
(to achieve parametric polymorphism) versions of _Str
, _Vec
, and
_Dict
. Hacking away at this code really has increased my appreciation
for a carefully designed set of linked lists… but sometimes you really
do just want a vector of strings without the unpleasantness of writing
char buf[256]; sprintf(buf, ...);
just this one-last-time.
Of course, that could have been fixed in some other way, but it was never quite worth it. ↩
#BestSupportingMixedMetaphorsInAFeatureFilm
↩
I’m sure putting the half of the array return type on the far side of the function, and the seemingly-random extra required parens needed to disambiguate parsing made sense to someone in 1972. But only someone that was named dmr
. As the old UNIX-HATERS quote goes, substituting C
for UNIX
: “There are two major products that come out of Berkeley: LSD and UNIX. We don’t believe this to be a coincidence.” ↩
Section 7.1.3 Reserved for meeeeeeeeeee! Drunk on reserved identifier powers! ↩
Comments or corrections? Feel free to send me an email. Back to the front page.