Why not use gcroot?

In a couple of recent threads on llvmdev the question of what’s wrong with the existing garbage collection support in llvm has come up.  I’ve partially answered this in a couple of different places, but this post series is an attempt to group everything together into one consistent answer.

This post will focus on what I believe to the fundamental issues with the gcroot mechanism used to support garbage collection in llvm.  In a follow on post, I will discuss a few of the current implementation limits and how they might be addressed.  The second post is mostly for completeness sake; as I hope you’ll gather from the first post, I believe pursuing the gcroot approach to be non-viable long term.

To give a bit of context, I’ve been looking into how to efficiently support GC in LLVM for several months.  We investigated the gcroot approach, but for the reasons described here have decided to pursue an alternate approach.  If this works out in practice, I’ll write a followup post sometime in the next few weeks.

To be clear, I do not mean this post as an attack on the authors or users of the current GC mechanism.  While I believe it can and should be enhanced, we likely would have not seriously considered LLVM if some form of GC support was not already present.  Having any support for GC at all says a lot about the openness of the LLVM community towards non-C family languages and we were thrilled to see that.  Thank you.

Outline for this post:

  • What you need to know about GC
  • How gcroot works
  • gcroot can lose objects (for any GC)
  • gcroot is broken for GCs which relocate roots

GC Background

This section briefly introduces the key concepts of GC needed to explain my points.  I’m gearing this article towards folks who know compilers, but might not necessarily know the intricacies of GC very well.  Feel free to skip to the next section if desired.

A garbage collector is a tool for automatically managing memory.  At it’s most fundamental, a garbage collector recycles memory for an object when that object is no longer reachable through the object graph.  If the program provably can’t touch the memory again, it’s assumed to be safe to recycle.

Liveness of an object is generally established through reachability.  You start with a set of known ‘root’ values and recursively mark any object reachable from them.  Any object not reached during this traversal is assumed to be dead.  (For more of the basic terminology, you may find this previous post useful.

A relocating collector is one which moves objects around at runtime.  A collector doesn’t necessarily need to move objects to collect garbage, but relocating collectors have proven to be some of the most robust designs.  Explaining exactly why is beyond the scope of this post, but as a starting point, consider how a heap fragments over time and how this effects a non-relocating collector’s ability to reclaim free space.

Relocating collectors need to move objects reachable from running thread state around at runtime.  To do this, they need a point in the code where they can identify all the object pointers in the thread’s execution context, whether on the stack or in registers.  These points are called “safepoints”.

Let’s restrict ourselves to single threaded programs for a moment so that we don’t have to do anything fancy to get atomicity.  In particular, let’s ignore the intricacies of concurrent collection.  There’s no need to get into read or write barriers for the purpose of this post.

A collector doesn’t necessarily have to move all objects in a particular collection cycle.  Many real world collectors don’t.  One common subset to not move are objects directly reachable from the thread state (stack + registers) of currently executing threads.  As we’ll see in a bit, relocating objects directly reachable from executing code requires a fair amount of compiler work.

A collector which chooses to not relocate objects reachable from thread state can also choose to be somewhat “conservative” about identifying such objects.  So long as every object so directly reachable doesn’t move, it doesn’t really matter (from a correctness perspective) if a few objects are falsely identified as being “roots” (i.e. reachable from thread state) or if some non-pointer value on the stack is treated like a possible root.

A collector which does relocate all objects can be called a fully relocating precise collector.  The “precise” bit comes from the fact that a collector which relocates everything doesn’t really have the option to be “conservative”.  Consider what might happen if you updated some random integer in the program which just happened to “look like” a pointer to an object.  Ouch.

From a practical perspective, the company I work for has a fully relocating precise collector.  As such, this is where my primary interest lies.

How does gcroot work?

So how does the existing gcroot mechanism work?  At a conceptual level, it’s actually fairly straightforward.

The frontend is responsible for identifying all pointers in code being compiled which should be treated as roots.  This is done by creating a slot on the stack (via an alloca) and tagging that slot with a “gcroot” intrinsic.

The frontend also provides a GCStrategy which is responsible for deciding where safepoints are.  This information is used to record the address of a safepoint and the slot indices (i.e. stack offsets) into a side data structure during compilation.  This is done fairly late in the compilation process; in particular, it’s long after IR level optimization passes have run.  During optimization, there are no safepoints explicitly represented in the IR.

We’ll skip most of the implementation details for the moment and return to those later on.  They’re not really relevant for the first post.  What is important to remember is that the gcroot mechanism only records the stack slots.  Any copy of that pointer value is not recorded.

Note: The documentation for gcroot can be found here.   The wording is somewhat confusing since it makes it sound like relocation of pointers tagged as gcroot is fully supported.  As we’ll explore in a moment, it’s not.

gcroot can miss roots (even for non-locating collectors)

For gcroot to work, we need to preserve a non-trivial invariant.  One copy of a live pointer value must always be present in a location known to be a gcroot.  There can be other copies, but one copy must be in the gcroot location.  (This is true for any reachability based collector, not simply relocating ones.)

Consider the following bit of code from a made up language which happens to look like C++, but with garbage collection:

int* factory() {
  int* p = new int(0);
  return p;
}

For the sake of discussion, let’s assume we want to place a safepoint on method return.  (i.e. between the two lines of code)

This will get translated into LLVM which looks like this:

define i32* @factory() gc "gc name here" {
  ; This is the gc root setup
  %p = alloca i32*, align 8
  %tmp = bitcast i32** %p to i8**
  call void @llvm.gcroot(i8** %tmp, i8* null)
  store i32* null, i32** %p, align 8
  ; This is the allocation and initialization
  %1 = call noalias i8* @_Znwm(i64 4) #3
  %2 = bitcast i8* %1 to i32*
  store i32 0, i32* %2, align 4
  store i32* %2, i32** %p, align 8
  ; This is where we want the safepoint
  ; This is the return (including the required reload)
  %3 = load i32** %p, align 8
  ret i32* %3
}

Now so far everything looks fine.  The return value is reloaded from the known root location, and a safepoint inserted before the return would perform as expected.

The problem here is that compiler optimization passes are applied to this IR before the safepoint is inserted.  Running this IR through “opt -O3″ we get the following:

define i32* @factory() gc "gc name here" {
  ; Still the gcroot setup
  %p = alloca i8*, align 8
  call void @llvm.gcroot(i8** %p, i8* null)
  ; allocation, init, and return
  ; there is no store to the gcroot slot!
  %1 = call noalias i8* @_Znwm(i64 4) #2
  %2 = bitcast i8* %1 to i32*
  store i32 0, i32* %2, align 4
  ret i32* %2
}

When we try to insert the safepoint this time, we’ve got a serious problem.  All of the writes to the gcroot location have been optimized away.  There is no place in this function where we can insert a safepoint and capture the value of the pointer returned.  To put it another way, we have a live pointer which is untracked by the GC.  This is blatantly incorrect for any GC.

Now, what went wrong?  Essentially, LLVM was able to – correctly given this IR semantics – eliminate the load from p right before the return.  The value of *p is easily inferred from the store immediately preceding it.  Given that, all of the stores to p become trivially dead.  Now to be clear, this is exactly what the optimizer is supposed to do with this IR.

Aside: One fix for the problems I’m describing would be to insert safepoints before any compiler optimization.  Assuming you constructed your safepoint correctly – use and update of every gcroot alloca slot, read/write all memory with volatile semantics, etc.. – this would be adequate.

From everything I can gather from the current implementation and documentation, this is not the intended usage model of gcroot.  Adopting what I’ll call an “early safepoint insertion” model would be a modification to the existing scheme and suffers from one key problem of it’s own.  The inserted safepoints would absolutely cripple the optimizer.  Consider something simple like this bit of code:

int sum = 0;
for i in 0...n:
  sum += 1;

You’d expect the optimizer to convert this reduction into a single assignment of the form “sum = n”.  Without safepoints, that’s exactly what would happen.  If you inserted a safepoint on the backedge of the loop like so:

int sum = 0;
for i in 0...n:
  sum += 1;
  (all objects..) = safepoint(all objects...) clobber_all

The existing optimizer would no longer recognize the reduction.  Given this severe impact on practical performance, I consider early safepoint solution a non-solution from a performance standpoint.

You could improve the behavior of the naive early safepoint scheme, but not without adjusting every single optimization pass one by one.  Even then, I strongly suspect you’d worsen overall optimization in the end.  And suffer a long bug tail as well.

The core issue here is that LLVM doesn’t know that stores and loads to gcroot locations are ‘special’.  For the invariant to be upheld, LLVM would need to know that another actor can observe the values in a gcroot location.  These additional semantics are not actually baked into the IR.  As a result, the optimizer is quite capable of completely invalidating the entire scheme gcroot relies on.

Now, we’ve shown this for one particular case.  I haven’t actually checked, but I’m fairly sure this could be easily extended to any case where a pointer value escapes from a local function.  A few such cases could be:

  • Storing into a global variable
  • Storing into a object field
  • Throwing an exception

There reason this doesn’t show up everywhere is that common code patterns inhibit the optimizations we’ve discussed.  If there had been a call – which can have unknown memory effects – anywhere after the final store, LLVM would not have been able to eliminate that key store.  That’s pure luck though, not something which can be relied upon to continue being true.

For example, what happens if LLVM decides to inline that call?  Now the previous unknown side effects become known.  If LLVM decides the inlined code doesn’t write to the gcroot alloca, it can once again eliminate the store.

There are a few ways I can think of to ‘fix’ this:

  1. You could mark every one of the stores to ‘p’ as volatile.  Given LLVM’s semantics for volatile access, this would be sufficient to preserve the invariant which gcroot relies on.  However, its also going to greatly inhibit optimization of the IR.  From performance perspective, this not really a viable solution.
  2. You could extend LLVM to treat stores to gcroot locations differently.  (This would essentially be a limited form of volatile semantics.)  However, this would require a very non-trivial bit of analysis to determine the base object for every store.  In fact, doing so exactly is clearly impossible.  (e.g. (cond ? &root : nonroot) = pointer();)  As a result, your analysis would have to be somewhat conservative (i.e. classify things which aren’t gc roots as gc roots occasionally).  Beyond the major engineering effort required to make this work and modify all the places in LLVM which deal with stores, I suspect this would also have detrimental effect on optimization.  You could restrict that to functions marked as GC, but still.  I really really doubt you could get this accepted into the mainline of LLVM development.
  3. Place a call to an ‘extern’ function – one not known to the optimizer – at every site you might possibly want to insert a safepoint later.  This would have the effect of blocking the particular optimization we discussed.  You’d have to manually remove these fake calls when inserting safepoints, and you’d suffer most of the cost of early safepoint insertion.

Worth explicitly noting is that the two options practically available to the frontend author – first and third – are likely to have severe performance impact.  You might as well just do early safepoint insertion and might actually be better off doing so.

I also want to emphasize that this problem is only one symptom of a broader issue.  The gcroot scheme is fundamentally problematic since it relies on special semantics which are not modelled in the IR – i.e. that stores to gcroot locations can be observed by the garbage collector.  This is one case where it breaks; there may be – and likely are – others as well.

gcroot is broken for GCs which relocate roots

The existing gcroot mechanism is not sufficient to support the relocation of objects directly reachable from executing thread state.

To implement a precise safepoint, you need to know the location of every pointer – in particular, every copy of every pointer value – which is currently in use.  Updating one copy of a pointer, but not another leads to very very nasty bugs.

Returning to our previous example, let’s take a variant of the optimized code where we assume that LLVM is not able to remove the stores (to avoid our previous issue), but did forward the value of the load.  We’d be left with:

define i32* @factory() gc "gc name here" {
  ; This is the gc root setup
  %p = alloca i32*, align 8
  %tmp = bitcast i32** %p to i8**
  call void @llvm.gcroot(i8** %tmp, i8* null)
  ; This is the allocation and intiialization
  %1 = call noalias i8* @_Znwm(i64 4) #3
  %2 = bitcast i8* %1 to i32*
  store i32 0, i32* %2, align 4
  store i32* %2, i32** %p, align 8
  ; This is where we want the safepoint
  ; This is the return (with the load value forwarded)
  ret i32* %2
}

The core issue with the current gcroot mechanism is that it only tracks values on the stack.  In the above code, “%2″ (the forwarded value of the load) is not visible to the GC at the safepoint.  More generally, any temporary virtual register copy of a pointer inserted by the compiler is not visible to the GC.  This value could easily be held in a register across the safepoint – which in gcroot exists only as a label and has no special semantics – and be used again afterwards.  Since the GC may have moved the object ‘%2′ pointed at, this is blatantly incorrect.

Now, all of the ‘solutions’ I described above could be applied here as well (except for loads not stores).  As I said above, I believe these to be non-viable from a performance perspective.

Conclusion & Notes

I’ve highlighted two particular symptoms of what I believe to be a fundamental design issue with gcroot.  I believe we need a better mechanism which does not rely on a known set of ‘home’ locations for every root.  As I mentioned, using an early safepoint insertion model would solve some of the correctness concerns, but comes with a serious performance problem.

In the next few days, I will post a separate post with a list of a few issues with the current implementation.  I consider these less interesting than the fundamental issues discussed here, but if you were to pursue a partial solution such as marking all reads and writes to gcroot locations as volatile, they could be important to address.

Notes

  • I used a recent snapshot of LLVM to create these examples.  Specifically, I used “clang version 3.5 (trunk 198264)”.
  • My thanks for Sanjoy, Michael, and Kris for feedback on an early version of this post.  As always, credit goes to those who help find mistakes and any remaining blame for uncaught mistakes goes to the author (me) for making them.

3 Comments

  1. Nice writeup! One thing which isn’t clear is – are safepoints not a fairly rare occurance? (Say, somewhat akin to posix cancellation points) Would it be a big deal if these few points were specially tagged somehow and slower than the rest of the code?

    • A way to think about it is: *Taken* safepoints (cases where code is actually stopped and the safepoint information is used) are very rare dynamically. But when a thread is required to reach a safepoint, it needs to be reach it with extreme urgency. Whenever a global safepoint operation is required (e.g. for GC, or for deoptimization on class loading, etc.), the whole world, including all other threads that have already reached their safepoints, is waiting for the longest-time-to-safepoint thread to reach its next safepoint.

      As a result, safepoint opportunities cannot be allowed to be far apart in time. This means the amount of code executed before the next safepoint opportunity is reached must be fairly small (think microseconds).

      Safepoint opportunities are dynamically very frequent in the generated code of most safepoint-capable runtimes, and occur at pretty much all method returns (or calls) and loop back-edges, except for some cases where they can be taken away for provably short running loops (short counted in usecs).

      So since safepoint opportunities are dynamically very frequent, and since the existence of a safepoint opportunity itself will usually dramatically restrict optimizations across it regardless of whether or not it is actually taken, the dynamic performance impacts of early safepoint insertion would be huge (without all other optimizations being aware if their nature).

  2. @Stephen

    Any call which may trigger a safepoint is also a safepoint. As a result, there are a lot of call safepoints in most programs.

    Safepoints are actually triggered with fairly low dynamic frequency, but they’re encounted frequently – i.e. usually, it doesn’t do anything when executed. You generally need the application threads to hit a safepoint “quickly” for the garbage collector to be able to make forward progress.

    Statically, safepoints are common in compiled code.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>