For Christmas, please make a donation to the ACLU

I’ve had a multi-year tradition now of asking all gifts to me be in the form of a donation to charity.  This year, I am continuing that trend, but with a specific immediate reason.  As all of you know, President-Elect Trump is a proud racist, misogynist, and homophobe.  Regardless of your politics*, we can all agree that the next couple of years are likely to be difficult ones on the civil rights front.  Please consider donating to the ACLU as a hedge against that likely future.

(*) If it sounds strange to you to hear that someone can support Thump without necessarily also supporting his bigotry, well, pay some attention to the world and read some of the perspectives being shared by the other side.  A large number of his supporters voted for him because they desperately wanted change and he was the only option they saw.  Some agree with him on all points, many don’t.

crazy idea: how off the beaten path is your usage?

Warning: This is a potentially crazy idea, it may not work, and is something I’m probably not going to have time to investigate further.   Feel free to take it and run with it if you’re interested.

Given a set of integration tests, a library which is used by the system under test, and a set of unit tests for that library, can we assess how much the usage of the library implied by the integration tests varies from the expectations of the library authors?  In a sense, we’re looking for a metric of how stable the usage is likely to be.

Approach 1: Untested Lines

Run the unit tests for the library and collect line coverage for the library.  We want the entire set of lines covered, not just a percentage.

Run the integration tests, and collect line coverage for the library.

Subtract the lines covered by the unit tests from the lines covered by the integration tests.  The ratio of this number to the total lines in the library is the metric we’re looking for.

Approach 2: Critical Lines

Capture a corpus of tests inputs for the library from the integration tests.  This assumes the library has a serializable data format, and that API usage can be recorded in a replayable way.  (Most any library has this property if you apply enough creativity in how you look at it.)

For each line which is covered by the integration tests but not unit tests (from above), introduce an assertion which fails.  (This is obviously a step which should be automated!)

Run each input collected from the integration tests against the modified library.  Some of the inputs will crash when hitting untested lines, others will not.  The ration of the number of tests which fail to the total number of extracted tests is what we’re looking for.

By tracking which uncovered lines cause failures – i.e. are the first encountered by each failing tests – we can produce a histogram of which lines are most critical to the integration test to prioritize testing additions.

Approach 3: Untested Features

Take the corpus collected in approach 2, and run it through a coverage based corpus reducer.  (i.e. try to remove tests which aren’t required to preserve test coverage)  The number of tests left after reduction to the number we started with is our metric.

Approach 4: Critical Failure Points

Take our reduced corpus from approach 3, and our binary modification tool from approach 2 (we did automate that right?).  For each test, run it, identify the first uncovered line, and remove it from the set of failure causes lines.  (i.e. pretend it was covered in unit tests)  Repeat until all tests in reduced corpus pass.  The number of lines we had to mark as tested is our metric.  Alternatively, the ratio of said lines to the number of reduced tests might be useful.  The ration of said lines to the number of lines in the library might also be useful, but is likely to be misleading.

(It is probably a good idea to apply a minimal bit of static analysis and mark any obviously post-dominated lines as tested at each step.  This would reduce the number of iterations required markedly.)

 

 

Quick thoughts on WebKit’s B3

I just finished reading Introducing the B3 JIT Compiler and wanted to jot down some thoughts.  Fair warning, this post is being written in a hurry.  I’m focusing on getting down my initial reactions rather than trying for a really well thought out post.  That may follow at a later time or it may not.

The first key bit is that the goal of this effort appears to be strictly compile time, not peak performance.  Understanding that makes the effort make a lot more sense.  It’s still seems a bit odd to me for the compile time of your *fourth tier* JIT to be that important, but given I’m no expert in JavaScript, I’ll just accept that as a given.

In that context, I find the claimed 4.7x reduction in compile time surprisingly unexciting.  There’s enough low hanging fruit in LLVM – in particular, a better representation for “check” nodes – that I would expect something on that magnitude being possible within the existing framework they had.  Achieving a ~5 improvement of compile time with an entirely new compiler (and all of the engineering that implies), seems slightly disappointing.  Now it’s possible (heck, even likely!) that the new architecture will allow them to further drop compile time, but still…

The performance numbers quoted were uninformative at best.  From what I can gather in the write up, the JetStream benchmark is highly influenced by compile time.  While I understand the goal (it’s a useful one), it doesn’t really say anything about the peak performance of the code generated by the two compilers.  Given that, it’s really hard to tell if B3 is actually breakeven with the existing LLVM backend at peak. It have been really nice to see the same numbers with the compile time somehow removed or adjusted for.  (b3 with a sleep to be slower?  A longer warmup period in a modified benchmark?)

Just to be clear, I’m not saying that the numbers presented are “wrong”.  Merely that they don’t answer the question I’d most like them to. 🙂

Other things that jumped out at me:

  • The points about matching the IR to the source language in an effort to reduce the number of nodes (and thus memory, and time) are spot on.  If what you’re going for is compile time above all else, using an IR which closely matches your source language is absolutely the right approach.  This same general idea (remove memory/nodes where they don’t provide enough value) is what’s motivating the removal of pointer-to-pointer bitcasts in LLVM right now.
  • The emphasis on the importance of the “check” node (i.e. early OSR exit if condition fails) matches our experience as well.  You can also see this in Swift’s IR as well.  There is clearly an area that LLVM needs to improve.  I think we can do a lot better within LLVM, and I’m surprised they didn’t try to push that.  In particular, the aliasing problems mentioned could have been addressed with a custom AliasAnalysis instance.  Oh well.
  • The choice to use arrays (instead of lists) gives some interesting tradeoffs.  From a compile time perspective, a modify and compact scheme is likely a win.  Interestingly, this reminds me a lot of a mark-compact garbage collector (b3’s layout) vs a standard malloc/free allocator (llvm’s layout).  Given typical lifetimes in a compiler (short!), the collector approach is likely to be the right one.  It does raise some interesting challenges though: pointer equality can no longer be used, trivial dead code elimination is no longer trivial (which complicates various other transforms), and transforms have to deal with non-canonical forms due to extra identify nodes.  It’ll be really interesting to see where b3 goes on this basis alone.

A perspective on friendly C

I was talking about John Regehr’s Friendly C proposal and recent follow-on post tonight with a friend, and decide to jot down some thoughts in a sharable format.

I believe the idea of a friendly C variant is entirely feasible, but it posses an incredibly challenging design problem.  Every change considered needs to be validated against a deep knowledge of the implementation of the associated compiler, runtime environment, and the underlying hardware.

As a simple example, let’s consider trying to establish semantics for stray (i.e. out of bounds) read and writes.  We can start by trying to define what happens for a stray read.  That’s fairly easy, we can simply return an undefined value.  We could even be a bit more restrictive and say that the value must be one which is written to that address by some part of the program.

(The vagueness in that last bit is to allow concurrent execution reordering.  However, we accidentally required atomic reads and writes since we disallowed wording tearing.  Is that a good thing or not?  There’s a cost to that, but maybe it’s a cost we’re willing to pay.  Or maybe not…)

Now let’s consider how to handle stray writes.  We could simply define them to be erroneous, but that simply gets us back to undefined behavior in C/C++.  We’re trying to avoid that.  We either need to detect them, or provide a reasonable semantics.  Detecting arbitrary stray writes is a very hard problem.  We can easily handle specific categories of stray writes through techniques like implicit null checking, but detecting an arbitrary stray write requires something like a full address-sanitizer (or possibly even more expensive checks).  I doubt anyone is willing to pay 2x performance for their C code to be more friendly.  If they were, why are they writing in C?

The challenge with having defined stray writes is what does a particular read return?  Does it return the last written value to a particular address?  Or the last value written to the particular field of the given object?  With out of bounds writes, these are not necessarily the same.

It’s very tempting to have the read return the last value written to the underlying address, but that introduces a huge problem.  In particular, it breaks essentially all load-load forwarding.

int foo(int* p_int, float p_float) {
 int a = *p_int;
 *p_float = 0.0;
 return a - *p_int;
}

In the example above, your normal C compiler could return “0” because it assumes the intervening write can’t change the value at p_int.  An implementation of a friendly C variant with the semantics we’ve proposed could not.  In practice, this is probably unacceptable from a performance perspective; memory optimization (load-load forwarding and associated optimizations) is a huge part of what a normal C/C++ compiler does.  (see: BasicAA, MDA. GVN, DSE, EarlyCSE in LLVM)

If we want to avoid that problem, we could try to be more subtle in our definition.  Let’s say we instead defined a read as returning either the last value written to that field (i.e. in bounds write) or underlying memory address (i.e. stray write).  We still have the problem of requiring atomic memory access, but we seem to have allowed the compiler optimization we intended.

The problem with this definition is that we’ve introduced a huge amount of complexity to our language specification and compiler.  We now have to have separate definitions of both our objects, their underlying addresses, and all the associated implementation machinery.

Another approach would be to define a read as returning either the last value written to the field (if no stray write has occurred to that address) or an undefined value (if a stray write to that address has occurred).  Is that friendly enough?

Moreover, what happens if we improve our ability to detect stray writes?  Are we allowed to make that write suddenly fail?  Is a program which functions only because of a stray write correct?

(Before you dismiss this as ridiculous, I personally know of an emergency software release that did nothing but reintroduce a particular stray memory write in a C++ program because it happened to restore behavior that a client had been relying on for many many years.)

Hopefully, I’ve given you a hint of the complexities inherent in any friendly C proposal.  These are the same complexities involved in designing any new language.  If anything designing a workable friendly-C proposal is harder than designing a new language.  At least with a new language you’d have the freedom to change other aspects of the language to avoid having to deal with garbage pointers entirely; in practice, that’s often the much easier approach.

 

Remembering the Christmas spirit

Merry Christmas!  I hope you’re spending the day with family whether it be by blood, choice, or both.

This year, I’ve continued the tradition I’ve made over the last few years and made almost all of my gifts for friends and family in the form of a charitable donation.  I have also made a donation to charity my standing request whenever a friend or family member is considering gifts for me.

I would encourage anyone reading to consider doing the same in future years.  Do you really need that new gadget or gift card more than good the same amount could do for those less fortunate?

This year, my holiday gifts went to two charities.

Avenues for Homeless Youth in Minneapolis, MN – Avenues provides housing and support services for homeless youth in the Twin Cities area.  They’re also a pioneer in community based homeless services which aim to do more than provide emergency housing; they aim to unit homeless youth with a support system in the form of community volunteers and host homes.  If you wish to donate, here’s the direct link.  The navigation from the homepage appears broken.

view-source:https://my.pointfoundation.org/view.image?Id=615Point Foundation – Point awards scholarships to LGBT youth, particularly those who’ve lost family support, who are actively giving back to their community.  This year, Point awarded 85 full ride scholarships to students in need.  More than that, they provide a sense of hope and support to many college students considering coming out to the parents.  When weighing whether to come out to your parents – who likely control the purse strings to your college education and future – knowing someone has your back in the worst case is a major source of comfort.

Merry Christmas, Happy Holidays, and a Happy New Year!

Deoptimization terminology

Since I find myself talking about such things a good amount recently, I want to take a moment and define some depotimization terminology as I use it.  There doesn’t seem to be a common lexicon for this stuff; every compiler community has developed it’s own vocabulary.

deoptimization – The event or process by which speculative optimized code is corrected when the speculation made turns out to be wrong.

When a just-in-time compiler for a language like Java, JavaScript, Lisp, or SmallTalk is generating code, it’s faced with a variety of edge conditions which almost never occur.  For example, in most well behaved programs* you will never see a single range check violation.  It’d be really nice if we could optimize under the assumption that an out of bounds access doesn’t actually happen, but deal with it correctly when it does.

* Yes, there do exist bad programs which use range check violations to perform normal control flow.  They are evil, their developers should be ashamed, and don’t do that!

To my knowledge, there are two broad categories of deoptimization mechanisms.  One of those can be possible subdivided, but the exact distinction is debatable.

The simplest mechanism is a simple side exit – also known as an uncommon trap, osr exit, etc…  For this, you just emit a guard which checks the condition in question and branch to a bit of code which knows how to materialize the abstract state of the virtual machine and transitions into running another piece of code that knows how to deal with the unusual condition.  In most cases, this ‘other piece of code’ is your interpreter.  There’s a lot of detail here, but at least in concept it’s pretty straight forward.

As an extension of the previous, we can also play tricks to represent the branch of control without actually using a branch instruction. The most famous example here is the implicit null check, but nothing says this technique can’t be applied to other variety of checks.  The general approach is to construct the speculatively optimized code in such a way as to generate a signal or other hardware fault if the condition turned out not to be true.  The virtual machine can then catch the signal, lookup the faulting address in a side table, and then branch to the handler.  Other than the indirect through the signal handler. and a lot of complexity in implementation, this is pretty much just another form of side exit.

Alternatively, we can generate code without any form of checks, but have a way to “shoot down” the generated code if our assumption turns out to be wrong.  In Java, we like to do what is called Class Hierarchy Analysis – essentially this is whole program optimizations based on current inheritance structure subject to the assumption that the program isn’t going to change.  Since Java allows class loading, this is of course not an entirely safe assumption.

To make such a mechanism work, you need two things: a) the ability to stall an event such a class loading until the system can respond by invalidating code, and b) a mechanism to “quickly” invalidate any code which was compiled under now broken assumptions.  As long as the class loading event isn’t observable (i.e. doesn’t complete) until the problematic code is invalidated, we’re good to go.

I refer to such a process as an invalidation deoptimization.  I’m going to skip the implementation details for now to keep this post relatively short; I might return to that at another time.  As you can probably guess, there’s a lot of complexity invalidating running code in a timely manner.

 

Observations on fuzzing in practice

I’ve been watching with interest over the last few months as first afl-fuzz, and more recently, llvm-fuzz have come into existence and gained – in some circles at least – prominence.  The interesting part to me is not the technology per se – fuzzing is pretty old news in academic circles – but the fact that both fuzzers appear to be effective in practice.

One interesting observation – which really isn’t surprising in retrospect, but sure wasn’t obvious before hand – is that fuzzing works best when the programmer writes a fuzz driver which directly exercises functionality of interest based on a fuzzed binary input.  Analogously to the way you might have to factor your code a bit differently than you might otherwise to make it testable; it appears we might need to start thinking about factoring out code to make it fuzzable.

Another interesting result is that afl-fuzz has created what is a effectively an unofficial standard for representing test corpi and that cross feeding fuzzers appears to make both more effective.  Pointing different fuzzers, or even different instances of the same fuzzer configured slightly differently, at the same set of ouput directories appears to find more crashing inputs than either fuzzer by itself.  It’ll be interesting to see what happens over the next couple of months as others start porting their bug finding tools to using the same standard.  I suspect that the same effect would apply when adding test generates based on symbolic execution, concolic execution, or even bounded model checking.  It’ll be interesting to see if this actually happens in practice.

One thing I’d like to see happen is the creation of an open fuzzing platform.  Something that an programmer could upload a github url to and get back – possibly a day or two later – a set of test cases which cause the program built from that repository to fail.  The tricky part is doing this wouldn’t actually be the technical parts; it would be solving all of the configuration, reporting, and economic issues involved.  I’ve been playing with something in this direction myself, but have to admit that project has stalled.  Hopefully, someone out there will beat me to it.

 

How much does it cost to maintain LLVM?

In early October of 2014, I started collecting changes that I saw fly by on llvm-commits that I thought would be straight-forward to automate.  I was trying to be pretty conservative, so these tend to be pretty basic things: fixing deceptive white space around an if clause, removing the name of a method from it’s doxygen comment, removing a couple of syntactically redundant semi colons, and things of similar complexity.  These weren’t chosen because they were interesting, but precisely because they were not.

In the 66 days since I started collecting, I’ve saved 105 unique commits.  That’s a bit less than 2 per day, and only about 1.6% of the 6,500 commits made to LLVM in that time.

Let’s assume that each of those changes took an average of 15 minutes on the part of their author.  That’s not too much more than a single build and test cycle, so it seems like a reasonable estimate.  At roughly $2 per developer minute, we can guesstimate that each of these changes cost about $30.  Taken together, these 105 changes consumed about 26 hours of developer time at a cost of a bit over $3,150.

This gives us a value for straight forward code maintenance activities of roughly $1,400 per month (or roughly $50 per day.)

If anything, this is an extremely low estimate.  I know several of these changes required review, and at least a couple of them broke the build and had to be reverted.  We could probably add in several more hours of developer time just for that alone.

Now this is only a small fraction of the roughly $88,000 in development time going to the project as a whole each month*, but it’s still pretty material.

* Using the same logic as above: 15 minutes per change, $2 per developer minute, 6500 changes, llvm repository only.  It goes without saying that this is a massive understatement of the actual value of the contributed work.

A thought on the correctness of refactoring

The hard part about writing a refactoring tool isn’t the actual code transformation; it’s establishing that the code transformation you wrote didn’t break your code.

The traditional approach to validating the transform is to make it correct by construction.  The focus is on the correctness of the transformation in general, rather than a particular instance of the transformation as applied to a particular codebase.    Getting a transformation right for all possible code bases is hard and requires a lot of semantic knowledge about the language and all of its corner cases.  This is why semantic aware tools like clang-tidy, & clang-modernize are such a big deal.

It’s interesting to think about what the inverse might be.  There are certainly some code changes – regardless of how they’re performed – which are semantically preserving.  For example, replacing tabs with spaces in a C++ file probably doesn’t change a single bit in the final program.*  My strong suspicion is that there are entire classes of transformations which could be validated in this way.  In particular, I wonder how much of the program verification literature could be profitably adapted to this problem.

Assuming you could make this work, it would seem to make writing a one off refactoring tool one heck of a lot simpler.  You could write it using any combination of grep/awk and other text processing tools, not worry about getting it right in general, apply it and just check to see if the result was correct.  If it wasn’t, you fix the bug (exclude that file, be a bit smarter about the language construct, whatever) and repeat.

* For the purposes of this post, let’s ignore undefined behaviour.

Statepoint patches have landed

Early this week, I was able to submit the majority of the changes to support Statepoints in LLVM.  Before anyone gets too excited, let me note a few things:

  1. This is the less interesting part of our GC work.  The more interesting parts of late safepoint insertion have not landed yet and are likely a few months out.  Statepoints is about providing a robust implementation for generating stack maps for a relocating garbage collector, not about inserting the actual safepoints.
  2. The implementation is fairly robust, but is not yet really burnt in.  If you want to start playing with it, feel free, but expect to run into some bugs and have to submit some patches.  The good news is that those patches will go directly to llvm-commits and we’ll be reviewing them there.  No more of this weird split public/private branch stuff.  Yeah!
  3. There’s lots of tuning work to be done.  I’ve started filing bugs to track some of these items, but there’s a lot more to be filed.  In particular, there’s lots of room to tweak semantics to make statepoints less invasive for non-relocating GCs.
  4. The current interface for frontends is somewhat rough.  I’m in the process of working on a patch to add IRBuilder routines, but that hasn’t landed yet.  Similarly, there’s no in tree parser for the StackMap section.  (Patches for usability issues are extremely welcome!)

The best place to start is probably the documentation.  For a discussion on how statepoints compares with gcroot, see this previous post.  Questions are best directed to the LLVM mailing list.