crazy idea: how off the beaten path is your usage?

Warning: This is a potentially crazy idea, it may not work, and is something I’m probably not going to have time to investigate further.   Feel free to take it and run with it if you’re interested.

Given a set of integration tests, a library which is used by the system under test, and a set of unit tests for that library, can we assess how much the usage of the library implied by the integration tests varies from the expectations of the library authors?  In a sense, we’re looking for a metric of how stable the usage is likely to be.

Approach 1: Untested Lines

Run the unit tests for the library and collect line coverage for the library.  We want the entire set of lines covered, not just a percentage.

Run the integration tests, and collect line coverage for the library.

Subtract the lines covered by the unit tests from the lines covered by the integration tests.  The ratio of this number to the total lines in the library is the metric we’re looking for.

Approach 2: Critical Lines

Capture a corpus of tests inputs for the library from the integration tests.  This assumes the library has a serializable data format, and that API usage can be recorded in a replayable way.  (Most any library has this property if you apply enough creativity in how you look at it.)

For each line which is covered by the integration tests but not unit tests (from above), introduce an assertion which fails.  (This is obviously a step which should be automated!)

Run each input collected from the integration tests against the modified library.  Some of the inputs will crash when hitting untested lines, others will not.  The ration of the number of tests which fail to the total number of extracted tests is what we’re looking for.

By tracking which uncovered lines cause failures – i.e. are the first encountered by each failing tests – we can produce a histogram of which lines are most critical to the integration test to prioritize testing additions.

Approach 3: Untested Features

Take the corpus collected in approach 2, and run it through a coverage based corpus reducer.  (i.e. try to remove tests which aren’t required to preserve test coverage)  The number of tests left after reduction to the number we started with is our metric.

Approach 4: Critical Failure Points

Take our reduced corpus from approach 3, and our binary modification tool from approach 2 (we did automate that right?).  For each test, run it, identify the first uncovered line, and remove it from the set of failure causes lines.  (i.e. pretend it was covered in unit tests)  Repeat until all tests in reduced corpus pass.  The number of lines we had to mark as tested is our metric.  Alternatively, the ratio of said lines to the number of reduced tests might be useful.  The ration of said lines to the number of lines in the library might also be useful, but is likely to be misleading.

(It is probably a good idea to apply a minimal bit of static analysis and mark any obviously post-dominated lines as tested at each step.  This would reduce the number of iterations required markedly.)