Unit-testing of inherently random/non-deterministic algorithms

by KeithS   Last Updated November 15, 2018 02:05 AM

My current project, succinctly, involves the creation of "constrainably-random events". I'm basically generating a schedule of inspections. Some of them are based on strict schedule constraints; you perform an inspection once a week on Friday at 10:00AM. Other inspections are "random"; there are basic configurable requirements such as "an inspection must occur 3 times per week", "the inspection must happen between the hours of 9AM-9PM", and "there should not be two inspections within the same 8-hour period", but within whatever constraints were configured for a particular set of inspections, the resulting dates and times should not be predictable.

Unit tests and TDD, IMO, have great value in this system as they can be used to incrementally build it while its full set of requirements is still incomplete, and make sure I'm not "over-engineering" it to do things I don't currently know I need. The strict schedules were a piece-o'-cake to TDD. However, I'm finding it difficult to really define what I'm testing when I write tests for the random portion of the system. I can assert that all times produced by the scheduler must fall within the constraints, but I could implement an algorithm that passes all such tests without the actual times being very "random". In fact that's exactly what's happened; I found an issue where the times, though not predictable exactly, fell into a small subset of the allowable date/time ranges. The algorithm still passed all assertions I felt I could reasonably make, and I could not design an automated test that would fail in that situation, but pass when given "more random" results. I had to demonstrate the problem was solved by restructuring some existing tests to repeat themselves a number of times, and visually check that the times generated fell within the full allowable range.

Does anyone have any tips for designing tests that should expect non-deterministic behavior?

EDIT: Thanks to all for the suggestions. The main opinion seems to be that I need a deterministic test in order to get deterministic, repeatable, assertable results. Makes sense.

I created a set of "sandbox" tests that contain candidate algorithms for the constraining process (the process by which a byte array that could be any long becomes a long between a min and max). I then run that code through a FOR loop that gives the algorithm several known byte arrays (values from 1 to 10,000,000 just to start) and has the algorithm constrain each to a value between 1009 and 7919 (I'm using prime numbers to ensure an algorithm wouldn't pass by some serendipitous GCF between the input and output ranges). The resulting constrained values are counted and a histogram produced. To "pass", all inputs must be reflected within the histogram (sanity to ensure we didn't "lose" any), and the difference between any two buckets in the histogram cannot be greater than 2 (it should really be <= 1, but stay tuned). The winning algorithm, if any, can be cut and pasted directly into production code and a permanent test put in place for regression.

Here's the code:

    private void TestConstraintAlgorithm(int min, int max, Func<byte[], long, long, long> constraintAlgorithm)
    {
        var histogram = new int[max-min+1];
        for (int i = 1; i <= 10000000; i++)
        {
            //This is the stand-in for the PRNG; produces a known byte array
            var buffer = BitConverter.GetBytes((long)i);

            long result = constraintAlgorithm(buffer, min, max);

            histogram[result - min]++;
        }

        var minCount = -1;
        var maxCount = -1;
        var total = 0;
        for (int i = 0; i < histogram.Length; i++)
        {
            Console.WriteLine("{0}: {1}".FormatWith(i + min, histogram[i]));
            if (minCount == -1 || minCount > histogram[i])
                minCount = histogram[i];
            if (maxCount == -1 || maxCount < histogram[i])
                maxCount = histogram[i];
            total += histogram[i];
        }

        Assert.AreEqual(10000000, total);
        Assert.LessOrEqual(maxCount - minCount, 2);
    }

    [Test, Explicit("sandbox, does not test production code")]
    public void TestRandomizerDistributionMSBRejection()
    {
        TestConstraintAlgorithm(1009, 7919, ConstrainByMSBRejection);
    }

    private long ConstrainByMSBRejection(byte[] buffer, long min, long max)
    {
        //Strip the sign bit (if any) off the most significant byte, before converting to long
        buffer[buffer.Length-1] &= 0x7f;
        var orig = BitConverter.ToInt64(buffer, 0);
        var result = orig;
        //Apply a bitmask to the value, removing the MSB on each loop until it falls in the range.
        var mask = long.MaxValue;
        while (result > max - min)
        {
            mask >>= 1;
            result &= mask;
        }
        result += min;

        return result;
    }

    [Test, Explicit("sandbox, does not test production code")]
    public void TestRandomizerDistributionLSBRejection()
    {
        TestConstraintAlgorithm(1009, 7919, ConstrainByLSBRejection);
    }

    private long ConstrainByLSBRejection(byte[] buffer, long min, long max)
    {
        //Strip the sign bit (if any) off the most significant byte, before converting to long
        buffer[buffer.Length - 1] &= 0x7f;
        var orig = BitConverter.ToInt64(buffer, 0);
        var result = orig;

        //Bit-shift the number 1 place to the right until it falls within the range
        while (result > max - min)
            result >>= 1;

        result += min;
        return result;
    }

    [Test, Explicit("sandbox, does not test production code")]
    public void TestRandomizerDistributionModulus()
    {
        TestConstraintAlgorithm(1009, 7919, ConstrainByModulo);
    }

    private long ConstrainByModulo(byte[] buffer, long min, long max)
    {
        buffer[buffer.Length - 1] &= 0x7f;
        var result = BitConverter.ToInt64(buffer, 0);

        //Modulo divide the value by the range to produce a value that falls within it.
        result %= max - min + 1;

        result += min;
        return result;
    }

... and here's the results:

enter image description here

LSB rejection (bit-shifting the number until it falls within the range) was TERRIBLE, for a very easy-to-explain reason; when you divide any number by 2 until it's less than a maximum, you quit as soon as it is, and for any non-trivial range, that will bias the results towards the upper third (as was seen in the detailed results of the histogram). This was exactly the behavior I saw from the finished dates; all of the times were in the afternoon, on very specific days.

MSB rejection (removing the most significant bit off the number one at a time until it's within the range) is better, but again, because you're chopping off very large numbers with each bit, it's not evenly distributed; you're unlikely to get numbers in the upper and lower ends, so you get a bias toward the middle third. That might benefit someone looking to "normalize" random data into a bell-ish curve, but a sum of two or more smaller random numbers (similar to throwing dice) would give you a more natural curve. For my purposes, it fails.

The only one that passed this test was to constrain by modulo division, which also turned out to be the fastest of the three. Modulo will, by its definition, produce as even a distribution as possible given the available inputs.



Answers 14


Apart from validating that your code doesn't fail, or throws right exceptions in right places you can create a valid input/response pairs (even calculating this manually), feed the input in the test and make sure it returns expected response. Not great, but that's pretty much all you can do, imho. However, in your case it's not really random, once you create your schedule you can test for rule conformity - must have 3 inspections per week, between 9-9; there's no real need or ability to test for exact times when inspection occured.

Evgeni
Evgeni
February 02, 2012 20:31 PM

What you actually want to test here, I assume, is that given a specific set of results from the randomiser, the rest of your method performs correctly.

If that's what you're looking for then mock out the randomiser, to make it deterministic within the realms of the test.

I generally have mock objects for all kinds of non-deterministic or unpredictable (at the time of writing the test) data, including GUID generators and DateTime.Now.

Edit, from comments: You have to mock the PRNG (that term escaped me last night) at the lowest level possible - ie. when it generates the array of bytes, not after you turn those into Int64s. Or even at both levels, so you can test your conversion to an array of Int64 works as intended and then test separately that your conversion to an array of DateTimes works as intended. As Jonathon said, you could just do that by giving it a set seed, or you can give it the array of bytes to return.

I prefer the latter because it won't break if the framework implementation of a PRNG changes. However, one advantage to giving it the seed is that if you find a case in production that didn't work as intended, you only need to have logged one number to be able to replicate it, as opposed to the whole array.

All this said, you must remember that it's called a Pseudo Random Number Generator for a reason. There may be some bias even at that level.

pdr
pdr
February 02, 2012 20:53 PM

what you can unit test is the logic determine if the random dates are valid or if another random date needs to be selected.

There is no way to test a random date generator short of getting a bunch of dates and deciding if they are suitably random.

Ryathal
Ryathal
February 02, 2012 21:46 PM

This is going to sound like a stupid answer, but I'm going to throw it out there because this is how I've seen it done before:

Decouple your code from the PRNG - pass the randomization seed into all of the code that uses randomization. Then you can determine the 'working' values from a single seed (or multiple seeds of that would make you feel better). This will give you the ability to adequately test your code without having to rely on the law of large numbers.

It sounds inane, but this is how the military does it (either that or they use a 'random table' that isn't really random at all)

Jonathan Rich
Jonathan Rich
February 02, 2012 22:19 PM

There's really no better way than running it a bunch of times and seeing if you get the distribution you want. If you have 50 allowed potential inspection schedules, you run the test 500 times and make sure each schedule is used close to 10 times. You can control your random generator seeds to make it more deterministic, but that will also make your tests more tightly coupled to the implementation details.

Karl Bielefeldt
Karl Bielefeldt
February 02, 2012 22:53 PM

You might want to take a look at Sevcikova et al: "Automated Testing of Stochastic Systems: A Statistically Grounded Approach" (PDF).

The methodology is implemented in various test cases for the UrbanSim simulation platform.

krlmlr
krlmlr
February 02, 2012 23:58 PM

It is not possible to test a nebulous condition that has no concrete definition. If the generated dates pass all tests then theoretically your application is functioning correctly. The computer cannot tell you if the dates are "random enough" because it cannot acknowledge the criteria for such a test. If all tests pass but the behavior of the application is still not suitable then your test coverage is empirically inadequate (from a TDD perspective).

In my view, your best best is to implement some arbitrary date generation constraints so that the distribution passes a human smell test.

leo
leo
February 03, 2012 04:57 AM

Just record the output of your randomizer (whether pseudo or quantum/chaotic or real world). Then save and replay those "random" sequences that fit your test requirements, or that expose potential issues and bugs, as you build your unit test cases.

hotpaw2
hotpaw2
February 03, 2012 05:45 AM

For the unit tests, replace the random generator with a class that generates predictable results covering all corner cases. I.e. make sure your pseudo-randomizer generates the lowest possible value and the hightest possible value, and the same result several times in a row.

You don't want your unit tests to overlook e.g. off-by-one bugs happening when Random.nextInt(1000) returns 0 or 999.

user281377
user281377
February 03, 2012 07:27 AM

"Is it random (enough)" turns out to be an incredibly subtle question. The short answer is that a traditional unit test just won't cut it - you'll need to generate a bunch of random values and submit them to various statistical tests that give you a high confidence that they are random enough for your needs.

There will be a pattern - we're using psuedo-random number generators after all. But at some point things will be "good enough" for your application (where good enough varies a LOT between say games at one end, where relatively simple generators suffice, all the way up to cryptography where you really need sequences to be infeasible to determine by a determined and well equipped attacker).

The Wikipedia article http://en.wikipedia.org/wiki/Randomness_tests and its follow up links have more information.

James Iry
James Iry
February 03, 2012 17:44 PM

I have two answers for you.

=== FIRST ANSWER ===

As soon as I saw the title of your question, I came to jump in and propose the solution. My solution was the same as what several others have proposed: to mock out your random number generator. After all, I have built several different programs that required this trick in order to write good unit tests, and I have begun making mockable access to random numbers a standard practice in all my coding.

But then I read your question. And for the particular issue that you describe, that is not the answer. Your problem was not that you needed to make predictable a process that used random numbers (so it would be testable). Rather, your problem was to verify that your algorithm mapped uniformly random output from your RNG to uniform-within-the-constraints output from your algorithm -- that if the underlying RNG was uniform it would result in evenly distributed inspection times (subject to the problem constraints).

That's a really hard (but fairly well-defined) problem. Which means it's an INTERESTING problem. Iimmediately began to think of some really great ideas for how to solve this. Back when I was a hotshot programmer I might have started doing something with these ideas. But I'm not a hotshot programmer anymore... I like to that I am more experienced and more skilled now.

So instead of diving in to the hard problem, I thought to myself: what is the value of this? And the answer was disappointing. Your bug got solved already, and you'll be diligent about this issue in the future. External circumstances cannot trigger the problem, only changes to your algorithm. The ONLY reason for tackling this interesting problem was in order to satisfy the practices of TDD (Test Driven Design). If there is one thing that I've learned it is that blindly adhering to any practice when it isn't valuable causes problems. My suggestion is this: Just don't write a test for this, and move on.


=== SECOND ANSWER ===

Wow... what a cool problem!

What you need to do here is to write a test that verifies that your algorithm for selecting inspection dates and times will produce output that is uniformly distributed (within the problem constraints) if the RNG it uses produces uniformly distributed numbers. Here are several approaches, sorted by level of difficulty.

  1. You can apply brute force. Just run the algorithm a WHOLE bunch of times, with a real RNG as input. Inspect the output results to see if they are uniformly distributed. Your test will need to fail if the distribution varies from perfectly uniform by more than a certain threshhold, and to ensure you catch problems the threshhold can't be set TOO low. That means that you will need a HUGE number of runs in order to be sure that the probability of a false positive (a test failure by random chance) is very small (well < 1% for a medium-sized code base; even less for a big code base).

  2. Consider your algorithm as a function that takes the concatenation of all RNG output as an input, then produces inspection times as an output. If you know that this function is piecewise continuous, then there is a way to test your property. Replace the RNG with a mockable RNG and run the algorithm numerous times, producing uniformly distributed RNG output. So if your code required 2 RNG calls, each in the range [0..1], you might have the test run the algorithm 100 times, returning the values [(0.0,0.0), (0.0,0.1), (0.0,0.2), ... (0.0,0.9), (0.1,0.0), (0.1,0.1), ... (0.9,0.9)]. Then you could check whether the output of the 100 runs was (approximately) uniformly distributed within the allowed range.

  3. If you REALLY need to verify the algorithm in a reliable fashion and you can't make assumptions about the algorithm OR run a large number of times, then you can still attack the problem, but you might need some constraints on how you program the algorithm. Check out PyPy and their Object Space approach as an example. You could create an Object Space which, instead of actually executing the algorithm, instead just calculated the shape of the output distribution (assuming that the RNG input is uniform). Of course, this requires that you build such a tool and that your algorithm be built in PyPy or some other tool where it is easy to make drastic modifications to the compiler and use it to analyze the code.

mcherm
mcherm
February 03, 2012 17:59 PM

A simple histogram approach is a good first step, but is not sufficient to prove randomness. For a uniform PRNG you would also (at the very least) generate a 2-dimensional scatter plot (where x is the previous value and y is the new value). This plot should also be uniform. This is complicated in your situation because there are intentional non-linearities in the system.

My approach would be:

  1. validate (or take as a given) that the source PRNG is sufficiently random (using standard statistical measures)
  2. verify that an unconstrained PRNG-to-datetime conversion is sufficiently random over the output space (this verifies a lack of bias in the conversion). Your simple first-order uniformity test should be sufficient here.
  3. verify that the constrained cases are sufficiently uniform (a simple first-order uniformity test over the valid bins).

Each of these tests is statistical and requires a large number of sample points to avoid false positives and false negatives with a high degree of confidence.

As to the nature of the conversion/constraint algorithm:

Given: method to generate pseudo-random value p where 0 <= p <= M

Need: output y in (possibly discontinuous) range 0 <= y <= N <= M

Algorithm:

  1. calculate r = floor(M / N), that is, the number of complete output ranges that fit within the input range.
  2. calculate the max acceptable value for p: p_max = r * N
  3. generate values for p until a value less than or equal to p_max is found
  4. calculate y = p / r
  5. if y is acceptable, return it, otherwise repeat with step 3

The key is to discard unacceptable values rather than folding non-uniformly.

in pseudo-code:

# assume prng generates non-negative values
def randomInRange(min, max, prng):
    range = max - min
    factor = prng.max / range

    do:
        value = prng()
    while value > range * factor
    return (value / factor) + min

def constrainedRandom(constraint, prng):
    do:
        value = randomInRange(constraint.min, constraint.max, prng)
    while not constraint.is_acceptable(value)
Frank Szczerba
Frank Szczerba
February 03, 2012 21:23 PM

Your goal is not to write unit tests and pass them, but to make sure that your program fits its requirements. The only way you can do this, is to precisely define your requirements in the first place. For example, you mentioned "three weekly inspections at random times". I'd say the requirements are: (a) 3 inspections (not 2 or 4), (b) at times that are not predictable by people who don't want to be inspected unexpectedly, and (c) not too close together - two inspections five minutes apart are probably pointless, maybe not too far apart either.

So you write down the requirements more precisely than I did. (a) and (c) are easy. For (b), you might write some code as clever as you can that tries to predict the next inspection, and to pass the unit test, that code must not be able to predict better than pure guess.

And of course you need to be aware that if your inspections are truly random, any prediction algorithm could be correct by pure chance, so you must be sure that you and your unit tests don't panic if that happens. Maybe perform a few more tests. I wouldn't bother testing the random number generator, because in the end it's the inspection schedule that counts, and it doesn't matter how it was created.

gnasher729
gnasher729
March 25, 2014 18:55 PM

This case seems ideal for Property Based Testing.

In a nutshell, it's a mode of testing where the testing framework generates inputs for code under test and test assertions validate properties of the outputs. The framework can then be smart enough to "attack" code under test and try to corner it into an error. The framework is usually also smart enough to hijack your random number generator seed. Typically you can configure the framework to generate at most N test cases or run at most N seconds, and remember failed test cases from last run and re-run these against newer code version first. This allows for fast iteration cycle during development and more comprehensive testing out of band / in CI.

Here's a (dumb, failing) example testing the sum function:

@given(lists(numbers))
def test_sum(alist):
    result = sum(alist)
    assert isinstance(result, float)
    assert result > 0
  • test framework will generate one list at a time
  • content of the list will be numbers (ints, floats, etc.)
  • sum is called and properties of the result are validated
  • result is always float (even if input is empty or all ints)
  • result is positive (it's not a real example)

In the OP case, properties validated will be more complex: inspection type A present, inspection type A thrice a week, inspection time B always at 12pm, inspection type C from 9 to 9, [given schedule is for a week] inspections of types A,B,C all present, etc.

The best-known library is QuickCheck for Haskell, see Wikipedia page below for a list of such libraries in other languages:

https://en.wikipedia.org/wiki/QuickCheck

Hypothesis (for Python) has a good write-up about this kind of testing:

https://hypothesis.works/articles/what-is-property-based-testing/

Dima Tisnek
Dima Tisnek
November 15, 2018 01:44 AM

Related Questions



Testing a function that uses random number generator

Updated August 29, 2017 10:05 AM

Unit testing Markov chain code

Updated August 06, 2018 18:05 PM