Thursday, August 8, 2013

Deterministic Miller-Rabin Primality Test in C# Continued

Note: 1/17/2016 - I've reworked the code described in this post. The new version is better, and complete enough that I recommend using it instead. The idea is the same, but the implementation is cleaner and faster. It's hosted on GitHub The most important change is that the Miller-Rabin code for ulong and long types (64 bit integers) now uses a UInt128 class I've added, instead of BigInteger for doing ModPow. This makes the code several times faster than the version in this blog post, although the general idea described here is the same. The only thing lacking in the new version is the BigInteger implementation of non-deterministic Miller-Rabin prime test. If you need that, get the original code. I'll be adding that to the new version in the near future.

The previous post presented a UInt32 (uint) implementation of a deterministic Miller-Rabin Primality test. This post presents a very similar algorithm, but for UInt64 (ulong) values as well as BigInteger values, although only probabilistic for the latter.

A key part of implementing the Miller-Rabin algorithm is having access to a ModPow (or PowMod) function that computes, without overflowing, the modulo value of some number raised to some power. The previous post contained an implementation using a technique that only needs to deal with intermediate values as high as the MaxValue of the type squared. That was easily achieved by using UInt64 operations for intermediate values. The Miller Rabin IsPrime function itself also used UInt64 for one of its calculations. When implementing a UInt64 ModPow and IsPrime function, there is no UInt128 type in .Net for us to use. There is however a BigInteger type in .Net we can use. Faced with the need to have BigInteger operations in both the IsPrime and ModPow functions, I decided to first just implement a standard probabilistic Miller Rabin IsPrime function as a BigInteger extension, and use the ModPow function that already exists in .Net’s BigInteger implementation. I wrote it in a way that would allow me to also use it to do deterministic primality testing of UInt64 values.

The code below is a BigInteger implementation of a Miller Rabin IsPrime function. Besides the value being tested, it takes an argument specifying the number of witness rounds to be performed (the more rounds, the more certain the primality test is, at the expense of CPU time), and also an optional set of witness values. If no witness values are provided, or the number of witness values provided are less than the specified witness count, the algorithm randomly generates the additional witness values it needs. For this, I wrote a simple BigInteger random number generator class that I won’t go into detail about here, but may cover in a future post. The real purpose of taking UInt64 witness values as optional parameters is to provide the mechanism for doing deterministic primality checks using predetermined witness values that guarantee a correct primality determination up to some maximum range. See the previous post for more detail on that concept.
// constants for common referenced BigInteger values
private static BigInteger two = 2;
private static BigInteger three = 3;

/// <summary>
/// Determines if the specified value is a prime number, deterministically for
/// values 64 bits and less, and probabilistically for values larger than 64 bits.
/// </summary>
/// <remarks>This method chooses the algorithm to use based on
/// the magnitude of the specified value. For smaller values, a
/// simple trial division algorithm is used. For larger values up to
/// and including 64 bit values, a deterministic version of the 
/// Miller-Rabin algorithm is used. For values more than 64 bits,
/// a probabilistic Miller-Rabin algorithm with a default of 64
/// witness iterations is used.</remarks>
/// <param name="value">The value to be tested for primality.</param>
/// <param name="maxWitnessCount">The maximum number of witness iterations. Default is 64.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
public static bool IsPrime(this BigInteger value, int maxWitnessCount = 64) {
    if (value < two) return false;
    if (value <= ulong.MaxValue) return ((ulong)value).IsPrime();
    return value.IsPrimeMR(maxWitnessCount);
}

/// <summary>
/// Determines if the specified value is a prime number, using the
/// probabilistic Miller-Rabin algorithm.
/// </summary>
/// <param name="value">The value to be tested for primality.</param>
/// <param name="maxWitnessCount">The maximum number of witness iterations.</param>
/// <returns>True if the value is probably prime, otherwise, false.</returns> 
public static bool IsPrimeMR(this BigInteger value, int maxWitnessCount = 64) {
    // take care of the simple cases of small primes and the
    // common composites having those primes as factors
    if (value <= BigInteger.One) return false;
    if ((value % two) == BigInteger.Zero) return value == two;
    if ((value % three) == BigInteger.Zero) return value == three;
    if ((value % 5) == BigInteger.Zero) return value == 5;
    if (((value % 7) == BigInteger.Zero) || ((value % 11) == BigInteger.Zero)
        || ((value % 13) == BigInteger.Zero) || ((value % 17) == BigInteger.Zero)
        || ((value % 19) == BigInteger.Zero) || ((value % 23) == BigInteger.Zero)
        || ((value % 29) == BigInteger.Zero) || ((value % 31) == BigInteger.Zero)
        || ((value % 37) == BigInteger.Zero) || ((value % 41) == BigInteger.Zero)
        || ((value % 43) == BigInteger.Zero)) {
        return (value <= 43);
    }
    return InternalIsPrimeMR(value, maxWitnessCount, new ulong[0]);
}

/// <summary>
/// Determines if the specified odd, >= 3 value is a prime number, using the
/// Miller-Rabin algorithm.
/// </summary>
/// <param name="value">The value to be tested for primality.</param>
/// <param name="witnesses">The witnesses to be used.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
internal static bool InternalIsPrimeMR(BigInteger value, int witnessCount, params ulong[] witnesses) {
    // compute n − 1 as (2^s)·d (where d is odd)
    BigInteger valLessOne = value - BigInteger.One;
    BigInteger d = valLessOne / two; // we know that value is odd and valLessOne is even, so unroll 1st iter of loop
    uint s = 1;
    while ((d % two) == BigInteger.Zero) {
        d /= two;
        s++;
    }

    // test value against each witness
    RandomBigInteger rand = null;
    for (int i = 0; i < witnessCount; i++) {
        BigInteger a;
        if (i < witnesses.Length) {
            a = witnesses[i];
            if (a >= valLessOne) {
                a %= value - three;
                a += three;
            }
        } else {
            if (rand == null) rand = new RandomBigInteger(3, valLessOne);
            a = rand.Next();
        }
        BigInteger x = BigInteger.ModPow(a, d, value);

        if (x == BigInteger.One) continue;
        for (uint r = 1; (r < s) && (x != valLessOne); r++) {
            x = BigInteger.ModPow(x, two, value);
            if (x == BigInteger.One) return false;
        }
        if (x != valLessOne) return false;
    }
    // witnesses confirm value is prime
    return true;
}

The algorithm is very similar to the one for UInt32 in the previous post. First a simple check of compositeness is performed against the common low prime values. If those primes are not a factor, then the Miller Rabin algorithm is executed. Checking to see if low prime values are factors is a trade-off. When benchmarking the performance of an IsPrime function, there are two cases to consider: determining that a composite number is composite, and determining that a prime number is prime. It typically takes much longer to handle the prime input than it does to handle the composite input. Checking low prime values will improve the speed of the composite case, but will slow down the speed of the prime case. That’s because all the low prime tests will be executed, all will fail, and then the entire Miller-Rabin logic gets executed. So there’s a bit of a balancing game to play deciding how many low primes to test against in that first step.

The benefit of using this single BigInteger IsPrime implementation for both BigInteger and UInt64 prime testing is that we kill two birds with one stone. Not only do we get something that can do deterministic tests of UInt64 values, but we also can use it to do something like testing primality of a 1024 bit BigInteger value.

The code below are the UInt64 extension methods that leverage off the BigInteger function above to do deterministic checks using pre-tested witness values. Besides that, they also tie into the UInt32 implementation to check values that fall within that range. The BigInteger function is quite a bit slower than the UInt32 function. So the idea is to have the code automatically use the function that will perform the fastest for the value being tested, regardless of the type of the original value.
/// <summary>
/// Determines if the specified value is a prime number.
/// </summary>
/// <remarks>This method chooses the best algorithm to use based on
/// the magnitude of the specified value. For smaller values, a
/// simple trial division algorithm is used. For larger values, a 
/// deterministic version of the Miller-Rabin algorithm is used.</remarks>
/// <param name="value">The value to be tested for primality.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
public static bool IsPrime(this ulong value) {
    if (value <= uint.MaxValue) {
        return ((uint)value).IsPrime();
    } else {
        if ((value & 1) == 0) return value == 2;
        return value.IsPrimeMR();
    }
}

// precomputed witnesses and the maximum value up to which they guarantee primality
private static readonly ulong[] witnesses1 = new ulong[] { 9345883071009581737 };
private const ulong witnesses1Max = 341531;
private static readonly ulong[] witnesses2 = new ulong[] { 336781006125, 9639812373923155 };
private const uint witnesses2Max = 1050535501;
private static readonly ulong[] witnesses3 = new ulong[] { 4230279247111683200, 14694767155120705706, 16641139526367750375 };
private const ulong witnesses3Max = 350269456337;
private static readonly ulong[] witnesses4 = new ulong[] { 2, 141889084524735, 1199124725622454117, 11096072698276303650 };
private const ulong witnesses4Max = 55245642489451;
private static readonly ulong[] witnesses5 = new ulong[] { 2, 4130806001517, 149795463772692060, 186635894390467037, 3967304179347715805 };
private const ulong witnesses5Max = 7999252175582851;
private static readonly ulong[] witnesses6 = new ulong[] { 2, 123635709730000, 9233062284813009, 43835965440333360, 761179012939631437, 1263739024124850375 };
private const ulong witnesses6Max = 585226005592931977;
private static readonly ulong[] witnesses7 = new ulong[] { 2, 325, 9375, 28178, 450775, 9780504, 1795265022 };

/// <summary>
/// Determines if the specified value is a prime number, using the
/// Miller-Rabin algorithm.
/// </summary>
/// <remarks>This implementation is guaranteed to deterministically
/// decide primality for all positive value inputs.</remarks>
/// <param name="value">The ulong value to be tested for primality.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
public static bool IsPrimeMR(this ulong value) {
    // take care of the simple cases of small primes and the
    // common composites having those primes as factors
    if ((value & 1) == 0) return value == 2;
    if (value == 1) return false;
    if ((value % 3) == 0) return value == 3;
    if ((value % 5) == 0) return value == 5;
    if (((value % 7) == 0) || ((value % 11) == 0) || ((value % 13) == 0)
        || ((value % 17) == 0) || ((value % 19) == 0) || ((value % 23) == 0)
        || ((value % 29) == 0) || ((value % 31) == 0) || ((value % 37) == 0)
        || ((value % 41) == 0) || ((value % 43) == 0)) {
        return (value <= 43);
    }
    // perform Miller-Rabin with most efficient witness array for the value being tested
    BigInteger bValue = (BigInteger)value;
    if (value >= witnesses6Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 7, witnesses7);
    else if (value >= witnesses5Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 6, witnesses6);
    else if (value >= witnesses4Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 5, witnesses5);
    else if (value >= witnesses3Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 4, witnesses4);
    else if (value >= witnesses2Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 3, witnesses3);
    else if (value >= witnesses1Max) return BigIntegerExtensions.InternalIsPrimeMR(bValue, 2, witnesses2);
    else return BigIntegerExtensions.InternalIsPrimeMR(bValue, 1, witnesses1);
}
With this code, we can determine absolutely whether an unsigned 64 bit integer (0 to 18,446,744,073,709,551,615) is a prime number or not, with at most 7 witness iterations.
Performance-wise this is not as fast as the 32 bit version in the previous post. BigInteger operations are simply more time consuming than 32 bit operations. But I’ve tested it against two other BigInteger Miller Rabin implementations available on the web, and this is faster for both composite and prime BigInteger values of various sizes. For values 64 bits and smaller, it is much faster, sometimes by as much as 100 to 1000 times when it's able to avoid BigInteger math. For BigInteger values larger than 64 bits,  the differences are more modest.

Tuesday, May 21, 2013

Deterministic Miller-Rabin Primality Test in C#

Note: 1/17/2016 - I've reworked the code described in this post. The new version is better, and complete enough that I recommend using it instead. The idea is the same, but the implementation is cleaner and faster. It's hosted on GitHub

Note: 7/12/2013 - I edited this post to replace the original Int32 implementation with a faster uint version.
8/8/2013 - added ulong (UInt64) and BigInteger versions of the algorithm described here, in the next post.

I’m working on a SparseArray class that doesn’t use Dictionary under the covers, which allows much better space efficiency, with speeds comparable to Dictionary. I’ll add it to this blog sometime in the near future, but first I’m going to post some of the pieces that went into it. One of the things it does is allow any capacity to be used, but optimizes its behavior when the capacity is a power of 2, or is a prime number. To support using prime capacities, I’ve written a NearestPrimeCeiling method which in turn relies on an IsPrime method. In the past, when I needed a primality test over the range of positive ints, I’ve used one of these quickie methods that uses trial division.
public static bool IsPrimeTrial(this uint value) {
    if (value < 2) return false;
    if ((value & 1) == 0) return value == 2;
    if ((value % 3) == 0) return value == 3;
    // for smaller values (up to several million), it's faster
    // to use i*i in the loop termination condition rather than
    // precomputing the square root of the value.
    //if (value == uint.MaxValue) return true;
    for (uint i = 5, skip = 2; (i <= sqrtMax) && (i*i <= value); i+=skip, skip=6-skip) {
        if (value % i == 0) return false;
    }
    return true;
}
public static bool IsPrimeTrialSqrt(this uint value) {
    if ((value & 1) == 0) return value == 2;
    if (value == 1) return false;
    if ((value % 3) == 0) return value == 3;
    uint endVal = (uint) ((value > int.MaxValue) ? Math.Sqrt(value) : Math.Sqrt((int)value));
    for (uint i = 5, skip = 2; i <= endVal; i+=skip, skip=6-skip) {
        if (value % i == 0) return false;
    }
    return true;
}
The two methods above both do Primality testing by trial division. A minor difference between the two is that one uses the square of the loop variable to determine the terminal condition, and the other pre-computes the terminal loop value using the time expensive Sqrt method. For values up to around 10 million, the i*i version is faster. Beyond that, the Sqrt version is faster. The difference is only significant at the extreme ends of the range of positive integer values.

Although I was aware of the Miller-Rabin primality test, I put off using it because the trial division methods were ‘good enough’, and I didn't like the fact that Miller-Rabin is a probabilistic test. Both those reasons were very misguided. Compared to Miller-Rabin, trial division is not all that good, and within the ranges of int, uint, long and ulong integers, it’s quite easy to slightly modify the Miller-Rabin algorithm so that it’s deterministic (guaranteed to correctly determine primality). I initially searched for existing C# implementations, but found problems in them such as not operating over the entire range of values because of overflow in the PowMod portion of the algorithm. So, I ended up writing one.

The Miller-Rabin Wikipedia entry adequately describes the algorithm. A key component of the algorithm is randomly choosing witness values. The more witness values you use, the less likely it will return a false positive for primality. However, there are known sets of witness values that are guaranteed to correctly determine primality over limited ranges of values. For example, there are 3-witness sets that cover the entire range of unsigned 32-bit integers, and 7-witness sets that cover unsigned 64-bit integers. By replacing the random selection of witness values with code that uses pre-tested witness values, it’s easy to make a deterministic version of the Miller-Rabin algorithm over these ranges of values. The speed of the algorithm is affected by the number of witness values that are used. So it’s best to use the smallest set of witness values that will give a definitive primality result for the value you’re testing. There is a great website devoted to discovering these witness sets, not just for 32 bit integers, but for 64 bit values as well. I found it very helpful.

The charts below show the speed comparison of trial division to Miller-Rabin. The first chart shows the times (in microseconds) to determine primality of the nearest prime number to the values displayed along the x-axis. This is where trial division is at its slowest, since it must exhaust the possible factors before it can declare the value as prime.

The second chart shows the time (in microseconds) to perform the primality test on 1000 consecutive values less than or equal to the values displayed along the x-axis. Since it takes much less time to determine composite values are not prime than it does to determine if a primes are prime, the average time for checking 1000 consecutive values is greatly affected by the prime values within that range.


As you can see, Miller-Rabin destroys trial division, outperforming it over almost the entire range of unsigned 32-bit integers. These benchmarks were done using a 32-bit build. When compiled for 64-bits, the difference becomes even greater. The trial division times remain about the same, but the Miller-Rabin times improve by around 33%, because there is some ulong operations in the ModPow function.

In both graphs you can see the small jog up in time where the Miller-Rabin algorithm has to switch from using 2 witnesses to 3 at around the 700,000,000 value. From that point to the max range of uint, the execution time is nearly linear, typically taking around 2.75 microseconds to determine that a prime number is prime, and averaging 230 nanoseconds per call to IsPrime when testing consecutive values that are a mix of composites and primes. I ran the tests on a not-new run of the mill computer (P8600 @ 2.4GHz).

The code below is my basic implementation of the Miller-Rabin algorithm. Also included is the ModPow function it needs, implemented in a way that's pretty fast, and safe from the danger of overflows.

private static readonly ulong[] witnesses1 = new ulong[] { 9345883071009581737 };
private const uint witnesses1Max = 341531;
private static readonly ulong[] witnesses2 = new ulong[] { 15, 13393019396194701 };
private const uint witnesses2Max = 716169301;
private static readonly ulong[] witnesses3 = new ulong[] { 2, 7, 61 };
/// <summary>
/// Determines if the specified value is a prime number, using the
/// Miller-Rabin algorithm.
/// </summary>
/// <remarks>This implementation is guaranteed to deterministically
/// decide primality for all positive value inputs.</remarks>
/// <param name="value">The value to be tested for primality.</param>
/// <returns>True if the value is prime, otherwise, false.</returns> 
public static bool IsPrimeMR(this uint value) {
    // take care of the simple cases of small primes and the
    // common composites having those primes as factors
    if ((value & 1) == 0) return value == 2;
    if (value == 1) return false;
    if ((value % 3) == 0) return value == 3;
    if ((value % 5) == 0) return value == 5;
    if (((value % 7) == 0) || ((value % 11) == 0) || ((value % 13) == 0)
        || ((value % 17) == 0) || ((value % 19) == 0) || ((value % 23) == 0)
        || ((value % 29) == 0) || ((value % 31) == 0) || ((value % 37) == 0) 
        || ((value % 41) == 0) || ((value % 43) == 0)) {
        return (value <= 43);
    }
    // perform Miller-Rabin with most efficient witness array for the value being tested
    if (value >= witnesses2Max) return InternalIsPrimeMR(value, witnesses3);
    else if (value >= witnesses1Max) return InternalIsPrimeMR(value, witnesses2);
    else return InternalIsPrimeMR(value, witnesses1);
}
/// <summary>
/// Determines if the specified odd, >= 3 value is a prime number, using the
/// Miller-Rabin algorithm.
/// </summary>
/// <param name="value">The int value to be tested for primality.</param>
/// <param name="witnesses">The witnesses to be used.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
private static bool InternalIsPrimeMR(uint value, ulong[] witnesses) {
    // compute n − 1 as (2^s)·d (where d is odd)
    uint valLessOne = value - 1;
    uint d = valLessOne / 2; // we know that value is odd and valLessOne is even, so unroll 1st iter of loop
    uint s = 1;
    while((d & 1) == 0) {
        d /= 2;
        s++;
    }
    // test value against each witness
    for (uint i = 0; i < witnesses.Length; i++) {
        uint a;
        ulong aL = witnesses[i];
        if (aL >= value) {
            aL %= value;
            if (aL < 2) continue;
        }
        a = (uint) aL;
        if (a == valLessOne) continue;
        uint x = InternalModPow(a, d, value); // overflow safe version of x = Math.Pow(a, d) % value
        if (x == 1) continue;
        for (uint r = 1; (x != valLessOne) && (r < s); r++) {
            x = (uint)((x * (ulong)x) % value); // overflow safe version of x = (x * x) % value
            if (x == 1) return false;
        }
        if (x != valLessOne) return false;
    }
    // witnesses confirm value is prime
    return true;
}
/// <summary>
/// Compute the mod of a number raised to the specified power.
/// </summary>
/// <remarks>Overflow safe for all input values. </remarks>
/// <param name="b">The base number.</param>
/// <param name="e">The exponent applied to the base number.</param>
/// <param name="m">The modulo value.</param>
/// <returns>The mod of the base number raised to the specified power.</returns>
public static uint ModPow(uint b, uint e, uint m) {
    if (b == 0 || m == 0) return 0;
    return InternalModPow(b, e, m);
}
private static readonly uint sqrtMax = (uint) Math.Sqrt(uint.MaxValue);
private static uint InternalModPow(uint b, uint e, uint m) {
    switch (e) {
    case 0: return 1;
    case 1: return b % m;
    case 2: return (b <= sqrtMax) ? (b * b) % m : (uint)((b * (ulong)b) % m);
    default:
        uint result = 1;
        while (true) {
            if ((e & 1) != 0) {
                result = (uint)((result * (ulong)b) % m);
                if (e == 1) return result;
            }
            e /= 2; 
            b = (uint)((b * (ulong)b) % m);
        }
    }
}

The final IsPrime method selects the fastest underlying algorithm based on the value being tested.

/// <summary>
/// Determines if the specified value is a prime number.
/// </summary>
/// <remarks>This method chooses the best algorithm to use based on
/// the magnitude of the specified value. For smaller values, a
/// simple trial division algorithm is used. For larger values, a 
/// deterministic version of the Miller-Rabin algorithm is used.</remarks>
/// <param name="value">The int value to be tested for primality.</param>
/// <returns>True if the value is prime, otherwise, false.</returns>
public static bool IsPrime(this uint value) {
    if ((value & 1) == 0) return value == 2;
    if (value < 1000000) {
        return value.IsPrimeTrial();
    } else {
        return value.IsPrimeMR();
    }
}
I did a similar implementation that works with Int32 values, but the uint version was faster, and operates over a larger range. So I scrapped the Int32 version and implemented a little int IsPrime method that just uses the uint version.

public static bool IsPrime(this int value) {
    if (value <= 1) return false;
    return ((uint) value).IsPrime();
}

Saturday, January 12, 2013

Benchmarking C# Struct and Object Sizes

Getting the codez:
In the previous post, I described a class you can use to do casual execution time benchmarking of C# code. That Bench class also contains a couple methods that can be used to obtain memory size information about instances of classes or structs.

These methods measure the total memory consumed by an object or struct, including all the objects created as part of its creation. The methods work by taking a parameter that is a Func<T>, which is expected to create and return an object or struct instance for measurement. The technique used to measure the memory consumption is pretty simple; it just asks the garbage collector what the total memory consumed is both before and after the object creation, and notes the difference between the two. This may not give accurate results in a busy system, but during development, it seems to work pretty reliably. It works in either 32 or 64 bits, the results of which are slightly different from each other (identical objects or structs will often consume a little bit more memory in 64 bit builds).

There are two methods for obtaining memory consumption data. ByteSize returns a long value of the memory consumed by the object created by the specified Func. ByteSizeDescription returns a string describing the total memory consumed, as well as further information about the breakdown of that memory.
ByteSize and ByteSizeDescription will work for both objects and value types (like structs). Value types are handled by boxing them as objects so they consume heap memory, and then backing out the memory associated with the boxing overhead. Note that just as it is when measuring objects, for structs containing references to objects, the size returned by Bench’s methods will include not only the size of the reference in the struct, but also the size of the object referred to, if those objects are created by the Func method passed in.

There is a little fuzziness about the value type sizes, because .Net likes things to align on 4 byte boundaries in 32 bit builds, and 8 byte boundaries in 64 bit builds. It's also not entirely straightforward how much of an object's size is overhead and how much is content. Technically, there are 8 bytes of overhead for objects in 32 bit builds. However, an object will never take less then 12 bytes of size. For an empty object, 4 bytes of that are unused. An object with a single int member variable will also occupy 12 bytes - 8 bytes of overhead, plus 4 bytes for the int. As an object has more member variables, the size will increase, but in jumps of 4 bytes at a time. For 64 bit builds, there are 16 bytes of overhead, and a minimum of 8 bytes of content, whether it's used or not. To read more detail about this topic, I recommend these blog posts and articles:
Of Memory And Strings
How much much memory does a C# string take up?
and for the real nitty gritty detail (although a bit old)
Drill Into .NET Framework Internals to See How the CLR Creates Runtime Objects

The ByteSize function will return the aligned size of the struct returned by the passed in Func method. So, for instance, a struct containing a single byte field in a 32 bit build will return a size of 4 bytes. The ByteSizeDescription function will give additional information. Besides the deep aligned size of the struct returned by the Func method, it reports the deep packed size, and also the aligned size and packed size of a default instance of the struct. The reason for displaying aligned and packed versions can be seen when you have an array of structs. Using the example of a struct with a single byte field, an array of 1 element will have a size of 16 bytes (12 bytes of object overhead of the array object, plus 4 bytes for the single struct element). However, an array of 2, 3, or 4 elements will have this same size of 16 bytes. An array of 5 elements jumps the memory to 20 bytes. So the actual memory consumed by a struct may depend on how the struct is being used.

The following snippet of code shows some examples of using the ByteSize functions.
struct S1 { byte b; }
struct S2 { int i; }
struct S3 { string s; public S3(string s) { this.s = s;} }

public class Program {
    public static void Main(string[] args) {
        var bench = new Bench();
        Console.WriteLine(bench.ByteSize(() => { return new S1(); }));
        Console.WriteLine(bench.ByteSize(() => { return new S2(); }));
        Console.WriteLine(bench.ByteSize(() => { return new string('a', 16); }));
        Console.WriteLine(bench.ByteSize(() => { return new S3(new string('a', 16)); }));
        Console.WriteLine(bench.ByteSize(() => { return new int[16]; }));
        Console.WriteLine(bench.ByteSize(() => { 
            var d = new Dictionary<int,int>(10000);
            for (int i = 0; i < 10000; i++) d.Add(i,i);;
            return d;
         }));

        Console.WriteLine();
        Console.WriteLine(bench.ByteSizeDescription(() => { return new S1(); }) + "\n");
        Console.WriteLine(bench.ByteSizeDescription(() => { return new S2(); }) + "\n");
        Console.WriteLine(bench.ByteSizeDescription(() => { return new string('a', 16); }) + "\n");
        Console.WriteLine(bench.ByteSizeDescription(() => { return new S3(new string('a', 16)); }) + "\n");
        Console.WriteLine(bench.ByteSizeDescription(() => { return new int[16]; }) + "\n");
        Console.WriteLine(bench.ByteSizeDescription(() => { 
            var d = new Dictionary<int,int>(10000);
            for (int i = 0; i < 10000; i++) d.Add(i,i);;
            return d;
         }));

        Console.Write("done");
        Console.ReadKey();
    }
}
This produces the following output in 32 bit builds,

4
4
4
48
52
76
202136

S1 (struct): deep aligned size= 4 bytes, deep packed size= 1 bytes
    aligned default size= 4 bytes, packed default size= 1 bytes

S2 (struct): deep aligned size= 4 bytes, deep packed size= 4 bytes
    aligned default size= 4 bytes, packed default size= 4 bytes

String: size= 48 bytes, objOverhead= 8 bytes, content= 40 bytes

S3 (struct): deep aligned size= 52 bytes, deep packed size= 52 bytes
    aligned default size= 4 bytes, packed default size= 4 bytes

Int32[]: size= 76 bytes, objOverhead= 8 bytes, content= 68 bytes
    count= 16, avg/item= 4 bytes

Dictionary`2: size= 202136 bytes, objOverhead= 8 bytes, content= 202128 bytes
    count= 10000, avg/item= 20 bytes

and the following in 64 bit builds.

8
8
8
64
72
88
202192

S1 (struct): deep aligned size= 8 bytes, deep packed size= 1 bytes
    aligned default size= 8 bytes, packed default size= 1 bytes

S2 (struct): deep aligned size= 8 bytes, deep packed size= 4 bytes
    aligned default size= 8 bytes, packed default size= 4 bytes

String: size= 64 bytes, objOverhead= 16 bytes, content= 48 bytes

S3 (struct): deep aligned size= 72 bytes, deep packed size= 72 bytes
    aligned default size= 8 bytes, packed default size= 8 bytes

Int32[]: size= 88 bytes, objOverhead= 16 bytes, content= 72 bytes
    count= 16, avg/item= 4 bytes

Dictionary`2: size= 202192 bytes, objOverhead= 16 bytes, content= 202176 bytes
    count= 10000, avg/item= 20 bytes

The source code for the two ByteSize methods is included below. It’s also in the the full Bench class that can be downloaded from the link at the top of this post.
static private long overheadSize = ComputeOverheadSize();

/// <summary>
/// Determines the size of the object or struct created and returned by the
///    specified method. </summary>
/// <remarks>Should not be used in production! This is meant for use during
/// development, not as a general purpose sizeof function.</remarks>
/// <param name="maker">The method that creates and returns the object or 
/// struct whose size will be determined.</param>
/// <returns>The size in bytes of the object created by the method.</returns>
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
public long ByteSize<T>(Func<T> maker) {
    if (maker == null) throw new ArgumentNullException("maker");

    long size = long.MinValue;
    long prevSize;
    int count = 0;
    // because we're using an unreliable method of obtaining size, repeat until
    // we get a confirmed result to help eliminate reporting spurious results.
    const int maxAttempts = 10;
    long dummy;
    do {
        prevSize = size;
        count++;
        if (typeof(T).IsValueType) {
            long objSize = ByteSize(() => { return (object) 1L; });
            long boxedSize = ByteSize(() => { return (object)maker(); });
            // size is the boxed size less the overhead of boxing
            size = (boxedSize - objSize) + sizeof(long);
        } else {
            object obj = null;
            long startSize = GC.GetTotalMemory(true);
            obj = maker(); 
            long endSize = GC.GetTotalMemory(true);
            size = endSize - startSize;
            // ensure object stays alive through measurement
            dummy = obj.GetHashCode();
        }
    } while ((count < maxAttempts) && ((size != prevSize) || (size <= 0)));
    return size;
}

/// <summary>
/// Returns a string describing details about the size of the object or struct
/// created by the specified method.
/// </summary>
/// <remarks>Should not be used in production! This is meant for use during
/// development, not as a general purpose sizeof function.</remarks>
/// <param name="maker">The method that creates and returns the object or struct
/// whose size will be determined.</param>
/// <returns>String describing details about the size of an object.</returns>
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
public string ByteSizeDescription<T>(Func<T> maker) {
    if (maker == null) throw new ArgumentNullException("maker");
                
    // get size of target
    long byteSize = ByteSize(() => { return maker(); });
    string s = typeof(T).Name;
    if (typeof(T).IsValueType) {
        // special handling of value types (i.e. structs)
        long emptyArray = ByteSize(() => { return new T[0]; });
        long deepPacked = (ByteSize(() => {
            var x = new T[16];
            for (int i = 0; i < x.Length; i++) x[i] = maker();
            return x;
        }) - emptyArray) / 16;
        long alignedSize = ByteSize(() => { return new T[1]; }) - emptyArray;
        long packedSize = (ByteSize(() => { return new T[16]; }) - emptyArray) / 16;
        s += " (struct): deep aligned size= " + byteSize + " bytes, deep packed size= "
            + deepPacked + " bytes"
            + Environment.NewLine + "    aligned default size= " + alignedSize
            + " bytes, packed default size= " + packedSize + " bytes";
    } else {
        // handling of objects
        s += ": size= " + byteSize + " bytes"
            + ", objOverhead= " + overheadSize 
            + " bytes, content= " + (byteSize - overheadSize) + " bytes";
        if (typeof(System.Collections.ICollection).IsAssignableFrom(typeof(T))) {
            // special handling of classes implementing ICollection
            var coll = maker() as System.Collections.ICollection;
            int count = coll.Count;
            s += Environment.NewLine + "    count= " + count;
            if (count > 0) s += ", avg/item= " + ((byteSize - overheadSize) / count) + " bytes";
        }
    }
    return s;
}

static private long ComputeOverheadSize() {
    var bench = new Bench();
    long simpleSize = bench.ByteSize(() => { return new SimpleObject(); });
    return simpleSize - SimpleObject.InternalSize;
}


private class SimpleObject {
    public const int InternalSize = sizeof(int) * 2;
    int one;
    int two;
}

Sunday, January 6, 2013

Benchmarking C# code

Getting the codez:
Since writing this post, I've made improvements to the code described below. You can read about them in a newer post. The NuGet and GitHub links to the right are for the new version.
When doing casual benchmarking, I have often just thrown together a StopWatch and loop. I hate to admit the number of times I’ve wasted time thinking that a tweak I made was an improvement (or the opposite), only to discover after some head scratching that I had forgotten I’d changed the loop limit, and was comparing apples and oranges with regard to the previous benchmarks.
I’m working on some code that has me doing a fair amount of benchmarking and optimizing, so I decided to write a little utility to make benchmarking easier, and remove the source of past red herrings. I had the following goals:
  • Simple to use, with minimal coding needed
  • Produce results that are comparable, and independent of loop counts
  • Automatically figure out an appropriate number of times to execute the target code
  • Provide the ability to exclude set up code from the timing
  • Be able to benchmark small bits of code with very short run times
  • Remove the benchmarking overhead time (loop processing, etc.) from results
This post presents my first pass at it, and it seems to do the things I was looking for. The benchmark class achieves this using the StopWatch class, and a main timing loop. It takes an Action delegate for the code being benchmarked, which can be a normal method, or an anonymous method. In the case of code that has very small run times (under 1 microsecond), it’s difficult to get accurate results with a single loop that repeatedly calls the target method. The substantial overhead of looping and calling the method drowns out the tiny amount of time used by the target code, which gets lost in the noise. For those cases, the target method can have its own simple inner for loop. The count for the loop is passed in to the target method. This has two benefits. Since the benchmark utility knows the loop count, it can better calculate the associated overhead of the for loop and remove that from the result. Another benefit is that you can let the benchmark utility figure out the appropriate loop count your target method should use. In some cases you’ll want to control the loop count yourself, in cases where the loop count is an intrinsic part of what you’re benchmarking. For example, if I was benchmarking adding to a list, I might want to always benchmark the time to add 10,000 items. So I’d want to specify the count myself.

This benchmarking utility is all contained within a class called Bench. The following code snippet shows some examples of using the Bench class.

public class Program
{
    public static void Main(string[] args) {
        TimeSleep();
        TimeStringConcat();
        TimeDictionaryRemove();
        Console.Write("done");
        Console.ReadKey();
    }
    
    public static void TimeSleep() {
        new Bench().Time("Sleep", () => { Thread.Sleep(2); });
    }

    public static void TimeStringConcat() {
        string s1 = DateTime.UtcNow.ToString();
        string s2 = DateTime.UtcNow.ToString();
        string s;
        new Bench().Time("String Concat", (count) => {
            for (int i = 0; i < count; i++) s = s1 + s2;
        } );
    }
    
    public static void TimeDictionaryRemove() {
        new Bench().Time("Dictionary.Remove", 1000, DictionaryRemove );
    }

    public static void DictionaryRemove(int count, Bench.TimeControl tc) {
        tc.Pause();
        var d = new Dictionary<int, string>();
        for (int i = 0; i < count; i++) d.Add(i, i.ToString());
        tc.Resume();
        for (int i = 0; i < count; i++) d.Remove(i);
    }
}
The TimeSleep method shows the simplest use of Bench, using a lambda expression. The TimeStringConcat method shows timing a chunk of code with a very small execution time, so it is wrapped in a for loop and using the overload of Time that will let Bench calculate an appropriate loop count. That loop count is passed in as a parameter to the Action method. The Time method will return only the time for the concatenation. The overhead to repeatedly call the Action method, as well as the time of the for loop are removed from the final timing results. The last example of DictionaryRemove shows the use of TimeControl’s Pause/Resume to exclude the set up from the timing. I’ve chosen in this case to explicitly state the loop count since I’m using the loop count to build the Dictionary, and the size of the dictionary likely will affect the time of the Remove I’m timing. By explicitly specifying the loop count, I ensure all timings are comparable. This example also shows using a regular method instead of an anonymous method or lambda expression. Again, the time returned will be for just the d.Remove(i) code. The code above produces the following output to the Console:

Name= Sleep
Millisecs/Op= 2.006, , Ops= 2,490, ElapsedMillisecs= 4,994.50
Name= String Concat
Nanosecs/Op= 67.871, , Ops= 73,200,000, ElapsedMillisecs= 4,968.17
Name= Dictionary.Remove
Nanosecs/Op= 26.370, , Ops= 169,354,000, ElapsedMillisecs= 4,465.93


The Bench class uses the term “Iteration” to describe one call to the target method. If the target method doesn’t have a for loop, then the term “Operation” is the same as Iteration. But if the target method has a for loop, then Operation is a single execution of the body of the loop, and the number of operations executed will be the Iteration count times the loop count.

I should mention the normal caveats for casual benchmarking. This utility uses StopWatch, which measures clock time, not CPU time. This is an important point, because your results will vary some from run to run because of the other things going on in Windows and in the .Net runtime. If you need to do serious CPU time profiling, it’s better to just go to the commercial profiling tools. I tend to use both. I use the profiler to identify where I need to optimize, and use casual benchmarking while optimizing a chunk of code, to quickly see the effect of my changes. You can benchmark small bits of code, but when the time per operation is below 100 nanoseconds, don’t take the numbers too literally. Numbers that small are more prone to the effects of outside factors. You should also keep in mind the possible effects of optimizations made by the just-in-time compiler, which may even optimize away some of the code you are trying to benchmark.

Caveats aside, you can use Bench to do some pretty fine grained benchmarking, as the following example shows. This benchmark shows the difference between a normal mod operation and a mod of a power of 2.
    var bench = new Bench();
    int x;
    bench.Time("Mod 13", (c) => { for (int i = 0; i < c; i++) x = i % 13; });
    bench.Time("Mod 2^4", (c) => { for (int i = 0; i < c; i++) x = i % 16; });
The code for the parts of Bench used for timing is below, with comments stripped out. Use the download link at the top of this post to get the complete file with all the comments and documentation. There are also some methods in Bench for benchmarking the size of objects and struct instance, which are described in the next post.
// Copyright ©2013 SoftWx, Inc.
// Released under the MIT License the text of which appears at the end of this file.
// 1/3/2013
// <authors> Steve Hatchett

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;

namespace SoftWx.Diagnostics
{
    /// <summary>
    /// Tools for benchmarking code. The Bench class helps do timing benchmarks, and
    /// also object size determiniation. 
    /// </summary>
    public class Bench{
        
        public int MinIterations { get; set; }
        public int MinMilliseconds { get; set; }
        public bool WriteToConsole { get; set; }
                
        public Bench() : this(5, 5000, true) { }
        
        public Bench(int minIterations, int minMilliseconds, bool writeToConsole) {
            MinIterations = minIterations;
            MinMilliseconds = minMilliseconds;
            WriteToConsole = writeToConsole;
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, Action target) {
            if (target == null) throw new ArgumentNullException("target");

            return Time(name, 1, 
                        (c, tc) => { target();},
                        (c, tc) => { });
        }

        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, Action<TimeControl> target) {
            if (target == null) throw new ArgumentNullException("target");

            return Time(name, 1,
                        (c, tc) => { target(tc);},
                        (c, tc) => { });
        }

        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, Action<int> target) {
            if (target == null) throw new ArgumentNullException("target");
            int count = CalculateGoodCount((c, tc) => { target(c); });
            return Time(name, count, target);
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, Action<int, TimeControl> target) {
            if (target == null) throw new ArgumentNullException("target");
            
            int count = CalculateGoodCount(target);
            return Time(name, count, target);
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, int count, Action<int> target) {
            if (count < 0) throw new ArgumentOutOfRangeException("count");
            if (target == null) throw new ArgumentNullException("target");

            return Time(name, count,
                        (c, tc) => { target(c);},
                        (c, tc) => { for(int i = 0; i < c; i++) { } });
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        public TimeResult Time(string name, int count, Action<int, TimeControl> target) {
            if (count < 0) throw new ArgumentOutOfRangeException("count");
            if (target == null) throw new ArgumentNullException("target");
            
            return Time(name, count, target, 
                        (c, tc) => { for (int i = 0; i < c; i++) { } });
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        private TimeResult Time(string name, int count, Action<int, TimeControl> target, 
                                Action<int, TimeControl> overhead) {
            if (count < 1) count = 1;
            GC.Collect();
            GC.WaitForPendingFinalizers();

            TimeResult targetResult = Time(count, target, this.MinIterations, this.MinMilliseconds);
            TimeResult overheadResult = Time(count, overhead, targetResult.Iterations, 0);
            var adjustedTime = targetResult.Elapsed.Subtract(overheadResult.Elapsed);
            if (adjustedTime.Ticks < 0) adjustedTime = new TimeSpan(0);
            targetResult = new TimeResult(name, targetResult.Operations, adjustedTime, 
                                          targetResult.Iterations);
            if (this.WriteToConsole) {
                Console.WriteLine(targetResult.ToString());
            }
            return targetResult;
        }
        
        [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
        private TimeResult Time(int count, Action<int, TimeControl> target, 
                                long minIterations, long minMilliseconds) {
            TimeControl tc = new TimeControl();
            target(1, tc); // ensure the code being timed has been jitted
            tc.Reset();
            long iterations = 0;
            tc.StopWatch.Start(); // don't use tc Pause/Resume here, because we don't want to count overhead for that
            do {
                target(count, tc);
                iterations++;
            } while((tc.StopWatch.ElapsedMilliseconds < minMilliseconds) || (iterations < minIterations));
            tc.StopWatch.Stop();
            long totalOperations = (long) count * iterations;
            return new TimeResult(null, totalOperations, tc.Elapsed, iterations);
        }

        private int CalculateGoodCount(Action<int, TimeControl> target) {
            int count = 1;
            do {
                var result = Time(count, target, 0, 0);
                if (result.Elapsed.TotalMilliseconds > 1) {
                    break;
                } 
                count *= 10;
            } while (count < 1000000000);
            return count;
        }
        
        /// <summary>
        /// Allows control of what portions of the method are benchmarked, by
        /// pausing and resuming the timing through the TimeControl.
        /// </summary>
        public class TimeControl {
            private Stopwatch sw;
            private long pauseResumeOverhead; // ticks
            private long accumulatedOverhead; // ticks
            
            public TimeControl() {
                this.sw = new Stopwatch();
                CalculatePauseResumeOverhead(); // ensure it's jitted before using the calculated overhead
                CalculatePauseResumeOverhead();
            }
            
            public void Reset() {
                sw.Reset();
                this.accumulatedOverhead = 0;
            }
            
            public void Pause() { this.sw.Stop(); }
            
            public void Resume() { 
                this.accumulatedOverhead += pauseResumeOverhead;
                this.sw.Start(); 
            }

            internal TimeSpan Elapsed { 
                get {
                    var ticks = Math.Max(0L, this.sw.Elapsed.Ticks - this.accumulatedOverhead);
                    return new TimeSpan(ticks);
                } 
            }
            
            internal Stopwatch StopWatch { get { return this.sw; } }
                        
            [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
            private void CalculatePauseResumeOverhead() {
                Stopwatch sw = new Stopwatch();
                this.pauseResumeOverhead = 0;
                int count = 1000;
                long computed = 0;
                // try several times to get a non-zero result 
                for (int trys = 0; trys < 10; trys++) {
                    long ticks1 = 0;
                    sw.Reset();
                    sw.Start();
                    for (int i = 0; i < count; i++) { }
                    sw.Stop();
                    ticks1 = sw.Elapsed.Ticks;
                    sw.Reset();
                    sw.Start();
                    for (int i = 0; i < count; i++) {
                        sw.Stop();
                        sw.Start();
                    }
                    sw.Stop();
                    computed = (sw.Elapsed.Ticks - ticks1) / count;
                    if (computed >= 0) {
                        this.pauseResumeOverhead = computed;
                        break;
                    }
                }
            }            
        }
        
        /// <summary>Results of benchmark timing.</summary>
        public class TimeResult : IComparable<TimeResult> {
            private string name;
            private long operations;
            private TimeSpan elapsed;
            private long iterations;
            
            private TimeResult() { }
            
            public TimeResult(string name, long operations, TimeSpan elapsed, long iterations) {
                this.name = name;
                this.operations = operations;
                this.elapsed = elapsed;
                this.iterations = iterations;
            }
            
            public string Name { get { return this.name; } }
            
            public long Operations { get { return this.operations; } }
            
            public double ElapsedMilliseconds { get { return this.elapsed.TotalMilliseconds; } }
            
            public TimeSpan Elapsed { get { return this.elapsed; } }
            
            public long Iterations { get { return this.iterations; } }
            
            public double NanosecondsPerOperation {
                get { return (1000000.0 / this.operations) * this.elapsed.TotalMilliseconds; }
            }
            
            public double MicrosecondsPerOperation {
                get { return (1000.0 * this.elapsed.TotalMilliseconds) / this.operations; }
            }
            
            public double MillisecondsPerOperation {
                get { return this.elapsed.TotalMilliseconds / this.operations; }
            }
            
            public double NanosecondsPerIteration {
                get { return (1000000.0 / this.iterations) * this.elapsed.TotalMilliseconds; }
            }

            public double MicrosecondsPerIteration {
                get { return (1000.0 * this.elapsed.TotalMilliseconds) / this.iterations; }
            }

            public double MillisecondsPerIteration {
                get { return this.elapsed.TotalMilliseconds / this.iterations; }
            }
            
            public override string ToString() {
                string timePerOpsLabel;
                double timePerOps;
                if (MillisecondsPerOperation > 1) {
                    timePerOpsLabel = "Millisecs/Op";
                    timePerOps = MillisecondsPerOperation;
                } else if (MicrosecondsPerOperation > 1) {
                    timePerOpsLabel = "Microsecs/Op";
                    timePerOps = MicrosecondsPerOperation;
                } else {
                    timePerOpsLabel = "Nanosecs/Op";
                    timePerOps = NanosecondsPerOperation;
                }
                string debugger = System.Diagnostics.Debugger.IsAttached ? "(Debugger Attached)" : "";
                return string.Format("Name= {0} {1}\n{2}= {3}, , Ops= {4}, ElapsedMillisecs= {5}",
                                     name, debugger, timePerOpsLabel,
                                     timePerOps.ToString("0.000"),
                                     operations.ToString("#,##0"),
                                     ElapsedMilliseconds.ToString("#,##0.00"));

            }
            
            public int CompareTo(Bench.TimeResult other)    {
                return this.MillisecondsPerOperation.CompareTo(other.MillisecondsPerOperation);
            }
        }
    }
}