Jack Dermody

Block Allocation

Allocating objects in .net is fast but certainly not free. When you are in the business of allocating lots of objects the actual allocation time can become a burden. Block allocation is a technique to decrease the cost (and maybe the total allocated size too).

It works by allocating a large block of memory - enough to fit lots of objects - and then to dole out that memory over time for each single object allocation.

This works well in both c++ which makes it slightly easier with the placement new operator. However the same technique can also be used in .net with a few caveats. The performance improvements can be equally dramatic.

Scenario

Imagine we're processing some data and we're generating summaries as we go through it. When the data is large we can see (from profiling) that a lot of time is spent just allocating objects. Since we know we'll need lots of these objects, is there a way to make this more efficient?

interface IEvent
{
    ulong Id { get; }
    double Score { get; }
    byte Attribute1 { get; }
    byte Attribute2 { get; }
    float ConsolidatedScore { get; }
}

struct SomeStruct : IEvent
{
    public ulong Id { get; set; }
    public double Score { get; set; }
    public byte Attribute1 { get; set; }
    public byte Attribute2 { get; set; }
    public float ConsolidatedScore { get; set; }
}

class SomeClass : IEvent
{
    public ulong Id { get; set; }
    public double Score { get; set; }
    public byte Attribute1 { get; set; }
    public byte Attribute2 { get; set; }
    public float ConsolidatedScore { get; set; }
}

The first thing we might think of is to see if things are faster if we use a struct. Structs are allocated on the stack and an array of structs is allocated as a single block of contiguous memory. It turns out that sometimes, yes, this can be faster.

The caveat is boxing - if we're referring to the events via an interface or storing them in a data structure that is not backed by an array then the structs will be boxed and the performance will be much worse than a class (due mainly to their size - they're much bigger than a single word).

static void TimeIterations(string name, Action<int> callback)
{
    var sw = new Stopwatch();
    sw.Start();
    for (var i = 0; i < 1024 * 1024 * 32; i++)
        callback(i);
    sw.Stop();
    Console.WriteLine($"{name}: {sw.ElapsedMilliseconds:N0}ms");
}

var blocks = new List<SomeClass>();
var blocks2 = new List<SomeStruct>();
var blocks3 = new List<IEvent>();
TimeIterations("gc allocator (class)", i => blocks.Add(new SomeClass()));
TimeIterations("gc allocator (struct)", i => blocks2.Add(new SomeStruct()));
TimeIterations("gc allocator (boxing struct)", i => blocks3.Add(new SomeStruct()));
gc allocator (class): 4,048ms gc allocator (struct): 3,714ms gc allocator (boxing struct): 7,414ms

So we gain a small improvement by converting to a struct but a significant penalty if we ever want them boxed.

Block Allocator

It turns out that we can do better than the CLR if we take over the allocations ourselves using the unmanaged heap (not garbage collected). Of course, we'll need to remember to free the memory at some point to avoid memory leaks! We'll also need to enable unsafe compilation and write code that looks a little like good ole c++...

unsafe struct DataWrapper
{
    readonly SomeStruct* _ptr;

    public DataWrapper(SomeStruct* ptr)
    {
        _ptr = ptr;
    }

    public ulong Id {
        get => _ptr->Id;
        set => _ptr->Id = value;
    }

    public double Score {
        get => _ptr->Score;
        set => _ptr->Score = value;
    }

    public byte Attribute1 {
        get => _ptr->Attribute1;
        set => _ptr->Attribute1 = value;
    }

    public byte Attribute2 {
        get => _ptr->Attribute2;
        set => _ptr->Attribute2 = value;
    }

    public float ConsolidatedScore {
        get => _ptr->ConsolidatedScore;
        set => _ptr->ConsolidatedScore = value;
    }
}

class UnmanagedAllocator<T> : IDisposable
{
    readonly int _size, _blockSize, _allocationSize;
    readonly List<IntPtr> _buffers = new List<IntPtr>();
    readonly Func<IntPtr, T> _wrapperCreator;
    IntPtr _curr = IntPtr.Zero;
    int _index = -1;

    public UnmanagedAllocator(int typeSize, Func<IntPtr, T> wrapperCreator, int blockSize = 32768)
    {
        _size = typeSize;
        _blockSize = blockSize;
        _allocationSize = _size * _blockSize;
        _wrapperCreator = wrapperCreator;
    }

    public T Allocate()
    {
        if (_index < 0 || _index >= _blockSize) {
            _buffers.Add(_curr = Marshal.AllocHGlobal(_allocationSize));
            _index = 0;
        }

        return _wrapperCreator(_curr + (_index++ * _size));
    }

    public void Dispose()
    {
        foreach (var buffer in _buffers)
            Marshal.FreeHGlobal(buffer);
        _buffers.Clear();
        _index = 0;
    }
}

var allocator = new UnmanagedAllocator<DataWrapper>(
    sizeof(SomeStruct), 
    ptr => new DataWrapper((SomeStruct*) ptr)
);

To avoid returning pointers from the allocator, we instead return a "safe" wrapper that hides the pointer. The block allocator creates enough memory for 32,768 structs each time it runs out. In this scenario this means a total of 1024 allocations - down from 33,554,432!

The runtime performance of the block allocator is around half that of allocating structs via the CLR and if we use StructLayout (with Pack=1) then the allocation size might also be reduced (potentially with some cost in performance).

Since the memory is unmanaged there is no need to pin it and the pointers will remain valid until the allocator is disposed. This lends itself to a unit of work pattern - create an allocator, complete the unit of work while allocating lots of objects and then dispose the allocator when complete, freeing all of the allocated objects at once.