Misc | Daniel

I went and prototyped a queue data structure based on the mmu abusing priciples outlined here. The performance was fantastic but not faster than STL’s vector. If you have a very specific need for dynamic sized queues with direct indexing this data structure may help you. Otherwise I would suggest leveraging the existing STL containers like vector or queue which are cache friendly.

The prototype implementation is available here in C11: https://github.com/daniel-dressler/mbq

==———–==

This data structure is the result of reading Adam Martin’s contiguous memory data structure for entity systems.
Adam does a good job of designing his data structure but my mind soon ran with ideas. I’ve gone a decent amount of lower level programming and there are some aspects of modern kernels and CPUs we can exploit for a high performance entity-component system.

For ease of reference we’ll can the data structure Daniel Backed Entity Component Systems, or DBECS for short. Please forgive the poor name, my other attempts were only worse.

Consoles with MMUs

The current generation of consoles, the PS4 and Xbox One, bring to the console world 64bit memory spaces backed by hardware memory management units (MMU). The MMU is a magical device which can translate fake virtual addresses into real addresses. The primary purpose of an MMU is to enable security between separate programs. In days before MMUs any program could alter or read any other program’s memory.

The MMU does this by disconnecting the addresses a program sees from the real address on the ram. We can harness this to enable fast deletion in arrays.

Goals for DBECS

The overriding goal of DBECS is spatial reference locality. This means we want a component to consume data continuous in memory.

After our broad goal we must consider these further constraints:

Reallocations should be minimal
Random lookup must be fast
Deletion must be fast
Indirect dereferences are to be avoided

Who should use DBECS

Any game of reasonable size should not need DBECS. These games will have better room for optimization. To be honest if you’re a AAA developer you might want DBECS but I’d guess you already know how to do this and know of it under a better name than DBECS.

Basic Approach

Our overriding goal suggests an array.

Arrays are continuous, and random lookups cost only a multiply and an add. Provided we use an array of values and not an array of pointers we also avoid indirect dereferences.

Arrays would be perfect except for their horrible deletion cost. To delete element A at position X from an array you must copy every element at position greater than X down one position.

With clever garbage collection and exploitation of the MMU we can achieve fast amortized deletion but we will discuses that later.

First we must decide how to layout our data in the array.

Data Layout

In his post Adam Martin touches on the issue of laying out our data to maximize the kernel’s read ahead optimization. This optimization occurs if we request the contents of address Y. The kernel will then read the contents of address Y but also read N bytes ahead. This extra data is then stored in CPU cache. Accessing your cache is much faster than a full request to RAM so being fast depends on the data we want being in cache.

The simple case is an array for one component where all the relevant data is stored together. This is called an array of structs. An example follows:

// An array of structs
struct entity {
   position pos;
   velocity vel;
};
std::vector<struct entity> entities;

For object oriented languages you’ll have seen this as an array, or vector, of objects. A C++ example: std::vector<struct our_data>our_datas;

There exists another method for data layout called struct of arrays. This second method might look like

struct of_arrays {
    std::vector<position> positions;
    std::vector<velocity> velocities;
}

The downside of structures of arrays is code readability and design quality. The upside for us is no unused data fills the cache from other component’s data. Consider adding a score component’s data to the above example:

struct of_arrays {
    std::vector<position> positions;
    std::vector<velocity> velocities;
    std::vector<score> scores;
}

Since the scores are separate different components can each read ahead their own data.

This is because your kernel can read ahead into multiple arrays at the same time. So the component which takes a position and adds a velocity will read ahead into the positions array and also read ahead into the velocity array. Both arrays will hit into the cache and that is what we need.

This data layout provides the cache locality requirement we wanted but does not solve the expensive destruction. For this we’ll need a more concrete example.

Struct of Raw Arrays with the MMU

Instead of using C++’s Vector we are now going to use straight C arrays. At their heart a C array is a pointer to a span of allocated memory. Standard convention is to allocate dynamic arrays with malloc() or calloc(). Instead we’ll use mmap()

On Linux mmap() is an easy yet low level method for allocating pages of virtual memory. The consoles will have similar interfaces but we’ll focus on Linux.

Note: A page is how the kernel tracks memory. A single page can be various sizes but under linux it is 64KB. Along with every page are control bits which determine if the page contents can be executed or which process owns the page. Pages are low level things which change between architectures. The features we’ll use should be available on every architecture with an MMU.

What we will do is allocate a decent chunk of memory, a maybe a whole 100MB. We can do this even on mobile devices because what the kernel gives our program is only virtual memory. We can ask for 1TB and the kernel will gives us what looks like 1TB but this is all a lie. The kernel will only allocate real ram once we write data to it.

This means we can write to the last byte of our 100MB and the write will succeed. It succeeds because the kernel knows that everything but the last page is unused and thus only allocates one 64KB page out of our 100MB.

This is only possible because our computer has an MMU. The MMU can map the fake virtual addresses the kernel gave us to the real ram addresses. It does this for all memory in your computer all the time and is super fast. It is the kernel’s job to tell the MMU what virtual addresses map to real addresses. The kernel can also remove a mapping, which is how we will solve our deletion problem.

You see unlike malloc() and free() it is possible to un-mmap, with munmap(), only a subset of the memory we received from mmap(). This means of our 100MB array we can release the first 10MB and keep the rest. When this unmapping happens the kernel will release the pages it had used for that 10MB and return them to the pool of available memory.

Now you may have figured out how this all fits together. We will write to the end of our arrays and delete from the front by unmapping the memory. In effect we will be walking through our memory space!

Most programmers should have a heart-attack right now. This is almost by definition a memory leak. The world’s largest memory leak. 100MB is a lot for mobile devices and you’ll soon run out if you were wasting memory this fast.

Except what we are leaking is the virtual address space and not the real ram. Provided we let the kernel reclaim the pages from the front of our array it is safe to get new pages allocated at our array’s end.

Now at this point we would find ourselves walking across our allocation. At some point we will hit the allocations end. A good approach would be to copy our array into the beginning of a new allocation.

A better way is to further abuse the MMU and move our array without copying. Under linux we can use mremap(). In a single call to mremap() we can move our array and even allocate extra memory to compensate for the front of the array we freed with munmap().

You might have noticed that we’ve assumed entity deletions occur at the beginning of our array. In a real game this is true for many entities but not all, for this we need garbage collection.

Garbage Collection

Entities like the player will stay alive and find themselves at the beginning of the array. Meanwhile surrounding entities like enemies will have died and gotten deleted.

Our garbage collection will start at the beginning of our array and check the first N entities. We only want to check the first N entities to keep the garbage collection pause constrained to prevent the long pause you see in minecraft.

Now if Y out of N entities are not active then we’ll moving the alive entities to the end of our array and munmap the now free portion of the array. Which Y and which N depend on your game and should be tuned for your needs.

If your entity-component system requires entities to be processed in order of creation then you should alter garbage collection to move alive entities to the end of the portion of the array you’ve garbage collected. This is more expansive but preserve ordering and still free unused pages.

We can now allocate cheap to delete entities which maximize cache locality. Next we must consider sparse entities.

Handling Sparse Entities

Imagine if your game had 50 components. This means you must allocate a position in each of the 50 or more data arrays. Unless an entity uses all 50 components we now have holes in our data arrays.

This will increase memory usage depending on your game.

A game which allocates similar entities in batches will have large holes. If these holes are larger than a page size, 64KB under linux, the hole will not consider any memory!

If instead your game allocates entities interspersed you might have holes smaller than page size which will explode memory usage. Like all things in computer science this can be solved by abstracting the id from the index with another array.

Closing Words

I wrote this after reading Adam’s post and wanted to get my idea onto paper before it disappeared. I’ve explained the idea to two friends familiar with low level OS work and they thought the scheme could work.

Game Programming Patterns / Optimization Patterns Data Locality – http://gameprogrammingpatterns.com/data-locality.html#11

Pitfalls of Object Oriented Programming – http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

Re: Zero-fill-on-demand technique: why it is needed? – http://linux.derkeiler.com/Newsgroups/comp.os.linux.development.system/2006-11/msg00176.html

How to find CPU cache details with linux – http://stackoverflow.com/a/716229