# Computing AABB Trick (Loop Trick)

Lately I noticed a small trick that applies to loops when trying to find a minimum or maximum of some values. Usually I just apply the trick to loops where I need to compute an AABB over some geometry. I think I noticed this trick when reading a for loop Erin Catto wrote in some of Box2D’s internal code.

The trick is super simple: just process the first element outside of the loop to set up your initial conditions, then form your loop to skip the first element. An assumption would be made that there’s at least one element in the array to process. Here’s an example for computing an AABB:

Usually I myself would have written this kind of code like so and not given any more thought to it:

This second code chunk is arguably just slightly more esoteric and is definitely a little less efficient for no good reason.

One could also skip the first element when finding the min/max of any sort of array, like for example: dot product results. Though simple it’s pretty nice to find small ways to write slightly better code.

# Small C++ Reflection Demo

I created a small demonstration program that explains the core ideas behind implementing a custom reflection system for C++. More might be written in this post in the future — for now I’m just storing the demo right here on this webpage:

# Memory Management

Any competent software engineer will have spent significant time working with low level memory management. Even though the operating system code is written for will often provide some kind of allocation and deallocation mechanism, application specific assumptions can be made to increase memory related performance.

For example certain hardware doesn’t have virtual memory support, or the virtual memory support can be quite lacking. A lack of virtual memory means raw allocations from the OS return real addresses to the hardware RAM. Usually virtual memory can alleviate some effects of memory fragmentation through a level of indirection, though when dealing with physical memory yourself no such alleviation exists.

This is just one example of how a software memory manager can be written and used to control memory fragmentation in a way that makes sense for the application.

## Types of Allocators

There are a few main types of allocators that I myself have found pretty useful: paging, stack and heap based allocations. Each one makes specific assumptions about the types of allocations and how the memory ought be used. Due to these assumptions significant performance boosts can be reaped in ways that may not have been realistic with raw operating system allocations.

## Stack Based Allocation

My favorite type of allocation involves the use of a simple stack. The idea is to make one large call to malloc or new and hold this piece of memory. The Stack itself just holds a pointer to this large chunk of memory, and an integer representing an index into the stack with an element size in bytes.

Here is what a Stack implementation might look like (in pseudo code):

Allocation can work by moving the m_memory pointer forward in the stack. Deallocation can work by moving the m_memory pointer backwards in the stack. Notice that the Free function requires the user to pass back in the size of the allocation! This can be avoided by storing this size parameter from Allocate inside of the m_memory array itself, just before the location of the returned address. Upon deallocation this value can be retrieved by moving the data parameter of Free back in memory by 4 bytes.

The advantage of the stack allocator is that it’s extremely fast and dubiously simple to implement. The limitation is that deallocations must be performed in the reverse order of allocations, since the stack itself is in LIFO order. This makes the use cases for the stack allocator pretty limited. Usually resources, like images, level files, sounds, models, etc. can be loaded into memory with a stack based allocator. Anything that has a very clear and non-variable lifespan should be able to be allocated on a stack.

One last trick is that the last allocation can be trivially resized! Often times an algorithm will require a lot of temporary scratch memory to perform some calculations, or store some state. An initial guess as to how much memory is needed can often be calculated as the worst-case scenario. Once an algorithm finishes this scratch memory can be reduced to the size actually used, if it is the last allocation on the stack. Resizing the last stack allocation involves moving the index backwards in memory.

## Heap Allocation

Implementing your own heaps is pretty similar to the stack based allocator. A heap allocator will use the operating system to allocate a large chunk of memory. Subsequent calls to the heap’s Allocate and Free methods will just dip into this chunk and fetch a piece.

The heap is more versatile and general purpose than a stack allocator. The heap can be implemented with a linked list of nodes. Each node represents a piece of memory. A node can either be allocated or free. To keep track of these linked list pointers, allocation state, and size of the memory block some memory itself is required! This stuff can be stored in a separate array, or right inside the large raw chunk of memory (just like with the stack allocator).

Usually it is preferential to add a small header to each allocation to store this information. A heap node might look something like this:

When the heap is first constructed it will contain a linked list of HeapHeader structs, but only a single header will be present, and it holds the entire piece of raw memory originally allocated by the OS upon the Heap allocator’s construction.

Allocating from the heap involves splitting a free HeapHeader into an allocated piece, and a new HeapHeader for the leftover space. The details of this lay mostly in the linked list implementation, and is not the focus of this article.

In order to reduce memory fragmentation it is a good idea to merge adjacent free HeapHeader links into a single link. This ought to be handled in the Heap::Free function. The details of merging free links lay mostly in the linked list implementation, and is not the focus of this article.

Here’s an example of what the Heap may look like in implementation:

When Heap::Allocate is called a free link of appropriate size must be searched for. This has the time complexity of O( N ), and a lot of memory must be fetched into the cache upon allocation as the list itself is traversed. There are tricks to improve allocation performance of heaps, and a simple one would be the cache a single pointer to a free block in the heap itself. This pointer can be cached in Heap::FreeHeap::Allocate, or both. Once a new call to Heap::Allocate is made this cached pointer can be tested first to see it is an appropriate size.

There are two common ways to search through the links for an allocation: first fit and best fit. First fit will return the user with the first piece of memory large enough to hold the allocation. Best fit will return a chunk of memory that came from a HeapHeader with the smallest size that is still large enough to hold the requested allocation size.

First fit can be preferential for cache coherency, as it may prefer to allocate from the beginning of the heap and try to keep things closer together in memory. Best fit may be preferential for keeping the heap as un-fragmented as possible.

## Paged Allocation

The heap based allocator intends to fight memory fragmentation through fitting links to allocation sizes, and by merging adjacent free memory blocks. This type of fragmentation is called external fragmentation. Another type of memory fragmentation is called internal fragmentation.

An internal memory fragmentation is when an allocated piece of memory is given to the user that actually holds more memory than the user requested. The user is assumed to not know about this extra piece of memory. This can provide an advantage to the allocator: all allocations can be of a fixed size, and any allocation larger than this fixed size is denied.

This lets the allocator act like an array. When an allocation is requested an empty element can be returned to the user. Upon freeing a piece of memory, the element is simply marked as free and placed into a free list.

The free list is a linked list of array elements. The memory in the free elements themselves should be used to store the pointer of each subsequent free element.

Allocation and deallocation become constant in time complexity and there is zero external memory fragmentation. In this way internal memory fragmentation is traded for external memory fragmentation.

### Pages!

The term “pages” comes into play when the array is filled up. Once an array is full of allocated elements another array can be allocated. Once this array is filled up, another one is allocated. Each array (aka page) can be stored in a singly linked list of pages.

The free list itself can pointer across multiple pages without any problems.

A page containing only free elements can be deleted entirely, though this feature might not need to be supported.

A paged allocator can also hold an array of singly linked lists of pages. Each element of this array can hold a list of pages that corresponds to a different element size. This can allow the paged allocator to fit different allocation requests into the most appropriate page list. A common tactic is to have pages that represent arrays with an element size of 2^N bytes, where N is usually at least 2, and smaller than some value K.

The biggest advantage of a paged allocator is zero external fragmentation. The internal fragmentation does make memory more non-homogeneous. This type of allocator will probably lower your cache line utilization. Cache line utilization would be how much memory in each cache line fetched from main memory to the CPU cache is actually used. Since internal fragmentation is a feature of a paged allocator, cache line utilization will probably suffer.

The unused memory in the pages can be reduced drastically on a per-application basis; if the users of the allocator are able to specify the element sizes of different page lists, then zero internal fragmentation can be achieved.

## Handle-Based Array

Instead of thinking of a paged allocator in terms of separate arrays, one might think of a simpler allocator that holds just a single array. If the elements within this array of of POD nature the array elements can be referenced by index. This lets the array grow or shrink in size as necessary, as new sized arrays can still be accessed by an old index.

Whenever the user wants a pointer to an element they first give the array an index, and a pointer is returned. This pointer is never stored anywhere! Continuous translation from index to pointer occurs -this allows the internal array itself to moved around in memory as necessary.

Users might need a little more power to refer to elements than a simple integer. Some type of handle might be needed to translate from index to pointer. Read more about handles here.

## Conclusion

Given these three types of allocators an application should have all the variety of memory allocation necessary to run with pretty good performance. More advance allocation techniques definitely exist, and some are just combinations of the three basic allocators presented in this article.

Each allocator can be quite simple in isolation! I myself implemented a stack in about 100 lines, a paged allocator in 150, and a heap in about 250 lines of C++ code.

Further reading might include topics such as: cache coherency, memory alignment, garbage collection, virtual memory, page files (operating system pages).

# Distance Point to Line Segment

Understanding the dot product will enable one to come up with an algorithm to compute the distance between a point and line in two or three dimensions without much fuss. This is a math problem but discussion and focus is on writing a function to compute distance.

## Dot Product Intuition

The dot product comes from the law of cosines. Here’s the formula:

c^2 = a^2 + b^2 – 2ab * cos\;\gamma
\label{eq1}

This is just an equation that relates the cosine of an angle within a triangle to its various side lengths a, b and c. The Wikipedia page (link above) does a nice job of explaining this. Equation \eqref{eq1} can be rewritten as:

c^2 – a^2 – b^2 = -2ab * cos\;\gamma
\label{eq2}

The right hand side equation \eqref{eq2} is interesting! Lets say that instead of writing the equation with side lengths a, b and c, it is written with two vectors: u and v. The third side can be represented as u - v. Re-writing equation \eqref{eq2} in vector notation yields:

|u\;-\;v|^2 – |u|^2 – |v|^2 = -2|u||v| * cos\;\gamma
\label{eq3}

Which can be expressed in scalar form as:

(u_x\;-\;v_x)^2 + (u_y\;-\;v_y)^2 + (u_z\;-\;v_z)^2\;- \\
(u_{x}^2\;+\;u_{y}^2\;+\;u_{z}^2) – (v_{x}^2\;+\;v_{y}^2\;+\;v_{z}^2)\;= \\
-2|u||v| * cos\;\gamma
\label{eq4}

By crossing out some redundant terms, and getting rid of the -2 on each side of the equation, this ugly equation can be turned into a much more approachable version:

u_x v_x + u_y v_y + u_w v_w = |u||v| * cos\;\gamma
\label{eq5}

Equation \eqref{eq5} is the equation for the dot product. If both u and v are unit vectors then the equation will simplify to:

dot(\;\hat{u},\;\hat{v}\;) = cos\;\gamma
\label{eq6}

If u and v are not unit vectors equation \eqref{eq5} says that the dot product between both vectors is equal to cos( γ ) that has been scaled by the lengths of u and v. This is a nice thing to know! For example: the squared length of a vector is just itself dotted with itself.

If u is a unit vector and v is not, then dot( u, v ) will return the distance in which v travels in the u direction. This is useful for understanding the plane equation in three dimensions (or any other dimension):

ax\;+\;by\;+\;cz\;-\;d\;=\;0

The normal of a plane would be the vector: { a, b, c }. If this normal is a unit vector, then d represents the distance to the plane from the origin. If the normal is not a unit vector then d is scaled by the length of the normal.

To compute the distance of a point to this plane any point can be substituted into the plane equation, assuming the normal of the plane equation is of unit length. This operation is computing the distance along the normal a given point travels. The subtraction by d can be viewed as “translating the plane to the origin” in order to convert the distance along the normal, to a distance to the plane.

## Writing the Function: Distance Point to Line

The simplest function for computing the distance to a plane (or line in 2D) would be to place a point into the plane equation. This means that we’ll have to either compute the plane equation in 2D if all we have are two points to represent the plane, and in 3D find a new tactic altogether since planes in 3D are not lines.

In my own experience I’ve found it most common to have a line in the form of two points in order to represent the parametric equation of a line. Two points can come from a triangle, a mesh edge, or two pieces of world geometry.

To setup the problem lets outline the function to be created as so:

The two parameters a and b are used to define the line segment itself. The direction of the line would be the vector b – a.

After a brief visit to the Wikipedia page for this exact problem I quickly wrote down my own derivation of the formula they have on their page. Take a look at this image I drew:

The problem of finding the distance of a point to a line makes use of finding the vector that points from p to the closest point on the line ab. From the above picture: a simple way to calculate this vector would be to subtract away the portion of a – p that travels along the vector ab.

The part of a – p that travels along ab can be calculated by projecting a – p onto ab. This projection is described in the previous section about the dot product intuition.

Given the vector d the distance from p to ab is just sqrt( dot( d, d ) ). The sqrt operation can be omit entirely to compute a distance squared. Our function may now look like:

This function is quite nice because it will never return a negative number. There is a popular version of this function that performs a division operation. Given a very small line segment as input for ab it is entirely possible to have the following function return a negative number:

It’s very misleading to have a function called “square distance” or “distance” to return a negative number. Passing in the result of this function to a sqrt function call can result in NaNs and be really nasty to deal with.

## Barycentric Coordinates – Segments

A full discussion of barycentric coordinates is way out of scope here. However, they can be used to compute distance from a point to line segment. The segment portion of the code just clamps a point projected into the line within the bounds of a and b.

Assuming readers are a little more comfortable with the dot product than I was when I first started programming, the following function should make sense:

This function can be adapted pretty easily to compute the closest point on the line segment to p instead of returning a scalar. The idea is to use the vector from p to the closest position on ab to project p onto the segment ab.

The above function works by computing barycentric coordinates of p relative to ab. The coordinates are scaled by the length of ab so the second if statement must be adapted slightly. If the direction ab were normalized then the second if statement would be a comparison with the value 1, which should make sense for barycentric coordinates.

## Sample Program

Here’s a sample program you can try out yourself:

The output is: “Distance squared: 2.117647″.

# What is there to Hate about References?

I find most usage of references annoying cruft. Often the arguments I see or hear that are “pro-reference” make the same lame points that most of the internet makes:

• Pointers are dangerous
• Pointers are ambiguous and confusing
• NULL pointers lead to undefined behavior and crashes

Just google “pointers and references” and you’ll see bad advice everywhere. A new programmer seeing these bullet points is likely to get hyped about using references everywhere. Seeing advice like this just sort of upsets some part of me. Perhaps it’s because when the above statements are plastered onto websites they state them as fact.

In an effort to not make the same annoying mistake as every other article on the internet I’ll present my opinion as an opinion. By stating something as an opinion the reader will immediately begin to read with a certain amount of skepticism. This might coax newer readers into thinking for themselves, which ought to be the goal of writing an educational article on the first place. Writing step-by-step instructions on how not to use “dangerous pointers” is the worst way to write on the topic of pointers and references.

I know I sound pretty bitter. I recall a time when I browsed the internet and looked for advice on this exact topic. It takes time to unlearn bad things, and so this post was born.

## Memory Matters

Where things are in memory is a big deal. Memory access is commonly a bottleneck in real-time applications, and code that has ambiguous memory accesses patterns upsets me. Imagine peering into a large function that is operating on some kind of data. Scanning the middle of the function a few lines of code are encountered:

Just be reading this code it isn’t immediately understandable as what the variable d is. What is happening here? In order to know what the scope of d is some scrolling or manual code navigation will ensue. How long will this take? How much does it cut down on focus while the reader is just trying to understand the code?

Often times for member variables of an internal class or struct will be appended with the m_ prefix. This is nice as readers immediately know the scope and implications of all uses of a member variable. There’s an implicit this pointer being accessed somehow, and the variable’s scope is relative to this class’s definition.

In this case there’s no such nice prefix. d can either be a reference or value type. There’s no way to know without some kind of intellisense. Mousing over a variable to see what the type is, given a nice IDE, is not really that big of a deal. The big deal here is that if you have to mouse over something to get an idea of what sort of memory this represents. Just take a look at this code:

What sort of questions might the user have about this code? Clearly d is a pointer to some memory. This immediately informs the reader about the nature of the code. is likely to be some kind of output, or perhaps a specific element in an array. It is definitely not just a lone variable on the stack. No intellisense is needed to get this information, no mousing over or code navigation is needed just to understand the idea of assigning a value to some non-local stack memory. The programmer reading this code might be able to focus slightly better on understanding the code due to the use of a pointer.

Sure d could technically be a *NULL* pointer (gasp), but in reality this is a non-issue. The only times checking for NULL pointers is important is when user-input is being handled. A lot of code doesn’t deal with user input! For internal code I’d make the argument that memory not on the local stack scope (local to at least the function currently executing” should almost always be referred to by pointer. Internal code can make assumptions on how pointers are used and not care about the NULL case. Internal code often solves difficult problems, and needs to be efficient (in the scope of real-time applications). Anything that fragments reader focus is bad, even taking a moment to mouse of a variable to see if it’s a reference or not.

## Another Example

In the above snippet imagine that the joined AABB is being written to, by finding the AABB that bounds both a and b. Perhaps in this specific case it is fairly obvious that joined is memory that is being written to by the MergeAABBs function. This is probably because joined was quite well named, despite being passed by reference to MergeAABBs. However this function might have been written in a way that returns a new AABB entirely by value, and only operates on a local stack copy of joined. In this case the code would compile and run perfectly fine, but joined would have unitialized memory. This might lead to a crash or assert, thus lower iteration time and programmer focus.

Now lets look at the use of a “dangerous” pointer:

In this code snippet, no matter what the third parameter is named as, it is as obvious as possible that the function MergeAABBs is operating on some memory passed to it, and does not return anything useful in this particular use-case. The contents of the function MergeAABBs is probably obvious as well, I know I can imagine how it’s implemented without even looking; there’s just no ambiguity.

The name of variables should be meaningful to the problem the code is solving. Requiring a naming convention for code clarity simply because of an arbitrary reference function parameter is an unnecessary constraint! Naming things is hard enough without random convention limitations.

Sure if some idiot passed in a NULL pointer to MergeAABBs it will crash, but how often does this happen in practice? What kind of programmers are you hiring that actually get caught up in this kind of problem? Real-life competent engineers aren’t going pass in a NULL pointer and will appreciate good code written with “dangerous and ambiguous” pointers.

When a function takes only floats and writes to a float it’s pretty much worst-case for reference ambiguity. Which float is being written to in the next code snippet?

Is the triangle actually {a, b, c}, or some other combination of parameters? Which of the arguments are float arrays (vectors) or just floats? Which ones are written to, and which are read only? Some kind of code navigation is needed to know for sure. By convention uvw might represent the name of barycentric coordinates for a triangle, but perhaps this specific piece of code was solving a particular problem where the derivation named them something else? It’s just ambiguous without a pre-defined naming convention, of which is imposed in the middle of non-related algorithms.

Here’s the pointer version; note how non-ambiguous this code is:

## Useful References

I currently know of a single use of references that I really like, and that’s a const reference passed to a function as an argument, and sometimes the returning of a const reference.

Passing a const reference to a function means that this is a read-only value, and is definitely not pointing to an array. It is also a pretty common convention for operator overloading. The only downside is that the dot access operator may be mistaken as a value-type access, instead of a pointer access.

Returning a const reference might make sense sometimes, but usually I’m of the opinion that a pointer is better. Returning a pointer just abides by all my previous points about memory access. If the user retrieves a const pointer from a function, the explicit -> access makes it very clear that this memory came from somewhere else!

References are also able to capture temporary rvalues. This can make the life-time of such temporary values more explicit.

Sometimes a million dereferences is just too many. In some cases the lack of the dereference operator is nice and adds to code readability. However, in this case references are just an aid and live only on the stack. The pointer is what is actually kept and stored, in order to keep the code clean and up-front. Here’s an example:

An equivalent, but more verbose and annoying version can be constructed with pointers:

## “Advanced C++” and Generic Programming

“Advanced C++” features (in quotations for sarcasm, like much of the rest of the article) are useful sometimes, there’s no doubt about. Templates, classes, and all the weirdness therein is sometimes necessary. The amount of code duplication and boilerplate that these features can be remove makes them important.

However, a lot of code is type-static and very specific. Code often solves very particular, specific problems. A lot of newer students (me) and colleagues get all caught up in the features and just end up wasting their time. When I say waste time, I mean they were actually trying to finish a project, instead of just learn about C++ and the uses thereof.

One might view C++ from the perspective that all the “advanced features” just can egg-on a programmer into over engineering their code into a weird mess of indirection and verbose templated code. Crazy inlined callbacks, type agnostic code, verbose namespaces and whatnot. Many times it’s just useless cruft, and a specific implementation for a single problem will be simpler, easier to read, and smaller in code size.

All of this ranting comes down the the point of: references let code operate in a slightly more agonistic manner, which is great for templates. The dot operator does it all! This makes sense for code that needs templating, but often doesn’t make sense for a lot of code (which was what the previous portions of the article pointed out!).

However, good generic code is so incredibly difficult to design and come up with that hardly anybody should be doing it. Good code that is used by multiple people is at least an order of magnitude harder to write than good code that has a single specific use case. Templates, references, classes, these things might make it tempting to try out all the features to make a “generic program” that “can be re-used in the future”. I’ll tell it how it is: simple code that is type static, specialized for the specific problem it is solving, and doesn’t use advanced features is probably an order of magnitude more reusable (and performant) than “generic” code, simply because it’s easier to write.

## Form an Opinion

As a reader, think for yourself and make your own judgment calls based on your own experience. This means that a good way to take advantage of the knowledge of an experienced programmer is to try out their advice with an almost skeptical attitude. Just don’t look for step-by-step instructions on how to be a computer scientist. Nobody wants to work with a mindless programmer that writes bad code, because then the good programmers will be busy cleaning it up.

# Cache Aware Components

Special thanks to Danny Frisbie for a nice discussion on the PODHandler implementation!

Let me start off by saying that optimizations really only need to be applied to bottlenecks. In order to know where a bottleneck might occur (especially cache related ones) you’ll probably need some experience. The experience not need be your own, but the experience will come from someone. In my (limited) experience the only bottlenecks I’ve ever seen in any piece of game related software (aside from N^2 loops with a high N) are always due to waiting on things to be placed into the cache. It’s really easy to write bad code, and bad code is usually cache oblivious. Even conceptually clear and understandable code can still be cache oblivious!

I’ve seen some very nice 2D and 3D games, made in C++, that used only rudimentary memory allocation schemes and naive component implementations. They ran at 60 fps just fine. If you’re a hobbyist or just trying to learn, then thinking about how to write the fastest component framework ever might be fun but don’t expect to do it correctly on your first try. Expect to fail, and then iterate.

So when the time comes to actually optimize something, having some sort of idea of where to look to learn how solve cache related problems will be valuable.

## Data Lookups and Cache Lines

In general the cache line size for hardware nowadays seems to be mainly 64 bytes. A cache line is a 64 byte piece of memory that is on 64 byte boundaries. Whenever data is transferred from one cache to another (or to/from main memory (RAM)) the memory is transferred in a cache line. This keeps the memory bus busy. 16 32-bit integers would be the size of a single cache line, or 16 32-bit floating point numbers. This reduces to the size of a 4×4 matrix of 4 element floating point scalars.

How fast a cache line is transferred depends on which level it is being fetched from. In general terms: when a piece of memory is fetched by the CPU from a lower level cache it is hoisted up into the L1 data cache (L1 D, L1 I is the instruction cache for code). If this memory was not in the cache it will take forever (100 to 300 cycles, probably near the 300 range for PC). Here’s a nice diagram by Naughty Dog summarizing the common cache setup for PC CPUs:

When something is loaded into the L1 cache whatever was there before has to be evicted, and will be pushed into the L2 cache (again using the same 64 byte cache line size). This will probably evict something from the L2 cache back down into the L3 cache, and so on and so forth.

The implication here is that whenever a cache line is read it is up to the programmer to try to use as much of that cache line as possible. Even though we might have 8 gigabytes of RAM, if we aren’t running in the cache the CPU will be sitting there waiting. Even if a single byte is read from main memory and entire cache line will be fetched. Reading a single byte from a random location in main memory is about the worst possible way to use memory.

This tends toward the idea of using very compact and concise data structures. If a data structure is packed together in memory it can be operated upon by the CPU very quickly once it arrives to the CPU’s cache.

The cache isn’t very big. Here’s a nice slide by Scott Meyers on the topic:

32KB of L1 data cache is tiny. You don’t even get to use all of it as the operating system does need to do stuff too!

## Prefetching

Prefetching exists to try to hide the latency of fetching memory. A prefetch is when a cache line is preemptively fetched and placed into the cache, such that when the memory is actually requested a cache hit occurs.

Hardware can detect patterns in real-memory accesses, but it can only detect pretty simple patterns like array traversals. Scott Meyers describes (see resources section) that the hardware is made in such a way that it can detect iterating over arrays forwards, backwards and with variable (but constant) element step size. It can also do this for all hardware threads simultaneously. However, if you’re not looping over an array you can’t count on any intelligent prefetching. It will take two or more cache misses in a recognizeable pattern to start automatic prefetching.

Usually compilers provide a specific keyword to hint to the run-time to grab a specific cache line from somewhere in memory. This can be used by programmers to ease out a final bit of performance, given a proper implementation to prefetch for.

## Cache (Un)Aware Components

Hopefully by now readers are convinced that contiguous arrays of data are very friendly to the cache, and where performance matters this knowledge should be exploited.

In a component based game engine architecture looking up and operating on components is often the first bottleneck encountered. It might pay to learn a little about how this might be circumvented.

A memory naive implementation of components will look something like this:

Each component is allocated on the heap explicitly by the operating system, which is going to require a context switch and be very sensitive to memory fragmentation.

Ideally all data of a certain type will be packed together in a tight linear array. When this data needs to be operated upon, the fastest sort of transformation (without manual prefetching) will look something like this (for a generalized example):

Ideally the size of a given element will be below the size of a cache line, and often times this is possible if extraneous data can be removed from the Data type.

The explicit consumption of the data where a local copy is made is probably not necessary and will be compiled away. It is fairly easy to check the assembly by using a compiler to process to an assembly file to double check. However this kind of practice can be very helpful to let the compiler know that multiple pointers cannot possibly alias the same type. For more information search for “C++ aliasing Ericson” to find Christer’s old slides.

Now that the ideal computational situation for transforming a large data set has been described, lets look at a common (albeit contrived) data transformation that we’ve all been guilty of while first learning:

Conceptually this code is very concise and easy to reason about. Though the code readability and dynamic niceties aren’t very efficient. Random calls to delete occur, the inner loop contains a branch, and called Update on an object will go to who knows where in memory. All of these things are basically punching the cache right in the gut. Even the branch can be annoying for the CPU pipeline as it may have to eject code out of the L1 I cache if a branch is mispredicted!

How can this be solved? The first step is to make sure as much data is packed together in memory as possible. In the above code snippet the list can be changed to an array (perhaps std::vector). Okay, pretty trivial change, no big deal. Objects can perhaps just be placement new’d into the array and placement deleted. This will act like a memory pool.

The next step is to identify that the UpdateGameObjects function is performing two types of tasks (assuming the Update function performs a single task); deletion and update calling. This is a result of the container of objects not being sorted. It is a non-homogeneous collection of objects that are both alive and dead. If objects can be separated into sections of dead and alive, only the alive objects need to be looped over.

## Cache Aware Components

One way to implement this would be to have the beginning of the array contain a contiguous line of live elements. The rest of the array can contain “deleted” or “not yet allocated” elements. In order to uphold this invariant it might be best to design objects placed into these sorts of arrays not care if they are moved in memory. Usually this means making your data a plain old data type (POD).

Deleting things from the array is going to be a nice feature to support. A game wouldn’t be interactive for very long if it could only consume more and more memory. A simple and very effective scheme is to move the last element of your array into the location of a deleted element.

However in order to refer to a unique element within an array simple pointers are no longer going to cut it. When an element is moved from the last index into a deleted slot any pointers to the old spot (the last and now empty element) will be dangling. Some form of translation from one data type into a pointer must occur in order to ensure that the correct pointer is retrieved for a unique entry in the array.

Usually this translation comes in the form of a handle. A handle can be implemented as an integer divided into two different sections. The details of how to implement a handle should be known by the reader before continuing on, so please view Llopis as a resource.

Lets create a simple abstract data type that grants access to an array of POD elements, of which supports handle translation, allocation and deletion:

In order to implement these two functions please do refer to the Llopis resource referenced in the last paragraph.

In order to implement Release things start to get tricky. How does the PODHandler update the handle of the element it moved? Somehow the location of the internal handle entry needs to be accessible just by knowing where the element was moved from. The easiest solution is to place a handle inside the type T within each element of the array. However, it would be great if types that are held inside of PODHandlers can fit within a single cache line. Adding a handle to every single element lowers the density of the data in the array. For certain situations this data bloat, though only 4 bytes per elements, will reduce the effectiveness of every single cache line from 64 bytes to 60.

Clearly an alternative could be used! The solution is to yet again separate different types of data into different arrays. The internal array of the PODHandler should consist of homogeous data! Rip out that intrusive handle and place them all in their own arrayPODHandler can now consist of 3 arrays in total: an array of type T, an array of Handles, and an array of integers.

The array of integers share their indices with the array of PODs. This means a POD in element 3 will correspond to the integer in element 3 of the integer array. The integer array contains indices that map to the handle associated with a given POD element in the handle array m_entries.

Though readers may by now be wondering “wouldn’t three different arrays potentially have worse locality of reference than just two arrays?”, and this would be a good thing to wonder. It is true that an intrusive handle would be preferable if handle translations are extremely frequent. If they are, the original intrusive handle implementation may be ideal.

If an engine is architected to focus on cache utilization for transforming large data sets with expensive operations, then a homogeneous array will be preferential. Or in other words if you want to loop over a lot of stuff and do expensive math on each element, that array better be dense. This means that handle translations are more infrequent since the code focuses on looping over the data array itself rather than picking out individual elements at random.

## Open Source Implementation

The idea PODHandler represents is important. My implementation is just my own manifestation of the concepts described in this post. My implementation is not important! The concepts are important. Hopefully by allowing readers another piece of reference, in the form of some source code, the ideas presented here can be better realized.

PODHandler Source: Link (not up yet)

# Sane Usage of Components and Entity Systems

With some discussion going in a previous article about how to actually implement some sort of component system for a game engine, without vague theory or dogma, a need for some higher level perspective was reached, and so this article arose.

In general an aggregation model is often useful when piecing together bits of functionality or data to create something new. The ability to do so is very useful for writing game-specific gameplay code due the flexibility of code granted by aggregation. However as of late there’s been tremendous talk about OOP, Entity Systems, Inheritance, and blah blah blah within the online indie development community. More and more buzzwords get tossed around by big name writers and the audience really just looks for some guidelines to follow in hopes of writing good code.

Sadly there isn’t going to be a set of step by step rules for writing a game engine or coming up with a good architecture. Like many of said before me, writing a game is a specific task requiring specific solutions. Why do you think game engine developers such as Epic or the Unity guys have so many people working on the product? Because a generic game engine is a huge piece of software that requires a lot of features. Some features exist simply to let users add in custom features easily.

Components, aggregation, Entity Component Systems, Entity systems, these are just words and have various definitions (depending on who you ask).

## Some Definitions

To hopefully avoid silly arguments and confusion lets define some terms. If you don’t like the definitions here feel free to express so, I’m all up for criticism and debate.

• Component Based Architecture
• A preference for aggregation over inheritance. Is just a concept and does not lead to a single specific implementation. A game object is a collection of components. A component defines data and/or functionality for a concept.
• Entity Component System (ECS)
• A specific implementation of Component Based Architecture. A game object would be an ID (an integer). The ID is used to form an aggregate. Usually an ECS implies an implementation similar to a database, where components are entries into a database that are looked up through some identifier. The main goals of this implementation are efficiency and simplicity. Often times the term “ECS” is used just to describe a Component Based Architecture, often leading to confusion.
• Aggregation
• I like to think of this as a “has-a” relationship over an “is-a” relationship. Aggregation refers to one object “having” another object, which implies an aggregate is a collection (data structure) of other objects.

## Some Truth and History

Aggregation is useful from a game design perspective. It frees functionality from arbitrary classification (classes and inheritance). Classes were originally created in C++ to let a programmer tie together a piece of data and some functionality to represent some sort of real-life concept. This is in simplest terms the essence of Object Oriented Programming (OOP). Over time more features were added to help engineer relationships between classes, one such feature came in the form of inheritance.

There’s nothing inherently wrong with OOP and it makes sense in a lot of code. Problems can arise when there’s a mis-application of OOP that has implications that aren’t fully understood at the time of implementation that cause negative affects down the road. I’m sure we’ve all seen the code migration and mega-class example so commonly thrown around in articles arguing against OOP and inheritance abuse.

In response to such an abuse a new paradigm became popularized which focused on aggregation of functionality to form an object. This might be called a “component based architecture”. In general aggregation can be considered an appropriate alternative to inheritance.

## OOP Diatribe

Usually when an article spews forth caustic attacks against OOP it’s directed at naive implementations that disregard implications of how memory is accessed. Perhaps in the past the bottleneck of most everything was processor speed, so a lot of literature focuses on this. Nowadays CPUs on the PC have an architecture that have ridiculous computational power with extremely limited memory access. In general one might consider accessing memory from RAM 300 times slower than multiplying two floats together. Of course this last statement is extremely anecdotal without any evidence, but exists just to give a rough perspective of reality in many current (2014) cases.

If objects with associated code (classes) are just allocated and deallocated on the heap at will then a performance bottleneck of memory access is going to rear its ugly face, likely long before other performance issues are even on the radar. This is where much of the diatribe comes from.

It should be noted that pretty much all code bases that make use of the C++ language use classes and structures in some form or another. As long as a programmer has an understanding of memory, how it’s accessed, and what implications arise from given implementations, nothing will go wrong. Alas, actually doing these things and writing good code is super hard. It doesn’t matter if a class has some implementation code within it, so long as that bit of code makes sense for the purposes it is serving.

## Implementing Components, a First Draft

The most immediate implementation would be to make use of multiple inheritance. This has a clear definition of where the data goes, and it all goes in one class -the derived class. Multiple inheritance itself can get a bit tricky when dealing with pointer typecasting between derived and base types, though the C++ language itself handles the details much of the time.

Inheritance alone doesn’t provide a good mechanism to query whether a base class is apart of a specific derived aggregation and so the dynamic cast operator is born. Since the dynamic cast is a branching operation, usually implemented (afaik) by inspecting the vtable, it is avoided in general.

Multiple inheritance also does all sorts of work to member function pointers, and is just a sad part of C++. Additionally there isn’t any language feature that allows for dynamic dispatch for combinations of base classes, so if the need arises a custom solution will need to be implemented anyway.

Memory accessing, although defined, isn’t ideal. Multiple inheritance forms a blob of different data, and usually only a single piece of the blob is needed at any given time, meaning locality of reference will be poor in general. This leads to the idea of inheriting from multiple interfaces in order to decouple memory aggregation from functionality aggregation, which leads to the next draft.

## Second Draft – Run-Time Aggregation

Instead of using multiple inheritance on interfaces, which is a compile-time feature, run-time support can be added. Object aggregates can be formed during run-time, and modified thereafter. This is appealing for data driven applications, and game-design friendly development iteration speed.

So lets assume that some programmer wants to implement components, but doesn’t think much about memory access patterns the implications therein. Using a vector of pointers an implementation of components becomes super simple. Each pointer can point to an interface exposing a few functions like Update, Init and Shutdown.

Searching for a particular component is as simple as linearly looping over each pointer until a matching type is found. If these pointers are ordered in some way a search can be performed, perhaps a binary search could suffice. If the identifier of a component is hashable a hash table lookup can be used.

The implementation so far is an excellent one except that there is no definition of how memory is allocated and accessed! In the most naive of implementation each game object and each component will be allocated on the heap with separate calls to malloc.

Despite having no clear memory definition there are some nice benefits that have arisen. Data driving the composition of an aggregate becomes quite trivial as each component of an aggregation can have an entirely isolated lifetime. Adding, removing, modifying, or even creating new components at run-time are all now possibilities. This dynamic aggregate architecture is great for improving game development and design iteration time!

## Aggregation and Components and the Entity System Paradigm (ES/ECS)

As stated in the definitions section, an ECS is just a specific implementation of a component based architecture. A component based architecture game engine architecture would be a custom implementation of multiple inheritance. A clearly defined ECS can impose restrictions on how a component architecture is implemented and used in hopes of avoided poor memory access patterns, or in hopes of keeping code simple and orderly.

If a component is designed as a piece of memory without any code, and a game object defined as an integer ID then performance specifications can be easily imposed. Rules about where in memory components lay, and how components are actually accessed can be clearly defined in simple terms. Code can be written that operates upon arrays of components, transforming arrays linearly. This idea is actually a type of Data Oriented Design (DOD), which makes sense as DOD is just an idea! ECS is an application of the idea of DOD.

So with this type of implementation the benefits of dynamic composition can be paired with well-defined memory layout and access patterns. Suddenly prefetching and parallelism become much simpler to support.

## Aggregatize all the Things!

There’s a problem. Blindly shoving the idea of an ECS implementation into every nook and cranny of an engine during development is just silly (or any complex system, not just game engines or libraries). Often times a particular system is not best implemented with a component or aggregate paradigm in mind.

An obvious case is that of a physics engine. Often times a physics engine developer is worried about collision detection, solving systems of linear equations, rigid body mechanics and allowing the engine to easily be integrated into existing code bases. These details involve a lot of math and good API design. A developer of a physics engine is going to have their focus employed in full force in solving problems specific to physics engines. This means that the engineer’s focus is finite, so the implementation that is best is one that the engineer can actually bring to completion. An implementation that can come to completion is one that makes sense for the specific details of whatever is going on inside the physics engine. The specific paradigms used are often not aggregation or component based!

In order for a physics engine to run fast it needs to have efficient memory access patterns and memory usage, on modern PC hardware, requires some form of DOD. Since this complex (often black boxed) physics engine will have it’s own specific implementation and optimization it doesn’t make sense to force a component based model to its very core with some sort of idealistic zeal. It gets really bad when strict rules are imposed (like banning all code from classes and structures that define components) on the component model (like with an ECS) and the rules start permeating the deep recesses of the entire code base.

The same thing goes for any sort of complex system. The core facilities of a game engine often times just don’t really care about components or aggregation. This means that an engine architecture that implements components will usually have to deal with middleware graphics/physics engines/libraries that don’t subscribe to a component based model (simply because it’s easier to use a library than to write your own custom things, especially if those custom things religiously follow some silly methodology like ECS or even OOP). In practice light wrapper components can be created to let the functionality of such systems be presented in a component format, ready to be used in an aggregate object.

## What does this all mean? What should we all do?

Use components where it makes sense in code. Use inheritance where it makes sense in code. Use databases where they make sense. Use all the things where they should. This is a pretty sad answer but it’s the right one. There is no silver bullet paradigm that solves all the problems in the game engine architecture world, and there are no steps to follow to achieve a result that works in all cases. Specific problems require specific solutions. Good code is hard to write, and will require a lot of judgement calls. In order to make good judgement calls a lot of experience and perspective is required.

I recommend using aggregation where it really matters. Dynamic aggregation is important for gameplay specific code. Gameplay specific code, in this article, would refer to code that would not easily apply or work at all in a different game. It’s code that is your game and doesn’t define an isolated system or functionality.

Dynamic aggregation and the component based model are extremely important for game and object editors. Game design flourishes best when iteration times are driven to zero, and the ability to create new things from a composition of fundamentals is very valuable! Clearly composition is useful, but how it’s to be used is the hard part.

## What Components to Make?

I recommend making components concerned with providing access to game-independent functionality to be quite large. Every 3D game engine has a concept of a mesh, and will usually have some sort of file format to associate with, like FBX. Every 2D game engine will have the concept of a sprite. Each game using Box2D will have colliders and rigid bodies, and possibly joints. These fundamental pieces of functionality don’t change very often, so static compile-time relationships aren’t a bad thing since iteration time isn’t really all that relevant.

A 3D game might have a single Mesh component for example. A Mesh component can have renderable vertices, and possibly all the skeletal and animation information as well. There may be a single Rigid Body component, which encapsulates the idea of colliders or shapes, as well as the functionality of rigid body mechanics. The Rigid Body component might even contain all necessary code and data to hold multiple joints! Or joints may be a component themselves.

For high level and gameplay related features components can become much more granular (or not if you so choose). Gameplay should be iterated, tested and changed frequently, so having small and decomposed components will probably make a lot of sense in a lot of cases. Large components that encompass more broad ideas will be useful in many cases too. Even in the gameplay world judgement calls are essential.

Usually efficiency isn’t so important for much gameplay code, so any implementation that is decently performant will suffice. Scripting languages, dynamic memory allocation and virtual dispatch, or what have you can all work. The decisions of what requires flexibility, what requires performance and all between can be difficult to make. Please see the references section for some concrete examples.

## Further Readings

We live in a world of opinions and it takes time to sift through them! If you have recommendations please comment below :)

## Reference Source Code

The best reference I know of is an open source game engine in progress I myself am developing. Please do send me your recommendations!

# Deriving OBB to OBB Intersection and Manifold Generation

The Oriented Bounding Box (OBB) is commonly used all over the place in terms of simulation and game logic. Although often times OBBs aren’t actually bounding anything, and really just represent an implicitly defined box.

OBBs commonly come in 2D and 3D form, though there aren’t all that many resources on the internet describing how to derive the 3D case. The ability to write robust collision detection, and more importantly manifold generation (contact data needed to resolve the collision), relies on a good understanding of linear algebra and geometry.

Readers will need to know about affine and linear transformations, and have intuition about the dot and cross products. To learn these preliminaries I highly recommend the book “Essential Mathematics” by Jim Van Verth. I will also be assuming readers are familiar with the Separating Axis Theorem (SAT). If you aren’t familiar with SAT please do a short preliminary research on this topic.

The fastest test I’m aware of for detection collision between two OBBs, in either 2 or 3 dimensions, makes use of the Separating Axis Theorem. In both cases it is beneficial to transform one OBB into the space of the other in order to simplify particular calculations.

Usually an OBB can be defined something like this in C++:

## 6 Face Axes

As with collision detection against two AABBs, OBBs have x y and z axes that correspond to three different potential axes of separation. Since an OBB has local oriented axes they will most of the time not line up with the standard Euclidean basis. To simplify the problem of projecting OBB $$A$$ onto the the axes of OBB $$B$$, all computations ought to be done within the space of $$A$$. This is just a convention, and the entire derivation could have been done in the space of $$B$$, or even in world space (world space would have more complex, and slow, computations).

The translation vector $$t$$ is defined as $$t = R_{a}^{T} * (C_b – C_a)$$, where $$C_i$$ denotes the center of an OBB, and $$R$$ denotes the rotation of an OBB. $$t$$ points from $$A$$’s center to $$B$$’s center, within the space of $$A$$. The relative rotation matrix to transform from B’s frame to A’s frame is defined as $$C = R_{a}^{T} * R_b$$. $$C$$ can be used to transform $$B$$’s local axes into the frame of $$A$$. In $$A$$’s own frame it’s local axes line up with the standard Euclidean basis, and can be directly computed with the half width extents $$e_a$$.

To calculate the separation $$s$$ along the local axis $$l$$ from an OBB examine the following diagram and matching equation:

s = |t \cdot l| – (|a \cdot l| + |(C * b) \cdot l|)
\label{eq1}

Due to symmetry the absolute value of a projection can be used in order to gather a value that can be used to represent a projection along a given direction. In our case we aren’t really interested in the sign of the projection and always want the projection towards the other OBB. I understand that newer programmers may have some trouble translating math into code, so here’s an example of eq.\eqref{eq1} as C++:

## 9 Edge Axes

In 2D eq.\eqref{eq1} is all that is needed to detect overlap upon any given possible axis of separation. In 3D a new topological feature arises on the surface of geometry: edges. Because edges form entirely separate Voronoi regions from face surfaces they can also provide possible axes of separation. With OBBs it is fast to test all possible unique edge axes of separation.

Given two edges upon two OBBs in 3D space the vector perpendicular to both would be the possible axis of separation, and can be computed with a cross product. A possible 9 axes of separation arise from crossing each local axis of $$A$$ with the each local axis of $$B$$.

Since $$C$$ is composed of a concatenation between $$R_{a}^{T}$$ and $$R_b$$, the rows of this matrix can be thought of as the local axes of $$B$$ in the space of $$A$$. When a cross product between two axes is required to test separation, the axis can be pulled directly out of $$C$$. This is much faster than computing a cross product in world space between two arbitrary axis vectors.

Lets derive a cross product between $$A$$’s x axis and $$B$$’s z axis, all within the space of $$A$$. In $$A$$’s space $$A$$’s x axis is $$\begin{Bmatrix}1 & 0 & 0\end{Bmatrix}$$. In $$A$$’s space $$B$$’s z axis is $$\begin{Bmatrix}C_{20} & C_{21} & C_{22}\end{Bmatrix}$$. The axis $$l$$ is just the cross product of these two axes. Since $$A$$’s x axis is composed largely of the number $$0$$ elements of $$C$$ can be used to directly create the result of the cross product, without actually running a cross product function. We arrive to the conclusion that for our current case $$l = \begin{Bmatrix}0 & -C_{22} & C_{21}\end{Bmatrix}$$. This particular $$l$$ axis can be used in eq.\eqref{eq1} to solve for overlap.

Since the matrix $$C$$ is used to derive cross product results the sign of the elements will be dependent on the orientation of the two OBBs. Additionally, the cross product will negate on of the terms in the final axis. Since eq.\eqref{eq1} requires only absolute values it may be a good idea to precompute abs($$C$$), which takes the absolute value element by element. This computation can be interleaved between the different axis checks for the first 6 checks, only computed as needed. This may let an early out prevent needless computation.

The same idea can be used to derive this axis in the space of $$B$$ by crossing $$B$$’s local z axis with $$A$$’s local x axis transformed to the space of $$B$$. To do this it may be easy to use $$C^T$$, which represents $$R_{b}^{T} * A$$. More information about this can be found in my previous article about the properties of transposing a matrix concatenation.

Eventually all axes can be derived on paper and translated into code.

## Manifold Generation

A collision manifold would be a small bit of information about the nature of a collision. This information is usually used to resolve the collision during physical simulation. Once an axis of least penetration is found the manifold must be generated.

For SAT manifold generation is quite straight forward. The entirety of the details are expressed very well by Dirk Gregorius’ great online tutorial from GDC 2013. The slides are openly available for free, and currently hosted here generously by Erin Catto. Readers are encouraged to familiarize themselves with the idea of reference and incident faces, as well as the idea of reference face side planes. For a 2D analogue I have provided my own slides on this topic.

There are only two different cases of collision that should be handled: edge to edge and something to face. Vertex to edge, vertex to face, and vertex to vertex resolutions are ill-defined. On top of this numerical imprecision makes detecting such cases difficult. It makes more sense in every way to skip these three cases and stick to the original two.

## Something to Face

Something has hit the face of one of the OBBs. It can be anywhere from 1 to 4 vertices from the face of the opposing OBB. The entire intersecting volume between two OBBs does not need to be considered due to the nature of SAT and the axis of minimum penetration. Only two faces from each OBB need be considered for this case, and are by convention named the reference and incident faces, one from each OBB.

Finding the incident face given an axis of separation can be found in constant time with implicit geometry like an OBB; the problem reduces to that of finding a vector that most faces a given vector, or in other words the most negative dot product.

Once the incident face is identified it is clipped to the side planes of the reference face. There may be some tricks involving basis transformations here to simplify the clipping routine. In order to perform clipping the incident face vertices must be calculated. Given a face normal from an OBB all four face vertices can be computed with the normal and the half width extents of the OBB. In similar fashion the side planes of the reference face can also be directly computed.

## Edge to Edge

The edge to edge case requires a clever constant-time computation of a supporting edge. A supporting edge is the edge most pointed to by a given axis of separation. The computation of this edge may be tricky to derive yourself, though there do exist online references. I suggest checking out Open Dynamics Engine (ODE) by Russell Smith, or the Cyclone Engine by Ian Millington for information on how to compute a supporting edge.

Once both supporting edges are calculated the closest two points between each edge can be calculated and used as a single contact point of which to resolve upon.

## Numeric Stability

In all uses of OBB to OBB intersection cross products with nearly parallel vectors pose a problem. The best solution is to detect when any two axes from each OBB are parallel and skip all nine cross product axes. It can be visually shown that no false positives will come from skipping the cross product axes in this case.

A simpler solution, as proposed by Gottschalk ’00 in his paper “Collision Queries using Oriented Bounding Boxes”, which made its way into Ericson’s book “Real-Time Collision Detection”, is to bias all elements of $$C$$ with a small epsilon to drive near-zero axes away from a degenerate case. For other axes the epsilon value should be tuned such that it doesn’t have much an impact upon results.

If using this collision routine to resolve collision during physical simulation certain tuning my be appropriate. If an iterative resolver is used it may be preferred to have slightly more consistent manifold generation, as opposed to exact manifold generation. For details I suggest researching works by Erin Catto and comments made by others around the Bullet Forums.

## AABB to OBB

Hopefully by now readers realize that this whole time we’ve been simplifying the problem of OBB to OBB to the problem of AABB to OBB. When an actual AABB needs to be collided against an OBB simplifications can occur. Many simplifications revolve around the non-necessity to transform things into the basis of the AABB, since world coordinates should suffice.

## One Axis Aligned (or Two?)

Some games have OBBs that only orient on some of their axes instead of all of them. One can immediately realize that no cross product checks need to be performed in any case, if two OBBs have orientation locked on the same axis. Other optimizations will occur as well, making some axis tests reduced to that of AABB overlap testing.

## Conclusion

Collision detection is difficult. Writing robust collision detection code requires a good mathematical foundation as well as the ability to write efficient code. This slight mixture of fields of study may be difficult to learn all at once, but it all gets easier with time and practice and the rewards can be very fulfilling.

I’d like to thank my friend Nathan Carlson for teaching much of this information to me, and for his insightful discussions.

# Automated Lua Binding

Welcome to the fifth post in a series of blog posts about how to implement a custom game engine in C++. As reference I’ll be using my own open source game engine SEL. Please refer to its source code for implementation details not covered in this article. The folder of interest would be LuaInterface.

## Introduction

Binding things to Lua is twofold: objects and functions must be able to be sent to and retrieved from Lua. Functions can be either static C or struct/class methods. Objects can be sent “by value” or “by reference”. As you can imagine it is important to be able to unify and simplify the binding process as much as possible to reduce all manual dev-work and upkeep.

## Generic C++ Functor

As with many things in a modern C++ game engine it is critical to have a generic C++ functor. Ideally this functor can wrap around class/struct methods (not only static functions). It is also possible have this functor able to refer to a function within Lua as well.

Please see my article and slides on C++ Function Binding for implementation details not covered here.

## Prerequisites

This article is on the topic of automatic Lua binding; if you’re unfamiliar with how to bind simple C functions to Lua please do a little research and come back later. The deep end of the pool is actually pretty deep!

I also suggest a working knowledge of C++ templates before trying to implement these sort of features. A working knowledge of Lua is also essential.

## Setting the Boundaries

With a scripting language it’s important to clearly define what you want to expose to script. Is the entire game in Lua? Are only specific parts accessible? What are the boundaries. It’s all too easy to get very caught up in what to send, what to implement, what not to do. Having clear boundaries of exactly what you want to do is the best way to start coding.

## Passing Objects to Lua

Objects can be passed to Lua by reference and value. A reference would consist of 4 bytes of memory to contain a pointer to some C++ memory. This allows Lua to store a “reference” to an object in C++. Most of the work involved in this type of object binding is in allow Lua to call C++ methods or functions on the pointer its storing.

The benefits of this approach are such that: calling class methods is pretty fast and shouldn’t be a worry; fairly simple to implement as most of the work is finished by creating a generic functor in C++; no hassle or upkeep when wanting to send new types of objects to Lua -each object is just a 4 byte pointer.

## Passing by Reference with lightuserdata

There are two ways I’d recommend to pass an object to Lua with: userdata and lightuserdata. A lightuserdata represents a void * in Lua and can hold a reference to an object in C++.

Here’s how one might send and retrieve lightuserdata from Lua:

This method is very fast, simple to implement and has very minimal memory overhead. Additionally lightuserdata can be compared to one another, and are equal if the underlying address is equivalent. However, one cannot attach metatables to lightuserdata and there is no sense of type safety what so ever. A lack of type safety means that if someone passes a lightuserdata into an incorrect C function the host program will likely crash.

With lightuserdata the following code is possible:

This solution will work for one, maybe two people working on a smaller project or minimal amount of code. I can imagine that the lack of type safety will be the biggest issue as time goes on.

## Reflection for Type Safety

It is possible to implement type-safety in Lua. However this requires Lua code to be maintaining type information. Lua is a scripting language meaning it ought best be used to script things. Something so integral and common as type-safety might better be implemented in lower-level C++ code.

Implementing type safety on the C++ side has two benefits: efficiency of implementation; type-safety can optionally be compiled away in release mode.

I highly recommend building yourself a simple, custom introspection library in C++. All that is really needed to start is the ability to query a type’s size and name efficiently. Please see my older article on custom Introspection or the game engine SEL for examples on how to implement such a system.

With a simple macro-based registration system one can register and lookup type information via introspection like so:

After this is complete and working (if you don’t have an implementation of introspection yet this is fine, just think of it as a black box) a small generic Variable object ought to be created. Sample code of a functional Variable object is in this post.

A Variable can be used like so:

It is important to note that the Variable itself is not a templated type!

When passing an object to Lua we can send a pointer to a Variable. As long as the Variable exist in memory in C++ the lightuserdata within Lua will point to a valid Variable. Upon retrieval of the Lua object back to C++ a type assertion can be run:

## Generic Static Function Binding

Bind C-style static functions in a generic way makes heavy use of custom introspection. The way I was originally taught was to just throw the entire binding function (in C++) at you all at once and let you suffer. Prepare to suffer as I did!

This function isn’t doing the bind, it’s what is bound. Every time a function in C++ is called from Lua, this function is called first.

An upvalue in Lua is akin to static variables in C. Using this we can attach a pointer to a generic functor to a bound C function within Lua. As Lua calls a C function this upvalue is retrieved and eventually used to actually call the C function.

The rest is just a matter of handling variables to/from Lua. In the above example the Variable object contains some helper functions call ToLua and FromLua. The nice thing about my implementation of this within SEL is that no heap memory is used during this entire process! All this code boils down to a very efficient method of generically calling C functions.

I will leave binding C++ methods as an exercise for the reader. By now you ought to have an idea of where to look for example implementation! The idea is to handle type information for the “this pointer” of the method, and pass around an actual “this pointer” to call the method.

## Calling Methods from Lua

Lets say you have an implementation that allows Lua code like the following:

A few things need to happen here. The first is that the object in question should only call methods that are actually methods of that specific type of class; one cannot simply bind all C++ methods and place functions in Lua within the global scope. Any object type could call any method type making for a lack of type-safety and dangerous code.

At this point the lightuserdata will have to be upgraded to a full userdata. Full userdata in Lua enjoy benefits such as the ability to set and modify metatables. If you’re not familiar with Lua metatables please do a little research on the topic and come back later.

A full userdata allows us to place a copy of a Variable within Lua memory, instead of just a void *. This means a temporary Variable can be used to call ToLua, instead of requiring that the Variable sent stays valid in C++ for the duration of usage within Lua.

Currently a way to create metatables for all of our C++ types is required. Assuming a linked list of all TypeInfo objects from the introspection system is available:

This loop is just creating metatables given the string names of what each metatable should be called.

After the tables are created the actual C++ methods and functions should be bound. This turns out to be really simple! It is assumed that each function and method registered within the introspection system can be passed to the function at some point (perhaps during registration of the type information):

And that’s really all there is to it! The idea here is to make sure that a type with methods sent to Lua has its userdata fixed with a metatable containing the available methods to call. When the __index metamethod is called it will search within the metatable itself for an appropriate member. Members of the metatable are the functions we bound to Lua. After they are fetched they can be called. This is what happens behind the scenes when we do:

## Passing Object by Value

Passing objects by value is actually much more difficult. The idea is to utilize tables to to store representations of the members associated with a class or struct. A table can be used to represent state of an object.

The __index and __newindex metamethods of a userdata should be set to look into the state table first. This lets users assign new values, and lets your ToLua and FromLua functions copy members from C++ to/from this Lua state table.

If a member is not found in the state table the metatable can then be searched by setting the __index metamethod of the state table to refer to the proper metatable.

All of this table indirection does incur significant overhead, however it allows objects in Lua to be used like so:

I myself have not implemented this type of Lua binding, though it is entirely possible and can be quite nice to work with. I reiterate that adding this many tables incurs both memory and performance overhead not seen with the other styles. This seems to be the only drawback.

## Conclusion

Well this post turned out longer than I expected -over 2k words! Hopefully the information was clear. It’s really nice being able to refer people to a complete and working example such as the SEL engine; it makes writing articles much easier and simpler.

Hopefully this can help someone out there! As always feel free to ask questions or provide comments right here on this page.

Please see Game Programming Gems 6 ch. 4.2 for more information about binding C++ objects to Lua,