Texture Atlases

Texture atlases for 2D games is a great optimization for batching together tons of different sprites (especially quads) with a very few number of draw calls. Making them is a pain. For pixel art or other 2D art, DXT compression is often not a great choice due to the lossyness. One good way to make atlases for 2D games is with the PNG format. Single-file loaders and savers can be used to load and save PNGs with a single function call each.

Here’s an example with the popular stb_image, along with stb_image_write:

This requires two different headers, one for loading and one for saving. Additionally the author of these headers also has a bin packing header, which could be used to make a texture atlas compiler.

However I have created a single-file header called tinydeflate. It does PNG loading, saving, and can create a texture atlas given an array of images. Internally it contains its own bin packing algorithm for sorting textures and creating atlases. Here’s an example:

The above example (if given a few more images) could output an atlas like so:

Followed by a very easy to parse text formatted file containing uv information for each atlas:

The nice thing about tinydeflate is the function to create atlases has a little bit of anti UV bleeding code inside of it. Most places on the internet will tell you to solve texture atlas bleeding by adding padding pixels around your textures. This might be important for 3D applications or when mip-mapping is important, however, I don’t have this kind of 3D experience. For 2D pixel art games it seems squeezing UVs ever so gently inside the borders of a texture works quite well, and so tinydeflate takes this approach. No complicated padding necessary.

And finally here’s an older video I did with some more information on writing the code yourself to do some of the above tasks.

Share

Essentials of Software Engineering – With a Game Programming Focus

I’ve been commissioned to write a big document about what I think it means, and what is necessary, to become a good software engineer! If anyone finds the document interesting please do email me (email on my resume) with any questions or comments, even if you hate it and deeply disagree with me! This is the first release so some sections may be a bit skimpy and I might have some false info in there here and there.

I’m a fairly young engineer so please do take all my opinions with a healthy dose of skepticism. Nonetheless I hope the document is useful for someone! Many of my opinions come from my interpretation of much more experienced engineers I’ve come in direct contact with, and this PDF is sort of a creation of what others taught me.

As time goes on I may make edits or add new chapters, when I do I’ll modify the version number (just above the table of contents). Here is the PDF:

Preprocessed Strings for Asset IDs

Mick West posted up on his site a really good overview of some different methods for hashing string ids and gave good motivation for optimizing this area early on in a project. Please do review his article as it’s a prerequisite to this post, and his article is just really good.

I’ve been primarily concerned with memory management of strings as they are extremely hairy to work with. For me specifically I’ve ruled out the option of a string class — they hide important details and it’s too easy to write poor (but functional) code with them. This is just my opinion. Based on that opinion I’d like to achieve these points:

• Avoid all dynamic memory allocation, or de-allocation
• Avoid complicated string data structures
• Little to no run-time string traversals (like strcmp)
• No annoying APIs that clutter the thought-space while writing/reading code
• Allow assets to refer to one another while on-disk or in-memory without complicated pointer translations

There’s a lot I wanted to avoid here, and for good reason. If code makes use of dynamic memory, complicated data strctures, etc. that code is likely to suck in terms of both performance and maintenance. I’d like less features, less code, and some specific features to solve my specific problem: strings suck. The solution is to not use strings whenever possible, and when forced to use strings hide them under the rug. Following Jason Gregory’s example he outlined about Naughty Dog’s code base from his book “Game Engine Architecture 2nd Ed” I implemented the following solution, best shown via gif:

The SID (string id) is a macro that looks like (along with a typedef):

I’ve implemented a preprocessor in C that takes an input file, finds the SID macro instances, reads the string, hashes it and then inserts the hash along with the string stored as comment. Pretty much exactly what Mick talked about in his article.

Preprocessing files is fairly easy, though supporting this function might be pretty difficult, especially if there are a lot of source files that need to be pre-processed. Modifying build steps can be risky and sink a ton of time. This is another reason to hammer in this sort of optimization early on in a project in order to reap the benefits for a longer period of time, and not have to adjust heavy laid-in-stone systems after the fact.

Bundle Away the Woes

I’ve “stolen” Mitton’s bundle.pl program for use in my current project to recursively grab source files and create a single unified cpp file. This large cpp can then be fed to the compiler and compiled as a “unity build”. Since the bundle script looks for instances of the “#include” directive code can be written in almost the exact same manner as normal C++ development. Just make CPP files, include headers, and don’t worry about it.

The only real gotcha is if someone tries to do fancy inclusions by defining macros outside of files that affect the file inclusion. Since the bundle script is only looking for the #include directive (by the way it also comments out unnecessary code inclusions in the output bundled code) and isn’t running a full-blown C preprocessor, this can sometimes cause confusion.

It seems like a large relief on the linker and leaves me to thinking that C/C++ really ought to be used as single-translation unit languages, while leaving the linker mostly for hooking together separate libraries/code bases…

Compiling code can now look more or less like this:

First collect all source into a single CPP, then preprocess the hash macros, and finally send the rest off to the compiler. Compile times should shrink, and I’ve even caught wind that modern compilers have an easier time with certain optimizations when fed only a single file (rumors! I can’t confirm this myself, at least not for a while).

Sweep it Under the Rug

Once some sort of string id is implemented in-game strings themselves don’t really need to be used all too often from a programmer’s perspective. However for visualization, tools, and editors strings are essential.

One good option I’ve adopted is to place strings for these purposes into global table in designated debug memory. This table can then be turned off or compiled away whenever the product is released. The idea I’ve adopted is to allow tools and debug visualization to use strings fairly liberally, albeit they are stored inside the debug table. The game code itself, along with the assets, refer to identifiers in hash-form. This allows product code to perform tiny translations from fully hashed values to asset indices, which is much faster and easier to manage compared to strings.

This can even be taken a step further; if all tools and debug visualizations are turned off and all that remains is a bunch of integer hash IDs, assets can then be “locked” for release. All hashed values can be translated directly into asset IDs such that no run-time translation is ever needed. For me specifically I haven’t quite thought how to implement such a system, and decided this level of optimization does not really give me a significant benefit.

Parting Thoughts

There are a couple of downsides to doing this style of compile-time preprocessing:

• Additional complexity in the build-step
• Layer of code opacity via SID macro

Some benefits:

• Huge optimization in terms of memory usage and CPU efficiency
• Can run switch statement on SID strings
• Uniquely identify assets in-code and on-disk without costly or complicated translation

If the costs can be mitigated through implementing some kind of code pre-processor/bundler early on then it’s possible to be left with just a bunch of benefits :)

Finally, I thought it was super cool how hashes like djb2 and FNV-1a use an initializer value to start the hashing, typically a carefully chosen prime. This allows to hash a prefix string, and then feed the result off to hash the suffix. Mick explains this in his article this idea of combining hashed values as a useful feature for supporting tools and assets. This can be implemented both at compile or run-time (though I haven’t quite thought of a need to do this at compile-time yet):

Freelist Concept

A freelist is a way of retrieving some kind of resource in an efficient manner. Usually a freelist is used when a memory allocation is needed, but searching for a free block should be fast. Freelists can be used inside of general purpose allocators, or embedded directly into an optimized algorithm.

Lets say we have an array of elements, where each element is 16 bytes of memory. Our array has 32 elements. The program that arrays resides in needs to request 16 byte elements, use them, and later give them back; we have allocation of elements and deallocation of elements.

The order the elements are allocated is not related in any way to order of deallocation. In order for deallocation to be a fast operation the 16 byte element needs to be in a state such that a future allocation be handed this element to be reused.

A singly linked list can be used to hold onto all unused and free elements. Since the elements are 16 bytes each this is more than enough memory to store a pointer, or integer index, which points to the next block in the free list. We can use the null pointer, or a -1 index to signify the end of the freelist.

Allocating and deallocating can now look like:

Setting up the memory* will take some work. Each element needs to be linked together somehow, like through a pointer or integer index. If no more elements are available then more arrays of size 32 can be allocated — this means our memory is being managed with the style of a “paged allocator”, where each array can be thought of as a page.

The freelist is an important concept that can be embedded ad-hoc into more complex algorithms. Often times it is important for little pieces of software to expose a very tiny C-like interface, usually just a function or two. Having these softwares self-contain their own internal freelists is one way to achieve a simple interface.

Example of Hiding the Freelist

For example say we are computing the convex hull of a point-set through the Quick Hull algorithm. The hypothetical algorithm exposes an interface like this:

This QHull function does no explicit memory allocation and forces the user to allocate an appropriate amount of memory to work with. The bounds of this memory (how big it needs to be for the algorithm’s worst case scenario) is calculated by the ComputeMemoryBound function.

Inside of QHull often times the hull is expanded and many new faces are allocated. These faces are held on a free list. Once new faces are made, old ones are deleted. These deleted faces are pushed onto the free list. This continues until the algorithm concludes, and the user does not need to know about the details of the embedded memory management of the freelist.

Convex hull about to expand to point P. The white faces will be deleted. The see-through faces will be allocated.

A convex hull fully expanded to point P. All old faces were deleted.

The above images were found at this address: http://www.eecs.tufts.edu/~mhorn01/comp163/algorithm.html

Since linked lists are such an essential topic I’ve taken some extra care to learn efficient ways of using them. The simplest kind of linked list to conceptualize is the singly linked list. There are tons of online resources for learning the basics about linked lists, so I’ll assume readers are familiar with the concept.

Here’s a quick mock header of some linked list nodes for reference:

In general singly linked lists are more complicated to manage once removal of nodes is required. Since no explicit prev pointer is stored in memory a temporary variable is often kept on the stack while traversing a singly linked list. This means more complicated code that clogs the user’s focus.

Even though a doubly linked list requires twice the memory they are usually still preferred over singly linked lists, even when a singly linked list could get the job done without any additional time complexity. Often times linked lists are useful in complex algorithms, and if there’s a chance to simplify the implementation of a complex algorithm by using a doubly linked list, then that chance is probably worth the taking.

When I first implemented a doubly linked list and tested its performance out against std::list I couldn’t quite get it to perform well.

Naive insertion and removal of list nodes often has to check for NULL pointers, which represent the front and back of the linked list. Here’s an example of what removal might look like to give you the idea of how many if-statements could be necessary (code not tested, I just typed it out here on the spot):

There are two if statements hit every single time this function is called. When the CPU comes across a branch is loads instructions based on which path of execution it deems most likely. This is called branch prediction. If this prediction is incorrect the loaded code must be unloaded, and then the appropriate code must be re-loaded.

This branch missing probably going to be a very fast CPU operation since executing code is almost definitely in the L-1 code cache. Despite it being fast modern CPU still operate through a pipeline, and branch misses can still garble up whatever pipelining is happening. In the end a branch miss is a performance hit, and should be avoided when appropriate.

A common linked list optimization is to use a dummy head and tail node. These nodes sit in memory along with the list data structure. Upon list initialization they point their next and previous pointers to one another, and NULL out the pointers to represent the front and back of the list.

With this optimization the only case that user nodes will ever encounter is the case in the first two if statements (assuming both were true). The removal code can now look something like (again, not tested):

This is one kind of optimization the std implements. After doing this myself my list performed evenly with the std’s implementation.

Intrusive Lists

Intrusively linked lists invert the definition of what a node is. Traditionally a linked list node contains some data. An intrusive list has the data contain the node:

This scheme is nice since now nodes do not need to be allocated separately from the data. If the number of data elements is known, then the exact number of nodes needed can also be known.

C++ templates can be used to create a generic intrusively linked list implementation, able to define nodes inside of any data type. C macros can also be used to the same effect. In this way an intrusively linked list can be used in pretty much the same way a normal linked list is.

One major downside to intrusively linked lists is that they add in extra memory to your data. This can be a big deal if some code is very performance sensitive. If cache line utilization is important, then the percentage of data actually used in each line becomes important. Sometimes these pointers get in the way and clutter the lines. This cluttering is something to be aware of.

On the flip side many algorithms can run on arrays of data. Instead of storing explicit pointers to represent prev and next connections, indices into an array can be used. This can make entire data structures memcpy-able, or serializable just by dumping bits to a stream. Additionally, the pointers stored directly within data will often be accessed as the exact same time (depending on the algorithm), which results in very high cache line utilization.

It all depends on the scenario.

Circular Lists (Sentinel)

When dealing with intrusive linked lists it can often be really weird to define where in memory dummy nodes would reside. Are we to create dummy pieces of Data? What if the algorithm needs lists to be constantly created and destroyed? What if the algorithm can have as many lists as there are nodes? Suddenly an algorithm might need twice as many dummy nodes as actual nodes!

It is possible to remove the dummy nodes in some cases. Data elements can be initialized to point to themselves. In this way each element is itself a doubly linked list with one node. To insert a second node is a matter of making both nodes point to each other. Inserting a third node should use the exact same code as inserting the second node (and not require any branching since NULL indices/pointers do not exist), and so on.

In many cases an intrusive circular doubly linked list (boy, isn’t that a mouthful) can be the perfect solution to a hard problem! I will leave it as an exercise to research or implement this circular style of linked list.

Another name for this type of list would be a “sentinel intrusive list”, where a sentinel node can be used to bound a list traversal. Since our linked lists are circular we can start at any node, traverse the list, and once we reach the node we started upon our traversal is complete.

Cache Aware Components

Special thanks to Danny Frisbie for a nice discussion on the PODHandler implementation!

Let me start off by saying that optimizations really only need to be applied to bottlenecks. In order to know where a bottleneck might occur (especially cache related ones) you’ll probably need some experience. The experience not need be your own, but the experience will come from someone. In my (limited) experience the only bottlenecks I’ve ever seen in any piece of game related software (aside from N^2 loops with a high N) are always due to waiting on things to be placed into the cache. It’s really easy to write bad code, and bad code is usually cache oblivious. Even conceptually clear and understandable code can still be cache oblivious!

I’ve seen some very nice 2D and 3D games, made in C++, that used only rudimentary memory allocation schemes and naive component implementations. They ran at 60 fps just fine. If you’re a hobbyist or just trying to learn, then thinking about how to write the fastest component framework ever might be fun but don’t expect to do it correctly on your first try. Expect to fail, and then iterate.

So when the time comes to actually optimize something, having some sort of idea of where to look to learn how solve cache related problems will be valuable.

Data Lookups and Cache Lines

In general the cache line size for hardware nowadays seems to be mainly 64 bytes. A cache line is a 64 byte piece of memory that is on 64 byte boundaries. Whenever data is transferred from one cache to another (or to/from main memory (RAM)) the memory is transferred in a cache line. This keeps the memory bus busy. 16 32-bit integers would be the size of a single cache line, or 16 32-bit floating point numbers. This reduces to the size of a 4×4 matrix of 4 element floating point scalars.

How fast a cache line is transferred depends on which level it is being fetched from. In general terms: when a piece of memory is fetched by the CPU from a lower level cache it is hoisted up into the L1 data cache (L1 D, L1 I is the instruction cache for code). If this memory was not in the cache it will take forever (100 to 300 cycles, probably near the 300 range for PC). Here’s a nice diagram by Naughty Dog summarizing the common cache setup for PC CPUs:

When something is loaded into the L1 cache whatever was there before has to be evicted, and will be pushed into the L2 cache (again using the same 64 byte cache line size). This will probably evict something from the L2 cache back down into the L3 cache, and so on and so forth.

The implication here is that whenever a cache line is read it is up to the programmer to try to use as much of that cache line as possible. Even though we might have 8 gigabytes of RAM, if we aren’t running in the cache the CPU will be sitting there waiting. Even if a single byte is read from main memory and entire cache line will be fetched. Reading a single byte from a random location in main memory is about the worst possible way to use memory.

This tends toward the idea of using very compact and concise data structures. If a data structure is packed together in memory it can be operated upon by the CPU very quickly once it arrives to the CPU’s cache.

The cache isn’t very big. Here’s a nice slide by Scott Meyers on the topic:

32KB of L1 data cache is tiny. You don’t even get to use all of it as the operating system does need to do stuff too!

Prefetching

Prefetching exists to try to hide the latency of fetching memory. A prefetch is when a cache line is preemptively fetched and placed into the cache, such that when the memory is actually requested a cache hit occurs.

Hardware can detect patterns in real-memory accesses, but it can only detect pretty simple patterns like array traversals. Scott Meyers describes (see resources section) that the hardware is made in such a way that it can detect iterating over arrays forwards, backwards and with variable (but constant) element step size. It can also do this for all hardware threads simultaneously. However, if you’re not looping over an array you can’t count on any intelligent prefetching. It will take two or more cache misses in a recognizeable pattern to start automatic prefetching.

Usually compilers provide a specific keyword to hint to the run-time to grab a specific cache line from somewhere in memory. This can be used by programmers to ease out a final bit of performance, given a proper implementation to prefetch for.

Cache (Un)Aware Components

Hopefully by now readers are convinced that contiguous arrays of data are very friendly to the cache, and where performance matters this knowledge should be exploited.

In a component based game engine architecture looking up and operating on components is often the first bottleneck encountered. It might pay to learn a little about how this might be circumvented.

A memory naive implementation of components will look something like this:

Each component is allocated on the heap explicitly by the operating system, which is going to require a context switch and be very sensitive to memory fragmentation.

Ideally all data of a certain type will be packed together in a tight linear array. When this data needs to be operated upon, the fastest sort of transformation (without manual prefetching) will look something like this (for a generalized example):

Ideally the size of a given element will be below the size of a cache line, and often times this is possible if extraneous data can be removed from the Data type.

The explicit consumption of the data where a local copy is made is probably not necessary and will be compiled away. It is fairly easy to check the assembly by using a compiler to process to an assembly file to double check. However this kind of practice can be very helpful to let the compiler know that multiple pointers cannot possibly alias the same type. For more information search for “C++ aliasing Ericson” to find Christer’s old slides.

Now that the ideal computational situation for transforming a large data set has been described, lets look at a common (albeit contrived) data transformation that we’ve all been guilty of while first learning:

Conceptually this code is very concise and easy to reason about. Though the code readability and dynamic niceties aren’t very efficient. Random calls to delete occur, the inner loop contains a branch, and called Update on an object will go to who knows where in memory. All of these things are basically punching the cache right in the gut. Even the branch can be annoying for the CPU pipeline as it may have to eject code out of the L1 I cache if a branch is mispredicted!

How can this be solved? The first step is to make sure as much data is packed together in memory as possible. In the above code snippet the list can be changed to an array (perhaps std::vector). Okay, pretty trivial change, no big deal. Objects can perhaps just be placement new’d into the array and placement deleted. This will act like a memory pool.

The next step is to identify that the UpdateGameObjects function is performing two types of tasks (assuming the Update function performs a single task); deletion and update calling. This is a result of the container of objects not being sorted. It is a non-homogeneous collection of objects that are both alive and dead. If objects can be separated into sections of dead and alive, only the alive objects need to be looped over.

Cache Aware Components

One way to implement this would be to have the beginning of the array contain a contiguous line of live elements. The rest of the array can contain “deleted” or “not yet allocated” elements. In order to uphold this invariant it might be best to design objects placed into these sorts of arrays not care if they are moved in memory. Usually this means making your data a plain old data type (POD).

Deleting things from the array is going to be a nice feature to support. A game wouldn’t be interactive for very long if it could only consume more and more memory. A simple and very effective scheme is to move the last element of your array into the location of a deleted element.

However in order to refer to a unique element within an array simple pointers are no longer going to cut it. When an element is moved from the last index into a deleted slot any pointers to the old spot (the last and now empty element) will be dangling. Some form of translation from one data type into a pointer must occur in order to ensure that the correct pointer is retrieved for a unique entry in the array.

Usually this translation comes in the form of a handle. A handle can be implemented as an integer divided into two different sections. The details of how to implement a handle should be known by the reader before continuing on, so please view Llopis as a resource.

Lets create a simple abstract data type that grants access to an array of POD elements, of which supports handle translation, allocation and deletion:

In order to implement these two functions please do refer to the Llopis resource referenced in the last paragraph.

In order to implement Release things start to get tricky. How does the PODHandler update the handle of the element it moved? Somehow the location of the internal handle entry needs to be accessible just by knowing where the element was moved from. The easiest solution is to place a handle inside the type T within each element of the array. However, it would be great if types that are held inside of PODHandlers can fit within a single cache line. Adding a handle to every single element lowers the density of the data in the array. For certain situations this data bloat, though only 4 bytes per elements, will reduce the effectiveness of every single cache line from 64 bytes to 60.

Clearly an alternative could be used! The solution is to yet again separate different types of data into different arrays. The internal array of the PODHandler should consist of homogeous data! Rip out that intrusive handle and place them all in their own arrayPODHandler can now consist of 3 arrays in total: an array of type T, an array of Handles, and an array of integers.

The array of integers share their indices with the array of PODs. This means a POD in element 3 will correspond to the integer in element 3 of the integer array. The integer array contains indices that map to the handle associated with a given POD element in the handle array m_entries.

Though readers may by now be wondering “wouldn’t three different arrays potentially have worse locality of reference than just two arrays?”, and this would be a good thing to wonder. It is true that an intrusive handle would be preferable if handle translations are extremely frequent. If they are, the original intrusive handle implementation may be ideal.

If an engine is architected to focus on cache utilization for transforming large data sets with expensive operations, then a homogeneous array will be preferential. Or in other words if you want to loop over a lot of stuff and do expensive math on each element, that array better be dense. This means that handle translations are more infrequent since the code focuses on looping over the data array itself rather than picking out individual elements at random.

Open Source Implementation

The idea PODHandler represents is important. My implementation is just my own manifestation of the concepts described in this post. My implementation is not important! The concepts are important. Hopefully by allowing readers another piece of reference, in the form of some source code, the ideas presented here can be better realized.

PODHandler Source: Link (not up yet)

Sane Usage of Components and Entity Systems

With some discussion going in a previous article about how to actually implement some sort of component system for a game engine, without vague theory or dogma, a need for some higher level perspective was reached, and so this article arose.

In general an aggregation model is often useful when piecing together bits of functionality or data to create something new. The ability to do so is very useful for writing game-specific gameplay code due the flexibility of code granted by aggregation. However as of late there’s been tremendous talk about OOP, Entity Systems, Inheritance, and blah blah blah within the online indie development community. More and more buzzwords get tossed around by big name writers and the audience really just looks for some guidelines to follow in hopes of writing good code.

Sadly there isn’t going to be a set of step by step rules for writing a game engine or coming up with a good architecture. Like many of said before me, writing a game is a specific task requiring specific solutions. Why do you think game engine developers such as Epic or the Unity guys have so many people working on the product? Because a generic game engine is a huge piece of software that requires a lot of features. Some features exist simply to let users add in custom features easily.

Components, aggregation, Entity Component Systems, Entity systems, these are just words and have various definitions (depending on who you ask).

Some Definitions

To hopefully avoid silly arguments and confusion lets define some terms. If you don’t like the definitions here feel free to express so, I’m all up for criticism and debate.

• Component Based Architecture
• A preference for aggregation over inheritance. Is just a concept and does not lead to a single specific implementation. A game object is a collection of components. A component defines data and/or functionality for a concept.
• Entity Component System (ECS)
• A specific implementation of Component Based Architecture. A game object would be an ID (an integer). The ID is used to form an aggregate. Usually an ECS implies an implementation similar to a database, where components are entries into a database that are looked up through some identifier. The main goals of this implementation are efficiency and simplicity. Often times the term “ECS” is used just to describe a Component Based Architecture, often leading to confusion.
• Aggregation
• I like to think of this as a “has-a” relationship over an “is-a” relationship. Aggregation refers to one object “having” another object, which implies an aggregate is a collection (data structure) of other objects.

Some Truth and History

Aggregation is useful from a game design perspective. It frees functionality from arbitrary classification (classes and inheritance). Classes were originally created in C++ to let a programmer tie together a piece of data and some functionality to represent some sort of real-life concept. This is in simplest terms the essence of Object Oriented Programming (OOP). Over time more features were added to help engineer relationships between classes, one such feature came in the form of inheritance.

There’s nothing inherently wrong with OOP and it makes sense in a lot of code. Problems can arise when there’s a mis-application of OOP that has implications that aren’t fully understood at the time of implementation that cause negative affects down the road. I’m sure we’ve all seen the code migration and mega-class example so commonly thrown around in articles arguing against OOP and inheritance abuse.

In response to such an abuse a new paradigm became popularized which focused on aggregation of functionality to form an object. This might be called a “component based architecture”. In general aggregation can be considered an appropriate alternative to inheritance.

OOP Diatribe

Usually when an article spews forth caustic attacks against OOP it’s directed at naive implementations that disregard implications of how memory is accessed. Perhaps in the past the bottleneck of most everything was processor speed, so a lot of literature focuses on this. Nowadays CPUs on the PC have an architecture that have ridiculous computational power with extremely limited memory access. In general one might consider accessing memory from RAM 300 times slower than multiplying two floats together. Of course this last statement is extremely anecdotal without any evidence, but exists just to give a rough perspective of reality in many current (2014) cases.

If objects with associated code (classes) are just allocated and deallocated on the heap at will then a performance bottleneck of memory access is going to rear its ugly face, likely long before other performance issues are even on the radar. This is where much of the diatribe comes from.

It should be noted that pretty much all code bases that make use of the C++ language use classes and structures in some form or another. As long as a programmer has an understanding of memory, how it’s accessed, and what implications arise from given implementations, nothing will go wrong. Alas, actually doing these things and writing good code is super hard. It doesn’t matter if a class has some implementation code within it, so long as that bit of code makes sense for the purposes it is serving.

Implementing Components, a First Draft

The most immediate implementation would be to make use of multiple inheritance. This has a clear definition of where the data goes, and it all goes in one class -the derived class. Multiple inheritance itself can get a bit tricky when dealing with pointer typecasting between derived and base types, though the C++ language itself handles the details much of the time.

Inheritance alone doesn’t provide a good mechanism to query whether a base class is apart of a specific derived aggregation and so the dynamic cast operator is born. Since the dynamic cast is a branching operation, usually implemented (afaik) by inspecting the vtable, it is avoided in general.

Multiple inheritance also does all sorts of work to member function pointers, and is just a sad part of C++. Additionally there isn’t any language feature that allows for dynamic dispatch for combinations of base classes, so if the need arises a custom solution will need to be implemented anyway.

Memory accessing, although defined, isn’t ideal. Multiple inheritance forms a blob of different data, and usually only a single piece of the blob is needed at any given time, meaning locality of reference will be poor in general. This leads to the idea of inheriting from multiple interfaces in order to decouple memory aggregation from functionality aggregation, which leads to the next draft.

Second Draft – Run-Time Aggregation

Instead of using multiple inheritance on interfaces, which is a compile-time feature, run-time support can be added. Object aggregates can be formed during run-time, and modified thereafter. This is appealing for data driven applications, and game-design friendly development iteration speed.

So lets assume that some programmer wants to implement components, but doesn’t think much about memory access patterns the implications therein. Using a vector of pointers an implementation of components becomes super simple. Each pointer can point to an interface exposing a few functions like Update, Init and Shutdown.

Searching for a particular component is as simple as linearly looping over each pointer until a matching type is found. If these pointers are ordered in some way a search can be performed, perhaps a binary search could suffice. If the identifier of a component is hashable a hash table lookup can be used.

The implementation so far is an excellent one except that there is no definition of how memory is allocated and accessed! In the most naive of implementation each game object and each component will be allocated on the heap with separate calls to malloc.

Despite having no clear memory definition there are some nice benefits that have arisen. Data driving the composition of an aggregate becomes quite trivial as each component of an aggregation can have an entirely isolated lifetime. Adding, removing, modifying, or even creating new components at run-time are all now possibilities. This dynamic aggregate architecture is great for improving game development and design iteration time!

Aggregation and Components and the Entity System Paradigm (ES/ECS)

As stated in the definitions section, an ECS is just a specific implementation of a component based architecture. A component based architecture game engine architecture would be a custom implementation of multiple inheritance. A clearly defined ECS can impose restrictions on how a component architecture is implemented and used in hopes of avoided poor memory access patterns, or in hopes of keeping code simple and orderly.

If a component is designed as a piece of memory without any code, and a game object defined as an integer ID then performance specifications can be easily imposed. Rules about where in memory components lay, and how components are actually accessed can be clearly defined in simple terms. Code can be written that operates upon arrays of components, transforming arrays linearly. This idea is actually a type of Data Oriented Design (DOD), which makes sense as DOD is just an idea! ECS is an application of the idea of DOD.

So with this type of implementation the benefits of dynamic composition can be paired with well-defined memory layout and access patterns. Suddenly prefetching and parallelism become much simpler to support.

Aggregatize all the Things!

There’s a problem. Blindly shoving the idea of an ECS implementation into every nook and cranny of an engine during development is just silly (or any complex system, not just game engines or libraries). Often times a particular system is not best implemented with a component or aggregate paradigm in mind.

An obvious case is that of a physics engine. Often times a physics engine developer is worried about collision detection, solving systems of linear equations, rigid body mechanics and allowing the engine to easily be integrated into existing code bases. These details involve a lot of math and good API design. A developer of a physics engine is going to have their focus employed in full force in solving problems specific to physics engines. This means that the engineer’s focus is finite, so the implementation that is best is one that the engineer can actually bring to completion. An implementation that can come to completion is one that makes sense for the specific details of whatever is going on inside the physics engine. The specific paradigms used are often not aggregation or component based!

In order for a physics engine to run fast it needs to have efficient memory access patterns and memory usage, on modern PC hardware, requires some form of DOD. Since this complex (often black boxed) physics engine will have it’s own specific implementation and optimization it doesn’t make sense to force a component based model to its very core with some sort of idealistic zeal. It gets really bad when strict rules are imposed (like banning all code from classes and structures that define components) on the component model (like with an ECS) and the rules start permeating the deep recesses of the entire code base.

The same thing goes for any sort of complex system. The core facilities of a game engine often times just don’t really care about components or aggregation. This means that an engine architecture that implements components will usually have to deal with middleware graphics/physics engines/libraries that don’t subscribe to a component based model (simply because it’s easier to use a library than to write your own custom things, especially if those custom things religiously follow some silly methodology like ECS or even OOP). In practice light wrapper components can be created to let the functionality of such systems be presented in a component format, ready to be used in an aggregate object.

What does this all mean? What should we all do?

Use components where it makes sense in code. Use inheritance where it makes sense in code. Use databases where they make sense. Use all the things where they should. This is a pretty sad answer but it’s the right one. There is no silver bullet paradigm that solves all the problems in the game engine architecture world, and there are no steps to follow to achieve a result that works in all cases. Specific problems require specific solutions. Good code is hard to write, and will require a lot of judgement calls. In order to make good judgement calls a lot of experience and perspective is required.

I recommend using aggregation where it really matters. Dynamic aggregation is important for gameplay specific code. Gameplay specific code, in this article, would refer to code that would not easily apply or work at all in a different game. It’s code that is your game and doesn’t define an isolated system or functionality.

Dynamic aggregation and the component based model are extremely important for game and object editors. Game design flourishes best when iteration times are driven to zero, and the ability to create new things from a composition of fundamentals is very valuable! Clearly composition is useful, but how it’s to be used is the hard part.

What Components to Make?

I recommend making components concerned with providing access to game-independent functionality to be quite large. Every 3D game engine has a concept of a mesh, and will usually have some sort of file format to associate with, like FBX. Every 2D game engine will have the concept of a sprite. Each game using Box2D will have colliders and rigid bodies, and possibly joints. These fundamental pieces of functionality don’t change very often, so static compile-time relationships aren’t a bad thing since iteration time isn’t really all that relevant.

A 3D game might have a single Mesh component for example. A Mesh component can have renderable vertices, and possibly all the skeletal and animation information as well. There may be a single Rigid Body component, which encapsulates the idea of colliders or shapes, as well as the functionality of rigid body mechanics. The Rigid Body component might even contain all necessary code and data to hold multiple joints! Or joints may be a component themselves.

For high level and gameplay related features components can become much more granular (or not if you so choose). Gameplay should be iterated, tested and changed frequently, so having small and decomposed components will probably make a lot of sense in a lot of cases. Large components that encompass more broad ideas will be useful in many cases too. Even in the gameplay world judgement calls are essential.

Usually efficiency isn’t so important for much gameplay code, so any implementation that is decently performant will suffice. Scripting languages, dynamic memory allocation and virtual dispatch, or what have you can all work. The decisions of what requires flexibility, what requires performance and all between can be difficult to make. Please see the references section for some concrete examples.

We live in a world of opinions and it takes time to sift through them! If you have recommendations please comment below :)

Reference Source Code

The best reference I know of is an open source game engine in progress (stalled until I graduate) I myself am developing. Please do send me your recommendations on references!

Automated Lua Binding

Welcome to the fifth post in a series of blog posts about how to implement a custom game engine in C++. As reference I’ll be using my own open source game engine SEL. Please refer to its source code for implementation details not covered in this article. The folder of interest would be LuaInterface.

Introduction

Binding things to Lua is twofold: objects and functions must be able to be sent to and retrieved from Lua. Functions can be either static C or struct/class methods. Objects can be sent “by value” or “by reference”. As you can imagine it is important to be able to unify and simplify the binding process as much as possible to reduce all manual dev-work and upkeep.

Generic C++ Functor

As with many things in a modern C++ game engine it is critical to have a generic C++ functor. Ideally this functor can wrap around class/struct methods (not only static functions). It is also possible have this functor able to refer to a function within Lua as well.

Please see my article and slides on C++ Function Binding for implementation details not covered here.

Prerequisites

This article is on the topic of automatic Lua binding; if you’re unfamiliar with how to bind simple C functions to Lua please do a little research and come back later. The deep end of the pool is actually pretty deep!

I also suggest a working knowledge of C++ templates before trying to implement these sort of features. A working knowledge of Lua is also essential.

Setting the Boundaries

With a scripting language it’s important to clearly define what you want to expose to script. Is the entire game in Lua? Are only specific parts accessible? What are the boundaries. It’s all too easy to get very caught up in what to send, what to implement, what not to do. Having clear boundaries of exactly what you want to do is the best way to start coding.

Passing Objects to Lua

Objects can be passed to Lua by reference and value. A reference would consist of 4 bytes of memory to contain a pointer to some C++ memory. This allows Lua to store a “reference” to an object in C++. Most of the work involved in this type of object binding is in allow Lua to call C++ methods or functions on the pointer its storing.

The benefits of this approach are such that: calling class methods is pretty fast and shouldn’t be a worry; fairly simple to implement as most of the work is finished by creating a generic functor in C++; no hassle or upkeep when wanting to send new types of objects to Lua -each object is just a 4 byte pointer.

Passing by Reference with lightuserdata

There are two ways I’d recommend to pass an object to Lua with: userdata and lightuserdata. A lightuserdata represents a void * in Lua and can hold a reference to an object in C++.

Here’s how one might send and retrieve lightuserdata from Lua:

This method is very fast, simple to implement and has very minimal memory overhead. Additionally lightuserdata can be compared to one another, and are equal if the underlying address is equivalent. However, one cannot attach metatables to lightuserdata and there is no sense of type safety what so ever. A lack of type safety means that if someone passes a lightuserdata into an incorrect C function the host program will likely crash.

With lightuserdata the following code is possible:

This solution will work for one, maybe two people working on a smaller project or minimal amount of code. I can imagine that the lack of type safety will be the biggest issue as time goes on.

Reflection for Type Safety

It is possible to implement type-safety in Lua. However this requires Lua code to be maintaining type information. Lua is a scripting language meaning it ought best be used to script things. Something so integral and common as type-safety might better be implemented in lower-level C++ code.

Implementing type safety on the C++ side has two benefits: efficiency of implementation; type-safety can optionally be compiled away in release mode.

I highly recommend building yourself a simple, custom introspection library in C++. All that is really needed to start is the ability to query a type’s size and name efficiently. Please see my older article on custom Introspection or the game engine SEL for examples on how to implement such a system.

With a simple macro-based registration system one can register and lookup type information via introspection like so:

After this is complete and working (if you don’t have an implementation of introspection yet this is fine, just think of it as a black box) a small generic Variable object ought to be created. Sample code of a functional Variable object is in this post.

A Variable can be used like so:

It is important to note that the Variable itself is not a templated type!

When passing an object to Lua we can send a pointer to a Variable. As long as the Variable exist in memory in C++ the lightuserdata within Lua will point to a valid Variable. Upon retrieval of the Lua object back to C++ a type assertion can be run:

Generic Static Function Binding

Bind C-style static functions in a generic way makes heavy use of custom introspection. The way I was originally taught was to just throw the entire binding function (in C++) at you all at once and let you suffer. Prepare to suffer as I did!

This function isn’t doing the bind, it’s what is bound. Every time a function in C++ is called from Lua, this function is called first.

An upvalue in Lua is akin to static variables in C. Using this we can attach a pointer to a generic functor to a bound C function within Lua. As Lua calls a C function this upvalue is retrieved and eventually used to actually call the C function.

The rest is just a matter of handling variables to/from Lua. In the above example the Variable object contains some helper functions call ToLua and FromLua. The nice thing about my implementation of this within SEL is that no heap memory is used during this entire process! All this code boils down to a very efficient method of generically calling C functions.

I will leave binding C++ methods as an exercise for the reader. By now you ought to have an idea of where to look for example implementation! The idea is to handle type information for the “this pointer” of the method, and pass around an actual “this pointer” to call the method.

Calling Methods from Lua

Lets say you have an implementation that allows Lua code like the following:

A few things need to happen here. The first is that the object in question should only call methods that are actually methods of that specific type of class; one cannot simply bind all C++ methods and place functions in Lua within the global scope. Any object type could call any method type making for a lack of type-safety and dangerous code.

At this point the lightuserdata will have to be upgraded to a full userdata. Full userdata in Lua enjoy benefits such as the ability to set and modify metatables. If you’re not familiar with Lua metatables please do a little research on the topic and come back later.

A full userdata allows us to place a copy of a Variable within Lua memory, instead of just a void *. This means a temporary Variable can be used to call ToLua, instead of requiring that the Variable sent stays valid in C++ for the duration of usage within Lua.

Currently a way to create metatables for all of our C++ types is required. Assuming a linked list of all TypeInfo objects from the introspection system is available:

This loop is just creating metatables given the string names of what each metatable should be called.

After the tables are created the actual C++ methods and functions should be bound. This turns out to be really simple! It is assumed that each function and method registered within the introspection system can be passed to the function at some point (perhaps during registration of the type information):

And that’s really all there is to it! The idea here is to make sure that a type with methods sent to Lua has its userdata fixed with a metatable containing the available methods to call. When the __index metamethod is called it will search within the metatable itself for an appropriate member. Members of the metatable are the functions we bound to Lua. After they are fetched they can be called. This is what happens behind the scenes when we do:

Passing Object by Value

Passing objects by value is actually much more difficult. The idea is to utilize tables to to store representations of the members associated with a class or struct. A table can be used to represent state of an object.

The __index and __newindex metamethods of a userdata should be set to look into the state table first. This lets users assign new values, and lets your ToLua and FromLua functions copy members from C++ to/from this Lua state table.

If a member is not found in the state table the metatable can then be searched by setting the __index metamethod of the state table to refer to the proper metatable.

All of this table indirection does incur significant overhead, however it allows objects in Lua to be used like so:

I myself have not implemented this type of Lua binding, though it is entirely possible and can be quite nice to work with. I reiterate that adding this many tables incurs both memory and performance overhead not seen with the other styles. This seems to be the only drawback.

Conclusion

Well this post turned out longer than I expected -over 2k words! Hopefully the information was clear. It’s really nice being able to refer people to a complete and working example such as the SEL engine; it makes writing articles much easier and simpler.

Simple Sprite Batching

Welcome to the fourth post in a series of blog posts about how to implement a custom game engine in C++. As reference I’ll be using my own open source game engine SEL. Please refer to its source code for implementation details not covered in this article. Files of interest are Graphics.cpp and Graphics.h.

Batching and Batches

Did I say “simple” sprite batching? I meant dead simple!

Modern graphics card drivers (except Mantle???) do a lot of stuff, and it makes for passing information over to the GPU (or retrieving it) really slow on the PC platform. Apparently this is more of a non-issue on consoles, but meh we’re not all working with consoles now are we?

The solution: only send data one time. It’s latency that kills performance and not so much the amount of data. This means that we’ll do better at utilizing a GPU if we send a lot of data all at once as opposed to sending many smaller chunks.

This is where “batching” comes in. A batch can be thought of as a function call. In OpenGL you’ll see something like DrawArrays, and in DirectX something else. These types of functions send chunks of data to the GPU in a “batch”

Sprites and 2D

Luckily it’s really easy to draw sprites in 2D: you can use a static quad buffer and instance by sending transforms to the GPU, or you can precompute transformed quads on the CPU and pass along a large vertex  buffer, or anything in-between.

However computing the batches is slightly trickier. For now lets assume we have sprites with different textures and different zOrders.

Computing Batches

In order to send a batch to the GPU we must only draw sprites with the same texture. This is because we can only render instances (or a large vertex array) with a given texture in order to lower draw calls. So we must gather up all sprites with the same texture and render in the correct order according to their zOrders.

If you are able to store your sprites in pod-structures then you’ll be in luck: you can in-place sort a large array of pods really easily using std::sort. If not, then you can at least make an array of pointers or handles and sort those. You’ll have extra indirection, but so be it.

Using std::sort requires STL compatible iterators, and you’ll want one with random access  (array index-style access). Here’s an example with a std::vector:

The sort implementation within your package of the STL is likely going to be quicksort.

This sort will sort by zOrder first, and if zOrders are matching then sort by texture. This packs a lot of the sprites with similar textures next to each other in the draw list.

From here it’s just a matter of looping over the sprite array and finding the beginning/end of each segment with the same texture.

Optimizations

A few simple operations can be done here to ensure that computing batches goes as fast as possible. The first is to get all of your sprites together in a single array and sort the array in-place. This is easily done if your sprites are mere pods. This ensures very high locality of reference when transforming the sprite array.

The second optimization is to only sort the sprite array when it has been modified. If a new sprite is added to the list you can sort the whole thing. However there is no need to sort the draw list (sprite array) every single frame if no changes are detected.

Conclusion

Like I said, sprite batching is super simple in 2D. It can get much more complex if you add in texture atlasing into the mix. If you wish to see an OpenGL example please see the SEL source code.

I was able to render well over 8k dynamic sprites on-screen in my own preliminary tests. I believe I actually ended up being fill-rate bound instead of anything else. This is much more than necessary for any game I plan on creating.