Category Archives: C++

Compute Basis with SIMD

A while back Erin Catto made a cool post about his short function to compute a basis given a vector. The cool thing was that the basis is axis aligned if the input vector is axis aligned.

Catto noted that the function can be converted to branchless SIMD by using a select operation. So, a while ago I sat down and wrote the function, and only used it once. It didn’t really seem all too helpful to have a SIMD version, but it was fun to write anyway (I made use of Microsoft’s __Vectorcall):



Single File Libraries – |

Sean T. Barrett makes a lot of very cool single-file libraries in C. Recently he’s also been making another big list of other single-file (or two files with src/header) that he likes.

The great thing about Sean’s libraries is that they contain functions that do exactly what they intend to accomplish, without doing anything more or less. They don’t contain any extra complexity other than specifically what is needed to get the job done. This helps when preventing his libraries from creating external dependencies, meaning his libs can be deployed as a single file inclusion into a pre-existing project, without linking to external libraries or requiring additional headers.

These qualities make libraries like Sean’s extremely easy to hookup and get going. If you want to learn how to make quality libraries just look at any of the STB libraries.

Writing Libraries

Sean writes libraries in an old version of Visual Studio (I think VC6?), and codes in C. He also keeps all of his code inside of a single file while writing it — I’ve seen this on For the rest of us that aren’t as crazy or hardcore as Sean we can just use is a Perl script written by an unknown author (probably written by Richard Mitton). I found the file inside of Mitton’s single-file library “tigr”, which stands for Tiny Graphics Library. Mitton uses to recursively include a bunch of files into one larger c file. Check out the script yourself:

The idea is to make a dummy C file that includes other source files, like this:

Then can be run from the command line (assuming you have Perl installed) like so:

The script outputs some nice and quick text to stdout displaying which files were visited and packages the entire source tree into a single source file. The output file contains nice comments indicated the beginning and ending of files, like this excerpt from one of my own single-file libraries:


Mitton writes about the old joys of using the incbin command in assembly, where the assembler would embed the binary contents of a file straight into your program. It sounds like this gets dubious when dealing with linkers nowadays (I’ve had some really long link times by including large files into source code…), though it still happens from time to time in small single-file libraries.

An example is in Mitton’s tigr library where he uses a makeshift perl script “” to embed shaders and a png file (containing raster font glyphs) straight into C source code. This concept can also be seen in Omar Ocornut’s imgui library where some font files are embedded into the source. Omar seems to have used a small C program to generate the binary data as C source.

Again, I’m not sure who originally wrote this script but it was probably Mitton himself.

Finding Heap Corruption Tip on Windows

If one is fortunate enough to have access to <crtdbg.h> from Microsoft’s Visual C++ header, when running in debug mode there are a few _Crt*** functions that can  be pretty useful. My favorite ones are _CrtCheckMemory and _CrtDbgBreak (which is just a macro for __debugbreak in my case).

This function checks padding bytes around allocations made by the debug heap for buffer overruns. Usually this is not too necessary if you just do bounds checking yourself in debug mode, though sometimes old code crops up that you didn’t even write yourself, and needs debugging.

Lucky enough for me I had a good guess as to where the corruption was coming from as I had some call stacks in-tact from my crash. Finding the heap corruption was easy in my particular case since the code is deterministic and can be re-run quickly. Sprinkling this macro around will let me back-trace through the code to find exactly where the first instance of corruption happened:

A few minutes later I ended up here:

Since my breakpoint activated just after this function call, but not before, I knew the corruption came from this pesky function. Turns out it was a weird typo that was really sneaky, just barely causing an off-by-one error.

See if you can spot the error below. Here’s the answer (text is white, so highlight it to read): on line 8 chromSize needs to be changed to i.

Good If Statements – Self Documenting Code

Here’s a quick trick for documenting those pesky if-statements, especially when littered with complicated mathematics.

Here’s an example from Timothy Master’s “Pragmatic Neural Network Recipes in C++” where the author needed to test some interval bounds to see if the conditions were appropriate to perform a parabolic estimate of a step-size. Sounds, and looks, complicated:

In my opinion this author writes extremely good code. In the above example the details of the code are very well documented, however that’s just due to wonderful (albeit dubiously placed) comments. When modified slightly the code can still maintain the same readability and perhaps gain a little clarity:

The trick is to just name those little integer values that collapse down to a 1 or 0 (the comparison operators in C, like && and ||, take their arguments and return a 1 or 0). One of the most common applications of this kind of naming is when there’s a complex end condition in an algorithm. Often times this kind of termination criterion can just be called “done”:



Circular Orbits | Planetary Motion | NBody Simulation

I took a few hours yesterday to start on a simulation for a job application. Part of the simulation involves trying to figure out how to create stable nbody orbits, at least stable for some amount of time to look interesting.

Turns out that the mathematics beyond constructing a circular orbit of just two bodies is a bit far-fetch’d. We can see in this link that solving for the velocity of each planet is very straightforward. Let me take a screenshot of the final equations here:


To me this makes a lot of sense and matches my own derivation, albeit with a slightly different final form (just due to some algebra differences). I plug in the equation into my simulation and hit run, however there’s just not enough velocity to keep an orbit. See for yourself (the simulation is slowed down a lot compared to later gifs):


Each white circle is a body (like a planet) and the red circle is the system’s barycenter.

So I tinkered a bit and found out the velocity is too dim by a factor of sqrt( 2 ). My best guess is that the equations I’m dealing with have some kind of approximation involved where one of the bodies is supposed to be infinitely large. I don’t quite have the time to look into this detail (since I have much more exciting things to finish in the simulation), so I just plugged in my sqrt( 2 ) factor and got this result:


Here’s the function I wrote down to solve for orbit velocity in the above gif:

I’m able to run the simulation for thousands of years without any destabilization.

If any readers have any clue as to why the velocity is off by a factor of rad 2 please let me know, I’d be very interested in learning what the hiccup was.

Next I’ll try to write some code for iteratively finding systems of more than 2 bodies that have a stable configuration for some duration of time. I plan to use a genetic algorithm, or maybe something based off of perturbation theory, though I think a genetic algorithm will yield better results. Currently the difficulty lies in measuring the system stability. For now the best metrics I’ve come up with is measuring movement of the barycenter and bounding the entire simulation inside of a large sphere. If the barycenter moves, or if bodies are too far from the barycenter I can deem such systems as unstable and try other configurations.

One particularly interesting aspect of the barycenter is if I create a group of stationary bodies and let them just attract each other through gravity, the barycenter seems to always be dead stationary! The moment I manually initialize a planet with some velocity the barycenter moves, but if they all start at rest the barycenter is 100% stable.


A Star

Recently I’ve found a nice piece of pseudo code for implementing A Star after searching through a few lesser or incorrect pseudo code passages:

And here is the Pop function pseudo code (written by me, so it probably has a small error here or there):

My favorite thing about the pseudo code is that the closed list can be implemented with just a flag. The open list becomes the most complicated part of the algorithm. Should the open list be a sorted array, an unsorted array, a binary heap? The answer largely depends on much more memory you need to traverse.

If a small portion of memory needs to be searched all dynamic memory can be allocated up-front on one shot. Otherwise bits of memory should probably be allocated up-front, and more as necessary during the algorithms run.

Just yesterday I implemented AStar in C where my full header file looked like:

In my internal C file I only have < 150 lines of code, including some file-scope variables and math functions. Implementation was nearly a one to one transcription of the above pseudo code (so those who are recursion impaired like myself shouldn’t have any problems). This style may not be thread-safe, but hey, most AI related coded can only be done in serial anyway. My grid size was maxed out at 20×15, so pre-allocating memory for your entire search area may not be so practical as it was for me.

Still, I hope this post can provide some bits of intuition that are useful to someone.

Freelist Concept

A freelist is a way of retrieving some kind of resource in an efficient manner. Usually a freelist is used when a memory allocation is needed, but searching for a free block should be fast. Freelists can be used inside of general purpose allocators, or embedded directly into an optimized algorithm.

Lets say we have an array of elements, where each element is 16 bytes of memory. Our array has 32 elements. The program that arrays resides in needs to request 16 byte elements, use them, and later give them back; we have allocation of elements and deallocation of elements.

The order the elements are allocated is not related in any way to order of deallocation. In order for deallocation to be a fast operation the 16 byte element needs to be in a state such that a future allocation be handed this element to be reused.

A singly linked list can be used to hold onto all unused and free elements. Since the elements are 16 bytes each this is more than enough memory to store a pointer, or integer index, which points to the next block in the free list. We can use the null pointer, or a -1 index to signify the end of the freelist.

Allocating and deallocating can now look like:

Setting up the memory* will take some work. Each element needs to be linked together somehow, like through a pointer or integer index. If no more elements are available then more arrays of size 32 can be allocated — this means our memory is being managed with the style of a “paged allocator”, where each array can be thought of as a page.

The freelist is an important concept that can be embedded ad-hoc into more complex algorithms. Often times it is important for little pieces of software to expose a very tiny C-like interface, usually just a function or two. Having these softwares self-contain their own internal freelists is one way to achieve a simple interface.

Example of Hiding the Freelist

For example say we are computing the convex hull of a point-set through the Quick Hull algorithm. The hypothetical algorithm exposes an interface like this:

This QHull function does no explicit memory allocation and forces the user to allocate an appropriate amount of memory to work with. The bounds of this memory (how big it needs to be for the algorithm’s worst case scenario) is calculated by the ComputeMemoryBound function.

Inside of QHull often times the hull is expanded and many new faces are allocated. These faces are held on a free list. Once new faces are made, old ones are deleted. These deleted faces are pushed onto the free list. This continues until the algorithm concludes, and the user does not need to know about the details of the embedded memory management of the freelist.

Convex hull about to expand to point P. The white faces will be deleted. The see-through faces will be allocated.

A convex hull fully expanded to point P. All old faces were deleted.

The above images were found at this address:


Parsing C Style Expressions

Turns out that constructing a hand-written C-style parser has a few parts that were very difficult for me.

The first thing was realizing that Backus Naur Form (BNF) largely sucks if you want to hand-write your own parser. BNF is really verbose and expressing simple things like optional terminals or lists is difficult. BNF is also poor for expressing operator precedence, as many intermediate and redundant non-terminals are required to be evaluated during parse-tree derivation. As an alternative Extended Backus Naur Form is perfect for languages that plan to use hand-written parsers instead of parsers created by parser generators. Left-factoring a BNF for LL parsing is also not very useful since handling infinite recursion with hand-written code is trivial.

The second thing is that parsing expressions with various types of operators can be really difficult, especially if there’s a lack of confidence in recursion (like myself). Creating a parse tree given a string representing an expression is a very recursive problem.

In C expressions consist of atoms and operators. An atom would be a literal, constant value, identifier, or an expression wrapped in parentheses. Operators are the usual + or – kind of tokens.

If each operator has an associated precedence value there are a few different algorithms out there with references for learning. I ended up face in the dirt and in the end derived what is known as “precedence climbing“. According to Eli Bendersky precedence climbing is what is currently used by Clang to parse C++ expressions! That ought to instill some perceived merit. From what I can tell Lua 5.3 uses (well, very close to) the same method.

The idea of precedence climbing is to think of two major recursive operations:

  • Compute righthand-side node
  • Make a binary operator node and connect lefthand-side and righthand-side children nodes

The first point is the complex one (that is, conceptually complex). The algorithm starts given a lefthand-side node, however, righthand-side nodes do not come in through the input stream in tree format; the next token can represent a node that should be much deeper in the tree — this means that computing the righthand-side node ought to be the main recursive path.

Realizing that the righthand-side node computation is the recursive path led me to notice a key observation that tipped me off to a working algorithm.

Say we have the following input string as an expression: A 2 B 1 C 4 D 3 E 7 F

Numbers are operators, and the number itself is precedence (higher number is higher precedence), letters are atoms (like a const int variable). Here’s the valid parse tree:


The lowest leaves are evaluated first. It’s easy to see that the tree itself encodes the operator precedence.

If we begin parsing our input string the first atom is A, which will always be a lefthand-side node for most any parsing algorithm used, and will likely be the left-most node in the tree. The next token is the 2 operator followed by B. It’s easy enough to construct the subtree of node 2 pointing to A and B.

The next input token is the operator 1 and atom C. C is bound by operator precedence to the operator 4, though the current state of the algorithm has yet to even read in the token 4. Studying this scenario is what tipped me off to a working solution; C must be treated as a lefthand-side node, though at the current state is considered a potential righthand-side node.

Wikipedia, and this link from earlier, both show great pseudo code for the precedence climbing algorithm. The main difference between the two links is wikipedia includes a nested for-loop in favor of less overall recursive calls. My own code ended up looking something like after I cleaned it up from influences of previous links:

In the end I’m quite happy with the result, and even hooked up a nice ascii-tree printer courtesy of a random stack-overflow user. Here’s a dot product and initialization trees in ascii:

My favorite part about the operator precedence climbing algorithm is how it handles parentheses and prefix unary operators: parentheses can be considered an atom, and when the atom function finds a parentheses is just calls the expression parsing function directly and returns the result! The same can be done for prefix unary operators (if they have really high precedence). The algorithm also trivially handles right-associativity. I haven’t yet thought about unary postfix operators, so if any reader has thoughts on this topic please do comment!

Here’s psuedo-y atom code:


Capsule to Convex – SAT Edge Prune via Gauss Map

In 2013 Dirk Gregorius of Valve presented at GDC on the topic of the Separating Axis Theorem. In his talk he studied an useful property of the Minkowski Difference between two 3D convexes: edge pairs from each shape may or may not contribute to the convex hull of the Minkowski Difference.

It is possible to derive simple predicate functions to reduce the number of edge pair queries, and to simplify implementation, during collision detection via the Separating Axis Theorem. This article shows a derivation of this predicate for the Capsule to Convex case.

Here is the PDF containing the derivation:

Download (PDF, Unknown)

SIMD – Matrix3x3 Transpose

Just recently I finished implementing my own personal SIMD math library using SSE intrinsics. There are two major resources for learning how to write effective SIMD code:

While inspecting the DirectXMath source I came across the implementation of transposing a 4×4 matrix:

Lately I have been working only with 3×3 matrices and vectors. This is nice since often times 4×4 matrices store mostly useless data in the bottom row. In effect some kind of 3×4 matrix can be stored in memory to represent an affine transformation:

Depending on what the code is used for the rotation matrix can have scaling built in, or not. Often times only uniform scaling is desired so that chains of transformations can easily be reversed an decomposed freely.

Since I’m only dealing with 3×3 matrices I decided to cut down on the number of shuffles as best I could, and ended up with this implementation:

Only 5 shuffles are used here instead of the 8 from DirectXMath for the 4×4 transpose. I did not really take care of handling the w component of any of the __m128’s during the whole process. In general I just left the shuffles for w as 0.

I really don’t think another shuffle can be removed in the 3×3 case, so any further optimizations would probably be outside my realm of knowledge. If anyone knows of anything else interesting as far as transposition goes feel free to comment below.

Note: On Windows if anyone is wondering why my function does not incur a compiler error complaining about parameter alignment, be sure to lookup __vectorcall for Visual Studio 2013.