Tutorial: Performance Engineering for Machine Learning and Scientific Computing

29 minute read

If you do machine learning, you’re probably familiar with the pain of waiting a long time for your code to run. This post is about making your code faster.

Specifically, I’m going to talk about performance engineering on CPUs with an emphasis on ideas most relevant for machine learning and data mining. GPUs are of course also important, but I won’t address them since:

  1. If I did, this post would be too long
  2. Many of the principles are the same
  3. If you use TensorFlow, PyTorch, etc (which you hopefully do), most of the details relevant for performance tuning are beyond your control anyway

For brevity and readability, this post contains many simplifications. For a more exhaustive treatment, see the slides for MIT’s excellent performance engineering class.

When and What to Optimize

Most of the time, you shouldn’t be optimizing for performance. And when you actually do have a performance problem, step 1 isn’t making everything faster—it’s figuring out what part of your code is slow. Only once you know what the bottlenecks are should you spend time optimizing. See this discussion of the classic maxim, “Premature optimization is the root of all evil.”

If you’re coding in C/C++, your IDE probably has good tools for performance monitoring. If you’re using Python, LineProfiler can be invaluable. If you’re using MATLAB, stop.1

Computer Architecture in Three Paragraphs

Performance engineering is all about using the hardware more effectively. Unfortunately, if you come from a non-EECS background, you might know much about how computers work. Here’s a quick overview.

Computers have a few basic operations they know how to perform—adding two numbers together, loading in data from memory, computing the logical OR of bits, etc. These operations are done in a section of the chip called the Arithmetic Logic Unit (ALU). The basic loop the computer carries out is:

  1. Load data from memory into the ALU
  2. Do some operations to that data
  3. Store the results back in memory

For code to be faster, the computer has to spend less time doing one or more of these steps.

Memory Hierarchy

As you might have noticed, 2/3 of the above steps involve memory. Consequently, the processor does everything it can to make the memory fast. There are a lot of low-level details that go into making it fast in general, but the trick that matters most for programmers is caching.

Caching is when the processor stores data you’re likely to access in a special segment of memory called the cache. This memory uses faster hardware, but has to be smaller because:

  1. It requires more chip area per bit stored.
  2. The bigger it is, the farther it is (on average) from the computation logic. This matters because electricity doesn’t conduct as fast as we would like.

Modern processors don’t just have one cache. They have several, each offering a different size/speed tradeoff. These tradeoffs, along with the relevant statistics for other storage media, are shown in Table 1. Registers, described in the first row, are the temporary storage that the ALU operates on directly.

Level Size Latency Throughput
Register 100B None N/A
L1 Cache 64KB 4 cycles 2 64B loads/cycle
L2 Cache 256KB 10 cycles 2 cycle read*, 5 cycle write
L3 Cache 10MB 40 cycles 5 cycle read*, 10 cycle write
Main Memory 10GB 100 ns 10 GB/s
SSD 1TB 100 us 500MB/s
Disk 1TB 10 ms 100MB/s

* “Read” is defined as “load into L1 cache”

A few notes:

  • These numbers are approximate and vary from chip to chip
  • The above values have been rounded/approximated significantly for ease of recollection.
  • The latencies and throughputs are for random access; memory will be about 2x faster if read sequentially, and disk will be much faster.
  • The main memory, SSD, and disk are mostly independent of the processor, so the above are ballpark figures for comparison.
  • The L1 cache is often split so that half is reserved for storing code, and half for storing “data”, defined as anything that isn’t the code. This means that it effectively stores at most 32KB of parameters, variables, etc.

When the processor needs to load data, it first checks whether it’s in the L1 cache; if not, then it checks the L2 cache; if it’s not there, it checks the L3 cache, and so on. Finding the data in any cache is called a cache hit, and failing to find it is called a cache miss. Finding or failing to find it in a particular cache level N is called an LN cache {hit,miss} (e.g., “L1 cache hit”, “L2 cache miss”).

You can’t control what data resides in which level of cache directly, but the processor uses a few simple heuristics that generally work well. They’re also simple enough that we can structure our code to help them work even better.

The main heuristic is that, when you access data, it probably gets loaded into L1 or L2 cache, and when data gets kicked out of a given cache level (“evicted”) to make room for other data, it probably gets moved to the next level down. This means that you can keep much of your data in a given cache level by only operating on an amount of data that fits in that level. E.g., you could operate on <32KB of data to keep most of it in L1 cache.

Other heuristics are discussed below.

Cache Lines

Part of the challenge that caching hardware faces is keeping track of what data is in which cache. E.g., in a 256KB cache, there are 256,000 individual bytes it has to know are in there at a given time.

To make this easier, the CPU doesn’t allow individual bytes to be cached—it instead operates on cache lines, which are blocks of 64 adjacent bytes. Entire cache lines are moved into and out of cache levels as necessary.

Grouping bytes is also a heuristic to improve performance, thanks to locality of reference—i.e., if your program just accessed byte 974, it’s likely to access byte 975 (or other bytes near there) in the immediate future. Consequently, by loading all the bytes around byte 974 when you ask for 974, the processor can probably avoid some cache misses.

Prefetching

Another important heuristic the processor uses is prefetching, which can roughly be stated as:

If the program just accessed cache lines (i, i+1, i+2) in that order, load (i+3) into cache immediately”

The details of how many successive accesses are required and how far ahead the prefetching happens will vary, but the point is that the processor will notice when you scan through data in memory order and will be able to avoid cache misses when this happens.

It can also notice when there are multiple sequences of accesses at once; e.g., if you have code that looks like:

for (int i = 0; i < 100*1000*1000; i++) {
    c[i] = a[i] + b[i];
}

The processor can prefetch the memory for all three arrays a, b, and c. Note that this only works for a few sequences at a time (often less than 7). This means that if we needed four more arrays d, e, f, and g for this computation, we would probably be better off splitting it into two loops.

If you’re doing extremely low-level programming and you feel like suffering, you can manually tell the processor to prefetch using builtins such as GCC’s __builtin_prefetch(). I’ve never gotten a performance boost from this, but in theory it’s possible.

Storage Order

2D arrays can be stored in row-major (‘C’) or column-major (‘F’/’Fortran’) order. The former means that elements within the same row are contiguous in memory, while the latter means that elements in the same column are contiguous in memory. For more than 2D, row-major means that later dimension are closer together (with the final dimension contiguous), while column-major means that earlier dimensions are closer together.

Storage order matters because it affects whether we get the benefits of prefetching and cache lines loading nearby elements.

If you iterate through array elements that aren’t contiguous, you’ll suffer two penalties:

  1. Each cache line will only store one array element you care about. As a result, you’ll effectively have a much smaller cache.
  2. You’re unlikely to trigger prefetching.

In the case of huge arrays, traversing non-contiguous elements can even result in cache thrashing—i.e., repeatedly cache missing because new data kicks the old data out.

Fortunately, any sane array library will take storage order into account when carrying out its functions, so you typically don’t have to worry about this. The two most common cases where it might matter are

  1. Slicing an array. If you want to examine a subset of the rows, it’s probably better if the array is in row-major order. Vice-versa for a subset of the columns.

  2. Iterating through array elements in a compiled language. You really want to scan through the elements in memory order. E.g.:

    Matrix<float, RowMajor>(10000, 1000) A;  // row-major 2D array

    // this way is good
    for (size_t i = 0; i < A.rows(); i++) {
        for (size_t j = 0; j < A.cols(); j++) {
            doStuff(A(i, j));
        }
    }
    // reversed the loops--this way is bad
    for (size_t j = 0; j < A.cols(); j++) {
        for (size_t i = 0; i < A.rows(); i++) {
            doStuff(A(i, j));
        }
    }

Numpy uses row-major order by default, while Fortran, MATLAB, Julia, R, and TensorFlow2 use column-major. In Python, you can specify how to store a matrix using the order parameter of the np.array function, or with the np.ascontiguousarray() and np.asfortranarray() functions.

Loop Tiling

One way to keep your data in cache is to split computations on large blocks of data into several computations on smaller chunks of data. E.g.:

A = -log(abs(B + C))

could be replaced with something like:

num_chunks = 10
chunk_size = len(A) / num_chunks
for i in range(num_chunks):
    start, end = i * chunk_size, (i + 1) * chunk_size
    A[start:end] = -log(abs(B[start:end] + C[start:end]))

This will result in only loading chunk_size rows of each array at once, and then doing the negation, log, absolute value, and addition operations to each chunk in turn. Ideally, the chunks will fit in cache, even though the original arrays wouldn’t have. This is unlikely to work as written since the intermediate results (e.g. B[start:end] + C[start:end]) will be allocated in new memory, but if you wrote these intermediate results to preallocated storage (see below), this wouldn’t be a problem.

In practice, loop tiling is hard to get a performance increase from, since prefetching usually does an extremely good job when operating on large arrays. It’s more likely to help when writing lower-level code that can use it to keep everything in L1 cache.

Subtree Clustering / Block Storage

If you’re implementing a data structure in a low-level language, you want to ensure that elements are stored contiguously to the greatest extent possible. For linked lists or search trees, this means storing pointers to blocks of elements, not individual elements (see unrolled linked lists and B-Trees). For trees that need smaller numbers of children, you might want to use subtree clustering (see top of page 6).

Memory Allocation

When your program starts, it only has access to a little bit of memory, provided by the operating system. To create new objects, load data, etc, it has to ask the operating system for more. In C/C++, you do this using malloc/calloc or new. In a higher-level language, you generally just create objects and let the language make the underlying malloc/calloc calls for you.

The overhead of allocating memory isn’t terrible, but it’s often worth avoiding in critical sections of code. According to some benchmarks, allocating N bytes is about twice as expensive as copying N bytes.

Reusing Memory

Consider the following operation:

A = -log(abs(B + C))

It appears that this creates one array, A. In fact, it is likely to create 4 arrays to hold the all the incremental values needed to get to A. I.e., it is probably interpreted as:

temp0 = B + C
temp1 = abs(temp0)
temp2 = log(temp1)
A = -temp2

To avoid these extra memory allocations, you can often give functions an “output” argument that tells them where to store their results. In Python, for example, the above can become:

A = np.empty(B.shape)  # allocate storage once
np.add(B, C, out=A)
np.absolute(A, out=A)
np.log(A, out=A)
np.negative(A, out=A)

This is less concise, but can be worth it if you find that allocation of temporary arrays is slowing your code down. Note that good C++ array libraries bypass this problem using expression templates.3

Small-object Allocators

One way to reduce the cost of memory allocation is to allocate one large block of memory and then hand out pieces of that block yourself. If the pieces can be of all different sizes, it’s unlikely you’ll be able to do this any more efficiently than just calling new/malloc normally. However, if many of the pieces are small and the same size (probably because they’re instances of the same class), it can be easy to keep track of them, and you can manage their storage with far less overhead than the operating system. This is the idea behind writing a small-object allocator.

Pipelining and Superscalar Execution

Processors have the ability to do several different kinds of operations—adds, subtracts, loads from memory, logical ANDs, and dozens of others. Yet each instruction only uses one of these operations. This means that the vast majority of the computation hardware sits idle at any given time.

Or does it?

It turns out that modern CPUs are intelligent about looking ahead at the next few instructions and, whenever possible, executing more than one of them at a time when they use different operations. This is called superscalar execution. More concretely, the processor has 5-10 “execution ports”, each of which handles a certain set of operations. In the best case, all the ports are always busy, but this is never achieved in practice.

Another trick the processor uses is pipelining. The idea here is to split each operation into multiple sub-operations, and work on sub-operations for different operations in parallel.

If you’ve ever been to Chipotle, you already understand pipelining. They decompose the process of making a burrito into a “pipeline” of steps:

  1. Getting the tortilla or bowl
  2. Putting rice and meat on it
  3. Putting other ingredients on it
  4. Having you pay

Because they split it into different steps, each employee can just do one step and then pass it on to the next person. To handle high demand, they can have many employees shoulder-to-shoulder cranking out burritos. Replace employees with logic circuits and burritos with operations, and you have CPU pipelining.

The advantage of splitting work into chunks this way is that you can “parallelize” a sequence of serial operations by working on different chunks of each one simultaneously.

Conditional Branches

Continuing the Chipotle analogy, suppose you have the following situation. A group of people walk in and all get in line. The first one orders a chicken burrito, but his friends say, “if he likes the chicken burrito, we want that; otherwise, we want a veggie bowl.”

This is a problem. Why? Because we can’t start working on their orders until the first order is done. This means that most of the employees will just be standing around until the first order finishes, instead of assembling pieces of orders. This is called a pipeline “stall.”

In code, this situation happens whenever you have an if statement (the formal term for which is “conditional branch”). The processor doesn’t know whether to execute the if block or not until the value of the condition is known. Consequently, it doesn’t get to leverage its pipeline.

Or so it seems.

But consider the following twist to the Chipotle scenario. What if the employees knew that almost everyone likes their chicken burritos? In this case, they could just assume that the first person would like it, and go ahead and start making everyone else’s order immediately. They might be wrong, but in the likely scenario that they weren’t, they will have spent no time idle. And if they were wrong, it might cost them some ingredients, but they won’t have lost any more time than they would have by waiting.

CPUs employ this exact strategy in the form of “branch prediction.” Whenever they encounter an if statement, they guess whether it will be true or not and start executing the corresponding instructions. In other words, what hurts you isn’t having conditional branches—it’s having conditional branches that aren’t predictable.

What makes a branch predictable? CPUs predict branches in two ways:

  1. Heuristics. CPUs assume that what’s in the if block will happen, and what’s in the else block, if one exists, won’t happen.
  2. Once they’ve seen a particular if statement, they remember what happened the last few times they encountered it and predict the same thing will happen this time.4 This is by far the more common case.

So the takeaway is that, it’s not terrible to have an if statement in a critical sections of code as long as its condition will tend to be the same many times in a row.

Loops

A special case of the conditional branch is the loop. Programming languages usually hide this, but loops boil down to checking whether the termination condition is met at the end of the loop body and going back to the top of the body if not. I.e., they’re all secretly while loops.

You can’t usually avoid having a loop, but, fortunately, they tend to be inexpensive for two reasons:

  1. Branch prediction for loops is extremely good. Just assuming that the loop happens again is correct (n-1) out of n times for an n-iteration loop. Furthermore, Intel chips (and likely others) can even avoid being wrong at the very end for n <= 64, if n is known when the loop begins.
  2. Loops with small, constant numbers of iterations are often “unrolled” by the compiler. I.e., it removes the loop and just pastes in the loop body the appropriate number of times. This has the added benefit of removing the conditional branch logic entirely, not just predicting its outcome correctly.

Another consequence of point 2 is that loop tiling with small, fixed-size blocks can often improve performance when using a compiled language. E.g.:

array<int> a = some_array();
int length = a.size();
const int chunk_size = 4;
int num_chunks = length / chunk_size;  // assume a % chunk_size == 0
for (int c = 0; c < num_chunks; c++) {
    for (int i = 0; i < chunk_size; i++) {
        a[c * chunk_size + i] += 42;
    }
}

Without loop unrolling, the inner loop really says:

    i = 0;
loop_start:
    a[c * chunk_size + i] += 42;
    i++;
    if (i < chunk_size) {
        goto loop_start;
    }

With loop unrolling, it says:

a[c * chunk_size + 0] += 42;
a[c * chunk_size + 1] += 42;
a[c * chunk_size + 2] += 42;
a[c * chunk_size + 3] += 42;

which has eliminated all of the loop overhead and should run much faster.

Note that all of this only applies in a compiled language. In a scripting language, evaluating conditionals is more a matter of running code in the interpreter than stalling pipelines. Much the same holds for basically everything else in the section.

Store-to-load Forwarding

One final note about pipelining is that it sometimes lets you avoid touching even the cache when operating on the same data repeatedly. This is done through store-to-load forwarding, which basically means that output from one operation in the pipeline can be fed directly into another operation earlier in the pipeline.

Compiler Intelligence and Front-End Lookahead

Many modern processors will rearrange the sequence of instructions to allow better scheduling in the pipelining. Compilers will also rearrange your instructions to avoid unnecessary code execution and help get them in a decent order for the processor. As a result, it’s rarely worth it to think about how you order your lines of code. For example, it’s fine to do something like:

M = 20;
for (int i = 0; i < N; i++) {
    int constant = M * M;
    a[i] = constant + i;
}

You might worry that it will call M * M N times, but any sane compiler will pull that line out of the loop. The exception would be if the constant is determined based on a function call and the compiler doesn’t know whether calling the function will have side effects.

SIMD Instructions

If you’ve written code in a scripting language like Python or R, you’ve probably heard about the importance of using “vectorized” operations instead of loops.

Part of why this lower-level code is faster is that it avoids the overhead of the scripting language’s interpreter and abstractions (e.g., Python ints and floats are reference-counted PyObj pointers, not raw numbers).

The other part is that it can use more efficient sequences of operations to do a given computation. Some of this is generic compiler optimizations such as common sub-expression elimination, but a good deal of it in scientific computing is the use of vector instructions.

Vector instructions (often called SIMD instructions) are special CPU operations designed for number crunching. They carry out the same operation on groups of adjacent numbers simultaneously. E.g., using x86 AVX2 instructions, one can replace:

for (int i = 0; i < 8; i++) {
    a[i] = b[i] + c[i];
}

with

__m256 b_vect = _mm256_loadu_ps(b); // load 8 floats from b
__m256 c_vect = _mm256_loadu_ps(c); // load 8 floats from c
__m256 sums = _mm256_add_ps(b_vect, c_vect); // sum the floats
_mm256_store_ps(a, sums); // store the results

This replaces 16 loads, 8 additions, and 8 stores with 2 loads, 1 addition, and 1 store, offering a huge speedup. In case you’re curious, the m256 means 256-bit (the width of 8 floats) and the ps means “packed single-precision”.

You likely won’t be invoking SIMD instructions directly, so here are tips for making sure that your code can leverage them:

  1. Set array dimensions to multiples of 32 bytes, which corresponds to 8 floats or 4 doubles. SIMD instructions usually operate on either 16B or 32B of data at once, so this ensures that operations can be done entirely using SIMD instructions.
  2. Store your data contiguously. SIMD instructions need the values they operate on to be adjacent in memory.
  3. If using a compiled language, try to use arrays of sizes that are either large or compile-time constants. This makes it more likely that the compiler will use SIMD instructions to operate on them.

Threads, Hyperthreads, and Coroutines

A tutorial on concurrency is beyond the scope of this post, so we’re just going to go over some basics to give you a feel for what threads, etc, can and can’t do for you. The main lessons are that:

  1. There are often fundamental limits to what parallelism can buy you.
  2. It often takes an enormous amount of data for multithreading (let alone using multiple machines) to be worth the overhead.
  3. Unless your problem is embarrassingly parallel and/or you have excellent abstractions, writing concurrent code will probably cost you a great deal of time debugging.

Definitions

Core. You’ve probably heard of computers having, say, a “four-core” processor. Each core is basically it’s own processor, except that the cores share the same memory (and sometimes L3 cache).

Threads are a construct in the operating system that represents a sequence of instructions to be executed. Each core has support for running one thread at a time, in the form of a hardware thread. The operating system can have many “OS threads” that take turns using the one hardware thread. Starting a new OS thread is an expensive operation, so it’s often better to use a thread pool rather than start up new threads for new tasks.

Hyperthreads. Remember how I just said each core has one hardware thread? I lied. Sort of. Cores on modern processors often have two hyperthreads instead of one regular hardware thread. The idea of hyperthreads is to feed two sequences of instructions to the same ALU. If neither sequence would be able to utilize half the ALU’s throughput (which is common since waiting for memory often takes most of the time; see above), this lets the hardware effectively execute two threads at once. Since the low utilization condition doesn’t necessarily hold, having a hyperthread is not as good as having a full core to yourself. This comes up when you use a cloud computing service—one EC2 “vCPU,” for example, is just one hyperthread, not one full core.

Coroutines, also called “lightweight threads,” are thread-like objects implemented in software on top of OS threads. They have far less overhead than OS threads (e.g., you can switch between them in <1000 cycles), and so can substitute for using a thread pool. If you’re familiar with Go, its Goroutines are basically coroutines. If you’re not using Go or doing systems programming, you probably don’t need to think about Coroutines.

Calling C/C++ code

The best way to call low-level code through a high-level langauge is by using a well-maintained library, such as NumPy. Sometimes, though, there’s a critical section of code that can’t be implemented efficiently enough using these libraries. At other times, you may have low-level code of your own and want to create an easy interface in a higher-level language. In both cases, you have a few options:

  1. Language-specific FFI. E.g., in Python, this is ctypes. In MATLAB, this is mex files.
  2. Cython. This is a language that basically lets you write typed Python code and compile it to an ordinary Python module.
  3. SWIG. This takes in a set of C/C++ files and automatically generates wrapper functions and classes in any language you would probably care about except MATLAB or Julia.
  4. Weave. This lets you write C code in the middle of your Python code.
  5. BoostPython. A C++ library that lets your code work as a Python module.

If you aren’t using Python, your options are probably either SWIG or your langauge’s FFI. I would definitely opt for SWIG if you have to wrap a lot of code, and/or if the code changes regularly, but likely the FFI if you just need something simple.

If you are using Python, I personally just use SWIG for everything, but I also like Cython. I haven’t used Weave but I’ve heard good things. If you want to use SWIG, you should use my example project, because otherwise you will suffer greatly.

Bit hacks

There are often sneaky ways to perform computations that are faster than the obvious ways. For example, if you want to compute the minimum of two integers, the obvious way would be:

int min(int x, int y) {
    if (x < y) {
        return x;
    }
    return y;
}

This has a conditional branch, and not even a predictable one. A better way is:

int max(int x, int y) {
    return y ^ ((x ^ y) & -(x < y));
}

where ^ denotes exclusive or. (See lecture 3 of 6.172 for an explanation of why this works.)

When writing C/C++, the compiler is probably better at these sorts of tricks than you, so inserting them manually may not be a good idea. In languages without compilers, however, these tricks are often helpful.

See here and here for the best lists of bit hacks and similar low-level tricks.

Other Topics and Resources


1: Although I have to admit that the built-in profiler is great.

2: If you’re running code on a GPU, the storage order considerations are different, since GPUs are less dependent on caches and can parallelize many more operations at once.

3: Also the most effective means I know of ensuring that straightforward-looking code will subtly screw you over.

4: It’s actually more complicated than this, but what I described is a good approximation of what happens.

Updated: