SimonHF's Blog

Just another site

Concurrency, performance, threads, and locking brain dump December 3, 2012

As a ‘hard-core’ C developer with over 30,000 hours of hands on development experience then I feel qualified to comment on concurrency and performance issues relating to ‘threads and locks’. So here goes my brain dump by dividing up the territory into 7 common problem areas:

Problem #1: The hidden concurrency in single-threaded algorithms, *or* How developers think memory operations & mechanisms like bsearch() are fast:

It’s almost like — by starting to think about concurrency in terms of threads and locks — we’re jumping the gun. Because *before* even considering the problems of multi-threading, there’s a different concurrency war going on behind the scenes in your single threaded program! Believe it or not your higher level code might look single threaded but at run-time it comes down to assembler op codes being executed and the associated memory accesses. Often our intention is to use concurrency in order to achieve run-time performance in which case it’s critical for the developer to understand how memory access works even in a single-threaded program. Why? Believe it or not it’s easily possible for an inexperienced developer to write a concurrent, multi-threaded algorithm which is way slower than a single-threaded algorithm. Now imagine that much faster single-threaded algorithm made into a multi-threaded algorithm. It’s going to be mega fast.

One common reason for developers inadvertently making their single-threaded algorithm unnecessarily slow is the misconception that memory access is generally pretty fast. The reality is a bit different and more complicated. Why? Because memory is not as fast as you might think. Generally developers think memory is fast and disk is slow so try to access the disk as little as possible. This is a good rule of thumb. However, what very many developers don’t realize is that that memory access times by CPUs these days has a split personality; uncached RAM is much, much slower than CPU cached RAM. So much so that the time difference between accessing uncached RAM and cached RAM is analogous to the idea that one should try to minimize disk access in favour of memory access. So how is RAM cached? In simple terms, RAM is cached in ‘cache lines’ the size of which depends upon the CPU but usually 64 bytes on modern 64 bit CPUs. In the old days then figuring out the speed of some assembler was as easy as determining and adding up the number of clock cycles for a series of op codes, and memory access speed was always the same and therefore predictable. These days, op codes still have an individual execution cost associated with them but the total time to execute is blurred by two factors; pipe-lining of instructions causing variable execution time, and *highly* variable memory access times depending upon how memory has been cached. As this point I bet you’re thinking: Okay, so a bit of caching goes on; big deal; it can’t make that much difference to the execution speed of my program; at most a few percent or 10% or something low like that. Well you’re wrong and an easy way to illustrate the problem is by using bsearch(). Yes, bsearch(). And it’s easy to Google bsearch and find a whole bunch of developers willing to swear by their grandmother than it’s the fastest way to access a large chunk of data. The truth is that it might have been fast 10 years ago but not any more. And the reason it’s not fast is because if the data set being searched is much larger than the total cache line capacity on the CPU, then the bsearch() algorithm by its very nature causes much memory to be accessed which isn’t currently cached in cache lines. Even accessing 1 byte of a 64 byte cache line causes an entire 64 byte cache line to be expensively ‘flown in’ from the RAM to the cache line cache on the CPU. How expensive is this cache line gathering process? As an example we’ll consider a simple C program which creates a sorted array of 30 million equal sized records taking up about 1 GB of RAM. If we use bsearch() to loop through all keys from 1 to 30 million in order, then on my laptop’s Intel i7 CPU then about 8 seconds of wall clock time goes by. Fine you say; 30 million is a lot of records and that sounds like good performance. However, now let’s look up the same 30 million keys but in random order. On the same CPU then about 40 seconds of wall clock time goes by. That’s *five* times slower. Not a few percent slower. Not ten percent slower. But *five* times slower. Obviously this is a huge difference. And it’s all down to cache line exhaustion. In both experiments we accessed exactly the same amount of memory and looped the same number of times, but in experiment #2 then the CPU had to have cache lines expensively ‘flown in’ way more often. As a reference, then the same 30 million records can be accessed in the form of a hash table in under 5 seconds of wall clock time and in any random order. Why? Because less cache lines must be expensively ‘flown in’ 🙂

What is the moral of this story? There are several: Always be paranoid and question the status quo. Never assume that your algorithm is the fastest no matter how simple it looks. Or that the only way to make it faster is to make it multi-threaded. When working with large data sets in memory then always be mindful of cache line exhaustion. Never try to measure small amounts of something; always performance test your algorithm working flat out, exercising 100% CPU, and with realistic data access patterns in order to determine the average tiny cost per iteration, e.g. 40 seconds divided by 30 million. Always try very hard to avoid multi-threaded code because it’s evil in so many ways.

Problem #2: Turning a single-threaded algorithm into a locked multi-threaded algorithm:

In example #1 then we patted ourselves on the back for optimizing the query algorithm so that instead of taking 40 seconds, it now takes 5 seconds using a hash table. We did this by understanding the not-in-our-face cache line memory concurrency issues happening in the CPU even with a single thread. Now we want to make our hash table multi-threaded. Let’s say we have 8 physical CPU cores and want each core to be able to read and write hash table records. Problem is that we need to lock the hash table so that no two cores simultaneously try to write to the hash table at the same time.

Pitfall #1: Which amazingly I have seen even experienced developers fall into: Put as little code inside the lock as possible. For example, there’s no need to calculate the hash of the key inside the lock!

Pitfall #2: Use the hash of the key to reduce lock contention. For example, instead of locking the entire 1 GB hash table then have e.g. 100 x 10 MB hash tables; each having its own lock. Use the hash of the key to decide which key exclusively goes in which hash table. Now when accessing the hash table concurrently then there’s a much bigger chance that a particular thread won’t block because individual hash tables can be locked and accessed in parallel.

Pitfall #3: Avoid using fancy locks. Some developers are tempted to use fancy hybrid, spin-lock type constructs that do fancy operations, for example, don’t lock if only multi-threaded readers are currently accessing, but only lock if a single writer is writing. These fancy locks sound good but are in fact very expensive to execute — even if no actual locking is performed — due to the fact that multiple atomic assembler instructions must be used to implement the fancy algorithm.

Pitfall #4: Avoid using fancy lock-less algorithms. Some developers think that using fancy lock-less algorithms is the way to get around having to use locks and is therefore a kind of silver bullet solution. Problem is that all these lock-less algorithms rely on using atomic assembler instructions which are expensive to execute. Why? Because they guarantee that a cache line must be expensively ‘flown in’ (see above) as well as doing a bunch of other expensive things like breaking the concurrent op code pipe-line.

Problem #3: Turning a locked multi-threaded algorithm into an unlocked multi-threaded algorithm:

Obviously using no locks is faster than using locks. Plus using fancy lock-less algorithms is kind of cheating and also doesn’t result in the ultimate performance that we’re looking for. So how to turn our locked multi-threaded algorithm into an unlocked multi-threaded algorithm? One way is to let the kernel help us. Ultra high performance algorithms for reading and writing data often have many readers and relatively few writers. What if the readers could all read as much as they want and concurrently, *and* a writer can write as much as it wants and concurrently *and* all without locks? Well it turns out that this is possible. How? Everybody has heard about copy-on-write memory but this is mainly associated with fork()ing and multiple processes. What is little known is that it’s possible to do copy-on-write in a single process and even in a single thread! Let’s go back to our first bsearch() example with the 1 GB memory block. We can memory map that 1 GB memory block and use it for the concurrent readers. However, we can also memory map the same 1 GB to another address range in memory as a copy-on-write copy of the first 1 GB memory map. The second 1 GB memory map uses the same physical memory blocks as the first one and so takes up almost no more physical memory. The concurrent writer can write to it, update multiple areas, and even apply a series of updates as an atomic transaction of updates, all without disturbing the first memory map which the readers are happily reading from. When the writing is complete then the pointers to the memory maps get updated and… tadaa! We have a multi-threaded and truly lock-less algorithm. A side benefit of using copy-on-write is that the memory map will survive — and we can choose whether the memory map is backed to disk or just virtual memory — our process re-starting, which means that if the memory map is holding some sort of cache of data then it will be immediately ‘hot’ upon restarting.

Problem #4: My (unlocked) multi-threaded algorithm works too fast for the network card:

After having taken the CPU cache lines into consideration, and after having removed lock contention using copy-on-write then finally we end up with monster performance which scales across all CPUs. Only thing is that this doesn’t really help us very much if we can’t speak to the outside world faster enough. Let’s say our multi-threaded algorithm can read and write at 10 million transactions per second with all CPU cores pinned at 100%… how does this help us if we’ve deploying to a box which has a 100 Mbit or 1,000 Mbit NIC only capable of a few hundred thousand packets per second? And this is probably the type of NIC commonly available from a service like Amazon EC2. The truth is that unless your NIC is 10,000 Mbit then you probably don’t need a multi-threaded algorithm in the first place. It is even said that C gurus can write code which handles all the packets of a 10,000 Mbit NIC using a single-thread; it depends upon your algorithm of course.

An exception to this is if you’re writing your cloud code in a language other than C. For example, node.js is fast to write but only relatively fast to run. A single-threaded node.js algorithm can easily be an order of magnitude slower than the same algorithm in C. Mainly because in node.js the author has little control over the internal format of data structures and therefore the efficiency of accessing them in terms of CPU cache line invalidation. You would have to hack node.js to take advantage of CPU cache lines and/or copy-on-write memory which will be so complicated that you might as well use C in the first place. It’s a similar story for plain old Java or other higher level languages. This is also the main reason that the operating system itself and high performance software such as databases are generally written in C. This doesn’t mean you have to write everything in complicated C; just the generic, high performance components. Consider separating and writing higher level business logic — the source code which will probably change more often — in a higher level language which leverages the generic, high performance components. If you want the higher level language to interface directly to C then think very carefully about which language to use. Most scripting languages have the ability to call C functions and vice-versa but there can be enormous differences in speed when doing this. For example, some languages store their function names in an internal hash table. This means if C calls the higher level function then the higher level language is going to do a hash table lookup for every call; expensive.

Problem #5: My unlocked multi-threaded algorithm is an efficient work of art but still under performing on the network:

Also of note — and you would have thought this problem was fixed long ago — is that it can’t be taken for granted that a particular operating system and NIC will operate as efficiently as expected. For example, your average 1,000 Mbit NIC will only deliver anywhere close to 1,000 Mbit if optimally sized MTU sized packets are being sent. Try to send 1,000 Mbits using smaller packets and watch system interrupt time go up, while through-put goes down to levels as low as 200 Mbit. This could be partly due to the NIC hardware, partly due to the NIC driver, and/or partly due to operating system network stack tuning. Fact is, that you might only be able to tune it so high and no more. This is the point at which you might want to try a different NIC and/or different kernel version. Always test the operating system / NIC performance independently from your concurrent code. As we have seen before, it may not be necessary to even for you to make your code concurrent in order to fulfill the performance requirements.

Problem #6: My unlocked multi-threaded algorithm works amazingly on my 10,000 Mbit NIC on my 3 servers but took too long to develop:

Oh dear. You optimized everything so much and managed to develop something which is an order of magnitude or two faster than anything else available and it’s running on 3 beefy server boxes with high end NICs, but it took way too long to develop. All that CPU cache line analysis and lower level coding in C took much longer to develop than in other languages. Maybe it would have been financially better to develop everything in node.js which needs 30 servers instead of 3? This could well be the situation you find yourself in. Only your business model knows the answer to this conundrum. 27 extra servers could easily be much cheaper than paying more expensive C programmers to develop fiddly code for longer. However, if you’re expecting the business to grow e.g. 10 fold in a reasonable period of time then maybe it’s worth paying up front for the more complicated C code because suddenly the extra overhead of the C developers looks cheap compared to 300 – 30 = 270 extra servers for the node.js solution.

Problem #7: I’m told that the GalacticTurboNoSQLDB 2.0 is as fast as anything and the ultimate concurrent solution:

Don’t believe them! One solution says 10,000 transactions per second, while another says 100,000, and another says 1,000,000, and yet another says 10,000,000 transactions per second. Always evaluate performance by creating a real world performance / load test. Don’t be afraid to make the test as real as possible. For example, if you are expecting a million TCP connections to a particular server then have the test create a million TCP connections to each server; only then are we properly testing concurrency. Then ask yourself if it could be faster and/or use less memory and/or use less disk? Examine the CPU usage during the test. If they are not at 100% then maybe the algorithm is not optimal. If they are at 100% and it’s a network program then determine whether the maximum NIC through-put has been exhausted? If the NIC through-put has not been exhausted then there’s room for improvement. Once you have tested everything and compared all tests metrics and decided that concurrency and therefore performance is good then ensure that these metrics can be monitored 24/7 during live production. It maybe that the GalacticTurboNoSQLDB 2.0 is blazingly fast for 7 hours and your performance test only lasted for 5 hours. Because GalacticTurboNoSQLDB 2.0 is written in Java then it seemed to work well on your 128 GB monster server until garbage collection kicked in and it went on a bender for half and hour 😦 When production metrics are found to no longer reflect the carefully crafted performance tests then carefully craft the performance tests a bit more!

Not the end of threads and locking concurrency issues, but the end of this brain dump.


libsxe, shared-memory, and parallel state-driven algorithms February 27, 2011

I recently came across the following paper: “Memory Models: A Case for Rethinking Parallel Languages and Hardware” by Sarita V. Adve and Hans-J. Boehm

The paper starts off:

The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad  community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current popular languages and systems that prevent achieving the vision of structured and safe parallel programming. This paper discusses the path to the above convergence, the hard lessons learned, and their implications. …

And then introduces the idea of “disciplined shared-memory models”:

Moving forward, we believe a critical research agenda to enable “parallelism for the masses” is to develop and promote disciplined shared-memory models that:
• are simple enough to be easily teachable to undergraduates; i.e., minimally provide sequential consistency to programs that obey the required discipline;
• enable the enforcement of the discipline; i.e., violations of the discipline should not have undefined or horrendously complex semantics, but should be caught and returned back to the programmer as illegal;
• are general-purpose enough to express important parallel algorithms and patterns; and
• enable high and scalable performance.

This is interesting because libsxe has a disciplined shared-memory model which goes a long way towards fulfilling the criteria above in the form of the sxe pool library. So what is a sxe pool and how does it offer us a disciplined shared-memory model?

The sxe pool library was invented for different reasons than offering a disciplined shared-memory model. A shared-memory option was added later as a pool construction option. In short, sxe pools offer a way to create C arrays of structs with the following generic benefits:
• The size of the array is persisted
• Each element of the array gets its own state which is persisted outside the element struct in a linked list
• Each element of the array gets its own timestamp which is persisted outside the element struct in a linked list
• Each element is accessed using regular & concise C code, e.g. myarray[myindex].mystructmember

The sxe pool library caller can manipulate the state & timestamp element properties using the following API:

sxe_pool_get_number_in_state(void * array, unsigned state)
sxe_pool_index_to_state(void * array, unsigned id)
sxe_pool_set_indexed_element_state(void * array, unsigned id, unsigned old_state, unsigned new_state)
sxe_pool_set_oldest_element_state(void * array, unsigned old_state, unsigned new_state)
sxe_pool_get_oldest_element_index(void * array, unsigned state)
sxe_pool_get_oldest_element_time(void * array, unsigned state)
sxe_pool_get_element_time_by_index(void * array, unsigned element)
sxe_pool_touch_indexed_element(void * array, unsigned id)

Converting the sxe pool library API to support shared-memory was relatively simple. The sxe_pool_new() function got an option to share the pool memory. The API functions to change the pool element state use atomic assembler instructions if the pool was constructed as a shared-memory pool. It’s also interesting to note that sxe pools can be shared between processes as well as between threads in the same process. This is because the sxe pool library internal implementation avoids absolute pointers; which is also something that I encourage from libsxe developers and C developers in general.

This API is “general-purpose enough to express important parallel algorithms and patterns” and most interestingly is the same API whether the algorithm is threaded or not. It’s also “simple enough to be easily teachable to undergraduates” or even junior developers as we have found out at Sophos. The atomic sxe_pool_set_[indexed|oldest]_element_state() API functions “enable the enforcement of the discipline” by requiring both the old state and the new state of the array element; if the developer supplies the wrong old state then sxe pool will assert. Because the sxe pool library manages the element states itself then an assert is very unlikely when using a single pool. However, more complicated algorithms often make use of chains of pools in order to implement multiplexing and/or combining of parallel results, etc. In these cases, it is common to keep references to pool array element indexes and/or pool array element states in the caller supplied pool element structs. Finally, by implementing algorithms using the sxe pool API then it is possible to “enable high and scalable performance” using a minimum of simple to understand C source code. The developer is forced into thinking about the algorithm as a state model which often simplifies the hardest problems. And the generic part of the implementation complexity — e.g. locking, shared memory, keeping state, double linked lists, timeout handling, & memory allocation — is all handled by the sxe pool library and backed by automated tests with 100% library code coverage. The resulting performance is excellent as can be seen by the figures published in earlier blog entries; tens of thousands of network requests per second per core.

As you can see, the sxe pool is an incredibly powerful and code saving and code simplifying generic data structure. It’s a sort of Swiss Army knife for parallel, event driven algorithms. In a future article I’ll show some of the implementation patterns.


%d bloggers like this: