SimonHF's Blog

Just another WordPress.com site

Concurrency, performance, threads, and locking brain dump December 3, 2012

As a ‘hard-core’ C developer with over 30,000 hours of hands on development experience then I feel qualified to comment on concurrency and performance issues relating to ‘threads and locks’. So here goes my brain dump by dividing up the territory into 7 common problem areas:

Problem #1: The hidden concurrency in single-threaded algorithms, *or* How developers think memory operations & mechanisms like bsearch() are fast:

It’s almost like — by starting to think about concurrency in terms of threads and locks — we’re jumping the gun. Because *before* even considering the problems of multi-threading, there’s a different concurrency war going on behind the scenes in your single threaded program! Believe it or not your higher level code might look single threaded but at run-time it comes down to assembler op codes being executed and the associated memory accesses. Often our intention is to use concurrency in order to achieve run-time performance in which case it’s critical for the developer to understand how memory access works even in a single-threaded program. Why? Believe it or not it’s easily possible for an inexperienced developer to write a concurrent, multi-threaded algorithm which is way slower than a single-threaded algorithm. Now imagine that much faster single-threaded algorithm made into a multi-threaded algorithm. It’s going to be mega fast.

One common reason for developers inadvertently making their single-threaded algorithm unnecessarily slow is the misconception that memory access is generally pretty fast. The reality is a bit different and more complicated. Why? Because memory is not as fast as you might think. Generally developers think memory is fast and disk is slow so try to access the disk as little as possible. This is a good rule of thumb. However, what very many developers don’t realize is that that memory access times by CPUs these days has a split personality; uncached RAM is much, much slower than CPU cached RAM. So much so that the time difference between accessing uncached RAM and cached RAM is analogous to the idea that one should try to minimize disk access in favour of memory access. So how is RAM cached? In simple terms, RAM is cached in ‘cache lines’ the size of which depends upon the CPU but usually 64 bytes on modern 64 bit CPUs. In the old days then figuring out the speed of some assembler was as easy as determining and adding up the number of clock cycles for a series of op codes, and memory access speed was always the same and therefore predictable. These days, op codes still have an individual execution cost associated with them but the total time to execute is blurred by two factors; pipe-lining of instructions causing variable execution time, and *highly* variable memory access times depending upon how memory has been cached. As this point I bet you’re thinking: Okay, so a bit of caching goes on; big deal; it can’t make that much difference to the execution speed of my program; at most a few percent or 10% or something low like that. Well you’re wrong and an easy way to illustrate the problem is by using bsearch(). Yes, bsearch(). And it’s easy to Google bsearch and find a whole bunch of developers willing to swear by their grandmother than it’s the fastest way to access a large chunk of data. The truth is that it might have been fast 10 years ago but not any more. And the reason it’s not fast is because if the data set being searched is much larger than the total cache line capacity on the CPU, then the bsearch() algorithm by its very nature causes much memory to be accessed which isn’t currently cached in cache lines. Even accessing 1 byte of a 64 byte cache line causes an entire 64 byte cache line to be expensively ‘flown in’ from the RAM to the cache line cache on the CPU. How expensive is this cache line gathering process? As an example we’ll consider a simple C program which creates a sorted array of 30 million equal sized records taking up about 1 GB of RAM. If we use bsearch() to loop through all keys from 1 to 30 million in order, then on my laptop’s Intel i7 CPU then about 8 seconds of wall clock time goes by. Fine you say; 30 million is a lot of records and that sounds like good performance. However, now let’s look up the same 30 million keys but in random order. On the same CPU then about 40 seconds of wall clock time goes by. That’s *five* times slower. Not a few percent slower. Not ten percent slower. But *five* times slower. Obviously this is a huge difference. And it’s all down to cache line exhaustion. In both experiments we accessed exactly the same amount of memory and looped the same number of times, but in experiment #2 then the CPU had to have cache lines expensively ‘flown in’ way more often. As a reference, then the same 30 million records can be accessed in the form of a hash table in under 5 seconds of wall clock time and in any random order. Why? Because less cache lines must be expensively ‘flown in’ūüôā

What is the moral of this story? There are several: Always be paranoid and question the status quo. Never assume that your algorithm is the fastest no matter how simple it looks. Or that the only way to make it faster is to make it multi-threaded. When working with large data sets in memory then always be mindful of cache line exhaustion. Never try to measure small amounts of something; always performance test your algorithm working flat out, exercising 100% CPU, and with realistic data access patterns in order to determine the average tiny cost per iteration, e.g. 40 seconds divided by 30 million. Always try very hard to avoid multi-threaded code because it’s evil in so many ways.

Problem #2: Turning a single-threaded algorithm into a locked multi-threaded algorithm:

In example #1 then we patted ourselves on the back for optimizing the query algorithm so that instead of taking 40 seconds, it now takes 5 seconds using a hash table. We did this by understanding the not-in-our-face cache line memory concurrency issues happening in the CPU even with a single thread. Now we want to make our hash table multi-threaded. Let’s say we have 8 physical CPU cores and want each core to be able to read and write hash table records. Problem is that we need to lock the hash table so that no two cores simultaneously try to write to the hash table at the same time.

Pitfall #1: Which amazingly I have seen even experienced developers fall into: Put as little code inside the lock as possible. For example, there’s no need to calculate the hash of the key inside the lock!

Pitfall #2: Use the hash of the key to reduce lock contention. For example, instead of locking the entire 1 GB hash table then have e.g. 100 x 10 MB hash tables; each having its own lock. Use the hash of the key to decide which key exclusively goes in which hash table. Now when accessing the hash table concurrently then there’s a much bigger chance that a particular thread won’t block because individual hash tables can be locked and accessed in parallel.

Pitfall #3: Avoid using fancy locks. Some developers are tempted to use fancy hybrid, spin-lock type constructs that do fancy operations, for example, don’t lock if only multi-threaded readers are currently accessing, but only lock if a single writer is writing. These fancy locks sound good but are in fact very expensive to execute — even if no actual locking is performed — due to the fact that multiple atomic assembler instructions must be used to implement the fancy algorithm.

Pitfall #4: Avoid using fancy lock-less algorithms. Some developers think that using fancy lock-less algorithms is the way to get around having to use locks and is therefore a kind of silver bullet solution. Problem is that all these lock-less algorithms rely on using atomic assembler instructions which are expensive to execute. Why? Because they guarantee that a cache line must be expensively ‘flown in’ (see above) as well as doing a bunch of other expensive things like breaking the concurrent op code pipe-line.

Problem #3: Turning a locked multi-threaded algorithm into an unlocked multi-threaded algorithm:

Obviously using no locks is faster than using locks. Plus using fancy lock-less algorithms is kind of cheating and also doesn’t result in the ultimate performance that we’re looking for. So how to turn our locked multi-threaded algorithm into an unlocked multi-threaded algorithm? One way is to let the kernel help us. Ultra high performance algorithms for reading and writing data often have many readers and relatively few writers. What if the readers could all read as much as they want and concurrently, *and* a writer can write as much as it wants and concurrently *and* all without locks? Well it turns out that this is possible. How? Everybody has heard about copy-on-write memory but this is mainly associated with fork()ing and multiple processes. What is little known is that it’s possible to do copy-on-write in a single process and even in a single thread! Let’s go back to our first bsearch() example with the 1 GB memory block. We can memory map that 1 GB memory block and use it for the concurrent readers. However, we can also memory map the same 1 GB to another address range in memory as a copy-on-write copy of the first 1 GB memory map. The second 1 GB memory map uses the same physical memory blocks as the first one and so takes up almost no more physical memory. The concurrent writer can write to it, update multiple areas, and even apply a series of updates as an atomic transaction of updates, all without disturbing the first memory map which the readers are happily reading from. When the writing is complete then the pointers to the memory maps get updated and… tadaa! We have a multi-threaded and truly lock-less algorithm. A side benefit of using copy-on-write is that the memory map will survive — and we can choose whether the memory map is backed to disk or just virtual memory — our process re-starting, which means that if the memory map is holding some sort of cache of data then it will be immediately ‘hot’ upon restarting.

Problem #4: My (unlocked) multi-threaded algorithm works too fast for the network card:

After having taken the CPU cache lines into consideration, and after having removed lock contention using copy-on-write then finally we end up with monster performance which scales across all CPUs. Only thing is that this doesn’t really help us very much if we can’t speak to the outside world faster enough. Let’s say our multi-threaded algorithm can read and write at 10 million transactions per second with all CPU cores pinned at 100%… how does this help us if we’ve deploying to a box which has a 100 Mbit or 1,000 Mbit NIC only capable of a few hundred thousand packets per second? And this is probably the type of NIC commonly available from a service like Amazon EC2. The truth is that unless your NIC is 10,000 Mbit then you probably don’t need a multi-threaded algorithm in the first place. It is even said that C gurus can write code which handles all the packets of a 10,000 Mbit NIC using a single-thread; it depends upon your algorithm of course.

An exception to this is if you’re writing your cloud code in a language other than C. For example, node.js is fast to write but only relatively fast to run. A single-threaded node.js algorithm can easily be an order of magnitude slower than the same algorithm in C. Mainly because in node.js the author has little control over the internal format of data structures and therefore the efficiency of accessing them in terms of CPU cache line invalidation. You would have to hack node.js to take advantage of CPU cache lines and/or copy-on-write memory which will be so complicated that you might as well use C in the first place. It’s a similar story for plain old Java or other higher level languages. This is also the main reason that the operating system itself and high performance software such as databases are generally written in C. This doesn’t mean you have to write everything in complicated C; just the generic, high performance components. Consider separating and writing higher level business logic — the source code which will probably change more often — in a higher level language which leverages the generic, high performance components. If you want the higher level language to interface directly to C then think very carefully about which language to use. Most scripting languages have the ability to call C functions and vice-versa but there can be enormous differences in speed when doing this. For example, some languages store their function names in an internal hash table. This means if C calls the higher level function then the higher level language is going to do a hash table lookup for every call; expensive.

Problem #5: My unlocked multi-threaded algorithm is an efficient work of art but still under performing on the network:

Also of note — and you would have thought this problem was fixed long ago — is that it can’t be taken for granted that a particular operating system and NIC will operate as efficiently as expected. For example, your average 1,000 Mbit NIC will only deliver anywhere close to 1,000 Mbit if optimally sized MTU sized packets are being sent. Try to send 1,000 Mbits using smaller packets and watch system interrupt time go up, while through-put goes down to levels as low as 200 Mbit. This could be partly due to the NIC hardware, partly due to the NIC driver, and/or partly due to operating system network stack tuning. Fact is, that you might only be able to tune it so high and no more. This is the point at which you might want to try a different NIC and/or different kernel version. Always test the operating system / NIC performance independently from your concurrent code. As we have seen before, it may not be necessary to even for you to make your code concurrent in order to fulfill the performance requirements.

Problem #6: My unlocked multi-threaded algorithm works amazingly on my 10,000 Mbit NIC on my 3 servers but took too long to develop:

Oh dear. You optimized everything so much and managed to develop something which is an order of magnitude or two faster than anything else available and it’s running on 3 beefy server boxes with high end NICs, but it took way too long to develop. All that CPU cache line analysis and lower level coding in C took much longer to develop than in other languages. Maybe it would have been financially better to develop everything in node.js which needs 30 servers instead of 3? This could well be the situation you find yourself in. Only your business model knows the answer to this conundrum. 27 extra servers could easily be much cheaper than paying more expensive C programmers to develop fiddly code for longer. However, if you’re expecting the business to grow e.g. 10 fold in a reasonable period of time then maybe it’s worth paying up front for the more complicated C code because suddenly the extra overhead of the C developers looks cheap compared to 300 – 30 = 270 extra servers for the node.js solution.

Problem #7: I’m told that the GalacticTurboNoSQLDB 2.0 is as fast as anything and the ultimate concurrent solution:

Don’t believe them! One solution says 10,000 transactions per second, while another says 100,000, and another says 1,000,000, and yet another says 10,000,000 transactions per second. Always evaluate performance by creating a real world performance / load test. Don’t be afraid to make the test as real as possible. For example, if you are expecting a million TCP connections to a particular server then have the test create a million TCP connections to each server; only then are we properly testing concurrency. Then ask yourself if it could be faster and/or use less memory and/or use less disk? Examine the CPU usage during the test. If they are not at 100% then maybe the algorithm is not optimal. If they are at 100% and it’s a network program then determine whether the maximum NIC through-put has been exhausted? If the NIC through-put has not been exhausted then there’s room for improvement. Once you have tested everything and compared all tests metrics and decided that concurrency and therefore performance is good then ensure that these metrics can be monitored 24/7 during live production. It maybe that the GalacticTurboNoSQLDB 2.0 is blazingly fast for 7 hours and your performance test only lasted for 5 hours. Because GalacticTurboNoSQLDB 2.0 is written in Java then it seemed to work well on your 128 GB monster server until garbage collection kicked in and it went on a bender for half and hourūüė¶ When production metrics are found to no longer reflect the carefully crafted performance tests then carefully craft the performance tests a bit more!

Not the end of threads and locking concurrency issues, but the end of this brain dump.

 

G-WAN versus SXE ‚ÄúHello World‚ÄĚ April 26, 2012

Recently I’ve been very impressed reading about the performance figures for G-WAN:
http://gwan.ch/benchmark

G-WAN has quite the licensing model with the G-WAN binary being freeware and support costing very much money:
http://gwan.ch/buy

So I decided to do a simple libsxe versus G-WAN performance test like I did for libsxe versus NGINX and libsxe versus node.js. However, for this test I decided to use G-WAN’s very own multi-threaded load tool called weighttp:
http://redmine.lighttpd.net/projects/weighttp/wiki

I modified the simple libsxe HTTP server to make it take advantage of multiple CPUs.

These tests were run on a Ubuntu 11.04 instance running on a dual quad core i7 processor.

First the G-WAN figures:

I don’t know why G-WAN is talking about 16 cores upon starting because my i7 only has 8!

simon@ubuntu:~/gwan_linux64-bit$ sudo ./gwan

allowed Cores: 8 (‘sudo ./gwan’ to let G-WAN use your 16 Core(s))

loading
> ‘all.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)
> ‘hello.mm’: to use Objective-C++ (*.mm) scripts, install ‘gobjc++’ (sudo apt-get install gobjc++)
> ‘loan.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)..
> ‘argv.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac).
> ‘hello.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac).
> ‘hello.m’: to use Objective-C (*.m) scripts, install ‘gobjc’ (sudo apt-get install gobjc)
> ‘report.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)..

G-WAN 3.3.28 (pid:3110)

simon@ubuntu:~/weighttp$ ./build/default/weighttp -n 10000000 -c 1000 -t 4 -k “http://127.0.0.1:8080/100.html”
weighttp – a lightweight and simple webserver benchmarking tool

host: ‘127.0.0.1’, port: 8080
starting benchmark…
spawning thread #1: 250 concurrent requests, 2500000 total requests
spawning thread #2: 250 concurrent requests, 2500000 total requests
spawning thread #3: 250 concurrent requests, 2500000 total requests
spawning thread #4: 250 concurrent requests, 2500000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done

finished in 61 sec, 501 millisec and 457 microsec, 162597 req/s, 59862 kbyte/s
requests: 10000000 total, 10000000 started, 10000000 done, 10000000 succeeded, 0 failed, 0 errored
status codes: 10000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 3770000000 bytes total, 2770000000 bytes http, 1000000000 bytes data

Now the libsxe figures:

simon@ubuntu:~/sxe-httpd/sxe-httpd$ ./build-linux-64-release/sxe-httpd 127.0.0.1 8080 10000
20120426 211759.525 T 10198 —— 1 – sxe-httpd starting // detected cpus: 8
20120426 211759.525 T 10198 —— 1 – sxe-httpd parent forking 7 times
20120426 211759.525 T 10199 —— 1 – sxe-httpd child created
20120426 211759.525 T 10200 —— 1 – sxe-httpd child created
20120426 211759.525 T 10201 —— 1 – sxe-httpd child created
20120426 211759.526 T 10202 —— 1 – sxe-httpd child created
20120426 211759.526 T 10203 —— 1 – sxe-httpd child created
20120426 211759.526 T 10204 —— 1 – sxe-httpd child created
20120426 211759.526 T 10205 —— 1 – sxe-httpd child created

simon@ubuntu:~/weighttp$ ./build/default/weighttp -n 10000000 -c 1000 -t 4 -k “http://127.0.0.1:8080/100.html”
weighttp – a lightweight and simple webserver benchmarking tool

host: ‘127.0.0.1’, port: 8080
starting benchmark…
spawning thread #1: 250 concurrent requests, 2500000 total requests
spawning thread #2: 250 concurrent requests, 2500000 total requests
spawning thread #3: 250 concurrent requests, 2500000 total requests
spawning thread #4: 250 concurrent requests, 2500000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done

finished in 34 sec, 79 millisec and 878 microsec, 293428 req/s, 108316 kbyte/s
requests: 10000000 total, 10000000 started, 10000000 done, 10000000 succeeded, 0 failed, 0 errored
status codes: 10000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 3780000000 bytes total, 2780000000 bytes http, 1000000000 bytes data

Conclusion:

At 162597 versus¬†293428 requests per second,¬†libsxe is significantly — or 1.8 times — faster than G-WAN for this simple performance test using 8 cores. Although G-WAN calls itself the fastest web server available — and admittedly is very fast — it obviously suffers internally from quite a bit of overhead even for such a trivial performance test such as this one. And with libsxe the CPU bottleneck is really the networking layer in the kernel… so what is G-WAN doing with all those spare CPU cycles? Looks like G-WAN might have room for optimization yet? Or maybe it’s partly due to libsxe’s fixed memory model which does away with the unnecessary and repetitive malloc() / free() cycle? I guess we’ll never know since G-WAN is closed source.

EDIT: Since running this test we have found two potential problems with G-WAN which mean that these figures are unreliable (see thread below): (a) G-WAN’s performance seems highly tuned to particular processors but it’s supplied as a single binary executable meaning that performance tests may vary wildly, and (b) G-WAN doesn’t scale linearly as the number of cores increase even with the simplest of performance tests.

 

What they didn’t tell you about creating scalable HTTP web services April 10, 2011

Filed under: Uncategorized — simonhf @ 3:22 am

Important (performance) factors to keep in mind when creating HTTP web services (with or without libSXE):

1. Accept()ing new sockets is an expensive and generally single-threaded kernel operation. Therefore it’s important to consider having as many long-lived ‘keep-alive’ HTTP 1.1 connections as possible. For example, your multi-threaded HTTP server might be able to have 500,000 simultaneous HTTP connections and perform at 250,000 keep-alive requests per second, but the server will still only be able to accept() new connections at a rate of 30,000 per second.

2. Parsing HTTP headers can be done with the sub-library lib-sxe-http. However, parsing the key/value structured headers is (unnecessarily) slow. Why slow? Because it’s necessary for the parser to loop over and examine each and every character in the headers in order to determine e.g. whether it’s an end of line character, or whether it’s one of the end of header characters etc. Although such a loop seems simple and fast, when we loop over, say, 250,000 or more headers per second then all that looping adds up into a lot of CPU instructions. Why unnecessarily slow? If you have control over the peer sending the HTTP request — e.g. because the peer is another of your servers or your javascript library etc — then it’s possible to build the never ‘pipelined’ requests and/or responses in such a way that it’s not necessary for your HTTP server to parse the headers. For simple GET requests, your server only needs to examine the last characters read in order to determine if they are the end of header characters. For POST requests and responses, the ‘content-length’ header is useful but in order to avoid parsing the headers then we can embed this info at the end of our body at a fixed, known location. And this technique works even if third-party HTTP proxies manipulate the headers en-route. Conclusion: Avoid parsing HTTP headers if possible. This blog post shows how parsing HTTP headers using node.js causes node.js to more than halve in perfermance: Node.Js Versus SXE ‚ÄúHello World‚ÄĚ; Node.Js Performance Revisited

3. The general rule of thumb is that the HTTP server nearly never has to close() a socket. The close() function is not as slow at the accept() function but it’s not for free either. Using close() might just lead to a malicious peer to connect again causing an expensive accept(). So assuming your server has a lot of memory then it’s probably a better trade off to just keep a badly behaving HTTP peer connection, and maybe even ‘tar pit’ it etc. The idea is that a little bit of memory for the peer buffers is a lesser evil than the potentially expensive close()/accept() bottleneck.

4. Another reason not to close() is that on very heavily loaded networks then there is no guarantee that the TCP packets generated by a close() will even reach a peer. In fact it doesn’t even have to be a heavily loaded network. For example, a client which suddenly powers off and had an idle keep-alive connection to a server will not inform the server of the close; therefore the server will continue to believe that the TCP connection is still alive unless it tries to send data to the client. So a good rule of thumb is not to rely on the close event — generated when a peer close()s — for cleaning up the resources / states / etc belonging to a particular HTTP connection. So how else to clean up a connection if the server never close()s and we cannot 100% rely on the close event? The only answer which makes sense is to timeout the particular connection. But there is an even simpler way; just close and reuse the oldest TCP connection. For example, the server is configured to handle a maximum of 500,000 simultaneous connections. When we want to accept() the 500,001 connection then we simply close() the oldest connection in the connection pool. Chances are that we’re just reaping one of those ‘zombie’ connections.

 

12,000 views in 5 months! April 6, 2011

Filed under: Uncategorized — simonhf @ 4:49 am

Today the counter ticked up to 12,000 blog views. Yay! The blog was started just over 5 months ago. That’s way more views than I ever expected. Thank you to all the readers / anonymous lurkers / etc out there. And an extra special thank you to all those who left comments on the various postings or who contacted me directly. I have some really interesting posts coming up… if I can just find the timeūüôā

 

Screencast: Building libsxe March 27, 2011

Filed under: Uncategorized — simonhf @ 8:09 pm
Tags: , , , , , ,

Screencast: Building libsxe
Screencast: Click to play HD full-screen

I thought I’d try something new. So here’s a screencast showing how to download, build, and test libsxe on 64 bit Ubuntu. I also explain a bit about the layout of the libsxe source files and the various sub-libraries. Building the release, debug, and coverage targets for libsxe from scratch — including running the well over 1,000 tests on each target and enforcing 100% code coverage — takes about 1 minute 20 seconds in total on my VMware installation of Ubuntu. Quite fast but could be faster. The build is currently executed using consecutive steps. It’s on the ‘to do’ list to parallelize the build and make use of multiple cores to speed things up even more. The tests run so fast already — even on a single core — because we do sneaky things like faking time and mocking system calls to easily reproduce the most difficult to reproduce error conditions.

 

libsxe, shared-memory, and parallel state-driven algorithms February 27, 2011

I recently came across the following paper: “Memory Models: A Case for Rethinking Parallel Languages and Hardware” by Sarita V. Adve and Hans-J. Boehm

The paper starts off:

The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad¬† community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current popular languages and systems that prevent achieving the vision of structured and safe parallel programming. This paper discusses the path to the above convergence, the hard lessons learned, and their implications. …

And then introduces the idea of “disciplined shared-memory models”:

Moving forward, we believe a critical research agenda to enable ‚Äúparallelism for the masses‚ÄĚ is to develop and promote disciplined shared-memory models that:
‚ÄĘ are simple enough to be easily teachable to undergraduates; i.e., minimally provide sequential consistency to programs that obey the required discipline;
‚ÄĘ enable the enforcement of the discipline; i.e., violations of the discipline should not have undefined or horrendously complex semantics, but should be caught and returned back to the programmer as illegal;
‚ÄĘ are general-purpose enough to express important parallel algorithms and patterns; and
‚ÄĘ enable high and scalable performance.

This is interesting because libsxe has a disciplined shared-memory model which goes a long way towards fulfilling the criteria above in the form of the sxe pool library. So what is a sxe pool and how does it offer us a disciplined shared-memory model?

The sxe pool library was invented for different reasons than offering a disciplined shared-memory model. A shared-memory option was added later as a pool construction option. In short, sxe pools offer a way to create C arrays of structs with the following generic benefits:
‚ÄĘ The size of the array is persisted
‚ÄĘ Each element of the array gets its own state which is persisted outside the element struct in a linked list
‚ÄĘ Each element of the array gets its own timestamp which is persisted outside the element struct in a linked list
‚ÄĘ Each element is accessed using regular & concise C code, e.g. myarray[myindex].mystructmember

The sxe pool library caller can manipulate the state & timestamp element properties using the following API:

sxe_pool_get_number_in_state(void * array, unsigned state)
sxe_pool_index_to_state(void * array, unsigned id)
sxe_pool_set_indexed_element_state(void * array, unsigned id, unsigned old_state, unsigned new_state)
sxe_pool_set_oldest_element_state(void * array, unsigned old_state, unsigned new_state)
sxe_pool_get_oldest_element_index(void * array, unsigned state)
sxe_pool_get_oldest_element_time(void * array, unsigned state)
sxe_pool_get_element_time_by_index(void * array, unsigned element)
sxe_pool_touch_indexed_element(void * array, unsigned id)

Converting the sxe pool library API to support shared-memory was relatively simple. The sxe_pool_new() function got an option to share the pool memory. The API functions to change the pool element state use atomic assembler instructions if the pool was constructed as a shared-memory pool. It’s also interesting to note that sxe pools can be shared between processes as well as between threads in the same process. This is because the sxe pool library internal implementation avoids absolute pointers; which is also something that I encourage from libsxe developers and C developers in general.

This API is “general-purpose enough to express important parallel algorithms and patterns” and most interestingly is the same API whether the algorithm is threaded or not. It’s also “simple enough to be easily teachable to undergraduates” or even junior developers as we have found out at Sophos. The atomic sxe_pool_set_[indexed|oldest]_element_state() API functions “enable the enforcement of the discipline” by requiring both the old state and the new state of the array element; if the developer supplies the wrong old state then sxe pool will assert. Because the sxe pool library manages the element states itself then an assert is very unlikely when using a single pool. However, more complicated algorithms often make use of chains of pools in order to implement multiplexing and/or combining of parallel results, etc. In these cases, it is common to keep references to pool array element indexes and/or pool array element states in the caller supplied pool element structs. Finally, by implementing algorithms using the sxe pool API then it is possible to “enable high and scalable performance” using a minimum of simple to understand C source code. The developer is forced into thinking about the algorithm as a state model which often simplifies the hardest problems. And the generic part of the implementation complexity — e.g. locking, shared memory, keeping state, double linked lists, timeout handling, & memory allocation — is all handled by the sxe pool library and backed by automated tests with 100% library code coverage. The resulting performance is excellent as can be seen by the figures published in earlier blog entries; tens of thousands of network requests per second per core.

As you can see, the sxe pool is an incredibly powerful and code saving and code simplifying generic data structure. It’s a sort of Swiss Army knife for parallel, event driven algorithms. In a future article I’ll show some of the implementation patterns.

 

a gcc bug: optimization (-O) of code using union unexpectedly causes essential code to be optimized away February 12, 2011

Filed under: Uncategorized — simonhf @ 6:49 pm

Having written hundreds of thousands of lines of C during my career, I’ve never found a bug in gcc unless I’ve been trying to do something well off the beaten track. So it came as a surprise to discover this nasty bug in gcc 4.5.1 and I share it here. Note: The code below works perfectly with Microsoft compilers and gcc 3.x compilers.

Reproduce with:


mingw32-gcc.exe -DWITH_UNNECESSARY_CODE    example.c -oexample.exe && example.exe
mingw32-gcc.exe                            example.c -oexample.exe && example.exe
mingw32-gcc.exe -DWITH_UNNECESSARY_CODE -O example.c -oexample.exe && example.exe
mingw32-gcc.exe                         -O example.c -oexample.exe && example.exe

Shows output:


C:\20110211-gcc-bug>mingw32-gcc.exe -DWITH_UNNECESSARY_CODE    example.c -oexample.exe   && example.exe
before  copy_bytes(): source     : abcd
before  copy_bytes(): destination: 1234
calling copy_bytes(): copy source to destination
after   copy_bytes(): source     : abcd
after   copy_bytes(): destination: abcd <-- expected

C:\20110211-gcc-bug>mingw32-gcc.exe                            example.c -oexample.exe   && example.exe
before  copy_bytes(): source     : abcd
before  copy_bytes(): destination: 1234
calling copy_bytes(): copy source to destination
after   copy_bytes(): source     : abcd
after   copy_bytes(): destination: abcd <-- expected

C:\20110211-gcc-bug>mingw32-gcc.exe -DWITH_UNNECESSARY_CODE -O example.c -oexample.exe   && example.exe
before  copy_bytes(): source     : abcd
before  copy_bytes(): destination: 1234
calling copy_bytes(): copy source to destination
after   copy_bytes(): source     : abcd
after   copy_bytes(): destination: abcd <-- expected

C:\20110211-gcc-bug>mingw32-gcc.exe                         -O example.c -oexample.exe   && example.exe
before  copy_bytes(): source     : abcd
before  copy_bytes(): destination: 1234
calling copy_bytes(): copy source to destination
after   copy_bytes(): source     : abcd
after   copy_bytes(): destination: 1234 <-- not expected: the bug in action

example.c:


#include <stdio.h>
#include <stdlib.h>

typedef union PTR_UNION
{
    char * as_s08_ptr;
    unsigned int as_u32;
} PTR_UNION;

void copy_bytes(unsigned int address_as_u32, char * source, unsigned int size)
{
    unsigned int i;
    PTR_UNION destination;

    destination.as_u32 = address_as_u32;

    for (i=0 ; i<size; i++) {
        destination.as_s08_ptr[i] = source[i];
    }

#ifdef WITH_UNNECESSARY_CODE
    if ((size >= 1) && (destination.as_s08_ptr[0] != source[0])) {
        exit(1);
    }
#endif
}

void main(void)
{
    PTR_UNION address;
    char source[] = "abcd";
    char destination[] = "1234";

    printf("before  copy_bytes(): source     : %s\n", source);
    printf("before  copy_bytes(): destination: %s\n", destination);

    printf("calling copy_bytes(): copy source to destination\n");
    address.as_s08_ptr = destination;
    copy_bytes(address.as_u32, source, sizeof(source));

    printf("after   copy_bytes(): source     : %s\n", source);
    printf("after   copy_bytes(): destination: %s <-- %s\n", destination,
        strcmp(source, destination) ? "not expected: the bug in action" : "expected");
}

The bug is filed here.

 

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: