SimonHF's Blog

Just another WordPress.com site

Concurrency, performance, threads, and locking brain dump December 3, 2012

As a ‘hard-core’ C developer with over 30,000 hours of hands on development experience then I feel qualified to comment on concurrency and performance issues relating to ‘threads and locks’. So here goes my brain dump by dividing up the territory into 7 common problem areas:

Problem #1: The hidden concurrency in single-threaded algorithms, *or* How developers think memory operations & mechanisms like bsearch() are fast:

It’s almost like — by starting to think about concurrency in terms of threads and locks — we’re jumping the gun. Because *before* even considering the problems of multi-threading, there’s a different concurrency war going on behind the scenes in your single threaded program! Believe it or not your higher level code might look single threaded but at run-time it comes down to assembler op codes being executed and the associated memory accesses. Often our intention is to use concurrency in order to achieve run-time performance in which case it’s critical for the developer to understand how memory access works even in a single-threaded program. Why? Believe it or not it’s easily possible for an inexperienced developer to write a concurrent, multi-threaded algorithm which is way slower than a single-threaded algorithm. Now imagine that much faster single-threaded algorithm made into a multi-threaded algorithm. It’s going to be mega fast.

One common reason for developers inadvertently making their single-threaded algorithm unnecessarily slow is the misconception that memory access is generally pretty fast. The reality is a bit different and more complicated. Why? Because memory is not as fast as you might think. Generally developers think memory is fast and disk is slow so try to access the disk as little as possible. This is a good rule of thumb. However, what very many developers don’t realize is that that memory access times by CPUs these days has a split personality; uncached RAM is much, much slower than CPU cached RAM. So much so that the time difference between accessing uncached RAM and cached RAM is analogous to the idea that one should try to minimize disk access in favour of memory access. So how is RAM cached? In simple terms, RAM is cached in ‘cache lines’ the size of which depends upon the CPU but usually 64 bytes on modern 64 bit CPUs. In the old days then figuring out the speed of some assembler was as easy as determining and adding up the number of clock cycles for a series of op codes, and memory access speed was always the same and therefore predictable. These days, op codes still have an individual execution cost associated with them but the total time to execute is blurred by two factors; pipe-lining of instructions causing variable execution time, and *highly* variable memory access times depending upon how memory has been cached. As this point I bet you’re thinking: Okay, so a bit of caching goes on; big deal; it can’t make that much difference to the execution speed of my program; at most a few percent or 10% or something low like that. Well you’re wrong and an easy way to illustrate the problem is by using bsearch(). Yes, bsearch(). And it’s easy to Google bsearch and find a whole bunch of developers willing to swear by their grandmother than it’s the fastest way to access a large chunk of data. The truth is that it might have been fast 10 years ago but not any more. And the reason it’s not fast is because if the data set being searched is much larger than the total cache line capacity on the CPU, then the bsearch() algorithm by its very nature causes much memory to be accessed which isn’t currently cached in cache lines. Even accessing 1 byte of a 64 byte cache line causes an entire 64 byte cache line to be expensively ‘flown in’ from the RAM to the cache line cache on the CPU. How expensive is this cache line gathering process? As an example we’ll consider a simple C program which creates a sorted array of 30 million equal sized records taking up about 1 GB of RAM. If we use bsearch() to loop through all keys from 1 to 30 million in order, then on my laptop’s Intel i7 CPU then about 8 seconds of wall clock time goes by. Fine you say; 30 million is a lot of records and that sounds like good performance. However, now let’s look up the same 30 million keys but in random order. On the same CPU then about 40 seconds of wall clock time goes by. That’s *five* times slower. Not a few percent slower. Not ten percent slower. But *five* times slower. Obviously this is a huge difference. And it’s all down to cache line exhaustion. In both experiments we accessed exactly the same amount of memory and looped the same number of times, but in experiment #2 then the CPU had to have cache lines expensively ‘flown in’ way more often. As a reference, then the same 30 million records can be accessed in the form of a hash table in under 5 seconds of wall clock time and in any random order. Why? Because less cache lines must be expensively ‘flown in’ ūüôā

What is the moral of this story? There are several: Always be paranoid and question the status quo. Never assume that your algorithm is the fastest no matter how simple it looks. Or that the only way to make it faster is to make it multi-threaded. When working with large data sets in memory then always be mindful of cache line exhaustion. Never try to measure small amounts of something; always performance test your algorithm working flat out, exercising 100% CPU, and with realistic data access patterns in order to determine the average tiny cost per iteration, e.g. 40 seconds divided by 30 million. Always try very hard to avoid multi-threaded code because it’s evil in so many ways.

Problem #2: Turning a single-threaded algorithm into a locked multi-threaded algorithm:

In example #1 then we patted ourselves on the back for optimizing the query algorithm so that instead of taking 40 seconds, it now takes 5 seconds using a hash table. We did this by understanding the not-in-our-face cache line memory concurrency issues happening in the CPU even with a single thread. Now we want to make our hash table multi-threaded. Let’s say we have 8 physical CPU cores and want each core to be able to read and write hash table records. Problem is that we need to lock the hash table so that no two cores simultaneously try to write to the hash table at the same time.

Pitfall #1: Which amazingly I have seen even experienced developers fall into: Put as little code inside the lock as possible. For example, there’s no need to calculate the hash of the key inside the lock!

Pitfall #2: Use the hash of the key to reduce lock contention. For example, instead of locking the entire 1 GB hash table then have e.g. 100 x 10 MB hash tables; each having its own lock. Use the hash of the key to decide which key exclusively goes in which hash table. Now when accessing the hash table concurrently then there’s a much bigger chance that a particular thread won’t block because individual hash tables can be locked and accessed in parallel.

Pitfall #3: Avoid using fancy locks. Some developers are tempted to use fancy hybrid, spin-lock type constructs that do fancy operations, for example, don’t lock if only multi-threaded readers are currently accessing, but only lock if a single writer is writing. These fancy locks sound good but are in fact very expensive to execute — even if no actual locking is performed — due to the fact that multiple atomic assembler instructions must be used to implement the fancy algorithm.

Pitfall #4: Avoid using fancy lock-less algorithms. Some developers think that using fancy lock-less algorithms is the way to get around having to use locks and is therefore a kind of silver bullet solution. Problem is that all these lock-less algorithms rely on using atomic assembler instructions which are expensive to execute. Why? Because they guarantee that a cache line must be expensively ‘flown in’ (see above) as well as doing a bunch of other expensive things like breaking the concurrent op code pipe-line.

Problem #3: Turning a locked multi-threaded algorithm into an unlocked multi-threaded algorithm:

Obviously using no locks is faster than using locks. Plus using fancy lock-less algorithms is kind of cheating and also doesn’t result in the ultimate performance that we’re looking for. So how to turn our locked multi-threaded algorithm into an unlocked multi-threaded algorithm? One way is to let the kernel help us. Ultra high performance algorithms for reading and writing data often have many readers and relatively few writers. What if the readers could all read as much as they want and concurrently, *and* a writer can write as much as it wants and concurrently *and* all without locks? Well it turns out that this is possible. How? Everybody has heard about copy-on-write memory but this is mainly associated with fork()ing and multiple processes. What is little known is that it’s possible to do copy-on-write in a single process and even in a single thread! Let’s go back to our first bsearch() example with the 1 GB memory block. We can memory map that 1 GB memory block and use it for the concurrent readers. However, we can also memory map the same 1 GB to another address range in memory as a copy-on-write copy of the first 1 GB memory map. The second 1 GB memory map uses the same physical memory blocks as the first one and so takes up almost no more physical memory. The concurrent writer can write to it, update multiple areas, and even apply a series of updates as an atomic transaction of updates, all without disturbing the first memory map which the readers are happily reading from. When the writing is complete then the pointers to the memory maps get updated and… tadaa! We have a multi-threaded and truly lock-less algorithm. A side benefit of using copy-on-write is that the memory map will survive — and we can choose whether the memory map is backed to disk or just virtual memory — our process re-starting, which means that if the memory map is holding some sort of cache of data then it will be immediately ‘hot’ upon restarting.

Problem #4: My (unlocked) multi-threaded algorithm works too fast for the network card:

After having taken the CPU cache lines into consideration, and after having removed lock contention using copy-on-write then finally we end up with monster performance which scales across all CPUs. Only thing is that this doesn’t really help us very much if we can’t speak to the outside world faster enough. Let’s say our multi-threaded algorithm can read and write at 10 million transactions per second with all CPU cores pinned at 100%… how does this help us if we’ve deploying to a box which has a 100 Mbit or 1,000 Mbit NIC only capable of a few hundred thousand packets per second? And this is probably the type of NIC commonly available from a service like Amazon EC2. The truth is that unless your NIC is 10,000 Mbit then you probably don’t need a multi-threaded algorithm in the first place. It is even said that C gurus can write code which handles all the packets of a 10,000 Mbit NIC using a single-thread; it depends upon your algorithm of course.

An exception to this is if you’re writing your cloud code in a language other than C. For example, node.js is fast to write but only relatively fast to run. A single-threaded node.js algorithm can easily be an order of magnitude slower than the same algorithm in C. Mainly because in node.js the author has little control over the internal format of data structures and therefore the efficiency of accessing them in terms of CPU cache line invalidation. You would have to hack node.js to take advantage of CPU cache lines and/or copy-on-write memory which will be so complicated that you might as well use C in the first place. It’s a similar story for plain old Java or other higher level languages. This is also the main reason that the operating system itself and high performance software such as databases are generally written in C. This doesn’t mean you have to write everything in complicated C; just the generic, high performance components. Consider separating and writing higher level business logic — the source code which will probably change more often — in a higher level language which leverages the generic, high performance components. If you want the higher level language to interface directly to C then think very carefully about which language to use. Most scripting languages have the ability to call C functions and vice-versa but there can be enormous differences in speed when doing this. For example, some languages store their function names in an internal hash table. This means if C calls the higher level function then the higher level language is going to do a hash table lookup for every call; expensive.

Problem #5: My unlocked multi-threaded algorithm is an efficient work of art but still under performing on the network:

Also of note — and you would have thought this problem was fixed long ago — is that it can’t be taken for granted that a particular operating system and NIC will operate as efficiently as expected. For example, your average 1,000 Mbit NIC will only deliver anywhere close to 1,000 Mbit if optimally sized MTU sized packets are being sent. Try to send 1,000 Mbits using smaller packets and watch system interrupt time go up, while through-put goes down to levels as low as 200 Mbit. This could be partly due to the NIC hardware, partly due to the NIC driver, and/or partly due to operating system network stack tuning. Fact is, that you might only be able to tune it so high and no more. This is the point at which you might want to try a different NIC and/or different kernel version. Always test the operating system / NIC performance independently from your concurrent code. As we have seen before, it may not be necessary to even for you to make your code concurrent in order to fulfill the performance requirements.

Problem #6: My unlocked multi-threaded algorithm works amazingly on my 10,000 Mbit NIC on my 3 servers but took too long to develop:

Oh dear. You optimized everything so much and managed to develop something which is an order of magnitude or two faster than anything else available and it’s running on 3 beefy server boxes with high end NICs, but it took way too long to develop. All that CPU cache line analysis and lower level coding in C took much longer to develop than in other languages. Maybe it would have been financially better to develop everything in node.js which needs 30 servers instead of 3? This could well be the situation you find yourself in. Only your business model knows the answer to this conundrum. 27 extra servers could easily be much cheaper than paying more expensive C programmers to develop fiddly code for longer. However, if you’re expecting the business to grow e.g. 10 fold in a reasonable period of time then maybe it’s worth paying up front for the more complicated C code because suddenly the extra overhead of the C developers looks cheap compared to 300 – 30 = 270 extra servers for the node.js solution.

Problem #7: I’m told that the GalacticTurboNoSQLDB 2.0 is as fast as anything and the ultimate concurrent solution:

Don’t believe them! One solution says 10,000 transactions per second, while another says 100,000, and another says 1,000,000, and yet another says 10,000,000 transactions per second. Always evaluate performance by creating a real world performance / load test. Don’t be afraid to make the test as real as possible. For example, if you are expecting a million TCP connections to a particular server then have the test create a million TCP connections to each server; only then are we properly testing concurrency. Then ask yourself if it could be faster and/or use less memory and/or use less disk? Examine the CPU usage during the test. If they are not at 100% then maybe the algorithm is not optimal. If they are at 100% and it’s a network program then determine whether the maximum NIC through-put has been exhausted? If the NIC through-put has not been exhausted then there’s room for improvement. Once you have tested everything and compared all tests metrics and decided that concurrency and therefore performance is good then ensure that these metrics can be monitored 24/7 during live production. It maybe that the GalacticTurboNoSQLDB 2.0 is blazingly fast for 7 hours and your performance test only lasted for 5 hours. Because GalacticTurboNoSQLDB 2.0 is written in Java then it seemed to work well on your 128 GB monster server until garbage collection kicked in and it went on a bender for half and hour ūüė¶ When production metrics are found to no longer reflect the carefully crafted performance tests then carefully craft the performance tests a bit more!

Not the end of threads and locking concurrency issues, but the end of this brain dump.

 

G-WAN versus SXE ‚ÄúHello World‚ÄĚ April 26, 2012

Recently I’ve been very impressed reading about the performance figures for G-WAN:
http://gwan.ch/benchmark

G-WAN has quite the licensing model with the G-WAN binary being freeware and support costing very much money:
http://gwan.ch/buy

So I decided to do a simple libsxe versus G-WAN performance test like I did for libsxe versus NGINX and libsxe versus node.js. However, for this test I decided to use G-WAN’s very own multi-threaded load tool called weighttp:
http://redmine.lighttpd.net/projects/weighttp/wiki

I modified the simple libsxe HTTP server to make it take advantage of multiple CPUs.

These tests were run on a Ubuntu 11.04 instance running on a dual quad core i7 processor.

First the G-WAN figures:

I don’t know why G-WAN is talking about 16 cores upon starting because my i7 only has 8!

simon@ubuntu:~/gwan_linux64-bit$ sudo ./gwan

allowed Cores: 8 (‘sudo ./gwan’ to let G-WAN use your 16 Core(s))

loading
> ‘all.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)
> ‘hello.mm’: to use Objective-C++ (*.mm) scripts, install ‘gobjc++’ (sudo apt-get install gobjc++)
> ‘loan.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)..
> ‘argv.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac).
> ‘hello.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac).
> ‘hello.m’: to use Objective-C (*.m) scripts, install ‘gobjc’ (sudo apt-get install gobjc)
> ‘report.java’: to use Java (*.java) scripts, install ‘javac’ (sudo apt-get install javac)..

G-WAN 3.3.28 (pid:3110)

simon@ubuntu:~/weighttp$ ./build/default/weighttp -n 10000000 -c 1000 -t 4 -k “http://127.0.0.1:8080/100.html”
weighttp – a lightweight and simple webserver benchmarking tool

host: ‘127.0.0.1’, port: 8080
starting benchmark…
spawning thread #1: 250 concurrent requests, 2500000 total requests
spawning thread #2: 250 concurrent requests, 2500000 total requests
spawning thread #3: 250 concurrent requests, 2500000 total requests
spawning thread #4: 250 concurrent requests, 2500000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done

finished in 61 sec, 501 millisec and 457 microsec, 162597 req/s, 59862 kbyte/s
requests: 10000000 total, 10000000 started, 10000000 done, 10000000 succeeded, 0 failed, 0 errored
status codes: 10000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 3770000000 bytes total, 2770000000 bytes http, 1000000000 bytes data

Now the libsxe figures:

simon@ubuntu:~/sxe-httpd/sxe-httpd$ ./build-linux-64-release/sxe-httpd 127.0.0.1 8080 10000
20120426 211759.525 T 10198 —— 1 – sxe-httpd starting // detected cpus: 8
20120426 211759.525 T 10198 —— 1 – sxe-httpd parent forking 7 times
20120426 211759.525 T 10199 —— 1 – sxe-httpd child created
20120426 211759.525 T 10200 —— 1 – sxe-httpd child created
20120426 211759.525 T 10201 —— 1 – sxe-httpd child created
20120426 211759.526 T 10202 —— 1 – sxe-httpd child created
20120426 211759.526 T 10203 —— 1 – sxe-httpd child created
20120426 211759.526 T 10204 —— 1 – sxe-httpd child created
20120426 211759.526 T 10205 —— 1 – sxe-httpd child created

simon@ubuntu:~/weighttp$ ./build/default/weighttp -n 10000000 -c 1000 -t 4 -k “http://127.0.0.1:8080/100.html”
weighttp – a lightweight and simple webserver benchmarking tool

host: ‘127.0.0.1’, port: 8080
starting benchmark…
spawning thread #1: 250 concurrent requests, 2500000 total requests
spawning thread #2: 250 concurrent requests, 2500000 total requests
spawning thread #3: 250 concurrent requests, 2500000 total requests
spawning thread #4: 250 concurrent requests, 2500000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done

finished in 34 sec, 79 millisec and 878 microsec, 293428 req/s, 108316 kbyte/s
requests: 10000000 total, 10000000 started, 10000000 done, 10000000 succeeded, 0 failed, 0 errored
status codes: 10000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 3780000000 bytes total, 2780000000 bytes http, 1000000000 bytes data

Conclusion:

At 162597 versus¬†293428 requests per second,¬†libsxe is significantly — or 1.8 times — faster than G-WAN for this simple performance test using 8 cores. Although G-WAN calls itself the fastest web server available — and admittedly is very fast — it obviously suffers internally from quite a bit of overhead even for such a trivial performance test such as this one. And with libsxe the CPU bottleneck is really the networking layer in the kernel… so what is G-WAN doing with all those spare CPU cycles? Looks like G-WAN might have room for optimization yet? Or maybe it’s partly due to libsxe’s fixed memory model which does away with the unnecessary and repetitive malloc() / free() cycle? I guess we’ll never know since G-WAN is closed source.

EDIT: Since running this test we have found two potential problems with G-WAN which mean that these figures are unreliable (see thread below): (a) G-WAN’s performance seems highly tuned to particular processors but it’s supplied as a single binary executable meaning that performance tests may vary wildly, and (b) G-WAN doesn’t scale linearly as the number of cores increase even with the simplest of performance tests.

 

libsxe, shared-memory, and parallel state-driven algorithms February 27, 2011

I recently came across the following paper: “Memory Models: A Case for Rethinking Parallel Languages and Hardware” by Sarita V. Adve and Hans-J. Boehm

The paper starts off:

The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad¬† community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current popular languages and systems that prevent achieving the vision of structured and safe parallel programming. This paper discusses the path to the above convergence, the hard lessons learned, and their implications. …

And then introduces the idea of “disciplined shared-memory models”:

Moving forward, we believe a critical research agenda to enable ‚Äúparallelism for the masses‚ÄĚ is to develop and promote disciplined shared-memory models that:
‚ÄĘ are simple enough to be easily teachable to undergraduates; i.e., minimally provide sequential consistency to programs that obey the required discipline;
‚ÄĘ enable the enforcement of the discipline; i.e., violations of the discipline should not have undefined or horrendously complex semantics, but should be caught and returned back to the programmer as illegal;
‚ÄĘ are general-purpose enough to express important parallel algorithms and patterns; and
‚ÄĘ enable high and scalable performance.

This is interesting because libsxe has a disciplined shared-memory model which goes a long way towards fulfilling the criteria above in the form of the sxe pool library. So what is a sxe pool and how does it offer us a disciplined shared-memory model?

The sxe pool library was invented for different reasons than offering a disciplined shared-memory model. A shared-memory option was added later as a pool construction option. In short, sxe pools offer a way to create C arrays of structs with the following generic benefits:
‚ÄĘ The size of the array is persisted
‚ÄĘ Each element of the array gets its own state which is persisted outside the element struct in a linked list
‚ÄĘ Each element of the array gets its own timestamp which is persisted outside the element struct in a linked list
‚ÄĘ Each element is accessed using regular & concise C code, e.g. myarray[myindex].mystructmember

The sxe pool library caller can manipulate the state & timestamp element properties using the following API:

sxe_pool_get_number_in_state(void * array, unsigned state)
sxe_pool_index_to_state(void * array, unsigned id)
sxe_pool_set_indexed_element_state(void * array, unsigned id, unsigned old_state, unsigned new_state)
sxe_pool_set_oldest_element_state(void * array, unsigned old_state, unsigned new_state)
sxe_pool_get_oldest_element_index(void * array, unsigned state)
sxe_pool_get_oldest_element_time(void * array, unsigned state)
sxe_pool_get_element_time_by_index(void * array, unsigned element)
sxe_pool_touch_indexed_element(void * array, unsigned id)

Converting the sxe pool library API to support shared-memory was relatively simple. The sxe_pool_new() function got an option to share the pool memory. The API functions to change the pool element state use atomic assembler instructions if the pool was constructed as a shared-memory pool. It’s also interesting to note that sxe pools can be shared between processes as well as between threads in the same process. This is because the sxe pool library internal implementation avoids absolute pointers; which is also something that I encourage from libsxe developers and C developers in general.

This API is “general-purpose enough to express important parallel algorithms and patterns” and most interestingly is the same API whether the algorithm is threaded or not. It’s also “simple enough to be easily teachable to undergraduates” or even junior developers as we have found out at Sophos. The atomic sxe_pool_set_[indexed|oldest]_element_state() API functions “enable the enforcement of the discipline” by requiring both the old state and the new state of the array element; if the developer supplies the wrong old state then sxe pool will assert. Because the sxe pool library manages the element states itself then an assert is very unlikely when using a single pool. However, more complicated algorithms often make use of chains of pools in order to implement multiplexing and/or combining of parallel results, etc. In these cases, it is common to keep references to pool array element indexes and/or pool array element states in the caller supplied pool element structs. Finally, by implementing algorithms using the sxe pool API then it is possible to “enable high and scalable performance” using a minimum of simple to understand C source code. The developer is forced into thinking about the algorithm as a state model which often simplifies the hardest problems. And the generic part of the implementation complexity — e.g. locking, shared memory, keeping state, double linked lists, timeout handling, & memory allocation — is all handled by the sxe pool library and backed by automated tests with 100% library code coverage. The resulting performance is excellent as can be seen by the figures published in earlier blog entries; tens of thousands of network requests per second per core.

As you can see, the sxe pool is an incredibly powerful and code saving and code simplifying generic data structure. It’s a sort of Swiss Army knife for parallel, event driven algorithms. In a future article I’ll show some of the implementation patterns.

 

Thought Experiment: Scaling Facebook January 30, 2011

I was recently asked the question:

“Let’s imagine it’s circa 2004 and we’ve just launched a website that is about to evolve into Facebook. We’re “pre-architecture” – we have a single instance of MySQL running on a linux machine out of a dorm room at Harvard. The site starts to take off. Status updates are hitting the database at ferocious pace. Now what???!!! If we don’t do something soon, the system will crash. What kinds of caching layers do we build, and how will they interact with the database backend? Do we transition to a colo, or keep everything in house? Let’s see if we can put our heads together and think through Facebook’s many scaling challenges, all the way from the ground up …”

Here’s my reply:

I love this thought experiment ūüôā It’s one of the things which I dream about when I go to bed at night and which my mind first thinks about when I wake up in the morning. No kidding. I also had to architect and lead develop a smaller cloud system for my employer which scales up to the tens of millions of Sophos users… it would scale further but we don’t have e.g. one billion users! But I dream about much more complex clouds which scale to a billion or more users… even though such systems don’t exist yet. I believe that to create such systems we have to go back to computer science fundamentals and be prepared to think outside the box if necessary. For example, I love the emerging NoSQL technology… but all of the systems are flawed in some way, and performance varies enormously but orders of magnitude. Few off the shelf technologies appear to come close to BerkeleyDB performance which is a disappointment. Back to scaling Facebook: One of the fundamental problems is striving to use the smallest number of servers while serving the largest population of users. The cost of the servers and the server traffic is going to play an enormous role in the business plan. Therefore I need to think in terms of creating servers or groups of servers which act as building blocks which can be duplicated in order to scale the cloud as large as required. Because we’ll be scaling up to using hundreds or thousands of servers, then it becomes cost effective to develop efficient technology which doesn’t exist yet. For example, the sexy new node.js technology is wonderful for prototyping cloud services or running clouds which don’t have to be scaled too big. However, node.js uses nearly ten times more memory than, say, C does (https://simonhf.wordpress.com/2010/10/01/node-js-versus-sxe-hello-world-complexity-speed-and-memory-usage/) for the same task… and this may mean having to use ten times more servers. So for a monster cloud service like Facebook then I’d forget about all the scripting language solutions like node.js and Ruby On Rails, etc. Instead I’d go for the most efficient solution which also happens to be the same challenge that my employer gave me; to achieve the most efficient run-time performance per server while allowing as many servers as necessary to work in parallel. This can be done efficiently by using the C language mixed with kernel asynchronous events. However, in order to make working in C fast and productive then some changes need to be made. The C code needs to work with a fixed memory architecture — almost unheard of in the computing world. This is really thinking out of the box. Without the overhead of constantly allocating memory and/or garbage collecting then the C code becomes faster than normally imaginable. The next thing is to make the development of the C code much faster… nearing the speed of development of scripting languages. Some of the things which make C development slow is the constant editing of header and makefiles. So I designed a system where this is largely done automatically. Next, C pointers cause confusion for programmers young and old so I removed the necessity to use pointers a lot in regular code. Another problem, is how to develop protocols, keep state, and debug massively parallel and asynchronous code while keeping the code as concise and readable as possible. Theses problem have also been solved. In short, Facebook helped to solve their scaling problem by developing HipHop technology which allowed them to program in PHP and compile the PHP to ‘highly optimized C++’. According to the blurb then compiled PHP runs 50% faster. So in theory this reduces the number of servers necessary also by 50%. My approach is from the other direction; make programming in C so comfortable that using a scripting language isn’t necessary. Also, use C instead of C++ because C++ (generally) relies on dynamic memory allocation which is also an unnecessary overhead at run-time. Languages which support dynamic memory allocation are great for general purpose programs which are generally not designed to use all the memory on a box. In contrast, in our cloud we will have clusters of servers running daemons which already have all the memory allocated at run-time that they will ever use. So there is no need for example to have any ‘swap’. If a box has, say, 16GB RAM then a particular daemon might be using 15.5GB of that RAM all the time. This technique also has some useful side-effects; we don’t have to worry about garbage collection or ever debug memory leaks because the process memory does not change at run-time. Also, DDOS attacks will not send the servers into some sort of unstable, memory swap nightmare. Instead, as soon as the DDOS passes then everything immediately returns to business as usual without instability problems etc. So being able to rapidly develop the Facebook business logic in C using a fixed memory environment is going to enable the use of less servers (because of faster code using less memory) and result in a commercial advantage. But there is still a problem: Where do we store all the data that all the Facebook users are generating? Any SQL solution is not going to scale cost effectively. The NoSQL offerings also have their own limitations. Amazon storage looks good but can I do it cheaper myself? Again, I create an own technology which is a hierarchical, distributed, redundant hash table. But unlike a big virtual hard disk, the HDRHT can store both very small and very large files without being wasteful and we can make it bigger as fast as we can hook up new drives and servers. The files would be compressed and chunked and stored via SHA1 (similar-ish to Venti although more efficient; http://doc.cat-v.org/plan_9/4th_edition/papers/venti/) in order to avoid duplication of data. I’d probably take one of these (http://www.engadget.com/2009/07/23/cambrionix-49-port-usb-hub-for-professionals-nerds/) per server and connect 49 2TB external drives to it, giving 49TB of of redundant data per server (although in deployment the data would redundant across different servers in different geographic locations). There would be 20 such servers to provide one Petabyte of redundant data, 200 for ten Petabytes, etc. Large parts of the system can be and should be in a Colo in order to keep the costs minimal. Other parts need to be in-house. The easy way to build robust & scalable network programs without comprising run-time performance or memory usage is called SXE (https://simonhf.wordpress.com/2010/10/09/what-is-sxe/) and is a work-in-progress recently open-sourced by my employer, Sophos. Much of what I’ve written about exists right now. The other stuff is right around the corner… ūüôā

 

node.js versus Lua ‚ÄúHello World‚ÄĚ — Postmortem October 23, 2010

Filed under: Uncategorized — simonhf @ 4:54 pm
Tags: , , , , , , ,

Last time I performance tested an experimental version of the soon to be released SXELua¬†against node.js. What I found is that SXELua¬†is 2.9 times faster with the simple “Hello World” test. However, what I didn’t mention is why SXELua¬†is so much faster. When first running the test with a very early version of SXELua¬†then the results were about half as good as the node.js+net+crcr¬†results! The reason for this is the way that Lua¬†handles its strings. During the simple “Hello World” test then HTTP queries are read using the following code path: libev¬†C code calls SXE¬†C code which read()s the HTTP query string and then passes the string to the Lua¬†read event handler. During this last step then Lua¬†‘internalizes’ the string. This means that Lua¬†does a quick hash (alert: more expensive operation) over the string, allocates memory for it, and copies it… and later unfortunately garbage collects it too. All these operations take lots of CPU. The simple “Hello World” benchmark sends 500,000 HTTP query strings which are each 199 bytes. So you can imagine how even heavily optimized string internalization¬†code might get bogged down with such a large number of strings to internalize and eventually garbage collect. In-fact, take any small overhead and multiply it by 500,000 and suddenly it becomes a large overhead. So in the end Neil change SXELua¬†so that Lua’s¬†string handling code isn’t used and instead Lua¬†calls back out to SXE¬†C code to do all its string handling. This works very fast because C is very fast, and calling C from Lua¬†or vice-versa is very fast, and SXE¬†itself uses neither malloc() nor dynamic memory garbage collection techniques while processing queries. So this means that SXELua¬†is limited to only using an if… then… call subset of syntax of Lua? Yes! And if you think about it then this makes perfect sense. Why? Because it’s faster to do the generic ‘heavy lifting’ — for example, string — operations in C than it is in Lua… especially if this avoids internalization¬†and garbage collection. The part that I’d like to rapidly code in Lua is the lighter-weight, non-generic, business logic of the program.¬†In the end this works out really well and SXELua¬†is 2.9 times faster than the node.js simple “Hello World” test. This is probably mainly because node.js has three different types of overhead in this particular test: 1. JavaScript string manipulation code is slower than C string manipulation code and this difference gets multiplied 500,000 times. 2. JavaScript creates 500,000 strings. 3. JavaScript garbage collects 500,000 strings. So it’s easy to see why both the SXELua¬†and SXE simple “Hello World” tests are 2.9 times and 3.4 times faster than the node.js equivalent.

 

node.js versus Lua ‚ÄúHello World‚ÄĚ October 13, 2010

Neil Watkiss¬†— known among other things for many¬†cool Perl modules¬†—¬†has created a non-optimized, experimental version of SXE (pronounced ‚Äėsexy‚Äô) containing embedded Lua¬†called SXELua. So I thought it would be fun to redo the familiar ‚Äď to readers of this blog ‚Äď ‚ÄėHello World‚Äô benchmark using SXELua. And here is the Lua source code:

do
    local connect = function (sxe) end
    local read = function (sxe, content)
        if content:match("\r\n\r\n", -4) then sxe_write(sxe,"HTTP/1.0 200 OK\r\nConnection: Close\r\nContent-Type: text/html\r\nContent-Length: 14\r\n\r\nHello World\n\r\n")
        end
    end
    local close = function (sxe) end
    sxe_register(10001, function () sxe_listen(sxe_new_tcp("127.0.0.1", 8000, connect, read, close)) end)
end

Compare this with the slightly longer node.js equivalent from the last blog:

var net = require('net');
var server = net.createServer(function (stream) {
  stream.on('connect', function () {});
  stream.on('data', function (data) {
    var l = data.length;
    if (l >= 4 && data[l - 4] == 0xd && data [l - 3] == 0xa && data[l - 2] == 0xd && data[l - 1] == 0xa) {
      stream.write('HTTP/1.0 200 OK\r\nConnection: Keep-Alive\r\nContent-Type: text/html\r\nContent-Length: 13\r\n\r\nHello World\r\n');
    }
  });
  stream.on('end', function () {stream.end();});
});
server.listen(8000, 'localhost');

And now the updated results:

‚ÄúHello World‚Ä̬†¬†¬† Queries/ % Speed
Server           Second   of SXE
---------------- -------- -------
node.js+http     12,344    16%
Node.js+net+crcr 23,224    30% <-- *1
Node.js+net      28,867    37%
SXELua           66,731    85%
SXE              78,437   100%

In conclusion, calling Lua¬†functions from C and vice-versa is very fast… close ¬†to the speed of C itself. I am very excited by how well Lua¬†performed in the benchmark. The Lua¬†“Hello World” program performed 3.6 times better than the node.js equivalent.¬†After a quick Google it looks like this isn’t the first time that JavaScript V8¬†has gone up against Lua; these results¬†suggest that SXELua¬†could get even faster after optimization. It looks like Lua¬†will become part of SXE¬†soon. Lua¬†seems ideal for creating tests for SXE¬†& SXELua programs alike, and prototyping programs. Stay tuned‚Ķ!

*1 Update: Somebody who knows JavaScript better than me offered faster code to detect the “\n\r\n\r”. I updated the script above and the resulting queries per second and % speed of SXE.

 

Nginx versus SXE ‚ÄúHello World‚ÄĚ October 2, 2010

After my last post then a colleague offered the criticism that comparing C to a script language is a bit like shooting fish in a barrel¬†ūüôā I think the colleague missed the point which is that often the main reason for choosing to use a scripting language in the first place is to achieve rapid application development at the expense of run-time performance and memory usage. The purpose of the post was to try to dispel this myth and show how few lines of C source code can be necessary to achieve ultimate performance. However, in order to keep the colleague happy, here is a similar head to head between nginx¬†and SXE. What is nginx?¬†Here‚Äôs what Wikipedia says about nginx: ‚ÄúNginx quickly delivers static content with efficient use of system resources.‚ÄĚ Now on with the ‚ÄúHello World‚ÄĚ comparison‚Ķ

Here is the nginx.conf:

# cat /etc/nginx/nginx.conf
worker_processes  1;
events {
    worker_connections  10240;
}
http {
    server {
        listen 8000;
        access_log off;
        server_name  localhost;
        location / {
            root   html;
            index  index.html index.htm;
        }
    }
}

 And here is the index.html file:

# cat /usr/html/index.html
Hello World

I use the same http.c from the previous post in order to load test nginx. Here are the results:

# ./http -i 127.0.0.1 -p 8000 -n 50 -c 10000
20101002 181142.250 P00006a5f ------ 1 - connecting via ramp 10000 sockets to peer 127.0.0.1:8000
20101002 181142.290 P00006a5f    999 1 - connected: 1000
20101002 181142.328 P00006a5f   1999 1 - connected: 2000
20101002 181142.367 P00006a5f   2999 1 - connected: 3000
20101002 181142.406 P00006a5f   3999 1 - connected: 4000
20101002 181142.445 P00006a5f   4999 1 - connected: 5000
20101002 181142.484 P00006a5f   5999 1 - connected: 6000
20101002 181142.523 P00006a5f   6999 1 - connected: 7000
20101002 181142.562 P00006a5f   7999 1 - connected: 8000
20101002 181142.602 P00006a5f   8999 1 - connected: 9000
20101002 181142.641 P00006a5f   9999 1 - connected: 10000
20101002 181142.641 P00006a5f ------ 1 - starting writes: 500000 (= 10000 sockets * 50 queries/socket) queries
20101002 181142.641 P00006a5f ------ 1 - using query of 199 bytes:
20101002 181142.641 P00006a5f ------ 1 - 080552a0 47 45 54 20 2f 31 32 33 34 35 36 37 38 39 2f 31 GET /123456789/1
20101002 181142.641 P00006a5f ------ 1 - 080552b0 32 33 34 35 36 37 38 39 2f 31 32 33 34 35 36 37 23456789/1234567
20101002 181142.641 P00006a5f ------ 1 - 080552c0 38 39 2f 31 32 33 34 35 36 37 38 39 2f 31 32 33 89/123456789/123
20101002 181142.641 P00006a5f ------ 1 - 080552d0 34 35 36 37 38 39 2f 31 32 33 34 35 36 37 38 39 456789/123456789
20101002 181142.641 P00006a5f ------ 1 - 080552e0 2f 31 32 33 34 35 36 37 38 39 2f 31 32 33 34 35 /123456789/12345
20101002 181142.641 P00006a5f ------ 1 - 080552f0 36 37 2e 68 74 6d 20 48 54 54 50 2f 31 2e 31 0d 67.htm HTTP/1.1.
20101002 181142.641 P00006a5f ------ 1 - 08055300 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 .Connection: Kee
20101002 181142.641 P00006a5f ------ 1 - 08055310 70 2d 41 6c 69 76 65 0d 0a 48 6f 73 74 3a 20 31 p-Alive..Host: 1
20101002 181142.641 P00006a5f ------ 1 - 08055320 32 37 2e 30 2e 30 2e 31 3a 38 30 30 30 0d 0a 55 27.0.0.1:8000..U
20101002 181142.641 P00006a5f ------ 1 - 08055330 73 65 72 2d 41 67 65 6e 74 3a 20 53 58 45 2d 68 ser-Agent: SXE-h
20101002 181142.641 P00006a5f ------ 1 - 08055340 74 74 70 2d 6c 6f 61 64 2d 6b 65 65 70 61 6c 69 ttp-load-keepali
20101002 181142.641 P00006a5f ------ 1 - 08055350 76 65 2f 31 2e 30 0d 0a 41 63 63 65 70 74 3a 20 ve/1.0..Accept:
20101002 181142.641 P00006a5f ------ 1 - 08055360 2a 2f 2a 0d 0a 0d 0a                            */*....
20101002 181202.794 P00006a5f   9128 1 - read all expected http responses
20101002 181202.794 P00006a5f   9128 1 - time for all connections: 0.391057 seconds or 25571.718778 per second
20101002 181202.794 P00006a5f   9128 1 - time for all queries    : 20.152358 seconds or 24810.992567 per second
20101002 181202.794 P00006a5f   9128 1 - time for all            : 20.543415 seconds or 24338.699486 per second

Where nginx manages 25,571 connections per second, the SXE implementation manages 25,009 connections per second; a performance tie. Further, where nginx manages 24,810 queries per second, the SXE implementation manages 59,171 queries per second; a 2.4 fold increase. This is an especially great result for SXE because there is still scope for optimizing its code further.

During the test I also monitored memory usage of both the client and server processes:

# top -b -d1 | egrep "(nginx|http)"
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 44712 5160  692 S    0  0.1   0:18.46 nginx
27231 root      15   0 18064  16m  516 R   79  0.4   0:00.79 http
27216 nobody    16   0 57468  17m  692 R   67  0.4   0:19.13 nginx
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    17   0 60064  20m  692 R   98  0.5   0:20.12 nginx
27231 root      15   0 18064  16m  516 S   58  0.4   0:01.37 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    19   0 60064  20m  692 R   96  0.5   0:21.09 nginx
27231 root      15   0 18064  16m  516 R   68  0.4   0:02.05 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    20   0 60064  20m  692 R   97  0.5   0:22.07 nginx
27231 root      15   0 18064  16m  516 R   64  0.4   0:02.69 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    24   0 60064  20m  692 R   97  0.5   0:23.05 nginx
27231 root      15   0 18064  16m  516 R   66  0.4   0:03.35 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   86  0.5   0:23.91 nginx
27231 root      15   0 18064  16m  516 R   42  0.4   0:03.77 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:24.90 nginx
27231 root      15   0 18064  16m  516 R   50  0.4   0:04.27 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:25.90 nginx
27231 root      15   0 18064  16m  516 R   50  0.4   0:04.77 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R  100  0.5   0:26.91 nginx
27231 root      15   0 18064  16m  516 R   50  0.4   0:05.27 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:27.91 nginx
27231 root      15   0 18064  16m  516 R   54  0.4   0:05.81 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:28.91 nginx
27231 root      15   0 18064  16m  516 R   53  0.4   0:06.34 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:29.91 nginx
27231 root      15   0 18064  16m  516 R   50  0.4   0:06.84 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   99  0.5   0:30.90 nginx
27231 root      15   0 18064  16m  516 R   51  0.4   0:07.35 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R  100  0.5   0:31.91 nginx
27231 root      15   0 18064  16m  516 R   50  0.4   0:07.85 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 D   54  0.5   0:32.45 nginx
27231 root      15   0 18064  16m  516 S   28  0.4   0:08.13 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    19   0 60064  20m  692 R   89  0.5   0:33.35 nginx
27231 root      15   0 18064  16m  516 S   61  0.4   0:08.74 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    21   0 60064  20m  692 R   97  0.5   0:34.33 nginx
27231 root      15   0 18064  16m  516 R   66  0.4   0:09.40 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    22   0 60064  20m  692 R   68  0.5   0:35.01 nginx
27231 root      15   0 18064  16m  516 R   34  0.4   0:09.74 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   98  0.5   0:36.00 nginx
27231 root      15   0 18064  16m  516 S   66  0.4   0:10.40 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 60064  20m  692 R   97  0.5   0:36.98 nginx
27231 root      15   0 18064  16m  516 R   52  0.4   0:10.92 http
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27216 nobody    25   0 44712 5160  692 S   63  0.1   0:37.61 nginx
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx
27215 root      25   0 40788  836  344 S    0  0.0   0:00.00 nginx

Unlike SXE, nginx uses a dynamic memory model and top shows me that peak memory usage is only 20MB which is very similar to the peak memory usage of SXE of 16MB; another tie.

In conclusion, if you’re planning to serve static content and CPU is your bottleneck then using nginx¬†could cause you to¬†employ up to 2.4 times as many servers as if you had implemented with SXE. It would be interesting to create a real static content delivery system using SXE and post a more realistic head to head comparison. If anybody has ideas on what the more realistic head to head comparison might look like then please comment below.

 

 
%d bloggers like this: