SimonHF's Blog

Just another WordPress.com site

Thought Experiment: Scaling Facebook January 30, 2011

I was recently asked the question:

“Let’s imagine it’s circa 2004 and we’ve just launched a website that is about to evolve into Facebook. We’re “pre-architecture” – we have a single instance of MySQL running on a linux machine out of a dorm room at Harvard. The site starts to take off. Status updates are hitting the database at ferocious pace. Now what???!!! If we don’t do something soon, the system will crash. What kinds of caching layers do we build, and how will they interact with the database backend? Do we transition to a colo, or keep everything in house? Let’s see if we can put our heads together and think through Facebook’s many scaling challenges, all the way from the ground up …”

Here’s my reply:

I love this thought experiment 🙂 It’s one of the things which I dream about when I go to bed at night and which my mind first thinks about when I wake up in the morning. No kidding. I also had to architect and lead develop a smaller cloud system for my employer which scales up to the tens of millions of Sophos users… it would scale further but we don’t have e.g. one billion users! But I dream about much more complex clouds which scale to a billion or more users… even though such systems don’t exist yet. I believe that to create such systems we have to go back to computer science fundamentals and be prepared to think outside the box if necessary. For example, I love the emerging NoSQL technology… but all of the systems are flawed in some way, and performance varies enormously but orders of magnitude. Few off the shelf technologies appear to come close to BerkeleyDB performance which is a disappointment. Back to scaling Facebook: One of the fundamental problems is striving to use the smallest number of servers while serving the largest population of users. The cost of the servers and the server traffic is going to play an enormous role in the business plan. Therefore I need to think in terms of creating servers or groups of servers which act as building blocks which can be duplicated in order to scale the cloud as large as required. Because we’ll be scaling up to using hundreds or thousands of servers, then it becomes cost effective to develop efficient technology which doesn’t exist yet. For example, the sexy new node.js technology is wonderful for prototyping cloud services or running clouds which don’t have to be scaled too big. However, node.js uses nearly ten times more memory than, say, C does (https://simonhf.wordpress.com/2010/10/01/node-js-versus-sxe-hello-world-complexity-speed-and-memory-usage/) for the same task… and this may mean having to use ten times more servers. So for a monster cloud service like Facebook then I’d forget about all the scripting language solutions like node.js and Ruby On Rails, etc. Instead I’d go for the most efficient solution which also happens to be the same challenge that my employer gave me; to achieve the most efficient run-time performance per server while allowing as many servers as necessary to work in parallel. This can be done efficiently by using the C language mixed with kernel asynchronous events. However, in order to make working in C fast and productive then some changes need to be made. The C code needs to work with a fixed memory architecture — almost unheard of in the computing world. This is really thinking out of the box. Without the overhead of constantly allocating memory and/or garbage collecting then the C code becomes faster than normally imaginable. The next thing is to make the development of the C code much faster… nearing the speed of development of scripting languages. Some of the things which make C development slow is the constant editing of header and makefiles. So I designed a system where this is largely done automatically. Next, C pointers cause confusion for programmers young and old so I removed the necessity to use pointers a lot in regular code. Another problem, is how to develop protocols, keep state, and debug massively parallel and asynchronous code while keeping the code as concise and readable as possible. Theses problem have also been solved. In short, Facebook helped to solve their scaling problem by developing HipHop technology which allowed them to program in PHP and compile the PHP to ‘highly optimized C++’. According to the blurb then compiled PHP runs 50% faster. So in theory this reduces the number of servers necessary also by 50%. My approach is from the other direction; make programming in C so comfortable that using a scripting language isn’t necessary. Also, use C instead of C++ because C++ (generally) relies on dynamic memory allocation which is also an unnecessary overhead at run-time. Languages which support dynamic memory allocation are great for general purpose programs which are generally not designed to use all the memory on a box. In contrast, in our cloud we will have clusters of servers running daemons which already have all the memory allocated at run-time that they will ever use. So there is no need for example to have any ‘swap’. If a box has, say, 16GB RAM then a particular daemon might be using 15.5GB of that RAM all the time. This technique also has some useful side-effects; we don’t have to worry about garbage collection or ever debug memory leaks because the process memory does not change at run-time. Also, DDOS attacks will not send the servers into some sort of unstable, memory swap nightmare. Instead, as soon as the DDOS passes then everything immediately returns to business as usual without instability problems etc. So being able to rapidly develop the Facebook business logic in C using a fixed memory environment is going to enable the use of less servers (because of faster code using less memory) and result in a commercial advantage. But there is still a problem: Where do we store all the data that all the Facebook users are generating? Any SQL solution is not going to scale cost effectively. The NoSQL offerings also have their own limitations. Amazon storage looks good but can I do it cheaper myself? Again, I create an own technology which is a hierarchical, distributed, redundant hash table. But unlike a big virtual hard disk, the HDRHT can store both very small and very large files without being wasteful and we can make it bigger as fast as we can hook up new drives and servers. The files would be compressed and chunked and stored via SHA1 (similar-ish to Venti although more efficient; http://doc.cat-v.org/plan_9/4th_edition/papers/venti/) in order to avoid duplication of data. I’d probably take one of these (http://www.engadget.com/2009/07/23/cambrionix-49-port-usb-hub-for-professionals-nerds/) per server and connect 49 2TB external drives to it, giving 49TB of of redundant data per server (although in deployment the data would redundant across different servers in different geographic locations). There would be 20 such servers to provide one Petabyte of redundant data, 200 for ten Petabytes, etc. Large parts of the system can be and should be in a Colo in order to keep the costs minimal. Other parts need to be in-house. The easy way to build robust & scalable network programs without comprising run-time performance or memory usage is called SXE (https://simonhf.wordpress.com/2010/10/09/what-is-sxe/) and is a work-in-progress recently open-sourced by my employer, Sophos. Much of what I’ve written about exists right now. The other stuff is right around the corner… 🙂

Advertisements
 

2 Responses to “Thought Experiment: Scaling Facebook”

  1. jerng Says:

    Hi there, I’ve been very interested in the low-level abstraction approach that you’re taking with SXE. It’s encouraging to note that Sophos as open-sourced this. How do I peek in / get involved?

    • simonhf Says:

      Thanks for taking an interest. It’s a reasonable size and growing body of code and there are lots of edge points for possible involvement. Feel free to contact me at simonhf at gmail dot com to chat 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s