Thursday, 20 November 2008

QCON SF 08: Tim Bray on storage and persistence trends

The keynote this morning was Tim Bray, a bit of a guru in the unix and web development scene, who helped write the original XML spec among many many other things.

Some quick notes before getting to the meat of his talk:
  • The Drizzle db project is worth keeping an eye on – a key mysql committer forked mysql's code to focus on the most minimal sql engine possible — one date type, one float type, no triggers, as simple as possible, but fast and reliable. The idea is so compelling that people are apparently running it in production even though it's barely in alpha!

  • “Column oriented databases” are about to see their day – BigTable in Google AppEngine is probably the best indicator at the moment

  • CouchDB has just become a top-level apache project – the author is now employed by amazon, and the project is going really well. I know Dirk and the guys back at the ranch have been looking at the product so this is good news. Some quick CouchDB facts:
    • REST-based, built in Erlang
    • uses the “eventually consistent” model
    • it has a nifty way of using MapReduce functions on the server to do views! (which could even be adapted to do "stored procedure" type functionality I guess)
    • HTTP is only access protoccol - “the most debugged protocol on the internet”
    So it sounds like CouchDB is here to stay. Good news.

  • Atompub vs WebDAV: performance is always questioned, but Bray is building an atompub server apache module, mod_atom that seems to perform pretty well, and he hasn't even started optimising it yet. Sounds like mod_atom is something else to keep an eye on.

  • Facebook gets 90,000 transactions/sec using memcached! (that's good... very good)

The main guts of the talk was a walkthrough the different layers of storage required by modern computer systems, in order of performance:
  1. registers on a CPU,
  2. Local cache (l-cache) in the processor,
  3. DRAM on the server,
  4. distributed hash table accessed over a network (eg memcached),
  5. solid-state storage (ie Flash memory),
  6. magnetic disk (or as Tim called it, "spinning rust"),
  7. tape (which as Tim reminded us is used more than ever due to regulations like Sarbanes-Oxley requiring everyone to keep everything practically forever)
The news here is (a) a validation of our approach at the BBC's Forge project, where we use memcached as a critical part of the scaling infrastructure for dynamic publishing, just like a growing set of people, like Facebook, Yahoo, and many more, and (b) the introduction of solid-state storage to the list -- and so high up in the list!

But the thing that really got Tim excited was not just his impressive figures on how much faster solid state could be on the right filesystem (which was a bit of an ad for a new server released by his employer, Sun), but the fact that SSD has Moore's law on its side: as opposed to "spinning rust", SSD is all silicon, so it will only increase in price/performance over time.

As Tim says, "Ladies and gentlemen, you are looking at the future."

Note: For the business students that might stumble upon this blog, here's a reward for reading through all that techy stuff: Sandisk own many many patents in solid-state storage and they were strong enough to shrug off Samsung's offer a couple of months ago, so they could be an interesting stock to watch as solid-state disks become a key part of more and more high-end computer systems... but they're going down right now, and they might not have hit bottom yet as it looks like all the analysts are downgrading them one by one (not to be taken as investment advice blah blah)

Tim's slides (warning: the /tmp/ in the URL gives the indication that they may not be there forever...)