Tuesday 13 January 2009

QCon SF 2008 Day Three - nothing like a timely post

I mean that literally. This is nothing like a timely post. Still, better than no post at all, right?

Day Three was my volunteering exercise -- hey I was a starving student at the time, so I got a free student rego in exchange for helping out with conference organisation, collecting feedback forms, making sure the speakers were actually in the rooms speaking to people etc -- not a bad gig really if you choose the right room. I missed out on the "architectures you've always wondered about" track but I heard most of those guys in 2007 anyway, so I chose the Data Storage Rethinking: Document Oriented Distributed Databases track which I was very happy about -- it was fascinating and very useful for me.

Some notes only barely converted from my rough typing in between pressing the little clicker to count people going in and out of the room:

  • A column-based bigtable clone
  • GPLed
  • Stores history of everytyhing– even deletes are just stored as new entries with a flag
  • Splits tables automatically across machines if you need to
  • Instrumentation for monitoring etc not there yet for 1.0 (jan/feb next year) (note as of blog posting date: it's at 0.9.2 right now, getting there...)
  • In 1.1 master-slave communication will work much better, including intelligent resource allocation
  • Keeps a write-ahead commit log as well as the data store, so can recover from failure if written to a distributed FS
  • Has "Hyperspace" distributed lock manager – equivalent to “chubby” at google (whatever that is?! presumably somebody reading this knows...)
  • -> currently a SPOF but will have “some form of replication” by release
  • Can run on any distributed FS: hadoop HDFS, KFS (Kosmos FS) etc
  • All communication is asynchronous
  • Languages: C++ plus Thrift bindings which will expose java, python, PHP etc... Release containing this stuff will come out in a few weeks
  • Concurrency: “it uses MVCC”, he skipped it.. What does this mean?? (Wikipedia tells me it's "multi-version concurrency control" which is used by CouchDB, BerkeleyDB, MySQL/InnoDB etc)
  • Achieved over 1m inserts/sec on AOL test data (1TB of 30-byte query log rows -- ie almost pathological but good for certain use cases)
  • Google has “megatable”, abstraction layer on top of bigtable — hypertable will have an equivalent eventually
  • Have their own communication protocol
  • From the guys behind homeaway.com – Bryon Jacob and chris Berry
  • Took Abdera from apache to build their own framework
  • Added Atom Publishing Protocol extensions, eg
  • open search - google
  • paging – mark nottingham - rfc5005
  • “atom store” - get, put, edit, search via APP – canonical example is gdata
  • Uses Abdera which graduated from the Apache incubator this week and will go 1.0 very soon
  • Provides a solid, scalable, etc implementation on top of Abdera
  • APP Spec doesn’t force you to make services and workspaces first-class objects with own RESTful interfaces and URIs, but they do anyway
  • POST for new content where you let the server assign the ID, or PUT if you know the URI you want
  • Uses model of starting at the beginning and following next links to get everything (a la GData, I think..?)
  • Incrementing index numbers for all changes, so you can see things twice, as each changes gives the item a higher inde number, good for syncing eg queues
  • Has APP categories (aka tags)
  • Can create tags specifically for items using category docs
  • Can create hooks for auto-categorisers
  • xpath one built in, can use to extract standard tags from custom XML into category tags for querying later
  • view feeds by category, atomserver specific but based on gdata implementation
  • can do boolean ANDs and ORs of tags, to do a vague equyivalent of SQL SELECT queries
  • Concurrency for edits: each edit must have the revision number appended to the URI for optimistic locking – if that’s not the correct revision, it is rejected (409 CONFLICT)
  • link rel=”edit” URI has the revision number built in
  • Example of “atom-based service architecture” using objects with states — eg could do a moderation service by querying for a feed of objects with “UNMODERATED” tags
  • “etags are the preferred way to pass query parameters to atom” -- I wrote it down but I don't really know what they mean?! I thought etags were about caching?!
  • Aggregate feeds ability (in AtomServer only) - “we join on categories in the same way that SQL would join tables based on a column”
  • Batch updates via one feed doc — (I thought mime multipart was supposed to be used for that??)
  • Custom (pluggable) content storage – only supports RDBMS now but planning to support key-value stores (eg couchdb) later (we at the BBC are very keen to see this happen!)
  • Scales with multiple front-ends using one database – can replicate etc but still requires one sql database (for now)
  • I think I heard them say at the end that they don’t support mysql because they use transactions!?!? would be good to know more about that...
  • it's a graph database
  • started off talking about growth in connected data, eg Facebook’s MySQL store: “facebook has hundreds of machines with 1TB RAM to keep their entire database in memory”
  • At first it sounded silly and just a replacement for RDF but when I could see that they can do depth-X pathExists() searches eg (friends of friends of people), 2ms for 1m people with average 50 connections, ie 25m connections! eg haven't you always wondered how LinkedIn could always say how many degrees away from each person you are when you do searches? that's hard! (I'm not saying that they use neo4j at LinkedIn, but they must use some similar algorithms -- I think they keep most of their social graph state in memory as well, from what I remember hearing at the QCon architectures track in 2007)
  • Has a NeoMock in-memory implementation for testing, but you can just put lots of RAM to your JVM and it uses memory for you
  • neo4j now has sparql support! v interesting
  • they are working on NeoRDF – have two customers but haven’t released as a product yet, it's coming in 1.1
  • They use the OSGi architecture for plugins -- it seems to be becoming a real standard now
  • They are thinking about releasing a standalone server, REST API etc
  • but they say that exposing domain-oriented services is better than exposing the database over the wire – as Ian yesterday was describing, Eran calls it “terrorist-oriented architecture” -- ie independent cells all capable of surviving on their own -- the extreme case of "small pieces loosely joined"
  • "it's hard to think of a good REST API for something as chatty as we are"
  • what's coming in v2.0? they are thinking of sharding (aka partitioning) on top of newton (infiniflow) from paremus
  • based on CAP theorem, BASE rather than ACID -- ie everything is synchronised eventually -- if you don't get this then google it, there are loads of presos about it
  • licensed under AGPLv3 – if you develop software with it it’s free, but if you use it to store more than 1m primitives, you have to pay
jan h, jan@apache.org
  • has the same optimistic locking approach as atomserver, keeps all revisions – ie nothing is locked, ever
  • uses mapreduce (with javascript as the scripting language!) for views, aggregations etc rather than inventing a new query language
  • is slow the first time, as it parses the javascript etc, builds a btree index
  • next time you query, it checks if anything updated and if so, gives the diff to the view server and builds a new diff with just the new data
  • therefore inserts are cheap, you only rebuild views when they are queried again (and even then only incrementally)
  • does syncing between DBs (based on lotus notes?!), one direction or bi-directional
  • books.couchdb.org/relax – drafts coming out in jan (looking now, they might be running a bit behind... but some intro chapters are up at least)
Couchdb in the real world
Jan again (who is available for consulting BTW, apparently we at the BBC have already employed him at least once!)

  • lots of standard storage design patterns change in a couchdb world...
  • eg views (sort of like stored procedures, but using map/reduce)
  • views are saved in “design documents” where your javascript goes
  • you can have multiple views in the same design document, but realise that they’re all updated each time any data changes
  • also put validation rules, authorisation, and more into a design document
  • CouchDB has no such thing as sequences, but you wouldn’t want to use sequences in a distributed env anyway – you have a system field called _id but that’s not a guaranteed sequence
  • if you do want to order your results, do it by a natural key such as time rather than some sequence id
  • CouchDB provides no transactions, no roundtripping, no multi-node transactions (they would be too expensive) - "use an http proxy if you need redundancy"!!? (syncing helps with that I guess? but how resilient is it really?)
  • Can have master-slave or master-master replication setups, eventually consistent but BASE, not ACID (see notes above)
  • You can add as many masters as you like, unlike mysql
  • Replication communication also happens over HTTP so you can use caches, proxies etc
  • because it’s all asynchronous, you can actually call couchdb directly via ajax

Phew, long and disjointed, sorry about that, but it's enough to get my notes down, I might expand on these topics later as we explore these technologies some more!

Hope it's useful to somebody.