Tuesday 3 November 2009

Deathlines - a small peek into the BBC's archive taxonomy

Well, first post in a long time. I have a new job now and everything. But that's for another post, now I want to talk about a great internal hack day we had at BBC R&D's offices at Kingswood Warren, based around cool things we could do with the BBC's archive.

I wasn't even really planning on doing a hack, I just wanted people to see the cool data I had available from my work on the DMI data migration project, such as our Lonclass categorisation taxonomy (or thesaurus if you will) and our P4A production reporting data which shows incidental music etc.

But Simon Delafond (from BBC Online Media Group) and I were chatting and Simon mentioned that he would love to take information about important events in history and put them on a timeline. As an example Simon mentioned the death of Queen Victoria. That sparked an idea, and I showed Simon my copy of the Lonclass database, including -- sure enough -- an entry in the taxonomy about the death of Queen Victoria:

I didn't have the references to programmes to hand, but even the term from the catalogue on its own was useful, especially as we realised that the way the Lonclass data was constructed, we could extract all "death events" from the database. Note that "612.673" at the start of the subject, that's the Lonclass term for death! I could go into more detail on how those terms work, and if Dan Brickley has his way I will do exactly that some time, but for now suffice to say that searching this file for ">612.673" was enough to find all "death events" in the taxonomy.

But we had no way of mapping those events to an actual date that we could put on a timeline. So, like the good proto-semweb geeks that we are, we thought dbpedia would have that info. Tim Dobson, sitting across the table being very helpful with servers and stuff, suggested that we use YQL and a short while later we had a script that took a name (Screaming Lord Sutch was out favourite for testing), performed a Yahoo search limited to wikipedia, took the link from the first result, turned it into a dbpedia resource, found the JSON version of the dbpedia page, and parsed out the "deathdate" from the JSON file.

From there, all we had to do was make an XML file with the results, and feed it to a nifty timeline flash app that Simon had commissioned when he was producer of the Memoryshare project on bbc.co.uk.

The results are in the screenshot above, and why not, I'll put them on my server for posterity.

So have a click around, and be sure to use the funky navigation tools on the left and right of the flash app. Remember I didn't write the app, I just provided the data! And I know the links don't work, but the data is all there in our huge Infax database, so one day we should be able to link to archive footage in this way.

I hope this can give a sense of what can be achieved with the amazing data we have on offer, a few pre-prepared tools, a few different minds being brought together, and a sense of mischief.

Thanks of course to Ant Miller for organising the day, John Z from the R&D Archive research team and Tony Ageh for sponsoring various aspects of the day, and all the hackers who turned up!

Friday 27 February 2009

My response to the Canvas PVT

I'm not sure whether I'm allowed / supposed to post a response to a Public Value Test by the BBC, but here's what I had to say about the Canvas PVT:

Speaking as someone who is considering starting up companies in this area, I think that this is a fantastic opportunity to create new markets and encourage a new wave of development and innovation in small businesses, in a similar way to how Facebook applications and the iPhone AppStore have created a new wave of companies exploiting those platforms by satisfying user needs.

In particular, the Canvas platform should offer a common payment mechanism so that people can download content and/or applications for the set-top box in a manner as simple as buying an application on the iTunes AppStore on an iPhone. eg you could pay 50p and get a great little TV-based application for seeing ski conditions on your favourite slopes, or a game, or an interactive TV-watching tool, etc. Either the Canvas JV itself could manage the payment interactions, or it could create a payment standard that is handled by the ISPs providing the broadband connections (users already have a financial relationship so this would be easy to manage, and would give ISPs a new revenue stream by taking a percentage of the fees paid).

Another area where innovation would be key is the EPG -- the Canvas system should allow for "pluggable EPGs" so people could choose to change their EPG, possibly using the payment model described above to purchase an EPG that exists in a 3D world, or whatever crazy ideas people think up!

BBC Backstage has taught us that without a commercial incentive, people building tools for BBC services are limited to hobbyists and dabblers, and user needs are not met in any mainstream way. Introduce a potential revenue model, and innovation flourishes. Canvas is a perfect opportunity to make this happen.

A thriving Canvas ecosystem could also encourage other countries to adopt the Canvas standards, giving the UK a lead in an important new market around the world.

[Also added in response to their questions about draft market analysis]

I don't believe that "freesat and freesat from skypenetration remain broadly constant" (annex p28), wouldn't they switch over to Canvas Freesat?

You only seem to have covered threats/substitutes from pay-TV providers, what about Microsoft/XBox 360, Apple TV, IP-only boxes like Roku and Boxee using internet services such as the Netflix API in the US? They are already growing rapidly. Also of course there are the TVs with direct internet connections, most of which are coming out of Asia working with the US, eg embedded YouTube. If the UK doesn't respond to this emerging market, people may end up watching more US-originated YouTube content (much of it of dubious legal standing) rather than UK-originated services.

The radio market may be affected when people can use IP-connected TVs to listen to radio from around the world, or services such as Last.fm or Spotify through their TVs in surround sound etc. This is negative for incumbent radio stations but a huge opportunity in general.

Tuesday 13 January 2009

QCon SF 2008 Day Three - nothing like a timely post

I mean that literally. This is nothing like a timely post. Still, better than no post at all, right?

Day Three was my volunteering exercise -- hey I was a starving student at the time, so I got a free student rego in exchange for helping out with conference organisation, collecting feedback forms, making sure the speakers were actually in the rooms speaking to people etc -- not a bad gig really if you choose the right room. I missed out on the "architectures you've always wondered about" track but I heard most of those guys in 2007 anyway, so I chose the Data Storage Rethinking: Document Oriented Distributed Databases track which I was very happy about -- it was fascinating and very useful for me.

Some notes only barely converted from my rough typing in between pressing the little clicker to count people going in and out of the room:

  • A column-based bigtable clone
  • GPLed
  • Stores history of everytyhing– even deletes are just stored as new entries with a flag
  • Splits tables automatically across machines if you need to
  • Instrumentation for monitoring etc not there yet for 1.0 (jan/feb next year) (note as of blog posting date: it's at 0.9.2 right now, getting there...)
  • In 1.1 master-slave communication will work much better, including intelligent resource allocation
  • Keeps a write-ahead commit log as well as the data store, so can recover from failure if written to a distributed FS
  • Has "Hyperspace" distributed lock manager – equivalent to “chubby” at google (whatever that is?! presumably somebody reading this knows...)
  • -> currently a SPOF but will have “some form of replication” by release
  • Can run on any distributed FS: hadoop HDFS, KFS (Kosmos FS) etc
  • All communication is asynchronous
  • Languages: C++ plus Thrift bindings which will expose java, python, PHP etc... Release containing this stuff will come out in a few weeks
  • Concurrency: “it uses MVCC”, he skipped it.. What does this mean?? (Wikipedia tells me it's "multi-version concurrency control" which is used by CouchDB, BerkeleyDB, MySQL/InnoDB etc)
  • Achieved over 1m inserts/sec on AOL test data (1TB of 30-byte query log rows -- ie almost pathological but good for certain use cases)
  • Google has “megatable”, abstraction layer on top of bigtable — hypertable will have an equivalent eventually
  • Have their own communication protocol
  • From the guys behind homeaway.com – Bryon Jacob and chris Berry
  • Took Abdera from apache to build their own framework
  • Added Atom Publishing Protocol extensions, eg
  • open search - google
  • paging – mark nottingham - rfc5005
  • “atom store” - get, put, edit, search via APP – canonical example is gdata
  • Uses Abdera which graduated from the Apache incubator this week and will go 1.0 very soon
  • Provides a solid, scalable, etc implementation on top of Abdera
  • APP Spec doesn’t force you to make services and workspaces first-class objects with own RESTful interfaces and URIs, but they do anyway
  • POST for new content where you let the server assign the ID, or PUT if you know the URI you want
  • Uses model of starting at the beginning and following next links to get everything (a la GData, I think..?)
  • Incrementing index numbers for all changes, so you can see things twice, as each changes gives the item a higher inde number, good for syncing eg queues
  • Has APP categories (aka tags)
  • Can create tags specifically for items using category docs
  • Can create hooks for auto-categorisers
  • xpath one built in, can use to extract standard tags from custom XML into category tags for querying later
  • view feeds by category, atomserver specific but based on gdata implementation
  • can do boolean ANDs and ORs of tags, to do a vague equyivalent of SQL SELECT queries
  • Concurrency for edits: each edit must have the revision number appended to the URI for optimistic locking – if that’s not the correct revision, it is rejected (409 CONFLICT)
  • link rel=”edit” URI has the revision number built in
  • Example of “atom-based service architecture” using objects with states — eg could do a moderation service by querying for a feed of objects with “UNMODERATED” tags
  • “etags are the preferred way to pass query parameters to atom” -- I wrote it down but I don't really know what they mean?! I thought etags were about caching?!
  • Aggregate feeds ability (in AtomServer only) - “we join on categories in the same way that SQL would join tables based on a column”
  • Batch updates via one feed doc — (I thought mime multipart was supposed to be used for that??)
  • Custom (pluggable) content storage – only supports RDBMS now but planning to support key-value stores (eg couchdb) later (we at the BBC are very keen to see this happen!)
  • Scales with multiple front-ends using one database – can replicate etc but still requires one sql database (for now)
  • I think I heard them say at the end that they don’t support mysql because they use transactions!?!? would be good to know more about that...
  • it's a graph database
  • started off talking about growth in connected data, eg Facebook’s MySQL store: “facebook has hundreds of machines with 1TB RAM to keep their entire database in memory”
  • At first it sounded silly and just a replacement for RDF but when I could see that they can do depth-X pathExists() searches eg (friends of friends of people), 2ms for 1m people with average 50 connections, ie 25m connections! eg haven't you always wondered how LinkedIn could always say how many degrees away from each person you are when you do searches? that's hard! (I'm not saying that they use neo4j at LinkedIn, but they must use some similar algorithms -- I think they keep most of their social graph state in memory as well, from what I remember hearing at the QCon architectures track in 2007)
  • Has a NeoMock in-memory implementation for testing, but you can just put lots of RAM to your JVM and it uses memory for you
  • neo4j now has sparql support! v interesting
  • they are working on NeoRDF – have two customers but haven’t released as a product yet, it's coming in 1.1
  • They use the OSGi architecture for plugins -- it seems to be becoming a real standard now
  • They are thinking about releasing a standalone server, REST API etc
  • but they say that exposing domain-oriented services is better than exposing the database over the wire – as Ian yesterday was describing, Eran calls it “terrorist-oriented architecture” -- ie independent cells all capable of surviving on their own -- the extreme case of "small pieces loosely joined"
  • "it's hard to think of a good REST API for something as chatty as we are"
  • what's coming in v2.0? they are thinking of sharding (aka partitioning) on top of newton (infiniflow) from paremus
  • based on CAP theorem, BASE rather than ACID -- ie everything is synchronised eventually -- if you don't get this then google it, there are loads of presos about it
  • licensed under AGPLv3 – if you develop software with it it’s free, but if you use it to store more than 1m primitives, you have to pay
jan h, jan@apache.org
  • has the same optimistic locking approach as atomserver, keeps all revisions – ie nothing is locked, ever
  • uses mapreduce (with javascript as the scripting language!) for views, aggregations etc rather than inventing a new query language
  • is slow the first time, as it parses the javascript etc, builds a btree index
  • next time you query, it checks if anything updated and if so, gives the diff to the view server and builds a new diff with just the new data
  • therefore inserts are cheap, you only rebuild views when they are queried again (and even then only incrementally)
  • does syncing between DBs (based on lotus notes?!), one direction or bi-directional
  • books.couchdb.org/relax – drafts coming out in jan (looking now, they might be running a bit behind... but some intro chapters are up at least)
Couchdb in the real world
Jan again (who is available for consulting BTW, apparently we at the BBC have already employed him at least once!)

  • lots of standard storage design patterns change in a couchdb world...
  • eg views (sort of like stored procedures, but using map/reduce)
  • views are saved in “design documents” where your javascript goes
  • you can have multiple views in the same design document, but realise that they’re all updated each time any data changes
  • also put validation rules, authorisation, and more into a design document
  • CouchDB has no such thing as sequences, but you wouldn’t want to use sequences in a distributed env anyway – you have a system field called _id but that’s not a guaranteed sequence
  • if you do want to order your results, do it by a natural key such as time rather than some sequence id
  • CouchDB provides no transactions, no roundtripping, no multi-node transactions (they would be too expensive) - "use an http proxy if you need redundancy"!!? (syncing helps with that I guess? but how resilient is it really?)
  • Can have master-slave or master-master replication setups, eventually consistent but BASE, not ACID (see notes above)
  • You can add as many masters as you like, unlike mysql
  • Replication communication also happens over HTTP so you can use caches, proxies etc
  • because it’s all asynchronous, you can actually call couchdb directly via ajax

Phew, long and disjointed, sorry about that, but it's enough to get my notes down, I might expand on these topics later as we explore these technologies some more!

Hope it's useful to somebody.

Interview with Vinod Khosla!

Presumably those reading this blog already know that I just returned from an exchange term at the Haas School of Business, UC Berkeley. One of the more fun subjects I studied was Venture Capital and Private Equity, taught by the inimitable trio: Jerry Engel of Monitor Venture Partners, Terry Opdendyk of ONSET Ventures, and Sean Foote of Labrador Ventures, who after seven years teaching together had all their lines honed to a tee but still played off each other, like a good standup act.

We were given an assignment to go and find a real VC or two, an interview them. We were quite proud of ourselves by managing to interview Silicon Valley scion Vinod Khosla, co-founder of Sun Microsystems, partner at Kleiner Perkins during their glory years of the dot-com boom, and proponent of all things clean-tech at his current venture, Khosla Partners.

For posterity, I thought I would cut and paste our assignment as I think we asked some pretty good questions, considering we only had five minutes with him!

Interview date: 3 October 2008 (a brief 5 minute chat before Khosla's presentation at Berkeley Labs)

How do you think the venture capital industry will change as a result of the flattening world?

I think the emerging world gives us more points of innovation, there are many smart people in the world and we can be open to all of them — the pool of talent is larger. We can cultivate local venture capitalists all around the world.

We noticed that most of your investments have been in the US — why is this?

Venture capital is a high-touch industry. You need to be close to your businesses to mentor them properly.

How do you reconcile that view with the idea that innovation is coming from everywhere? Are you saying that, to be successful, an international business has to start an operation in the US?

Being in the US increases your probability of success by ten times. The opportunities are all here; the venture industry is all here. You have to be here to be a part of it.

Do you think that this will change as a result of the financial crisis and the changing economy?

The VC industry won’t change much, it might be smaller in a few years but in the short term nothing will change. Hopefully the economic situation will mean that we move away from synthetic goods, and back towards physical goods that make a difference.

What do you mean by synthetic goods? Do you mean moving away from software?

I mean financial instruments, investment bankers making things up rather than creating things that actually add value to the world.