Tuesday, 3 November 2009

Deathlines - a small peek into the BBC's archive taxonomy

Well, first post in a long time. I have a new job now and everything. But that's for another post, now I want to talk about a great internal hack day we had at BBC R&D's offices at Kingswood Warren, based around cool things we could do with the BBC's archive.

I wasn't even really planning on doing a hack, I just wanted people to see the cool data I had available from my work on the DMI data migration project, such as our Lonclass categorisation taxonomy (or thesaurus if you will) and our P4A production reporting data which shows incidental music etc.

But Simon Delafond (from BBC Online Media Group) and I were chatting and Simon mentioned that he would love to take information about important events in history and put them on a timeline. As an example Simon mentioned the death of Queen Victoria. That sparked an idea, and I showed Simon my copy of the Lonclass database, including -- sure enough -- an entry in the taxonomy about the death of Queen Victoria:

I didn't have the references to programmes to hand, but even the term from the catalogue on its own was useful, especially as we realised that the way the Lonclass data was constructed, we could extract all "death events" from the database. Note that "612.673" at the start of the subject, that's the Lonclass term for death! I could go into more detail on how those terms work, and if Dan Brickley has his way I will do exactly that some time, but for now suffice to say that searching this file for ">612.673" was enough to find all "death events" in the taxonomy.

But we had no way of mapping those events to an actual date that we could put on a timeline. So, like the good proto-semweb geeks that we are, we thought dbpedia would have that info. Tim Dobson, sitting across the table being very helpful with servers and stuff, suggested that we use YQL and a short while later we had a script that took a name (Screaming Lord Sutch was out favourite for testing), performed a Yahoo search limited to wikipedia, took the link from the first result, turned it into a dbpedia resource, found the JSON version of the dbpedia page, and parsed out the "deathdate" from the JSON file.

From there, all we had to do was make an XML file with the results, and feed it to a nifty timeline flash app that Simon had commissioned when he was producer of the Memoryshare project on bbc.co.uk.

The results are in the screenshot above, and why not, I'll put them on my server for posterity.

So have a click around, and be sure to use the funky navigation tools on the left and right of the flash app. Remember I didn't write the app, I just provided the data! And I know the links don't work, but the data is all there in our huge Infax database, so one day we should be able to link to archive footage in this way.

I hope this can give a sense of what can be achieved with the amazing data we have on offer, a few pre-prepared tools, a few different minds being brought together, and a sense of mischief.

Thanks of course to Ant Miller for organising the day, John Z from the R&D Archive research team and Tony Ageh for sponsoring various aspects of the day, and all the hackers who turned up!