主播大秀 Blogs - 主播大秀 Genome Blog - Data archaeology

Data archaeology - or the missing listings

Tuesday 11 August 2015, 11:30

Jake Berger

主播大秀 Genome

Tagged with:

The small-but-perfectly-formed Genome technical team have spent the last six months digging into the vast cache of data that was produced during the scanning over 350,000 pages of the Radio Times. They’ve found quite a lot interesting artefacts in unexpected places, and are currently brushing them down and cleaning them up in readiness for publication on the site...

I’ll describe a couple of these artefacts, and try to explain how they came to exist and what we are doing with them.

The Genome project set out to create an easily navigable and searchable broadcast history of the 主播大秀 - the public face of the 主播大秀’s catalogue - rather than to simply digitise the Radio Times. This meant that substantial chunks of the magazines would not be made available online (e.g the adverts, articles, letters), as these would have made the listings themselves harder to find.

In order to achieve this, during the design of the data extraction process, algorithms were written to automagically identify which sections of the magazine and which elements within a page were likely to contain a listing (as opposed to anything else). These algorithms essentially looked for patterns in the layout of pages and patterns in sequences of characters.

As with most automated pattern recognition scenarios, there will always be an element of error. Given the substantial variance in the design and layout of the Radio Times between 1923 and 2009 (I blame the graphic designers) these algorithms had to be continually tweaked to achieve a satisfactory accuracy rate, but a degree of error was inevitable. Finding and correcting these errors is of huge importance if we are going to be able to make available the fullest and most accurate representation of the 主播大秀’s broadcast history.

Simon Smith - our champion data archaeologist - spent a substantial amount of time poring over a million or so rows of data in the Genome database in two tables, called ‘genome_related’ and' genome_tables’. Fields in these rows contain blocks of text that the algorithms decided were not programme listings (it usually decided that they were articles and therefore not for publication).

This is what a Grandstand listing would look like.

Simon found that in many cases, these fields contained either what was essentially an extended synopsis (as found in text boxes adjacent to the programme listing or in the longer photo captions) or further details about the programme listed (such as which pieces by which composer would be played during a radio music programme) or ‘sub-listings’ (such as the start time of the various sports featured in an episode of Grandstand). He has painstakingly identified more than a hundred thousand items which we are working on making available on the Genome website, adding a whole new level of detail to the listings.

We are very excited (yes - we do need to get out more) about something else that Simon has found: around 150,000 chunks of text that the algorithms misidentified as articles when in fact they are brief programme listings. We estimate that when the data is extracted and properly classified, we might be able to add nearly a million new programme listings to Genome. This will take a good while to get ready, but we’ll blog about it in more detail when we have made some more progress.

Tagged with:

Comments

View the comments for this blog

The Sunday Post: Continuity

Sunday 9 August 2015, 10:00

On This Day, 1927: First 主播大秀 Prom

Thursday 13 August 2015, 10:00

About this Blog

News, highlights and banter from the team at 主播大秀 Genome 鈥� the website that shows you all the 主播大秀鈥檚 listings between 1923 and 2009 (and tells you what was on the day you were born!) Join us and share all the oddities, archive gems and historical firsts you find while digging around鈥�

Follow on Twitter

Find us on

Blog Updates

Stay updated with the latest posts from the blog

Subscribe to the blog using:

主播大秀 Genome Links

主播大秀 Archive

主播大秀

Accessibility links

Data archaeology - or the missing listings

More Posts

Previous

The Sunday Post: Continuity

Next

On This Day, 1927: First 主播大秀 Prom