Main content

主播大秀 Online outage on Saturday 19th July 2014

Richard Cooper

Controller Digital Distribution

Hi, I'm Richard Cooper, the 主播大秀's Controller of Digital Distribution for 主播大秀 Future Media.

As many of you will have noticed we suffered a serious incident over the weekend which impacted 主播大秀 iPlayer, 主播大秀 iPlayer Radio, and audio and video playback on other parts of . We also had to use our emergency homepage for prolonged periods of time.

Here鈥檚 what happened.

We have a system comprising 58 application servers and 10 database servers that provides programme and clip metadata. This data powers various 主播大秀 iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across 主播大秀 Online. This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres.

At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail.

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on 主播大秀 Online, and one of the pools failed. The products managed by that pool include 主播大秀 iPlayer and the 主播大秀 homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we鈥檙e still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (鈥淒ue to technical problems, we are displaying a simplified version of the 主播大秀 主播大秀page鈥). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

Restoring the metadata service was complex. Isolating the source of the additional load proved to be far from straightforward, and restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems). Performance of the system remained sufficiently poor that in the end we decided to do some significant remedial work on Saturday afternoon, which ran on until the evening. During that period, 主播大秀 iPlayer was effectively not useable.

After that work was complete we were in a walking wounded state that allowed close to normal operation for much of the site, though 主播大秀 iPlayer remained down on a number of devices. We chose to run it in this mode throughout the rest of the weekend while planning a full restoration of the service. By the time we were ready to do that we were entering the peak period on Sunday evening, so rather than risk the service further, we chose instead to do it on Monday morning.

We recognise that during this incident, with 主播大秀 iPlayer unavailable for some periods for some users you may not have been able to watch or listen to the programmes you wanted. I鈥檓 afraid we can鈥檛 simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced. Essentially programmes aired on Saturday 12th July and Sunday 13th July were not available this last weekend for some users. It's small consolation but that was the weekend of the World Cup Final, Scottish Open, Women's Open and other live sporting events which are less likely to be viewed on catch-up. I should also stress that programmes aired this weekend - when the problems occurred - are available now on 主播大秀 iPlayer.

主播大秀 iPlayer is an incredibly popular product, last year alone we had 3 billion requests and instances like this are incredibly rare.

We will now be completing the forensics to make sure that we鈥檝e fully understood the root causes, and put in place the measures necessary to minimise the chances of such interruptions in the future.

We're sorry for the inconvenience.

Richard Cooper is Controller of Digital Distribution, 主播大秀 Future Media

More Posts

Previous

Testing 主播大秀 iPlayer Mobile App