Between 18:45 UTC and 20:45 UTC on February 28, 2017 Amazon Web Services Simple Storage Service (S3) suffered an outage in the North American Region (us-east-1) that also partially affected Bitmovin encoding services. Details on what exactly happened at Amazon are still scarce, but the S3 outage also caused downstream service disruptions for many seemingly unrelated Services like Slack, Nest, Adobe, Salesforce and Docker.
How this affected Bitmovin Products
Since all of our infrastructure at Bitmovin is designed to run in multiple clouds and Multi-AZ availability deployments, no customer facing services were affected directly. At all times during the outage our customers were able to interact with our API, deliver their videos through our HTML5 Player or record video Analytics without any interruptions.
However, parts of our encoding pipeline do rely on Docker images during the encoding process. The mentioned S3 outage also affected DockerHub where we store these Docker images and thus we were unable to provision new encoding instances in the Bitmovin API for the duration of the incident. Running encodings were not affected, but newly created encodings failed to start and were subsequently marked as erroneous.
How we fixed the problem
Once we started noticing the effects of the S3 outage we immediately reached out to affected customers and informed them of the situation. Meanwhile we also started implementing new processes that allow us to pull images from a backup Docker registry in case DockerHub is unavailable again. Since we had to update all of our machine images distributed across multiple cloud infrastructures and cloud regions this fix could not be applied in time to mitigate the immediate effect of the DockerHub outage for each and every customer. However in future cases this will be handled gracefully and without any immediate effect to our customers. Please accept our apologies for this recent issue affecting your service. Bitmovin is committed to continually and quickly improving our technology and operational processes to help prevent service disruptions.