Detecting language please wait for.......
On February 14, 2020, SDL Machine Translation Cloud was down for approximately 4 hours. Since SDL MT Cloud v4 was released 16 months ago, this is the worst outage we have experienced. We sincerely apologize to all of our customers. We would like to provide a post-mortem of what happened and share with you what we have been done since then and what is still to be done to prevent a repeat of that outage cause.
Our production monitors detected irregularities at 02:19 UTC and the on-call engineers immediately started investigating the errors. At this point, service errors were random and few, not affecting all customers and not consistently affecting any particular customers. The engineers identified that the faulty service was the cluster orchestrator that we use. The problem was caused by a very high number of automated operations happening when trying to accommodate traffic irregularities. In order to clean up caches that reached a memory limit, the service needed to be restarted. Since the service is set up in a high availability mode the restart should not have caused any disruption in service for our customers. However the issue could not be fixed and the result was a complete failure of the cluster management service causing a full outage that started at 04:15 UTC.
We also apologize that the first service outage notification did not go out to customers (admin users) until 07:51 UTC. We are implementing a feature on the user interface that will allow customers to opt-in/out of email notifications so that in the future, notifications will be sent much sooner to customers who have opted-in.
Partial services were restored at 08:36 UTC and complete services were restored by 12:05 UTC. During this period, we sent an update at 10:07 UTC and then a confirmation that services have been restored at 13:22 UTC.
Since then, our teams have been investigating the root cause of the outage to ensure that we implement the correct fixes to prevent it from occurring again. The teams have now confirmed that the root cause is an edge case error that only happens when we reach a very high number of parallel operations happening on the cluster management service. Investigating this even further we uncovered a bug on the version of the cluster management system that we use. The teams have since established and validated a new procedure to mitigate outage should the same conditions that triggered the outage happen again. As part of the development plan to remove the root cause, the teams are working on upgrading the cluster management system to the latest version which contains a fix for that particular bug and running tests to ensure that this fix indeed resolves the underlying cause.
We take performance and reliability of our services to you very seriously and are relentlessly working to ensure that we over achieve our committed SLA to you.