Around 10:00 AM UTC we have noticed an increase errors in our internal error reporting tool, and a quick investigation revealed that one of our database clusters, managed by an external provider, cannot handle the current load. This affected our dashboard as well as incoming chats.
An investigation revealed that our database was automatically scaled down to a less performant instance based on the previous 24 hours. We reacted by manually scaling it up to the previous cluster size using external provider’s console, however due to a malfunction on the external provider side, this process did not complete as it usually should.
At 10:40 AM UTC we contacted the external provider’s support team, and they force-applied our change from their end.
At 11:57 AM UTC our dashboard became operational.
At 12:12 PM UTC chats on all CRMs became operations.
As first measure to avoid a similar future situation, we now enforce a higher minimum cluster size to avoid excessive automatic down scaling. Furthermore, we are in contact with the provider’s support to investigate why our changes were not applied.