Some AI Agents responding with technical error

Incident Report for Ultimate AI's System Uptime

Postmortem

Bots using our Zero-Training AI model were experiencing slow responses and in some extreme cases were replying with “technical error” due to high latency from requests to OpenAI scale tier API.
On a first try to mitigate the issue, we restarted our llm-related services but to no avail. Moving on to a different approach we first disabled our scale-tier for a few minutes and then reenabled. After a few minutes we could see performance improvements and the request latency returning to normal values.
In an initial investigation we could not find any odd behaviour on our systems, which has led to a on-going investigation together with OpenAI to determine the root cause.
To increase our awareness of this type of issues we are implementing the following improvements:

developing an Azure fallback mechanism for future high-latency or error events.

Additional Findings

We have found that some SNGP bots were also affected as a side-effect of this incident. This was due to our internal processing queue being shared between models. The high latency on Zero-Training requests was blocking queued requests, resulting in a cascading failure.

To mitigate this in the future, we have increased the maximum number of queued requests, ensuring that SNGP requests can still be processed even if Zero-Training requests encounters delays due to OpenAI-related issues.

Posted Feb 27, 2025 - 10:48 CET

Resolved

Our systems are back operating as normal, we confirmed the issue is resolved.

Posted Feb 13, 2025 - 18:07 CET

Monitoring

We have applied a fix and are noticing improvements, we are still monitoring affected AI Agents closely.

Posted Feb 13, 2025 - 17:20 CET

Identified

We have pinpointed the issue to be high latency API calls to OpenAI API. We are still doing further investigation and monitoring the issue.

Posted Feb 13, 2025 - 16:15 CET

Investigating

We are currently investigating an issue on some AI Agents which are experience some technical issues and are not being able to work properly. The problem seems to affect only a subset of AI Agents.

Posted Feb 13, 2025 - 15:57 CET

This incident affected: Chat integrations (Sunshine, LiveChat.com CRM Integration, Zendesk Chat, Salesforce, Freshchat, Zendesk Support Automation, Giosg Automation, Intercom Automation, Freshdesk Automation).