Production Agent Reporting Outage

Incident Report for Cloud 66

Postmortem

At approximately 07:00 UTC we started to perform a minor system quality-of-life backend update. The update had unintended consequences on an older system component which hadn’t been changed in a while. The component was to do with agent server communications, and it ended up disabled after the update. Although the same component was unaffected during testing in dev/staging and passed UAT, it turned out that there was a necessarily different configuration around rewrites in production that caused the issue.

The nature of the update meant that rolling back wasn’t straightforward, so the team resolved to fix the incompatibility in-place, and as such were unable to stop the erroneous “server down” notifications that then went out. The issue was resolved approximate 1 hour later. at 08:00 UTC - after which time servers would have started to appear as “online” again.

Subsequent to this outage, the difference in configuration has been added to our operations processes, such that this will not occur again. Our apologies for the inconvenience and concern that this may have created!

Posted Sep 09, 2022 - 09:22 UTC

Resolved

This incident has been resolved.

Posted Sep 09, 2022 - 08:51 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 09, 2022 - 07:20 UTC

This incident affected: Upstream Dependencies (Github) and Core Systems, BuildGrid, Realtime Push Systems, ContainerNet.