Goldsky - Turbo pipelines stuck in restarting state during a security upgrade – Incident details

Turbo pipelines stuck in restarting state during a security upgrade

Resolved
Partial outage
Started 2 days agoLasted 17 minutes

Affected

Mirror

Major outage from 2:15 PM to 2:32 PM

Mirror Replication

Major outage from 2:15 PM to 2:32 PM

Updates
  • Postmortem
    Postmortem

    After a deep investigation, the root cause and impact is clear and new processes and alerts are being setup to mitigate this in the future.

    During an infrastructure upgrade meant to harden our security posture, some pipelines failed to restart, resulting in data not being sent to the sinks.

    This affected less than 5% of customer projects, for between 5-17 minutes. Pipelines in those projects deployed earlier than Monday will have been affected. No data would be lost, and the pipelines continued from where they left off once the issue was resolved.

    As such, the incident is downgraded from Major Outage to Partial Outage.

    New processes and monitors will be added by the end of the week to prevent these issues.

  • Resolved
    Resolved

    This incident has been resolved. During a node upgrade, a configuration change did not propagate fast enough and some nodes were caught in a crash loop while the rollout completed. We are working on a permanent fix to prevent this kind of failure from happening again.

    This affected less than 5% of customer projects, for between 5-17 minutes.

  • Identified
    Identified
    We are continuing to work on a fix for this incident.