Goldsky - Partial Mirror Pipeline Downtime – Incident details

Partial Mirror Pipeline Downtime

Resolved
Partial outage 30 %
Started 15 days agoLasted about 4 hours

Affected

Mirror

Partial outage from 7:35 AM to 11:43 AM, Operational from 10:51 AM to 12:00 AM

Mirror Replication

Partial outage from 7:35 AM to 11:43 AM, Operational from 11:43 AM to 12:00 AM

Chain Ingestion

Partial outage from 7:35 AM to 10:51 AM, Operational from 10:51 AM to 12:00 AM

Subgraph Sources

Partial outage from 7:35 AM to 10:51 AM, Operational from 10:51 AM to 12:00 AM

Updates
  • Postmortem
    Postmortem

    The root cause was an EC2 nodepool scaling failure for a region.

    Our auto-cross-AZ redundancy networking didn’t recover the pipelines correctly in this scenario due to some edge cases. We have fixed this with a workaround and the auto-cross-az recovery will be fixed with an update in the coming weeks.

  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring
    The nodes for the AZ are back up. Some customer pipelines may have been terminated due to repeated failures during this issue, and the team is reviving any terminated customer pipelines.
  • Identified
    Identified

    Nodes on one of our four availability zones went down. This is causing a portion of the mirror pipelines on that AZ to fail as they try to recover to other availability zones.