The AWS team has fixed the root cause, and we have confirmed stability in our underlying systems.
As a recap:
From 11:18 AM to 12:46PM, 3% to 5% of requests errored (5xx) or timed out from our Subgraph APIs during intermittent periods lasting up to 3 minutes at a time. This affected a number of specific endpoints, so certain customers may have seen up to 50% of their requests erroring our timing out.
After 12:46PM, under 0.01% of requests sent would be affected. Customer impact was greatly reduced at this time. The number of API requests dropped across our whole system dropped to single digits in any 5 minute window, but would still happen.
By 3:40PM, we no longer saw any errors.