CynderHost Status

Service Issue Impacting proton.cynderhost.com Wednesday 13th October 2021 00:23:35


We are aware of issues impacting some sites hosted on proton.cynderhost.com and are currently investigating and working to restore service to full capacity as soon as possible.

We will provide updates and an ETA on this page if possible and when available. We apologize for any inconvenience caused by this issue and we appreciate your patience as we work to restore all impacted services.

You may reach us at support@cynderhost.com for any questions.

Post-Mortem

At 00:45 AM PST, our team responded to multiple alerts indicating elevated backend error rates. After triaging the incident, we isolated the issue to a problem with Apache (HTTPD) failing and causing backend unreachability, and subsequently, 5XX errors.

Upon further investigation, we identified the root cause of the issue as a WAF rule channel vendor timing out on HTTPD startup, causing the server to be killed before starting up. One of our WAF rules vendors hosts their rules feed on OVH's network, which experienced a global backbone failure at the time and brought their entire network offline. Our team disabled the rulesets in question and brought the server online.

Resolution time was compounded by the fact that our primary telemetry instances were also hosted on OVH's network and affected by the outage, requiring us to use a backup platform.

We build our infrastructure to be extremely resilient, implementing multiple levels of redundancy and self-healing checks. While we don't utilize OVH's network for end clients and such outage should not have caused any noticeable issues, we sincerely regret that this was not the case and apologize for any inconvenience or impact caused.

Moving forward, we’ll first and foremost be stripping Apache’s startup process of any external dependencies. Fetching an external rule on startup provides several advantages. By combining these updates with startup, we save unnecessary restarts, while being able to deploy rules with an extremely low time-to-live as needed. There are built-in failsafes to ensure that an unavailable distribution channel is simply ignored instead of blocking. However, due to a behavior of Apache, this failsafe contains edge cases which cannot be resolved. Therefore, to ensure continued stability, we’ll be developing an alternative.

Furthermore, we’ll be auditing our reliance on third-party networks, to ensure similar incidents of this kind do not affect overall availability of our core platform, in addition to further training with various failover and incident recovery scenarios.

We’d like to thank everyone for their understand, and as always, encourage anyone with questions or feedback to reach us at support@cynderhost.com

We thank everyone for their patience. At this time all issues have been resolved.

We are continuing to investigate and resolve issues impacting proton.cynderhost.com