CynderHost Status

All systems are operational
Maintenance
Panel CDN Integration Rewrite

Over the next two weeks, from 12/19/2020 to the start of January, we will be conducting a full rewrite of our panel-CDN integration system to make it more extensible and configurable. The impact of panel usability will be low and we don't expect any noticeable changes during this period. If any bugs are encountered, please report them to support@cynderhost.com

This event is applicable to: Proton, Neutron

As of 12/23/2020 this rewrite has concluded. All panel features should be functional and fully stable.

Past Incidents

30th December 2020

No incidents reported

29th December 2020

No incidents reported

28th December 2020

No incidents reported

27th December 2020

No incidents reported

26th December 2020

No incidents reported

25th December 2020

No incidents reported

24th December 2020

No incidents reported

23rd December 2020

Gravity Connection Issues

We're aware of issues with sites on Gravity

  • Post Mortem

    At 9:13 PST, Apache conducted an automatic routine graceful restart, where current standing connections are finished and then terminated and new workers are started. This normally is conducted within a few seconds, with no noticeable impact. However, during this process, Apache did not behave normally and recreate new children workers immediately, resulting in the rejection of new requests.

    A 9:13 PST, a few seconds after this happened, properties being monitored starting to error out, and the Apache processes fell below normal thresholds. Shortly after, at 9:14 PST, our engineers were paged and notified. The first diagnosis of the issue began at 9:14. At 9:15, the incident was confirmed and the cause was narrowed down to an issue with Apache. The incident was also posted to our status page at this time. At 9:16, the cause was identified and in issue with the unavailability of Apache workers. Coincidentally, an automated service health recovery daemon terminated the current Apache restart process and conducted a hard restart, bringing workers back to normal levels and restoring service. At 9:18, all services were fully functional and performing under pre-outage conditions.

    Over the next hour, we investigated the root cause of the Apache graceful restart failure as well as a cursory audit of our incident response. A fix was deployed resolving the issue and Apache was reloaded to ensure the patch had been properly applied and addressed the issue, The total duration of this outage lasted 5 minutes.

    While small, we take all outages, big or small, extremely seriously. We’ve identified several key improvements that we will be implementing to ensure this does not happen again:

    • Improved correlation analysis: Based on internal review, the largest area for improvement was the time spent diagnosing the issue. All functional services are monitored through New Relic, which is also responsible for issuing alerts. We will be increasing the sensitivity of our monitors by lowering various thresholds while at the same time implementing better correlation and grouping analysis to filter out false positives and identify when there is an active issue.
      • Estimated response improvement: (-1 minute)
    • Increased polling intervals: Pooling intervals for self-recovery monitoring is set to ensure restart loops do not occur. However, based on testing, we believe the interval can be safely lowered slightly to improve recovery time. This change will not be deployed until further review can be conducted.
      • Estimated response improvement: (-1 minute)
    • Improved Status Messages: While we’re happy with the timeliness that our status page was updated with, we do believe there’s lots of room for improvement regarding the detail provided. While incidents are fluid and most of the time thorough updates cannot be provided immediately, we do believe that we can improve our workflow to provide a basic ETA faster and incident outline when possible.
      • Estimated response improvement: N/A

    We apologize for any inconvenience caused by this incident. Please reach out if there is anything we can help with.

  • At this time we've confirmed that the proper patches were applied. A post mortem is coming shortly.

  • To ensure that patches have been properly applied, we will be conducting a graceful restart of the web server. Service may be intermittently available for a few seconds.

  • We're investigating the root cause of this issue as well as monitoring all services closely. A full post mortem will be provided at the conclusion of our investigation.

  • At this time, this issue had subsided. We're currently investigating the details around this and will provide an update shortly