CynderHost Status

Network Disruption on Proton Friday 8th January 2021 00:15:00

We are aware of issues connecting to the backend of Proton

We've received a full update and investigation from our datacenter. The full report is below.

During this incident, customers with servers might have experienced increased latency and/or packet loss over the frontend public network.

On 08 January 2021 at 08:18, Network Specialists were alerted to a failover of the primary management module in the Frontend Customer Router (fcr06b.dal09). Initial investigation showed that the management module had previously attempted to failover from primary to secondary. When that failover occurred, the standby module halted the forwarding of traffic until the failover was complete. The module failover completed on 08 January 2021 at 08:43, without manual intervention.

Network Specialists, working with the hardware vendor, were able to identify a failing module that triggered the switchover and issued an RMA for the hardware. The hardware vendor, when reviewing the case, identified the failing module as the reason that the failover between management modules did not happen as quickly or cleanly as expected. Because of this hardware failure, High availability of the device was also impacted.

The failed hardware has been replaced, and the device will be brought back into full service during a planned maintenance window on or before 10 February 2021. A future update will be provided once this is complete.

As of 12:43 AM PST, 1/8/2021, this issue should have subsided. An RCA is below:

Total Outage Time: 25 minutes

On 08 January 2021, 12:18 AM PST, our datacenter located in Dallas experienced a network service disruption. The cause of this was a failure in the active chassis management module (supervisor) in the frontend router for the VLAN in which Proton is located in. This resulted in the inaccessibility of the backend server. At 12:43 AM PST, a secondary, backup chassis management module took over automatically and all public network routing was restored immediately. We are currently coordinating with our network provider, and data center engineers to isolate the cause of the failure and apply the necessary fixes to the primary router to ensure such failure does not happen again.

We apologize for any issue this has caused - we take the reliability of our platform, especially our premium high-performance platform, extremely seriously. While our SLA only governs downtime exceeding <99.9% a month, or roughly 50 minutes, this incident exceeds both our internal uptime targets of 99.99% and 99.95% - as such, we will be providing compensation on request as per our 800% compensation policy. Please open a support ticket with us if you'd like to request SLA credit.

A more detailed RCA will be provided once the primary router is fully restored.