Update: T+3 8:30 AM
We've now reached a final resolution.
RFO: A maintenance event that was supposed to be conducted with no impact by our upstream caused several network issues due to human error. This knocked out public networking as well as the storage backend, causing a lengthy cleanup and recovery process.
SLA: This event is covered under our 8x SLA. Given the severity, we're offering 15% of your monthly payment. Please open a ticket with billing to request the SLA credit.
We'd like to apologize once again - we don't take this incident likely. While we initially made the move "cloud" in hopes of achieving a faster and more reliable platform, these types of incidents are unacceptable. We'll be taking a look at and re-evaluating as necessary.
Update: T+1 10:30 PM
- We've revised our policy to include the posting of all maintenance events (by both us and upstream vendors) on our status page. While most, if not all should be no-impact, they'll be outlined in the interest of transparency
- We've worked with our vendors and the final cause of the network outage (which triggered storage backend error) was human error that caused a network loop in the switch configuration. We're auditing what happened, and what the errors were to ensure they don't happen again.
Update: 8:30 PM
To follow up on our initial update our first set of changes include:
- We're expanding our team to encompass cross-regional system administrators. While operating on a 24/7 on-call basis serves well, we believe having a team on standby would accelerate initial response times.
- We've set a target to distribute our client area to another data center by 6/15
- We've added additional checks and policies to assist in the diagnosis of such issues, while we continue to work on the RCFO. Unfortunately, the final resolution isn't something we have an ETA for as of now.
Another update will be provided within 24 hours.
Update: 12:30 PM
Some details based on our investigations regarding the outage today:
- The primary cause was with the storage backend. One change we made rather recently was the switch to "networked" storage versus your typical RAID array setup. In theory, this provides better reliability and redundancy
- Previously, a similar issue occurred on 5/24 (post mortem on that here: https://blog.cynderhost.com/5-24-proton-hp-server-outage-post-mortem/) - the network went out, so IO writes failed and the filesystem went into "read-only" causing the errors (5XX, 4XX) seen on some sites
- To provide a permanent fix, we've worked with our vendors to find exactly what was happening and outlined a plan for how to restructure the internal network to make it more resilient
- They attempted to make some changes to the network to resolve these issues - we had no prior notice of this, which we'll chalk up to a communication mishap, but the event was supposed to be "no-impact" - meaning exactly that.
- Unfortunately, it didn't go as expected and we saw a reoccurrence of the above issue.
- This outage extended beyond that of the previous, as there were two issues that had to be rectified before recovery could take place. The prior experience of what happened did streamline and speed up recovery once it finished though
- We're still looking into this and evaluating the best way to proceed. There are a few items on our list for improvement:
- s1: Permanent, non-impacting resolution of the storage issues that was the RCFO
- s2: Communication around maintenance events
- s3: Moving our main website to a completely separate datacenter and finish the transition as outlined previously
- s4: Evaluating the potential for an active-passive failover setup or cross datacenter redundancy.
We'll continue to provide updates as we look into this. The two outages in the past month are unacceptable, and we take full responsibility. If there are any questions or notes for feedback, please reach out to our team.
Update: 8:50 AM
We"ll be posting progressive updates around of investigations of this incident.