100% of the affected email and database servers are now back online. If you are still experiencing any problems with your email or databases, please contact our support team. We’ve begun our formal investigation into the cause of the problem, and will be publishing the post-issue report on our main blog next week. We will include a link to that main blog post here.
Sunday morning at 3:49am EST, we discovered a problem with the RAID system on one of our storage arrays. We determined that migrating the data from the failed array to new locations is the best way we can get everything back up and running as quickly as possible.
Simplified description of the issue:
- RAID is a type of data storage system that’s designed to handle drive failures by automatically re-creating the failed drive; but when multiple RAID drives fail at once, in too close succession to each other, servers go down. This extremely unlikely event is what occurred on Sunday night, and we are still investigating its root cause.
- The recreation takes a long time for each server because of the delicate state the RAID array is in. Our engineers are painstakingly going through all data, ensuring that it is in good condition, fixing any issues, and then migrating it to tested drives. Extreme caution must be taken to avoid heavy loads, which may cause the array to have additional failures.
- It’s too risky to attempt a repair of the array in this delicate state until after the data has been migrated, as another failure could cause data loss.
Technical description of the issue (updated with further clarification):
- The affected cluster of servers uses a SAN, which is made up of 5 storage arrays in a tiered configuration. Each storage array consists of 14 enterprise-grade 15,000 RPM SAS iSCSI-connected hard drives with 2 hot spares. This is a ‘14+2 RAID 50’ storage array.
- During regular integrity scans, the RAID controller for one of the 5 storage arrays recognized degraded service on Drive 6. So, as designed, it automatically activated a hot spare, Drive 10, and began rebuilding the RAID array. The RAID controller selected Drive 0 as the spare’s source drive for data parity to restore the data on Drive 10. Almost immediately after this rebuild began, Drive 0 failed which corrupted the data being built on Drive 10. Two unusable drives in the same parity group exceeded the fault tolerance of this storage array. This is the point where the outage began – up until when the source drive (Drive 0) failed, all servers in this cluster were online and functional.
- Working with our hardware vendor’s tier 4 engineers, we were able to coax both Drive 0 back into active status while Drive 6 remained degraded. Which satisfied fault tolerance thresholds, one degraded drive per parity group, and allowed the storage array to be activated. This got the SAN back online, and we were able to start transferring your data from the affected volumes to stable ones.
- Due to the very fragile nature of the array and it’s dependency on Drive 0 remaining active (which may crash anytime, causing a loss of all data on the array), our hardware vendor’s senior tier 4 engineer stated that the evacuation process must be handled one volume (one server) at a time, in sequence. At this point, additional engineers or hardware cannot influence the speed of the recovery process. Evacuating multiple volumes would make the process go faster but could result in another fault and possibly data loss.
At this time, we believe that we will be able to restore most of the servers and their data – either your live data, or from a backup. It is possible that some data was lost, but so far we have been able to recover 100%.
We’re still working around the clock to get the remaining servers back online. Once we’re 100% up and running, we will be conducting an in-depth investigation and publishing a post-issue report on our main blog. This report will include the root cause of these issues, as well as documentation outlining the steps we’ll be taking to prevent any future incidents of this nature.
When the outage began, email sent to affected servers started bouncing back to the sender with a notification that it was not delivered. Systems administrators blocked port 25 on those mail servers very early Monday morning so that any email received since that time is now queued up. We expect that all queued mail will be delivered once the mail server returns to service and it could take 24 hours or more to complete delivery. One exception is mail907, due to an error this server was missed during the port 25 block. All mail already on that server is intact and all mail sent to this server was bounced back to the sender with a notification that it was not delivered. Admins have checked all remaining mail servers and have confirmed that mail907 was the only mail server this error happened on.
There is a way to determine your email server through the control panel, but it’s easier to just go to mxtoolbox.com, enter in your domain name and click the MX Lookup button.
To find your database server:
1. Login to your HE Web Hosting account (https://my.hostexcellence.com/)
2. Go to ‘My Products,’ then click ‘Manage’ under your hosting product
3. Click the MySQL or PgSQL Server Icon under “Databases”
4. The name will be listed next to the ‘Host Name’ at the top
All servers are now ONLINE:
- All Control Panels (CP9, 10, 11)
We are going to continue to monitor this situation and complete an overview of all servers, to make sure everything is functioning properly. We greatly appreciate the patience and understanding with this issue, and will provide any other details we have as they become available. Thank you.