You might have experienced a 30 minutes downtime of our web site yesterday, the story follows.
While doing some software maintenance on our colocation server using RDP I took a look at the raid array just to realise that one of the disks was failed. And it was failed for two weeks already. It is a 3ware raid card, and of course it has a feature that alerts you via email if anything goes wrong – I just forgot to set it up when installed the server.
So I grabbed an identical hard disk – we did have some spare -, scheduled a visit at the colocation site at 22:00 CET, went in, replaced the hard drive, went home. The time I spend there was about 30 minutes, which could has been shorter if I can go to the server but it is not possible. I can only go to the visitor room and the technical staff brings the server there – no matter if it has hot-swappable disks.
The current config is RAID5, but I am worried about the possible data loss which can occur if 1. two disks fail at the same time, or 2. one other disk fails during array rebuild. The first option is quite possible, given that usually the disks that are built in a server are manufactured together, with almost identical serial numbers. So in the future we will opt for RAID 6, which seems to be a good choice. The cost of a gigabyte is ever decreasing, and there are 500G or even 750G drives already – one can easily put a number of terabytes into a server, which seemed unreachable even a few years ago.
After restarting the server, the actual rebuild took about 6 hours to complete. I had to fight some other problems as well – one was that the NNTP server did not start, required some indexing. This is a known problem – at least for us-, that thing usually fails about once a month.