Do I need a four-socket server?
Today, 80 percent of the worldwide x86 rack-server revenue comes from one- or two-socket servers. So why do...
Hi there! Lenovo System x3650 M5 here, I recently picked up a job as a data centre employee! As part of my new responsibilities, I deal with any unplanned downtime that hits our servers. Seeing I have a wealth of experience in predicting and handling downtime, I thought it’d be nice of me to share some of my know-how. So without further ado, I present my survival kit for unplanned server downtime.
A good data centre that always runs on schedule is the dream of every IT professional, however, if wishes were fishes, we’d all swim in riches. It’s much better to prepare for the worst, which is the first step to surviving unplanned downtime – and Mondays.
Start developing a disaster recovery plan (DRP), which will serve as your metronome during a crisis. The plan should outline what everybody needs to do to get the server back online in the event of major outage, be it a fire, a distributed denial-of-service (DDoS) attack or a plain old dummy spit. Important actions include informing the CEO, assessing damage to infrastructure and estimating the required recovery work time.
A response plan of any kind helps reduce confusion and minimise unnecessary actions in the event of an emergency.
A lot of people ask me, “Lenny, why are you so good at being a server?” And aside from my God-given talents, it’s practise. Accordingly, disaster recovery testing is crucial to effectively preparing for the ‘big day’. Tests ensure staff are a well-oiled machine prepared for when trouble brews. How many professional sports teams do you know that ‘show up on the day’ and win a championship?
Testing has several benefits, such as ensuring the validity of recovery procedures, verifying the capability of the personnel running the recovery procedures, verifying the time estimate for recovery, familiarising staff with the recovery plan and discovering potential threats to the recovery process.
Walk-throughs of the DRP should be conducted at least once a year, while a full-blown active ‘this is not a drill’ test should happen at least once with every new DRP. Prepare the troops for the test, let them have a chance to digest all the information and give them every chance to succeed. Three, two, one, GO!
Ok, so it happened. The server’s gone gaga and you don’t know why. The first thing to do is tell your users something is wrong. Resist the urge to try and hide it because that will only reflect poorly on your brand’s reputation. If people are going to lose work, it’s better they find out quickly so they can start planning accordingly.
However, don’t be too liberal with the information you give out, just acknowledge that the server is having issues and you’re working hard to find a solution. I like to ask for patience. No need to be rude! Even if the unplanned downtime was beyond your control, you’ll still have to apologise for the inconvenience. People like to know it’s not their fault.
How many staff will you need, to recover effectively? Well, how long is a piece of string really? Correctly assessing the number of IT pros you need at the disaster site depends on the kind of disaster, the time of day and how effective the team members on site are. So stay flexible. I’m all for moving people on- or off-site depending on how big the problem is, but remember to stick to your DRP!
I took a few improvisation classes in college, it’s good to remember to “yes, and” what everyone says. No-one likes a Negative Nelly.
Whatever your DRP looks like, make sure your business takes a holistic approach to any server interruption with all troops, from IT to business development and customer service, doing their part.
The ITIC 2015–2016 Global Server Hardware, Server OS Reliability Report recently named Lenovo x86 servers as the most reliable in the field. Go me! Head on over to www.think-progress.com/lenovo-servers-your-most-reliable-team-member/ or download our awesome white paper here to find out more.