Last week, I was in Business Continuity training in Philadelphia with DRII.org. We were discussing RTOs, recovery time objectives, and it got me thinking about the backup/disaster recovery service we offer.
Our backup solution utilizes Storagecraft’s Shadowprotect to take snapshots of the server on a regular basis, typically hourly. Once a day, a snapshot if sent offsite to bi-coastal datacenters. In the event of a disaster, our clients can have their servers virtualized in the cloud and access them via a VPN or via Citrix if that client has Citrix servers also being backed up.
The reason this came to mind while I was in training about business continuity, where we really were not discussing technology at all, is although we know we are able to recover the servers in the cloud from testing them, we really haven’t had a consistent experience in bringing those servers online. There are several obstacles that present themselves when doing fail over testing. Some of these only happen during testing, but in a live event, wouldn’t matter.
First, we have management agents installed on each server for monitoring, maintenance, and support. These agents have unique ids associated with them. If we do a fail over test that has internet access, both the live and the test fail over server have the same unique ids leading to a ton of false alarms and confusion. The second problem is when the servers come up in the virtual environment, they have a new NIC. This NIC doesn’t have the same configuration it had in the live environment. It’s assigned an IP by the network you configured during the failover, which only gives you the options for network, subnet mask, and gateway. This creates the problem of none of the servers being able to find the domain controllers. You may quickly get the server booted up virtually, but try logging in and it could take quite some time. Then after you login, you need to reconfigure the network on the DCs followed by all the other servers and reboot.
So I’m sitting in the class thinking about how to fix this, so when we do a test failover, it’s quick and predictable. Here’s what I’m working on to resolve this. Let me know if you see any flaws or have any advice on the best way to code this.
First, I’m going to have to create a python script that runs as a service, so that it runs before anyone ever logs in. In order for this to run on the servers we fail over to, it needs to run on the live servers as well. This is where I have to be careful. As far as what the script is going to do, here’s what I have figured out so far.
1. It’s going to check the IP address to see if it’s in the range I’m setting up for the failover test network. This network will not be the same network as any network at any of our clients. If it determines it’s part of the network, it moves on to the rest of the script. If not, it exits.
2. The script goes through all the agent services that we don’t want running and disables them.
3. After disabling the services, it goes through those services and stops them.
4. The last configuration change is the script will set the IP address to a predetermined IP address. These settings will be planned out before hand and saved to a configuration file on the live server. When the server is virtualized from a recent backup, the configuration file will be there.
5. Lastly, the script will reboot the server to make sure it refreshes communication with the domain controllers. When the service starts again after the reboot, it will have to check the configuration file to see if the IP has already been set. If so, it will exit. (Just thought of this as I was typing this up)
This should save us and clients a ton of setup time for failover testing and let us have a more predictable RTO in a live scenario.
So far I’m using the winreg module and the win32serviceutil module. I haven’t got to the IP configuration part yet. Once I get this coded up, I’ll put another post out with the code. If you have any input now or recommendations, let me know.Author
Jason Vanzin is the CEO at Vanzin Consulting Corp. He has over 15 years of IT experience and lives in Pittsburgh, PA. He blogs on topics related to Business Continuity, Python programming, and technology in general.