When data backup goes wrong
30 Jul 2021
Businesses know that they need to have backups as part of a data management and Disaster Recovery (DR) strategy. But the reality is that backup failures are an all too common experience, and they happen for any number of reasons, some of which seem really silly. To ensure DR and Business Continuity (BC) strategies are effective, businesses need to understand how and where backups can go wrong so that they can mitigate the risks. Often, the easiest solution is to engage with a specialist service provider to monitor and manage backups, ensuring that backups are available when they are needed, to get businesses up and running with minimal disruption and downtime.
What is a backup failure?
Data backups are there as a failsafe, in the event businesses need to revert to a previous copy of data. They are essentially an insurance policy. However, there are many things that could cause a backup not to run, or not to complete successfully – this is a backup failure. It means that the backup is not available if it is needed, and depending on how long backups have been failing for, could set businesses back hours or even days.
For backups to run successfully, both the production and backup environments need to be operational. This seems like common sense, but in the South African context it is often more challenging to achieve than it may appear.
Infrastructure availability is a challenge
Load shedding is an all too common reason why backups fail. If there is no electricity at the production site or the backup site, the backup cannot run. Similarly, if the network is down, either because of a connectivity failure, a power failure or international cable damage, the backups cannot run. Malware attacks on the backup or production environments can also cause backups to fail or run incorrectly.
Another reason backups may fail is because the backup media is not available – disk storage might be faulty or tape storage may be full or corrupted. Infrastructure readiness can also be a challenge, because cable theft and accidental damage to cabling are unfortunate realities. Even seemingly unrelated technology like air conditioners and fire extinguishers failing can cause backup failure. If the data centre overheats, or the air conditioner leaks and causes water damage, then the environment will not be available for backup. If the fire system malfunctions and deploys unnecessarily, computer equipment can be damaged and this could cause backups to fail.
Human error is always a factor
No matter how much Artificial Intelligence (AI) is injected into software, there is always the potential for human error to cause failure. If tapes need to be manually changed, and this cannot be done, perhaps because of lockdown regulations, then backups will fail when the tapes are full. If humans need to intervene to manage the backup solution or change files, this introduces potential for error.
Even if everything is automated, someone needs to set up and manage the automations. Human error even extends to seemingly ridiculous elements like accidentally switching off the plug where the server is powered or turning off the heating and cooling system over a long weekend. It could even be accidentally deleting data. Human error needs to be taken into account and strategy needs to cater for it.
Evolve the strategy
Backup strategy needs to evolve. If a business has deployed the production environment in the cloud, but the backup is still running to an on-prem data centre, there is a mismatch between production and backup strategy. For example, if a business still uses tape as a backup destination, but the workforce is remote, who changes the tape during lockdown, and how do users recover in the event of an incident?
Businesses need to ensure their backup strategy is aligned with business needs in a digital, cloud-driven and post-pandemic world. DR readiness requires backups to take place regularly and be up to date, and if this does not happen, businesses may lose access to the latest copy of their data. There is no such thing as working hours anymore, so data must be available 24/7 and backups need to run at all times. In addition, the impact of backup failure goes beyond BC – with the Protection of Personal Information Act (PoPIA) in effect, backup failures mean a compliance breach, which could have dire consequences.
Mitigate the risk
Following basic data management principles is essential – it is imperative to perform readiness checks before backups run, have active monitoring in place during backups, and post-backup reviews to ensure they have been successful. With these checks in place, failures will be caught before they can cause problems. However, this requires constant time and attention, which many businesses do not have capacity for. Engaging with a managed service provider means that businesses will have dedicated skills proactively monitoring and maintaining the backup and DR environments.