AlwaysOn Backup and Other Tips
SQL Server’s AlwaysOn technology (available since SQL Server 2012) provides high-availability and disaster-recovery database functionality which largely supplants mirroring and log-shipping – in fact, mirroring is now deprecated. Exactly what functionality is available and how robust it is, varies by release (2012+), edition (Standard versus Enterprise) and the physical infrastructure devoted to it. AlwaysOn is fairly easy to set up (though it requires cooperation from both Windows Server and networking admins) and, relative to the required effort, provides exceptional durability for SQL Server databases. AlwaysOn is not a complete OOTB durability solution and has significant gaps – e.g. it does not maintain SQL Server logins and roles synchronized across multiple servers – but it is an excellent start for the needs it caters to.
This post assumes the reader has at least a basic familiarity with SQL Server backups, as well as AlwaysOn technology, topologies, possibilities and terminology, and does not cover these subjects except as required for clarity. Both of these topics have countless resources online. For purposes of this post it is enough to know:
- AlwaysOn is a layer above Windows Server Failover Cluster technology. It comes in two variants, with this post focusing on AlwaysOn availability groups (AG).
- AlwaysOn provides durability by synchronizing selected database(s) on a “primary” server (aka node) to one or more SQL Server instances on “secondary” servers. The selected databases are collectively called an availability group (AG).
- At any given time, the primary node is where client activity occurs, with secondaries either read-only or not client-readable. If the primary node fails, operations switch seamlessly to one of the secondaries, which becomes primary – with defined levels of possible data loss, depending on configured latency among the nodes.
This post provides tips on some non-obvious AlwaysOn topics, with a particular emphasis on how traditional database backups should be managed when used with an AlwaysOn AG. This subject is not well covered by existing resources.
SQL Server AlwaysOn Availability Groups Licensing
An AlwaysOn installation will involve at least two instances of SQL Server on two servers, with a greater number for greater redundancy and reliability. It is easy to overlook the licensing implications of these multiple instances, but correct understanding of this is essential to planning and budgeting the AlwaysOn solution. Here are the most important facts:
- All nodes supporting an AG should be the same release and edition of SQL Server.
- Prior to SQL 2016 (and starting with SQL 2012), an AlwaysOn AG requires Enterprise licensing. Beginning with SQL 2016, Standard edition supports “basic availability groups” (BAG), which is considered a replacement for mirroring, the latter being deprecated with this release. BAG has the following significant limitations compared to the full functionality afforded by Enterprise licensing:
- Exactly two nodes, a primary and a secondary, are supported per AG.
- An AG may contain only one database. However, multiple AGs may be created, and each one’s secondary need not be the same node.
- Secondaries are not readable, so offloading of reporting and backups are not supported.
- As of the date of writing, a BAG must be created via TSQL, using special syntax – SSMS does not provide support.
- The primary node requires a full license – either Standard (SQL 2016+) or Enterprise. But what about the secondaries? Microsoft guidance states that secondaries do not require licenses as long as they are “truly passive”, defined as existing solely for failover support within the cluster. Any active use of a secondary node, which includes offloading reporting or backups from the primary (two of the advantages of using AlwaysOn), or any other non-passive SQL Server processing, requires a license. NOTE 1: the AG databases on an unlicensed secondary will not be readable. NOTE 2: consult your Microsoft rep to be certain of the details for your situation.
- Following the above licensing rule, one might have no secondaries licensed, all licensed, or some combination, depending on requirements. A typical configuration would have two licensed nodes operating synchronously for high-availability – also thereby allowing reporting and backups from the secondary node – and a third unlicensed, non-readable, asynchronous node for disaster recovery (DR).
- These licensing rules might make SQL Server Replication an attractive alternate technology for supporting reporting and other non-critical purposes.
Overview of SQL Server Backups on AlwaysOn
One might argue that with AlwaysOn, backups are not as important to resiliency and DR, at least for those databases participating in an AG, as they are traditionally. The more synchronous secondaries are configured and the more geographically dispersed they are, the truer this is. On the other hand, as a practical matter, it will often be the case that synchronous nodes must reside in the same data center, increasing the possibility of concurrent primary and secondary failure and therefore the need to restore from backups. Perhaps the most important reason to maintain a traditional backup regime when using AlwaysOn is to be able to restore to a point-in-time, as AlwaysOn cannot protect against functional corruption of a database introduced by application or SQL Server error, a new release, operator error or certain hardware errors. If backing up to protect against these, you will need to determine how far back in time backups will be maintained. The remainder of the post assumes that the need for backups is established.
AlwaysOn imposes restrictions on what types of backups may be performed on primary versus readable (i.e. licensed) secondary nodes. All forms of SQL Server backup – full, log and differential – may be performed against the primary node. Only copy-only full and log backups may be performed against readable secondary nodes. Within these restrictions, the standard literature implies that you may implement independent backup schemes against all readable nodes, and indeed, nothing will stop you from doing so. However, implementing backups across nodes incorrectly will lead to an untenable situation should you ever need to restore. As this would defeat the point of taking backups, it is very important to understand. The existing literature implies but does not make this explicit, which motivates this post.
One feature of AlwaysOn AGs that is much documented is the ability to configure “backup preferences”, which may be specified as “prefer secondary”, “secondary only” and “primary”. You may also configure relative priority % for backups against different nodes, and whether any nodes should be excluded from being backed up. What may not be obvious is that these configurations are only metadata – they alone have no effect on anything. It is up to your backup scheme to interrogate these settings and take action (or not) accordingly. The TSQL function sys.fn_hadr_backup_is_preferred_replica(@DBNAME)) is provided for this purpose.
On an AlwaysOn cluster, there will be at least two and possibly more nodes on which backup jobs would typically be configured and running simultaneously. You want a setup where backups will continue to be taken as desired, regardless of failovers and node-role changes that may occur – while at the same time (as will be explained), not taking backups where they should not be taken. For simplicity, the same script should be running on each candidate node. This script can call sys.fn_hadr_backup_is_preferred_replica(@DBNAME)) to evaluate the backup preferences on the node it runs on – including the node’s current status as primary or secondary – and take appropriate action, either exiting without action or performing some type of backup. This works because each node in the AG has the same metadata for this configuration. The code would look something like this (from MDSN):
IF (NOT sys.fn_hadr_backup_is_preferred_replica(@DBNAME))
Select ‘This is not the preferred replica, exiting with success’;
RETURN 0 – This is a normal, expected condition, so the script returns success
BACKUP DATABASE @DBNAME TO DISK=<disk> WITH COPY_ONLY;
Fortunately, you need not develop your own TSQL backup scripts. A highly recommended, tested and industry-accepted option is the free work of Ola Hallengren. The code available here is AlwaysOn-aware and will respect the meaning of the backup preference configuration on the node on which it runs. Installing this code, which takes the form of stored procedures (sprocs) and a logging table, will also create SQL Agent jobs for each type of backup – full, log and differential. All you need to do is add a schedule to each job you wish to run under your backup plan, and adjust the job parameters as desired. The Ola code takes many parameters to enable controlling all the parameters of the BACKUP command, for example doing a copy-only full backup on a secondary node, verifying the backup, and using checksums and compression. It also has others, such as one that allows overriding the AlwaysOn backup preference configuration and forcing a backup.
NOTE: by default, the Ola code and its logging table are installed in the master database. A standard best practice is to establish a “DBA utility” database on each SQL Server instance to house maintenance artifacts such as these. While this is easily done in general, there is a gotcha with AlwaysOn, if you choose to place this utility database in an availability group. Since AG secondaries are read-only, the Ola code will be unable to write to its logging table on secondaries to which you have offloaded backups. Whether it writes to this table is subject to a parameter and so can be turned off, but in general the information so captured can be useful for debugging. In such case, you’ll have to go with the default, or at least locate the logging table in a writable database and change the Ola code accordingly. Going with the default installation is the most practical, even though in general we would prefer not to have non-SQL Server artifacts in the master database.
General Planning for SQL Server Backups on AlwaysOn
The points in the prior section can be mapped to your situation to determine what backup strategy is feasible for your AlwaysOn installation. For example:
- If using BAG, all backup types can only be done against the primary node.
- Differential backups can only be done against the primary node, regardless of any other consideration.
- Differential backups can only be applied to non-copy-only full backups. Since AlwaysOn secondaries can produce only copy-only full backups, this means that if you want differential backups in your backup plan, you must ensure that non-copy-only full backups and subsequent differential backups are taken on the primary node – you cannot use the differentials against the copy-only backups you take against secondaries. You will also want to make sure that copy-only backups are never taken on the primary, which would preclude using any differential backups from prior to that time. Further, you will want to arrange for these backups to be copied elsewhere to eliminate the single point of failure you are otherwise exposed to.
- Log backups are not affected by either non-copy-only or copy-only full backups and can be applied to a restore of either type. Thus, if differential backups are not to be used, a viable backup scheme is to perform both copy-only full backups and log backups on an AlwaysOn secondary. As with BAG, appropriate copying of such backups to additional locations, on an appropriate schedule, is critical for maximum resilience.
Planning SQL Server Log Backups on AlwaysOn
Careful readers will have noticed that nothing in the foregoing precludes taking log backups on any readable AlwaysOn nodes. As distinct instances, log backups could be scheduled and taken independently on any or all such nodes. Nothing prevents this, and it might seem like a good idea, creating redundant log backups and thereby increasing resilience without any further special measures. This would be gravely mistaken – but you would never know this from online resources on the topic.
Some resources imply the problem when they state that log backup information is maintained across all AlwaysOn nodes. What this actually means is that all nodes know about any log backup taken on any node. This means that if a log backup is taken on node SS, all nodes will know what the last LSN (log sequence number) recorded by that backup is. If a log backup subsequently runs on node AS, its first LSN will be node SS’s last LSN. Now, if all nodes are configured to write their log backups to a shared location this would be OK, as the log backups needed to maintain the log chain unbroken would all be in the same location and accessible in order by whatever node needs to be restored. However, such a shared location may not be feasible, especially when nodes are disbursed geographically. It also introduces a single point of failure that has to be addressed.
More usual would be for each node to write its backups to local storage. In such case, when log backups are taken on multiple nodes, log backups end up distributed across the nodes. This would seriously complicate any restore scenario and quite possibly be impossible, depending on the failure scenario demanding a restore.
This can be illustrated by the following data collected from a three node AlwaysOn cluster – one primary P, one synchronous secondary SS (for HA), and one asynchronous secondary AS (for DR). Copy-only full and log backups of database TestDB were being taken to local storage on SS and AS on slightly different schedules (assume no backups being taken on P), using the Ola code. SS was configured as the preferred backup secondary, and the AS backups were forced using the Ola parameter for this. The data were generated by a query (following) against the msdb database on SS and AS.
The first thing to notice is that the log chain is not continuous on either node. “Continuous” means the first_lsn of one log backup equals the last_lsn of the most recent prior log backup. For example, for SS, the first_lsn of the 12:50 AM backup, 536621000000210000000, is not equal to 536621000000140000000, the last_lsn of the 12:35 AM backup. There is a gap, which means these log backups could not be applied to the full backup taken at 12:09 AM. These same observations are true for the AS backups.
The second thing to notice is that the SS first_lsn of the 12:50 AM backup, 536621000000210000000, is (for example) equal to the last_lsn of the most recent prior backup taken on AS, that at 12:45 AM. Similar correspondences between SS and AS are visible for other backup times. This means that there is an unbroken LSN chain for the AlwaysOn cluster overall, but the backups are distributed between SS and AS and would need to be used in the correct order from each node in a log restore. This would be complex, if even feasible.
At 2:28 PM on 5/6, log backups were stopped on AS. The following shows how the log chain becomes continuous on SS after this time (highlighted). Following the next SS full (copy-only) backup, the SS log backups could then be applied successfully in a restore to whatever point-in-time was desired:
Here is a query that can produce the above data (with additional columns of interest) – similar code is found in various places online – here is one: backup query:
SELECT CAST(s.database_name as char(20)) AS DB
,CAST(CAST(s.backup_size / (1024*1024) AS int) AS CHAR(12)) AS MBSize
,CAST(DATEDIFF(second, s.backup_start_date, s.backup_finish_date) AS CHAR(4)) AS SecsTaken
,CAST(s.first_lsn AS CHAR(30)) AS first_lsn
,CAST(s.last_lsn AS CHAR(30)) AS last_lsn
END AS BackupType
FROM msdb.dbo.backupset s
INNER JOIN msdb.dbo.backupmediafamily m
ON s.media_set_id = m.media_set_id
AND backup_start_date > ‘2018-05-06 00:00:00′
WHERE s.database_name = ‘TestDB’
ORDER BY DB, s.backup_start_date, BackupType
Database backup strategy on an AlwaysOn cluster requires considerably more consideration than when AlwaysOn is not being used. The technology will allow you to configure setups that will not serve you well. Just one more reason it is crucial to test restores from your backup strategy. This would quickly reveal problematic setups, though the reason for problems might be quite non-obvious. I hope that this post will help you avoid, or diagnose such cases.