Browsing: High Availability

How to force manual failover AlwaysOn Group

This article describes how to force manual failover of AlwaysOn Group. A forced failover is a type of manual failover that is strictly for disaster recovery scenarios. Important to realize, in such scenario a planned manual failover is not possible.

Problem: 

For example you have a SQL Server Availability Group (AG) on a Windows Server Failover Cluster (WSFC) with two nodes. Imagine a situation when  one node is completely offline\unavailable or AG Group resource is offline in cluster.

This may cause the AG Group to go into a Resolving state and all the databases in the AG and the Listener will be inaccessible.

Now you must restore the service. In this scenario you have to do force failover and willing to risk losing data.

Solution:

First bring up the WSFC using the forced quorum method. Second Force Failover AG with allow data loss option.

For Example SQLP1 (Primary) is unavailable. WSFC is down and AG is inaccessible.

SQLP2 (Replica) is up and running, and you can connect to SQLP2 from SSMS. The Always On Dashboard lists SQLP2 in the Availability replica, showing AG Group in Resolving state.

ForceQuorum

Recover WSFC using Force Quorum

Important to realize that this is actually forcing a cluster to start without a quorum. Again there are two ways of achieving this. Using powershell and command prompt.

Fix Quorum Using PowerShell:

Start PowerShell using Run As Administrator. You can refer the article Powershell Script to Create Folders where I had shown this step.

Import-Module FailoverClusters
$node = "SQLP2" 
Stop-ClusterNode -Name $node  
Start-ClusterNode -Name $node -FixQuorum
(Get-ClusterNode $node).NodeWeight = 1
Force Quorum Using Command Prompt:

Open command prompt using Run As Administrator. In the first place bring down the cluster service on SQLP2.

C:\>Net stop clussvc

Force Quorum

Now bring up Cluster Service with Force Quorum

C:\>Net start clussvc /forcequorum

Force Quorum

At this moment if you launch WSFC manager, the quorum will show warning as shown in the below screen shot.

Force Quorum

Now the second part of the solution.

Force Always On Failover with Allow data loss:

Now since WSFC is up, the user databases in the AG will appear in the AG Dashboard, but they will be in the Not Synchronizing State.

Force Quorum

We will need to perform a forced AG failover to the SQL Server replica which is still up and running using with allow data loss option. In our example the replica to which we’ll do force failover is SQLP2.

The T-SQL Command to force failover is as follows:

ALTER AVAILABILITY GROUP AGTest FORCE_FAILOVER_ALLOW_DATA_LOSS; 

At this moment the AG should be up and running. Open the AlwaysON Dashboard again and it should now show SQLP2 as Primary and running. This will allow applications to connect to the AG listener and the service is restored.

Force Quorum

When the issue with the server SQLP1 is fixed, they will re-join the WSFC automatically. As long as AG is not removed from other SQL Server instances, they will be able to re-join into the AG as well.

After a forced failover, all secondary databases are suspended. This includes the former primary databases, after the former primary replica comes back online and discovers that it is now a secondary replica. You must manually resume each suspended database individually on each secondary replica.

Resuming of data movement using SSMS:

Force Quorum

T-SQL to resume data movement for the AG:
ALTER DATABASE [AGplaceHolder] SET HADR RESUME;
GO
Summary:

Here we have described a situation when only one node was up in a WSFC without quorum. We performed force quorum to bring up WSFC on a single node (available replica) and brought the SQL AlwaysOn Group back to functional. Once the problem node came up, we resumed data movement to sync the availability databases.

Hope this helps. Please feel free to comment in case of any questions.

You may also interested in the following articles:

How SQL Agent Job Schedule Got Changed Automatically

Interesting facts about SQL Server Startup Parameter -f

Script To List SQL Agent Jobs And Schedules

{ Add a Comment }

OS Patching in SQL Server Always On

<<Previous

This article is about DBA steps during OS Patching in SQL Server Always On and in continuation of the Steps in SQL Server Always On during OS patching. Please read the first article if you have directly landed on this page.

Till now you have seen why database backup is important and how to Validate Synchronization State. Now let’s see how to change the failover mode. Why we need to change, you already know from first article.

How to change Failover Mode:

Right Click on AG Group on Primary > Properties:

Change Failover Mode
AG Dashboard Properties

Change the Failover Mode to Manual and click OK:

Change Failover Mode
Change Failover Mode

4. Now you can apply the OS patch on secondary Node. If you are DBA then request Intel to patch the secondary server. Do not failover the cluster resources to other node. You can reboot the secondary server after reboot.

5. After secondary server comes back online, wait until databases are in SYNCHRONIZED state and make sure Failover Readiness shows No Data Loss as described in Step 2) Validate Synchronization State, described in first article.

6. Availability Group (AG) Failover:

Fail over the Availability Group (AG) to the secondary replica using SSMS (SQL Server Management Studio) and not from Failover Cluster Manager.

Right Click on AG on Primary and Choose Failover:

AlwaysOn Failover
AlwaysOn Failover

Click Next:

AlwaysOn Failover
AlwaysOn Failover

Select the new Primary Replica and click Next:

AlwaysOn Failover
AlwaysOn Failover

Next step would be to connect to the Replica to failover.

7. Wait until all the database synchronization state becomes SYNCHRONIZED. Follow the instructions in Step 2) Validate Synchronization State. (Refer first article)

8. Once all the Db status become Synchronized, you can start OS patching activity on new Secondary.

9. Keep monitoring the log file growth and if it is about to get full then we may need to remove the database from Always ON AG to prevent it from going to suspect mode.

10. Once the activity completed monitor until the databases become SYNCHRONIZED. Follow the instructions in Step 2) Validate Synchronization State.

11. Failback the AG to the original node and wait until all databases become SYNCHRONIZED. This step is not mandatory and depends upon the application setup.

12. Lastly the node acting as the primary replica, change the failover mode back to Automatic. In fact you should not forget this step. Please follow step 3) Failover Mode.

Hope this helps and leave you comments or questions.

You may also like to read the following articles:

Tempdb space issue in AlwaysOn secondary Replica

SQL Database Restore using NetBackup

{ Add a Comment }

Steps in SQL Server Always On during OS patching

This article is about detailed Steps in SQL Server Always On during OS patching. Any minor mistakes for Always On Availability Group (AG) failover during OS patching can cause huge downtime. And this is because the database can go to recovery pending state and application will not be able to connect.
Therefore it is very critical to follow proper steps. In addition to this, the article will help non DBAs or accidental DBAs as well.
One of the main idea behind implementing SQL Server Always On is to minimize downtime during any kind of patching activity.

Steps to Follow in SQL Server Always On During OS Patching
  1. Backup Confirmation:
    Please make sure we have good full/diff backups of all the databases. This is to avoid any unwanted situation. Like in worst case you may need to restore from database backup and re-establish Always On.
  2. Validate Synchronization State: Make sure that synchronization state for all databases is SYNCHRONIZED and Failover Readiness shows No Data Loss in Availability Group (AG) Dashboard.
  • Right Click on AG from Primary Server and go to Dashboard
Open Always On Dashboard
Open Always On Dashboard
  • You need to check Synchronization State is Synchronized, and Failover Readiness is No Data Loss.
    Also make sure there is no error shows in the AG Dashboard as shown below:
AlwaysOn Dashboard: Synchronization status
AlwaysOn Dashboard: Synchronization status

The above step is for you to validate that you do not loose any data during failover of Always On AG. Other synchronization State can be “Synchronizing” and Failover Readiness is “Data Loss”. In that case you got to do some work before failover.

3. Change Failover Mode:

Change the failover mode of AG group to manual. This step is to make sure to have control during Node reboot after patching. Obviously you don’t want the AG to failover automatically during the activity.

Continue Reading>>

 

{ Add a Comment }

Tempdb growth due to Version Store in AlwaysOn

<<Previous

This article about fixing tempdb growth due to Version Store in Always On Secondary replica. This is in continuation of  the article Tempdb space issue in AlwaysOn Secondary Replica. Hence please go through the first article and then comeback here so that it make more sense to you.

To summarize till now we have found that version store is consuming most of the space in tempdb. Hence we tried to find out the transaction consuming version store and detected an old system session id having lastwaittype as “REDO_THREAD_PENDING_WORK”.

As I mentioned in the first article this is a hint of some transaction in primary causing this. Furthermore you can check the below performance counters:

  • Longest transaction that’s currently running: SQLServer:Transactions\Longest Transaction Running Time.
  • The current size of all the version stores: SQLServer:Transactions\Version Store Size (KB) .
  • Rate of version store cleanup in tempdb: SQLServer:Transactions\Version Cleanup rate (KB/s).
  • Rate of version store generation: SQLServer:Transactions\Version Generation rate (KB/s).

Conversely you can use the below query in secondary replica facing the issue.

Query to check the above mentioned performance counters
SELECT * FROM sys.dm_os_performance_counters WHERE counter_name IN ( 'Longest Transaction Running Time' ,'Version Store Size (KB)' ,'Version Cleanup rate (KB/s)' ,'Version Generation rate (KB/s)' )
Sample output:
Result set of Longest Running Transaction
Fig1: Longest Running Transaction time

As you can see the “Longest Transaction Running Time” counter has a very high Cntr_Value. This is another key point and compelling evidence that there is long running transaction in primary replica.
Additionally the other three counters related to Version store shows that the the long running transaction is consuming the version store.

Now execute the below query again. This time in Primary Replica.

Query to find the transaction utilizing version store:
SELECT GETDATE() AS runtime
	,a.*
	,b.kpid
	,b.blocked
	,b.lastwaittype
	,b.waitresource
	,db_name(b.dbid) AS database_name
	,b.cpu
	,b.physical_io
	,b.memusage
	,b.login_time
	,b.last_batch
	,b.open_tran
	,b.STATUS
	,b.hostname
	,b.program_name
	,b.cmd
	,b.loginame
	,request_id
FROM sys.dm_tran_active_snapshot_database_transactions a
INNER JOIN sys.sysprocesses b ON a.session_id = b.spid
Sample Output:
Transactions In Primary Using Version Store In Secondary Replica
Fig2: Transactions In Primary Using Version Store In Secondary Replica

In the above sample output the most important columns are marked in red. If you have noticed, the “login_time and “last_batch” time is very old. Additionally the status is “sleeping”. Now you need to kill the corresponding session ids one by one and monitor the version store utilization as shown in the first article.

Points often overlooked is to kill one session at a time and then monitor the version store utilization in secondary replica. The reason behind this is, next time the version clean up process runs it will be able to clear the version store if you have killed the right one. This way you will be able to find the culprit transaction as well.
Also not to mention that you may need to contact the application team before killing any spid depending upon your environment.

Hope this helps. If you find any new scenario please put that in the comment section.

{ Add a Comment }

Tempdb space issue in AlwaysOn Secondary Replica

In this article I am going to discuss about how to troubleshoot and fix the tempdb space issue in AlwaysOn Secondary Replica.
Recently I had faced an issue when tempdb in AlwaysOn secondary replica had grown to the disk space limit. It was SQL Server 2014 Enterprise Edition.
As you all know tempdb is used by server operations in a SQL Instance to store the following objects:
user objects, internal objects, and the version store, used primarily for snapshot isolation. Therefore we start the troubleshooting just like we do in case of regular tempdb space issue.

First of all we need to check what is consuming space in tempdb. We need to query sys.dm_db_file_space_usage which provides the details.

Query to find the tempdb space usage:
SELECT CAST(SUM((total_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS Total_Size_MB
	,--Works on SQL 2012 and future editions
	CAST(SUM((unallocated_extent_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS Free_Space_MB
	,CAST(SUM((user_object_reserved_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS User_Objects_MB
	,CAST(SUM((internal_object_reserved_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS Internal_Objects_MB
	,CAST(SUM((version_store_reserved_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS Version_Store_MB
	,CAST(SUM((mixed_extent_page_count) * 8) / 1024. AS DECIMAL(25, 2)) AS Mixed_Extent_MB
FROM tempdb.sys.dm_db_file_space_usage
GO
The output was as follows:
Tempdb Space Issue on AlwaysOn Secondary Replica
Objects consuming space in tempdb

As you can see Version Store is the highest consumer. The next step is to identify the transactions on Secondary replica maintaining an active version store. Execute the below query in secondary replica.

Query to find the transaction utilizing version store:
SELECT GETDATE() AS runtime
	,a.*
	,b.kpid
	,b.blocked
	,b.lastwaittype
	,b.waitresource
	,db_name(b.dbid) AS database_name
	,b.cpu
	,b.physical_io
	,b.memusage
	,b.login_time
	,b.last_batch
	,b.open_tran
	,b.STATUS
	,b.hostname
	,b.program_name
	,b.cmd
	,b.loginame
	,request_id
FROM sys.dm_tran_active_snapshot_database_transactions a
INNER JOIN sys.sysprocesses b ON a.session_id = b.spid
Sample Output:
Tempdb space issue on Always On Secondary Replica
Result showing transactions utilizing version store

Most important columns in the above result set to notice are “session_id “, “lastwaittype”, “database_name”, “login_time” and “last_batch”.

In addition you need to remember that we ran the query in Secondary Replica. Therefore you can see lastwaitype as “REDO_THREAD_PENDING_WORK”. Which means the redo thread was trying to catch up with the primary and was maintaining multiple versions of the rows updated to achieve read consistency.

In addition to that you can see “login_time” and “last_batch” of the session_id 37 was the oldest among all. If it is recent we can ignore but in my case it was two months back. Hence I could sense that there must my some transaction in Primary causing this.  Besides that the session_id 37 is < 50 which means it is a system session. Now let’s see how to find the culprit in primary. >>Continue Reading

{ Add a Comment }

Synchronous Database Mirroring Misconceptions

This article is about the Synchronous Database Mirroring misconceptions. Synchronous is one of the operating modes of SQL Database Mirroring also known as High Safety mode; High Safety of transactions in case of failover.

Let’s see what Microsoft TechNet says:

When the session begins, the principal server begins sending its active log to the mirror server. The mirror server writes all of the incoming log records to disk as quickly as possible. As soon as all of the received log records have been written to disk, the databases are synchronized. As long as the partners remain in communication, the databases remain synchronized.
After synchronization finishes, every transaction committed on the principal database is also committed on the mirror server, guaranteeing protection of the data. This is achieved by waiting to commit a transaction on the principal database, until the principal server receives a message from the mirror server stating that it has hardened the transaction’s log to disk. Note the wait for this message increases the latency of the transaction.

The common misconceptions/confusions are as follows:
  1. Principal server first commits the transaction and then sends log records to the mirror and wait until mirror server writes the same to the database.
  2. Principal sends its log to the mirror and once it commits in mirror then only it commits in principal.

If you know what happens during commit, it is a bit easier. Commit does not mean the transaction is written to the database (data file). When you commit a transaction SQL Server writes it to the log file on the disk, in other words it hardens the log. This confirms the durability of the transaction.

Let’s take an example of the following transaction:

Begin Tran

Statement 1

Statement 2

Commit;

SQL Server writes the log records in the log file of principal database as per normal process of logging. Mirror server follow the same.

When the application fires a commit statement, Synchronous Database Mirroring mechanism has to confirm, mirror server has written all the log records related to the transaction before sending confirmation to the application.

So, from above example transaction, It is possible that SQL Server has already written the Begin Tran and Statement 1  to the log in principal and mirror server.

The following diagram depicts the same:

Synchronous Database Mirroring

Now lets see what happens when application fires a Commit Tran:
    1. Principal server receives the commit transaction from client.
    2. The principal server writes the log records for the transaction to the log file and sends the log records to the mirror server at the same time as shown below.Synchronous Database MirrorringNote: Principal Server has written the log records to the log file. So one of the misconception is cleared here. Principal Server does not wait for acknowledgement from mirror to commit the transaction.
    3. Principal server completes the I/O  but it cannot send the confirmation to the client at this point.
    4. The mirror server hardens the log in disk.
    5. I/O on the mirror server completes.
    6. Returns an acknowledgement to the principal server.
    7. On receiving the acknowledgement from the mirror server, the principal server sends a confirmation/Acknowledgement to the client.

The following diagram shows the complete cycle:

Synchronous Database Mirroring

Note: The misconception, Principal Server waits until Mirror Server writes the transactions to the mirror database is not true. Synchronous database mirroring confirms that the Mirror writes the log records  to the log file and not data file, before sending acknowledgement, hence even if the principal crashes after that, all the log records of the committed transactions are durable and mirror can take over the role of principal. And that is why it is also know as High Safety Mode.

The mirror server continuously writes data to mirror database from redo queue. In Asynchronous mirroring the Principal Server does not wait for the Acknowledgment from the mirror.

I hope this clears few misconceptions or confusions of Synchronous database mirroring. Please let me know if you have any questions.

You may like the following articles as well:

Myth around Remote Query Timeout Option
Remove or Delete Tempdb data file without restarting SQL Server
Fact about Dedicated Administrator Connection (DAC)
SSRS Encryption Key Backup Information
SQL Database Restore using NetBackup

{ Add a Comment }