This Friday I want to talk about some best practices for troubleshooting queries using waits and queues. Waits and queues are excellent tools for trying to figure out why a query is running slower than you think it should be running. When a query has to wait for some reason, the query enters a suspended state and registers a wait type. This wait type clues you in to what is making the query wait. Not all wait types are well documented or well understood, and how to use them is not always straight forward. These best practices should make this process easier for you.
- Use waiting tasks, not wait stats: One of the most common mistakes I see is people using sys.dm_os_wait_stats to troubleshoot slow running queries. This DMV (Dynamic Management View) is all cumulative waits on the SQL Server instance since the last service restart or manual clearing of the wait stats — I will talk about manually clearing the stats later, so let’s focus on the DMVs. This is not a good indicator of what is causing queries to run slowly right this minute.
If you are troubleshooting currently running queries, you need to use sys.dm_os_waiting_tasks. This DMV shows queries that are currently active and in a suspended state along with information about the current waits (wait type, wait time, wait description). This is where you need to look 99% of the time you are troubleshooting queries. Active waits, not cumulative waits.
I have seen a lot of people spend hours investigating waits that had absolutely nothing to do with their current problems. Don’t be led astray by cumulative waits. Focus on current waits.
- If it’s a queue, look away: You may be wondering what a queue is. A queue is a special kind of wait type that indicates that the only thing it’s waiting for is something to do. There are certain processes that are designed to do a specific piece of work, like the redo thread of an Availability Group (AG). If the process has no work to do, but is still running because it needs to be ready to process work as soon as it arrives, it is placed into a queue wait.
Simply put, if you see a high wait time on a queue or a thread waiting on a queue, it most often means it has nothing to do. This is good, performance-wise. Ignore these waits. Not all queue waits are easy to spot, but it is safe to assume that any wait type with queue or sleep in the name is a queue wait type and you can safely ignore it.
Exception: There are probably more exceptions, but the one I am familiar with is the SLEEP_TASK wait type. SLEEP_TASK is the most generic queue wait in SQL Server. Whenever they need a task to wait and there is no specific wait type for it, the use SLEEP_TASK. SLEEP_TASK is usually very quick and very benign. However, if you see active queries with long wait types on this and the wait description indicates tempdb, it is an indicator of a process hitting tempdb disk hard. On the rare occasions that I have seen this, it has been mixed with IO_COMPLETION waits.
- Unknown wait -> break it down: As I said before, wait types have not been documented well (see #5 below for more on this). When you come across an unknown wait type and you can’t find any documentation for it, break it down into pieces. Quite often, the wait name is a combination of words that gives you a clue on its meaning. Here are some examples:
- IO_COMPLETION — parts are IO and Completion. IO means Input/Output and Completion is self-explanatory. So it is waiting for IO to complete.
- SOS_SCHEDULER_YIELD — SOS means the SQL OS, the mechanism that controls SQL Server’s OS interface, Scheduler is the mechanism that manages SQL Server worker threads and the time they receive on an actual CPU thread, and Yield is the process where threads share the CPU threads for a specific given amount of time (a quantum) and then give up access to the CPU to another thread. So this is simply put threads for a query that have yielded their time on the CPU and are now waiting to get access to a CPU again.
- BACKUPTHREAD — Backup is a database back up process and Thread means the worker thread. This wait means a backup is running and this thread is waiting for it to finish.
Another subset of waits where this works well are PREEMPTIVE_xxxxxx waits. These are waits where the process has shelled out to an external OS layer. The key here is to ignore the PREEMPTIVE part and focus on the rest. A lot of PREEMPTIVE waits use the name of the OS layer API that it is calling for the wait name. So if you remove the preemptive part and do a search for just the rest, you can quite often find an OS API that matches that name, and the documentation for that API may tell you exactly what it is doing. If you don’t find an exact match, the name may be a generic reference to a set of APIs that it calls. If there are 3 parts to the name, the second part may be a reference to the particular set of APIs that it references (such as OS or COM) because there can be APIs with the same name in different parts of the OS.
Example 1: PREEMPTIVE_OS_ACCEPTSECURITYCONTEXT We ignore preemptive. OS indicates it is a Windows API being called. I then do a web search for ACCEPTSECURITYCONTEXT and the first result is for AcceptSecurityContext (General) function. If you follow the link, the page has way more info about this API, and therefore the wait type, than you had hoped for.
Example 2: PREEMPTIVE_OS_AUTHENTICATIONOPS We ignore preemptive. OS indicates it is a Windows API being called. If we do a web search for AUTHENTICATIONOPS, we get no hits. But the fast that the name is plural is a hint that this is a group of APIs that may be called. We may not be able to see the deep details on what is happening, but we can at least see that it is calling APIs related to authentication.
- Don’t delete good data: I mentioned manually clearing wait stats above. A lot of people will do this when troubleshooting slow queries. They reset them all back to zero, and then see how they are increasing. But why would you delete data on purpose? Wait stats are data for DBAs. I implore you not to do this. If you must use wait stats, insert the current stats into a temp table, wait a given amount of time, collect the wait stats again, and then simply subtract the first set of stats from the second. It’s just subtraction. It’s easy, and you’re not just dumping data. Here is a query you can use to do that:
Declare @Waits Table ( WaitID int identity(1, 1) not null primary key, wait_type nvarchar(60), wait_time_s decimal(12, 2)); WITH Waits AS (SELECT wait_type, wait_time_ms / 1000. AS wait_time_s, 100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN( 'SLEEP_TASK', 'BROKER_TASK_STOP', 'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'LAZYWRITER_SLEEP')) -- filter out additional irrelevant waits Insert Into @Waits (wait_type, wait_time_s) SELECT W1.wait_type, CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold WaitFor Delay '0:01:00'; WITH Waits AS (SELECT wait_type, wait_time_ms / 1000. AS wait_time_s, 100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN( 'SLEEP_TASK', 'BROKER_TASK_STOP', 'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'LAZYWRITER_SLEEP')) -- filter out additional irrelevant waits Insert Into @Waits (wait_type, wait_time_s) SELECT W1.wait_type, CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold Select wait_type, MAX(wait_time_s) - MIN(wait_time_s) WaitDelta From @Waits Group By wait_Type Order By WaitDelta Desc;
- Bookmark this: I've said a couple of times that wait stats are not well documented. Paul Randal (blog | @PaulRandal) has spent what I can only assume was many, many hours trying to fix this. Paul has given us the Wait Types Library and Latch Classes Library. I wouldn't be surprised if other members of the SQLSkills team have helped out. I hope he has had help.
This is great because it is a collection of documentation that you could spend hours scouring the internet and still not find it all. Even more than that, it is great because it includes a lot of Paul's extensive experiences with different wait types. It is the first place I look when I need to know more about a wait type or latch class.