SQL Server Optimisation: Why Less Is More

One of the most valuable lessons I’ve learned in performance tuning is the importance of breaking queries tasks into smaller, more manageable sections. Who knew? The experts were right all along when they said, “If you’re confused by the query, SQL probably is too”.

In today’s post, we will cover a simple, but powerful concept when trying to write queries. A common sight as a DBA is long, monolithic queries attempting to do ALL THE THINGS in one pass (including solving world peace, all powered by a table variable).
Joking aside, the longer and more complex a query is, the greater the chance of issues arising, especially as the database grows over time. That 10,000 line stored procedure may run fine in the early days when you have 100 records in each table. However, it won’t last.

Predicting The Future

One thing that happens when you execute a query in SQL Server is that SQL quickly determines the optimal execution plan for your query. At this point, things can go wrong with large, complex queries. SQL must estimate multiple factors, including how many records pass through the query at different stages of dataset processing. The more datasets being rendered, the more logic there is, the more complex the joins; the more pushed SQL is going to be for making good judgement calls within that tiny window of time.

Divide And Conquer

Fear not, intrepid reader - we can assist SQL in making smarter decisions. There is a concept which can help us here called ‘materialising’ datasets. We can take an approach that guides SQL as it rapidly makes judgement calls about retrieving the tables and columns you’ve joined. Minimising the number of rows we need to work with as quickly as possible, as early as possible, can have significant effects on your query’s performance.

Demo: Setting The Stage

Lets do a demo to show what I mean. I will be using:

SQL Server 2022
StackOverflow2013 database, in compatibility level 160

Here we have an ugly query, it’s joining up dataset to dataset, adding various conditional logic along the way too. Reading this type of query hurts, but I am trying to make a point here, and it’s not ’that’ different to what I will commonly see out in the wild. You don’t need to understand it all - you don’t even need to read it all. I just want you to be mindful of the number of joins, hops, spins arounds and touch the grounds we are doing throughout it.

  1USE StackOverflow2013;
  2GO
  3
  4DROP TABLE IF EXISTS #GetQuestionAndAnswers;
  5DROP TABLE IF EXISTS #GetQuestionAndAnswers2;
  6
  7;WITH GetQuestions AS
  8(
  9	SELECT 
 10		'Question' AS Q,
 11		p.Title,
 12		p.OwnerUserId
 13	FROM dbo.Posts p
 14	WHERE p.PostTypeId = 1
 15	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl <> '')
 16
 17),
 18GetAnswers AS
 19(
 20	SELECT 
 21		'Answer' AS A,
 22		p.Title,
 23		p.OwnerUserId
 24	FROM dbo.Posts p
 25	WHERE p.PostTypeId = 2
 26	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl <> '')
 27),
 28GetQuestionAndAnswers AS
 29(
 30	SELECT 
 31		GetQuestions.Q,
 32		GetQuestions.Title,
 33		GetQuestions.OwnerUserId 
 34	FROM GetQuestions
 35	UNION ALL
 36	SELECT 
 37		GetAnswers.A,
 38		GetAnswers.Title,
 39		GetAnswers.OwnerUserId 
 40	FROM GetAnswers
 41)
 42
 43SELECT
 44	SUM(CASE WHEN qa.Q = 'Question' THEN 1 ELSE 0 END) AS TotalQuestions,
 45	SUM(CASE WHEN qa.Q = 'Answer'	THEN 1 ELSE 0 END) AS TotalAnswers
 46INTO #GetQuestionAndAnswers
 47FROM GetQuestionAndAnswers qa
 48
 49
 50;WITH GetQuestions2 AS
 51(
 52	SELECT 
 53		'Question' AS Q,
 54		p.Title,
 55		p.OwnerUserId
 56	FROM dbo.Posts p
 57	WHERE p.PostTypeId = 1
 58	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl = '')
 59),
 60GetAnswers2 AS
 61(
 62	SELECT 
 63		'Answer' AS A,
 64		p.Title,
 65		p.OwnerUserId
 66	FROM dbo.Posts p
 67	WHERE p.PostTypeId = 2
 68	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl = '')
 69),
 70GetQuestionAndAnswers2 AS
 71(
 72	SELECT 
 73		GetQuestions2.Q,
 74		GetQuestions2.Title,
 75		GetQuestions2.OwnerUserId 
 76	FROM GetQuestions2
 77	UNION ALL
 78	SELECT 
 79		GetAnswers2.A,
 80		GetAnswers2.Title,
 81		GetAnswers2.OwnerUserId 
 82	FROM GetAnswers2
 83)
 84
 85SELECT
 86	SUM(CASE WHEN qa.Q = 'Question' THEN 1 ELSE 0 END) AS TotalQuestions,
 87	SUM(CASE WHEN qa.Q = 'Answer'	THEN 1 ELSE 0 END) AS TotalAnswers
 88INTO #GetQuestionAndAnswers2
 89FROM GetQuestionAndAnswers2 qa
 90
 91
 92;WITH AllTogether AS
 93(
 94	SELECT 'Users With Website' AS [Category], 		* FROM #GetQuestionAndAnswers
 95	UNION ALL
 96	SELECT 'Users Without Website' AS [Category], 	* FROM #GetQuestionAndAnswers2
 97)
 98
 99SELECT 
100    MAX(CASE WHEN Category = 'Users With Website'		THEN TotalQuestions END)	AS TotalQuestionsWithURL,
101    MAX(CASE WHEN Category = 'Users With Website'		THEN TotalAnswers END)		AS TotalAnswersWithURL,
102    MAX(CASE WHEN Category = 'Users Without Website'	THEN TotalQuestions END)	AS TotalQuestionWithoutURL,
103    MAX(CASE WHEN Category = 'Users Without Website'	THEN TotalAnswers END)		AS TotalAnswersWithoutURL
104FROM AllTogether a;

Demo: We Can Do Better

As you have seen, we have a long messy query which utilises all the things along the way.

When tackling a query like this, I try to apply the following strategies:

Identify logical breaking points within the dataset rendering process: Think about how I can spoon-feed SQL information (food?) along the way. Can we try and ‘materialise’ any datasets early in the logical processing of the query? My primary goal here is to shore up the estimates as early as possible. Better estimates mean better-resourced queries, which means better performance.
Look for repeated work or the repeated accessing of objects: The old adage is correct: the fastest query is the one you don’t run. Too often I will see queries hitting the same table, several times in different ways when just one pass of the object would have done.

Within the mess that is our demo query above, there are two aspects which I would like to address:

1 Accessing the dbo.Users table - many times

We are repeatedly accessing the dbo.Users table to check that if corresponding post has a user with or without a website:

1-- Example 1
2AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl = '')
3
4-- Example 2
5AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl <> '')

Sure, I have amplified this aspect of the query somewhat for demonstration purposes. But it’s not that dissimilar to the things I will often see out in the wild. I understand how it can easily happen too. You have time-pushed, stressed developers, tacking on things over time and you end up with unnecessary work being done.

So how do we correct this issue? We ‘materialise’ a dataset early on within the query. Instead of querying the dbo.Users table four times, all we need to do is this, once:

1INSERT INTO #UserDetail
2
3SELECT 
4    u.Id,
5    CASE WHEN u.WebsiteUrl <> '' THEN 1 ELSE 0 END AS WebExists,
6    CASE WHEN u.WebsiteUrl = ''  THEN 1 ELSE 0 END AS WebDoesNotExist
7FROM dbo.Users u;

We are materialising the dataset into a Temporary Table. This is a simple but important concept because:

We only need to access the dbo.Users table once. One pass through it, and we have everything we need for the rest of the query
Materialising the dataset into a temporary table instead of using a Common Table Expression (CTE), for example, is key. This gives us a dataset which is ‘static’ and does not need to be rendered each time it’s called. Statistics are also used with temporary objects which will further aid SQL to make better judgement calls when allocating resource, as it should have a decent idea of how many records the temporary table holds.
We could even index the temporary table, should we need to.

2 Rendering the dbo.Posts data - all the ways, all the time

In a similar vein to the issue with the Users table, we are rendering information from the dbo.Posts table. We have multiple CTEs rendering Posts information, based on whether it’s a question or answer. Further down, we reassemble this collected information using a UNION to bring the multiple datasets back together.

Issue 1 : Repeated dataset renders for questions and answers

 1-- Example 1
 2WITH GetQuestions AS
 3(
 4	SELECT 
 5		'Question' AS Q,
 6		p.Title,
 7		p.OwnerUserId
 8	FROM dbo.Posts p
 9	WHERE p.PostTypeId = 1
10	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl <> '')
11
12)
13
14-- Example 2
15GetAnswers AS
16(
17	SELECT 
18		'Answer' AS A,
19		p.Title,
20		p.OwnerUserId
21	FROM dbo.Posts p
22	WHERE p.PostTypeId = 2
23	AND EXISTS (SELECT 1 FROM dbo.Users u WHERE p.OwnerUserId = u.Id AND u.WebsiteUrl <> '')
24)

Issue 2 : Unnecessary dataset combining

 1	SELECT 
 2		GetQuestions.Q,
 3		GetQuestions.Title,
 4		GetQuestions.OwnerUserId 
 5	FROM GetQuestions
 6	UNION ALL
 7	SELECT 
 8		GetAnswers.A,
 9		GetAnswers.Title,
10		GetAnswers.OwnerUserId 
11	FROM GetAnswers

Once again, I have been a scamp here and have overdone it slightly to prove a point. However, it’s really not uncommon to see this type of process in the real world. I have often stared at my screen, reviewing somebody’s code, thinking “But…why?” (More worryingly, sometimes I realise it’s my own code)

Despite having those six CTEs (each rendering information from the dbo.Posts table) all we ultimately need to get the same output is the following:

 1SELECT 
 2    SUM(CASE WHEN p.PostTypeId = 1 AND u.WebExists = 1
 3            THEN 1 ELSE 0 END) AS TotalQuestionsWithURL,
 4    SUM(CASE WHEN p.PostTypeId = 2 AND u.WebExists = 1
 5            THEN 1 ELSE 0 END) AS TotalAnswersWithURL,
 6    SUM(CASE WHEN p.PostTypeId = 1 AND u.WebDoesNotExist = 1
 7            THEN 1 ELSE 0 END) AS TotalQuestionWithoutURL,
 8    SUM(CASE WHEN p.PostTypeId = 2 AND u.WebDoesNotExist = 1
 9            THEN 1 ELSE 0 END) AS TotalAnswersWithoutURL
10FROM dbo.Posts p
11INNER JOIN #UserDetail u ON p.OwnerUserId = u.UserId

Key points about the above:

We are able to make one pass of the dbo.Posts table
Using logic (the CASE statements), we can simply get the status of the Post (question or answer) and SUM it accordingly
We have finally joined against the temporary table #UserDetail

By rewriting the original query, we have reduced it from ~106 lines of code to just ~30. This is the final improved query:

 1USE StackOverflow2013;
 2GO
 3
 4DROP TABLE IF EXISTS #UserDetail;
 5
 6CREATE TABLE #UserDetail
 7(
 8    UserId INT,
 9    WebExists BIT,
10    WebDoesNotExist BIT
11)
12
13INSERT INTO #UserDetail
14
15SELECT 
16    u.Id,
17    CASE WHEN u.WebsiteUrl <> '' THEN 1 ELSE 0 END AS WebExists,
18    CASE WHEN u.WebsiteUrl = ''  THEN 1 ELSE 0 END AS WebDoesNotExist
19FROM dbo.Users u;
20
21SELECT 
22    SUM(CASE WHEN p.PostTypeId = 1 AND u.WebExists = 1
23            THEN 1 ELSE 0 END) AS TotalQuestionsWithURL,
24    SUM(CASE WHEN p.PostTypeId = 2 AND u.WebExists = 1
25            THEN 1 ELSE 0 END) AS TotalAnswersWithURL,
26    SUM(CASE WHEN p.PostTypeId = 1 AND u.WebDoesNotExist = 1
27            THEN 1 ELSE 0 END) AS TotalQuestionWithoutURL,
28    SUM(CASE WHEN p.PostTypeId = 2 AND u.WebDoesNotExist = 1
29            THEN 1 ELSE 0 END) AS TotalAnswersWithoutURL
30FROM dbo.Posts p
31INNER JOIN #UserDetail u ON p.OwnerUserId = u.UserId

Sure, you will just have to take my word for it here (or do the demo yourself), but just for completeness, here are the outputs of query 1 and 2:

Query 1 Output -

Query 2 Output -

Does It Run Better?

After all that effort, imagine if it performed exactly the same or, even worse, slowed down instead. 😄 Fortunately, it does run better, significantly so, in fact.

Using the output from STATISTICS IO and TIME, we have the following:

Logical reads are down by 74%
Total CPU time spent is down by 72%
Total elapsed time is down by 68%

Drop it like its hawt

Thanks for reading!