Wednesday, February 17, 2010

Optimizing SQL Server Joins - Part 2

In my last post, I talked about covering the query and the join expression, with a single index, http://jahaines.blogspot.com/2010/01/optimizing-sql-server-joins.html. Covering the join expression allows the optimizer to seek the joining column and has the benefit of removing the very costly Key/RID lookup. In this post, I will be talking about optimizing join order.

Let’s start with a very simplistic definition of join order. Join order is the order in which the optimizer processes tables involved in the join clause. When tables are processed, the tables are placed in either the top or bottom input, which dictates how the data is searched and how many rows are affected. There is plenty of misinformation out there stating that the order you specify your joins in can directly impact query performance. While this is partially true, it is more the exception than the rule. This misinformation came about because the optimizer is a cost based optimizer. The optimizer has to determine the best plan in the shortest amount of time. The optimizer does not have all the time in the world to test each and every join order or permutation. Because the optimizer cannot evaluate every possible permutation (in some cases), it is possible for the optimizer to miss a more optimal join order, but this would only happen when a query has an exuberant number of joins. Like I said before this is not a regular occurrence. In almost all cases, the join order is of negligible concern. Let’s see this in action.

First I will create my sample tables and load them with data.

USE tempdb
GO

IF OBJECT_ID('tempdb.dbo.Customers') IS NOT NULL
BEGIN
    DROP TABLE dbo.Customers;
END
GO

CREATE TABLE dbo.Customers(
CustId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
FName VARCHAR(50),
LName VARCHAR(50)
);
GO

;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT dbo.Customers(FName,LName)
SELECT 
    'FName' + CAST(N AS VARCHAR(10)),
    'LName' + CAST(N AS VARCHAR(10))    
FROM Nums
WHERE N<=1000;
GO

IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
    DROP TABLE dbo.Orders;
END
GO

CREATE TABLE dbo.Orders(
OrderId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
CustId INT
);
GO

;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT dbo.Orders(CustId)
SELECT 
     ABS(CHECKSUM(NEWID())%250+65)   
FROM Nums
WHERE N<=100;
GO

CREATE NONCLUSTERED INDEX ncl_idx_CustId ON dbo.Orders(CustId);
GO
First and foremost, I really do not recommend using HINTS like FOREPLAN, Plan Guides, or OPTION(FORCE ORDER), unless it is a last resort. In most cases, performance problems can be attributed to non-scalable code, and/or missing indexes; therefore, the first step in solving performance problems is to optimize code. In this post, I will be using FORCEPLAN to illustrate how join order can affect performance. I will start by running a set of simple queries. One query will not use FORCEPLAN and the other will use FORCEPLAN. FORCEPLAN makes the optimizer maintain the JOIN order specified. One of the biggest drawbacks of FORCEPLAN is that a nested loop join will likely be used, which may or may not be efficient for your query plan. The optimizer can still use other types of joins if other hints are used, or the optimizer has to use another join to satisfy the query. You can find more information on FORCEPLAN here, http://msdn.microsoft.com/en-us/library/ms188344.aspx. SQL Server 2005 has more force options available that do not have the same join type limitation, such as OPTION(FORCE ORDER), Plan hints, and plan guides; however, these can be just as bad for performance, if used excessively or inappropriately.
 
SET STATISTICS IO ON;
SET STATISTICS PROFILE ON;
GO

SELECT o.CustId,c.FName,c.LName
FROM dbo.Orders o
INNER JOIN dbo.Customers c
    ON c.CustId = o.CustId
WHERE c.LName = 'LName151'

SET STATISTICS IO OFF;
SET STATISTICS PROFILE OFF;
GO

/*
Table 'Orders'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SELECT o.CustId,c.FName,c.LName  FROM dbo.Orders o  INNER JOIN dbo.Customers c   ON c.CustId = o.CustId  WHERE   c.LName = 'LName151'
  |--Nested Loops(Inner Join, OUTER REFERENCES:([c].[CustId]))
       |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[Customers].[PK__Customers__7E6CC920] AS [c]), WHERE:([tempdb].[dbo].[Customers].[LName] as [c].[LName]='LName151'))
       |--Index Seek(OBJECT:([tempdb].[dbo].[Orders].[ncl_idx_CustId] AS [o]), SEEK:([o].[CustId]=[tempdb].[dbo].[Customers].[CustId] as [c].[CustId]) ORDERED FORWARD)
*/
GO
 
As you can see, the Customers table is used as the OUTER table and Orders table is the INNER table, which supports my saying the optimizer is free to rearrange the predicate. For each row in the outer table the INNER table has to be scanned. If you use, FORCEPLAN the optimizer will make the OUTER TABLE Orders and the INNER table Customers. The first problem I encounter is the Orders table does not have a where clause, so the index seek on Orders will go away, but the optimizer will be able to seek Customers on CustId. Additionally, I have made the OUTER table (Orders) smaller, which means the optimizer will have less passes on the INNER table (Customers); however, the optimizer will have to scan/seek through a lot more data to return the results. Using FORCEPLAN can really degrade performance and in this case results more than 30 times more reads.
 
SET FORCEPLAN ON;
SET STATISTICS IO ON;
SET STATISTICS PROFILE ON;
GO

SELECT o.CustId,c.FName,c.LName
FROM dbo.Orders o
INNER JOIN dbo.Customers c
    ON c.CustId = o.CustId
WHERE c.LName = 'LName151'

SET FORCEPLAN OFF;
SET STATISTICS IO OFF;
SET STATISTICS PROFILE OFF;
GO

/*
Table 'Customers'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


SELECT o.CustId,c.FName,c.LName  FROM dbo.Orders o  INNER JOIN dbo.Customers c   ON c.CustId = o.CustId  WHERE c.LName = 'LName151'
  |--Nested Loops(Inner Join, OUTER REFERENCES:([o].[CustId]))
       |--Index Scan(OBJECT:([tempdb].[dbo].[Orders].[ncl_idx_CustId] AS [o]))
       |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[Customers].[PK__Customers__7E6CC920] AS [c]), SEEK:([c].[CustId]=[tempdb].[dbo].[Orders].[CustId] as [o].[CustId]),  WHERE:([tempdb].[dbo].[Customers].[LName] as [c].[LName]='LName151') ORDERED FORWARD)
*/

As you can see, adding or misusing query hints can really degrade query performance. In my opinion, the worst part about using hints is you prohibit the optimizer from logically deducing the best way to solve a query. Developers often neglect to see the long term impact of using hints, as data distribution, indexes, and business requirements change over time. One of the best thing you can do is leave the optimizer alone and let it go to work. A lot of times developers think they know better than the optimizer or that they can trick it; however, this almost always ends in bad performance. Query hints are quite often band-aids to much larger problems; however, under certain circumstances there can be a benefit in using query hints and this is why they exist. For more information on how to help persuade the optimizer to do what you want, you can use thing following link, http://msdn.microsoft.com/en-us/library/cc917694.aspx. There are a more options in SQL 2005+ that give developers and DBAs more control over query execute plans; however, I still recommend that you solve performance problems by optimizing code and/or adding indexes. Only when you have evaluated all other options, should you consider using hints or guides. All-in-all, the most typical way to optimize join order is to leave it alone.

Until next time happy coding.