Tuesday, May 11, 2010

SQL Meme: Tagged: 5 things SQL Server should drop

I have been tagged by Denis Gobo (Blog/Twitter) in a SQL meme regarding the top 5 things I would drop from SQL Server, Denis’s post. I am sure some of you could spit out a list a mile long, but I am going to focus on my 5 biggest pet peeves, well ones that have not been listed yet :^).

sp_msforeachdb and sp_msforeachtable

First off these two stored procedures are undocumented, so they can be deprecated or the functionality may change. In my opinion, these two stored procedures are useless. If you look underneath the hood, these stored procedures both use basic cursors…. Sure they make fancy work of the “?”, but you can to with your own cursor. Remove these from your code and roll your own cursors.

PRINT

Print is a bit antiquated when it comes to newer versions of SQL Server. I am not saying drop print altogether, but drop the 8000 varchar/4000 nvarchar print limitation. I have seen this byte (sorry couldn’t resist) people over and over. It does not make sense to allow a developer to store 2 GB worth of data in a variable and then only print 8000 characters… Sure we can roll our own print procedure, or use XML, but why should we work around the issue. Allow PRINT to “print” up to the maximum variable size.

ORDER BY Constant in Windowing Function

If you try to order by a constant value using a windowing function, such as Row_Number(), you will get a message stating constants are not allowed; however, there is a workaround. The workaround is to use a subquery (with a constant) in the order by clause. The behavior should be removed because it seemingly gives developers the idea that the data will be order in the order of table. Before anyone says anything, a table does not have a predefined order. So what we have here is a number sequence that is not guaranteed each time it is run. In my book, inconsistent behavior = Remove the functionality and make the user order by an actual column.

**** UPDATE ****

I have been asked to provide a sample of what I am talking about here. Essentially create a very simple table with a few columns as such:

DECLARE @t TABLE(
Id INT IDENTITY(1,1) PRIMARY KEY,
SomeChar CHAR(1)
);

INSERT INTO @t (SomeChar) VALUES ('a') ;
INSERT INTO @t (SomeChar) VALUES ('b') ;
INSERT INTO @t (SomeChar) VALUES ('c') ;

Now try each of the following queries against the table and you will see one windowing function says constants cannot be used, but clearly then can with a little ingenuity. The problem here is the illusion that the data will be sequenced in the order of the table, without actually sorting it; however, what is really occurring is the optimizer is generating a sequence whose order may vary from execution to execution. A column should be supplied; otherwise, unexpected results may occur.

SELECT Id,SomeChar,ROW_NUMBER() OVER(ORDER BY 'a') AS seq
FROM @t
/*Windowed functions do not support constants as ORDER BY clause expressions.*/

SELECT Id,SomeChar,ROW_NUMBER() OVER(ORDER BY (SELECT 'a')) AS seq
FROM @t
/*
Id          SomeChar seq
----------- -------- --------------------
1           a        1
2           b        2
3           c        3
*/

Edit Top X Rows

This is a SSMS feature that I just do not find useful at all, plus this feature gives non database professionals the ability to modify/inserted/delete data, with no understanding of what is occurring in the background. In my opinion, those who use this feature are asking for trouble. I believe all insert/update/delete transactions should be done through CRUD (Create/Read/Update/Delete) stored procedures or TSQL batch operations. If you do not know how to do CRUD through TSQL, you do not need to be doing it at all.

SELECT *

I may take a little heat from this one, but SELECT * should be removed from SQL Server. SQL Server intellisense should auto expand “*” into the column list. SELECT * is a prime candidate for performance problems and wasted network traffic. SELECT * affects the optimizer’s ability to use indexes, increases network bytes, and breaks code when column ordinal position is changed or columns are added or removed. Sure we all use SELECT * for quick ad-hoc queries, but believe me.. it also exists in production code. In my opinion, the benefits of expanding the “*” outweigh the cons because it makes developers/DBAs realize how many columns they are selecting, which may tip them off that they should restrict the number of columns being selected. Also expanding the “*” prevents insert statements from breaking when columns are added or removed.

These are the items I would remove from SQL Server given the chance. I am sure I can come up with a lot more, but I will let others take a stab at this.

Until next time, happy coding.

Tuesday, May 4, 2010

Performance tuning Case Expressions With Correlated Subqueries

Today I wanted to talk about some potential pitfalls that a developer may encounter when using correlated subqueries, in a case expression.  As you may recall, I have done a post on the potential performance pitfalls, in using correlated subqueries, before  http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html.  In this post, I will be focusing on case expressions that use  correlated subqueries.

I will start by creating a sample table.

USE [tempdb]
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData')
BEGIN
    DROP TABLE dbo.[TestData];
END
GO

CREATE TABLE dbo.TestData(
RowNum INT PRIMARY KEY ,
SomeChar TINYINT
);
GO

INSERT INTO dbo.TestData
SELECT TOP 1000 
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID())%3+1)
FROM 
    Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2
GO

CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData(SomeChar);
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData2')
BEGIN
    DROP TABLE dbo.[TestData2];
END
GO

CREATE TABLE dbo.TestData2(
Id INT IDENTITY(1,1) PRIMARY KEY,
RowNum INT unique,
SomeChar TINYINT
);
GO

INSERT INTO dbo.TestData2
SELECT TOP 500
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID())%3+1)
FROM 
    Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2
GO

CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData2(SomeChar);
GO

A typical correlated subquery in a case expression may look something like this:

SELECT 
    RowNum,SomeChar,
    CASE (SELECT SomeChar FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) 
        WHEN 1 THEN 'Type1' 
        WHEN 2 THEN 'Type2'
        WHEN 3 THEN 'Type3'
    END
FROM dbo.TestData t1
WHERE [RowNum] <= 500

Let’s have a look at the execution plan to see what is going on underneath the hood

image

IO:

Table 'Worktable'. Scan count 442, logical reads 2886, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 3, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Whoa!!! This query is extremely inefficient.  As you can see the TestData2 table was scanned 3 times and a worker table was created and scanned 442 times.  The problem here is the optimizer chooses to spool the data from dbo.TestData2 twice.  The even bigger problem with this method is scalability.  This code does not scale well at all.  In the case of this query, the optimizer creates a relational number of index spools to the number of elements in the case expression.  The relationship can be defined as Number Of Spools = Number of Case Elements – 1.  What does this mean? It means that if your case expression has 4 elements you get 3 spools… if you case expression has 5 elements you get 4 spools and so on.  Simply put…..query performance decreases as the number of elements in the case expression increase.  Take a look at the example below.

image

IO:

Table 'Worktable'. Scan count 539, logical reads 4081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 4, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So how should we change the query to help the optimizer make a better decision? 

The best solution is to allow the optimizer to get the computed value, while it is joining the TestData2 table, as shown below.

SELECT 
    RowNum,SomeChar,
    (SELECT CASE SomeChar WHEN 1 THEN 'Type1' WHEN 2 THEN 'Type2' WHEN 3 THEN 'Type3' END FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) 
FROM dbo.TestData t1
WHERE [RowNum] <= 500

image

IO:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see, this is a much better query plan.  The key here is the optimizer is able to use a compute scalar operator upstream, while joining the tables. Because the computed value is joined to the TestData table, we do not have to worry about spooling the data.

Conclusion

For me, correlated subqueries can have inconsistent behavior and often bear performance problems, such as this one.  Do not get me wrong, correlated subqueries are not all bad, but they should be thoroughly tested.  In my opinion, the best way to write this query is to LEFT OUTER JOIN dbo.TestData2.  An outer join will provide more consistent performance.

SELECT 
    t1.RowNum,t1.SomeChar,
    CASE t2.SomeChar 
        WHEN 1 THEN 'Type1' 
        WHEN 2 THEN 'Type2'
        WHEN 3 THEN 'Type3'
    END
FROM dbo.TestData t1
LEFT JOIN dbo.TestData2 t2  ON t1.RowNum = t2.RowNum
WHERE t1.[RowNum] <= 500

Until next time, happy coding.