Monday, June 29, 2009

Concatenating Column Values (Part 1)

As you may know, SQL Server does not have native support for handing arrays.  This post is part one of a two part series, where I will look at the most widely used methods to concatenate column values.  In this post, I will focus on creating delimited column values and I will explore the performance implications of each method, in the second part of this series.

The two methods that are most often used, in SQL Server 2005+, to concatenate column values are, the variable assignment method and the FOR XML PATH method.  The variable assignment method is most commonly used in scalar UDF's, so there are usually some performance implications.  The FOR XML PATH method is more commonly used in a correlated sub query to concatenate column values inline.  This method also has performance implications.

Let’s create a sample table with data.

USE [tempdb]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE  NAME = 't')
BEGIN
    DROP TABLE dbo.t;
END
GO
 
CREATE TABLE t(
id INT,
col CHAR(1)
);
 
INSERT INTO t VALUES (1,'a');
INSERT INTO t VALUES (1,'b');
INSERT INTO t VALUES (1,'c');
INSERT INTO t VALUES (1,'d');
INSERT INTO t VALUES (2,'e');
INSERT INTO t VALUES (2,'f');
INSERT INTO t VALUES (3,'g');
INSERT INTO t VALUES (4,'h');

We have our data so let’s start with the scalar UDF code.  Essentially we need to create a scalar UDF that takes a surrogate key parameter and will use variable assignment to concatenate all required values.

Here is the Scalar UDF code:

USE [tempdb]
GO
 
CREATE FUNCTION dbo.ConcatenateCols(@Id INT)
RETURNS VARCHAR(MAX)
AS
BEGIN
 
DECLARE @RtnStr VARCHAR(MAX)
 
SELECT @RtnStr = COALESCE(@RtnStr + ',','') + col
FROM dbo.t
WHERE id = @Id AND col > ''
 
RETURN @RtnStr
 
END
GO

Now that we have our function we can test the scalar UDF code.

SELECT 
    id,
    dbo.ConcatenateCols(id) AS Cols
FROM t
GROUP BY
    Id

Here are the results:

image

Now let’s take a look at the FOR XML PATH method. 

SELECT 
    Id,
    STUFF(
            (
            SELECT ',' + CONVERT(VARCHAR(5),col)
            FROM t t2
            WHERE
                t1.id = t2.id AND t2.col > ''
            FOR XML PATH('') 
            )
    ,1,1,'') AS Cols
FROM [t] t1
GROUP BY
    Id

Results:

image

Note: You can also use CROSS APPLY and achieve the same result, but it makes the code a little easier to read, in my opinion, as shown below:

SELECT 
    t1.id,
    MAX(STUFF(t2.x_id,1,1,'')) AS Cols
FROM t t1
CROSS apply(
    SELECT ',' + t2.col
    FROM t t2
    WHERE t2.id = t1.id AND t2.col > ''
    FOR xml PATH('')
) AS t2 (x_id)
GROUP BY
    t1.id
GO

There you have it.  I have shown you the two most widely used methods to concatenate strings in SQL Server 2005+.  They are pretty simple to implement and maintain, but one should know how each method performs, so stay tuned…..in the coming post, I will look at the performance implications associated with the scalar UDF and FOR XML PATH methods.

****************** UPDATE ******************

There was a bug in the code that would cause the concatenated value in the variable assignment method to become NULL.  When I was creating the sample code, I really didn’t think about using COALESCE and a filter predicate because I was using NON NULL values.  Best practice is to use coalesce or a proper predicate filter to prohibit useless data from being concatenated.  I have opted to go with a predicate filter of > ‘’.   Thanks to all for pointing this out.

Sunday, June 28, 2009

Converting a delimited string of values into columns

I have seen a few questions asking how to transform a delimited values into columns, so I thought I would talk about it here.  In most cases it is recommended to use a string parser function to split the string; however, today I want to talk about another method.  This method takes advantage of the XML data type that was introduced in SQL Server 2005.  What is nice about the XML data type is it preserves the document order.  The document order is critical because it guarantees the string is kept in the same order when it is converted to XML.  Let’s start by creating a sample table with a few rows of data.

DECLARE @t TABLE(
ProductId INT,
ProductName VARCHAR(25),
SupplierId INT,
Descr VARCHAR(50)
)
 
INSERT INTO @t VALUES (1,'Product1',1,'A1,10in,30in,2lbs');
INSERT INTO @t VALUES (2,'Product2',2,'T6,15in,30in,');
INSERT INTO @t VALUES (3,'Product3',1,'A2,1in,,0.5lbs');

Okay now we have our sample data, let’s talk about our data.  The column “Descr” contains 4 attributes of the product.  The attributes that we are storing are Type,Length,Height, and Weight.  Anyone of these attributes can be null in the string but MUST be represented in the string.  For example, “A1,10in,30in,” has a comma at the end because the weight is unknown and a string like “A1,10in,,1lbs” has a empty space for height.  If we do not have this space, how could we ever determine which attributes are associated with a particular column?  The answer is there would be no real way of knowing with this method or any other method.

Now that we have laid all the ground work, it is time to start building our query.  We will start off by creating a CTE (common table expression) making sure to convert our delimited string into valid XML.  Below is the first part of our CTE (this is not the complete code).

;WITH cte (ProductId, ProductName,SupplierId,Prod_Attributes)
AS
(
SELECT 
    [ProductId],
    [ProductName],
    [SupplierId],
    CONVERT(XML,'<Product><Attribute>' 
        + REPLACE([Descr],',', '</Attribute><Attribute>') 
        + '</Attribute></Product>') AS Prod_Attributes
FROM @t
)

You may be asking yourself, what does this code do?  This CTE code selects all the data from the table @t, but it also converts the delimited string into valid XML.  How does it do that?  Well let’s break down the code to figure it out.  

This is the conversion code of interest:

CONVERT(XML,'<Product><Attribute>' 
    + REPLACE([Descr],',', '</Attribute><Attribute>') 
    + '</Attribute></Product>') AS Prod_Attributes

This code takes the input string and uses the replace function to insert the XML tags, so that it has the look and feel of valid xml.  For example, the string “A1,10in,30in,5lbs”  will be transformed into

<Product>
    <Attribute>A1</Attribute>
    <Attribute>10in</Attribute>
    <Attribute>30in</Attribute>
    <Attribute>5lbs</Attribute>
</Product>

Note: It should be noted that we are using a comma delimitation.  If your delimitation is different, you will need to change the delimiter in the replace function.  The delimiter is between the column and the closing XML tag. In the code following, REPLACE([Descr],',', '</Attribute> . the delimiter is denoted in Red.

Now that the product description values are in a valid XML format, we can easily get the values by using a hardcoded singleton value, as shown below. 

;WITH cte (ProductId, ProductName,SupplierId,Prod_Attributes)
AS
(
SELECT 
    [ProductId],
    [ProductName],
    [SupplierId],
    CONVERT(XML,'<Product><Attribute>' 
        + REPLACE([Descr],',', '</Attribute><Attribute>') 
        + '</Attribute></Product>') AS Prod_Attributes
FROM @t
)
SELECT 
    [ProductID],
    [SupplierId],
    Prod_Attributes.value('/Product[1]/Attribute[1]','varchar(25)') AS [Type],
    Prod_Attributes.value('/Product[1]/Attribute[2]','varchar(25)') AS [Length],
    Prod_Attributes.value('/Product[1]/Attribute[3]','varchar(25)') AS [Height],
    Prod_Attributes.value('/Product[1]/Attribute[4]','varchar(25)') AS [Weight]
FROM cte

The singleton of each attribute element is denoted by a hardcoded value 1-4.  If you have more columns you will need to specify more singletons. Here are the results:

image

So there you have it.  An easy to implement solution to “pivot” or transform delimited values into columns. 

**************** Update ********************

I have reworked my code to work with XML special characters.  XML special characters are quite problematic but Brad Schulz, http://bradsruminations.blogspot.com/, has come up with a great method to work around the encoding issues.  Here is the modified code:

SELECT 
    [ProductID],
    [SupplierId],
    NewXML.value('/Product[1]/Attribute[1]','varchar(25)') AS [Type],
    NewXML.value('/Product[1]/Attribute[2]','varchar(25)') AS [Length],
    NewXML.value('/Product[1]/Attribute[3]','varchar(25)') AS [Height],
    NewXML.value('/Product[1]/Attribute[4]','varchar(25)') AS [Weight]
FROM @t t1
CROSS APPLY (SELECT XMLEncoded=(SELECT Descr AS [*] FROM @t t2 WHERE t1.ProductId = t2.[ProductId] FOR XML PATH(''))) EncodeXML
CROSS APPLY (SELECT NewXML=CAST('<Product><Attribute>'+REPLACE(XMLEncoded,',','</Attribute><Attribute>')+'</Attribute></Product>' AS XML)) CastXML

Wednesday, June 24, 2009

Are valid search arguments still relevant?

I have seen a lot queries written in forums and from differing developers over the past few years and have noticed an increasing trend in “sloppy” code. Most of the “sloppy’” code that I am referring to has either an invalid search argument, or a function in the predicate. It seems developers these days are in such a rush to get code out they forget the principal mechanics of a properly tunned query. A lot of times a developer will put whatever in the predicate without knowing the performance implications. So how exactly do these type of queries affect performance? The short answer is the optimizer typically has to use an index/table scan to satisfy the query because the filters are not adequate to seek the row. In this post, I will be primarily focusing on invalid search arguments (non SARGable queries).

We will start of by defining what valid search arguments are. Essentially you should have an inclusive operator to seek an index. So what the heck is an inclusive operator? Inclusive operators are =, BETWEEN, >, >=, <=, <, and like (begins with only). Now that we have our valid search arguments, we know that our query predicate should include one of these operators; otherwise, an index seek will be unlikely. You will notice that the list of available operators does not include: IS NULL, IS NOT NULL, <>, and like (contains). Does this mean that these operators are incapable of using a seek? No, it does not. It should be known that the optimizer may not be able to generate accurate cardinality estimates for the “not equal to” operator. What I want to dig into is the implications of using invalid search arguments and I will be focus on the “<>” operator.

Let’s start by creating and populating a sample table.

USE [tempdb]
GO
 
IF OBJECT_ID('dbo.TestData') IS NOT NULL
BEGIN
    DROP TABLE dbo.[TestData];
END
GO
 
SELECT TOP 100000 
    RowNum   = IDENTITY(INT,1,1),
    SomeID   = ABS(CHECKSUM(NEWID()))%2500+1, 
    SomeCode = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
             + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
INTO dbo.TestData
FROM Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2 
GO
        
ALTER TABLE dbo.TestData
ADD PRIMARY KEY CLUSTERED (RowNum);
GO
 
CREATE INDEX IXC_TestData_Cover1 ON dbo.TestData(SomeId) ;
GO
CREATE INDEX IXC_TestData_Cover2 ON dbo.TestData(somecode) ;
GO

Now that we have our table, lets start by running a simple query using “not equal to.”

DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
 
SELECT
    rownum,
    someid, 
    [SomeCode]
FROM dbo.[TestData] t1
WHERE
    t1.[SomeId] <> 0

Here is the query plan and the IO/TIME stats

image

SQL Server parse and compile time:

   CPU time = 0 ms, elapsed time = 0 ms.

Table 'TestData'. Scan count 1, logical reads 238, physical reads 1, read-ahead reads 146, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



 SQL Server Execution Times:

   CPU time = 47 ms,  elapsed time = 1477 ms.



 SQL Server Execution Times:

   CPU time = 0 ms,  elapsed time = 0 ms.

As you can see, the query performed like we thought it would by scanning the index to satisfy the query, but what happens when we cover the query with our index? The answer may be quite surprising. To test this we need to reduce the number of columns in the select list.

Here is the new query: Note: we can leave rownum in the select list because this column is part of the clustering key, which is included in non clustered indexes.

SET NOCOUNT ON;
GO
 
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT
    rownum,
    someId
FROM dbo.[TestData] t1
WHERE
    t1.[SomeId] <> 0
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

Here is the query plan and the IO/TIME stats

image

SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. Table 'TestData'. Scan count 2, logical reads 179, physical reads 1, read-ahead reads 153, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 94 ms, elapsed time = 1207 ms.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

The results are pretty interesting. We actually get an index seek because of the covering index, but the IO and CPU Time actually increased. You are probably screaming at the screen saying,” HOLD ON BUDDY YOU JUST SAID THAT <> IS NOT A VALID SEARCH ARGUMENT!!!” and you would be right. The answer is actually quite simple…. the “not equal to” operator did NOT seek the row. If you mouse over the index seek show plan operator you will see that the seek predicate actually gets transformed into a > and <. So this is where myself and other SQL Server enthusiasts debate . This begs the question that if the optimizer can change the “not equal to” operator into a ranged search… does this make “not equal to” a valid search argument? I leave this up to you to decide, I however say no. In my opinion, the < and > are valid search operators, not the "<>". I cannot find any documentation on how/when the predicate transformation actually occurs. The main question I have is, can we definitively say that the optimizer will ALWAYS transform the predicate. If the answer is yes, then we can say it is a valid search argument, but until then I have to say no. I will most definitely agree that the optimizer is usually pretty smart in using the index to search inequalities. Here is a screenshot of the index show plan operator.

image

Now let’s do what I like to do and rewrite inequalities to be valid SARG.

SET NOCOUNT ON;
GO
 
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT
    rownum,
    someId
FROM dbo.[TestData] t1
WHERE
    t1.[SomeId] < 0 OR t1.[SomeID] > 0
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

The query plan and IO/Time Stats:

image

SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. Table 'TestData'. Scan count 2, logical reads 179, physical reads 1, read-ahead reads 153, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 94 ms, elapsed time = 1207 ms.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

So what’s the verdict? Well in this case the optimizer choose the exact same plan with the exact same IO/CPU time.  So where does that leave us?  I invite each of you to decide for yourselves, but my opinion is to stick with best practices and with valid search arguments.  While these queries are theoretically the same, I believe a predicate should be as SARG friendly as possible. In doing so the developer can ensure the optimizer is able to make the best decision possible. The questions on the table are, How consistent is the optimizer in transforming the query, especially if the optimizer cannot accurately measure cardinality and are valid search arguments still relevant? I cant answer the first question now, perhaps I can come up with an example, but this is a question maybe more appropriate for the query optimization team. I answer yes to SARG relevance. Valid search arguments give the optimizer better cardinality estimates and is a more inclusive search, which usually translates into better and more consistent performance. With that said, the optimizer is able to handle many invalid search argument variants and successfully transform those queries into something useful, but do we really want to rely on the optimizer to  “automagically” fix our code?

Links:

Monday, June 22, 2009

Correlated sub queries for good or evil?

Today on the MSDN forums I engaged a good discussion, on the innards of a correlated sub query http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/0a4e384c-f8c6-45b7-a4db-c14827500fe3/. I will start off by saying that all correlated sub queries are not evil, just most :-). All kidding aside it is not that correlated sub queries are bad, it is just there are better methods to hanlde these types of queries. Here is the BOL link to correlated sub queries, http://msdn.microsoft.com/en-us/library/ms187638.aspx. Let’s get started.

You are probably asking yourself, “How can a correlated sub query be bad?” The answer is because of a nested loop join. A nested loop join takes all the results from the outer query and searches for the a resulting value in the inner query. This type of join can be very expensive, especially for queries with large outer results. For example say that the outer query returns 1 million rows and the inner query has 500 hundred. In this example, each row returned by the outer table has to queried against the inner table, which means that inner table is queried 1 million times! Now that is a serious performance problem. Let’s have a look at some sample code to see how this really works.

First we will create the sample table: (Sample data compliments of Michael Asher)

use [tempdb]
GO
 
IF OBJECT_ID('dbo.item') IS NOT NULL
BEGIN
    DROP TABLE dbo.item;
END
GO
 
create table dbo.Item (
 [ItemID] int not null,
 [name] varchar(50) not null);
 
IF OBJECT_ID('dbo.Child') IS NOT NULL
BEGIN
    DROP TABLE dbo.Child;
END
GO
create table dbo.Child (
 Child1ID int identity(1,1) not null,
 ItemID int not null,
 txt varchar(100) not null );
 
INSERT INTO Item 
SELECT TOP(1000000) rn, 'Row: ' + CONVERT(VARCHAR,rn)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rn
FROM sys.columns a
CROSS JOIN sys.columns b
CROSS JOIN sys.columns c
) AS A
CREATE CLUSTERED INDEX idx_Item ON Item(ItemID)

Correlated sub query example:

SELECT 
    i.[name],
    (SELECT c.[txt] FROM child c WHERE c.ItemID = i.ItemID)
FROM dbo.Item i
WHERE
    i.ItemId < 1000

Here is the execution plan and IO statistics:

image

Table 'Child'. Scan count 999, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Item'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see a nested loop join was used to get the data via the correlated sub query and the table “child” was scanned 999 times. Obviously, the increased IO and CPU translates into performance degradation.

Note: You can mouse over the item table show plan operator to view the actual number of rows.

Now let’s change the query to use a join.

SELECT 
    i.[name],
    c.[txt]
FROM dbo.Item i
LEFT JOIN child c
    ON c.ItemID = i.ItemID
WHERE
    i.ItemId < 1000

Execution plan and IO statistics:

image

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Item'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Child'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The difference between the queries is night and day. The number of scans decreased to 1, which is a heck of a lot better than 999. Moving a correlated sub query into the join makes the query much more performant and was an easy to implement solution. This is one of the most common performance problems that I have encountered over the years. The solution is easy to fix, but occurs a lot because developers do not understand the impact this type of query can have. The bottom line is RDMS are optimized for joins and best practice suggests the use of joins, to relate data and not correlated sub queries. This does not mean that there are not special cases where using a correlated sub query may be faster , but these cases are the exception and not the rule.

Extra Content:

Note: There are many instances where the query optimizer can transform the correlated sub query into a join. The catch is this usually occurs when a correlated sub query is used in conjunction with Exists/In, in the where clause. A correlated sub query in the select list will more than likely use a nested loop join. Regardless of how the optimizer transforms the query text, I would rather code the TSQL correctly the first time and not have to worry about the optimizer fixing the query for me.

Here is a sample of how the optimizer changes the correlated sub query to an inner join.

SELECT 
    i.[name]
FROM dbo.Item i
WHERE
    i.ItemId < 1000
    AND EXISTS(
        (SELECT c.[txt] FROM child c WHERE c.ItemID = i.ItemID)
    )

Thursday, June 18, 2009

Stored Procedure Set Options.. A potential gotcha

I have seen a lot of problems with stored procedures not working correctly, all of a sudden.  There can be numerous reasons why the stored procedure is not working correctly, but one of  the most overlooked culprits is SET options.  I wanted to talk about how SET options can affect query results and behavior. The behavior in question applies to two SET options and is fully documented in BOL, http://msdn.microsoft.com/en-us/library/ms187926.aspx. The two SET options are ANSI_NULLS and QUOTED_IDENTIFIER.  When a stored procedure is created, the set options that are enabled for that session, are stored in metadata.  For each subsequent run, the metadata settings are used and all explicit set options, outside of the stored procedure, are ignored. In SQL Server 2005 and greater, you can use the catalog view, sys.sql_modules, to view properties of a given object.  If you find yourself in a situation where you need to override the metadata, you should drop and recreate your stored procedure with the correct SET options. Here is a sample query to check the metadata of a given object.

SELECT *
FROM sys.sql_modules 
WHERE OBJECT_ID = OBJECT_ID('usp_TestProc')

Now that we know our objective, let’s get to the code.

We will start by creating a simple stored procedure that creates a table variable and selects some data.  We will turn the ANSI_NULLS OFF prior to the stored procedure creation. 

USE [tempdb]
GO
 
--drop the sp and recreate with the correct set options.
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_TestProc]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_TestProc]
GO
 
SET ANSI_NULLS OFF --<------------ANSI_NULLS OFF
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE PROCEDURE usp_TestProc
AS
BEGIN
 
DECLARE @t TABLE(
id INT,
col CHAR(1)
)
 
INSERT INTO @t VALUES (1,'a');
INSERT INTO @t VALUES (2,NULL);
 
SELECT *
FROM @t 
WHERE col = NULL
 
END
GO

Now that we have our stored procedure. , let’s check the metadata in sys.sql_modules, by running the code below.

SELECT *
FROM sys.sql_modules 
WHERE OBJECT_ID = OBJECT_ID('usp_TestProc')

You should something like the screenshot below:

image

Now we are going to execute the stored procedure, making sure to explicitly make the SET options different than the stored procedure. Remember ANSI_NULLS are turned OFF, in the stored procedure; however, we are going to set it to ON.  We do not expect to see results because we are turning the option on and NULL should evaluate to unknown.

SET ANSI_NULLS ON
GO
EXEC [usp_TestProc]
GO

Results: The stored procedure returned results even though we turned on ANSI_NULLS.  This query should not have returned anything because using = or <> to a NULL should resolve as UNKNOWN;  however, the stored procedure metadata overrode our SET option and thus we get a result.

image 

Lets switch this around by dropping the procedure and recreating it with ANSI_NULL on and then explicitly turn it off.

--drop the sp and recreate with the correct set options.
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_TestProc]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_TestProc]
GO
 
SET ANSI_NULLS ON
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE PROCEDURE usp_TestProc
AS
BEGIN
 
DECLARE @t TABLE(
id INT,
col CHAR(1)
)
 
INSERT INTO @t VALUES (1,'a');
INSERT INTO @t VALUES (2,NULL);
 
SELECT *
FROM @t 
WHERE col = NULL
 
END
GO

Now let’s execute the stored procedure, with the new SET options.

SET ANSI_NULLS OFF;
GO
EXEC [usp_TestProc]
GO

Results:  There were no results because ANSI_NULLS is turned on, which means you cannot use = or <> when comparing to NULL, regardless that we turned it OFF explicitly.

image

As you can see, even though we explicitly set ANSI_NULLS the option is ignored because the stored procedure’s metadata overrides all. This can make debugging a stored procedure a little more difficult, but it is a very good fact to know.  You can use a set option with in the stored procedure itself, which should override any global settings, for the duration of the stored procedure.  I have seen this problem bite people on the $%^ a few too many times and hopefully by reading this you will become more aware of the potential pitfalls, regarding SET options.

Tuesday, June 16, 2009

Superfluous columns…. more than a bad habit?

Are using superfluous columns, in a SELECT list, bad? Sure they are and for a multitude of reasons. The better question is, why do developers still use superfluous columns, knowing that they should not?  I believe developers get tunnel vision and forget to look at the big picture,  perhaps it is laziness, or perhaps the developer does not care because company policy is to fix performance problems with bigger hardware, which is common these days. For whatever the reason, tunnel vision and laziness usually lead to rapidly developed, non-performant code. Developers often forget the overall impact bad code can have on a system, no matter how simple.  The impact depends on varying factors including: access frequency, bytes returned, query performance etc... I am going to explore why superfluous columns are not a best practice.

Let’s get started by creating a sample table with a million rows.

USE [tempdb]
GO
 
IF OBJECT_ID('test..t') IS NOT NULL
DROP TABLE t
GO
 
SELECT TOP 1000000 --<<<LOOK! CHANGE THIS NUMBER TO CHANGE THE NUMBER OF ROWS!
        RowNum   = IDENTITY(INT,1,1),
        SomeID   = ABS(CHECKSUM(NEWID()))%2500+1, --<<<LOOK! CHANGE THIS NUMBER TO 1/400th THE ROW COUNT
        SomeCode = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                 + CHAR(ABS(CHECKSUM(NEWID()))%26+65),
        SomeCode2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),
        SomeCode3 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                    + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
   INTO dbo.t
   FROM Master.dbo.SysColumns t1,
        Master.dbo.SysColumns t2 
 
--===== Add the index to support both types of queries
CREATE INDEX IXC_TestData_Cover1 ON dbo.t(SomeID,SomeCode)
GO

Now that we have our table and test data, let’s run a simple query using select * and another select statement that uses fewer columns.  You should enable client statistics to view the bytes returned to the client.

Note: You can enable this via the toolbar, shown below, or by clicking the option from the query menu.

image 

First run the query with superfluous columns ( Make sure you enable client statistics)

SELECT *
FROM dbo.t
WHERE
    someid < 500
GO

Now let’s run the query with fewer columns:

SELECT 
    RowNum,
    SomeID,
    SomeCode
FROM dbo.t
WHERE
    someid < 500
GO

The client statistics between the two runs should look similar to the screenshot below.  You will notice that the bytes received from the server has drastically decreased, for the query with fewer columns in the SELECT list. The bytes sent from client is the number of bytes that make up the query text, which is irrelevant to our tests.

image

The value of interest is Bytes received from server.  This represents the number of bytes to the client, from the server.   In our case, the bytes dropped from 1024209 to 551576.  To put the number of bytes reduced into perspective, lets crunch the numbers.  We know that the superfluous query used 1024209 bytes. We can transform this into the number of  MBs returned. We are going to base these numbers on a concurrent user load of 500 users. Our calculation becomes. ((Number of bytes * number of concurrent users) / 1024.0) / 1024.0.  The result of this calculation is shown below.

 

Bytes

KB

MB

Superfluous Columns

512104500

500102.1

488.381

Non-Superfluous Columns

275788000

269324.2

263.012

As you can see, the number of bytes returned by the superfluous query is over 200 MB larger. This may not seem like much, but this is only one query and imagine if you have a 1000 concurrent users.  The superfluous column query would return approximately 500MB more data. This means the superfluous query would return nearly 1GB of data… now that is huge!

By selecting the columns we need, we are able to reduce the amount of data the web server has to process, which can lead to faster load times, less consumed resources, and a better end user experience.

So, your probably sold on the idea that superfluous columns are bad, but I have one more point to make.  Not only does the query return more bytes, but superfluous columns can affect the optimizer’s ability to use a covering index, even in a derived table/CTE.

Using the same sample, with a slight modification.

Superfluous query:

SELECT *
FROM dbo.t
WHERE
    someid = 100
GO

Below is the query plan.  What you see is that our index was able to be used; however, the index does not contain all the columns in our SELECT list, thus the optimizer has to go back to the HEAP to get the remaining column data.

Note: depending on the indexes in place the RID look could be replaced by a Key Lookup, if a clustered index exists.

image

Here is the non-superfluous column query:

SELECT 
    SomeID,
    SomeCode
FROM dbo.t
WHERE
    someid = 100
GO

These are the results from the query with non-superfluous columns:

image

You can test the same results with a CTE:

--this yeilds an index RID lookup
;WITH cte
AS
(
    SELECT 
        [RowNum]
        [SomeId],
        [SomeCode],
        [SomeCode2],
        [SomeCode3]
    FROM t
    WHERE 
        [SomeId] = 100
)
SELECT 
    [SomeId],
    [SomeCode]
FROM cte
 
--this yields an index seek via covering index
;WITH cte
AS
(
    SELECT 
        [SomeId],
        [SomeCode]
    FROM t
    WHERE 
        [SomeId] = 100
)
SELECT 
    [SomeId],
    [SomeCode]
FROM cte

Results:

Superfluous columns:

image

Non-superfluous columns:

image

So there you have it. Selecting more columns than you need, not only affects application performance, it affects the number of bytes returned to the client. All of these factors can  directly affect the user’s experience, which should be very important to developer/DBA staff.  Additionally, superfluous columns can cause bookmark lookups, which will degrade database performance. 

Friday, June 12, 2009

SQL Server Auto Statistics

I was asked a question, on the MSDN forums, regarding a custom function to automatically update statistics when a threshold is breached.  I went on to say that SQL Server already handles this for us automatically. Also that he should use sp_updatestats or UPDATE STATISTICS, if manual statistic updates were needed.  I wanted to briefly discuss some of the SQL Server built-in functionality regarding statistics. SQL Server has two options enabled by default, which help the database engine create and manage statistics.  These two options are AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS. 

First let’s talk a about what statistics are and how the optimizer uses them.  Statistics are the information used to represent the distribution of column values, within a table or indexed view.  The optimizer uses this statistical information to make cardinality estimates.  So what does this mean…  it means the optimizer uses the information to make a estimate of the number of rows, in the query result. The more accurate the estimate, the better the query plan is.  In some cases, good statistics can make the difference between getting an index seek and getting an index scan.

Now that we know more about what statistics are, let’s talk about how SQL Server creates statistics.  The AUTO_CREATE_STATISTICS option allows the database engine to automatically create statistics on individual columns that appear in query predicates.  You can check statistics by querying the sys.stats catalog view.  When statistics are automatically created they are prefixed with “_WA”, but also have the flag auto_created set to 1.

SELECT 
    OBJECT_NAME(s.object_id) AS object_name,
    COL_NAME(sc.object_id, sc.column_id) AS column_name,
    s.name AS statistics_name
FROM sys.stats s 
INNER JOIN sys.stats_columns sc
    ON  s.stats_id = sc.stats_id 
        AND s.OBJECT_ID = sc.Object_id    
INNER JOIN sys.tables t
    ON t.OBJECT_ID = s.object_id    
WHERE 
    s.name like '_WA%'
    AND s.auto_created = 1
    AND t.TYPE = 'U'
    AND t.is_ms_shipped = 0
ORDER BY 
    s.name;

You should note that auto created statistics only are present on a single column, to obtain more dense statistical data you will need to create an index or manual statistics. Now that we have statistics, how does this help our query run faster? In short, if a query predicate contains a column with statistics, the optimizer does not have to guess at the number of rows affected by the query, thus the optimizer has enough information to intelligently create the query plan.

Now that we have statistics, we can rely on the database engine to keep them up. SQL Server uses AUTO_UPDATE_STATISTICS (enabled by default) to manage statistics.  Statistics are checked before query compilation or before executing a cached query plan. If a table has just been issued a large insert/delete transaction,the next query may experience slowness or delay because the statistics  have to be checked and updated. In this type of situation it is best to update the statistics manually, after the DML operation.  So how does the optimizer know the statistics are out-of-date? When checking statistics, the optimizer compares the number of data modifications, since the last update, against a threshold.  If the number of data modifications is greater than the threshold, the statistics are updated.  The technical threshold limits are listed below.

Statistics are considered out-of-date when:

  1. The table size has gone from 0 to >0 rows.

  2. The number of rows in the table when the statistics were gathered was 500 or less, and the column modification counter of the leading column of the statistics object has changed by more than 500 since then.

  3. The table had more than 500 rows when the statistics were gathered, and the column modification counter of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table when the statistics were gathered.

  4. If the statistics object is defined on a temporary table, it is out of date as discussed above, except that there is an additional threshold for recomputation at 6 rows, with a test otherwise identical to test 2 in the previous list.

You can view the statistical information for an object by executing DBCC SHOW_STATISTICS.  Here is a sample using the Adventure Works database:

DBCC show_statistics('[Production].ProductProductPhoto','_WA_Sys_00000002_01142BA1')

DBCC SHOW_STATISTICS: http://msdn.microsoft.com/en-us/library/ms174384.aspx

This is how SQL Server creates and manages statistics.  Remember that just because SQL Server does this for you, does not mean you can forget about it.  You always want to stay on top of statistics, like indexes. In some cases, AUTO_UPDATE_STATISTICS is not enough and using sp_updatestats every so often may help alleviate query performance issues. The bottom line is, if your statistics are bad, there is a good chance the optimizer may miss index seeks or choose a sub optimal joins.There is a lot more to know about statistics and I have hardly scratched the surface here.  If you would like to learn more about SQL Server Statistics, you should take a look at the referenced links, as these have a wealth of information.

References:

http://msdn.microsoft.com/en-us/library/ms190397.aspx

http://technet.microsoft.com/en-us/library/cc966419.aspx