Tuesday, December 15, 2009

SQL Server 2005 – How To Move 10 Millions Rows In 1 Millisecond

StarWars This blog post is more a tip that I picked up on while at PASS 2009.  Have you ever had the need to copy the contents of an entire table into another table?  Traditionally speaking, we as developers will use SELECT INTO or a INSERT INTO statement to load a destination table.  This is a still a great way of accomplishing the task at hand, but it is not nearly fast as what I am about to show you.  The method I am about to show you is not for all scenarios, but it can be very handy.  I do not know how many Oracle guys are reading this, but I have one question for you, “can your RDMS move 10 millions rows of data in <= 1 millisecond?”  I would be willing to bet that most will answer no, but I have to admit that this is not really a fair fight.  Why is this a little unfair…. let’s have a look at how this works under the hood. 

This method derives it power based on new partitioning functionality, in SQL Server 2005.  If you have used partitioning in SQL Server 2005, you probably have a good idea where I am going with this.  If not, SQL Server 2005 has built-in functionality that allows tables to be split or divided into what I will call virtual tables, whose values are dependent on predefined boundaries. When the partitioning function column is used in a query predicate the optimizer knows which partition the data resides in, which makes queries more IO efficient.  This is amazing functionality because it does not require application changes and significantly reduces the amount of data SQL Server has to sift through. The partitioning feature I will be focusing on is the feature that allows SQL Server to switch or trade partitions out, ironically named SWTICH.  This is commonly used for situations were you want to move data to a different partition either because the boundaries have changed or you need to phase data out.  The real benefit in using the SWITCH function is SQL Server does not actually move the data, it updates the meta data pointers to the data.  Because I am not actually moving data, I am able to move data around nearly instantaneously, regardless of the number of rows. This is why I said it is not fair, but hey what in life is fair :^)

Okay let’s see an example.  I will start by creating a sample table.

USE [tempdb]
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData')
BEGIN
    DROP TABLE [dbo].[TestData];
END
GO

CREATE TABLE [dbo].[TestData](
RowNum INT PRIMARY KEY,
SomeId INT,
SomeCode CHAR(2)
);
GO

INSERT INTO [dbo].[TestData]
SELECT TOP 10000000 
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID()))%2500+1 AS SomeId, 
    CHAR(ABS(CHECKSUM(NEWID()))%26+65)
    + CHAR(ABS(CHECKSUM(NEWID()))%26+65) AS SomeCode
FROM 
    Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'NewTestData')
BEGIN
    DROP TABLE [dbo].[NewTestData];
END
GO

--Create New Table To Move Data To
CREATE TABLE [dbo].[NewTestData](
RowNum INT PRIMARY KEY,
SomeId INT,
SomeCode CHAR(2)
);
GO

Now the fun part……. behold the power of SQL Server!!!!!!!!!!!!!!!!

--Move data to the new table
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

ALTER TABLE [dbo].[TestData] SWITCH to [dbo].[NewTestData];

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
GO

/*
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms.
*/

Next, I will verify the results.

SELECT COUNT(*) FROM [dbo].[TestData]; --0
SELECT COUNT(*) FROM [dbo].[NewTestData]; --10,000,000

/*
-----------
0

(1 row(s) affected)


-----------
10000000

(1 row(s) affected)
*/

There you have.  I have successfully moved 10 million rows into a new table in 1 MS and incurred no IO through IO stats; however, IO has to be incurred to update meta data, although it should be minimal.  This method has limited use, but can be extremely advantageous.  There are stipulations that have to be met for SWITCH, such as the table you are switching to must be empty and the table must have the same schema.  For a comprehensive list of requirements please refer to this link, http://technet.microsoft.com/en-us/library/ms191160.aspx

Until next time happy coding.

Wednesday, December 9, 2009

Splitting A Delimited String (Part 2)

This is part two of a two part series.  In part 1 of this series I demonstrated the most popular methods used to parse/split a delimited string of values, http://jahaines.blogspot.com/2009/11/splitting-delimited-string-part-1.html.  In this article I will be focusing on the performance implications of each method presented.  I will start of by giving a disclaimer that your results may vary from the results presented in this article.  Although the numbers may differ, the data trend should be somewhat consistent with my results.  I will be tracking three key performance counters: CPU, Duration, and Reads against varying string sizes and data loads. In addition to a varying delimited strings length and load, I have performed each query 10 times and taken an average.  I did this to ensure that I get the most accurate results. Enough talk let’s dive into our first test.

Note: I am not going to walk through how I did each test, but I will link all my scripts at the bottom of this post

The first test is testing how CPU usage differs between methods.  I tested a delimited string of exactly 10 Ids over a table with 10,000 rows, 100,000 rows and 1,000,000 rows.  As you can see, the permanent numbers table TVF is by far the best solution.  The CPU is highest on the inline Numbers TVF.  The inline numbers table is the most expensive because SQL Server has to do a lot of processing and calculation on the fly, whereas the permanent numbers table mostly has to read data, which means the IO will be much higher.  Both XML methods perform much better than the inline numbers table, but are nearly twice as slow as the permanent numbers TVF because they require more CPU intensive processing, which is derived from SQL converting the data to XML and transforming the XML back to a relational format.  It takes more processing power to encode and decode the XML than to simply convert it, so you should only use the encode/decode method, if your data contains special XML characters, http://msdn.microsoft.com/en-us/library/aa226544(SQL.80).aspx.

Winner of Round 1: Numbers Table Split Function

image

The next test was performed on a table with 10,000, 100,000 rows, 1,000,000 rows, using a string consisting of 100 Ids.  If you look at the chart below, you will see a trend.  As the number of Ids increases, so does the cost of the XML method.  CPU usage actually increases at a exponential rate, which makes it the least scalable solution.  Obviously the Numbers Table TVF is the clear winner here. 

Winner of Round 2: Numbers Table Split Function

image

The final test was taken over the same load, but I supplied a string of 1000 Ids.  As you can see with 1000, ids the results depict more of the same behavior.  The big take away is the XML method should only be used when the number of values in the delimited string is relatively small.

Winner of Round 3: Numbers Table Split Function

image

The next counter I will be focusing on is Duration.  Duration is not a very reliable counter, as it is very dependent on other processes running on the machine; however, it does provide insight to performance.  The first test will be done over the same load.  I will begin with 10 Ids again.

The results are a little more aligned in this test.  The Numbers Table Split TVF is the best performing on average, followed by the XML methods.  Again there is a higher performance cost to encode XML, so do so only when necessary.  Duration does give you a general idea about performance, but these results definitely do not carry as much weight as the other counters.

Winner of Round 4: Numbers Table Split Function

image

The next step is to increase the number of Ids to 100.  I wont repeat the same stuff over again I promise.  This test yields more of the same.

Winner of Round 5: Numbers Table Split Function

image

Next, I bump the number of Ids to 1000.  Here we go….. a different result :) .  In this example, the numbers table actually performed worse than the inline number TVF.  You may be wondering why the duration is worse for the numbers table split function.  I suspect the answer is the number of reads makes the query take longer to execute; however, there could have been something running on my machine when profiler captured the data.  This is a prime example of why duration can be an unreliable counter. I cannot tell if the query is actually worse or if an environmental factor on my laptop may have skewed the timing.  I will take the high road :^) and assume the reads impacted the timing because I had no known programs running.   The XML results are just disturbing…… For as much as I recommend XML split solutions on the forums, these results are just scary.

Winner of Round 6: Inline Numbers TVF Split

image

The final counter I will be testing is reads.  This is by far one of the most important counters because it impacts so many facets of SQL Server performance. Do not let the number of reads for the Numbers Table TVF persuade you to avoid it. A permanent numbers table TVF  is going to have more reads.  Essentially you are reading  table values from cache/disk instead of calculating them, so the numbers of reads is greater.  The obvious choice for this test is the inline numbers table TVF.

Winner of Round 7: Inline Numbers TVF Split

image

The next test increases the number of Ids to 100.  As for the results, we see more of the same.

Winner of Round 8: Inline Numbers TVF Split

image

Finally, I will increase the number of Ids to 1000. Again more of the same.  The number or reads, stays relatively consistent across all solutions.  The XML solution does better than the numbers table TVF here, but it just does not scale well at any other level.  I used to be a firm believer in the XML method, but I think I am going to start primarily recommending the number table TVF or an inline number table TVF.

Winner of Round 9: Inline Numbers TVF Split

image

The verdict

So what is the verdict?  Well, again I cannot reiterate enough that no one solution is always better than another.  Different solutions work best in different situations and environments.  In my tests, there is a clear winner, the permanent numbers table TVF.  Even though this is the “winner”, I have overwhelming evidence that says I should be using a numbers table to split a string, regardless of the numbers table being permanent or inline.  I am happy with these results because the permanent numbers table split function performs well and is very easy to implement.  Another benefit of the permanent numbers table TVF solution is that it works in all versions of SQL.  Why would you not want to use the permanent numbers table?  You may be willing to accept more CPU consumption to reduce IO, or perhaps you do not want to maintain another table.  If this is the case, an inline numbers table solution is the way to go.  All-in-all, the XML method really did surprise me  and I find it sad that this method does not perform well.  There is just not enough incentive to use an XML solution when an easier and better performing solution exists.  Remember that you should test each method in your environment to see which works best in your environment. 

I hope that you have learned something and can use this information to make more informed decisions when deciding to find a method to split delimited strings.

**** UPDATE *****

I did have some bugs in the code for the 1000 ids test.  I guess my relaxed brain never came home from vacation.  A special thanks goes out to Brad Schulz for spotting the bug.  I am glad that the end result is for the most part the same.  The only real deviation occurred in the reads category.  The numbers of reads should be comparable for all methods because the same amount of data should be returned, but in my original result set they were not.  I resolved the bug and have since updated the scripts and the results.

****UPDATE*****

A good friend and fellow SQL server enthusiast Brad Schulz recently posted a great entry, on parsing delimited strings with XML.  His findings show how using the XML method in certain ways can cause the query to really bomb; however, you can avoid some performance penalties by casting and storing the delimited string in a XML data type, instead of casting and parsing the XML inline.  I will not go into detail about why the inline XML is slower because I want you to read it right from the horse’s mouth, http://bradsruminations.blogspot.com/2009/12/delimited-string-tennis-anyone.html.   When I changed my code to use a XML variable, the XML methods were just as performant as the numbers table methods.  I do not know about you, but I am ecstatic.  I for one love the XML method and am very excited to see that it is and can be just as performant, when used in the right context.

Download the script files: http://cid-6f041c9a994564d8.skydrive.live.com/self.aspx/.Public/Split%20Delimited%20String/BlogPost^_UnpackDelimitedString.zip

Sunday, November 15, 2009

Splitting A Delimited String (Part 1)

In a previous post, I demonstrated how to split a finite array of elements, using XML, http://jahaines.blogspot.com/2009/06/converting-delimited-string-of-values.html.  The method presented in my previous post can only be used when there is a known number of elements.  In this post, I will be focusing on the methods most commonly used today to parse an array, when the number of elements is unknown. This part one of a two part series where I look at the differing methods used to split an array of strings, using SQL Server 2005 and 2008.  There are three primary methods to parse an array.  The first method takes advantage of a numbers table to quickly parse the string.  The second method uses the new XML functionality built into SQL Server 2005.  The final method uses a TVF, without a permanent numbers table. 

Let’s get started by creating are sample table.

SET NOCOUNT ON
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 't')
BEGIN
    DROP TABLE dbo.[t];
END
GO
 
CREATE TABLE dbo.t(
SomeId INT NOT NULL PRIMARY KEY,
SomeCode CHAR(2)
);
GO
 
;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT dbo.t (SomeId,SomeCode)
SELECT 
    N,
    CHAR(ABS(CHECKSUM(NEWID()))%26+65)
    + CHAR(ABS(CHECKSUM(NEWID()))%26+65) AS SomeCode    
FROM Nums
WHERE N<=10000;
GO
 
CREATE NONCLUSTERED INDEX ncl_idx_SomeCode ON dbo.t(SomeCode);
GO

The first method I will demonstrate is the Numbers table method.  This is probably the most efficient method of all the methods listed, but performance does vary among environments.  Another great benefit of this method is that it works with SQL Server 2000 and greater.

The first step in using this method is to create a table of numbers.  A table is just what it sounds like… a table of natural number starting from 1 and going to n, where is the maximum number you want in the table.  This method really performs well because of a clustered index on the number column, which allows for very fast index seeks. Here is the code I use to generate my numbers table.

--=============================================================================
--      Setup
--=============================================================================
USE [tempdb]
GO
 
SET NOCOUNT ON 
GO
 
--=============================================================================
--      Create and populate a Numbers table
--=============================================================================
--===== Conditionally drop 
IF OBJECT_ID('dbo.Numbers') IS NOT NULL 
BEGIN
    DROP TABLE dbo.Numbers;
END
GO
 
CREATE TABLE dbo.[Numbers](
N INT NOT NULL
);
GO
 
;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT Numbers
SELECT N FROM Nums
WHERE N<=10000;
GO
 
ALTER TABLE dbo.Numbers ADD CONSTRAINT PK_N
PRIMARY KEY CLUSTERED ([N])WITH(FILLFACTOR = 100);
GO

Next, we will need to create an Inline TVF (Table Valued Function) to split the array.  This function is written by SQL Server guru Itzik Ben-Gan.  This split function is very fast and very scalable. 

IF OBJECT_ID('dbo.fn_split') IS NOT NULL
DROP FUNCTION dbo.fn_split;
GO
CREATE FUNCTION dbo.fn_split(@arr AS NVARCHAR(2000), @sep AS NCHAR(1))
RETURNS TABLE
AS
RETURN
SELECT
(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'')) + 1 AS pos,
SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep, n) - n) AS element
FROM dbo.Numbers
WHERE n <= LEN(@arr) + 1
AND SUBSTRING(@sep + @arr, n, 1) = @sep;
GO

This function’s logic is pretty straight forward, but I will discuss how it works.  The numbers table is used to iterate through each character in the array.  As you can see the first step is to pad the beginning of the string with the delimiter.   With all the delimiters in place, the code can determine the position of each delimiter.  Once the logic has the delimiter’s position, the code logics utilizes the CHARINDEX() and SUBSTRING() system functions to extract each element and it’s corresponding position.

Now that we know how this code works, lets see it in action.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN dbo.fn_split(@Ids,',') AS fn
    ON t.SomeId = fn.Element
 
/*
SomeId      SomeCode
----------- --------
1           SQ
500         BO
5439        ZV
9999        RD
7453        IG
*/

Before we move onto the next method, I would like to point out that some developers love to use the exists clause for this situation, especially when the developer does not need any columns from the TVF; however, exists may degrade performance.  I am planning to do an in-depth post regarding the differences between inner join and EXISTS.  To give you an idea of the how exists can degrade performance, have a look at the screenshot below.  Please note that performance is not always black or white and all executions plans will not deviate as much as the one below.

Execution Plan:

image

IO Stats:

********************* inner join *************************
 
Table 't'. Scan count 0, logical reads 14, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
********************* exists *************************
 
Table 'Worktable'. Scan count 2, logical reads 20018, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 2, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't'. Scan count 3, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see, the difference between the inner join and the exist query plans are night and day.  I will post on this at a later date, but I wanted to make you aware of possible performance problems.

The next method I will be discussing is the XML nodes method.  This method does not require an ancillary table, but does require SQL Server 2005 and greater.  This method does handle XML special characters. I picked up this encoding/decoding method from SQL Server enthusiast Brad Schulz, http://bradsruminations.blogspot.com/.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN(
    SELECT x.i.value('.','INT') AS SomeId
    FROM(SELECT XMLEncoded=(SELECT @Ids AS [*] FOR XML PATH(''))) AS EncodeXML
    CROSS APPLY (SELECT NewXML=CAST('<i>'+REPLACE(XMLEncoded,',','</i><i>')+'</i>' AS XML)) CastXML
    CROSS APPLY NewXML.nodes('/i') x(i)
) AS Ids
    ON Ids.SomeId = T.SomeId

The XML method looks a lot more complex than it really is.  Essentially, what I am doing is creating an XML structure that contains each of the elements of the array.  For example, the array “1,500,5439,9999,7453” becomes “<i>1</i><i>500</i><i>5439</i><i>9999</i><i>7453</i>.” The first step is to encode the array by using FOR XML PATH. Once the array is in an encoded string format, I explicitly cast the xml string into an XML data type.  Once I have the XML in a decoding XML format, I use the XML nodes method to put the XML values into a relational format. For more information about how this method works, you can view the following blog post by Brad Schulz, http://bradsruminations.blogspot.com/2009/10/un-making-list-or-shredding-of-evidence.html.  This post does a great job of breaking down the inner workings of this method.

The final method I will be demonstrating uses a TVF function with a virtual table of numbers.  The TVF method does not require an ancillary table because a numbers table is generated on the fly.

IF OBJECT_ID('dbo.fn_TVF_Split') IS NOT NULL
DROP FUNCTION dbo.fn_TVF_Split;
GO
 
CREATE FUNCTION dbo.fn_TVF_Split(@arr AS NVARCHAR(2000), @sep AS NCHAR(1))
RETURNS TABLE
AS
RETURN
WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
SELECT
(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'')) + 1 AS pos,
SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep, n) - n) AS element
FROM Nums
WHERE 
    n <= LEN(@arr) + 1
    AND SUBSTRING(@sep + @arr, n, 1) = @sep
    AND N<=1000
GO

Now that the function is in place, you can do the same join as before.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN dbo.fn_TVF_split(@Ids,',')
    ON t.SomeId = Element

This method is really the same method as provided before, except it does not take advantage of a permanent numbers table. 

I have shown the three most common and best performing methods for splitting a delimited string.  Which method do you use?  As you can imagine, this answer depends on the distribution of data, size of tables, indexes etc..  One method is not always going to be better than another method, so my recommendation is to test each method and choose the one that makes the most sense for you environment. In part two of this series, I am really going to dig into how each of these methods performs on varying sized tables and strings.  This should give you a better idea of which method to choose based on your data, but as stated before the results may vary depending on several environmental factors.

Until next time, happy coding.

Wednesday, November 11, 2009

SSRS - Should I Use Embedded TSQL Or A Stored Procedure?

SSRS is becoming a highly scalable and performant reporting solution, for most environments.  SSRS is becoming more and more popular because of its price tag.  It is really tough to compete with free.  The only cost  consideration that needs to be made is SQL licensing.  With the uproar with BI and Share Point SSRS is becoming the premier reporting platform.  As database professionals, we need to consider the performance consequences of all the code that executes on our production/reporting databases.  SSRS is a great reporting tool but if left unchecked can be quite problematic.  This post will strictly focus on the eternal question…. should I use a stored procedure, or should I use embedded SQL. I will be addressing this question from the DBA perspective, which is often neglected. 

I will be very frank and say that I really do not see any benefits to using embedded TSQL from the DBA perspective.  Embedded TSQL has a lot of cons that should deter any database professional from using it.  So what are some of the problems in using embedded TSQL?  Let’s name a few of the cons.

The Cons Of Embedded TSQL:
  • Harder to manage security
  • Report may break when schema changes
  • Difficult to make changes to embedded TSQL
  • Causes procedure cache to bloat

Note: I did not list the pros to using stored procedures, but the list is the inverse of the Embedded TSQL list. 

As you can see, there are a lot of problems in choosing to use embedded TSQL.  The first con to using embedded TSQL is security.  It is extremely difficult to manage security without the use of stored procedures.  When code logic is encapsulated in stored procedure, the DBA can easily apply permission to the stored procedure, without elevating permissions to the underlying objects.  If embedded TSQL is used, the person executing the report must have underlying permissions to all objects referenced in the embedded TSQL, which makes maintaining embedded TSQL complicated because you really have no idea what code is actually being executed against your database.  To get an idea of what permissions are needed to execute a report, you have to open the report or run a trace to get the TSQL.

Embedding TSQL in a SSRS report also can become a problem when the underlying database schema changes.  The SSRS report has no dependencies on the database schema and any change can break a report.  In this scenario, there may be additional downtime just to figure out and fix the problem.  In most cases, the DBA has no idea that a schema change broke the report. Typically the problem does not surface until customers start complaining, which leads to the problem of changing the embedded TSQL.  This is not to say that when stored procedures are used reports will not break, but SSMS offers better dependency checks to determine what objects are dependant, which decreases the likelihood of an report outage. 

One of the biggest problems with embedded TSQL is modifying the TSQL code.  To modify the embedded TSQL the developer has to download the report RDL and then make the change.  Once the change has been made, the developer has to redeploy the report.  These steps require a lot of time to implement and additional downtime is incurred.  If a stored procedure is used, the only downtime incurred is the time taken to modify the stored procedure.  Another benefit of a stored procedure is a developer or DBA can more easily test the report within the confines of SSMS, instead of having to use BIDS.

The absolute worse aspect of using embedded TSQL is it can bloat the procedure cache, which can severely degrade server performance. SSRS tries and does a good job at parameterizing most TSQL, but there are certain aspects of SSRS that cause the procedure cache to bloat.  This is where I want to focus most of my attention because this is often the most overlooked aspect of embedded TSQL.  There are two scenarios that I am currently aware of that can directly cause the procedure cache to bloat.  The first scenario occurs when a multi-value parameter is used in conjunction with the IN clause.  Multi-value parameters are treated a differently than standard parameters, in SSRS.  When a multi-value parameter is used with the IN clause the SSRS engine submits the query to SQL Server using literal values in the IN clause.  When literal values are used in the IN clause, the query is not considered parameterized, so the optimizer has to create a new query plan, unless an exact binary match already exists. Let’s have a look to see this example in action.

First let’s create the sample table and populate the table with data.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF object_id('tempdb.dbo.SSRS_Cache_Bloat') IS NOT NULL
BEGIN
    DROP TABLE dbo.SSRS_Cache_Bloat;
END
GO
 
CREATE TABLE SSRS_Cache_Bloat(
ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
ColA VARCHAR(10),
ColB BIT
);
 
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Adam',0);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Bob',1);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Chad',0);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Dave',1);
GO
 
IF object_id('tempdb.dbo.Lookup_Vals') IS NOT NULL
BEGIN
    DROP TABLE dbo.Lookup_Vals;
END
GO
 
CREATE TABLE dbo.Lookup_Vals(
ColA CHAR(4)
);
 
INSERT INTO dbo.Lookup_Vals VALUES ('Adam');
INSERT INTO dbo.Lookup_Vals VALUES ('Bob');
INSERT INTO dbo.Lookup_Vals VALUES ('Chad');
INSERT INTO dbo.Lookup_Vals VALUES ('Dave');

Next create a new SSRS report. Create two data sets and a parameter, as defined below.

DataSet1 is the main report dataset.

select Id, ColA, ColB from SSRS_Cache_Bloat where ColA in(@var)

DataSet2 is the parameter dataset.  You will need to make sure your parameter derives its values from this dataset.  

select colA from dbo.Lookup_Vals

Make sure the parameter is set to use multi-value parameters, as shown below.

image image

Once you have all the data sets configured.  Preview the report in SSRS.  When selecting your parameter values make sure to select more than one value.  Here is a screenshot of my report.

image

As you can see the multi-value parameter returned a row for all three parameter values.  Let’s look at the query execution stats to see what SQL actually executed.

SELECT TOP 10
        qs.execution_count,
        qt.text,
        qt.dbid, dbname=db_name(qt.dbid),
        qt.objectid 
FROM sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) as qt
ORDER BY qs.[last_execution_time] DESC

You should see an entry similar to the screenshot below.

image

As you can see, SSRS submitted TSQL with literal values specified in the IN clause.  What do you think will happen, if we preview the report again with different parameter values?  If you guessed that we will get a completely new plan, you would be right.

image

Can you imagine what happens when you have hundreds or thousands of differing options and hundreds or thousands of users?  The plan cache will take a beating because so many plans with differing values will have to be stored.  When so many plans exist in the procedure cache, you have less memory to store data pages in cache.  Ultimately nothing good comes out of having a bloated procedure cache. Multi-value parameters are not the only cause of bloating the cache.  The next scenario that bloats the procedure cache is using hard coded values in a parameter list.

Using hard coded values in a parameter list seems like a harmless gesture, but the reality is the SSRS engine guesses at the size of the string which directly impacts whether an existing plan can be used.  In our SSRS report change the dataset that is used to query the Lookup table to use fixed values, as shown below.

image image

Note: I added x to some of the values so that the length varies among strings.

Let’s preview the report, to see what happens. I used the values “Adam” for the first execution and “Bob” for the second execution.  You should see entries like below in your procedure cache.

image

The primary difference between the two execution plans is the size of the declared variable.  In the case of “Adam” the variable was declared as a nvarchar(4) and for “Bob” a nvarchar(3).  Because the size is different a new query plan was created. 

These are the couple of scenarios that I am currently aware of that cause the plan cache to behave in this manner.  I am sure there are other quirks that can cause this problem.  So the big question left on the table is…. how do I fix this problem?  The answer is to use stored procedures. 

I will start by fixing scenario two.  Create the following procedure in the database.

CREATE PROCEDURE usp_Fix_Scenario2(@var varchar(10))
AS 
BEGIN
    SELECT id,ColA,[colB]
    FROM dbo.SSRS_Cache_Bloat
    WHERE ColA IN(@var)
END
GO

You will see a single entry in the procedure cache, for both “Adam” and “Bob”.  As you can see the execution count is at two, which means the optimizer reused an existing plan.

image

Now let’s fix scenario one.

CREATE PROCEDURE usp_Fix_Scenario1(@var varchar(100))
AS 
BEGIN
    DECLARE @x XML
    SET @x = '<i>' + REPLACE(@var,',','</i><i>') + '</i>'
 
    SELECT id,ColA,[colB]
    FROM dbo.SSRS_Cache_Bloat
    WHERE ColA IN(
        SELECT x.i.value('.','varchar(10)')
        FROM @x.nodes('/i') x(i)
    )
END
GO

image

As you can obviously see using stored procedures is by far a best practice.  Stored procedures allow the greatest security, flexibility, manageability, and performance. I really cannot see a reason to use embedded TSQL at all and hopefully at this point you feel the same way.  All-in-all, we learned a valuable lesson in this post.  You cannot always trust that easier is better even if Microsoft says it is okay.  SSRS is tool written with the developer in mind and the DBA perspective is neglected.  If DBAs knew what SSRS is really doing behind the scenes, embedded TSQL would be outlawed. We as DBAs have to know and expose potential performance problems for all applications including Microsoft applications.  I hope that my exposing these flaws within SSRS, will help you and your environment adhere to better SSRS practices.

Until next time, happy coding.

Wednesday, October 28, 2009

Locking A Table, While It Is Being Loaded, And Minimizing Down Time

I came across an interesting thread in the MVP newsgroups a few weeks back and thought I would share it’s content here.  The thread was about providing a reliable method to load a table, while keeping the downtime to a minimum.  As an added bonus the users still want to be able to query the old data, while you are loading the new data. I would like to point out that I did not come up with this solution.  I got this solution from Aaron Bertrand, http://sqlblog.com/blogs/aaron_bertrand/.  His solution is absolutely fantastic.  Aaron's solution is by far the best way to attack this problem, in my opinion.

Instinctively,  the first solution that comes to most of our minds is an insert into/select statement, with locking hints.  This solution is not a scalable one and does not adhere to our business requirements.   I will start by creating a sample table and then we can dig into how to solve this problem.

USE [tempdb]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE name = 'TestData')
BEGIN
    DROP TABLE [dbo].[TestData];
END
GO
 
CREATE TABLE [dbo].[TestData](
RowNum INT PRIMARY KEY,
SomeCode CHAR(2)
);
GO
 
INSERT INTO [dbo].[TestData] (RowNum,SomeCode)
SELECT TOP 10000
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    CHAR(ABS(CHECKSUM(NEWID()))%26+65)
    + CHAR(ABS(CHECKSUM(NEWID()))%26+65) AS SomeCode
FROM 
    [Master].[dbo].[SysColumns] t1,
    [Master].[dbo].[SysColumns] t2
GO

Okay with our table out of the way, we can start to really think about how to solve this problem.  The first step in solving this problem is to create two new schemas.  The first schema is called “Holder” and this schema will be a holder or container, for our table that we will be loading with new data.

--create Holder schema
CREATE SCHEMA [Holder];
GO

The Holder schema does just what the name implies… it holds the table that I will be inserting into.  The next step is to create a table that matches the same definitions, as the one above, but in the Holder schema. 

--Create TestData table in the Holder schema
CREATE TABLE [Holder].[TestData](
RowNum INT PRIMARY KEY,
SomeCode CHAR(2)
);
GO

With our schema and table created, I only have one other schema to create.  The last schema is the Switch schema.  The Switch schema is used as an intermediary schema to house the current source table (dbo.TestData) while the loaded table in the Holder schema (Holder.TestData) is transferred to the dbo schema (dbo.TestData). 

--Create Switch schema
CREATE SCHEMA [Switch];
GO

This solution adheres to all of our business rules and reduces downtime to the amount of time required to perform a schema metadata operation, which is nearly instantaneous.  This is a very scalable solution because the loading of the data is completely transparent to the users, all-the-while allowing them to query the stale data.  Let’s have a look at the final solution:

--Create procedure to load the table
CREATE PROCEDURE [dbo].[usp_LoadTestData]
AS
BEGIN
 
SET NOCOUNT ON;
 
--Truncate holder table
TRUNCATE TABLE [Holder].[TestData];
 
--load new data into holder table
INSERT INTO [Holder].[TestData] (RowNum,SomeCode)
SELECT TOP 500
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    CHAR(ABS(CHECKSUM(NEWID()))%26+65)
    + CHAR(ABS(CHECKSUM(NEWID()))%26+65) AS SomeCode
FROM 
    [Master].[dbo].[SysColumns] t1,
    [Master].[dbo].[SysColumns] t2
 
BEGIN TRANSACTION
    -- move "live" table into the switch schema
    ALTER SCHEMA [Switch] TRANSFER [dbo].[TestData];
 
    -- move holder populated table into the "live" or dbo schema
    ALTER SCHEMA [dbo] TRANSFER [Holder].[TestData];
 
    -- Move the prior table to the holder schema
    ALTER SCHEMA [Holder] TRANSFER [Switch].[TestData];
COMMIT TRANSACTION
 
END
GO

Let’s see how it works.  Execute the code and then query the dbo.TestData table.  You will see that the table now contains 500 rows instead of the 10000 I started with. 

EXEC dbo.[usp_LoadTestData];
SELECT * FROM dbo.TestData;

Note: If you are interested in seeing how the solution handles locks, you can strip all the code out of the stored procedure and run everything but the commit transaction.  You can then open another window and try to query the table, which will result in a wait until the schema transfer is committed. 

That’s it!!! This is by far the best method I have seen to solve this business problem.  It is very scalable,  has very little downtime, and is not affected by the NOLOCK hint.  I am really happy and thankful that Aaron shared his solution.  This solution has given me a lot of insight to solving this problem and similar problems.  Hopefully this post will have the same effect on you.

Until next time, happy coding.

Friday, October 23, 2009

Converting A Scalar User Defined Function To A Inline Table Valued Function

In my last post http://jahaines.blogspot.com/2009/10/io-stats-what-are-you-missing.html, I talked about the performance problems associated with scalar user defined functions and how SSMS may report invalid IO statistics, for scalar UDFs.  In this post, I will focusing on how to transform those pesky scalar UDFs into more scalable function.  When developing user defined functions, you have to keep a few things in mind.  Firstly, scalar UDFs are evaluated for each row returned by the query.  Additionally, SQL Server is not able to maintain statistics and optimize any function, except an inline table valued function.  Lastly, most code logic does not necessarily need to be encapsulated, in a function.  You may get better performance if you choose to use a derived table instead of a function; however, the biggest problem with a derived table is it cant be encapsulated and reused across an application.  An inline TVF is really useful if you need to encapsulate business logic and reuse it throughout an application.  Let’s start by creating the sample table DDL.

USE [tempdb]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData')
BEGIN
    DROP TABLE dbo.[TestData];
END
GO
 
--Create Sample Table
CREATE TABLE dbo.TestData(
RowNum INT PRIMARY KEY,
SomeId INT,
SomeDate DATETIME
);
GO
 
INSERT INTO dbo.TestData 
SELECT TOP 1000 
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID()))%250 AS SomeId, 
    DATEADD(DAY,ABS(CHECKSUM(NEWID()))%1000,'2008-01-01') AS SomeDate
FROM 
    Master.dbo.SysColumns t1
GO

Next I am going to create two functions.  One function will be a scalar UDF and the other will be an inline table valued function.  If you do not know what an inline TVF is, an inline TVF is like a parameterized view and is subject to the same restrictions as a view.  For more information you can read the following post in BOL, http://msdn.microsoft.com/en-us/library/ms189294.aspx.

--Create Scalar Function
CREATE FUNCTION dbo.fn_SomeFunction(@SomeId INT)
RETURNS DATETIME
AS
BEGIN
    DECLARE @dt DATETIME
    SELECT @dt = MAX(SomeDate)
    FROM dbo.TestData
    WHERE SomeId = @SomeId
    
    RETURN @dt
END
GO
 
--Create Inline Table Valued Function
CREATE FUNCTION dbo.fn_SomeInlineFunction()
RETURNS TABLE
RETURN(
    SELECT SomeId, MAX(SomeDate) AS SomeDate
    FROM dbo.TestData
    GROUP BY SomeId
)
GO

All of our DDL is in place.  All that is left is to test the performance.  If you read my last post, you should be expecting the inline TVF to out perform the scalar UDF.  Let’s see what actually transpires.  Discard the query results to the grid by clicking Tools –> Options –> Query Results –> SQL Server –> Results To Grid –> Discard results after execution.  Next open SQL Server profiler and use the standard template.  Run the code below to capture the performance counters in profiler.

SELECT SomeId,dbo.fn_SomeFunction(SomeId)
FROM dbo.[TestData]
GO
 
SELECT TestData.SomeId,max_dt.SomeDate
FROM dbo.[TestData]
INNER JOIN dbo.fn_SomeInlineFunction() AS max_dt
    ON max_dt.SomeId = TestData.SomeId
GO

The results of the queries should look similar to my results below.    The things of note are the reads and the CPU required to satisfy the each query.  The number of reads and CPU required to satisfy the scalar UDF  is astronomically greater than the inline TVF. 

image

If the above screenshot is not enough to discourage you from using scalar UDFs, I do not know what can.  The point being that there are all kinds of great alternatives to encapsulating code logic, without the use of scalar functions.  Inline TVFs offer a SET based  approach for encapsulating business logic; plus the optimizer is able to use existing statistics and indexes to optimize inline TVFs.  It is my recommendation that you should try to convert all scalar UDFs to inline TVFs.  I know this is not always possible, but it is a good start.  I typically try to stay away from scalar and multi-line UDFs, unless absolutely necessary.  I hope that you have learned something new and that you can use this example to get the needed signoff to change those problematic scalar UDFs, into inline TVFs.

Until next time, happy coding.