Sunday, November 15, 2009

Splitting A Delimited String (Part 1)

In a previous post, I demonstrated how to split a finite array of elements, using XML, http://jahaines.blogspot.com/2009/06/converting-delimited-string-of-values.html.  The method presented in my previous post can only be used when there is a known number of elements.  In this post, I will be focusing on the methods most commonly used today to parse an array, when the number of elements is unknown. This part one of a two part series where I look at the differing methods used to split an array of strings, using SQL Server 2005 and 2008.  There are three primary methods to parse an array.  The first method takes advantage of a numbers table to quickly parse the string.  The second method uses the new XML functionality built into SQL Server 2005.  The final method uses a TVF, without a permanent numbers table. 

Let’s get started by creating are sample table.

SET NOCOUNT ON
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 't')
BEGIN
    DROP TABLE dbo.[t];
END
GO
 
CREATE TABLE dbo.t(
SomeId INT NOT NULL PRIMARY KEY,
SomeCode CHAR(2)
);
GO
 
;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT dbo.t (SomeId,SomeCode)
SELECT 
    N,
    CHAR(ABS(CHECKSUM(NEWID()))%26+65)
    + CHAR(ABS(CHECKSUM(NEWID()))%26+65) AS SomeCode    
FROM Nums
WHERE N<=10000;
GO
 
CREATE NONCLUSTERED INDEX ncl_idx_SomeCode ON dbo.t(SomeCode);
GO

The first method I will demonstrate is the Numbers table method.  This is probably the most efficient method of all the methods listed, but performance does vary among environments.  Another great benefit of this method is that it works with SQL Server 2000 and greater.

The first step in using this method is to create a table of numbers.  A table is just what it sounds like… a table of natural number starting from 1 and going to n, where is the maximum number you want in the table.  This method really performs well because of a clustered index on the number column, which allows for very fast index seeks. Here is the code I use to generate my numbers table.

--=============================================================================
--      Setup
--=============================================================================
USE [tempdb]
GO
 
SET NOCOUNT ON 
GO
 
--=============================================================================
--      Create and populate a Numbers table
--=============================================================================
--===== Conditionally drop 
IF OBJECT_ID('dbo.Numbers') IS NOT NULL 
BEGIN
    DROP TABLE dbo.Numbers;
END
GO
 
CREATE TABLE dbo.[Numbers](
N INT NOT NULL
);
GO
 
;WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
INSERT Numbers
SELECT N FROM Nums
WHERE N<=10000;
GO
 
ALTER TABLE dbo.Numbers ADD CONSTRAINT PK_N
PRIMARY KEY CLUSTERED ([N])WITH(FILLFACTOR = 100);
GO

Next, we will need to create an Inline TVF (Table Valued Function) to split the array.  This function is written by SQL Server guru Itzik Ben-Gan.  This split function is very fast and very scalable. 

IF OBJECT_ID('dbo.fn_split') IS NOT NULL
DROP FUNCTION dbo.fn_split;
GO
CREATE FUNCTION dbo.fn_split(@arr AS NVARCHAR(2000), @sep AS NCHAR(1))
RETURNS TABLE
AS
RETURN
SELECT
(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'')) + 1 AS pos,
SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep, n) - n) AS element
FROM dbo.Numbers
WHERE n <= LEN(@arr) + 1
AND SUBSTRING(@sep + @arr, n, 1) = @sep;
GO

This function’s logic is pretty straight forward, but I will discuss how it works.  The numbers table is used to iterate through each character in the array.  As you can see the first step is to pad the beginning of the string with the delimiter.   With all the delimiters in place, the code can determine the position of each delimiter.  Once the logic has the delimiter’s position, the code logics utilizes the CHARINDEX() and SUBSTRING() system functions to extract each element and it’s corresponding position.

Now that we know how this code works, lets see it in action.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN dbo.fn_split(@Ids,',') AS fn
    ON t.SomeId = fn.Element
 
/*
SomeId      SomeCode
----------- --------
1           SQ
500         BO
5439        ZV
9999        RD
7453        IG
*/

Before we move onto the next method, I would like to point out that some developers love to use the exists clause for this situation, especially when the developer does not need any columns from the TVF; however, exists may degrade performance.  I am planning to do an in-depth post regarding the differences between inner join and EXISTS.  To give you an idea of the how exists can degrade performance, have a look at the screenshot below.  Please note that performance is not always black or white and all executions plans will not deviate as much as the one below.

Execution Plan:

image

IO Stats:

********************* inner join *************************
 
Table 't'. Scan count 0, logical reads 14, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
********************* exists *************************
 
Table 'Worktable'. Scan count 2, logical reads 20018, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 2, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't'. Scan count 3, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see, the difference between the inner join and the exist query plans are night and day.  I will post on this at a later date, but I wanted to make you aware of possible performance problems.

The next method I will be discussing is the XML nodes method.  This method does not require an ancillary table, but does require SQL Server 2005 and greater.  This method does handle XML special characters. I picked up this encoding/decoding method from SQL Server enthusiast Brad Schulz, http://bradsruminations.blogspot.com/.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN(
    SELECT x.i.value('.','INT') AS SomeId
    FROM(SELECT XMLEncoded=(SELECT @Ids AS [*] FOR XML PATH(''))) AS EncodeXML
    CROSS APPLY (SELECT NewXML=CAST('<i>'+REPLACE(XMLEncoded,',','</i><i>')+'</i>' AS XML)) CastXML
    CROSS APPLY NewXML.nodes('/i') x(i)
) AS Ids
    ON Ids.SomeId = T.SomeId

The XML method looks a lot more complex than it really is.  Essentially, what I am doing is creating an XML structure that contains each of the elements of the array.  For example, the array “1,500,5439,9999,7453” becomes “<i>1</i><i>500</i><i>5439</i><i>9999</i><i>7453</i>.” The first step is to encode the array by using FOR XML PATH. Once the array is in an encoded string format, I explicitly cast the xml string into an XML data type.  Once I have the XML in a decoding XML format, I use the XML nodes method to put the XML values into a relational format. For more information about how this method works, you can view the following blog post by Brad Schulz, http://bradsruminations.blogspot.com/2009/10/un-making-list-or-shredding-of-evidence.html.  This post does a great job of breaking down the inner workings of this method.

The final method I will be demonstrating uses a TVF function with a virtual table of numbers.  The TVF method does not require an ancillary table because a numbers table is generated on the fly.

IF OBJECT_ID('dbo.fn_TVF_Split') IS NOT NULL
DROP FUNCTION dbo.fn_TVF_Split;
GO
 
CREATE FUNCTION dbo.fn_TVF_Split(@arr AS NVARCHAR(2000), @sep AS NCHAR(1))
RETURNS TABLE
AS
RETURN
WITH 
   L0 AS (SELECT 1 AS C UNION ALL SELECT 1)       --2 rows
  ,L1 AS (SELECT 1 AS C FROM L0 AS A, L0 AS B)    --4 rows (2x2)
  ,L2 AS (SELECT 1 AS C FROM L1 AS A, L1 AS B)    --16 rows (4x4)
  ,L3 AS (SELECT 1 AS C FROM L2 AS A, L2 AS B)    --256 rows (16x16)
  ,L4 AS (SELECT 1 AS C FROM L3 AS A, L3 AS B)    --65536 rows (256x256)
  ,L5 AS (SELECT 1 AS C FROM L4 AS A, L4 AS B)    --4,294,967,296 rows (65536x65536)
  ,Nums AS (SELECT row_number() OVER (ORDER BY (SELECT 0)) AS N FROM L5)  
SELECT
(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'')) + 1 AS pos,
SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep, n) - n) AS element
FROM Nums
WHERE 
    n <= LEN(@arr) + 1
    AND SUBSTRING(@sep + @arr, n, 1) = @sep
    AND N<=1000
GO

Now that the function is in place, you can do the same join as before.

DECLARE @Ids VARCHAR(1000)
SET @Ids = '1,500,5439,9999,7453'
 
SELECT t.*
FROM dbo.t
INNER JOIN dbo.fn_TVF_split(@Ids,',')
    ON t.SomeId = Element

This method is really the same method as provided before, except it does not take advantage of a permanent numbers table. 

I have shown the three most common and best performing methods for splitting a delimited string.  Which method do you use?  As you can imagine, this answer depends on the distribution of data, size of tables, indexes etc..  One method is not always going to be better than another method, so my recommendation is to test each method and choose the one that makes the most sense for you environment. In part two of this series, I am really going to dig into how each of these methods performs on varying sized tables and strings.  This should give you a better idea of which method to choose based on your data, but as stated before the results may vary depending on several environmental factors.

Until next time, happy coding.

Wednesday, November 11, 2009

SSRS - Should I Use Embedded TSQL Or A Stored Procedure?

SSRS is becoming a highly scalable and performant reporting solution, for most environments.  SSRS is becoming more and more popular because of its price tag.  It is really tough to compete with free.  The only cost  consideration that needs to be made is SQL licensing.  With the uproar with BI and Share Point SSRS is becoming the premier reporting platform.  As database professionals, we need to consider the performance consequences of all the code that executes on our production/reporting databases.  SSRS is a great reporting tool but if left unchecked can be quite problematic.  This post will strictly focus on the eternal question…. should I use a stored procedure, or should I use embedded SQL. I will be addressing this question from the DBA perspective, which is often neglected. 

I will be very frank and say that I really do not see any benefits to using embedded TSQL from the DBA perspective.  Embedded TSQL has a lot of cons that should deter any database professional from using it.  So what are some of the problems in using embedded TSQL?  Let’s name a few of the cons.

The Cons Of Embedded TSQL:
  • Harder to manage security
  • Report may break when schema changes
  • Difficult to make changes to embedded TSQL
  • Causes procedure cache to bloat

Note: I did not list the pros to using stored procedures, but the list is the inverse of the Embedded TSQL list. 

As you can see, there are a lot of problems in choosing to use embedded TSQL.  The first con to using embedded TSQL is security.  It is extremely difficult to manage security without the use of stored procedures.  When code logic is encapsulated in stored procedure, the DBA can easily apply permission to the stored procedure, without elevating permissions to the underlying objects.  If embedded TSQL is used, the person executing the report must have underlying permissions to all objects referenced in the embedded TSQL, which makes maintaining embedded TSQL complicated because you really have no idea what code is actually being executed against your database.  To get an idea of what permissions are needed to execute a report, you have to open the report or run a trace to get the TSQL.

Embedding TSQL in a SSRS report also can become a problem when the underlying database schema changes.  The SSRS report has no dependencies on the database schema and any change can break a report.  In this scenario, there may be additional downtime just to figure out and fix the problem.  In most cases, the DBA has no idea that a schema change broke the report. Typically the problem does not surface until customers start complaining, which leads to the problem of changing the embedded TSQL.  This is not to say that when stored procedures are used reports will not break, but SSMS offers better dependency checks to determine what objects are dependant, which decreases the likelihood of an report outage. 

One of the biggest problems with embedded TSQL is modifying the TSQL code.  To modify the embedded TSQL the developer has to download the report RDL and then make the change.  Once the change has been made, the developer has to redeploy the report.  These steps require a lot of time to implement and additional downtime is incurred.  If a stored procedure is used, the only downtime incurred is the time taken to modify the stored procedure.  Another benefit of a stored procedure is a developer or DBA can more easily test the report within the confines of SSMS, instead of having to use BIDS.

The absolute worse aspect of using embedded TSQL is it can bloat the procedure cache, which can severely degrade server performance. SSRS tries and does a good job at parameterizing most TSQL, but there are certain aspects of SSRS that cause the procedure cache to bloat.  This is where I want to focus most of my attention because this is often the most overlooked aspect of embedded TSQL.  There are two scenarios that I am currently aware of that can directly cause the procedure cache to bloat.  The first scenario occurs when a multi-value parameter is used in conjunction with the IN clause.  Multi-value parameters are treated a differently than standard parameters, in SSRS.  When a multi-value parameter is used with the IN clause the SSRS engine submits the query to SQL Server using literal values in the IN clause.  When literal values are used in the IN clause, the query is not considered parameterized, so the optimizer has to create a new query plan, unless an exact binary match already exists. Let’s have a look to see this example in action.

First let’s create the sample table and populate the table with data.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF object_id('tempdb.dbo.SSRS_Cache_Bloat') IS NOT NULL
BEGIN
    DROP TABLE dbo.SSRS_Cache_Bloat;
END
GO
 
CREATE TABLE SSRS_Cache_Bloat(
ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
ColA VARCHAR(10),
ColB BIT
);
 
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Adam',0);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Bob',1);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Chad',0);
INSERT INTO dbo.SSRS_Cache_Bloat VALUES ('Dave',1);
GO
 
IF object_id('tempdb.dbo.Lookup_Vals') IS NOT NULL
BEGIN
    DROP TABLE dbo.Lookup_Vals;
END
GO
 
CREATE TABLE dbo.Lookup_Vals(
ColA CHAR(4)
);
 
INSERT INTO dbo.Lookup_Vals VALUES ('Adam');
INSERT INTO dbo.Lookup_Vals VALUES ('Bob');
INSERT INTO dbo.Lookup_Vals VALUES ('Chad');
INSERT INTO dbo.Lookup_Vals VALUES ('Dave');

Next create a new SSRS report. Create two data sets and a parameter, as defined below.

DataSet1 is the main report dataset.

select Id, ColA, ColB from SSRS_Cache_Bloat where ColA in(@var)

DataSet2 is the parameter dataset.  You will need to make sure your parameter derives its values from this dataset.  

select colA from dbo.Lookup_Vals

Make sure the parameter is set to use multi-value parameters, as shown below.

image image

Once you have all the data sets configured.  Preview the report in SSRS.  When selecting your parameter values make sure to select more than one value.  Here is a screenshot of my report.

image

As you can see the multi-value parameter returned a row for all three parameter values.  Let’s look at the query execution stats to see what SQL actually executed.

SELECT TOP 10
        qs.execution_count,
        qt.text,
        qt.dbid, dbname=db_name(qt.dbid),
        qt.objectid 
FROM sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) as qt
ORDER BY qs.[last_execution_time] DESC

You should see an entry similar to the screenshot below.

image

As you can see, SSRS submitted TSQL with literal values specified in the IN clause.  What do you think will happen, if we preview the report again with different parameter values?  If you guessed that we will get a completely new plan, you would be right.

image

Can you imagine what happens when you have hundreds or thousands of differing options and hundreds or thousands of users?  The plan cache will take a beating because so many plans with differing values will have to be stored.  When so many plans exist in the procedure cache, you have less memory to store data pages in cache.  Ultimately nothing good comes out of having a bloated procedure cache. Multi-value parameters are not the only cause of bloating the cache.  The next scenario that bloats the procedure cache is using hard coded values in a parameter list.

Using hard coded values in a parameter list seems like a harmless gesture, but the reality is the SSRS engine guesses at the size of the string which directly impacts whether an existing plan can be used.  In our SSRS report change the dataset that is used to query the Lookup table to use fixed values, as shown below.

image image

Note: I added x to some of the values so that the length varies among strings.

Let’s preview the report, to see what happens. I used the values “Adam” for the first execution and “Bob” for the second execution.  You should see entries like below in your procedure cache.

image

The primary difference between the two execution plans is the size of the declared variable.  In the case of “Adam” the variable was declared as a nvarchar(4) and for “Bob” a nvarchar(3).  Because the size is different a new query plan was created. 

These are the couple of scenarios that I am currently aware of that cause the plan cache to behave in this manner.  I am sure there are other quirks that can cause this problem.  So the big question left on the table is…. how do I fix this problem?  The answer is to use stored procedures. 

I will start by fixing scenario two.  Create the following procedure in the database.

CREATE PROCEDURE usp_Fix_Scenario2(@var varchar(10))
AS 
BEGIN
    SELECT id,ColA,[colB]
    FROM dbo.SSRS_Cache_Bloat
    WHERE ColA IN(@var)
END
GO

You will see a single entry in the procedure cache, for both “Adam” and “Bob”.  As you can see the execution count is at two, which means the optimizer reused an existing plan.

image

Now let’s fix scenario one.

CREATE PROCEDURE usp_Fix_Scenario1(@var varchar(100))
AS 
BEGIN
    DECLARE @x XML
    SET @x = '<i>' + REPLACE(@var,',','</i><i>') + '</i>'
 
    SELECT id,ColA,[colB]
    FROM dbo.SSRS_Cache_Bloat
    WHERE ColA IN(
        SELECT x.i.value('.','varchar(10)')
        FROM @x.nodes('/i') x(i)
    )
END
GO

image

As you can obviously see using stored procedures is by far a best practice.  Stored procedures allow the greatest security, flexibility, manageability, and performance. I really cannot see a reason to use embedded TSQL at all and hopefully at this point you feel the same way.  All-in-all, we learned a valuable lesson in this post.  You cannot always trust that easier is better even if Microsoft says it is okay.  SSRS is tool written with the developer in mind and the DBA perspective is neglected.  If DBAs knew what SSRS is really doing behind the scenes, embedded TSQL would be outlawed. We as DBAs have to know and expose potential performance problems for all applications including Microsoft applications.  I hope that my exposing these flaws within SSRS, will help you and your environment adhere to better SSRS practices.

Until next time, happy coding.