Wednesday, September 30, 2009

SSIS – Performing An UPSERT

For those of you who may be wondering what the heck an UPSERT is, an UPSERT is an UPDATE and INSERT into a given table. Typically, an UPSERT operation requires two separate transactions. Usually, the first transaction issued is the UPDATE and then the INSERT is issued shortly after. In this post I will be extending the package I created, in my previous post, http://jahaines.blogspot.com/2009/09/ssis-only-inserting-rows-that-do-not.html. Once your SSIS package resembles the package in the link, you can proceed with the next step. If you want to skip the prior post, you can download the package at the bottom on this post.

The first item of interest is the Lookup transformation. First I will drag an OLE DB Command onto the canvas and connect it to the Lookup transformation. A dialog box should appear. When the dialog box appears choose the Lookup Match Output, in the Output dropdown list. An OLE DB Command, is a parameterized TSQL statement that can issued against a given data set. Before I start configuring the transformation, I would like to point out that this method is extremely easy to setup, but has a major drawback. The OLE DB Command has to be issued for each row returned by the Input, so this type of process works recursively or iteratively. I will demonstrate another method later in this post. With that out of the way, I will start configuring the transformation. First, I will build an UPDATE command using parameters for each column and the predicate.

My canvas currently looks like the screenshot below.

image

The next step is to configure the OLE DB Command transformation. Double-click the OLE DB Command transformation. In the Connection Manager dropdown list, choose the database where you created dbo.SSIS_Import. Click the “Component Properties” tab. Within the “Component Properties” tab, you will need to click the ellipses next to the “SQL Command” property. Paste the code presented below into the box and click ok.

UPDATE dbo.SSIS_Import
SET 
     SomeInt = ?,
     SomeChar = ?
FROM dbo.SSIS_Import import
WHERE SomeId = ?

The code above is a pretty simplistic UPDATE statement. The key thing to note is the “?”. Each of the ‘?” will be given a parameter value in SSIS. The parameters are dynamically named in the order they appear in the command. Next, I will be mapping these parameters to my input columns. Click the “Column Mappings” tab. Align the input columns to each parameter. My column mapping is shown below. Once complete, click “Ok.”

image

That it!!! We have successfully created an UPSERT operation using SSIS… but wait what is the catch? The catch is the UPDATE will operate like a cursor processing one UPDATE command at a time, which can be a nightmare from a performance standpoint. You may be wondering how we can make this process set based. There is not much you can do in SSIS alone, so I will need to find other means. The best way to make this process more scalable is to leverage SQL Server and SSIS.

The first step is to drop the OLE DB command I just created. I will then drag a OLE DB Destination to the canvas and connect it to the Lookup Match Output. My canvas looks like below.

image

Next, I will switch gears and write some TSQL code. I will need to open SSMS and connect to the database where dbo.SSIS_Import exists. Firstly, I have to create a view. I will not use the view to select data, but will use the view as a intermediate object to insert data.

CREATE VIEW vw_SSIS_Import
AS
SELECT [SomeId],[SomeInt],[SomeChar]
FROM dbo.[SSIS_Import]
GO

Now that my view is created, I am going to create an INSTEAD of trigger, on my view. The instead of trigger will allow me to use my OLE DB Destination to bulk insert from SSIS. The bulk insert from SSIS gives me a mechanism to pass the rows for UPDATE, as a set of data. Once I have the rows in a set, the trigger can efficiently UPDATE the data.

CREATE TRIGGER trg_UPdate_SSIS_Import
ON dbo.vw_SSIS_Import
INSTEAD OF INSERT
AS 
BEGIN
 
    UPDATE import
    SET 
        SomeInt = i.SomeInt,
        SomeChar = i.SomeChar
    FROM dbo.SSIS_Import import
    INNER JOIN inserted i
        ON i.SomeId = import.SomeId
 
END
GO

In the code above, I am using efficient TSQL to update all rows that are inserted into the view, from the SSIS package. I am now ready to configure the OLE DB Destination. Double-click the OLE DB Destination to launch the configure dialog box. Make sure your package is using the right connection manager and choose the view. Click the “Column Mappings” tab and then click '”Ok.” Right-click the OLE DB Destination and choose properties. In the “Fast Load Options”, type “FIRE_TRIGGERS.” This allows the SSIS insert to fire the INSTEAD OF trigger, on the view. That’s it we are done!!! The view/INSTEAD OF TRIGGER method is extremely easy to configure and implement; however, it does require that additional database objects to be created.

There you have it. I have demonstrated two methods to UPSERT data, using SSIS. The first method relies on the OLE DB Command, which has the limitation of having to run for each and every row. The latter option is a more scalable solution that requires the creation of a view and trigger. Both methods have pros and cons and I leave it to you to determine which is best for your environment.

Until next time, Happy Coding.

PS: If you want to do all of the work in SQL, you can allow the trigger to perform the UPDATE and INSERT. This method reduces the complexity of the SSIS package because you only need a single source and destination, with no Lookup.

Download SSIS Package:

Note: The package was created in BIDS 2008, which is not compatible with BIDS 2005.

Sunday, September 27, 2009

SSIS – Only Inserting Rows That Do Not Exist

I have seen an overwhelming trend that suggests that today’s TSQL/BI developers are very interested in using Microsoft’s BI (Business Intelligence) product, SQL Server Business Intelligence Development Studio aka BIDS.  For those of you who may not know, BIDS is installed when SQL Server is installed.  BIDS is nothing more than a Visual Studios add-in that uses the .Net Framework.   BIDS allows you to do Analysis Services (Cube Design and management),  Reporting Services, and Integration Services.  The BI platform that I will be focusing on today is SSIS (SQL Server Integration Services).  SSIS is used for ETL, which is a acronym for  Extract- Transform-Load.  Essentially SSIS is used to migrate data from many different sources,  including MS Access, Excel, Text Files, Oracle, SQL, AS400 etc, while allowing data transformations between source and destination.  The problem that most import processes have is flexibility.  Some bulk import APIs do not have an easy method to import new data, or data that does not currently exist in a table.  In these cases, the import process has to import all source data, then another command has to be issued to filter and insert the data.  A great example of this is BCP or the Bulk Insert command.  SSIS gives developers an easy and efficient method to insert new data.  This is where I will be focusing my efforts today, in a future post, I will demonstrate how to do an UPSERT (Update existing and Insert new) operation using SSIS.  Okay let’s start by creating a sample table.

--Switch DB context to Tempdb
USE [tempdb]
GO
 
SET NOCOUNT ON;
GO
 
--Drop table if exists
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'SSIS_Import')
BEGIN
    DROP TABLE dbo.SSIS_Import;
END
GO
 
--Create table
CREATE TABLE dbo.SSIS_Import(
SomeID INT PRIMARY KEY,
SomeInt INT,
SomeChar CHAR(1)
);
GO
 
--Insert some test data
INSERT INTO dbo.SSIS_Import ([SomeID],[SomeInt],[SomeChar]) VALUES (1,1,'A');
INSERT INTO dbo.SSIS_Import ([SomeID],[SomeInt],[SomeChar]) VALUES (2,5,'B');
INSERT INTO dbo.SSIS_Import ([SomeID],[SomeInt],[SomeChar]) VALUES (3,8,'C');
INSERT INTO dbo.SSIS_Import ([SomeID],[SomeInt],[SomeChar]) VALUES (4,2,'D');
GO

Now that our table exists, I am going to shift gears and start creating the SSIS package. Before creating the package, you should create a Pipe delimited text file on the C: drive, called SSIS_Import.txt.  Below is a sample of my file.

1|100|U
2|200|V
5|300|W
8|400|X
9|500|Y
10|600|Z

Okay now we can start creating our SSIS package.  You should now be looking at a blank canvas, the first objective is to drag a Data Flow Task onto the canvas.  Your canvas should look like below.

image

Note: I have renamed my data flow task to represent what it is doing.

I will now configure the Data Flow Task.  Double-click the data flow task to open the data flow task designer canvas. Drag a “Flat File Source” to the canvas.  Double-click the Flat File Source and create a new connection.  Your connection should be configured as shown below.

image

You will see an error message at the bottom on the window, stating the columns are not defined.  All you need to do is click the Advanced tab and make sure the data types and length are correct.  Column 0 and column 1 should be of type “four byte signed integer” and column 2 is a string with a length of 1.   In addition to configuring the data types, you should rename the columns.  You can change the name of a column by typing a new name in the Name property field.  The names should align with the table, so the mappings should look like this:

  • Column 0 –> SomeId
  • Column 1 –> SomeInt
  • Column 2 –> SomeChar

After all configurations, your configuration should look like below.

image

Once the Flat File Source is configured, click “Ok” and “Ok” again.  The next step is to drag a “Lookup” transformation to the canvas and connect the Source connection to the Lookup transformation via the green precedence constraint.  Double-Click the Lookup transformation to configure it.  The first thing I have to do is change how errors are handled.  By default, if a single value is not found in the source, the entire Lookup component fails.  We will need to make sure non matching rows are redirected to the error output, as shown below.

image

Next, I will configure the connection.  Click the connection tab and the “New” button again.  The OLE DB connection should point to the database where the SSIS_Import table is.  Your screen should look like below.

image

Next we have to configure the key columns.  For the sample table the key column is SomeID, which maps to the Column 0, or SomeID if you renamed the column in your flat file connection manager, as shown below.

Note: To create the relationship you will have to drag the source column to the destination column.

image

Once you have finished configuration of this component, click “Ok” to save the changes.  Your canvas should be look like the screenshot below.

image

Next drag a OLE DB Destination to the canvas.  Make sure you drag the red arrow (precedence constraint) to the OLE DB Destination, which should cause a dialog box to popup. When it does just click “Ok.”  The destination controls where the source data will be inserted.  Double-click the OLE  DB Destination and make the destination point to the SSIS_Import table, as shown below.

image

Now click the mappings tab to map the source columns to the destination columns.  I have renamed my source columns, so SSIS will automatically map the columns for me. Click “Ok”, once you have mapped all the columns.

image

That is it!!! The final canvas will look like the screenshot below. 

image

All that is left is testing.  Click the debug button (looks like a “Play” button). If you have correctly configured all the steps,  the components will light up green.  If any turn red, something is not configured properly. Once all components light up green, run a simple select statement over the table to make sure it did what is supposed to do.

There you have it a method that takes less than 5 minutes to configure and will migrate new data only.  I hope that you have enjoyed reading this post and can make use of this in your environment. Stay tuned…… my next step is to Update pre-existing rows, using SSIS.  This is commonly known as an UPSERT.  I will show you a few ways to accomplish this task and the performance considerations, for each method.

Until next time… Happy Coding.

Wednesday, September 23, 2009

Sequencing Data Using SQL Server 2000+ (Part 2)

This is part two of a two part series where I will be demonstrating partitioned ranking or what I call partitioned sequences.  In part one of this series, we looked at sequencing data, http://jahaines.blogspot.com/2009/09/sequencing-data-in-sql-server-part-1.html.  A sequence number is often attributed as a row number given to each row of a result set.  A partitioned sequence is a sequence of numbers that is distinct across a grouping of data.  In my opinion, I believe it to be easier to partition sequence data, if you have a somewhat sequential surrogate key, like a date column or even better an identity column.  It is a lot easier to handle sequenced partitioning when you have a unique sequential  key, but not required. Let’s get started by creating our same sample table.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Orders]
END;
GO
 
CREATE TABLE dbo.[Orders](
OrderId INT IDENTITY(1,1),
CustId INT,
OrderDate DateTIME
);
GO
 
INSERT INTO [dbo].[Orders] (CustId,OrderDate)
SELECT 1,'1/1/09 08:00 AM' UNION ALL 
SELECT 1,'1/1/09 10:00 AM' UNION ALL 
SELECT 1,'2/1/09 02:00 PM' UNION ALL
SELECT 1,'3/1/09 03:15 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL --Show that ties do not affect the output
SELECT 2,'1/1/09 07:30 AM' UNION ALL
SELECT 2,'1/1/09 05:45 PM' UNION ALL
SELECT 2,'2/2/09 11:38 PM';
GO

Now that we have our sample table, we can begin our first partitioning sequence, using the row_number function (SQL Server 2005+). 

--Row Sequence Partitioning: 2005+
SELECT *, ROW_NUMBER() OVER(PARTITION BY o1.CustId ORDER BY o1.OrderId ASC) AS seq
FROM [dbo].[Orders]o1
ORDER BY o1.OrderId, o1.CustId

In the above example, the Row_Number, http://msdn.microsoft.com/en-us/library/ms186734.aspx ,  function is driving the results.  As you can see, I am using the partition by clause to “group” the sequence by o.CustId.  The partitioning clause creates a new sequence for each grouping.  The order by clause of the Row_Number function dictates the sequence order.  I have chosen ASC order, so the sequence will be generated on each CustId starting with the smallest OrderId. until it reaches the largest OrderId.

Results:

image

Now, we will move onto to solutions for SQL Server 2000.  The next example will use a correlated subquery, which can really degrade query performance.  I do not really like correlated subqueries, but I will list this as a possible method, but advise you to use another method, http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html

--Partitioning: 2000 slow
SELECT *,(SELECT COUNT(*) FROM [dbo].[Orders]o2 WHERE o2.OrderId <= o1.OrderId AND o1.CustId=o2.[CustId]) AS seq
FROM [dbo].[Orders]o1
ORDER BY o1.OrderId, o1.CustId

Results:

image

There really is not a whole lot of magic to the correlated subquery solution.  All we are doing is counting the number of rows less than the current OrderId, which effectively generates our partitioned sequence.  The next method is the derived table method. 

--Partitioning: 2000 much faster
SELECT o1.OrderId,o1.CustId,o1.OrderDate,COUNT(*) AS seq
FROM [dbo].[Orders]o1
INNER JOIN [dbo].[Orders]o2 ON o2.OrderId <=o1.OrderId AND o1.[CustId]=o2.[CustId]
GROUP BY o1.OrderId,o1.CustId, o1.OrderDate
ORDER BY o1.OrderId, o1.CustId, o1.OrderDate

Again, there is no magic formula here.  We are using the same logic, as the correlated subquery, yet we using a derived table.  The derived table method is a better solution than the correlated subquery method and CAN be more performant than the row_number method.  Whether or not the derived table method is better performing than the Row_Number function depends on several factors, but the biggest factor is the number of sort operators.  the derived table method will mostly use one sort operator (Order by Clause) and a stream aggregate, whereas, the Row_Number solution can use two sort operators, if the ORDER BY clause does not align with the ORDER BY clause of the Row_Number function.  In this scenario, the Row_Number solution can be more expensive than the derived table solution.  In most cases the derived table method is more costly than the row_number function because it requires another join, which translates into higher IO, while the Row_Number function reads from a single table , in this scenario.

Results:

image

There you have it.  We have successfully implemented sequenced partitioning, in SQL 2000 and greater versions.  The main thing to remember for solutions prior to SQL Server 2005 is that in some cases you may have to handle ties.  I find it best to handle sequence partitioning using a ranking function, such as Row_Number.  If you are using SQL 2000 and do not have a sequential key like an identity, you may have a little heartache, especially if you are trying to handle ties.  It is not impossible, but it is more tricky and you have to be more creative with your solution.  I hope that you have learned something that you can use in your environment and as always, happy coding.

Sunday, September 20, 2009

Sequencing Data Using SQL Server 2000+ (Part 1)

Do to the  overwhelming amount of data sequencing questions asked on the MSDN forums, I will demonstrate how to implement the most common solutions for sequencing data, using SQL Server 2000 and greater.  Before starting I would like to give a disclaimer.  I believe that data sequencing should be handled in the application/presentation layer.  SQL Server is not really designed to present data in a “pretty” format.  I am not saying data sequencing cannot be done in SQL Server or SQL Server cannot provide a highly performant data sequencing solution. I believe it is best to handle sequencing, in the application/presentation layer.  With the disclaimer out of the way, let’s push forward.  This post is part one of a two part series.  This post will primarily focus on basic data sequencing.  Let us start by creating our sample table.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Orders]
END;
GO
 
CREATE TABLE dbo.[Orders](
OrderId INT IDENTITY(1,1),
CustId INT,
OrderDate DateTIME
);
GO
 
INSERT INTO [dbo].[Orders] (CustId,OrderDate)
SELECT 1,'1/1/09 08:00 AM' UNION ALL 
SELECT 1,'1/1/09 10:00 AM' UNION ALL 
SELECT 1,'2/1/09 02:00 PM' UNION ALL
SELECT 1,'3/1/09 03:15 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL --Show that ties do not affect the output
SELECT 2,'1/1/09 07:30 AM' UNION ALL
SELECT 2,'1/1/09 05:45 PM' UNION ALL
SELECT 2,'2/2/09 11:38 PM';
GO

The first method I will demonstrate is the Row_Number, http://msdn.microsoft.com/en-us/library/ms186734.aspx ,  method.  Row_Number is a function that was introduced in SQL Server 2005, which means it is only available in SQL Server 2005 and 2008.  The Row_Number function allows developers to sequence/rank a resultset.  For this sample, I will not be using the partition clause, within the Row_Number function.  The partition clause creates a sequence based on a partition or grouping of data.  I will discus partitioning sequences further in part two of this series.  Now to the first sample.

--Row Sequence Number- Row_Number() Function (2005+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,ROW_NUMBER() OVER(ORDER BY o1.OrderId ASC) AS seq
FROM [dbo].[Orders] o1
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- --------------------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

In the above sample, I used the Row_Number function and specified the order of the sequencing to be based on the OrderId ascending.  The query above is really straight forward and easy to implement.  The Row_Number function makes sequencing data really easy and is quite useful for advanced queries.  The next sample I will demonstrate uses a correlated subquery to count the number of rows less than or equal to the current value.  The problem with this method is the subquery.  A subquery can really degrade performance and makes a solution less scalable. You can get more information on the problems with correlated subqueries here,  http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html.

--Row Sequence Number- Correlated Subquery (2000+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,(SELECT COUNT(*) FROM [dbo].[Orders]o2 WHERE o2.OrderId <= o1.OrderId) AS seq
FROM [dbo].[Orders]o1
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- -----------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

The next method I will be demonstrating uses the same logic as the subquery, but uses a self join with an aggregate function. I like this query a lot more than the subquery. This solution can significantly reduce IO, therefore, allowing it perform a lot better than the subquery method. The problem with this query is the triangular join. A triangular join occurs when you join a table with another table (this can be the same table), but uses a greater than or less than operator, instead of an equality join. It is called a triangular join because the shape the data makes when this type of join is used. If you want more info on triangular joins, Jeff Moden has written a great article, http://www.sqlservercentral.com/articles/T-SQL/61539/.

--Row Sequence- Aggregate (2000+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,COUNT(*) AS seq
FROM [dbo].[Orders]o1
INNER JOIN [dbo].[Orders]o2 ON o2.OrderId <=o1.OrderId
GROUP BY o1.OrderId,o1.CustId, o1.OrderDate
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- -----------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

The final method I will demonstrate is the temp table method.  This method can be highly performant and is can sometimes be the best solution, especially in versions of SQL Sever prior to 2005.  The key to the temp table method is to use INSERT INTO/SELECT with an ORDER BY clause.  I also like to create a clustered index, on the ORDER BY column. This solution scales really well, especially when working with lots of data.  When dealing with large quantities of data, this is often the best performing solution because it does not use triangular join, does not have extra IO from subqueries, and can take advantage of bulk insertion and minimally logged transactions.  The negative aspect is you will be using a large temporary object, so this could hinder other operations that use tempdb.

--Row Sequence- Temp Table (2000+)
IF OBJECT_ID('tempdb..#Row_Sequencing') IS NOT NULL
BEGIN
    DROP TABLE #Row_Sequencing;
END;
 
CREATE TABLE #Row_Sequencing(
seq INT IDENTITY(1,1),
OrderId INT PRIMARY KEY CLUSTERED, --Used to create a clustered index, which sorts the data,
CustId INT,
OrderDate DATETIME
);
 
INSERT INTO #Row_Sequencing ([OrderId],[CustId],[OrderDate])
SELECT o.OrderId,o.CustId,o.OrderDate
FROM [dbo].[Orders]o
ORDER BY o.OrderId ASC; --You must issue an order by to guarantee the data insertion order
 
SELECT * FROM [#Row_Sequencing] ORDER BY [seq];
 
/*
seq         OrderId     CustId      OrderDate
----------- ----------- ----------- -----------------------
1           1           1           2009-01-01 08:00:00.000
2           2           1           2009-01-01 10:00:00.000
3           3           1           2009-02-01 14:00:00.000
4           4           1           2009-03-01 15:15:00.000
5           5           1           2009-03-08 22:00:00.000
6           6           1           2009-03-08 22:00:00.000
7           7           2           2009-01-01 07:30:00.000
8           8           2           2009-01-01 17:45:00.000
9           9           2           2009-02-02 23:38:00.000
*/

There you have it.  I have demonstrated the most common solutions, for Sequencing data.  Stay tuned for part two of this series. In part two, I will be focusing on creating sequence partitions, using SQL Server 2000 and greater.  I hope that you have learned something new and can make practical use of the samples  provided.

Happy Coding.

Monday, September 14, 2009

Select The Most Current Row

I have seen a lot of questions regarding how to obtain the most current row, for a given grouping.  The type of query is very typical for a one-to-many relationship.  In our example, we will be creating an Orders table and our objective is to find the most recent order for each customer.  Let’s start by creating our table structure.

USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.Customers') IS NOT NULL
BEGIN 
    DROP TABLE dbo.Customers;
END
GO
 
CREATE TABLE dbo.Customers(
CustomerID INT IDENTITY(1,1) PRIMARY KEY,
EmpSalesID INT,
FName VARCHAR(25),
LName VARCHAR(25),
Phone CHAR(12)
);
 
INSERT INTO Customers VALUES (1,'Adam','Haines','555-555-5555');
INSERT INTO Customers VALUES (1,'John','Deere','555-555-7777');
INSERT INTO Customers VALUES (2,'Allision','Smith','555-555-8888');
 
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN 
    DROP TABLE dbo.Orders;
END
GO
 
CREATE TABLE dbo.Orders(
OrderId INT IDENTITY(1,1) PRIMARY KEY,
CustomerID INT,
Order_Amt NUMERIC(9,2),
Order_Dt DATETIME
);
 
INSERT INTO dbo.Orders VALUES (1,199.50,'2009-08-30');
INSERT INTO dbo.Orders VALUES (1,49.99,'2009-09-05');
INSERT INTO dbo.Orders VALUES (2,2500.00,'2009-09-08');
INSERT INTO dbo.Orders VALUES (3,1.00,'2009-08-08');
INSERT INTO dbo.Orders VALUES (3,2.00,'2009-09-07');
INSERT INTO dbo.Orders VALUES (3,3.00,'2009-09-08');

Now that we have our table structure, we can start writing our query.  As you can see, we have a one-to-many relationship between customer and orders.   When we join our Customers table to the Orders table we will see multiple rows returned, but our objective is to return the most recent row for each customer. I will illustrate two different methods, for obtaining this data.  The first method uses a CTE (Common Table Expression).  The key to this method is the row_number function, as this can be used to partition and sort the data to our liking.  In my example, I partition by the customer id and ordered by the order_dt DESC because I want to return the newest order placed by each customer.  This gives the newest record a sequence value of 1 per CustomerID.  I can then filter the outermost CTE for all sequences that are equal to 1, or the most current.

--SQL 2005+
--create cte to select the data we want
;WITH Orders (seq,CustomerID,FName,LName,OrderId,Order_Amt,Order_Dt)
AS
(
SELECT --use the row_number function to get the newest record
    ROW_NUMBER() OVER(PARTITION BY c.[CustomerID] ORDER BY  Order_Dt DESC) AS seq, 
    c.CustomerID,c.FName,c.LName,o.OrderId,o.Order_Amt,o.Order_Dt
FROM dbo.Customers c
INNER JOIN dbo.Orders o
    ON c.[CustomerID] = o.[CustomerID]
)
SELECT CustomerID,FName,LName,OrderId,Order_Amt,Order_Dt
FROM Orders
WHERE seq = 1 --filter for the newest record only
ORDER BY CustomerID

image 

Next, I will show you a method that works on SQL Server versions less than 2005.  This method works on the same principal but uses the max aggregate to get the most current order_dt per customerId.  We can then join this order_dt back to the orders table to get the most recent order row data.

--SQL 2000
SELECT o.CustomerID,c.FName,c.LName,o.OrderId,o.Order_Amt,o.Order_Dt
FROM dbo.Customers c
INNER JOIN dbo.Orders o
    ON c.[CustomerID] = o.[CustomerID]
INNER JOIN(
    SELECT CustomerId, MAX(Order_Dt) AS Order_Dt
    FROM dbo.Orders
    GROUP BY [CustomerId]
) AS Newest_Order
    ON  O.CustomerID = Newest_Order.CustomerId
        AND o.[Order_Dt] = Newest_Order.[Order_Dt] 
ORDER BY o.[OrderId]

image 

There you have it!!! I have provided two simplistic methods to get the most current row.  This method is very straight forward and very easy to implement.

Happy Coding.

Saturday, September 12, 2009

Using The Output Clause

In a prior post,http://jahaines.blogspot.com/2009/09/output-clause-great-alternative-to.html , I described a great alternative to scope identity, which is the Output clause, http://technet.microsoft.com/en-us/library/ms177564.aspx. In this post I will describe how to use the Output clause in a manner other than getting identity values.  The Output clause is a new to SQL Server 2005.  The biggest benefit of this clause it allows a developer access to the DML pseudo tables, inserted and deleted.  Typically these pseudo tables are only available to DML triggers.  Now that I laid the groundwork, I will create our table structures.

USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.t') IS NOT NULL
BEGIN
    DROP TABLE dbo.t;
END
GO
 
CREATE TABLE dbo.t(
Id INT IDENTITY(1,1) PRIMARY KEY,
Col CHAR(1)
);
 
IF OBJECT_ID('tempdb.dbo.DML') IS NOT NULL
BEGIN
    DROP TABLE dbo.DML;
END
GO
 
CREATE TABLE dbo.DML(
DML_Type CHAR(1),
Id INT,
Col CHAR(1)
);

Now that I have created our table structure, I can create our first DML transaction.  Our first DML transaction will be an insert statement where I use the values clause.  In this example and in the subsequent examples, I will be performing different DML transactions and inserting the affected row data and transaction type into the DML table.

--Insert a single value into the table dbo.t
INSERT INTO dbo.t (col) OUTPUT 'I',INSERTED.Id,INSERTED.Col INTO dbo.DML VALUES ('a');
 
--Select all rows from the table dbo.DML
SELECT * FROM dbo.DML
 
/*
DML_Type Id          Col
-------- ----------- ----
I        1           a
*/

As you can see from the output above, I have used the output clause to successfully capture all the row data.  The next DML transaction I will demonstrate is a insert statement, using a select statement.

--Insert multiple values into the table dbo.t
INSERT INTO dbo.t (col) OUTPUT 'I',INSERTED.Id, INSERTED.Col INTO dbo.DML
SELECT 'b' UNION ALL
SELECT 'c'
 
--Select all rows from the table dbo.DML
SELECT * FROM dbo.DML
 
/*
DML_Type Id          Col
-------- ----------- ----
I        1           a
I        2           b
I        3           c
*/

Next, I will demonstrate a delete statement.  I will be deleting all the rows, except id 2.

--Delete all ids but 2 from dbo.t
DELETE FROM dbo.t OUTPUT 'D',DELETED.Id,DELETED.Col INTO dbo.DML
WHERE id <> 2
 
--Select all rows from the table dbo.DML
SELECT * FROM dbo.DML
 
/*
DML_Type Id          Col
-------- ----------- ----
I        1           a
I        2           b
I        3           c
D        1           a
D        3           c
*/

In he last example, I will be demonstrating an Update statement.  One thing to remember about an Update statement is the Output Clause has access to both the INSERTED and DELETED pseudo tables. It should be noted that I am only getting data from the INSERTED table.

--Update id = 2 the table dbo.t
UPDATE dbo.t
SET Col = 'E'
OUTPUT 'U', INSERTED.Id,INSERTED.[Col] INTO dbo.DML
WHERE Id = 2
 
--Select all rows from the table dbo.DML
SELECT * FROM dbo.DML
 
/*
DML_Type Id          Col
-------- ----------- ----
I        1           a
I        2           b
I        3           c
D        1           a
D        3           c
U        2           E
*/

There you have it.  The Output clause is a very powerful clause that allows developers a method to obtain data that was once difficult to get.  This clause can be used in a lot of different scenarios and situations, but I leave it to you to find out how this can be used in your environment.

Happy Coding.

Sunday, September 6, 2009

Output Clause – A Great Alternative To Scope_Identity()

In SQL Server 2005, a new clause was introduced that allows your query access to the inserted and deleted pseudo tables, without the use of a trigger.  In prior versions of SQL Server, the primary way to get a newly inserted identity value is to either use @@IDENTITY or Scope_Identity(). The new clause that became available in SQL Server 2005 is the Output Clause, http://technet.microsoft.com/en-us/library/ms177564.aspx.  The Output clause is a more reliable and cleaner solution for dealing with newly inserted identities.  The primary reason one should use the Output clause is reliability.  The scope_identity function can sometimes return a null value and @@identity may return an incorrect identity value.  I won’t go into detail about how scope_identity() and @@identity can return invalid results, but trust me it does happen.  Another great reason is scalability.  This is where I will be focusing my effort.  The Output clause is more scalable because @@Identity or Scope_Identity() function is limited to single row processing, while the output clause allows you to work with sets of data.  You may be asking yourself how-so?  The simple answer is scope_identity() allows you to get the newly inserted identity, within the scope (batch,trigger,stored procedure), or simple the newly inserted identity value; however, what happens when you need to get a collection of newly inserted identities? In order to process all of the the rows, a recursive or iterative process has to be implemented.  Let’s have a look how each method works and performs.

Sample Data:

IF OBJECT_ID('tempdb..#t') IS NOT NULL
BEGIN
    DROP TABLE #t;
END
GO
 
CREATE TABLE #t(
Id INT IDENTITY(1,1) PRIMARY KEY,
Col CHAR(1) NOT NULL
);
GO
 
IF OBJECT_ID('tempdb..#ValuesToInsert') IS NOT NULL
BEGIN
    DROP TABLE #ValuesToInsert;
END
GO
 
CREATE TABLE #ValuesToInsert(
Id INT,
val CHAR(1)
);
GO
 
----insert some values to insert
INSERT INTO #ValuesToInsert VALUES (1,'a');
INSERT INTO #ValuesToInsert VALUES (2,'b');
INSERT INTO #ValuesToInsert VALUES (3,'c');

Now let’s perform our first test, which will demonstrate how the scope_identity/@@Identity will yield incomplete results.

--INSERT into our table #t
INSERT INTO #t (col)
SELECT val FROM #ValuesToInsert;
 
--select the current identity, which does not include 1 and 2
--this means we need an iterative/recursive process to get each newly insert identity
SELECT @@IDENTITY AS [@@IDENTITY],SCOPE_IDENTITY() AS [SCOPE_IDENTITY()];
GO

Results:

image

We inserted three values into the table, not one. How do we accurately get the identity values for all inserted rows?  The answer is we have to come up with an iterative or recursive process, as shown below.

/*Test 2 - using a iterative process*/
IF EXISTS(SELECT 1 FROM #t)
BEGIN 
    TRUNCATE TABLE #t;
END
GO
 
DECLARE @i INT
DECLARE @NbrIns INT
 
SET @i = 1
SET @NbrIns    = (SELECT COUNT(*) FROM [#ValuesToInsert])
 
--loop through values to insert
WHILE @i <= @NbrIns
BEGIN
    --insert current values
    INSERT INTO #t (col)
    SELECT val FROM [#ValuesToInsert] WHERE [Id] = @i;
    --select the new identity value
    SELECT @@IDENTITY AS [@@IDENTITY],SCOPE_IDENTITY() AS [SCOPE_IDENTITY()];
 
    --increment the counter
    SET @i = @i + 1
END
GO

Results:

image

We all know that this is not highly scalable code.  This code will start to cause problems some where down the road, or is already affecting performance.  How can we make our code more scalable and better performing?  The saving grace is the Output clause.

/*Test 3 - Using the Ouput Clause*/
IF EXISTS(SELECT 1 FROM #t)
BEGIN 
    TRUNCATE TABLE #t;
END
GO
 
--Declare table to hold newly inserted ids
DECLARE @New_Ids TABLE(
Id INT
);
 
--INSERT into our table #t
INSERT INTO #t (col) OUTPUT inserted.Id INTO @New_Ids
SELECT val FROM #ValuesToInsert;
 
SELECT * FROM @New_Ids

Results:

image

There you have it.  We have successfully captured all of the newly inserted ids, while working with set based operations.  This is just a scratch on the surface of the power of the Output clause.  In a future post, I will go over how to the Output clause to capture insert, update, and delete rows.  Thanks for readings and as always, happy coding.

References: