Arrays and Lists in SQL Server 2005 and Beyond
When TVPs Do Not Cut it

An SQL text by Erland Sommarskog, SQL Server MVP. Latest revision: 2013-11-12.

Introduction

In the public forums for SQL Server, you often see people asking How do I use arrays in SQL Server? Or Why does SELECT * FROM tbl WHERE col IN (@list) not work? The short answer to the first question is that SQL Server does not have arrays – SQL Server has tables. However, upto SQL 2005 you could not specify a table as input to SQL Server from a client, but you had to pass a string with the values and unpack it into a table on the SQL Server end.

This article describes a number of different ways to do this, both good and bad. I first give a background to the problem (including a quick solution that is good enough in many cases). I then give a brief overview over the methods, whereupon I discuss general issues that apply, no matter which method you use. Having dealt with these introductory topics, I devote the rest of the article to detailed descriptions of all methods, and I discuss their strengths and weaknesses.

SQL 2008 added a new feature that evades this kind of kludge: table-valued parameters, TVPs. So if you are on SQL 2008, you may not need to use the techniques described in this article, instead you can head to the accompanying article Arrays and Lists in SQL Server 2008 which is entirely devoted to table-valued parameters. However, not all client APIs support TVPs, so you may still have use for this article, even if you are on SQL 2008. There is also a second accompanying article: Arrays and Lists in SQL 2000 and Earlier which covers the options in SQL 2000, SQL 7 and SQL 6.5.

Furthermore, there are two performance-test appendixes. The first is from 2006 when I originally published this article, and the second is from 2009 when I conducted new tests to cover new methods, including table-valued parameters. This time I also looked at call overhead and made an attempt to run multi-thread tests. The second appendix was updated in 2010 to reflect things that I had missed about XML.

If you feel deterred by the sheer length of this article, you should be relieved to know that this is the kind of article where you may come and go as you please. If you are a plain SQL programmer who want to know "how do I?", you can drop off already after the first solution if you are in a hurry. If you have a little more time, you can read the background, the overview and the General Considerations section, and study the methods that look the most appealing to you. True SQL buffs that are curious about the performance numbers might find the explanations of the methods a little tedious and may prefer to skim these parts, and go directly to the performance-test appendixes.

Note: all samples in this article refer to the Northwind database. This database does not ship with SQL 2005 or later, but you can download the script to install it from Microsoft's web site.

Here is a table of contents:

   Introduction
   Background
      Comma-separated List of Values
      Inserting Many Rows
   Overview of the Methods
   General Considerations
      Interface
      Robustness
      Performance Considerations
         varchar vs. nvarchar
         Inline, Multi-Statement and Temp Tables
         MAX Types vs. Regular (n)varchar
         Collations
         Scalability in a Multithread Environment
      How To Use These Functions – JOIN vs. EXISTS
      Unpacking Lists in a Table
   The Iterative Method
      List-of-integers
      List-of-strings
   Using the CLR
      Introducing the CLR
      CLR Functions Using Split
      Rolling Our Own in the CLR
      What About Performance?
   XML
      Using XML for a List of Values
      Inserting Many Rows
      Element-centric XML
      Typed XML
      OPENXML
   Using a Table of Numbers
      The Concept
      An Inline Function
      A Chunking Multi-Statement Function
      Concluding Remarks
   Fixed-Length Array Elements
   Using a Function of Numbers
   XML Revisited
   Using Recursive CTEs
   Dynamic SQL
   Passing the List as Many Parameters
   Making the List into Many SELECT
   Really Slow Methods
   Concluding Performance Remarks
   Acknowledgements and Feedback
   Revisions

Background

Comma-separated List of Values

You have a number of key values, identifying a couple of rows in a table, and you want to retrieve these rows. If you are the sort of person who composes your SQL statements in client code, you might have something that looks like this:

cmd.CommandText = "SELECT ProductID, ProductName FROM Northwind..Products " & _
                  "WHERE ProductID IN (" & List & ")"
reader = cmd.ExecuteReader()

List is here a variable which you somewhere have assigned a comma-separated list, for instance "9, 12, 27, 39".

This sort of code above is bad practice, because you should never interpolate parameter values into your query string. (Why is beyond the scope of this article, but I discuss this in detail in my article The Curse and Blessings of Dynamic SQL, particularly in the sections on SQL Injection and Caching Query Plans.)

Since this is bad practice, you want to use stored procedures. However, at first glance you don't seem to find that any apparent way of doing this. Many have tried with:

CREATE PROCEDURE get_product_names @ids varchar(50) AS
SELECT ProductID, ProductName
FROM   Northwind..Products
WHERE  ProductID IN (@ids)

But when they test this:

EXEC get_product_names '9, 12, 27, 37'

The reward is this error message:

Server: Msg 245, Level 16, State 1, Procedure get_product_names, Line 2
Syntax error converting the varchar value '9, 12, 27, 37' to a column
of data type int.

This fails, because we are no longer composing an SQL statement dynamically, and @ids is just one value in the IN clause. An IN clause that could also read:

... WHERE col IN (@a, @b, @c)

Or more to the point, consider this little script which will return one row:

CREATE TABLE #csv (a varchar(20) NOT NULL)
go
INSERT #csv (a) VALUES ('9, 12, 27, 37')
INSERT #csv (a) VALUES ('something else')
SELECT a FROM #csv WHERE a IN ('9, 12, 27, 37')

The preferred way to handle the situation is to use a table-valued parameter, provided you are on SQL 2008 and use an API that supports it. (See my article Arrays and Lists in SQL Server 2008 for a discussion on this.) If not, the way to go is to use a function that unpacks the string into a table. Here is a very simple such function:

CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX))
   RETURNS @tbl TABLE (number int NOT NULL) AS
BEGIN
   DECLARE @pos        int,
           @nextpos    int,
           @valuelen   int

   SELECT @pos = 0, @nextpos = 1

   WHILE @nextpos > 0
   BEGIN
      SELECT @nextpos = charindex(',', @list, @pos + 1)
      SELECT @valuelen = CASE WHEN @nextpos > 0
                              THEN @nextpos
                              ELSE len(@list) + 1
                         END - @pos - 1
      INSERT @tbl (number)
         VALUES (convert(int, substring(@list, @pos + 1, @valuelen)))
      SELECT @pos = @nextpos
   END
   RETURN
END

The function iterates over the string looking for commas, and extracts the values one by one. The code is straightforward, and makes use of some of the string functions in T-SQL. The most complex part is the CASE expression which exists to handle the last value in the string correctly. This example shows how you could use this function:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   iter$simple_intlist_to_tbl(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_iter '9, 12, 27, 37'

So there, you have a solution to the problem. But let me say it directly that the function above is not extremely speedy, and almost all methods I will discuss in this article are faster. Nevertheless it's good enough for many situations, particularly if your list is short. So if you are in a hurry and want to move on with your project, feel free to stop here and come back later if you are curious or run into problems with performance and need to learn more.

If you are in the situation your lists are in a table column, and you want to rush, head for the section Unpacking Lists in a Table.

Inserting Many Rows

A related problem is to insert many rows at once. While you can send an INSERT statement or call a stored procedure for every row, this is not efficient because of the many network roundtrips and the way the transaction log works.

If you are looking into to importing data on a major scale, you should consider SQL Server Integration Services (SSIS) or use bulk load with BCP or BULK INSERT (which all are out of the scope for this article). But they are too heavy artillery to load a few thousand rows from a grid or a file.

The best choice in SQL 2008 is to use table-valued parameters, but if this option is not available to you, several of the methods that I describe in this article lend themselves for this purpose. Most of them are best fitted to handle "arrays" of single values, although they can be reworked to handle records with several fields. Two methods serve this purpose better than the others, and that is XML and a trick with INSERT-EXEC that I discuss in the section Making the List into Many SELECT.

In this article I'm focusing on comma-separated lists, since most questions on the newsgroup are about this scenario. But I will occasionally touch the topic of inserting many rows.

Overview of the Methods

As I've already hinted there are quite a few methods to unpack a list into table. Here I will just give a quick overview of the methods, before I move on to the general considerations.

General Considerations

Interface

Most of the methods I present are packaged into functions that take an input parameter which is a list of values and returns a table, like this:

CREATE FUNCTION list_to_table (@list nvarchar(MAX))
   RETURNS @tbl TABLE (number int NOT NULL) AS 

The reason the methods are in functions is obvious: this permits you to easily reuse the function in many queries. Here I will discuss some considerations about the interface of such functions.

The Input Parameters

In this article, as far as the input parameter is a string, it is always of the data type nvarchar(MAX). This is a new data type in SQL 2005. It can fit up to 2 GB of data, just like the old ntext data type, but nvarchar(MAX) does not have the many quirks of ntext.

I made this choice, because I wanted to make the functions as generally applicable as possible. By using nvarchar, the functions can handle Unicode input, and with MAX the functions permit unlimited input. With nvarchar(4000), they would silently yield incorrect results with longer input, which is very bad in my book.

Nevertheless, there is a performance cost for these choices. If you use an SQL collation, you should know that varchar gives you better performance (more on that in a minute). And some operations are slower with the MAX data types. Thus, if you know that your lists will never exceed 8000 bytes and you will only work with your ANSI code page, but you need all performance you can get, feel free to use varchar(8000) instead.

Some of the functions take a second parameter to permit you to specify the delimiter, or, in case of fixed-length, the element length. In some functions I have opted to hard-code the delimiter, but for all methods that use a delimiter (save dynamic SQL), you can always add such a parameter. To keep things simple, I have consistently used one-character delimiters. If you need multi-character delimiters, most methods can be extended to handle this.

The Output Table

The output from all functions is a table. For all methods, I've included one function for returning a table of strings, and for some methods also a function that returns a table of integers. (Which is likely to be the most common data type for this kind of lists.) If you have a list of integers and a function that returns strings, you can use it like this:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.id = convert(int, t.str)

Similar applies to other data types. You can easily clone a version of the function that has the convert built-in. (However, check the Robustness section for some things to look out for.)

The data type that requires most consideration is strings. Return nvarchar or varchar? Obviously when you work with Unicode data, you need to get nvarchar strings back. It may be tempting to always return nvarchar, but for reasons that I will return to in the performance section, you should make sure that you have a varchar string when you join with a varchar column. For some of the functions in this article, I have added both a varchar and an nvarchar column to the return table.

In some functions, I also return the position in the list for the elements. This can be handy when you have two or more lists that are horizontal slices of the same source, so you can say things like:

INSERT tbl(id, col2, col3)
   SELECT a.number, b.str, c.str
   FROM   intlist_to_table(@list1) a
   JOIN   charlist_to_table(@list2) b ON a.listpos = b.listpos
   JOIN   charlist_to_table(@list3) c ON a.listpos = c.listpos  

That is, this is a way to insert many rows in one go, although it's not really the best one. Sometimes this can be OK if you only have two or three columns per row to insert, but as the number of parallel lists grows, it gets out of hand, and you should investigate XML instead. The particular danger with the approach above is that if the lists get out of sync with each other, you will insert inconsistent data.

For some methods, the list position can easily be derived from the method itself, for others (but not all) you can use the row_number() function, a very valuable addition to SQL 2005.

Robustness

It can't be denied that parsing strings is a bit risky. As long as the input plays by the rules everything goes fine, but what happens if it doesn't? A good list-to-table function can protect you from some accidents, but not all. Here are a couple of situations to watch out for.

Delimiter in the Input

Say that you have a couple of city names like this: Berlin, Barcelona, Frankfurt (Main), Birmingham, København. From this you want to compose a comma-separated list that you pass to list-to-table function. With the names listed above, that works fine, but then some joker enters Dallas, TX. Oh-oh.

There are several ways to deal with this problem. One is to use a delimiter that is unlikely to appear in the input data, for instance a control character. Many programs put strings into quotes, so the above list would read "Berlin","Barcelona" etc. This latter format is not supported by any of the functions I present, but you could tweak some of them a bit to get there.

Sometimes you cannot really make any assumption about the delimiter at all, for instance if the source is user input or data on the wire. In such case you will need to use a method with a general escape mechanism, of which I present one, to wit XML. Or you can avoid the delimiter business entirely by using fixed-length strings.

When you work with lists of integers, this is not very likely to be a problem.

Extra Spacing

If you have an input list that goes:

ALFKI, VINET, BERGS,FRANK

Do you want those extra spaces to be included in the data returned by the list-to-table function? Probably not. All functions in this article strips trailing and leading spaces from list elements. However, there are some methods, where this is not possible. (Or more precisely, they are not able to handle inconsistent spacing.)

Illegal Input

Say that you have a function that accepts a list of integers, and the input is 9, 12, a, 23, 12. What should happen?

With no particular coding, SQL Server will give you a conversion error, and the batch will be aborted. If you prefer, you can add checks to your function so that the illegal value is ignored or replaced with NULL.

To focus on the main theme, I have not added such checks to the functions in this article.

Empty Elements

What if a function that accepts a list of integers is fed the input: 9, 12,, 23, 12. How should that double comma be interpreted? If you just do a simple-minded convert, you will get a 0 back, which is not really good. It would be better to return NULL or just leave out the element. (Raise an error? You cannot raise errors in functions.)

One approach I have taken in some functions in this article is to avoid the problem altogether by using space as delimiter. But since T-SQL does not provide a function to collapse internal spacing, the approach is not without problems. For methods that build on logic of traditional programming, you can easily handle multiple spaces, but for methods that uses a combination of charindex and set-based logic, you would still have to filter out empty elements in the WHERE clause. (Something I have not done in this article.)

Performance Considerations

While I have conducted performance tests and devoted long appendixes to them, a very important performance aspect is not with the methods themselves, but how you use them and how they are packaged. In this section I will look into some important issues.

varchar vs. nvarchar

As I discussed in the Interface section, it appears to be a good choice for a function that unpacks a list of strings to have an nvarchar column in its return table, so it can work with both Unicode and 8-bit data. Functionally, it's sound. Performancewise it can be a disaster. Say that you have:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.indexedvarcharcol = l.nvarcharcol

Why is this bad? Recall that SQL Server has a strict data-type precedence, which says that if two values of different data types meet, the one with lower precedence is converted to the higher type. varchar has lower precedence than nvarchar. (Quite naturally, since the set of possible values for varchar is a subset of the possible values for nvarchar.) Thus, in the query above indexedvarcharcol will be converted to nvarchar. The cost for this depends on the collation of the column.

If the column has a Windows collation, SQL Server is still be able to use the index, because in a Windows collation the 255 possible varchar values make up a true subset of all possible values for nvarchar. This permits SQL Server to do something known a range seek. My tests indicate a doubling or tripling of the execution time, which of course is reason enough to avoid the conversion

But the real disaster is if the column has an SQL collation: in this case the index is completely useless, because in an SQL collation the rules for varchar and nvarchar characters are different. This means that SQL Server will have to find some other way to run the query, most likely one that requires a scan of the entire table. Instead of a sub-second response, it may take minutes to run the query.

Thus, it is instrumental that you have separate functions to return varchar and nvarchar data. Or do as I have done in some functions in this article: have two return columns, one for varchar and one for nvarchar. You can also always try to remember to convert the output column to the appropriate data type – but it's human to forget.

You can find out the collation of a column with sp_help.

There is one more thing to say on this theme, which I will come back to in a minute.

Inline, Multi-Statement and Temp Tables

This far I have used "table function" very loosely. There are several kinds of table functions, and with some methods you don't really even use a function. There are four main themes:

No method can be implemented in all these ways, but most methods permit you to use intermediate storage.

It may seem from the above that Direct SQL is the most preferable, and indeed table-valued parameters are. While I describe them in detail in another article, I like to cover their superiority here. For almost all other methods described in this article, the optimizer has very little information about your input list. A TVP is just a form of a table variable, so there is still no statistics – but one piece of information: cardinality. That is, the first time you call a procedure with a TVP the optimizer will look at the number of rows in the table and use that number when determining the query plan and then cache that plan. Not that this is always sufficient, but it does take you a long way. As for the other methods that permit direct SQL, they have a set of problems, which I will cover when I describe these methods in detail.

At first glance, T-SQL Inline functions seem appealing, since there is no intermediate table and the optimizer has full information about the query. Unfortunately, the latter point is not really true. To be able to estimate the best way to access the other tables in the query, the optimizer would need to know: 1) how many rows the input string will generate 2) the distribution of the values. But the optimizer is not able to do that with the SQL inline functions in this article, because the information is buried too deep in the logic of these functions.

Thus, in practice the optimizer will apply blind assumptions no matter you use SQL inline, opaque inline, a multi-statement function or a table variable. So then it does not matter which one you use? Oh, no. The blind assumptions are different for the different inline methods, and if the function uses an auxiliary table of numbers, the size of that table will affect the blind assumptions. (Because the optimizer has information about that table.) And the blind assumptions for CLR functions and XML are different from each other and from those for T-SQL inline functions. The assumptions for multi-statement functions and table variables appears to be the same vis-à-vis each other, but yet different from the inline methods. With some luck, if the blind assumption for one method leads the optimizer astray, it may work better with another.

The T-SQL inline functions give the optimizer a higher degree of freedom as it works with the expanded query. I've seen cases where T-SQL inline gives unexpectedly good performance on multi-CPU machines, because the optimizer finds a parallel plan. However, there is a non-negligible risk that the optimizer outsmarts itself. Here is an extract from a mail that I received in response to my old article for SQL 2000:

After reading your article 'Arrays and list in SQL server' I tried to use the Fixed-Length Array Elements method in my application. Everything worked fine until I moved the code from a client batch to a stored procedure. When I looked at the query execution plan, I saw that the number of rows retrieved from the Numbers table was over 61 millions ! Instead of starting by joining the Numbers table with the source table to filter out the 500 rows included in the array, it processes the GROUP BY clause on the entire table (121 000 rows) and then it uses a nested loop to match each entry with the Number table.

With all other methods but T-SQL inline, the optimizer does not really have much other choice than to first compute the table from the list input, but some T-SQL inline functions opens for the possibility for a reverse strategy – which is not likely to be successful. I suggested my correspondent that he should try a multi-statement function instead, and that resolved his issue. I will return to this theme in the section JOIN vs. EXISTS below.

What about temp tables then? Initially, when the optimizer compiles a stored procedure, it makes a blind assumption about a temp table. But if a sufficient amount of data is inserted into the table this will trigger auto-statistics, and this will in its turn trigger a recompile of the statements where the temp table is referenced. This recompile is both a blessing and a curse. It's a blessing, because it gives the optimizer a second chance to find a better plan. But if the optimizer comes up with the same plan as it had before, it was just wasted cycles.

At this point the reader may feel both confused and uncomfortable over all these complications. In practice, it is not really that bad. Often, these blind assumptions work fairly well, particularly if your input lists are small. So go for a method that you think fit your needs, and stick with it as long as you don't run into problems. When do you run into bad performance, come back and read this section again, to get an idea of what alternatives you should try. One rule of thumbs is that the bigger the input list is, the more reason you have to consider using a temp table.

Here is an overview of which strategies that are possible with which methods:

 Direct SQL T-SQL InlineOpaque Inline Multi-Statement Table Variable Temp Table
Table-valued Parameters Yes No No No Yes Yes
Iterative Method No No No Yes Yes Yes
CLR No No Yes No Yes Yes
XML No No Yes No Yes Yes
Table of Numbers No Yes No Yes Yes Yes
Fixed-Length No Yes No Yes Yes Yes
Function of Numbers No Yes No Yes Yes Yes
Recursive CTE No Yes No Yes Yes Yes
Dynamic SQL Yes No No No No No
Many Parameters Yes No No No Yes Yes
List to SELECT No No No No Yes Yes
Real Slow Yes No No No No No

MAX Types vs. Regular (n)varchar

In a previous section I discussed the problems with joining nvarchar and varchar. When I ran my performance tests and investigated some unexpected results; I discovered a second problem of a similar kind. Consider this:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.indexednvarcharcol = l.nvarcharmaxcol

The list-to-table function is here written in such a way that its return type is nvarchar(MAX). This too leads to an implicit conversion of the indexed column. It may not be apparent that it has to be that way at first sight, but when SQL Server evaluates an expression, it always works with the same data type for all operands. And apparently, nvarchar(4000) and shorter is a different data type from nvarchar(MAX). The result of the implicit conversion is not fatal. The optimizer applies a range-seek operator and is still able to use the index, but nevertheless there is an overhead. When I initially ran my tests, I had not observed this issue, and my inline functions returned nvarchar(MAX) (of the simple reason that the input string was nvarchar(MAX)). As a consequence, my tests in some cases seemed to indicate that inline functions performed worse than the corresponding multi-statement solutions.

Presumably, most of the time when you use list-to-table functions for a list of strings, the list elements are short, just a few characters long. Therefore, there is all reason to make sure that your list-to-table function returns a regular varchar or nvarchar. Particularly, this means that for inline functions, you should make sure that the return value is explicitly converted to regular (n)varchar. You will see this in all inline functions in this article.

Collations

All functions in this article uses nvarchar both for parameters, output and internal variables. If you never work with Unicode data, you may think that you should rewrite the functions to use varchar instead, assuming that 8-bit characters are faster for SQL Server to work with than the 16-bit Unicode characters.

This may or may not be the case, depending on which collation you are using. As I discussed above under varchar vs. nvarchar, there are two sorts of collations: Windows collations and SQL collations. If you use a Windows collation, you get a slight reduction in performance if you use varchar rather than nvarchar. This is due to that with a Windows collation, the Unicode rules and routines are always employed internally, so all that varchar buys you is extra conversion to and from Unicode.

On the other hand, with an SQL collation you can get some 30 % improvement in execution time with using varchar instead. This is because SQL collations are 8-bit only, for which there exist a separate set of 8-bit only routines, and the rules for an 8-bit character set are far simpler than those for Unicode. (If you have an SQL collation and use nvarchar, you are in fact using a Windows collation under the cover.)

Note here that the exact gain depends on the type of operation. 30 % is what you can expect from a plain equality test. There are situations where the difference between varchar and nvarchar in an SQL collation can be as much as a factor of 7. We will look at such case in the section on really slow methods.

If you have to use nvarchar, there is still a way to evade the overhead of the Unicode routines: use a binary collation. Now, if you opt to use a binary collation throughout your database, you will have to accept that all comparisons are case-insensitive, that sorting is funky, not the least for other languages than English. So for most applications, using a binary collation is not a viable option. However, there exists a second possibility: force the collation for a certain expression. I have employed this throughout this article where it makes sense. You will see a lot of things like:

charindex(@delimiter COLLATE Slovenian_BIN2, @list, @pos + 1)

Since @delimiter is cast to an explicit collation, this also happens with @list. (This is discussed in Books Online in the topic Collation Precedence.) When using charindex to find a delimiter, odds are good that you are looking for the exact delimiter and you have no need for case- or accent-insensitive searches. Thus, using a binary collation in this situation does not lead to any loss in functionality. When I tested this for the iterative method, I got some 10 % improvement in execution time.

(Why Slovenian? And why BIN2? Well, I had to pick something, but which binary collation you chose does not matter. I used Slovenian because my test data happens to be a list of Slovenian words from a spelling dictionary. And I selected BIN2, since these newer binary collations resolve some oddities of the older BIN collations. But as I said, all that matters in this case is that you select a binary collation.)

Scalability in a Multithread Environment

When I discuss performance in this article, I'm mainly concerned with the method as such, since this is mainly what I have tested. But if you believe that you will have many users concurrently submitting comma-separated lists or similar, you will also be interested in how these methods scale. In my latest suite of performance tests in 2009, I did a feeble attempt to test this, but the tests were not very conclusive. You can read about my observations in the second performance appendix, but save a few places, I'm generally quiet on this point in the main article.

How To Use These Functions – JOIN vs. EXISTS

Once you have such a function, how would you use it? Well, you have already seen examples of this:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   iter$simple_intlist_to_tbl(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_iter '9, 12, 27, 37'

But you could also write the query as:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
WHERE  P.ProductID IN
      (SELECT i.number FROM iter$simple_intlist_to_tbl(@ids) i)

Or:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
WHERE  EXISTS (SELECT *
               FROM   iter$simple_intlist_to_tbl(@ids) i
               WHERE  P.ProductID = i.number)

In this article I almost exclusively use the first model, using joins. Does this mean that this is the best way to write such queries? It turns out that this is a more delicate question that it may seem at first glance.

Before I start, let me first say that there is no practical difference between IN and EXISTS, SQL Server will rewrite them to the same internal representation, before the optimizer gets to see the query. So it is only a question on whether to use join on the one hand or IN/EXISTS on the other.

Next thing to note is that if you need to get the list position, you need to write a join, because then you are using the table function for more than just checking existence. But if we overlook that part, we can see that the join query above has a potential problem that the other two do not have. To wit, what if there are duplicates in the input list? Now, obviously, it could be the case that the business rules are such that if 27 appears multiple times in the list, Schoggi Schokolade should appear multiple times in the output. And in such case, the join is the only reasonable option.

But in most cases, you don't want duplicates in the output, and this means that if you use a join query, that you need to eliminate duplicates with the DISTINCT keyword. And this means that SQL Server needs to put in extra work to sort (or hash) the data to get rid of the duplicates. With IN or EXISTS, you don't need DISTINCT, because you are only selecting rows from one table. Thus, EXISTS is leaner? Nah, not really. If you look at the IN/EXISTS queries above, they are in practice very similar to a join + DISTINCT, because that is how SQL Server will implement them.

So it doesn't matter which we use then? Maybe in some cases. The more complicated the query gets, the more likely there will be differences. If you use DISTINCT + JOIN, DISTINCT applies to the entire result set which may yield a different result than using IN or EXISTS. If the latter case, SQL Server knows that it only has to eliminate duplicates from the table-function. And the query plan may for other reasons include a hash or sort operator that can perform elimination of duplicates, so that there is no real extra cost.

Beside all the performance-ponderings there is the purist view: you should express your intent clearly, and if you mean to include the table function to check for existence, you should state so in the query.

After all this rambling, it seems that I have arrived at the conclusion that IN/EXISTS are better than joins. So why do I use joins then in the article? There are two reasons. One is kind of shameful: when I first wrote the article for SQL 2000 many years ago my brain was an auto-pilot, and I just used what first came into my mind. The other reason is more heavy-duty: when I ran my performance tests in 2009 I added tests with EXISTS queries, and for all T-SQL inline functions without exception this resulted in disastrous query plans. The execution time for an EXISTS query with a list of 20 elements ranged from 25 to 40 seconds. The corresponding execution time for the join queries were a few milliseconds in the very most cases. This abysmal performance was far less common with join queries.

As I discussed above in the section Inline, Multi-Statement and Temp Tables, inline T-SQL functions gives the optimizer a degree of freedom which is not really good for its health. I like to stress that my tests do not prove that SQL Server always freaks out on T-SQL inline functions when EXISTS is involved. But nevertheless, it happened eight times out of eight in my tests, and while some functions are similar to each other, there are also some that are entirely different in nature.

So for this reason, I have retained the queries in the article as join queries. I have not added DISTINCT to them. After all, in many cases, you probably have control over how the lists are constructed, and can be reasonably sure that they don't have duplicates.

At the end of the article there is a section of concluding performance remarks, where I further discuss this topic.

Unpacking Lists in a Table

Most examples in this article work with a single list being passed as a parameter into a function, and this is probably the most common case. But sometimes you find yourself working with a table like this one:

ModelidColours
A200Blue, Green, Magenta, Red
A220Blue, Green, Magenta, Red, White
A230Blue, Green, Magenta, Red, Cyan, White, Black
B130Brown, Orange, Red
B150Yellow, Brown, Orange, Red

That is, the available colours for a model appear as a comma-separated list. Let me directly emphasise that this is an extremely poor design that violates a very basic principle of relational databases: no repeating groups. As a consequence of this, tables with this design are often very painful to work with. If you encounter this design, you should seriously consider changing the data model, so there is a sub-table with one row for each model and colour. Then again, to get there, you need to be able to split up these lists into rows.

In SQL 2000, it was not possible to call a table-valued function and pass a table column as parameter, but SQL 2005 adds the APPLY operator that permits you to do this. Here is a script that creates the table above and then unpacks it with a query:

CREATE TABLE models (modelid  char(4)      NOT NULL,
                     -- other columns like modelname etc.
                     colours  varchar(200) NOT NULL,
                     CONSTRAINT pk_models PRIMARY KEY (modelid))
go
INSERT models (modelid, colours)
  SELECT 'A200', 'Blue, Green, Magenta, Red'
  UNION
  SELECT 'A220', 'Blue, Green, Magenta, Red, White'
  UNION
  SELECT 'A230', 'Blue, Green, Magenta, Red, Cyan, White, Black'
  UNION
  SELECT 'B130', 'Brown, Orange, Red'
  UNION
  SELECT 'B150', 'Yellow, Brown, Orange, Red'
go
SELECT m.modelid, t.str AS colour
FROM   models m
CROSS  APPLY  iter_charlist_to_tbl(m.colours, ',') AS t
ORDER  BY m.modelid, t.str

(The code for iter_charlist_to_tbl will appear shortly.)

Just like the JOIN operator, APPLY takes two table sources as its input. A table source is anything that exposes columns like a table: a view, a table-valued function, a derived table, a rowset function or a common table expression. (The latter is another new feature in SQL 2005 that I will return to in this article). With JOIN, the table sources have to be autonomous from each other: for instance a table-valued function cannot take a parameter from a table on the left side. But this is exactly what APPLY permits. The function is evaluated once for each row in the table source on the left side, and that row will be exploded to as many rows that the table source on the right side evaluates to.

For completeness, I should add that there are two forms of APPLY: CROSS APPLY and OUTER APPLY. The difference lies in what happens when the table source on the right hand side returns no rows at all. With CROSS APPLY the row from the left side is lost, with OUTER APPLY it is retained.

For more information on APPLY, see the topics FROM and Using Apply in Books Online.

I will return to the topic of unpacking lists in a table in the final section Concluding Performance Remarks.

The Iterative Method

I will now describe the various methods to unpack lists into a table, one by one in detail.

I've opted to start with the iterative method. Together with using a recursive CTE, the iterative method needs the least amount of preparations. Just create the function and you are on the air. Most other methods require some extra step. Another advantage is that the code for the iterative method is very easy to understand, not the least if you have a background with traditional programming languages. This makes things simple if you need to adapt the code to a special input format.

The method is far from the fastest in the bunch, but as long as you mainly work with lists of reasonable length, you will find performance acceptable. If you often work with lists of several thousand elements, you should probably investigate some of the faster methods.

List-of-integers

You have already seen an example of the iterative method in the beginning of this article, but I repeat it here:

CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX))
   RETURNS @tbl TABLE (number int NOT NULL) AS
BEGIN
   DECLARE @pos        int,
           @nextpos    int,
           @valuelen   int

   SELECT @pos = 0, @nextpos = 1

   WHILE @nextpos > 0
   BEGIN
      SELECT @nextpos = charindex(',', @list, @pos + 1)
      SELECT @valuelen = CASE WHEN @nextpos > 0
                              THEN @nextpos
                              ELSE len(@list) + 1
                         END - @pos - 1
      INSERT @tbl (number)
         VALUES (convert(int, substring(@list, @pos + 1, @valuelen)))
      SELECT @pos = @nextpos
   END
  RETURN
END

The idea is simple. We iterate over the string, look for commas, and then extract the values between the commas. Note the use of the third parameter to charindex, this specifies the position where to start searching for the next comma. The computation of @valuelen includes the only complexity: we must cater for the fact that charindex will return 0 when there are no more commas in the list.

However, this function is slower than it has to be. When I wrote the same function for SQL 2000 some years back, I applied a technique where I broke up the input into chunks. This was necessary, because in SQL 2000 there is no nvarchar(MAX), only ntext, and charindex operates only within the first 8 000 bytes of an ntext value. I had hoped that with nvarchar(MAX) chunking would not be necessary, but testing showed that by using chunks of nvarchar(4000) values, I could improve performance by 20-30 %. The culprit is charindex: it's slower on nvarchar(MAX) than on nvarchar(4000). Why, I don't know, but since nvarchar(MAX) values can be up to 2 GB in size, I assume that charindex needs a more complex implementation for nvarchar(MAX).

There is a second problem with iter$simple_intlist_to_tbl: if you for some reason feed it two consecutive commas, this will result in a 0 in the output, which isn't really good. While you can easily address this by adding some extra logic to the function, my preference is to avoid the problem by using space as a separator. The comma does not really fill any purpose for a list of integers.

So here is a better implementation of the iterative method for a list of integers:

CREATE FUNCTION iter_intlist_to_tbl (@list nvarchar(MAX))
   RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                       number  int NOT NULL) AS
BEGIN
   DECLARE @startpos int,
           @endpos   int,
           @textpos  int,
           @chunklen smallint,
           @str      nvarchar(4000),
           @tmpstr   nvarchar(4000),
           @leftover nvarchar(4000)

   SET @textpos = 1
   SET @leftover = ''
   WHILE @textpos <= datalength(@list) / 2
   BEGIN
      SET @chunklen = 4000 - datalength(@leftover) / 2
      SET @tmpstr = ltrim(@leftover +
                    substring(@list, @textpos, @chunklen))
      SET @textpos = @textpos + @chunklen

      SET @startpos = 0
      SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr)

      WHILE @endpos > 0
      BEGIN
         SET @str = substring(@tmpstr, @startpos + 1,
                              @endpos - @startpos - 1)
         IF @str <> ''
            INSERT @tbl (number) VALUES(convert(int, @str))
         SET @startpos = @endpos
         SET @endpos = charindex(' ' COLLATE Slovenian_BIN2,
                                 @tmpstr, @startpos + 1)
      END

      SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos)
   END

   IF ltrim(rtrim(@leftover)) <> ''
      INSERT @tbl (number) VALUES(convert(int, @leftover))

   RETURN
END

Here is an example on how you would use this function:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   iter_intlist_to_tbl(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_iter '9 12 27 37'

This function has two loops. One which creates the chunks, and one that iterates over the chunks. The first chunk is always 4 000 characters (provided that the input is that long, that is). As we come to end of a chunk, we are likely to be in the middle of an element, which we save in @leftover. We bring @leftover with us to the next chunk, and for this reason we may grab fewer than 4 000 characters from @list this time. When we have come to the last chunk, @leftover is simply the last list element.

Multiple spaces are handled by simply ignoring @str if it's blank.

There are two things from the general considerations that I have added to this function that was not in iter$simple_intlist_to_tbl:

  1. The output table includes the list position.
  2. I use the COLLATE clause to force a binary collation to gain some further performance.

I also like to note that there is a minor performance improvement from the version that appears in the SQL 2000 version of this article. Sam Saffron pointed out to me that I kept reallocating the string rather than using the third parameter of charindex.

List-of-strings

Here is a similar function, but that returns a table of strings.

CREATE FUNCTION iter_charlist_to_tbl
                 (@list      nvarchar(MAX),
                  @delimiter nchar(1) = N',')
      RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                          str     varchar(4000)      NOT NULL,
                          nstr    nvarchar(2000)     NOT NULL) AS

BEGIN
   DECLARE @endpos   int,
           @startpos int,
           @textpos  int,
           @chunklen smallint,
           @tmpstr   nvarchar(4000),
           @leftover nvarchar(4000),
           @tmpval   nvarchar(4000)

   SET @textpos = 1
   SET @leftover = ''
   WHILE @textpos <= datalength(@list) / 2
   BEGIN
      SET @chunklen = 4000 - datalength(@leftover) / 2
      SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen)
      SET @textpos = @textpos + @chunklen

      SET @startpos = 0
      SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr)

      WHILE @endpos > 0
      BEGIN
         SET @tmpval = ltrim(rtrim(substring(@tmpstr, @startpos + 1,
                                             @endpos - @startpos - 1)))
         INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval)
         SET @startpos = @endpos
         SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2,
                                 @tmpstr, @startpos + 1)
      END

      SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos)
   END

   INSERT @tbl(str, nstr)
      VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover)))
   RETURN
END

An example on how you would use this function:

CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   iter_charlist_to_tbl(@customers, DEFAULT) s
     ON   C.CustomerID = s.nstr
go
EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'

There are a few differences to iter_intlist_to_tbl. I've added a parameter to specify the delimiter, since you may need to use different delimiters depending on the data. Note that if you want to use the default of a table-valued function, you cannot leave out the parameter, but you have to specify the keyword DEFAULT to use the default value. Another addition is that I trim off space that appears directly adjacent to the delimiters, but space within the list elements is retained.

I like to give attention to the use of datalength. There are two system functions in T-SQL to return the length of a string: len returns the number of characters, and does not count trailing space. datalength returns the number of bytes (whence all these / 2), and includes trailing spaces. I'm using datalength here, since there is no reason to ignore trailing spaces in the chunks – they could be in the middle of a list element.

Using the CLR

Introducing the CLR

SQL 2005 added the possibility to create stored procedures, functions etc in .NET languages such as C# and Visual Basic .NET, or any language that supports the Common Language Runtime.

If you have never worked with the CLR before, you may find that this method goes a little over your head, and you may prefer to use a pure SQL method. On the other hand, if you are a seasoned C# or VB programmer, you will surely appreciate this method. Just like the iterative method, this method lends itself very easily to modifications to adapt to special input formats.

There are many ways you can abuse the CLR and use it when you should not, but a list-to-string function is a prime example of what the CLR is good for: operations that do not perform any data access, but that perform complex manipulations of strings or numbers. The reason for this is two-fold: 1) The CLR gives you a much richer set of functions to work with, regular expressions just to name one. 2) The CLR languages are compiled, while T-SQL is interpreted, leading to much better performance with the CLR. In the realm of table-valued functions there is another factor that improves performance: the output from a table-valued CLR function is not written into any intermediate storage, but the rows are fed into the rest of the query as soon as they are produced. So in that sense they are inline functions, but in difference to T-SQL's own inline functions, the optimizer have no idea what they will produce, which is why I refer to them as opaque inline.

A special quirk of CLR functions is that they cannot return varchar data – you can only return nvarchar. This means that when you work with list of strings that you must always be careful to remember to convert the output to varchar when you join with varchar columns in tables, as I discussed in the section varchar vs. nvarchar.

By default, running user-written CLR code is disabled in SQL Server. You can enable the CLR by running

EXEC sp_configure 'CLR enabled', 1
RECONFIGURE  

from a query window.

In the following I will try to give a crash-course how to write a table-valued function in the CLR. Seasoned .NET programmers may in find it inaccurate in points – I will have to admit that these were the first CLR table functions I've ever wrote. (I would have preferred to refer you to Books Online, but I found the topic on CLR table-valued functions in Books Online to be far too terse.) In the interest of brevity, I'm only including examples in C#.

CLR Functions Using Split

We will look at two ways of implementing a list-to-table function in the CLR. The first one, with very little of our own code, serves as an introduction to CLR table functions rather than as an example of a function you actually should use. In the second alternative, we will roll our own, which opens for a higher degree of flexibility and also better scalability.

The Code

A complete C# file that implements two list-to-table functions, one for strings and one for integer has to be no longer than this:

using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public class CLR_split
{
    [SqlFunction(FillRowMethodName="CharlistFillRow")]
    public static IEnumerable CLR_charlist_split(SqlString str,
                                                 SqlString delimiter)
    {
        return str.Value.Split(delimiter.Value.ToCharArray(0, 1));
    }

    public static void CharlistFillRow(object row, out string str)
    {
        str = (string)row;
        str = str.Trim();
    }

    [SqlFunction(FillRowMethodName="IntlistFillRow")]
    public static IEnumerable CLR_intlist_split(SqlString str)
    {
        return str.Value.Split((char[]) null,
                               System.StringSplitOptions.RemoveEmptyEntries);
     }

    public static void IntlistFillRow(object row, out int n)
    {
        n = System.Convert.ToInt32((string) row);
    }
}

Compile and Install

To compile this, open a command-prompt window and make sure that you have C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727 or corresponding in your path. (Later versions of the .NET Framework will work as well, but 1.1 will not.) Assuming that the name of the file is CLR_split.cs, the command is:

csc /target:library CLR_split.cs

This gives you CLR_split.dll. If your SQL Server is not on your local machine, you will have to copy the DLL to the server box, or make the DLL visible from the server in some way. Then run from a query window:

CREATE ASSEMBLY CLR_split FROM 'C:\somewhere\CLR_split.dll'
go
CREATE FUNCTION CLR_charlist_split(@list nvarchar(MAX),
                                   @delim nchar(1) = N',')
RETURNS TABLE (str nvarchar(4000))
AS EXTERNAL NAME CLR_split.CLR_split.CLR_charlist_split
go
CREATE FUNCTION CLR_intlist_split(@list nvarchar(MAX))
RETURNS TABLE (number int)
AS EXTERNAL NAME CLR_split.CLR_split.CLR_intlist_split
go

(Note: it is also possible to deploy the functions from Visual Studio, but I'm not showing you that, because I don't know how to do it myself. Visual Studio mainly leaves me in a maze, and at the same time I find the command-line very simple to use. What I have been told is that VS may require you to add extra attributes to the functions if you want to deploy the functions that way.)

You have now created the functions and can use them from T-SQL. Here is an example for both:

CREATE PROCEDURE get_company_names_clr @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   listtest..CLR_charlist_split(@customers, DEFAULT) s
     ON   C.CustomerID = s.str
go
EXEC get_company_names_clr 'ALFKI, BONAP, CACTU, FRANK'
CREATE PROCEDURE get_product_names_clr @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   CLR_intlist_split(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_clr '9 12 27  37'

As with iter_intlist_to_tbl, CLR_intlist_split takes a space-separated list of integers.

What's Going On?

If you have never worked with CLR table functions before, you may at this point wonder how this all works, and I will try to explain.

CREATE ASSEMBLY loads the DLL into SQL Server. Note that it does not merely save a pointer to the file; the DLL as such is stored in the database. Since CREATE ASSEMBLY operates from SQL Server, the file path refers to the the drives on the server, not on your local machine. (If you are loading the assembly from a network share, it's better to specify the location by \\servername name than by drive letter.) It is also possible to load an assembly as a hex-string:

The CREATE FUNCTION statements look just like the statements for creating multi-statement functions. That is, you specify the parameter list and the return table. But instead of a body, AS is followed by EXTERNAL NAME where you specify the CLR method to use. This is a three-part name where the first part is the assembly, the second part is a class within the assembly, and the last part is the name of the method itself. In this example, I'm using the same name for the assembly in SQL Server as I do for the class.

There is one small detail on the return table: For a multi-statement function you can specify that a column is nullable, you can define CHECK and DEFAULT constraints and define a PRIMARY KEY. This is not possible for CLR functions.

If we turn to the C# code, the table-valued function is implemented through two C# methods. The first method is the one that we point to in the CREATE FUNCTION statement. The second method is specified through the attribute that comes first in the definition of the CLR_charlist_split method. That is, this line:

[SqlFunction(FillRowMethodName="CharlistFillRow")]

This line specifies that the method is a table-valued function and points to the second method of the function. CLR_charlist_split is the entry point and is called once. The entry point must return a collection or an enumerator, and the CLR will call the method specified in FillRowMethodName once for every element in the collection/enumerator, and each invocation produces a row in the output table of the function.

So this is what happens when you call CLR_charlist_split from T-SQL. The C# method calls the String method Split which splits the string into a collection over a delimiter. (For full details on Split I refer you to the .NET Framework SDK in MSDN Library.) Since you get a collection, you need do no more. The CLR calls CharlistFillRow for each element in the collection. And as I noted above, as soon as a row is produced, it can be consumed in the outer query, without waiting for the table function to complete.

What about parameters? As you may guess, the parameter list of the entry method must agree with the parameter list in the CREATE FUNCTION statement. The exact rules for mapping SQL data types to those of the CLR are beyond the scope of this text, please refer to Books Online for the full details.

The first parameter of the fill method (CharlistFillRow) is of the type object. This is the current element in the collection/enumeration and to use it, you will need to cast it to the real type. The remaining parameters to the fill method are all output parameters, and they map to the output table in the CREATE FUNCTION statement.

One more thing calls for attention: the return type of the entry function. In this example it is IEnumerable, since Split returns a collection. The only other alternative is IEnumerator, which we will look at shortly.

Back on Track

After this excursion into CLR, let's get back to the topic of list-to-table functions. What are the characteristics of the list-to-table functions that use Split?

As you can see, we don't return the list position. As far as I know, you cannot get the list position this way, but I will have to admit that I have not dug into it. Overall, Split puts you quite much into a straight-jacket. You can be as flexible as Split is. That includes specifying an alternate string delimiter, which can be multi-character, and you can specify that empty elements should not be returned (which I make use of in the function for integers). But that's it.

There is another thing that is problematic with Split. As I said, it splits up the entire string into a collection, and that collection takes up memory. Of course, if you send in an occasional string with 10 elements that is not an issue. But if you have a busy website with 10 000 concurrent users? SQL Server MVPs Paul Nielsen and Adam Machanic have both reported they have tried to use CLR functions based on Split with disastrous results. If SQL Server comes under memory pressure, the AppDomain may be unloaded, which may cause a CLR function based on Split to crash. Or at least the AppDomain will have to be reloaded, which takes a couple of seconds. (If you are not acquainted with the term AppDomain, I will have to admit that it is a bit of mumbo-jumbo to me as well, but it is something that hosts the CLR execution. I kindly refer you to Books Online and the .NET documentation for details.) But in short: methods based on Split can lead to scalability and stability issues on a busy system.

Rolling Our Own in the CLR

Instead of relying on Split, we can do the work ourselves. The advantage with this is that we win flexibility and we also avoid the stability issues with Split. Here is a C# file that is longer than the previous one:

using System;
using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public class CLR_iter
{

    private class stringiter : IEnumerator
    {
        string  _str;
        char    delim;
        int     _start_ix;
        int     _end_ix;
        int     _listpos;


        public string str {
           get {
              return this._str;
           }
        }

        public int start_ix {
           get {
              return this._start_ix;
           }
        }

        public int end_ix {
           get {
              return this._end_ix;
           }
        }

        public int listpos {
           get {
              return this._listpos;
           }
        }


        public stringiter(SqlString  str,
                          SqlString  delimiter) {
           this._str  = str.IsNull ? "" : str.Value;
           this.delim = delimiter.IsNull ? '\0'
                               : delimiter.Value.ToCharArray(0, 1)[0];
           Reset();
        }

        public bool MoveNext() {
            this._start_ix = this._end_ix + 1;
            if (delim == ' ') {
               while (this._start_ix < this._str.Length &&
                      this.str[this._start_ix] == ' ') {
                  this._start_ix++;
               }
            }

            if (this._start_ix >= this._str.Length) {
               return false;
            }

            this._end_ix = this.str.IndexOf(this.delim, this._start_ix);
            this._listpos++;
            if (this.end_ix == -1) {
               this._end_ix = this._str.Length;
            }
            return true;
        }

        public Object Current {
          get {
            return this;
          }
        }

        public void Reset() {
           this._start_ix = -1;
           this._end_ix   = -1;
           this._listpos  = 0;
        }
    }

    [SqlFunction(FillRowMethodName="CharlistFillRow")]
    public static IEnumerator CLR_charlist_iter(SqlString str,
                                                SqlString delimiter)
    {
        return new stringiter(str, delimiter);
    }

    public static void CharlistFillRow(object     obj,
                                       out int    listpos,
                                       out string str)
    {
        stringiter iter = (stringiter) obj;
        listpos = iter.listpos;
        str = iter.str.Substring(iter.start_ix,
                                 iter.end_ix - iter.start_ix);
        str = str.Trim();
    }


    [SqlFunction(FillRowMethodName="IntlistFillRow")]
    public static IEnumerator CLR_intlist_iter(SqlString str,
                                               SqlString delimiter)
    {
        return new stringiter(str, delimiter);
    }

    public static void IntlistFillRow(object     obj,
                                      out int    listpos,
                                      out int    number)
    {
        stringiter iter = (stringiter) obj;
        listpos = iter.listpos;
        string str = iter.str.Substring(iter.start_ix,
                                        iter.end_ix - iter.start_ix);
        number = System.Convert.ToInt32(str);
    }
}

The key is the internal class stringiter. First note the class declaration itself:

private class stringiter : IEnumerator

This means that the class implements the IEnumerator interface which is a requirement for a table-valued function. (That or IEnumerable.) Next follows internal class variables, and property methods to read these values from outside the class. Next piece of interest is the constructor:

public stringiter(SqlString  str,
                  SqlString  delimiter) {

This method creates an instance of the stringiter class. This constructor is called from the entry-point method of the table-valued function.

Next follow MoveNext, Current and Reset. These are the methods that implement IEnumerator, and they must have precisely the names and signatures that you see above. (For more details on IEnumerator, I refer you to the .NET Framework SDK.) The interesting action goes on in MoveNext. It is here we look for the next list element and it is here we determine whether we are at the end of the list. As long as MoveNext returns true, the fill method of the table function will be called. That is, MoveNext should not return false when it finds the last element, but the next time round. (I hope that I got that right. I was not really able to conclude that from the docs, but I had to play around myself.)

What is interesting from a list-to-table perspective is that MoveNext handles space as separator in a special way: multiple spaces are collapsed into one. This does not happen with other delimiters.

After the code for the stringiter class come two entry-point methods with accompanying fill methods, one for strings and one for integers. In difference to the previous example with Split, the entry-points here return IEnumerator, since we implement IEnumerator ourselves. But similar to the Split example, all the entry-points do is to create a stringiter object.

The fill methods, finally, extracts the data from current stringiter object. Most noticeably is that I grab hold of the list position, so it can appear in the output table.

For completeness, here is the SQL declaration of the functions:

CREATE FUNCTION CLR_charlist_iter(@list nvarchar(MAX),
                                  @delim nchar(1) = ',')
RETURNS TABLE (listpos int,
               str     nvarchar(4000))
AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_charlist_iter
go
CREATE FUNCTION CLR_intlist_iter(@list nvarchar(MAX),
                                 @delim nchar(1) = ' ')
RETURNS TABLE (listpos int,
               number  int)
AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_intlist_iter
go

At first glance, it may seem that these functions does not give you that much more than Split. You get the list position, and the intlist function has a delimiter parameter, but most of all you seem to get a lot more code. But there is more than meets the eye. The above gives you a framework to work with if you need to handle more complex list formats or some other functionality not supported by the functions as given. Here, I have used the same enumerator class for strings and integers, but the fancier you want to be, it's probably better to use different classes.

What About Performance?

Overall, using a CLR function is one of the fastest methods around. Of the methods presented in this article, it is generally only the fixed-length method that we will look at later that can compete. To that comes table-valued parameters in SQL 2008 which I discuss in a separate article.

Since the CLR is fast, it's no wonder that there is on-going battle in the bloggosphere about finding the very fastest function. SQL Server MVP Adam Machanic presented what he thinks is the best so far in this blog post. Instead of using the String class and the IndexOf method he loops over a Char array and looks for the delimiters himself.

However, when I have run my performance tests, the performance of the CLR functions I have tested have been virtually identical. I tested the two functions above, Adam's and one more which takes fixed-length input rather than a list of values (more about this format later in the article). Only at extremely long input, 50 000 list elements, I can sense a small advantage for the CLR_iter solution above, and with Adam's solution tending to be the slowest. But the difference is very small, maybe 20 milliseconds when the total execution time is half a second or more.

I should immediately qualify this. Adam's function handles multi-character delimiters, and mine doesn't. And to be able to handle multi-character delimiters Adam does the right thing. If you would take my functions above and just change the definition, you would also lose some performance directly. This is because when you pass IndexOf a Char parameter for the delimiter it is culture-insensitive. But if you instead pass the delimiter as a String – which you would need to for a multi-character delimiter – you will invoke an overload of IndexOf that is culture-sensitive. Exactly what this implies, I don't know, but my assumption is that culture-insensitive implies a binary comparison, whereas culture-sensitive buys you the full Unicode battery. And indeed, in the past I have tested functions with a String parameter to IndexOf, and they had a small performance penalty – about in the same range I get when I don't use a binary collation in the T-SQL methods.

There is nevertheless one performance flaw in Adam's function as posted: he only provides a function that returns strings, so if you have a list of integers, you need to apply convert in your T-SQL code. When testing, I found that there was a small performance benefit of having the conversion done in the C# code than in the T-SQL code. (This is not covered in the performance appendixes, because by the time I made that particular test, I had already compiled the results.)

As I have mentioned, in my latest suite of tests, I added a multi-thread test to see how the various methods scale in a multi-user environment. I did not have the opportunity to run these tests on server-class hardware, and I have limited faith in the accuracy and the relevance of the results. However, the CLR methods emerged as clear winners: they were consistently among the methods that scaled the best. (I should immediately add that due to an issue in my test setup, I was not able to include table-valued parameters in the multithread tests.)

Interesting enough, the functions based on Split performed well in my tests even when I ran with 100 parallel threads and I did not see the issues that Adam and Paul have reported. I have a theory that these crashes are more likely to appear on 32-bit SQL Server where the full memory space beyond 2 GB is available only to the buffer cache, and the rest have to be shoehorned into the lower 2 GB. But that is mere speculation on my part.

XML

Rather than passing a comma-separated list, you can create an XML document to pass your list of values. Undeniably, this is more complex, so in most cases you may not find it worth the trouble as long it's only a list of single values. Where XML really shines is when you need to insert many rows. You transform your data to an XML document in the client, and then you unpack it into your tables with the xml type methods.

Using XML for a List of Values

Let's go straight to the matter. Here are our get_company_names and get_product_names using XML.

CREATE PROCEDURE get_company_names_xml @customers xml AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   @customers.nodes('/Root/Customer') AS T(Item)
     ON   C.CustomerID = T.Item.value('@custid[1]', 'nchar(5)')
go
EXEC get_company_names_xml
    N'<Root><Customer custid="ALFKI"/>
            <Customer custid="BONAP"/>
            <Customer custid="CACTU"/>
            <Customer custid="FRANK"/>
      </Root>'
go
CREATE PROCEDURE get_product_names_xml @ids xml AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   @ids.nodes('/Root/Num') AS T(Item)
     ON   P.ProductID = T.Item.value('@num[1]', 'int')
go
EXEC get_product_names_xml N'<Root><Num num="9"/><Num num="12"/>
                             <Num num="27"/><Num num="37"/></Root>'

The two xml type methods we use are nodes and value. nodes is a rowset function that returns a one-column table where each row is an XML fragment for the given path. That is, in the two examples, you get one row for each Customer or Num node. As with derived tables, you must specify an alias for the return table. You must also specify a name for the column that nodes produces. (You can name columns in derived tables in this way too, but it's not that common to do so.)

The sole operation you can perform on T.Item is to employ any of the four xml type methods: exists, query, nodes and value. Of interest to us here is value. This method extracts a single value from the XML fragment. value takes two parameters: The first is a node specification for a single element or attribute. In this case we want an attribute – custid and num respectively – which is why we specify with @ in front. (Without @ it would be an element specification.) As for the meaning of [1] I will come back that. The second argument to value is the T-SQL data type for the return value.

List Position

I thought for a long time that you cannot the retrieve list position from an XML document, as there is no xml type method that returns this information. Then SQL Server MVP Alejandro Mesa reminded me that you can try:

row_number() OVER(ORDER BY T.c)

That is, you order by the XML fragment produced by the nodes method. Digging into the query plan, and looking at the Sort operator, I find that it sorts by this expression:

[XML Reader with XPath filter].id

Looks good, doesn't it? But beware! This is not documented, and this could be something that works only by chance. (That would not be the first time in the history of SQL Server when it comes to ordering.) If you look at it from a logical view what would ordering by an XML fragment mean? That is just nonsense! Furthermore, if you try:

SELECT ...
ORDER  BY T.c

you will be told that this is not legal. I suspect the reason that it is accepted in the OVER() clause is due to an oversight. Thus, my recommendation is that you stay away from numbering by the context node.

Thankfully, there is a fully supported alternative, and again credit goes to Alejandro Mesa for finding a clever solution. But to explain it, we need to dig a little deeper into XPath, XQuery and the xml type methods. We also need another device that I will introduce in the next chapter. So I will return to the numbering of XML documents in the section XML Revisited, later in this article.

Creating the XML Document

It may seem simple to create XML documents by adding brackets etc in code. Maybe all is needed is some replace in T-SQL? OK, for a list of numbers it is not too difficult to achieve. But if you have a list of strings, don't consider crafting the XML by mere string manipulation. There are many characters that have a special meaning to XML, and which need to be encoded. For instance, the ampersand character (&) is encoded as &amp; in XML. Therefore, you should use library routines that know about all this to create your XML documents. Unfortunately, I cannot show you any examples, as I have never had the need to do this myself. But digging around in MSDN Library should give you something, no matter you are programming in .NET or in native code.

Performance

Just like with the CLR, the data from nodes streams into the rest of the query, so XML is also an inline method. But since parsing XML is more complex than parsing comma-separated lists, it's slower. In my tests, the execution times for XML are 40-60 % higher than for the CLR, but it is twice as fast as the iterative method. In fact, XML is the fastest method that does not require any preparations in the server: you don't have to activate the CLR and create a function, nor do you need to define a table type for a table-valued parameter. On the other hand, if the client code already produces a comma-separated list, you need to change that code.

You may guess that there is a higher performance cost for passing an XML document, and indeed there is. In my tests I measured this to be 20-25 ms for 10 000 list elements, to compare with 5-7 ms for most other methods. It's hardly a cause for alarm.

That said, it shall not be denied that there are certainly problems with performance with the xml type methods and XQuery in SQL Server. When the optimizer gets it right, performance is often very good. But if the optimizer gets it wrong, performance can be very bad. From one point of view, this is not entirely surprising: the optimizer has very little idea of what is in that XML document, and rather than using estimates based on data, it relies on standard assumptions. Unfortunately, one cannot escape the feeling that the optimization of XML queries is an unfinished chapter in SQL Server. Sometimes, seemingly meaningless changes can have a big impact on performance. Above I wrote the expression to extract a number from the XML document as:

T.Item.value('@num[1]', 'int')

But I did not say what [1] means. The answer is that it means – nothing! @num and @num[1] means the same thing: retrieve the attribute num in the current context node. Unfortunately, SQL Server does not seem to understand this, and will generate different query plans depending on [1] is there or not. In many cases the plan with only @num will be a teeny-weeny more efficient, because it has fewer operators. But because of its misperception of what @num implies, the optimizer sometimes produces a really bad plan. This is less likely to happen if you add [1]. Thus, my recommendation is that you always add [1] to address attributes, although there is no logical reason why you should have to. (And don't be surprised if you find queries where removing [1] gives you a vast performance improvement. Query plans when XQuery is involved sometimes feel very much like a lottery.)

One practical conclusion of this is that you should be restrictive to entangle the shredding of the XML document with the rest of your data. It may be better to keep it simple and insert the data into a temp table and take benefit of that temp tables have statistics.

Inserting Many Rows

I said that where XML is really good is when you need to insert many rows. So how would you do that? In fact you have already seen the basics. You use nodes and value. Here is an example where I unpack a document with orders and order details:

DECLARE @x xml
SELECT @x =
  N'<Orders>
      <Order OrderID="13000" CustomerID="ALFKI"
             OrderDate="2006-09-20Z" EmployeeID="2">
         <OrderDetails ProductID="76" Price="123" Qty = "10"/>
         <OrderDetails ProductID="16" Price="3.23" Qty = "20"/>
      </Order>
      <Order OrderID="13001" CustomerID="VINET"
             OrderDate="2006-09-20Z" EmployeeID="1">
         <OrderDetails ProductID="12" Price="12.23" Qty = "1"/>
      </Order>
    </Orders>'
SELECT OrderID    = T.Item.value('@OrderID[1]',    'int'),
       CustomerID = T.Item.value('@CustomerID[1]', 'nchar(5)'),
       OrderDate  = T.Item.value('@OrderDate[1]',  'datetime'),
       EmployeeId = T.Item.value('@EmployeeID[1]', 'smallint')
FROM   @x.nodes('/Orders/Order') AS T(Item)
SELECT OrderID    = A.Item.value('@OrderID[1]',    'int'),
       ProductID  = B.Item.value('@ProductID[1]',  'smallint'),
       Price      = B.Item.value('@Price[1]',      'decimal(10,2)'),
       Qty        = B.Item.value('@Qty[1]',        'int')
FROM   @x.nodes('/Orders/Order') AS A(Item)
CROSS  APPLY A.Item.nodes('OrderDetails') AS B (Item)

As you see, the document can include data for several tables, and extracting the order-header information is straight-forward. We use nodes to get data from that level, and then we pick the attributes with value.

Extracting the details rows is a little more interesting, because beside the information about product, quantity and price, we also need to get the order ID which is in the node above. The way to approach this is to first run nodes on the Order level, and use the CROSS APPLY operator to run nodes on the details for each order. (Recall that CROSS APPLY is like a JOIN, but which can take argument from the left side in difference to a normal join.)

This latter may sound a little inefficient, so you may be tempted to try this instead:

SELECT OrderID    = T.Item.value('../@OrderID[1]', 'int'),
       ProductID  = T.Item.value('@ProductID[1]',  'smallint'),
       Price      = T.Item.value('@Price[1]',      'decimal(10,2)'),
       Qty        = T.Item.value('@Qty[1]',        'int')
FROM   @x.nodes('/Orders/Order/OrderDetails') AS  T (Item)

That is, you use the parent-axis notation to retrieve the order ID. But as I mentioned, XQuery optimizing is not always the best in SQL Server, and parent access is a sore area. When I have tested, the above has worked well, but only if I included that [1] which does not mean anything. With only ../@OrderID I got a very bad query plan; in a test I ran, it took 10-30 seconds to extract 2155 details rows from 830 orders, whereas when written as a above, execution time was subsecond. I reported the poor performance in this Connect bug. Microsoft's response was that they recommend against using parent-axis notation, in favour of CROSS APPLY. And even if I was able to get good performance when I added the [1], I cannot guarantee that you will get a good query plan. The solution with CROSS APPLY seems to be more stable, why you should stick with it. (Nevertheless, I think Microsoft needs to clean up their act. This article suggested parent-axis notation – without [1] – for more than four years, since I was unaware of the short-comings with performance.)

Element-centric XML

What you have seen above is attribute-centric XML. You can also use element-centric XML, where the XML document for product IDs would look like this:

<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>

For the first parameter to value you could give a single period to denote the current element to extract the values:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('.', 'int')

That looks really simple, but as you might guess from the fact I've put it red, you should not use this. When this works well, you can count on an overhead of 30 % compared to the query for attribute-centric XML. But it can also work really bad, particularly if the SELECT is part of an INSERT statement. When I first ran my tests on SQL 2005 SP1 in 2006, the results for element-centric XML were decent (that is, 30 % slower than attribute-centric), but when I ran my tests on SQL 2005 SP2 the results were disastrous. Unpacking a document with 10 000 numbers took four minutes – with SP1 this took less than 300 ms on the same machine. I filed this Connect bug, and it is still Active as of this writing (Oct 2010), but the issue appears to be addressed in SQL 2005 SP3. When I ran my tests for SQL 2008 SP1 in 2009, the performance was equally bad as in SQL 2005 SP2, if not for all test cases. By chance I found in September 2010 that Microsoft had indeed fixed the bug in SQL 2008 as well, shortly after SQL 2008 shipped, and the fix is included in SQL 2008 SP1. But in their infinite wisdom, they have hidden the fix behind a trace flag. When this trace flag, 4130, is enabled, the performance is back to decent. You can also use trace flag 4199, which is the general trace flag to get all optimizer fixes.

However, rather than go enabling a trace flag to get better performance, there is a superior alternative. More than four years after I originally published this article, I learnt through SQL Server MVP Alejandro Mesa, is that there is a better way to write this query:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('(./text())[1]', 'int')

It sure is a strain on the eye, but this more complex expression has three advantages: 1) It is not subject to the bug that can cause bad query plans. 2) Execution time is in par with attribute-centric XML, or even somewhat better. 3) Empty elements are interpreted as NULL.

You may ask what that text() does, and if it has a real meaning, or if it something equally silly like that [1]. The answer is that text() is for real. Consider this little batch:

DECLARE @x xml
SELECT @x = '<Root><Sub>One<Subsub>Two</Subsub>Three<Subsub>Four</Subsub></Sub></Root>'
SELECT T.c.value('.', 'nvarchar(20)') AS Dot,
       T.c.value('(./text())[1]', 'nvarchar(20)') AS Text
FROM   @x.nodes('Root/Sub') AS T(c)

The output is:

Dot                  Text
-------------------- --------------------
OneTwoThreeFour      One

That is, with '.' you get all text that is in the Sub node, including the subnodes. With text(), which is not a function but said to be a node test, you only get the text outside the subnodes. Furthermore, you are compelled to specify which of texts you want; that is, [] is compulsory. You can use [2] in place of [1] in which case you get back Three. (If you use [3]? Then you get back NULL.) Certainly, this XML document is a bit wretched, and for most XML documents in this world there is no practical difference between a mere '.' and the complex '(./text())[1]', but SQL Server has to assume the worst about the XML document you feed it, and this is why using only '.' will always be slower.

This example also displays the difference between attributes and elements. An attribute can only appear once within a node, and it cannot contain other elements or attributes, why attributes are quite a simpler affair. But it seems that SQL Server does not realise this, but the query plans for attributes are very similar for those for element-centric XML. Presumably this explains why it may matter to add [1] to attributes.

Above I said that an advantage with text() is that empty elements are interpreted as NULL. In case you wonder what I mean, consider:

DECLARE @x xml
SELECT @x = '<Root><Value></Value></Root>'
SELECT T.c.value('.', 'int') AS "without", 
       T.c.value('(./text())[1]', 'int') AS "with text"
FROM   @x.nodes('/Root/Value') AS T(c)

The output from this is:

without     with text
----------- -----------
0           NULL

That is, if an element is present, but have no actual text, you get a non-NULL value if you only use '.', which can yield unexpected results as demonstrated in this example, whereas with '(./text())[1]' you get back NULL which is more likely to be what you want.

Inserting Many Rows from Element-centric Documents

In the examples with a list of values, the element-centric document is considerably simpler and shorter. But that is an exception. Normally, element-centric documents are far more verbose than the corresponding attribute-centric document. Here is an element-centric version of the order document above:

<Orders>
    <Order>
       <OrderID>13000</OrderID>
       <CustomerID>ALFKI</CustomerID>
       <OrderDate>2006-09-20Z</OrderDate>
       <EmployeeID>2</EmployeeID>
       <OrderDetails>
           <ProductID>76</ProductID>
           <Price>123</Price>
           <Qty>10</Qty>
       </OrderDetails>
       <OrderDetails>
           <ProductID>16</ProductID>
           <Price>3.23</Price>
           <Qty>20</Qty>
       </OrderDetails>
    </Order>
    <Order>
       <OrderID>13001</OrderID>
       <CustomerID>VINET</CustomerID>
       <OrderDate>2006-09-20Z</OrderDate>
       <EmployeeID>1</EmployeeID>
       <OrderDetails>
           <ProductID>12</ProductID>
           <Price>12.23</Price>
           <Qty>1</Qty>
       </OrderDetails>
    </Order>
</Orders>

To preserve space, I've set it in a small font. Here is how you would unpack it:

SELECT OrderID    = T.Item.value('(OrderID/text())[1]', 'int'),
       CustomerID = T.Item.value('(CustomerID/text())[1]', 'nchar(5)'),
       OrderDate  = T.Item.value('(OrderDate/text())[1]',  'datetime'),
       EmployeeId = T.Item.value('(EmployeeID/text())[1]', 'smallint')
FROM   @x.nodes('Orders/Order') AS T(Item)

SELECT OrderID    = O.n.value('(OrderID/text())[1]',   'int'),
       ProductID  = D.n.value('(ProductID/text())[1]', 'int'),
       Price      = D.n.value('(Price/text())[1]',     'decimal(10,2)'),
       Qty        = D.n.value('(Qty/text())[1]',       'int')
FROM   @x.nodes('/Orders/Order') AS O(n)
CROSS APPLY O.n.nodes('OrderDetails') AS D(n)

In this case, too, you need to use text() to get best performance, and to escape the query-plan bug in SQL 2005 SP2 and SQL 2008. As with attribute-centric XML, the best way to get the order ID together with the details rows, is to first run nodes on Order level, and then descend to the details level for each order. Using parent-notation as in ../OrderID can easily yield very poor performance with Microsoft's unfinished XQuery implementation.

Typed XML

So far we have only looked at untyped XML, but it is possible to define an XML schema with the command CREATE XML SCHEMA COLLECTION. When you define a variable or column, you can say xml(mycollection) rather than just xml. SQL Server will then validate that the document adheres to that schema. This is known as typed XML. Using typed XML can have some positive impact on performance, since the schema informs the optimizer that some things cannot occur. For instance, this statement defines how the XML document with a list of values could look like:

CREATE XML SCHEMA COLLECTION elemnum AS
'<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root">
      <xs:complexType>
         <xs:sequence>
            <xs:element maxOccurs="unbounded" name="Num" type="xs:int" />
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>'

(I'm afraid it's completely outside the scope of this article to explain this. Unfortunately Books Online does not really describe this either, but XSD is a standard defined by W3C.) Using this schema we can write:

DECLARE @x(elemnum)
SELECT @x = N'<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>'
SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('.', 'int')

You may note that this time I did not put the use of '.' in red this time. The XML schema states that a Num node includes one single number and that's all, why SQL Server does not need to consider the situation with text and subnodes interleaved, as in the example above. For this reason text() is not needed – in fact text() is not even permitted with typed XML.

This performs well, and in some tests I was able to achieve significantly better performance than with untyped XML. But most of the time the schema appears to incur an overhead, and not the least the validation to pass data to a typed XML parameter is quite expensive: 130-170 ms for 10 000 integers in my tests. Or about the same time it takes for the fastest methods to process that many elements at all. (And when I used typed element-centric XML in an EXISTS the result was a disaster – more than two minutes to find 20 items in my test table. This is beats all those inline functions with a wide margin.)

Thus, I would not consider using typed XML for a a list of scalar values to be a serious option. What the schema buys you is protection against errors like misspelling an element or attribute name. But for a list of values you will probably find such errors quickly anyway. On the other hand, if you are using XML to pass more complex data sets around, using a schema can be worthwhile, not the least if different teams are involved.

OPENXML

Already in SQL 2000 you could use XML thanks to the OPENXML function. OPENXML is still around in SQL 2005 and SQL 2008, but about the only reason to use it would be that you need to support SQL 2000 as well. OPENXML is bulkier to use, and performance is nowhere near nodes. In fact, in my tests it's 30-50 % slower than the iterative method. One possible advantage with OPENXML, though, is that the query plan is simpler. Query plans for XQuery mixes XML operators with relational operators, whereas OPENXML results in once single Remote Scan. This reduces the risk for the query-plan disasters you sometimes get with XQuery.

If you want to see an example of OPENXML, please refer to the SQL 2000 version of this article.

Using a Table of Numbers

The Concept

This is the fastest way to unpack a comma-separated list of numbers in pure T-SQL. That is, this is the section for you who want a fast function, but for some reason cannot use the CLR and nor want to change the format of the string.

The trick is to use an auxiliary table of numbers: a one-column table with numbers from 1 and up. Here is how you can create a table with numbers from 1 to 999 999:

CREATE TABLE Numbers (Number int NOT NULL PRIMARY KEY);
WITH digits (d) AS (
   SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
   SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION
   SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION
   SELECT 0)
INSERT Numbers (Number)
   SELECT Number
   FROM   (SELECT i.d + ii.d * 10 + iii.d * 100 + iv.d * 1000 +
                  v.d * 10000 + vi.d * 100000 AS Number
           FROM   digits i
           CROSS  JOIN digits ii
           CROSS  JOIN digits iii
           CROSS  JOIN digits iv
           CROSS  JOIN digits v
           CROSS  JOIN digits vi) AS Numbers
   WHERE  Number > 0

First I set up a Common Table Expression (CTE, a concept that we will look more closely into in a later section) to define a virtual table that has all digits from 0 to 9, and then I create a Cartesian product for as many powers of ten I want numbers in my table.

Before I go on and show how to use this table to unpack a comma-separated list, I like to point out that this sort of table has a much wider usage. Just to give you a taste of it, I will quickly two show examples. Say you have a long string where you want to count the frequency of the various characters. You can use Numbers to "loop" over the string to get the characters in tabular format:

DECLARE @str nvarchar(MAX)
SELECT @str = 'Insert your favourite novel here.'
SELECT substring(@str, Number, 1) AS ch, COUNT(*)
FROM   Numbers
WHERE  Number <= len(@str)
GROUP  BY substring(@str, Number, 1)
ORDER  BY ch

If this example is little on the esoteric side, the next is definitely a common one. You need to return the number of orders per day for a period, say a month. If there are days without any orders they should be included in the output, showing a zero for the count. To achieve this, you need something that spans the entire period for you, and then left join the period with the Orders table. This is exactly what Numbers is good for:

WITH dates AS (
   SELECT date = DATEADD(DAY, Number, '19000101')
   FROM   Numbers
)
SELECT d.date, COUNT(O.OrderDate)
FROM   dates d
LEFT   JOIN Northwind..Orders O ON O.OrderDate = d.date
WHERE  d.date BETWEEN '19970101' AND '19970131'
GROUP  BY d.date
ORDER  BY d.date

Since we need dates, I first define dates out of the numbers in a CTE, which I then work with. I'm making use of the fact that COUNT(col) will only count rows with non-NULL values in col.

An Inline Function

Let's now see how we can use Numbers to split a comma-separated list into a table. I will first present an T-SQL inline function, which captures the main logic. I then move on to present a multi-statement function that applies chunking. While I will study the inline function in detail, it is actually the chunking solution you should use in most cases.

Below is the core of the function inline_split_me, a function I originally got from SQL Server MVP Anith Sen. The version I got from Anith appeared in the SQL 2000 version of the article. The version below has a modification from Brian W Perrin.

CREATE FUNCTION inline_split_core(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT Value =
              substring(@param, Number,
                        charindex(',', @param + ',', Number) - Number)
       FROM   Numbers
       WHERE  Number <= len(@param)
         AND  substring(',' + @param, Number, 1) = ','
)          

Just below, I will make a couple of modifications of this function, but I wanted to show a "clean" version first, so that you can digest it more easily. As in the first example above, we use Numbers to "loop" over the string, and the first condition in the WHERE clause is our "loop" condition. The second condition looks for the string delimiter, but in order to find the first element in the list, we add a comma before the string. So with this WHERE condition, we stop at every comma in the string. Observe that since we inserted that initial comma, Number refers to the first position after a comma.

Then in the SELECT list, we extract the list element starting at position Number from where we look for the next comma using the charindex function. To handle the last element in the string correctly, this time we add a comma after the input string. We use the third parameter to charindex to start in the right place. charindex thus returns the position of the next delimiter. Since the third parameter to substring is the desired length of the substring, we simply subtract the current position from the return value of charindex to get the length of the extracted element.

While this function works, there are several enhancements that we can do to it. The first is to trim leading and trailing blanks from the return value, like I have done in other list-to-table functions in this article. Thus we wrap the return value in calls to ltrim and rtrim.

Next there are a couple of performance improvements. The first two I have already covered in this article. One is to force a binary collation for the comparisons including the call to charindex. The other is to observe to return value of the function. This is nvarchar(MAX) which is not so good if want to join the output to a regular nvarchar column, as I noted in the section MAX Types vs. Regular (n)varchar. Therefore we should also wrap the output in a convert to nvarchar(4000).

Next thing to look at is more subtle, but we can improve the condition of the length of the parameter, that is this part:

Number <= len(@param)

When the input parameter to len is a MAX value, the return type is bigint, but the Number column is int. Since int has lower precedence than bigint in SQL Server, Number is implicitly converted to bigint, and this introduces a range seek in the Seek operator, similar to what happens when varchar is converted to nvarchar in a Windows collation. The range seek does not seem to be costly in itself, but I found when I tested, that it seemed to prevent the optimizer from using a parallel plan. Therefore, it's a good idea to convert the return value from len to int.

The last performance issue to address is more obscure, as it is due to bug in SQL Server. If you take inline_split_core and use it in a stored procedure, you will find that the stored procedure gets compiled on every execution, or more exactly: the plan for the procedure is never put into the plan cache. This bad behaviour was originally discovered by SQL Server MVP Tony Rogerson. For a long time we suspected that something was fishy with the MAX data types in general. But as I worked with revising this article, I was able to isolate the expression @p + ',' as the culprit which I have described in this Connect bug. SQL Server MVP Simon Sabin then found the workaround: convert the comma to nvarchar(MAX), and the issue goes away.

All this leads to the real inline_split_me:

CREATE FUNCTION inline_split_me(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000),
                  substring(@param, Number,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      @param + convert(nvarchar(MAX), N','),
                                      Number) -
                            Number)
              ))) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(@param))
         AND  substring(convert(nvarchar(MAX), N',') + @param, Number, 1) =
                        N',' COLLATE Slovenian_BIN2)

I've also added N before the string literals to not rely on implicit conversion. Compared to the core version above, you may find that you no longer can see the forest for all the trees!

Here is an example on how to use this particular function:

CREATE PROCEDURE get_company_names_inline @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   inline_split_me(@customers) s ON C.CustomerID = s.Value
go
EXEC get_company_names_inline 'ALFKI,BONAP,CACTU,FRANK'

An obvious problem with this function is robustness. A comma-separated list that has more than one million characters is a very long string, but nevertheless it is bound to happen sooner or later. And in that case, this function will return incorrect results, but there will be no error message.

While we have sorted out several performance issues, others still remain. The most serious is that the optimizer may go completely wrong and select a catastrophic plan. When I ran my performance tests and rewrote my join queries to use EXISTS, the time to get 20 rows from test table rose from two milliseconds to 40 seconds. As I noted in the section JOIN vs. EXISTS, this is a problem with all T-SQL inline functions. We can easily avoid this disaster by making the function into a multi-statement function, but another problem that I also brought up with the iterative method still remains: charindex is slower on nvarchar(MAX) than nvarchar(4000). To resolve this, we need to go chunking, which we will look at in the next section.

But before that, let's look at the brighter side of inline_split_me. The optimizer does not always go astray, but sometimes it comes up with quite slick plans. When I ran my original performance tests I noticed that I got a parallel plan when joining with my test table. For short lists, parallelism rather added overhead. But for my test strings with 10 000 elements, inline_split_me suddenly performed faster than the CLR on a 4-CPU machine. (For full details on this, please see the first performance appendix.) For even longer strings, 50 000 elements, it fell behind the CLR again, for reasons I will discuss in the section on fixed-length input.

I like to stress that you cannot be sure to get a parallel plan, as that will depend on many things. For one thing, if you think the optimizer will use the condition on len(@param), forget it. No matter the input string, the optimizer will assume that the first WHERE condition will hit 30 % of the rows in Numbers, i.e. 300 000 in this case and the second condition with substring will reduce the estimated number of rows to around 9 500. Apparently in combination with my test table, this triggered a parallel plan. From this follows that while the size of the input parameter does not matter, the size of Numbers does.

The bottom line is that in most cases you should avoid using inline_split_me. The one exception is when parallelism goes your way, and you want the best performance. I think in this case you should benchmark it against other methods. And keep in mind that parallelism is not always a blessing – not if there are other users who want some CPU power as well!

A Chunking Multi-Statement Function

As with the iterative method, breaking the input into chunks of nvarchar(4000) improves performance. Here is a multi-statement function that does that:

CREATE FUNCTION duo_chunk_split_me(@list  nvarchar(MAX),
                                   @delim nchar(1) = N',')
RETURNS @t TABLE (str   nvarchar(4000) NOT NULL,
                  nstr  nvarchar(4000) NOT NULL) AS
BEGIN
   DECLARE @slices TABLE (slice nvarchar(4000) NOT NULL)
   DECLARE @slice nvarchar(4000),
           @textpos int,
           @maxlen  int,
           @stoppos int

   SELECT @textpos = 1, @maxlen = 4000 - 2
   WHILE datalength(@list) / 2 - (@textpos - 1) >= @maxlen
   BEGIN
      SELECT @slice = substring(@list, @textpos, @maxlen)
      SELECT @stoppos = @maxlen -
                        charindex(@delim COLLATE Slovenian_BIN2,
                                 reverse(@slice))
      INSERT @slices (slice)
         VALUES (@delim + left(@slice, @stoppos) + @delim)
      SELECT @textpos = @textpos - 1 + @stoppos + 2
      -- On the other side of the comma.
   END
   INSERT @slices (slice)
       VALUES (@delim + substring(@list, @textpos, @maxlen) + @delim)

   ;WITH stringget (str) AS (
      SELECT ltrim(rtrim(substring(s.slice, N.Number + 1,
                charindex(@delim COLLATE Slovenian_BIN2,
                          s.slice, N.Number + 1) -
                N.Number - 1)))
       FROM  Numbers N
       JOIN  @slices s
         ON  N.Number <= len(s.slice) - 1
        AND  substring(s.slice, N.Number, 1) = @delim COLLATE Slovenian_BIN2
   )
   INSERT @t (str, nstr)
      SELECT str, str
      FROM   stringget

   RETURN
END

We first split up the text in slices and put these in the table variable @slices. When we created chunks for the iterative method, we did not bother if a list element was split over two chunks. But for this method, we need to take precautions to avoid that, and we must make sure that the last character in a chunk is a delimiter. We first get a preliminary slice of the maximum length we can handle. Then we find the last delimiter in the slice, by feeding charindex the result of the reverse function, a neat little trick. When we insert the slice into the table variable, we make sure that there is a delimiter both before and after, so that we don't need deal with that later. You may note that if the input text is within the limits of a regular nvarchar, we will never enter the loop, but just insert the text directly in to the @slices table.

Once @slices is populated, we apply the logic from the inline function, although it looks slightly different, since we know that all strings in @slices start and end with the delimiter. Now the numbers filtered out through the JOIN are the positions for the delimiters, and the elements start one position further ahead. Note here that we do not need to iterate over @slices; we can join directly with Numbers.

As with iter_charlist_to_table, this function returns a table with both a varchar and an nvarchar column. My tests indicate that there is a cost of 10-15 % over returning a table with only a nvarchar column.

This function is in most cases considerably faster than inline_split_me, up to a factor of 2. As I discussed in the previous section, inline_split_me can be very fast with a parallel plan and long input. The nice thing with the chunking multi-statement function is that you get consistent performance. Compared to the CLR, duo_chunk_split_me was 25-50 % slower than the CLR when just unpacking the list into a table in my tests.

Note that this function does not have the robustness problem of the inline function. By chunking we are protected against running out of numbers, as long as we make sure that there are 4000 numbers in the table. OK, if you get more slices than you have numbers in the table, you have a problem. But with a million numbers in the table, that would mean an input length close to four milliard characters – which exceeds the capacity for nvarchar(MAX) with a wide margin.

Concluding Remarks

While faster than the iterative function, this method is more difficult to grasp, and it's probably also more difficult to extend to more complex formats.

These functions do not have any listpos column. You can use the row_number function for this:

listpos = row_number() OVER (ORDER BY s.sliceno, N.Number)

Where sliceno is an extra counter column you would have to add to the @slices table. It's likely that this would have some performance cost, but I have not examined this.

Compared to other methods, the performance depends a lot on the length of the total string, since we compare every character in the input string with the string delimiter. That is, extra spacing and the length of the elements will matter.

I did not include any function that unpacks a list of numbers. You could write a new function which uses convert in the right places, but you can also do like this:

CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   duo_chunk_split_me(@ids, DEFAULT) i ON
          P.ProductID = convert(int, i.str)
go
EXEC get_product_names_tblnum '9, 12, 27, 37'

Despite the chunking, there are still some robustness issues with the table of numbers. Users that do not know better could delete numbers in the middle or add numbers which should not be there. And as the queries are written, they assume that the first number is one. At some point I had a zero in the table, whereupon I got incorrect results back. If you are paranoid you can set up a check constraint for the minimum value, and add a trigger that cries foul if anyone meddles with the table. Then again, constraints and triggers can be disabled, so the really paranoid will probably prefer another method. In the section Using a Function of Numbers, I will discuss a variation where you can use a table-valued function to generate the numbers when you need them.

In any case, a table of numbers is something that comes in handy in several other SQL problems than just unpacking a comma-separated list. So it's a good idea to keep one available in your database.

Fixed-Length Array Elements

This is a method that was proposed by SQL Server MVP Steve Kass, inspired by an idea in Ken Henderson's book The Guru's Guide to Transact-SQL.

Just like XML, this is a method that requires a special input format. Instead of using delimiters as in other methods, the list elements have a fixed length. There are two advantages with this: 1) You can never run into problems with delimiters appearing in the data. 2) Performance. Except for extremely long input strings, this is the fastest of all methods in this article. In fact, this method can even beat using table-valued parameters. However, for multi-user environments, it does not seem to scale as well as the CLR.

The Core

Here is a quick example where the method is employed directly, without a function:

CREATE PROCEDURE get_product_names_fix @ids varchar(8000),
                                       @itemlen tinyint AS
   SELECT P.ProductID, P.ProductName
   FROM   Northwind..Products P
   JOIN   Numbers n ON P.ProductID = convert(int,
               substring(@ids, @itemlen * (n.Number - 1) + 1, @itemlen))
     AND  n.Number <= len(@ids) / @itemlen
go
EXEC get_product_names_fix '   9  12  27  37', 4

Each element in the "array" has the same length, as specified by the parameter @itemlen. We use the substring function to extract each individual element. And, yes, it's that Numbers table again, that we created in the beginning of the previous section Using a Table of Numbers.

Here is a function that embeds the method:

CREATE FUNCTION fixstring_single(@str nvarchar(MAX), @itemlen tinyint)
RETURNS TABLE AS
RETURN(SELECT listpos = n.Number,
              str = substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen)
       FROM   Numbers n
       WHERE  n.Number <= len(@str) / @itemlen +
              CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)

The purpose of the expression on the last line is to permit the last element in the array to be shorter than the others, in case trailing blanks have been stripped. You can see that this function returns the list position, which is very simple to achieve: it's the current number in the Numbers table. Here is an example using fixstring_single with a list of strings:

CREATE PROCEDURE get_company_names_fix @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   fixstring_single(@customers, 6) s ON C.CustomerID = s.str
go
EXEC get_company_names_fix 'ALFKI BONAP CACTU FRANK'

One possible disadvantage with fixed-length is that it's more sensitive to disruption in the input. If you lose one character somewhere or pass the wrong value in @itemlen, the entire list will be misinterpreted. But assuming that you construct the list programmatically, this should not be a big deal.

You may note that I did not convert len(@str) to int as I did in inline_split_me. While it would not be a bad idea, I was not able to detect any impact on performance in this case. In order to make the fixed-length procedures easier to read, I opted to leave out the convert in this case.

While this method can be very fast, keep in mind that as with all other methods that builds on T-SQL inline functions, the optimizer may lose grip entirely. It does not understand to use the length of the string to estimate how many rows we will read from Numbers, but makes its standard guess of 30 %, 300 000 rows with a million rows in the table. When I tested an EXISTS query, it took 30 seconds to get 20 elements from my test table; the corresponding join ran in a millisecond or less.

Passing Numbers as Binary

The fixed-length method opens for a different method to pass a list of numbers, to wit as a binary string:

CREATE FUNCTION fixbinary_single(@str varbinary(MAX))
RETURNS TABLE AS
RETURN(SELECT listpos = n.Number,
              n = convert(int, substring(@str, 4 * (n.Number - 1) + 1, 4))
       FROM   Numbers n
       WHERE  n.Number <= datalength(@str) / 4 )

(Do I need to add that I originally got this idea from Steve Kass as well?) Note that fixbinary_single does not take any @itemlen parameter, that's a bit superfluous. To handle the other integer types, you would write new functions for that.

When using this from T-SQL it looks less appealing:

CREATE PROCEDURE get_product_names_binary @ids varbinary(2000) AS
   SELECT P.ProductID, P.ProductName
   FROM   Northwind..Products P
   JOIN   fixbinary_single(@ids) b ON P.ProductID = b.n
go
EXEC get_product_names_binary 0x00000090000000C0000001B00000025

But normally you would not do that. Rather you would create your binary string in client code. SQL Server MVP Alex Kuznetzov was kind to share a C# function to build a binary string from numbers for the bigint data type:

static byte[] UlongsToBytes(ulong[] ulongs) {
   int ifrom = ulongs.GetLowerBound(0);
   int ito   = ulongs.GetUpperBound(0);
   int l = (ito - ifrom + 1)*8;
   byte[] ret = new byte[l];
   int retind = 0;
   for(int i=ifrom; i<=ito; i++)
   {
           ulong v = ulongs[i];
           ret[retind++] = (byte) (v >> 0x38);
           ret[retind++] = (byte) (v >> 0x30);
           ret[retind++] = (byte) (v >> 40);
           ret[retind++] = (byte) (v >> 0x20);
           ret[retind++] = (byte) (v >> 0x18);
           ret[retind++] = (byte) (v >> 0x10);
           ret[retind++] = (byte) (v >> 8);
           ret[retind++] = (byte) v;
   }

   return ret;
}

Thus, to use it with fixbinary_single above, you would have to modify it a bit.

In my original performance tests in 2006, I was not able to detect any difference in the performance of fixbinary_single and fixstring_single. Interesting enough when I re-ran my tests in 2009, I get 15-20 % better results with binary input for lists up to 10 000 elements. Whether this is due to that this was on SQL 2008 or that I used different hardware, I don't know. In 2009, I also tested lists with 50 000 elements, and this case the difference is dramatic – the binary version is three times faster, for reasons I will return to.

There is a second benefit that Alex pointed out to me: using binary format saves network bandwidth. Instead of passing 10 bytes per number, as you would need with a string that can fit all positive integers, with a binary string you only need four bytes per number. This can make a difference for longer lists, and I discuss this in the section Call Overhead for Plain Strings in the second performance appendix.

Fixed Length and the CLR

Since what is special with this method is the input format, not the algorithm itself, you could use the format with a CLR function instead. I wrote a CLR table-valued function that accepted fixed-length as input. There was very little difference in performance to the other CLR functions.

Unlimited Input

For some other methods, I have said that by splitting up the string in chunks we can get better performance. This does not apply to fixed-length input, because that is due to an issue with charindex which we don't use here. The other reason to go chunking is avoid running out of numbers in the Numbers table. This risk is smaller with fixed-length, since we only use one number per list element, not per character. So with the Numbers table above, we can handle one million list elements which is quite a few. Nevertheless, it is possible to write a function which is water-proof in this regard. Steve Kass proposed this function which self-joins Numbers to square the maximum number. That is, with one million numbers in the table, you get 1E12 numbers in total to play with.

CREATE FUNCTION fixstring_multi(@str nvarchar(MAX), @itemlen tinyint)
   RETURNS TABLE AS
RETURN(SELECT listpos = n1.Number + m.maxnum * (n2.Number - 1),
              str = substring(@str,
                    @itemlen *
                        (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1,
                    @itemlen)
       FROM   Numbers n1
       CROSS  JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m
       JOIN   Numbers n2 ON
              @itemlen *
                   (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1 <=
              len(@str)
       WHERE  n2.Number <= len(@str) / (m.maxnum * @itemlen) + 1
         AND  n1.Number <= CASE WHEN len(@str) / @itemlen <= m.maxnum
                                THEN len(@str) / @itemlen +
                                     CASE len(@str) % @itemlen
                                          WHEN 0 THEN 0
                                          ELSE 1
                                     END
                                ELSE m.maxnum
                           END
)

This is a more complex function that fixstring_single, but to save space I leave it as an exercise to the reader to understand what's going on. I only make a note about the line with CROSS JOIN: this saves me from hard-coding the number of rows in Numbers

You may think that a self-join that results in 1E12 numbers must be expensive, and indeed if your lists are mainly short, less than 200 elements, the overhead is considerable. However in my tests, there were cases where fixstring_multi outperformed fixstring_single with a wide margin on a 4-CPU machine. Just as for inline_split_me, the reason is parallelism.

Another approach to permit unlimited input is of course to go chunking. And with the fixed-length method, it's possible to do this with an inline function:

CREATE FUNCTION fixstring_multi2(@str nvarchar(MAX), @itemlen tinyint)
  RETURNS TABLE AS
RETURN(
     SELECT listpos = (s.sliceno - 1) * (s.maxnum / @itemlen) + n.Number,
            str = substring(s.slice, @itemlen * (n.Number - 1) + 1, @itemlen)
     FROM   (SELECT m.maxnum,
                    sliceno = n.Number,
                    slice = substring(@str,
                                      (m.maxnum - m.maxnum % @itemlen) *
                                            (n.Number - 1) + 1,
                                      m.maxnum - m.maxnum % @itemlen)
             FROM   Numbers n
             CROSS  JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m
             WHERE  n.Number <= len(@str) /
                    (m.maxnum - m.maxnum % @itemlen) +
                    CASE len(@str) % (m.maxnum - m.maxnum % @itemlen)
                         WHEN 0 THEN 0
                         ELSE 1
                    END) AS s
     JOIN   Numbers n ON n.Number <= len(s.slice) / @itemlen +
              CASE len(s.slice) % @itemlen WHEN 0 THEN 0 ELSE 1 END
   )

Performance is similar to fixstring_multi, including the good performance on multi-processor machines thanks to parallelism but it takes a bigger toll as the length of the string increases. More about that in a second.

Performance with Extremely Long Input

In the beginning of this section I said that the fixed-length method is the fastest method, except for extremely long input. In my original tests I found that on one machine, fixed-length lagged behind several other methods when the input was a list of 10 000 strings. This puzzled me for a long time, and eventually I ran some special tests for this case. My conclusion is that SQL Server has some change in its internal handling of nvarchar(MAX), which causes processing of it to be slower above a certain limit. This limit is around 500 000 bytes on x64 machines, and 750 000 bytes on 32-bit machines. (I don't have access to any IA64 box, so I don't know what the limit is there.) When the input length exceeds this limit, the execution time for the fixed-length functions about doubles.

In my tests in 2006, I only saw this with the fixed-length and only with string input, because that was the only case where I exceeded the limit. In 2009, I added a test with 50 000 list elements, and now I saw this effect with all methods that work on nvarchar(MAX) without splitting it up in chunks. The routines that apply chunking are fairly immune, since they do relatively few operations on the long string. The issue does not arise at all with the non-SQL methods: CLR handles the long string within the .NET Framework, and XML also has its own internal handling. This also explains why fixbinary_single is much faster than fixstring_single at 50 000 elements: the binary string is still below this mysterious limit. If you would have a list with 150 000 elements you would see it with binary input as well.

I have more details about this limit in the performance appendix. As a piece of a curio, in my test on 50 000 list elements it seems that I hit a second limit with fixstring_multi2. For 10 000 elements simply unpacking the list took 195 ms in the 2009 tests. Unpacking 50 000 list elements with its twin fixstring_multi took 690 ms. With fixstring_multi2 it took 670 seconds. I have not analysed this much at all. It's not a change in query plan, but it may be a resource issue, so the results could be different on different hardware.

Multithread Scalability

When I ran my multithread tests, the fixed-length methods did not do well. In some tests, they even scaled negatively. That is, the total execution time to run a certain number of calls was longer when spreading these calls over several threads, than when making all calls from a single thread. The one exception was when I used a multi-statement function based on fixstring_single. This function did not exhibit this behaviour. However, neither did this function scale as well as the CLR functions, particularly not with a high number of parallel threads.

Using a Function of Numbers

As I have discussed, the Numbers table comes with a robustness problem. You may run out of numbers, or there may inadvertently be gaps in the table. As second obstacle with these methods is that you may not want to (or be permitted to) create a table for this purpose. So what about a table-valued function you can use instead?

One of the smartest SQL people I know is SQL Server MVP Itzik Ben-Gan, and in his book Inside SQL Server 2005: T-SQL Programming he presented this function:

CREATE FUNCTION dbo.fn_nums(@n AS bigint) RETURNS TABLE AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)
  SELECT n AS Number FROM Nums WHERE n <= @n;
GO
SELECT Number FROM fn_nums(1000)

When called, this function generates numbers from 1 to the number you specify. And this function is fast! The query I presented to you initially to fill the Numbers table with a million numbers runs for some 20 seconds. Itzik's function generates those numbers in 1-2 seconds.

The function may seem unwieldy at first, but this is how it works: In his first CTE, L0, Itzik generates a virtual table with two rows. In the next CTE, he cross-joins L0 with itself, so that L1 generates four rows. In L2 he cross-joins L1 with itself, and he is now up to 16 rows. The procedure is repeated with L3 that produces 256 rows. L4 makes them 65536 rows, and then L5 explodes them into over four milliard rows. This is brought to the CTE Nums where he numbers the rows. The optimizer is smart enough to realise that all rows have the same value, so it does not waste time on sorting them. Finally he selects from Nums to get the numbers asked for.

The function is so fast that it is a viable alternative to side-step Numbers and use the function directly. I derived fn_inline_split_me from inline_split_me by changing the FROM and WHERE clauses like this:

FROM  fn_nums(len(@param))
WHERE substring(convert(nvarchar(MAX), N',') + @param, Number, 1) = N',' COLLATE Slovenian_BIN2)

Likewise, I created fn_fixstring_single from fixstring_single by skipping the WHERE clause entirely, and I wrote the FROM clause as:

FROM   fn_nums(len(@str) / @itemlen +
               CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)

In my performance tests, the functions based on fn_nums in many cases had equal performance to those that use Numbers, but there were also plentiful of cases where the fn_nums-based functions were considerably slower, particularly with fixed-length. Only in a few tests with short input length I measured lower values for functions based on fn_nums.

As I have noted, the optimizer can easily stumble on functions based on the Numbers table. It does not get any easier with fn_nums, au contraire. The optimizer has very little idea of what is going on, and will in many cases assume that the query will produce four milliard rows. When I ran my tests, the optimizer lost control for all EXISTS queries using functions based on fn_nums just like it did for the functions based Numbers. And in difference to all other methods, this also happened to fn_fixstring_single for some of the join queries.

I looked into writing fn_chunk_split_me, a version of duo_chunk_split_me that used fn_nums, but this was not easy at all. First I just replaced Numbers in the final join with fn_nums(4000), but I found that for each generated number, SQL Server would scan the @slices table. I also tried

FROM  @slices s
CROSS APPLY fn_nums(len(s.slice) - 1) AS N
WHERE substring(s.slice, N.Number, 1) = @delim COLLATE Slovenian_BIN2

with an equally poor plan. Even worse was:

fn_nums(SELECT MAX(len(slice)) FROM @slices)

This was a complete disaster. I did not have the patience to wait for it to complete, but we are talking minutes. Looking at the query plan, it seemed that SQL Server looked into generate all those four milliard rows, and then filter.

Eventually, I wrote fn_chunk_split_me as

INSERT @numbers(Number)
   SELECT Number FROM fn_nums(CASE WHEN len(@list) < @maxlen
                                   THEN len(@list)
                                   ELSE @maxlen
                              END + 2)

I then used @numbers in the final join instead of Numbers. The performance is clearly inferior to duo_chunk_split_me, particularly for shorter lists. For the very longest lists, with 50 000 elements fn_chunk_split_me were only a few percent slower. (In a previous incarnation of this article I said that I had found that using fn_nums in duo_chunk_split_me did not impact performance at all. This incorrect statement was due to a sloppy error on my part: I had failed to change the new function to use fn_nums, so I tested two identical functions!)

When I had completed all my performance tests and published the article, I was made aware of that Itzik had come up with a new version of fn_nums where the SELECT clause goes:

SELECT TOP (@n) Number FROM Nums ORDER BY n

and I also tried:

SELECT TOP (@n) Number From Nums WHERE n <= @n

The advantage of using TOP is that the optimizer has less reason to believe that the query will generate four milliard rows. I ran some extra tests, and I saw some improvement with fixed-length input, but they still lagged behind the functions based on the Numbers table. I now got better performance in fn_chunk_split_me using the query with CROSS APPLY above. However, for longer input, inserting the numbers into a table variable was still faster, and the function was still considerably slower than duo_chunk_split_me. And for fn_inline_split_me, the optimizer now found one of these crazy plans where it takes 30 seconds to unpack 20 elements for some of the join queries in addition to the EXISTS queries.

This experience made me reconsider some of the enthusiasm I initially had for using fn_nums. If you use it, you should feed it a constant or an expression based on variables only. Don't pass it the result of scalar query or a column from the surrounding query, or else you may be in for a disaster. The version with TOP may be less sensitive to this, but whichever version you use, test carefully. The optimizer is in for a tough game, and you run a higher risk of a really bad plan, since the optimizer works with such an incorrect estimate. Maybe the best is to play it safe, and extract the elements from the input list into a temp table or table variable and then use that in your query. Or write fn_nums as a multi-statement function to hide the Cartesian product from the optimizer.

A final note: the ultimate solution would be if SQL Server had a built-in function to generate these numbers, one the optimizer would have full understanding of. I have a Connect item that suggests this. Votes are always welcome!

XML Revisited

In the section on XML, I presented an undocumented way to get the list position in an XML document, and recommended against relying on it. I also promised to show you a supported and documented way, and now that we know about the Numbers table, we can look at that solution.

As I noted previously, there is no XQuery function or xml type method for this task. However, you can extract a specific element from a document by using an XPath expression where you index by number. For instance:

DECLARE @x xml
SELECT @x = N'<Root><Num num="9"/><Num num="12"/>
              <Num num="27"/><Num num="37"/></Root>'
SELECT @x.value('(/Root/Num)[3]/@num', 'int')  

This returns 27, the value of the third element in the document.

Instead of a constant, XPath permits you to pass the value of a variable or a column with the sql:column function. SQL Server MVP Alejandro Mesa decided to try to get the list position by passing numbers from the Numbers table. That is, instead of shredding the entire document at once, he would take it piece by piece.

Let's first look at using sql:column in a simple way. You cannot use it directly in the value function, you first have to use the query function, which returns an XML fragment, and then you can apply value to that fragment. I have put the really interesting part in bold face.

DECLARE @x xml
SELECT @x = N'<Root><Num num="9"/><Num num="12"/>
              <Num num="27"/><Num num="37"/></Root>'
SELECT Number AS listpos,
       @x.query('(/Root/Num)[sql:column("Number")]').
          value('(/Num)[1]/@num', 'int') AS num
FROM   Numbers
WHERE  Number <= 4

That is, sql:column inserts the column value into the XPath expression. This returns:

listpos     num
----------- -----------
1           9
2           12
3           27
4           37

The above however is nothing that performs well with larger documents. Alejandro tried several variations, until he eventually arrived at this solution:

DECLARE @no_of_elem  int
SELECT @no_of_elem = @str.value('count(/Root/Num)', 'int')

INSERT #Int_UNPACK (number)
SELECT n = T.Item.value('@num[1]', 'int')
FROM   Numbers n
CROSS  APPLY @str.nodes('(/Root/Num)[sql:column("n.Number")]') AS T(Item)
WHERE  n.Number <= @no_of_elem

He first determines the total number of elements in the document using the XQuery function count and saves this in a variable. Next he uses Numbers to "loop" over the document to get one node at the time. As you may guess, this is slower than shredding the entire document in one go. In my tests, the overhead was 30-80 % compared the query I showed in the main XML section. A considerable overhead, but given what it's doing it's amazingly quick.

It may not be necessary to use that extra variable, but you can see it as a safety precaution, so that the optimizer is not lead astray. What is very important for performance, though, is that pair of parentheses around /Root/Num. Without them, the query is horribly slow. Why? I'm afraid that explaining details on XPath performance is beyond the scope of this article. And more importantly, my knowledge.

Using Recursive CTEs

This is a method that is entirely new to SQL 2005. The method was originally suggested to me by SQL Server MVP Nigel Rivett. A function based on this idea also appears in SQL Server MVP Itzik Ben-Gan's book Inside Microsoft SQL Server 2005: T-SQL Querying.

This method is not among the quickest; it beats the iterative method with a mere 15 %. Nevertheless, Nigel Rivett says he prefers this method, because when he works as a consultant, he wants to leave as little footprint as possible, preferably not even a function. And indeed, the CTE method is fairly easy to use directly in a stored procedure. Nevertheless, I will show the method packaged as an inline function. As with inline_split_me, I will first show a core function where I've stripped the performance tweaks to make it easier to digest:

CREATE FUNCTION cte_split_core (@list  nvarchar(MAX),
                                @delim nchar(1) = ',') RETURNS TABLE AS
RETURN
   WITH csvtbl(start, stop) AS (
     SELECT start = convert(bigint, 1),
            stop = charindex(@delim, @list + @delim)
     UNION ALL
     SELECT start = stop + 1,
            stop = charindex(@delim, @list + @delim, stop + 1)
     FROM   csvtbl
     WHERE  stop > 0
   )
   SELECT substring(@list, start,
                    CASE WHEN stop > 0 THEN stop - start ELSE 0 END)
          AS Value
   FROM   csvtbl
   WHERE  stop > 0

The thing that starts with WITH is a Common Table Expression (CTE). A plain CTE is just like a macro that you define before the query, and which you then can use in the query as if it were a table. We have already encountered a number of CTEs in this article. But the CTE above is a special form of a CTE, it's a recursive CTE. A recursive CTE consists of two SELECT statements combined by the UNION ALL operator. The first SELECT statement is the starting point. The second SELECT statement makes a reference to the CTE itself. You could see this as a long list of:

SELECT ...
UNION ALL
SELECT ...
FROM   CTE
UNION ALL
SELECT ...
FROM   CTE
...

where each SELECT statement takes its input for the CTE from the SELECT right above it. The recursion continues as long as the SELECT statements continue to generate new rows. The final result is the UNION ALL of all the SELECT statements. (This recursive query is a bit special since it does not refer to any tables. Normally, you use recursive CTEs to wind up bills-of-materials and other hierarchical structures.)

Thus, in this function, the first SELECT returns 1 where the list starts, and the position for the first delimiter. The second SELECT sets start to the position after the first delimiter and stop to the position of the second delimiter. The third SELECT, as implied by the recursion, returns the position after the second delimiter as the start position, and the position for the third delimiter. And so it continues until there are no more delimiters and stop is returned as 0, leading to that the last SELECT returning nothing at all.

All this produces a virtual table with the start and stop positions for all the list elements, and we can use the rows in this table as input to substring. The CASE in the third argument to substring may seem unnecessary, since we have the same condition in the WHERE clause. But SQL Server sometimes computes expressions in the SELECT list before filtering, so we need the CASE expression to be sure to not get a run-time error for negative input to substring.

To get the real function, we now apply the same improvements as for inline_split_me (please refer to this section for details):

CREATE FUNCTION cte_split_inline (@list  nvarchar(MAX),
                                  @delim nchar(1) = ',') RETURNS TABLE AS
RETURN
   WITH csvtbl(start, stop) AS (
     SELECT start = convert(bigint, 1),
            stop = charindex(@delim COLLATE Slovenian_BIN2,
                             @list + convert(nvarchar(MAX), @delim))
     UNION ALL
     SELECT start = stop + 1,
            stop = charindex(@delim COLLATE Slovenian_BIN2,
                             @list + convert(nvarchar(MAX), @delim), stop + 1)
     FROM   csvtbl
     WHERE  stop > 0
  )
  SELECT ltrim(rtrim(convert(nvarchar(4000),
            substring(@list, start,
                      CASE WHEN stop > 0 THEN stop - start ELSE 0 END)
         ))) AS Value
  FROM   csvtbl
  WHERE  stop > 0

Here is an example of using this function: There is a little twist:

CREATE PROCEDURE get_company_names_cte @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   cte_split_inline(@customers, ',') s ON C.CustomerID = s.Value
   OPTION (MAXRECURSION 0)
go
EXEC get_company_names_cte 'ALFKI, BONAP, CACTU, FRANK'

Note the OPTION clause. Without this clause, SQL Server would terminate the function prematurely if there are more than 100 elements in the list, 100 being the default value for MAXRECURSION. MAXRECURSION serves as a safe-guard if you would happen to write a recursive CTE which never terminates. For more "normal" uses of CTEs like employer-manager relations, 100 is a lot, but for our purposes 100 is a far too low number. Here we set MAXRECURSION to 0 which turns off the check entirely.

But why is the OPTION clause not within the function? Simply because OPTION clauses are not permitted in inline functions. Recall that inline functions are not really functions at all, but just macros that are pasted into the referring query.

The function does not return the list position, but it would be easy to do so, by adding a counter column to the CTE:

SELECT ..., listpos = 1
...
UNION ALL
SELECT ..., listpos = listpos + 1

As you may guess, the function above takes a performance toll from the combination of nvarchar(MAX) and charindex. This can be addressed by writing a multi-statement function that breaks up the input into chunks, in a similar vein to duo_chunk_split_me. However, I found in my tests, that the overhead for bouncing the data over the return table in a multi-statement function slightly exceeded the cost for charindex on nvarchar(MAX). Since the selling point of the CTE method is its low intrusiveness, the chunking function is of less interest, and I leave it as an exercise to the reader to implement it. Or peek at this link.

As with all other inline functions, the optimizer failed on the EXISTS queries in my test, and it took over 30 seconds to extract 20 elements from my test table. It may be better to just extract the list elements into a temp table and use that in the real query.

Dynamic SQL

For a list of numbers, this method appears deceivingly simple:

CREATE PROCEDURE get_product_names_exec @ids nvarchar(4000) AS

EXEC('SELECT ProductName, ProductID
      FROM   Northwind..Products
      WHERE  ProductID IN (' + @ids + ')')
go
EXEC get_product_names_exec '9, 12, 27, 37'

But the full story is far more complex. There are several issues you need to consider. To start with, let's look an example with a list of strings:

CREATE PROCEDURE get_company_names_exec @customers nvarchar(2000) AS
   EXEC('SELECT CustomerID, CompanyName
         FROM   Northwind..Customers
         WHERE  CustomerID IN (' + @customers + ')')
go
EXEC get_company_names_exec '''ALFKI'', ''BONAP'', ''CACTU'', ''FRANK'''

The nested quotes make procedures using this method difficult to call.

The next thing to consider is permissions. Normally, when you use stored procedures, users do not need direct permissions to the tables, but this does not apply when you use dynamic SQL. In SQL 2000 there was no way around it. In SQL 2005 you can arrange for permissions by signing the procedure with a certificate. I have a full article that describes how to do this, Giving Permissions through Stored Procedures. But to get there, you now have a method that requires a complexity that no other method in this article calls for.

Always when you work with dynamic SQL, you must be aware of SQL injection. That is, if you simply accept user input, a malicious user may be able insert SQL code that you did not intend to execute. This is something you have to be particularly aware of when you implement web applications for the Internet. SQL injection is a very common way for hackers to break into sites. The normal way to avoid SQL injection is to use parameterised statements, but since the number of parameters is variable, this is not workable here. Instead you need to employ other methods which are more complex, and less secure. I discuss SQL injection and other issues around dynamic SQL in my article The Curse and Blessings of Dynamic SQL.

When it comes to robustness and flexibility, we can note that you cannot make a choice of the delimiter – comma is the only possible. There is no way to get the list position. And you can only use the method inline – there is no way to unpack the data into a temp table or a table variable.

When it comes to performance, this can be the fastest method for your query. This is the only method – together with the one in the next chapter – where the optimizer has full information about the input, and this can lead to a better plan, than one that is based on blind assumptions or the statistics of a temp table. But it might also be the other way. In fact, in my tests dynamic SQL proved to be one of the slowest methods, way behind the iterative method. The reason for this is that with dynamic SQL, you get the cost for compilation with almost every query and compilation of long IN lists is anything but cheap. Nevertheless, the situation has improved significantly from SQL 2000, where a list with 2 000 elements took over 10 seconds to process on all my test machines. On SQL 2005, I was able to process 10 000 elements in 4-6 seconds on reasonable hardware. Then again, when I ran my performance test on SQL 2008 and added a test of 50 000 list elements, I had to except dynamic SQL from that length, because the query failed with an internal query-processor error.

If you reuse the same list, the plan is likely to be in the cache, and in this case execution will be fast the second time. But keep in mind that it has to be exactly the same list, as the cache is both case- and space-sensitive. And how likely is this? Rather, this points to a secondary problem with dynamic SQL: cache bloat. You will fill the plan cache with plans that are very rarely reused. On machines with lots of memory, this can lead to severe performance problems because of hash collisions in cache-lookups.

There is however, a way to reduce the compilation costs for dynamic SQL. SQL 2005 adds a new database option, forced parameterisation. When this setting is in effect, SQL Server parameterises all ad-hoc statements. This has the effect that for a dynamically constructed IN expression like above, for a given number of list elements, there is a single entry in the cache that can be reused by all queries with the same number of elements. That is, there is one entry for one element, one for two elements etc. With this setting dynamic SQL is indeed very fast – up to 2 100 list elements that is. 2 100 is the maximum number of parameters for a stored procedure or parameterised statement. When this number is exceeded, the string can not be fully parameterised, and there will again be one cache entry for each string, with the same poor performance as before.

Keep in mind, that cached plans are based on the parameter values from the first invocation (this is known as parameter sniffing). Sometimes this can backfire, if the parameters on the first invocation are atypical or different sets of parameters simply calls for different plans. This apply if you use forced parameterisation as well. Also keep in mind that since forced parameterisation is a database setting, you as a programmer cannot rely on it on being on or off. (There are query hints to control the behaviour though.)

The conclusion is that as long you as you get good plans with other methods, there is absolutely zero reason to use dynamic SQL. Only if you have exhausted all other methods and have not been able to get good performance, it is worth looking at dynamic SQL. And even then you need to decide whether it is worth the hassle to handle permissions and avoid SQL injection.

Passing the List as Many Parameters

It couldn't be unfancier:

CREATE PROCEDURE get_product_names_manyparams
       @p1 int = NULL, @p2 int = NULL, @p3 int = NULL, @p4 int = NULL,
       @p5 int = NULL, @p6 int = NULL, @p7 int = NULL, @p8 int = NULL AS
  SELECT ProductName, ProductID
  FROM   Northwind..Products
  WHERE  ProductID IN (@p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8)
go
EXEC get_product_names_manyparams 9, 12, 27, 37

That is, just have a stored procedure with lots of parameters, and then pass as many as you need.

The reader immediately realises that there is a maintenance problem here. No programmer likes to write code like that. And there is an obvious limitation: no matter how many parameters you put in, you run the risk of needing one more.

And, true, this solution is only an alternative when your upper limit is fairly low. After all, there are applications where 30 list elements would mean a lot. The advantage? The same as with dynamic SQL but without the many drawbacks of dynamic SQL. The optimizer has full information about the parameter values each time it compiles the plan. Normally, it would only do so once, but if you know that different sets of input values calls for different plans, you can use OPTION(RECOMPILE) to force a recompilation of the statement every time. But there is no issue with permissions, and no risk for cache bloat. (Because with this hint, the plan is never put into the cache.)

The absolute upper limit for this method is 2 100 list elements, that is how many parameters you can pass to a stored procedure. You may think such a procedure is completely impossible to write, but it is in fact quite easy. In order to be able to test the performance, I created my test procedures with 2 000 parameters and I used T-SQL to generate the parameter lists and the IN list. This is possible with help of the FOR XML PATH clause. And once again the Numbers table comes into service:

SELECT '@p' + ltrim(str(Number)) + ' int = NULL, '
FROM   Numbers
WHERE  Number <= 2000
ORDER  BY Number
FOR XML PATH('')

Normally, when you use FOR XML PATH you specify the name for the enclosing element, but if you leave it as the empty string, the result is a plain text string although it's still returned as if it were an XML document. Run the above in grid mode in SSMS, open the XML document, copy and paste. Go to the end of the line to remove the final comma and you're done. As for passing the parameters from client code, that should not constitute any problem. That is after all only a loop.

So how did my 2 000-parameter procedure perform? If I look at my main summary, it seems to do pretty well. 19 ms in execution time for 2 000 elements which is better than any other method. But that result is based on timing within the procedure only. When I instead look at the total time as seen from the client the situation is outright poor. The total time for the call in this case exceeds 550 ms. Or more than double the execution time for dynamic SQL for equally many elements.

My regular test setup is Perl script, and in order to investigate whether this extreme overhead was due to poor performance in my Perl-specific API (as is the case for table-valued parameters), I also wrote a test program in C#. The call overhead was lower with SqlClient, but still almost 400 ms for 2 000 parameters.

So having a 2 000-parameter procedure is very unlikely to be your best solution. But if:

then this could be your best choice.

Making the List into Many SELECT

Inserting Many Rows

This method is mainly of interest in the situation you need to insert many rows, so I'll start looking at it from that angle.

Say that you don't want to use XML for some reason, and nor can you use table-valued parameters. (Because you are on SQL 2005, or your API does not support them.) What is left – if we overlook bulkload or SQL Server Integration Services – is to use INSERT statements in some way.

The simplest approach is to run a batch that goes:

SET NOCOUNT ON
INSERT tbl(...) VALUES (...)
INSERT tbl(...) VALUES (...)
...

SET NOCOUNT ON is needed to prevent each INSERT from generating a (1 row affected) message to the client and add a lot of network overhead. Still this is not terribly efficient, because each INSERT writes to the transaction log. So you would want a way to insert many rows in one statement. Here is a way:

INSERT tbl (...)
   SELECT val1a, val2a, val3a
   UNION ALL
   SELECT val1b, val2b, val3b
   UNION ALL
   ...

That is, instead of a lot of INSERT statement you have one big fat SELECT with lots of UNION ALL clauses. Alas, neither this performs well. For small batches around 20 elements, it's faster than many INSERT statements, as we reduce the transaction-log overhead. But as the batch size grows, the time it takes for the optimizer to compile the query plan explodes, and the total execution time can be several times than a batch of INSERT VALUES. One way to handle this is to generate many INSERT SELECT UNION batches, with a reasonable batch size of 20-50 elements. That gives you the best of those two methods.

In SQL 2008, there is a new syntax, so that you can provide more than one row in INSERT VALUES:

INSERT tbl(a, b)
   VALUES (1, 2), (12, 12), (14, 99)

This relieves you from the bulkier SELECT UNION syntax. Performancewise, there is no difference though. Internally, this is converted to the same representation as those SELECT UNION, and with the same poor development with an increasing the number of rows. Except that Microsoft realised this, and put a cap on INSERT VALUES, so it does not accept more than 1000 rows. However, my tests indicate that you should stay well below that level. Things may change, though. According to Microsoft's response to a Connect item on this theme, it seems that this will be addressed in SQL 11, the the next major version of SQL Server after SQL 2008.

Instead of using VALUES or SELECT UNION, there is a third method that is a lot faster. You get the benefit of a single INSERT statement, but you don't have to pay the excessive price for compilation. The trick is to use INSERT-EXEC, and it was suggested to me by Jim Ebbers:

INSERT sometable (a, b, c)
   EXEC('SELECT 1, ''New York'', 234
         SELECT 2, ''London'', 923
         SELECT 3, ''Paris'', 1024
         SELECT 4, ''Munich'', 1980')

Normally when you use INSERT-EXEC you have a stored procedure or a batch of dynamic SQL that returns one result set, and that result set is inserted in to the table. But if the stored procedure or batch of dynamic SQL returns multiple result sets, this works as long as all result sets have the same structure. The reason why this method is so much faster than using UNION is that every SELECT statement is compiled independently. One complexity with this method is that you have to be careful to get the nested quotes correct when you generate the EXEC statement.

It may come as a surprise that I propose dynamic SQL here, given how cool I was to it in the previous section. But the full story is that all three methods here presume that client would generate the full batch with INSERT statements with values and all, and what is that if not dynamic SQL? So you could only walk this route in situations when permissions are not an issue. (Which could be because you insert to a temp table, and later a stored procedure inserts from the temp table to a target table.) And obviously you would need to apply methods to deal with SQL injection.

Comma-Separated Lists

It may not seem apparent, but it is possible to use the methods above as a foundation to unpack comma-separated lists, although it should be said directly that this more sorts under the title "Crazy things you can do in T-SQL". It's difficult to see a situation when this method would be the best choice.

Originally this idea was suggested to me by Steve Kass. The idea is that you use the replace function to replace the delimiter with binding elements. In Steve Kass's original proposal he uses UNION ALL SELECT and then he adds some more plumbing to the string so that you have an SQL batch. Then you can use INSERT EXEC to insert the result of the SQL batch into a temp table, which you then use in your target query. Although UNION has very poor performance, I first show a stored procedure that uses UNION, for the simple reason that this code is simpler to understand than what is to follow:

CREATE PROCEDURE unpack_with_union
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = N',' AS
DECLARE @sql nvarchar(MAX),
        @q1  char(1),
        @q2  char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT INTO ' + @tbl + ' SELECT ' +
               replace(replace(@list, @q1, @q2), @delimiter,
                       N' UNION ALL SELECT ')
--PRINT @sql
EXEC (@sql)

The inner replace is there to handle the potential risk that @list includes single quotes, which we double in an attempt to protect us against SQL injection. The outer replace replaces the delimiter with the SQL plumbing. The name of the table to insert to is passed in the parameter @tbl. The variables @q1 and @q2 saves me from having a mess of single quotes all over the place. (It will get worse than this.)

Here is an example of how to use it:

CREATE PROCEDURE get_product_names_union @ids varchar(50) AS
   CREATE TABLE #temp (id int NULL)

   EXEC unpack_with_union @ids, '#temp'

   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   #temp t ON P.ProductID = t.id
go
EXEC get_product_names_union '9, 12, 27, 37'

The reason that this is a procedure and not a function is of course that you cannot use dynamic SQL in functions.

However, the performance of this procedure and a unpack_with_insert that uses INSERT VALUES is not defensible. It takes unpack_with_union almost five seconds to unpack 2 000 numbers. unpack_with_insert performs the same task in 1.3 seconds.

Decent performance is possible to achieve if we use the trick with INSERT-EXEC and many small SELECT statements, as suggested by Jim Ebbers. Here is such a procedure, a bit more complex than the one above.

CREATE PROCEDURE unpack_with_manyselect
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = ',' AS
DECLARE @sql nvarchar(MAX),
        @q1     char(1),
        @q2     char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT ' + @tbl + ' EXEC(' + @q1 + 'SELECT ' +
               replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2),
                       @delimiter COLLATE Slovenian_BIN2,
                       ' SELECT ') + @q1 + ')'
--PRINT @sql
EXEC (@sql)

Here, the inner replace replaces a single quote within @list by no less than four single quotes. This is because the INSERT EXEC itself is nested into an EXEC(). (You may guess why I have that commented PRINT @sql there!) In this procedure I have also added a COLLATE clause. (The other procedure is so slow, that the COLLATE would not make any difference anyway.) This procedure unpacks 2 000 numbers in 75 ms – decent, but slower than most other methods in this article.

If you look closer, you see that the procedures presented so far only work well with numeric values. If you want to pass a list of strings, you need to quote the strings yourself. Here is a version for handling strings to save you from this:

CREATE PROCEDURE unpackstr_with_manyselect
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = ',' AS
DECLARE @sql    nvarchar(MAX),
        @q1     char(1),
        @q2     char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT ' + @tbl +
               ' EXEC(' + @q1 + 'SELECT ltrim(rtrim(' + @q2 +
               replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2),
                       @delimiter COLLATE Slovenian_BIN2,
                       @q2 + ')) SELECT ltrim(rtrim(' + @q2) +
               @q2 + '))' + @q1 + ')'
--PRINT @sql
EXEC (@sql)

Here is an example of usage:

CREATE PROCEDURE get_company_names_manyselect @custids nvarchar(2000) AS
   CREATE TABLE #temp (custid nchar(5) NULL)

   EXEC unpackstr_with_manyselect @custids, '#temp'

   SELECT C.CompanyName, C.CustomerID
   FROM   Northwind..Customers C
   JOIN   #temp t ON C.CustomerID = t.custid
go
EXEC get_company_names_manyselect 'ALFKI, BONAP, CACTU, FRANK'

Is this method really a good alternative for handling comma-separated lists? No. Performance for integer lists is in par with the iterative method, but it is significantly slower with lists of strings. The iterative method has the benefit that it is easy to adapt the method to support different input formats, and getting information like the list position is simple. unpack_with_manyselect does not lend itself to any of that. The strength here lies in inserting many values.

Note: all procedures here include an INSERT. But you could leave out the INSERT from the procedure and instead call the procedure from INSERT EXEC. In that case, you have the choice of using a temp table or using a table variable.

Really Slow Methods

In a Q&A column of an SQL journal several years ago, the following solution was suggested by one SQL Server MVP, referring to another MVP, who both shall go nameless:

CREATE PROCEDURE get_company_names_charindex @customers nvarchar(2000) AS
   SELECT CustomerID, CompanyName
   FROM   Northwind..Customers
   WHERE  charindex(',' + CustomerID + ',', ',' + @customers + ',') > 0
go
EXEC get_company_names_charindex 'ALFKI,BONAP,CACTU,FRANK'

ByBy adding commas of both sides of the input string, we can use charindex to find ",ALFKI," etc. (Note that you cannot have embedded blanks here.)

The author noted in his column that this method would not have good performance, since embedding the table column in an expression precludes the use of any index on that column, leading to a table scan. But that's only a small part of the story. A plain table scan on my test table takes 800 ms, once the table is entirely in cache. This method needs 65 seconds for an input list of 20 strings!

Variations on this theme are illustrated by these WHERE clauses:

WHERE  patindex('%,' + CustomerID + ',%',   ',' + @customers + ',') > 0
WHERE  ',' + @customers + ','   LIKE    '%,' + CustomerID + ',%'

The solution with LIKE is equally slow or slower than charindex. I have never tested patindex for SQL 2005.

There is a "go faster" switch here: the COLLATE clause. Add a COLLATE clause to force a binary collation, and performance improves with a factor of 7 to 10. But the problem is that if you pass a string like alfki,bonap,cactu,frank you may still expect a hit, which you would not get with a binary collation. So in my tests, I only forced the collation when the input was a list of integers, as in this example:

CREATE PROCEDURE get_product_names_realslow @ids varchar(200) AS
SELECT ProductName, ProductID
FROM   Northwind..Products
WHERE  charindex(',' + ltrim(str(ProductID)) + ',' COLLATE Slovenian_BIN2,
                 ',' + @ids + ',' COLLATE Slovenian_BIN2) > 0
go
EXEC get_product_names_realslow '9,12,27,37'

If you use an SQL collation, you also get this huge performance improvement if you can stick to varchar (and then you could still be case-insensitive).

But while an improvement with a factor ten is impressing, we still talking seven seconds for a list of 20 integers. When no other method needs as much as fifty milliseconds (save when then optimizer loses itself in the inline functions).

Concluding Performance Remarks

In this section I bring up some more comments on performance, that I was not able to fit in elsewhere, and which are a little too nitty-gritty to belong in the general considerations.

The Optimizer and len()

I have said in several places, that the optimizer does not understand to use len(@param) when estimating how many rows inline_split_me and fixstring_single will produce. This is true as long as the input parameter is (n)varchar(MAX). But if the parameter is varchar(8000) or nvarchar(4000), the optimizer is actually able to understand it. This means that if you write fixstring_single as:

ALTER FUNCTION fixstring_single(@str nvarchar(4000), @itemlen tinyint)
RETURNS TABLE
RETURN(SELECT listpos = n.Number,
              str = rtrim(convert(nvarchar(4000),
                      substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen)))
       FROM   listtest..Numbers n
       WHERE  n.Number <= len(@str) / @itemlen)

The optimizer will make an exact estimate of how many values your string will produce. Note that the original fixstring_single I presented includes a CASE expression to permit the last element to be shorter. This CASE expression is missing here, since if I add it to the WHERE c condition, the optimizer finds it to difficult and resorts to the standard estimate of 30 % rows returned.

Changing the input parameter to limited length in inline_split_me, does not yield equally perfect results. The optimizer makes a correct estimate of how many rows it will read from Numbers, but it will of course have to guess how many rows in total the function will produce. The result is likely to be better than the standard estimate, though.

Keep in mind that while these changes may help you to improve performance, you are in for a gamble if the input can exceed 8000 bytes. Not the least does this apply to fixed-length which needs more space per element.

JOIN, EXISTS and DISTINCT values

You have already read a section on JOIN vs. EXISTS, but there is a little more to say on the topic, and how to deal with duplicates in the input string.

Above, I said that in my tests all inline functions resulted in abysmal plans with my EXISTS queries and I suggested that it may be better sometimes to write a join query with DISTINCT. A different take on it is a tip that I got from SQL Server MVP Adam Machanic. He fought this problem and his way out was to add a DISTINCT to the query in the inline function.

I constructed variations of the inline functions in this article and ran it through my test setup. Indeed, the optimizer now settled for reasonable plans for most of my EXISTS queries. But for fixstring_multi and fixstring_multi2 I got even worse plans in some cases also for the join queries, so this trick may cut both ways. (When I killed one query it had ran for 20 minutes. For an input of 20 elements.)

There is one important thing to notice if you take this road. Assuming that you using the function for a list of integer values, it becomes important to have the conversion to integer inside the function. That is:

CREATE FUNCTION inline_split_int_distinct(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT DISTINCT convert(int,
                  substring(@param, Number,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      @param + convert(nvarchar(MAX), N','),
                                      Number) -
                            Number)
              ) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(@param))
         AND  substring(convert(nvarchar(MAX), N',') + @param, Number, 1) =
                  N',' COLLATE Slovenian_BIN2)

Else SQL Server will spend time on sorting and comparing a bunch of string values, which is far more costly that dealing with numbers.

It seems quite clear that putting DISTINCT into a list-of-table function in your standard library is not a very good idea, as that would give you the overhead of dealing with duplicates every time. And there may be the occasions you actually want those duplicates. But it is certainly a good idea to stow this in your bag of tricks.

A final note on the topic of eliminating duplicates: there is a third option to DISTINCT and EXISTS, and that is if you bounce the values from the input list over a table variable (including a multi-statement UDF) or a temp table. You can of course use DISTINCT when you do this, but an alternative is to add a primary key with the IGNORE_DUP_KEY clause. This is a more efficient way of eliminating duplicates that are only occasional. See this blog post from SQL Server MVP Alex Kuznetsov for more details.

Unpacking Lists in a Table, Take Two

When you have the dreadful situation that you need to unbundle a multi-valued column, as I discussed in the section Unpacking Lists in a Table, the presumptions for performance are somewhat different from when you unpack a list in an input parameter.

The first thing to note is that the size of your table column may permit you to change the input parameter in your function from MAX to varchar(8000) or nvarchar(4000) without any risk for data loss. Since the limited string types perform better with charindex, this may give you better performance.

The second thing is more surprising: the relation in performance between the methods changes. Nikola Ilijev has published a blog post where he presents tests with different functions against a table. And in his tests the CTE-based methods were faster than inline_split_me. and other functions based around the Numbers table. I was able to repeat his results, and interesting enough, the difference also shows when I use a multi-statement version of the inline functions.

Really how this comes, I don't know, but I think it is rather the table-of-number functions that are slow in this context than the other way round. To wit, when I tested the CLR functions in Nikola's test setup they outperformed both CTE:s and function based on Numbers with a wide margin.

Final Words

You have now seen a number of methods of passing a list of values to SQL Server, and then use the values to find data in a table. Most methods transform the list of values to table format. I have also discussed general considerations on how to apply these methods. If you want to know more about the performance of these methods, there are two appendixes to this article where I present data from my performance tests.

But keep in mind that everything I say about performance here relates to my tests. With your queries, the situation may be different, and if you need best performance, you need to run benchmarks with your queries. And you may find that you get best performance with some variation that is not described in this article. On the other hand, in the very most situations, the plain iterative method will serve you very well.

Acknowledgements and Feedback

I have not invented all these methods myself. This article summarises the work of a lot of people in the SQL community. There are those who came up with an original idea, those who have refined the method itself, the interfaces or some other detail. I cannot mention them all, because I simply don't know about all of them.

That said, here is a list of persons who have contributed to this article, directly or indirectly, knowingly or unknowingly. One great source have been current and former SQL Server MVPs: Itzik Ben-Gan, Fernando Guerrero, Roy Harvey, Umachandar Jaychandran, Steve Kass, Narayana Vyas Kondreddi, Bryant Likes, Alejandro Mesa, Simon Sabin, Adam Machanic, Tom Moreau, Tony Rogerson, Nigel Rivett, Anith Sen, Razvan Socol and Linda Wierzbicki. But there also a lot people who have contributed through mail or discussions in public newsgroups: Jim Ebbers, Brian W Perrin, Sam Saffron, Ivan Arjentinski, Joe Celko, David Craig, Ken Henderson, Paul Marsh, Chris Payne, Daniel Sebestyen, Dariusz Śpiewak, John Moreno, Bharathi Veeramac, James Vitale, Nikola Ilijev and David Satz. And there are probably a few more that I should have mentioned but that I have forgotten.

I originally published the SQL 2000 version of this article in 2003, and I planned this update in January 2010 to be the last major update of the article. It did not turn out that way – because of the shortcomings in the XML section, I had to make another update in October 2010. But hopefully, that is the final major update to the article. I will update the article if am notified that the bugs I discuss are fixed. I will also make amendments if there are angles of security or usability that I have missed. However, it is not likely that I will run a new suite of performance tests to cover any methods or improvements to existing methods. With the advent of table-valued parameters, I simply don't find it interesting enough. All that said, I still welcome feedback on the article, if you think something is not covered correctly. And not the least I welcome if you point out spelling and grammar errors! The address is always esquel@sommarskog.se.

Revisions

2013-11-02 – David Satz pointed out that there is a difference how blank elements are interpreted depending whether you use '.' or '(./text())[1]' in element-centric XML and I have added a short discussion on this.

2010-10-10 – Rewrote the section on XML. I now suggest that it is best practice to use [1] with attribute-centric XML, although it's redundant. For element-centric XML, I point out that you need a trace flag to avoid query-plan problems, but even better is to use the text() node test. This gives you the same performance with element-centric XML as you get with attribute-centric XML. Updated the examples to insert many rows from XML so that they do not use parent-axis access, which often has poor performance. I have added a section on typed XML in more detail. Because of the many changes around XML, I have also rerun the performance tests and updated the second performance appendix with the new results.

Beside the update on XML, added a note on INSERT VALUES with many rows that performance should be better in the next version of SQL Server according to Microsoft.

2010-01-17 – Added comments on variations on fn_nums, and added the section Concluding Performance Remarks with some more tidbits that I had failed to include in the previous version.

2010-01-06 – Extensive rewrite to adapt for SQL 2008, but also new methods and revelations:

2007-03-03 – First version of the article for SQL 2005.

Back to my home page.