Arrays and Lists in SQL Server

The Long Version

An SQL text by Erland Sommarskog, SQL Server MVP. Latest revision: 2016-08-21.
Copyright applies to this text. See here for font conventions used in this article.

Introduction

If you were referred to this article in answer to your question in an SQL Server forum, you should read the short version of this article first and chances are very good that the short version is all you need for the moment. This article covers the topic in depth, and you would read it if you have an "advanced" scenario or you are more the geeky sort of person. Even if you intend to read this article in full, read the short version as an introduction first, because I will not repeat it all here.

A problem that has been popular over the years with SQL Server is how to handle a list of values. In the majority of the cases, people have a comma-separated list, because this format is produced by commonly used tools like multi-choice controls in .NET, Reporting Services and other places.

When I say that the problem is popular, I don't only mean that the questions are commonplace – but so are solutions. You can find no end of blog posts etc that presents string-splitting functions, including performance tests of such functions and there are function that are known to be the fastest etc.

The aim of this article is two-fold: 1) Give a general discussion of how to design string-splitting functions. 2) Present and discuss each method from the angles I bring up in the general discussion. This includes performance, but not only.

To back up my discussion on performance, I have run extensive performance tests, although it was a while ago now, and they don't cover the new possibilities that were added in SQL 2016. I have published the results from these tests in two appendixes. The first is from 2006 when I originally published the article for SQL 2005, and the second is from 2009 (and updated with some new findings for XML in 2010).

As you might sense, this article is of some age – I published the first version for SQL 2000 already in 2003 and I updated it several times for several years with the last major update in 2010. What I find surprising as I write this new introduction in 2016 is that the topic still is current. After all, it was already SQL 2008 that gave us what I thought would be the ultimate solution: table-valued parameters. I still think that table-valued parameters is the best way to do it, and I have written an article Using Table-Valued Parameters in SQL Server and .NET to demonstrate how to use them.

Nevertheless, questions on how to handle CSVs keep coming up in the forums, and partly this is to blame on the sad fact that not all environments support table-valued parameters.

So what you have here is the 2016 version of this article. I have moved the material for beginners to the short version. I've rewritten this new introduction and added material on string_split and OPENJSON, new features in SQL 2016. I have also conducted a some general overhaul of the contents in general to modernise it a bit. (For instance, it was time to stop talking about features added in SQL 2005 as new features!)

Another topic I touch in this article is how to insert many rows efficiently. While this is something that is best done with table-valued parameters, some form of bulk load or SSIS, some of the methods in this text are interesting from this perspective as well.

Note: all samples in this article refer to the Northwind database. You can download the script to install it from Microsoft's web site.

Table of Contents

Introduction

Overview of the Methods

General Considerations

Interface

Robustness

Performance Considerations

How to Use These Functions – JOIN vs. EXISTS

The Iterative Method

List-of-integers

List-of-strings

Using the CLR

Introducing the CLR

CLR Functions Using Split

Rolling Our Own in the CLR

What About Performance?

string_split – the Built-in Solution

XML and JSON

Using XML for a List of Values

Inserting Many Rows

Element-centric XML

Typed XML

OPENXML

JSON

Using a Table of Numbers

The Concept

An Inline Function

A Chunking Multi-Statement Function

Concluding Remarks

Fixed-Length Array Elements

Using a Function of Numbers

XML Revisited

Using Recursive CTEs

Looking at Table-Valued Parameters

Performance in SQL Server

Client-side Performance

Dynamic SQL

Passing the List as Many Parameters

Making the List into Many SELECT

Really Slow Methods

Concluding Performance Remarks

The Optimizer and len()

JOIN, EXISTS and DISTINCT values

Unpacking Lists in a Table, Take Two

What About In-Memory OTLP?

Final Words

Acknowledgements and Feedback

Revisions

Overview of the Methods

As I've already hinted there are quite a few methods to unpack a list into table. Here I will just give a quick overview of the methods, before I move on to the general considerations.

General Considerations

Interface

Most of the methods I present are packaged into functions that take an input parameter which is a list of values and returns a table, like this:

CREATE FUNCTION list_to_table (@list nvarchar(MAX))
   RETURNS @tbl TABLE (number int NOT NULL) AS 

The reason the methods are packaged in functions is obvious: this permits you to easily reuse the function in many queries. Here I will discuss some considerations about the interface of such functions.

The Input Parameters

In this article, as far as the input parameter is a string, it is always of the data type nvarchar(MAX) which can fit up to 2 GB of data.

I made this choice, because I wanted to make the functions as generally applicable as possible. By using nvarchar, the functions can handle Unicode input, and with MAX the functions permit unlimited input. With nvarchar(4000), they would silently yield incorrect results with longer input, which is very bad in my book.

Nevertheless, there is a performance cost for these choices. If you use an SQL collation, you should know that varchar gives you better performance (more on that in a minute). And generally, there is an overhead for using the MAX types over strings that have an upper limit in size. The latter live in memory only, while the MAX strings can spill to disk and that incurs an overhead, even if there is no actual spill.

Thus, the fastest methods you can find, typically have varchar(8000) as a parameter, but in my opinion that is an accident waiting to happen. And, it can be questioned how much the performance of a string-split function matters for strings of that relatively short size – very likely the cost for splitting the string is dwarfed by the cost of running the rest of the query. If you have lists that are so long that performance of the splitting matters, much likely they will exceed 8000 characters in size every now and then. See however the section Unpacking Lists in a Table, Take Two at the end of the article.

Some of the functions take a second parameter to permit you to specify the delimiter, or, in case of fixed-length, the element length. In some functions I have opted to hard-code the delimiter, but for all methods that use a delimiter (save dynamic SQL), you can always add such a parameter. To keep things simple, I have consistently used one-character delimiters. If you need multi-character delimiters, most methods can be extended to handle this.

The Output Table

The output from all functions is a table. For all methods, I've included one function for returning a table of strings, and for some methods also a function that returns a table of integers. (Which is likely to be the most common data type in this context.) If you have a list of integers and a function that returns strings, you can use it like this:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.id = convert(int, t.str)

Similar applies to other data types. You can easily clone a version of the function that has the convert built-in. (However, check the Robustness section for some things to look out for.)

The data type that requires most consideration is strings. Return nvarchar or varchar? Obviously when you work with Unicode data, you need to get nvarchar strings back. It may be tempting to always return nvarchar, but for reasons that I will return to in the performance section, you should make sure that you have a varchar string when you join with a varchar column. For some of the functions in this article, I have added both a varchar and an nvarchar column to the return table.

In some functions, I also return the position in the list for the elements. This can be handy when you have two or more lists that are horizontal slices of the same source, so you can say things like:

INSERT tbl(id, col2, col3)
   SELECT a.number, b.str, c.str
   FROM   intlist_to_table(@list1) a
   JOIN   charlist_to_table(@list2) b ON a.listpos = b.listpos
   JOIN   charlist_to_table(@list3) c ON a.listpos = c.listpos  

That is, this is a way to insert many rows in one go, although it's not really the best one. Sometimes this can be OK if you only have two or three columns per row to insert, but as the number of parallel lists grows, it gets out of hand, and you should investigate XML instead. The particular danger with the approach above is that if the lists get out of sync with each other, you will insert inconsistent data.

Knowing the list position is essential in situations when the list also implies the sorting for the output. Say that you get a list of US states like this: CA, TX, NY, FL, IL and the user wants the data to be sorted in that order. Without information about the list position, this is impossible.

For some methods, the list position can easily be derived from the method itself, for others (but not all) you can use the row_number() function.

Robustness

It can't be denied that parsing strings is a bit risky. As long as the input plays by the rules everything goes fine, but what happens if it doesn't? A good list-to-table function can protect you from some accidents, but not all. Here are a couple of situations to watch out for.

Delimiter in the Input

Say that you have a couple of city names like this: Berlin, Barcelona, Frankfurt (Main), Birmingham, København. From this you want to compose a comma-separated list that you pass to list-to-table function. With the names listed above, that works fine, but then some joker enters Dallas, TX. Oh-oh.

There are several ways to deal with this problem. One is to use a delimiter that is unlikely to appear in the input data, for instance a control character. Many programs put strings into quotes, so the above list would read "Berlin","Barcelona" etc. This latter format is not supported by any of the functions I present, but you could tweak some of them a bit to get there.

Sometimes you cannot really make any assumption about the delimiter at all, for instance if the source is user input or data on the wire. In such case you will need to use a method with a general escape mechanism, of which I present one, to wit XML. Or you can avoid the delimiter business entirely by using fixed-length strings.

When you work with lists of integers, it is unlikely that the delimiter would appear in the data itself.

Extra Spacing

If you have an input list that goes:

ALFKI, VINET, BERGS,FRANK

Do you want those extra spaces to be included in the data returned by the list-to-table function? Probably not. All functions in this article (but one!) strip trailing and leading spaces from list elements. However, there are some methods, where this is not possible. (Or more precisely, they are not able to handle inconsistent spacing.)

Illegal Input

Say that you have a function that accepts a list of integers, and the input is 9, 12, a, 23, 12. What should happen?

With no particular coding, SQL Server will give you a conversion error, and the batch will be aborted. If you prefer, you can add checks to your function so that the illegal value is ignored or replaced with NULL. This is also very easy to achieve in SQL 2012 and up thanks to try_convert.

To focus on the main theme, I have not added such checks to the functions in this article.

Empty Elements

What if a function that accepts a list of integers is fed the input: 9, 12,, 23, 12. How should that double comma be interpreted? If you just do a simple-minded convert, you will get a 0 back, which is not really good. It would be better to return NULL or just leave out the element. (Raise an error? You cannot raise errors in functions.)

One approach I have taken in some functions in this article is to avoid the problem altogether by using space as delimiter. But since T‑SQL does not provide a function to collapse internal spacing, the approach is not without problems. For methods that build on logic of traditional programming, you can easily handle multiple spaces, but for methods that uses a combination of charindex and set-based logic, you would still have to filter out empty elements in the WHERE clause. (Something I have not done in this article.)

Then again, if your function accepts a list of strings, you may simply consider the empty string to be like any other value and you want your function to retain it.

Performance Considerations

While I have conducted performance tests and devoted long appendixes to them, a very important performance aspect is not with the methods themselves, but how you use them and how they are packaged. In this section I will look into some important issues.

varchar vs. nvarchar

As I discussed in the Interface section, it appears to be a good choice for a function that unpacks a list of strings to have an nvarchar column in its return table, so it can work with both Unicode and 8-bit data. Functionally, it's sound. Performance-wise it can be a disaster. Say that you have:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.indexedvarcharcol = l.nvarcharcol

Why is this bad? Recall that SQL Server has a strict data-type precedence, which says that if two values of different data types meet, the one with lower precedence is converted to the higher type. varchar has lower precedence than nvarchar. (Quite naturally, since the set of possible values for varchar is a subset of the possible values for nvarchar.) Thus, in the query above indexedvarcharcol will be converted to nvarchar. The cost for this depends on the collation of the column.

If the column has a Windows collation, SQL Server is still able to use the index, because in a Windows collation the 255 possible varchar values make up a true subset of all possible values for nvarchar. This permits SQL Server to do something known a range seek. My tests indicate a doubling or tripling of the execution time for the seek, which of course is reason enough to avoid the conversion. But to be fair: in many cases you would not notice.

On the other hand, with an SQL collation it is very likely that you would. Because in this case the index is completely useless, since in an SQL collation the rules for varchar and nvarchar are different and you will get a scan of some sort and for a large table it may take minutes to run the query when it would have taken milliseconds with an index seek.

Thus, it is instrumental that you have separate functions to return varchar and nvarchar data. Or do as I have done in some functions in this article: have two return columns, one for varchar and one for nvarchar. You can also always try to remember to convert the output column to the appropriate data type – but it's human to forget.

You can find out the collation of a column with sp_help.

There is one more thing to say on this theme, which I will come back to in a minute.

Inline, Multi-Statement and Temp Tables

This far I have used "table function" very loosely. There are several kinds of table functions, and with some methods you don't really even use a function. There are four main models:

No method can be implemented in all these ways, but most methods permit you to use intermediate storage.

It may seem from the above that Direct SQL is the most preferable, and indeed table-valued parameters are. While I describe them in detail in another article, I like to cover their superiority here. For almost all other methods described in this article, the optimizer has very little information about your input list. A TVP is just a form of a table variable, so there are still no statistics – save for one piece of information: cardinality. That is, the first time you call a procedure with a TVP the optimizer will look at the number of rows in the table and use that number when determining the query plan and then cache that plan. As long as there is not a skew in the data, this is often sufficient.

At first glance, T‑SQL Inline functions seem appealing, since there is no intermediate table and the optimizer has full information about the query. Unfortunately, the latter point is not really true. To be able to estimate the best way to access the other tables in the query, the optimizer would need to know: 1) how many rows the input string will generate 2) the distribution of the values. But the optimizer is not able to do that with the SQL inline functions in this article, because the information is buried too deep in the logic of these functions.

Thus, in practice the optimizer will apply blind assumptions no matter you use SQL inline, opaque inline, a multi-statement function or a table variable. So then it does not matter which one you use? Oh, no. The blind assumptions are different for the different methods, and if the function uses an auxiliary table of numbers, the size of that table will affect the blind assumptions. (Because the optimizer has information about that table.) And the blind assumptions for CLR functions and XML are different from each other and from those for T‑SQL inline functions.

One thing to keep in mind is this: up to SQL 2012 the blind estimate for a multi-statement function was a single row. This was advantageous in the fairly common case that the list is short, just a handful of a values. However, in SQL 2014, the blind assumption for multi-statement functions was changed to 100. (More precisely, this applies if you use compatibility level 120 or higher and do not enable the trace flag to use the old cardinality estimator.) This is certainly good in many scenarios, but maybe not when you pass a list of two or three entries. Most other methods have a blind estimate of 50 to 200 rows and suffer from the same risk for overestimation with short lists.

The T‑SQL inline functions give the optimizer a higher degree of freedom as it works with the expanded query. I've seen cases where T‑SQL inline gives unexpectedly good performance, because the optimizer finds a parallel plan. However, there is a non-negligible risk that the optimizer outsmarts itself. Here is an extract from a mail that I received in response to my old article for SQL 2000:

After reading your article 'Arrays and list in SQL server' I tried to use the Fixed-Length Array Elements method in my application. Everything worked fine until I moved the code from a client batch to a stored procedure. When I looked at the query execution plan, I saw that the number of rows retrieved from the Numbers table was over 61 millions ! Instead of starting by joining the Numbers table with the source table to filter out the 500 rows included in the array, it processes the GROUP BY clause on the entire table (121 000 rows) and then it uses a nested loop to match each entry with the Number table.

With all other methods but T‑SQL inline, the optimizer does not really have much other choice than to first compute the table from the list input, but some T‑SQL inline functions open for the possibility for a reverse strategy – which is not likely to be successful. I suggested my correspondent that he should try a multi-statement function instead, and that resolved his issue. I will return to this theme in the section JOIN vs. EXISTS below.

Temp tables are a different matter. These have statistics, both cardinality and distribution. The latter is important if there is skew in the data. When you are not able to get a good plan on the blind assumptions, bouncing the data over a temp table may be required. There is an overhead for the bouncing, but it can pay off. If there is a skew, or the number of elements can vary vividly, you may prefer to have a recompile every time, to get the best plan for the moment. You can achieve this by adding OPTION (RECOMPILE) to the end of the query. However, beware that autostats only happens if there sufficiently many changes to the data, and because temp-table definitions are cached from execution to execution, this counts from one execution to another, why you also may need UPDATE STATISTICS. This is something that SQL Server MVP Paul White discusses in a lot more detail in his blog post Temporary Tables in Stored Procedures.

Here is an overview of which strategies that are possible with which methods:

 Direct SQL T‑SQL InlineOpaque Inline Multi-Statement Table Variable Temp Table
Table-valued Parameters Yes No No No Yes Yes
string_split No No Yes No Yes Yes
Iterative Method No No No Yes Yes Yes
CLR No No Yes No Yes Yes
XML and JSON No No Yes No Yes Yes
Table of Numbers No Yes No Yes Yes Yes
Fixed-Length No Yes No Yes Yes Yes
Function of Numbers No Yes No Yes Yes Yes
Recursive CTE No Yes No Yes Yes Yes
Dynamic SQL Yes No No No No No
Many Parameters Yes No No No Yes Yes
List to SELECT No No No No Yes Yes
Real Slow Yes No No No No No

MAX Types vs. Regular (n)varchar

In a previous section I discussed the problems with joining nvarchar and varchar. When I ran my performance tests and investigated some unexpected results; I discovered a second problem of a similar kind. Consider this:

SELECT ...
FROM   tbl t
JOIN   list_to_table(@list) l ON t.indexednvarcharcol = l.nvarcharmaxcol

The list-to-table function is here written in such a way that its return type is nvarchar(MAX). This too leads to an implicit conversion of the indexed column. It may not be apparent that it has to be that way at first sight, but when SQL Server evaluates an expression, it always works with the same data type for all operands. And nvarchar(n) is a different data type from nvarchar(MAX). The result of the implicit conversion is not fatal. The optimizer applies a range-seek operator and is still able to use the index, but nevertheless there is an overhead. When I initially ran my tests, I had not observed this issue, and my inline functions returned nvarchar(MAX) (of the simple reason that the input string was nvarchar(MAX)). As a consequence, my tests in some cases seemed to indicate that inline functions performed worse than the corresponding multi-statement solutions.

Presumably, most of the time when you use list-to-table functions for a list of strings, the list elements are short, just a few characters long. Therefore, there is all reason to make sure that your list-to-table function returns a regular varchar or nvarchar. Particularly, this means that for inline functions, you should make sure that the return value is explicitly converted to regular (n)varchar. You will see this in all inline functions in this article. But if you use the built-in string_split function, this is something you need to keep in mind yourself.

Collations

All functions in this article uses nvarchar both for parameters, output and internal variables. If you never work with Unicode data, you may think that you should rewrite the functions to use varchar instead, assuming that 8-bit characters are faster for SQL Server to work with than the 16-bit Unicode characters.

This may or may not be the case, depending on which collation you are using. As I discussed above under varchar vs. nvarchar, there are two sorts of collations: Windows collations and SQL collations. If you use a Windows collation, you get a slight reduction in performance if you use varchar rather than nvarchar. This is due to that with a Windows collation the Unicode rules and routines are always employed internally, so all that varchar buys you is extra conversion to and from Unicode.

On the other hand, with an SQL collation you can get some 30 % improvement in execution time with using varchar instead. This is because for SQL collations there are separate internal routines for string handling that work only on 8-bit character data, and these 8-bit routines are far simpler than those for Unicode. (If you have an SQL collation and use nvarchar, you are in fact using a Windows collation under the cover.)

Note here that the exact gain depends on the type of operation. 30 % is what you can expect from a plain equality test. There are situations where the difference between varchar and nvarchar in an SQL collation can be as much as a factor of 7. We will look at such case in the section on really slow methods.

If you have to use nvarchar, there is still a way to evade the overhead of the Unicode routines: use a binary collation. Now, if you opt to use a binary collation throughout your database, you will have to accept that all comparisons are case-insensitive and that sorting is funky, not the least for other languages than English. So for most applications, using a binary collation is not a viable option. However, there exists a second possibility: force the collation for a certain expression. I have employed this throughout this article where it makes sense. You will see a lot of things like:

charindex(@delimiter COLLATE Slovenian_BIN2, @list, @pos + 1)

Since @delimiter is cast to an explicit collation, this also happens with @list. (This is discussed in Books Online in the topic Collation Precedence.) When using charindex to find a delimiter, odds are good that you are looking for the exact delimiter and you have no need for case- or accent-insensitive searches. Thus, using a binary collation in this situation does not lead to any loss in functionality. When I tested this for the iterative method, I got some 10 % improvement in execution time.

(Why Slovenian? And why BIN2? Well, I had to pick something, but which binary collation you chose does not matter. I used Slovenian because my test data happens to be a list of Slovenian words from a spelling dictionary. And I selected BIN2, since these newer binary collations resolve some oddities of the older BIN collations. But as I said, all that matters in this case is that you select a binary collation.)

Scalability in a Multithread Environment

When I discuss performance in this article, I'm mainly concerned with the method as such, since this is mainly what I have tested. But if you believe that you will have many users concurrently submitting comma-separated lists or similar, you will also be interested in how these methods scale. In my suite of performance tests in 2009, I did a feeble attempt to test this, but the tests were not very conclusive. You can read about my observations in the second performance appendix, but I only comment on it in a few places in this article.

How to Use These Functions – JOIN vs. EXISTS

Once you have such a function, how would you use it? One way is to use a join:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   intlist_to_tbl(@ids) i ON P.ProductID = i.number

But you could also write the query as:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
WHERE  P.ProductID IN (SELECT i.number FROM intlist_to_tbl(@ids) i)

Or:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
WHERE  EXISTS (SELECT *
               FROM   intlist_to_tbl(@ids) i
               WHERE  P.ProductID = i.number)

In this article I almost exclusively use the first model, using joins. Does this mean that this is the best way to write such queries? It turns out that this is a more delicate question that it may seem at first glance.

Before I start, let me first say that there is no practical difference between IN and EXISTS, but they will always generate the same plan. So it is only a question on whether to use join on the one hand or IN/EXISTS on the other.

Next thing to note is that if you want to get the list position, you need to write a join, because then you are using the table function for more than just checking existence. But if we overlook that part, we can see that the join query above has a potential problem that the other two do not have. To wit, what if there are duplicates in the input list? Now, obviously, it could be the case that the business rules are such that if 27 appears multiple times in the list, Schoggi Schokolade should appear multiple times in the output. And in such case, the join is the only reasonable option.

But in most cases, you don't want duplicates in the output, and this means that if you use a join query, you need to eliminate duplicates with the DISTINCT keyword. And this means that SQL Server needs to put in extra work to sort (or hash) the data to get rid of the duplicates. With IN or EXISTS, you don't need DISTINCT, because you are only selecting rows from one table. Thus, EXISTS is leaner? Nah, not really. If you look at the IN/EXISTS queries above, they are in practice very similar to a join + DISTINCT, because that is how SQL Server will implement them.

So it doesn't matter which we use then? Maybe in some cases. The more complicated the query gets, the more likely there will be differences. If you use DISTINCT + JOIN, DISTINCT applies to the entire result set which may yield a different result than using IN or EXISTS. If the latter case, SQL Server knows that it only has to eliminate duplicates from the table-function. And the query plan may for other reasons include a hash or sort operator that can perform elimination of duplicates, so that there is no real extra cost.

Beside all the performance-ponderings there is the purist view: you should express your intent clearly, and if you mean to include the table function only to check for existence, you should state so in the query.

After all this rambling, it seems that I have arrived at the conclusion that IN/EXISTS are better than joins. So why do I use joins then in the article? There are two reasons. One is kind of shameful: when I first wrote the article for SQL 2000 many years ago my brain was an auto-pilot, and I just used what first came into my mind. The other reason is more heavy-duty: when I ran my performance tests in 2009 I added tests with EXISTS queries, and for all T‑SQL inline functions without exception this resulted in disastrous query plans. The execution time for an EXISTS query with a list of 20 elements ranged from 25 to 40 seconds. The corresponding execution time for the join queries were a few milliseconds in the very most cases. This abysmal performance was far less common with join queries.

As I discussed above in the section Inline, Multi-Statement and Temp Tables, inline T‑SQL functions give the optimizer a degree of freedom which is not really good for its health. I like to stress that my tests do not prove that SQL Server always freaks out on T‑SQL inline functions when EXISTS is involved. But nevertheless, it happened eight times out of eight in my tests, and while some functions are similar to each other, there are also some that are entirely different in nature.

So for this reason, I have retained the queries in the article as join queries. I have not added DISTINCT to them. After all, in many cases, you probably have control over how the lists are constructed, and you can be reasonably sure that they don't have duplicates.

At the end of the article there is a section of concluding performance remarks, where I further discuss this topic.

The Iterative Method

I will now describe the various methods to unpack lists into a table, one by one in detail.

I've opted to start with the iterative method. Together with using a recursive CTE, the iterative method needs the least amount of preparations. Just create the function and you are on the air. Most other methods require some extra step. Another advantage is that the code for the iterative method is very easy to understand, not the least if you have a background with traditional programming languages. This makes things simple if you need to adapt the code to a special input format.

The method is far from the fastest in the bunch, but as long as you mainly work with lists of reasonable length, you will find performance acceptable. If you often work with lists of several thousand elements, you should probably investigate some of the faster methods.

List-of-integers

You have already seen an example of the iterative method in introductory short version, but I repeat it here (with a different name, as all functions in this article has a name which includes the method used.)

CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX))
   RETURNS @tbl TABLE (number int NOT NULL) AS
BEGIN
   DECLARE @pos        int,
           @nextpos    int,
           @valuelen   int

   SELECT @pos = 0, @nextpos = 1

   WHILE @nextpos > 0
   BEGIN
      SELECT @nextpos = charindex(',', @list, @pos + 1)
      SELECT @valuelen = CASE WHEN @nextpos > 0
                              THEN @nextpos
                              ELSE len(@list) + 1
                         END - @pos - 1
      INSERT @tbl (number)
         VALUES (convert(int, substring(@list, @pos + 1, @valuelen)))
      SELECT @pos = @nextpos
   END
  RETURN
END

The idea is simple. We iterate over the string, look for commas, and then extract the values between the commas. Note the use of the third parameter to charindex, this specifies the position where to start searching for the next comma. The computation of @valuelen includes the only complexity: we must cater for the fact that charindex will return 0 when there are no more commas in the list.

However, this function is slower than it has to be. Recall what I said in the interface section above about there being an overhead for MAX types because need to be able to spill to disk. Originally, when I implemented the iterative method for SQL 2000, writing it like above was not possible, because there was no nvarchar(MAX) only the old ntext which was not fully supported by charindex. Therefore I had to implement a solution where I broke up the string into chunks of nvarchar(4000). When I ran my performance tests, I found that chunking was also a meaningful thing to do on SQL 2005 and later as this improves performance by 20-30 %.

There is a second problem with iter$simple_intlist_to_tbl: if you for some reason feed it two consecutive commas, this will result in a 0 in the output, which isn't really good. While you can easily address this by adding some extra logic to the function, my preference is to avoid the problem by using space as a separator. The comma does not really fill any purpose for a list of integers. Or so was my thought when I wrote the function many years ago. Little did I know that there are controls in .NET and SSRS that generate integer lists with comma as separator. So obviously a method that can handle comma is desirable. (And quite obviously it is reasonable to expect a list generated programmatically to obey the format and not have multiple consecutive commas.)

Nevertheless, I present the function as I originally wrote, as that is what I tested in my performance tests. Tweaking it to handle comma or any other separator is left as an exercise to the reader. So here it is:

CREATE FUNCTION iter_intlist_to_tbl (@list nvarchar(MAX))
   RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                       number  int NOT NULL) AS
BEGIN
   DECLARE @startpos int,
           @endpos   int,
           @textpos  int,
           @chunklen smallint,
           @str      nvarchar(4000),
           @tmpstr   nvarchar(4000),
           @leftover nvarchar(4000)

   SET @textpos = 1
   SET @leftover = ''
   WHILE @textpos <= datalength(@list) / 2
   BEGIN
      SET @chunklen = 4000 - datalength(@leftover) / 2
      SET @tmpstr = ltrim(@leftover +
                    substring(@list, @textpos, @chunklen))
      SET @textpos = @textpos + @chunklen

      SET @startpos = 0
      SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr)

      WHILE @endpos > 0
      BEGIN
         SET @str = substring(@tmpstr, @startpos + 1,
                              @endpos - @startpos - 1)
         IF @str <> ''
            INSERT @tbl (number) VALUES(convert(int, @str))
         SET @startpos = @endpos
         SET @endpos = charindex(' ' COLLATE Slovenian_BIN2,
                                 @tmpstr, @startpos + 1)
      END

      SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos)
   END

   IF ltrim(rtrim(@leftover)) <> ''
      INSERT @tbl (number) VALUES(convert(int, @leftover))

   RETURN
END

Here is an example on how you would use this function:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   iter_intlist_to_tbl(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_iter '9 12 27 37'

This function has two loops. One which creates the chunks, and one that iterates over the chunks. The first chunk is always 4 000 characters (provided that the input is that long, that is). As we come to end of a chunk, we are likely to be in the middle of an element, which we save in @leftover. We bring @leftover with us to the next chunk, and for this reason we may grab fewer than 4 000 characters from @list this time. When we have come to the last chunk, @leftover is simply the last list element.

Multiple spaces are handled by simply ignoring @str if it's blank.

There are two things from the general considerations that I have added to this function that was not in iter$simple_intlist_to_tbl:

  1. The output table includes the list position.
  2. I use the COLLATE clause to force a binary collation to gain some further performance.

Before I close this section, I need to credit Sam Saffron who pointed out a performance flaw in my original implementation where I kept reallocating the string rather than using the third parameter of charindex.

List-of-strings

Here is a similar function, but that returns a table of strings.

CREATE FUNCTION iter_charlist_to_tbl
                 (@list      nvarchar(MAX),
                  @delimiter nchar(1) = N',')
      RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                          str     varchar(4000)      NOT NULL,
                          nstr    nvarchar(2000)     NOT NULL) AS

BEGIN
   DECLARE @endpos   int,
           @startpos int,
           @textpos  int,
           @chunklen smallint,
           @tmpstr   nvarchar(4000),
           @leftover nvarchar(4000),
           @tmpval   nvarchar(4000)

   SET @textpos = 1
   SET @leftover = ''
   WHILE @textpos <= datalength(@list) / 2
   BEGIN
      SET @chunklen = 4000 - datalength(@leftover) / 2
      SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen)
      SET @textpos = @textpos + @chunklen

      SET @startpos = 0
      SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr)

      WHILE @endpos > 0
      BEGIN
         SET @tmpval = ltrim(rtrim(substring(@tmpstr, @startpos + 1,
                                             @endpos - @startpos - 1)))
         INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval)
         SET @startpos = @endpos
         SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2,
                                 @tmpstr, @startpos + 1)
      END

      SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos)
   END

   INSERT @tbl(str, nstr)
      VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover)))
   RETURN
END

An example on how you would use this function:

CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   iter_charlist_to_tbl(@customers, DEFAULT) s
     ON   C.CustomerID = s.nstr
go
EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'

There are a few differences to iter_intlist_to_tbl. I've added a parameter to specify the delimiter, since you may need to use different delimiters depending on the data. Note that if you want to use the default of a table-valued function, you cannot leave out the parameter, but you have to specify the keyword DEFAULT to use the default value. Another addition is that I trim off space that appears directly adjacent to the delimiters, but space within the list elements is retained.

I like to give attention to the use of datalength. There are two system functions in T‑SQL to return the length of a string: len returns the number of characters, and does not count trailing space. datalength returns the number of bytes (whence all these / 2), and includes trailing spaces. I'm using datalength here, since there is no reason to ignore trailing spaces in the chunks – they could be in the middle of a list element.

Using the CLR

Introducing the CLR

SQL 2005 added the possibility to create stored procedures, functions etc in .NET languages such as C# and Visual Basic .NET, or any language that supports the Common Language Runtime. Despite that this option has been in the product for over ten years, it is still something that most SQL programmers are unaware of or at least very shy of.

If you have never worked with the CLR before, you may find that this method goes a little over your head, and you may prefer to use a pure SQL method. On the other hand, if you are a seasoned C# or VB programmer, you will surely appreciate this method. Just like the iterative method, this method lends itself very easily to modifications to adapt to special input formats.

There are many ways you can abuse the CLR and use it when you should not, but a list-to-table function is a prime example of what the CLR is good for: operations that do not perform any data access, but that perform complex manipulations of strings or numbers. The reason for this is two-fold: 1) The CLR gives you a much richer set of functions to work with, regular expressions just to name one. 2) The CLR languages are compiled, while T‑SQL is interpreted, leading to much better performance with the CLR. In the realm of table-valued functions there is another factor that improves performance: the output from a table-valued CLR function is not written into any intermediate storage, but the rows are fed into the rest of the query as soon as they are produced. So in that sense they are inline functions, but in difference to T‑SQL's own inline functions, the optimizer has no idea what they will produce, which is why I refer to them as opaque inline.

A special quirk of CLR functions is that they cannot return varchar data – you can only return nvarchar. This means that when you work with list of strings that you must always be careful to remember to convert the output to varchar when you join with varchar columns in tables, as I discussed in the section varchar vs. nvarchar.

By default, running user-written CLR code is disabled in SQL Server. You can enable the CLR by running

EXEC sp_configure 'CLR enabled', 1
RECONFIGURE  

from a query window.

In the following I will try to give a crash-course how to write a table-valued function in the CLR. In the interest of brevity, I'm only including examples in C#.

CLR Functions Using Split

We will look at two ways of implementing a list-to-table function in the CLR. The first one, with very little of our own code, serves as an introduction to CLR table functions rather than as an example of a function you actually should use. In the second alternative, we will roll our own, which opens for a higher degree of flexibility and also better scalability.

The Code

A complete C# file that implements two list-to-table functions, one for strings and one for integer has to be no longer than this:

using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public class CLR_split
{
    [SqlFunction(FillRowMethodName="CharlistFillRow")]
    public static IEnumerable CLR_charlist_split(SqlString str,
                                                 SqlString delimiter)
    {
        return str.Value.Split(delimiter.Value.ToCharArray(0, 1));
    }

    public static void CharlistFillRow(object row, out string str)
    {
        str = (string)row;
        str = str.Trim();
    }

    [SqlFunction(FillRowMethodName="IntlistFillRow")]
    public static IEnumerable CLR_intlist_split(SqlString str)
    {
        return str.Value.Split((char[]) null,
                               System.StringSplitOptions.RemoveEmptyEntries);
     }

    public static void IntlistFillRow(object row, out int n)
    {
        n = System.Convert.ToInt32((string) row);
    }
}

Compile and Install

To compile this, open a command-prompt window and make sure that you have C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727 or corresponding in your path. (Later versions of the .NET Framework will work as well, but 1.1 will not.) Assuming that the name of the file is CLR_split.cs, the command is:

csc /target:library CLR_split.cs

This gives you CLR_split.dll. If your SQL Server is not on your local machine, you will have to copy the DLL to the server box, or make the DLL visible from the server in some way. Then run from a query window:

CREATE ASSEMBLY CLR_split FROM 'C:\somewhere\CLR_split.dll'
go
CREATE FUNCTION CLR_charlist_split(@list nvarchar(MAX),
                                   @delim nchar(1) = N',')
RETURNS TABLE (str nvarchar(4000))
AS EXTERNAL NAME CLR_split.CLR_split.CLR_charlist_split
go
CREATE FUNCTION CLR_intlist_split(@list nvarchar(MAX))
RETURNS TABLE (number int)
AS EXTERNAL NAME CLR_split.CLR_split.CLR_intlist_split
go

(Note: it is also possible to deploy the functions from Visual Studio, but I'm not showing you that, because I don't know how to do it myself. Visual Studio leaves me in a maze, and at the same time I find the command-line very simple to use. What I have been told is that VS may require you to add extra attributes to the functions if you want to deploy the functions that way.)

You have now created the functions and can use them from T‑SQL. Here is an example for both:

CREATE PROCEDURE get_company_names_clr @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   CLR_charlist_split(@customers, DEFAULT) s
     ON   C.CustomerID = s.str
go
EXEC get_company_names_clr 'ALFKI, BONAP, CACTU, FRANK'
CREATE PROCEDURE get_product_names_clr @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   CLR_intlist_split(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_clr '9 12 27  37'

As with iter_intlist_to_tbl, CLR_intlist_split takes a space-separated list of integers.

What's Going On?

If you have never worked with CLR table functions before, you may at this point wonder how this all works, and I will try to explain.

CREATE ASSEMBLY loads the DLL into SQL Server. Note that it does not merely save a pointer to the file; the DLL as such is stored in the database. Since CREATE ASSEMBLY operates from SQL Server, the file path refers to the drives on the server, not on your local machine. (If you are loading the assembly from a network share, it's better to specify the location by \\servername name than by drive letter.) It is also possible to load an assembly as a hex-string.

The CREATE FUNCTION statements look just like the statements for creating multi-statement functions. That is, you specify the parameter list and the return table. But instead of a body, AS is followed by EXTERNAL NAME where you specify the CLR method to use. This is a three-part name where the first part is the assembly, the second part is a class within the assembly, and the last part is the name of the method itself. In this example, I'm using the same name for the assembly in SQL Server as I do for the class.

There is one small detail on the return table: For a multi-statement function you can specify that a column is nullable, you can define CHECK and DEFAULT constraints and define a PRIMARY KEY. This is not possible for CLR functions.

If we turn to the C# code, the table-valued function is implemented through two C# methods. The first method is the one that we point to in the CREATE FUNCTION statement. The second method is specified through the attribute that comes first in the definition of the CLR_charlist_split method. That is, this line:

[SqlFunction(FillRowMethodName="CharlistFillRow")]

This line specifies that the method is a table-valued function and points to the second method of the function. CLR_charlist_split is the entry point and is called once. The entry point must return a collection or an enumerator, and the CLR will call the method specified in FillRowMethodName once for every element in the collection/enumerator, and each invocation produces a row in the output table of the function.

So this is what happens when you call CLR_charlist_split from T‑SQL. The C# method calls the String method Split which splits the string into a collection over a delimiter. (For full details on Split, I refer you to the MSDN documentation.) Since you get a collection, you need do no more. The CLR calls CharlistFillRow for each element in the collection. And as I noted above, as soon as a row is produced, it can be consumed in the outer query, without waiting for the table function to complete.

What about parameters? As you may guess, the parameter list of the entry method must agree with the parameter list in the CREATE FUNCTION statement. The exact rules for mapping SQL data types to those of the CLR are beyond the scope of this text, please refer to Books Online for the full details.

The first parameter of the fill method (CharlistFillRow) is of the type object. This is the current element in the collection/​enumeration and to use it, you will need to cast it to the actual data type. The remaining parameters to the fill method are all output parameters, and they map to the columns in the output table in the CREATE FUNCTION statement.

One more thing calls for attention: the return type of the entry function. In this example it is IEnumerable, since Split returns a collection. The only other alternative is IEnumerator, which we will look at shortly.

Back on Track

After this excursion into CLR, let's get back to the topic of list-to-table functions. What are the characteristics of the list-to-table functions that use Split?

As you can see, we don't return the list position. As far as I know, you cannot get the list position this way, but I will have to admit that I have not dug into it. Overall, Split puts you quite much into a straight-jacket. You can be as flexible as Split is. That includes specifying an alternate string delimiter, which can be multi-character, and you can specify that empty elements should not be returned (which I make use of in the function for integers). But that's it.

There is another thing that is problematic with Split. As I said, it splits up the entire string into a collection, and that collection takes up memory. Of course, if you send in an occasional string with 10 elements that is not an issue. But if you have a busy website with 10 000 concurrent users? SQL Server MVPs Paul Nielsen and Adam Machanic have both reported they have tried to use CLR functions based on Split with disastrous results. If SQL Server comes under memory pressure, the AppDomain may be unloaded, which may cause a CLR function based on Split to crash. Or at least the AppDomain will have to be reloaded, which takes a couple of seconds. (If you are not acquainted with the term AppDomain, I will have to admit that it is a bit of mumbo-jumbo to me as well, but it is something that hosts the CLR execution. I kindly refer you to Books Online and the .NET documentation for details.) But in short: methods based on Split can lead to scalability and stability issues on a busy system.

Rolling Our Own in the CLR

Instead of relying on Split, we can do the work ourselves. The advantage with this is that we win flexibility and we also avoid the stability issues with Split. Here is a C# file that is longer than the previous one:

using System;
using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public class CLR_iter
{

    private class stringiter : IEnumerator
    {
        string  _str;
        char    delim;
        int     _start_ix;
        int     _end_ix;
        int     _listpos;


        public string str {
           get {
              return this._str;
           }
        }

        public int start_ix {
           get {
              return this._start_ix;
           }
        }

        public int end_ix {
           get {
              return this._end_ix;
           }
        }

        public int listpos {
           get {
              return this._listpos;
           }
        }


        public stringiter(SqlString  str,
                          SqlString  delimiter) {
           this._str  = str.IsNull ? "" : str.Value;
           this.delim = delimiter.IsNull ? '\0'
                               : delimiter.Value.ToCharArray(0, 1)[0];
           Reset();
        }

        public bool MoveNext() {
            this._start_ix = this._end_ix + 1;
            if (delim == ' ') {
               while (this._start_ix < this._str.Length &&
                      this.str[this._start_ix] == ' ') {
                  this._start_ix++;
               }
            }

            if (this._start_ix >= this._str.Length) {
               return false;
            }

            this._end_ix = this.str.IndexOf(this.delim, this._start_ix);
            this._listpos++;
            if (this.end_ix == -1) {
               this._end_ix = this._str.Length;
            }
            return true;
        }

        public Object Current {
          get {
            return this;
          }
        }

        public void Reset() {
           this._start_ix = -1;
           this._end_ix   = -1;
           this._listpos  = 0;
        }
    }

    [SqlFunction(FillRowMethodName="CharlistFillRow")]
    public static IEnumerator CLR_charlist_iter(SqlString str,
                                                SqlString delimiter)
    {
        return new stringiter(str, delimiter);
    }

    public static void CharlistFillRow(object     obj,
                                       out int    listpos,
                                       out string str)
    {
        stringiter iter = (stringiter) obj;
        listpos = iter.listpos;
        str = iter.str.Substring(iter.start_ix,
                                 iter.end_ix - iter.start_ix);
        str = str.Trim();
    }


    [SqlFunction(FillRowMethodName="IntlistFillRow")]
    public static IEnumerator CLR_intlist_iter(SqlString str,
                                               SqlString delimiter)
    {
        return new stringiter(str, delimiter);
    }

    public static void IntlistFillRow(object     obj,
                                      out int    listpos,
                                      out int    number)
    {
        stringiter iter = (stringiter) obj;
        listpos = iter.listpos;
        string str = iter.str.Substring(iter.start_ix,
                                        iter.end_ix - iter.start_ix);
        number = System.Convert.ToInt32(str);
    }
}

The key is the internal class stringiter. First note the class declaration itself:

private class stringiter : IEnumerator

This means that the class implements the IEnumerator interface which is a requirement for a table-valued function. (That or IEnumerable.) Next follows internal class variables, and property methods to read these values from outside the class. Next piece of interest is the constructor:

public stringiter(SqlString  str,
                  SqlString  delimiter) {

This method creates an instance of the stringiter class. This constructor is called from the entry-point method of the table-valued function.

Next follow MoveNext, Current and Reset. These are the methods that implement IEnumerator, and they must have precisely the names and signatures that you see above. (For more details on IEnumerator, see MSDN.) The interesting action goes on in MoveNext. It is here we look for the next list element and it is here we determine whether we are at the end of the list. As long as MoveNext returns true, the fill method of the table function will be called. That is, MoveNext should not return false when it finds the last element, but the next time round.

What is interesting from a list-to-table perspective is that MoveNext handles space as the separator in a special way: multiple spaces are collapsed into one. This does not happen with other delimiters.

After the code for the stringiter class come two entry-point methods with accompanying fill methods, one for strings and one for integers. In difference to the previous example with Split, the entry-points here return IEnumerator, since we implement IEnumerator ourselves. But similar to the Split example, all the entry-points do is to create a stringiter object.

The fill methods, finally, extracts the data from current stringiter object. I grab hold of the list position, so it can appear in the output table.

For completeness, here is the SQL declaration of the functions:

CREATE FUNCTION CLR_charlist_iter(@list nvarchar(MAX),
                                  @delim nchar(1) = ',')
RETURNS TABLE (listpos int,
               str     nvarchar(4000))
AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_charlist_iter
go
CREATE FUNCTION CLR_intlist_iter(@list nvarchar(MAX),
                                 @delim nchar(1) = ' ')
RETURNS TABLE (listpos int,
               number  int)
AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_intlist_iter
go

At first glance, it may seem that these functions do not give you that much more than Split. You get the list position, and the intlist function has a delimiter parameter, but most of all you seem to get a lot more code. But there is more than meet the eye. The above gives you a framework to work with if you need to handle more complex list formats or some other functionality not supported by the functions as given. Here, I have used the same enumerator class for strings and integers, but the fancier you want to be, it's probably better to use different classes.

What About Performance?

Overall, using a CLR function is one of the fastest methods around. Of the methods presented in this article, it is generally only the fixed-length method that we will look at later that can compete. To that comes table-valued parameters which I discuss in a separate article.

Since the CLR is fast, it's no wonder that there is on-going battle in the bloggosphere about finding the very fastest function. SQL Server MVP Adam Machanic presented what he thinks is the best so far in this blog post. Instead of using the String class and the IndexOf method he loops over a Char array and looks for the delimiters himself.

However, when I have run my performance tests, the performance of the different CLR functions I have tested have been virtually identical. I tested the two functions above, Adam's and one more which takes fixed-length input rather than a list of values (more about this format later in the article). Only at extremely long input, 50 000 list elements, I can sense a small advantage for the CLR_iter solution above, and with Adam's solution tending to be the slowest. But the difference is very small, maybe 20 milliseconds when the total execution time is half a second or more.

I should immediately qualify this. Adam's function handles multi-character delimiters, and mine doesn't. And to be able to handle multi-character delimiters Adam does the right thing. If you would take my functions above and just change the definition, you would also lose some performance directly. This is because when you pass IndexOf a Char parameter for the delimiter it is culture-insensitive. (Which I assume is .NET-speak for binary.) But if you instead pass the delimiter as a String – which you would need to for a multi-character delimiter – you will invoke an overload of IndexOf that is culture-sensitive, that is, you get the full Unicode overhead, which as I discussed above comes with a cost in performance.

There is nevertheless one performance flaw in Adam's function as posted: he only provides a function that returns strings, so if you have a list of integers, you need to apply convert in your T‑SQL code. When testing, I found that there was a small performance benefit of having the conversion done in the C# code than in the T‑SQL code. (This is not covered in the performance appendixes, because by the time I made that particular test, I had already compiled the results.)

As I have mentioned, in the later of my suite of tests, I included a multi-thread test to see how the various methods scale in a multi-user environment. I did not have the opportunity to run these tests on server-class hardware, and I have limited faith in the accuracy and the relevance of the results. However, the CLR methods emerged as clear winners: they were consistently among the methods that scaled the best. (Observer here that due to an issue in my test setup, I was not able to include table-valued parameters in the multithread tests.)

Interesting enough, the functions based on Split performed well in my tests even when I ran with 100 parallel threads and I did not see the issues that Adam and Paul have reported. I have a theory that these crashes were more likely to appear on 32-bit SQL Server where the full memory space beyond 2 GB was available only to the buffer cache, and the rest have to be shoehorned into the lower 2 GB. But that is mere speculation on my part. In any case, today 32-bit SQL Server is much a thing of the past.

string_split – the Built-in Solution

SQL 2016 introduced a new built-in function, string_split which saves you from writing your own list-to-table function. This is certainly convenient, not the least if you are in the unfortunate situation that you cannot add your own user-defined functions to the database, but you have work with that is there. However, it is not that simple to use, not the least if you want to avoid performance surprises. Generally, I would say the function comes out as half-baked to me.

Here is an example of how to use it for an integer list:

CREATE PROCEDURE get_product_names_builtin @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   string_split(@ids, ',') s ON P.ProductID = convert(int, s.value)
go
EXEC get_product_names_builtin '9, 12, 27,  37'

Because string_split always return character data, you need to cast the return value to int (or some other numeric data type). It is true that some of the functions in the article also are of this kind, but given that numeric lists are so common, I think that when they added a built-in, they could have added a second one for numbers.

As you see from the example, you can specify a delimiter – in fact this is mandatory. Again, given how common comma-separated lists are, one like to think that this could have been made the default. The input string can be of any varchar and nvarchar length, including MAX, so from that point of view, nothing is missing. On the other hand, the delimiter can only be char(1) or nchar(1). That is, there is no support for multi-character delimiters.

Here is an example with using string_split to crack a list of strings:

CREATE PROCEDURE get_company_names_builtin @customers nvarchar(MAX) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   string_split(@customers, ',') s
     ON   C.CustomerID = convert(char(5), ltrim(s.value))
go
EXEC get_company_names_builtin 'ALFKI, BONAP, CACTU, FRANK'

You may be surprised by the appearance of ltrim. and the explicit convert to char(5). The call to ltrim is necessary here, because string_split does not trim leading or trailing spaces, but goes blindly by the delimiter. So without ltrim I would not get BONAP, but BONA with a leading space. (I don't use rtrim here, since trailing spaces are ignored in string comparison in SQL anyway.)

Why convert then? The return type of the column value the same as the type for the input list. This choice is not illogical, since in theory the list could be a single element and there should be no truncation. But this means that if the input is (n)varchar(MAX) the type of value is also MAX, and as I discussed in the section MAX Types vs. Regular (n)varchar this introduces an implicit conversion of the table column which impairs performance. Whence, we need to use convert to prevent this from happening. Since it is not very common that you want to extract elements as long as 4000 characters, I think it would have been a better choice to limit the return type to never be more than 8000 bytes, no matter the input.

Another thing to note is that the result set from string_split has this single column value. That is, the list position is not there. While this is not always critical, you want often enough to miss it. And what's the point with a built-in function that you can only use sometimes?

When it comes to performance, the optimizer estimates string_split to always return 50 elements. That is, it does not consider the length of the input string. Or at least that is the situation in SQL 2016 RTM, as always with the optimizer, this could change in the future.

I have not run any performance tests with this function, so I cannot say how well it fares, but I would expect it to do well, since there is no T‑SQL overhead. However, it has a go-slower button pressed. To wit, the delimiter is handled according to the current collation, which means that with a Windows collation you drag in the full Unicode comparison rules and the overhead that comes with them. You can avoid this by forcing a binary collation yourself, but then you are likely to get a collation conflict with the column you are comparing to. So you would end up with:

CREATE PROCEDURE get_company_names_builtin @customers nvarchar(MAX) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   string_split(@customers, ',' COLLATE Slovenian_BIN2) s
     ON   C.CustomerID = convert(char(5), ltrim(s.value)) COLLATE DATABASE_DEFAULT
go
EXEC get_company_names_builtin 'ALFKI, BONAP, CACTU, FRANK'

Note here that DATABASE_DEFAULT assumes that the C.CustomerID actually follows the database collation. That is, you must make sure that you use the collation of the column you are comparing to. If you force any other collation, any index on the column is dead and you may be in a performance disaster because of the scan. (The reason you would prefer DATABASE_DEFAULT over the actual collation is of course to avoid the hard-coding.)

By now the usage of string_split is so complex, that the benefit of that the function is built-in is almost lost.

XML and JSON

Rather than passing a comma-separated list, you can create an XML document to pass your list of values. Undeniably, this is more complex, so in most cases you may not find it worth the trouble as long it's only a list of single values. Where XML really shines is when you need to insert many rows. You transform your data to an XML document in the client, and then you unpack it into your tables with the xml type methods. Until table-valued parameters made the scene in SQL 2008, XML was the best method to pass a set of data from an application.

If you are on SQL 2016, you can instead use JSON, which I will discuss very briefly here.

Using XML for a List of Values

Let's go straight to the matter. Here are our get_company_names and get_product_names using XML.

CREATE PROCEDURE get_company_names_xml @customers xml AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   @customers.nodes('/Root/Customer') AS T(Item)
     ON   C.CustomerID = T.Item.value('@custid[1]', 'nchar(5)')
go
EXEC get_company_names_xml
    N'<Root><Customer custid="ALFKI"/>
            <Customer custid="BONAP"/>
            <Customer custid="CACTU"/>
            <Customer custid="FRANK"/>
      </Root>'
go
CREATE PROCEDURE get_product_names_xml @ids xml AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   @ids.nodes('/Root/Num') AS T(Item)
     ON   P.ProductID = T.Item.value('@num[1]', 'int')
go
EXEC get_product_names_xml N'<Root><Num num="9"/><Num num="12"/>
                             <Num num="27"/><Num num="37"/></Root>'

The two xml type methods we use are nodes and value. nodes is a rowset function that returns a one-column table where each row is an XML fragment for the given path. That is, in the two examples, you get one row for each Customer or Num node. As with derived tables, you must specify an alias for the return table. You must also specify a name for the column that nodes produces. (You can name columns in derived tables in this way too, but it's not that common to do so.)

The sole operation you can perform on T.Item is to employ any of the four xml type methods: exists, query, nodes and value. Of interest to us here is value. This method extracts a single value from the XML fragment. value takes two parameters: The first is a node specification for a single element or attribute. In this case we want an attribute – custid and num respectively – which is why we specify with @ in front. (Without the @ it would be an element specification.) As for the meaning of [1], I will come back that. The second argument to value is the T‑SQL data type for the return value.

Creating the XML Document

Normally, you create the XML document client side using some suitable library for the purpose, for instance the XmlWriter class in .NET. You should never attempt to create XML documents by just concatenating strings, since many characters needs to be encoded when you put them in XML. For instance, the ampersand character (&) is encoded as &amp; in XML.

That said, for a list of numbers there are not really many things that can go wrong. So here is a procedure that receives a comma-separated list and transforms it into XML and then shreds the XML to get the values:

CREATE PROCEDURE get_product_names_xmlcsv @csv varchar(100) AS
   DECLARE @xml xml = 
          '<Root><Num num="' + 
          replace(convert(varchar(MAX), @csv), 
                  ',' COLLATE Slovenian_BIN2, '"/><Num num="') + 
          '"/></Root>'
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   @xml.nodes('/Root/Num') AS T(Item)
     ON   P.ProductID = T.Item.value('@num[1]', 'int')
go
EXEC get_product_names_xmlcsv '9, 12, 27, 37'

At first glance, this may seem more cute than useful, but if you are in the dire situation that you cannot create functions in the database, this is not a bad solution at all. After all, it is quite compact. The reason I cast @csv to varchar(MAX) is that if @csv is regular varchar, so is the return type from replace. If the input is close to 8000 characters, truncation could occur.

But please don't ever do this for a list of strings – this is bound to end in tears the day when there is XML special characters in the data.

Performance

Just like with the CLR, the data from nodes streams into the rest of the query, so XML is also an inline method. But since parsing XML is more complex than parsing comma-separated lists, it's slower. In my tests, the execution times for XML are 40-60 % higher than for the CLR, but it is twice as fast as the iterative method. In fact, XML is the fastest method that does not require any preparations in the server: you don't have to activate the CLR and create a function, nor do you need to define a table type for a table-valued parameter.

You may guess that there is a higher performance cost for passing an XML document, and indeed there is. In my tests I measured this to be 20-25 ms for 10 000 list elements, to compare with 5-7 ms for most other methods. It's hardly a cause for alarm.

Note: these observations applies to case when the XML document is passed from the client. I have not run any performance tests with the pattern in get_product_names_xmlcsv, since I added that example in the revision in 2016. But the only extra cost is the cost for replace, which I've tried to minimize with forcing a binary collation.

That said, it shall not be denied that there are certainly problems with performance with the xml type methods and XQuery in SQL Server. When the optimizer gets it right, performance is often very good. But if the optimizer gets it wrong, performance can be very bad. From one point of view, this is not entirely surprising: the optimizer has very little idea of what is in that XML document, and rather than using estimates based on data, it relies on standard assumptions. Unfortunately, one cannot escape the feeling that the optimization of XML queries is an unfinished chapter in SQL Server. Sometimes, seemingly meaningless changes can have a big impact on performance. Above I wrote the expression to extract a number from the XML document as:

T.Item.value('@num[1]', 'int')

But I did not say what [1] means. The answer is that it means – nothing! @num and @num[1] means the same thing: retrieve the attribute num in the current context node. Unfortunately, SQL Server does not seem to understand this, and will generate different query plans depending on [1] is there or not. In many cases the plan with only @num will be a teeny-weeny more efficient, because it has fewer operators. But because of its misperception of what @num implies, the optimizer sometimes produces a really bad plan. This is less likely to happen if you add [1]. Thus, my recommendation is that you always add [1] to address attributes, although there is no logical reason why you should have to. (And don't be surprised if you find queries where removing [1] gives you a vast performance improvement. Query plans when XQuery is involved sometimes feel very much like a lottery.)

One practical conclusion of this is that you should be restrictive to entangle the shredding of the XML document with the rest of your data. It may be better to keep it simple and insert the data into a temp table and take benefit of that temp tables have statistics.

List Position

I thought for a long time that you cannot the retrieve list position from an XML document, as there is no xml type method that returns this information. Then SQL Server MVP Alejandro Mesa reminded me that you can try:

row_number() OVER(ORDER BY T.c)

That is, you order by the XML fragment produced by the nodes method. Digging into the query plan, and looking at the Sort operator, I find that it sorts by this expression:

[XML Reader with XPath filter].id

Looks good, doesn't it? But beware! This is not documented, and this could be something that works only by chance. (That would not be the first time in the history of SQL Server when it comes to ordering.) If you look at it from a logical view what would ordering by an XML fragment mean? That is just nonsense! Furthermore, if you try:

SELECT ...
ORDER  BY T.c

you will be told that this is not legal. I suspect the reason that it is accepted in the OVER() clause is due to an oversight. Thus, my recommendation is that you stay away from numbering by the context node.

Thankfully, there is a fully supported alternative, and again credit goes to Alejandro Mesa for finding a clever solution. But to explain it, we need to dig a little deeper into XPath, XQuery and the xml type methods. We also need another device that I will introduce in the next chapter. So I will return to the numbering of XML documents in the section XML Revisited, later in this article.

Inserting Many Rows

I said that where XML is really good is when you need to insert many rows. So how would you do that? In fact you have already seen the basics. You use nodes and value. Here is an example where I unpack a document with orders and order details:

DECLARE @x xml
SELECT @x =
  N'<Orders>
      <Order OrderID="13000" CustomerID="ALFKI"
             OrderDate="2006-09-20Z" EmployeeID="2">
         <OrderDetails ProductID="76" Price="123" Qty = "10"/>
         <OrderDetails ProductID="16" Price="3.23" Qty = "20"/>
      </Order>
      <Order OrderID="13001" CustomerID="VINET"
             OrderDate="2006-09-20Z" EmployeeID="1">
         <OrderDetails ProductID="12" Price="12.23" Qty = "1"/>
      </Order>
    </Orders>'
SELECT OrderID    = T.Item.value('@OrderID[1]',    'int'),
       CustomerID = T.Item.value('@CustomerID[1]', 'nchar(5)'),
       OrderDate  = T.Item.value('@OrderDate[1]',  'datetime'),
       EmployeeId = T.Item.value('@EmployeeID[1]', 'smallint')
FROM   @x.nodes('/Orders/Order') AS T(Item)
SELECT OrderID    = A.Item.value('@OrderID[1]',    'int'),
       ProductID  = B.Item.value('@ProductID[1]',  'smallint'),
       Price      = B.Item.value('@Price[1]',      'decimal(10,2)'),
       Qty        = B.Item.value('@Qty[1]',        'int')
FROM   @x.nodes('/Orders/Order') AS A(Item)
CROSS  APPLY A.Item.nodes('OrderDetails') AS B (Item)

As you see, the document can include data for several tables, and extracting the order-header information is straight-forward. We use nodes to get data from that level, and then we pick the attributes with value.

Extracting the details rows is a little more interesting, because beside the information about product, quantity and price, we also need to get the order ID which is in the node above. The way to approach this is to first run nodes on the Order level, and use the CROSS APPLY operator to run nodes on the details for each order. (Recall that CROSS APPLY is like a JOIN, but which can take arguments from the left side in difference to a normal join.)

This latter may sound a little inefficient, so you may be tempted to try this instead:

SELECT OrderID    = T.Item.value('../@OrderID[1]', 'int'),
       ProductID  = T.Item.value('@ProductID[1]',  'smallint'),
       Price      = T.Item.value('@Price[1]',      'decimal(10,2)'),
       Qty        = T.Item.value('@Qty[1]',        'int')
FROM   @x.nodes('/Orders/Order/OrderDetails') AS  T (Item)

That is, you use the parent-axis notation to retrieve the order ID. But as I mentioned, XQuery optimizing is not always the best in SQL Server, and parent access is a sore area. When I have tested, the above has worked well, but only if I included that [1] which does not mean anything. With only ../@OrderID, I got a very bad query plan; in a test I ran, it took 10-30 seconds to extract 2155 details rows from 830 orders, whereas when written as a above, execution time was subsecond. I reported the poor performance in this Connect bug. Microsoft's response was that they recommend against using parent-axis notation, in favour of CROSS APPLY. And even if I was able to get good performance when I added the [1], I cannot guarantee that you will get a good query plan. The solution with CROSS APPLY seems to be more stable, why you should stick with it.

Element-centric XML

What you have seen above is attribute-centric XML. You can also use element-centric XML, where the XML document for product IDs would look like this:

<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>

For the first parameter to value you could give a single period to denote the current element to extract the values:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('.', 'int')

That looks really simple, but as you might guess from the fact I've put it red, you should not use this. When this works well, you can count on an overhead of 30 % compared to the query for attribute-centric XML. But it can also work really badly, particularly if the SELECT is part of an INSERT statement. I have experienced that it could take four minutes to unpack a document with 10 000 numbers. I filed a Connect bug for this, and in 2016 this item is still active. However, when I test the repro in the Connect item in 2016, the response time is sub-second in all versions from SQL 2005 to SQL 2016 – and without enabling trace flag 4199, which you usually need to enable to get optimizer fixes.

No matter what, there is a way to get better and more predictable performance from element-centred XML that I learnt from SQL Server MVP Alejandro Mesa:

SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('(./text())[1]', 'int')

It sure is a strain on the eye, but this more complex expression has three advantages: 1) It is not subject to the bug that can cause bad query plans. 2) Execution time is in par with attribute-centric XML, or even somewhat better. 3) Empty elements are interpreted as NULL.

You may ask what that text() does, and if it has a real meaning, or if it something equally silly like that [1]. The answer is that text() is for real. Consider this little batch:

DECLARE @x xml = 
       '<Root><Sub>One<Subsub>Two</Subsub>Three<Subsub>Four</Subsub></Sub></Root>'
SELECT T.c.value('.', 'nvarchar(20)') AS Dot,
       T.c.value('(./text())[1]', 'nvarchar(20)') AS Text
FROM   @x.nodes('Root/Sub') AS T(c)

The output is:

Dot                  Text
-------------------- --------------------
OneTwoThreeFour      One

That is, with '.' you get all text that is in the Sub node, including the Subsub nodes. With text(), which is not a function but said to be a node test, you only get the text outside the Subsub nodes. Furthermore, you are compelled to specify which of texts you want; that is, [] is compulsory. You can use [2] in place of [1] in which case you get back Three, the second data value in the Sub element. Certainly, this XML document is a bit wretched, and for most XML documents in this world there is no practical difference between a mere '.' and the complex '(./text())[1]', but SQL Server has to assume the worst about the XML document you feed it, and this is why using only '.' will always be slower.

This example also displays the difference between attributes and elements. An attribute can only appear once within a node, and it cannot contain other elements or attributes, why attributes are quite a simpler affair. But it seems that SQL Server does not realise this, but the query plans for attributes are very similar for those for element-centric XML. Presumably this explains why it may matter to add [1] to attributes.

Above I said that an advantage with text() is that empty elements are interpreted as NULL. In case you wonder what I mean, consider:

DECLARE @x xml
SELECT @x = '<Root><Value></Value></Root>'
SELECT T.c.value('.', 'int') AS "without", 
       T.c.value('(./text())[1]', 'int') AS "with text"
FROM   @x.nodes('/Root/Value') AS T(c)

The output from this is:

without     with text
----------- -----------
0           NULL

That is, if an element is present, but have no actual text, you get a non-NULL value if you only use '.', which can yield unexpected results as demonstrated in this example, whereas with '(./text())[1]' you get back NULL which is more likely to be what you want.

Inserting Many Rows from Element-centric Documents

In the examples with a list of values, the element-centric document is considerably simpler and shorter. But that is an exception. Normally, element-centric documents are far more verbose than the corresponding attribute-centric document. Here is an element-centric version of the order document above:

<Orders>
    <Order>
       <OrderID>13000</OrderID>
       <CustomerID>ALFKI</CustomerID>
       <OrderDate>2006-09-20Z</OrderDate>
       <EmployeeID>2</EmployeeID>
       <OrderDetails>
           <ProductID>76</ProductID>
           <Price>123</Price>
           <Qty>10</Qty>
       </OrderDetails>
       <OrderDetails>
           <ProductID>16</ProductID>
           <Price>3.23</Price>
           <Qty>20</Qty>
       </OrderDetails>
    </Order>
    <Order>
       <OrderID>13001</OrderID>
       <CustomerID>VINET</CustomerID>
       <OrderDate>2006-09-20Z</OrderDate>
       <EmployeeID>1</EmployeeID>
       <OrderDetails>
           <ProductID>12</ProductID>
           <Price>12.23</Price>
           <Qty>1</Qty>
       </OrderDetails>
    </Order>
</Orders>

Here is how you would unpack it:

SELECT OrderID    = T.Item.value('(OrderID/text())[1]', 'int'),
       CustomerID = T.Item.value('(CustomerID/text())[1]', 'nchar(5)'),
       OrderDate  = T.Item.value('(OrderDate/text())[1]',  'datetime'),
       EmployeeId = T.Item.value('(EmployeeID/text())[1]', 'smallint')
FROM   @x.nodes('Orders/Order') AS T(Item)

SELECT OrderID    = O.n.value('(OrderID/text())[1]',   'int'),
       ProductID  = D.n.value('(ProductID/text())[1]', 'int'),
       Price      = D.n.value('(Price/text())[1]',     'decimal(10,2)'),
       Qty        = D.n.value('(Qty/text())[1]',       'int')
FROM   @x.nodes('/Orders/Order') AS O(n)
CROSS APPLY O.n.nodes('OrderDetails') AS D(n)

In this case, too, you need to use text() to get best performance. As with attribute-centric XML, the best way to get the order ID together with the details rows, is to first run nodes on Order level, and then descend to the details level for each order. Using parent-notation as in ../OrderID can easily yield very poor performance with Microsoft's unfinished XQuery implementation.

Typed XML

So far we have only looked at untyped XML, but it is possible to define an XML schema with the command CREATE XML SCHEMA COLLECTION. When you define a variable or column, you can say xml(mycollection) rather than just xml. SQL Server will then validate that the document adheres to that schema. This is known as typed XML. Using typed XML can have some positive impact on performance, since the schema informs the optimizer that some things cannot occur. For instance, this statement defines how the XML document with a list of values could look like:

CREATE XML SCHEMA COLLECTION elemnum AS
'<?xml version="1.0" encoding="utf-8"?>
   <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" 
              xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root">
      <xs:complexType>
         <xs:sequence>
            <xs:element maxOccurs="unbounded" name="Num" type="xs:int" />
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>'

(I'm afraid it's completely outside the scope of this article to explain this. Unfortunately Books Online does not really describe this either, but XSD is a standard defined by W3C.) Using this schema, we can write:

DECLARE @x(elemnum)
SELECT @x = N'<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>'
SELECT P.ProductName, P.ProductID
FROM   Northwind..Products P
JOIN   @ids.nodes('/Root/Num') AS T(Item)
  ON   P.ProductID = T.Item.value('.', 'int')

You may note that this time I did not put the use of '.' in red this time. The XML schema states that a Num node includes one single number and that's all, why SQL Server does not need to consider the situation with text and subnodes interleaved, as in the example above. For this reason text() is not needed – in fact text() is not even permitted with typed XML.

This performs well, and in some tests I was able to achieve significantly better performance than with untyped XML. But most of the time the schema appears to incur an overhead, and not the least the validation to pass data to a typed XML parameter is quite expensive: 130-170 ms for 10 000 integers in my tests. Or about the same time it takes for the fastest methods to process that many elements at all. (And when I used typed element-centric XML in an EXISTS query, the result was a disaster – more than two minutes to find 20 items in my test table. This beats all those inline functions with a wide margin.)

Thus, I would not consider using typed XML for a a list of scalar values to be a serious option. What the schema buys you is protection against errors like misspelling an element or attribute name. But for a list of values you will probably find such errors quickly anyway. On the other hand, if you are using XML to pass more complex data sets around, using a schema can be worthwhile, not the least if different teams are involved.

OPENXML

Already in SQL 2000 you could use XML thanks to the OPENXML function. OPENXML is still around in SQL 2016, but about the only reason to use it would be that you still need to support SQL 2000. OPENXML is bulkier to use, and performance is nowhere near nodes. In fact, in my tests it's 30-50 % slower than the iterative method. One possible advantage with OPENXML, though, is that the query plan is simpler. Query plans for XQuery mixes XML operators with relational operators, whereas OPENXML results in one single Remote Scan. This reduces the risk for the query-plan disasters you sometimes get with XQuery.

If you want to see an example of OPENXML, please refer to the SQL 2000 version of this article.

JSON

Before we leave this chapter, a little about JSON. Support for JSON was added in SQL 2016 on, as they say, popular request. There is no JSON data type, but there is an OPENJSON and a few more JSON functions as well as a FOR JSON operator.

Just like you can transform your comma-separated list of numbers to XML, you can transform it to JSON. There is one noticeable difference: it is a lot simpler. Here is an example:

CREATE PROCEDURE get_product_names_json @csv varchar(100) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   OPENJSON('[' + @csv + ']') j ON convert(int, j.value) = P.ProductID 
go
EXEC get_product_names_json '9, 12, 27, 37'

All you need to do is wrap the list in brackets. Well, since the column value always is of the type nvarchar(MAX), you need to cast it back to int.

I would not recommend doing this for list of strings, as I assume that there could be accidents if there are characters special to JSON in the data, but I don't know JSON myself to really say this with authority.

It is worth noting that the result set from OPENJSON includes a column key, which for a list above returns the list position starting on 0. Just beware that the datatype of key is nvarchar(4000) not int, as with other JSON documents it can hold string data.

I have not tested OPENJSON for performance, but I note that the estimate appears to always be 50 rows.

Since a JSON document just like XML can include data for several fields and records and they can be nested, it lends itself for inserting many rows. You probably would not create a JSON document out of another data source, not the least when you can use a table-valued parameter. On the other hand, if you already have the data as JSON, it makes sense to pass it to SQL Server as-is.

Using a Table of Numbers

The Concept

This is the fastest way to unpack a comma-separated list of numbers in pure T‑SQL. That is, this is the section for you who want a fast function, but for some reason cannot use the CLR and nor want to change the format of the string.

The trick is to use an auxiliary table of numbers: a one-column table with numbers from 1 and up. Here is how you can create a table with numbers from 1 to 999 999:

CREATE TABLE Numbers (Number int NOT NULL PRIMARY KEY);
WITH digits (d) AS (
   SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
   SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION
   SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION
   SELECT 0)
INSERT Numbers (Number)
   SELECT Number
   FROM   (SELECT i.d + ii.d * 10 + iii.d * 100 + iv.d * 1000 +
                  v.d * 10000 + vi.d * 100000 AS Number
           FROM   digits i
           CROSS  JOIN digits ii
           CROSS  JOIN digits iii
           CROSS  JOIN digits iv
           CROSS  JOIN digits v
           CROSS  JOIN digits vi) AS Numbers
   WHERE  Number > 0

First I set up a CTE to define a virtual table that has all digits from 0 to 9, and then I create a Cartesian product for as many powers of ten I want numbers in my table.

Before I go on and show how to use this table to unpack a comma-separated list, I like to point out that this sort of table has a much wider usage. Just to give you a taste of it, I will quickly two show examples. Say you have a long string where you want to count the frequency of the various characters. You can use Numbers to "loop" over the string to get the characters in tabular format:

DECLARE @str nvarchar(MAX)
SELECT @str = 'Insert your favourite novel here.'
SELECT substring(@str, Number, 1) AS ch, COUNT(*)
FROM   Numbers
WHERE  Number <= len(@str)
GROUP  BY substring(@str, Number, 1)
ORDER  BY ch

If this example is little on the esoteric side, the next is definitely a common one. You need to return the number of orders per day for a period, say a month. If there are days without any orders they should be included in the output, showing a zero for the count. To achieve this, you need something that spans the entire period for you, and then left join the period with the Orders table. This is exactly what Numbers is good for:

WITH dates AS (
   SELECT date = DATEADD(DAY, Number, '19000101')
   FROM   Numbers
)
SELECT d.date, COUNT(O.OrderDate)
FROM   dates d
LEFT   JOIN Northwind..Orders O ON O.OrderDate = d.date
WHERE  d.date BETWEEN '19970101' AND '19970131'
GROUP  BY d.date
ORDER  BY d.date

Since we need dates, I first define dates out of the numbers in a CTE, which I then work with. I'm making use of the fact that COUNT(OrderDate) will only count rows with non-NULL values in OrderDate.

An Inline Function

Let's now see how we can use Numbers to split a comma-separated list into a table. I will first present an T‑SQL inline function, which captures the main logic. I then move on to present a multi-statement function that applies chunking. While I will study the inline function in detail, it is actually the chunking solution you should use in most cases.

Below is the core of the function inline_split_me, a function I originally got from SQL Server MVP Anith Sen. The version I got from Anith appeared in the SQL 2000 version of the article. The version below has a modification from Brian W Perrin.

CREATE FUNCTION inline_split_core(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT Value =
              substring(@param, Number,
                        charindex(',', @param + ',', Number) - Number)
       FROM   Numbers
       WHERE  Number <= len(@param)
         AND  substring(',' + @param, Number, 1) = ','
)          

Just below, I will make a couple of modifications of this function, but I wanted to show a "clean" version first, so that you can digest it more easily. As in the first example above, we use Numbers to "loop" over the string, and the first condition in the WHERE clause is our "loop" condition. The second condition looks for the string delimiter, but in order to find the first element in the list, we add a comma before the string. So with this WHERE condition, we stop at every comma in the string. Observe that since we inserted that initial comma, Number refers to the first position after a comma.

Then in the SELECT list, we extract the list element starting at position Number from where we look for the next comma using the charindex function. To handle the last element in the string correctly, this time we add a comma after the input string. We use the third parameter to charindex to start in the right place. charindex thus returns the position of the next delimiter. Since the third parameter to substring is the desired length of the substring, we simply subtract the current position from the return value of charindex to get the length of the extracted element.

While this function works, there are several enhancements that we can do to it. The first is to trim leading and trailing blanks from the return value, like I have done in other list-to-table functions in this article. Thus, we should wrap the return value in calls to ltrim and rtrim.

Next, there are a couple of performance improvements. The first two I have already covered in this article. One is to force a binary collation for the comparisons including the call to charindex. The other is to observe to the type of the return value of the function. This is nvarchar(MAX) which is not so good if want to join the output to a regular nvarchar column, as I noted in the section MAX Types vs. Regular (n)varchar. Therefore we should also wrap the output in a convert to nvarchar(4000).

Next thing to look at is more subtle, but we can improve the condition of the length of the parameter, that is this part:

Number <= len(@param)

When the input parameter to len is a MAX value, the return type is bigint, but the Number column is int. Since int has lower precedence than bigint in SQL Server, Number is implicitly converted to bigint, and this introduces a range seek in the Index Seek operator, similar to what happens when varchar is converted to nvarchar in a Windows collation. The range seek does not seem to be costly in itself, but I found when I tested, that it seemed to prevent the optimizer from using a parallel plan. Therefore, it's a good idea to convert the return value from len to int.

The last performance issue is a old bug in SQL Server. SQL Server MVP Tony Rogerson found that when using a stored procedure which used something akin to inline_split_core: the plan for the procedure was never put into cache, why it was compiled every time. I was able to isolate the expression @p + ',' as the culprit and I submitted this Connect bug. The bug is closed as Won't Fix, but that only seems to relate to in which version they intended to fix the issue. When testing in 2016 on the latest service packs of all versions from SQL 2005 and up, the bug only appears in SQL 2005. Thus, this bug is largely a thing of the past, but nevertheless, I have employed the workaround that SQL Server MVP Simon Sabin found: convert the comma to nvarchar(MAX).

All this leads to the real inline_split_me:

CREATE FUNCTION inline_split_me(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000),
                  substring(@param, Number,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      @param + convert(nvarchar(MAX), N','),
                                      Number) -
                            Number)
              ))) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(@param))
         AND  substring(convert(nvarchar(MAX), N',') + @param, Number, 1) =
                        N',' COLLATE Slovenian_BIN2)

I've also added N before the string literals to not rely on implicit conversion. Compared to the core version above, you may find that you no longer can see the forest for all the trees!

Here is an example on how to use this particular function:

CREATE PROCEDURE get_company_names_inline @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   inline_split_me(@customers) s ON C.CustomerID = s.Value
go
EXEC get_company_names_inline 'ALFKI,BONAP,CACTU,FRANK'

An obvious problem with this function is robustness. A comma-separated list that has more than one million characters is a very long string, but nevertheless it is bound to happen sooner or later. And in that case, this function will return incorrect results, but there will be no error message.

While we have sorted out several performance issues, others still remain. The most serious is that the optimizer may go completely wrong and select a catastrophic plan. When I ran my performance tests and rewrote my join queries to use EXISTS, the time to get 20 rows from test table rose from two milliseconds to 40 seconds. As I noted in the section JOIN vs. EXISTS, this is a problem with all T‑SQL inline functions. We can easily avoid this disaster by making the function into a multi-statement function, but another problem that I also brought up with the iterative method still remains: charindex is slower on nvarchar(MAX) than nvarchar(4000). To resolve this, we need to go chunking, which we will look at in the next section.

But before that, let's look at the brighter side of inline_split_me. The optimizer does not always go astray, but sometimes it comes up with quite slick plans. When I ran my original performance tests I noticed that I got a parallel plan when joining with my test table. For short lists, parallelism rather added overhead. But for my test strings with 10 000 elements, inline_split_me suddenly performed faster than the CLR on a 4-CPU machine. (For full details on this, please see the first performance appendix.) For even longer strings, 50 000 elements, it fell behind the CLR again, for reasons I will discuss in the section on fixed-length input.

I like to stress that you cannot be sure to get a parallel plan, as that will depend on many things. For one thing, if you think the optimizer will use the condition on len(@param), forget it. No matter the input string, the optimizer will assume that the first WHERE condition will hit 30 % of the rows in Numbers, i.e. 300 000 in this case and the second condition with substring will reduce the estimated number of rows to around 9 500. Apparently in combination with my test table, this triggered a parallel plan. From this follows that while the size of the input parameter does not matter, the size of Numbers does.

The bottom line is that in most cases you should avoid using inline_split_me. The one exception is when parallelism goes your way, and you want the best performance. I think in this case you should benchmark it against other methods. And keep in mind that parallelism is not always a blessing – not if there are other users who want some CPU power as well!

A Chunking Multi-Statement Function

As with the iterative method, breaking the input into chunks of nvarchar(4000) improves performance. Here is a multi-statement function that does that:

CREATE FUNCTION duo_chunk_split_me(@list  nvarchar(MAX),
                                   @delim nchar(1) = N',')
RETURNS @t TABLE (str   nvarchar(4000) NOT NULL,
                  nstr  nvarchar(4000) NOT NULL) AS
BEGIN
   DECLARE @slices TABLE (slice nvarchar(4000) NOT NULL)
   DECLARE @slice nvarchar(4000),
           @textpos int,
           @maxlen  int,
           @stoppos int

   SELECT @textpos = 1, @maxlen = 4000 - 2
   WHILE datalength(@list) / 2 - (@textpos - 1) >= @maxlen
   BEGIN
      SELECT @slice = substring(@list, @textpos, @maxlen)
      SELECT @stoppos = @maxlen -
                        charindex(@delim COLLATE Slovenian_BIN2,
                                 reverse(@slice))
      INSERT @slices (slice)
         VALUES (@delim + left(@slice, @stoppos) + @delim)
      SELECT @textpos = @textpos - 1 + @stoppos + 2
      -- On the other side of the comma.
   END
   INSERT @slices (slice)
       VALUES (@delim + substring(@list, @textpos, @maxlen) + @delim)

   ;WITH stringget (str) AS (
      SELECT ltrim(rtrim(substring(s.slice, N.Number + 1,
                charindex(@delim COLLATE Slovenian_BIN2,
                          s.slice, N.Number + 1) -
                N.Number - 1)))
       FROM  Numbers N
       JOIN  @slices s
         ON  N.Number <= len(s.slice) - 1
        AND  substring(s.slice, N.Number, 1) = @delim COLLATE Slovenian_BIN2
   )
   INSERT @t (str, nstr)
      SELECT str, str
      FROM   stringget

   RETURN
END

We first split up the text in slices and put these in the table variable @slices. When we created chunks for the iterative method, we did not bother if a list element was split over two chunks. But for this method, we need to take precautions to avoid that, and we must make sure that the last character in a chunk is a delimiter. We first get a preliminary slice of the maximum length we can handle. Then we find the last delimiter in the slice, by feeding charindex the result of the reverse function, a neat little trick. When we insert the slice into the table variable, we make sure that there is a delimiter both before and after, so that we don't need to deal with that later. You may note that if the input text is within the limits of a regular nvarchar, we will never enter the loop, but just insert the text directly in to the @slices table.

Once @slices is populated, we apply the logic from the inline function, although it looks slightly different, since we know that all strings in @slices start and end with the delimiter. Now the numbers filtered out through the JOIN are the positions for the delimiters, and the elements start one position further ahead. Note here that we do not need to iterate over @slices; we can join directly with Numbers.

As with iter_charlist_to_tbl, this function returns a table with both a varchar and an nvarchar column. My tests indicate that there is a cost of 10-15 % over returning a table with only a nvarchar column.

This function is in most cases considerably faster than inline_split_me, up to a factor of 2. As I discussed in the previous section, inline_split_me can be very fast with a parallel plan and long input. The nice thing with the chunking multi-statement function is that you get consistent performance. Compared to the CLR, duo_chunk_split_me was 25-50 % slower than the CLR when just unpacking the list into a table in my tests.

Note that this function does not have the robustness problem of the inline function. By chunking we are protected against running out of numbers, as long as we make sure that there are 4000 numbers in the table. OK, if you get more slices than you have numbers in the table, you have a problem. But with a million numbers in the table, that would mean an input length close to four milliard characters – which exceeds the capacity for nvarchar(MAX) with a wide margin.

Concluding Remarks

While faster than the iterative function, this method is more difficult to grasp, and it's probably also more difficult to extend to more complex formats.

These functions do not have any listpos column. You can use the row_number function for this:

listpos = row_number() OVER (ORDER BY s.sliceno, N.Number)

Where sliceno is an extra counter column you would have to add to the @slices table. It's likely that this would have some performance cost, but I have not examined this.

Compared to other methods, the performance depends a lot on the length of the total string, since we compare every character in the input string with the string delimiter. That is, extra spacing and the length of the elements will matter.

I did not include any function that unpacks a list of numbers. You could write a new function which uses convert in the right places, but you can also do like this:

CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS
   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   duo_chunk_split_me(@ids, DEFAULT) i ON
          P.ProductID = convert(int, i.str)
go
EXEC get_product_names_tblnum '9, 12, 27, 37'

Despite the chunking, there are still some robustness issues with the Numbers table. Users that do not know better could delete numbers in the middle or add numbers which should not be there. And as the queries are written, they assume that the first number the table is 1. At some point I had a zero in the table, whereupon I got incorrect results back. If you are paranoid you can set up a check constraint for the minimum value, and add a trigger that cries foul if anyone meddles with the table. Then again, constraints and triggers can be disabled, so the really paranoid will probably prefer another method. In the section Using a Function of Numbers, I will discuss a variation where you can use a table-valued function to generate the numbers when you need them.

In any case, a table of numbers is something that comes in handy in several other SQL problems than just unpacking a comma-separated list. So it's a good idea to keep one available in your database.

Fixed-Length Array Elements

This is a method that was proposed by SQL Server MVP Steve Kass, inspired by an idea in Ken Henderson's book The Guru's Guide to Transact-SQL.

Just like XML, this is a method that requires a special input format. Instead of using delimiters as in other methods, the list elements have a fixed length. There are two advantages with this: 1) You can never run into problems with delimiters appearing in the data. 2) Performance. Except for extremely long input strings, this is the fastest of all methods in this article. In fact, this method can even beat using table-valued parameters. However, for multi-user environments, it does not seem to scale as well as the CLR.

The Core

Here is a quick example where the method is employed directly, without a function:

CREATE PROCEDURE get_product_names_fix @ids varchar(8000),
                                       @itemlen tinyint AS
   SELECT P.ProductID, P.ProductName
   FROM   Northwind..Products P
   JOIN   Numbers n ON P.ProductID = convert(int,
               substring(@ids, @itemlen * (n.Number - 1) + 1, @itemlen))
     AND  n.Number <= len(@ids) / @itemlen
go
EXEC get_product_names_fix '   9  12  27  37', 4

Each element in the "array" has the same length, as specified by the parameter @itemlen. We use the substring function to extract each individual element. And, yes, it's that Numbers table again, that we created in the beginning of the previous section Using a Table of Numbers.

Here is a function that embeds the method:

CREATE FUNCTION fixstring_single(@str nvarchar(MAX), @itemlen tinyint)
RETURNS TABLE AS
RETURN(SELECT listpos = n.Number,
              str = substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen)
       FROM   Numbers n
       WHERE  n.Number <= len(@str) / @itemlen +
              CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)

The purpose of the expression on the last line is to permit the last element in the array to be shorter than the others, in case trailing blanks have been stripped. You can see that this function returns the list position, which is very simple to achieve: it's the current number in the Numbers table. Here is an example using fixstring_single with a list of strings:

CREATE PROCEDURE get_company_names_fix @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   fixstring_single(@customers, 6) s ON C.CustomerID = s.str
go
EXEC get_company_names_fix 'ALFKI BONAP CACTU FRANK'

One possible disadvantage with fixed-length is that it's more sensitive to disruption in the input. If you lose one character somewhere or pass the wrong value in @itemlen, the entire list will be misinterpreted. But assuming that you construct the list programmatically, this should not be a big deal.

You may note that I did not convert len(@str) to int as I did in inline_split_me. While it would not be a bad idea, I was not able to detect any impact on performance in this case. In order to make the fixed-length procedures easier to read, I opted to leave out the convert in this case.

While this method can be very fast, keep in mind that as with all other methods that builds on T‑SQL inline functions, the optimizer may lose grip entirely. It does not understand to use the length of the string to estimate how many rows we will read from Numbers, but makes its standard guess of 30 %, 300 000 rows with a million rows in the table. When I tested an EXISTS query, it took 30 seconds to get 20 elements from my test table; the corresponding join ran in a millisecond or less.

Passing Numbers as Binary

The fixed-length method opens for a different method to pass a list of numbers, to wit as a binary string:

CREATE FUNCTION fixbinary_single(@str varbinary(MAX))
RETURNS TABLE AS
RETURN(SELECT listpos = n.Number,
              n = convert(int, substring(@str, 4 * (n.Number - 1) + 1, 4))
       FROM   Numbers n
       WHERE  n.Number <= datalength(@str) / 4 )

(Do I need to add that I originally got this idea from Steve Kass as well?) Note that fixbinary_single does not take any @itemlen parameter, that's a bit superfluous. To handle the other integer types, you would write new functions for that.

When using this from T‑SQL it looks less appealing:

CREATE PROCEDURE get_product_names_binary @ids varbinary(2000) AS
   SELECT P.ProductID, P.ProductName
   FROM   Northwind..Products P
   JOIN   fixbinary_single(@ids) b ON P.ProductID = b.n
go
EXEC get_product_names_binary 0x00000090000000C0000001B00000025

But normally you would not do that. Rather you would create your binary string in client code. SQL Server MVP Alex Kuznetzov was kind to share a C# function to build a binary string from numbers for the bigint data type:

static byte[] UlongsToBytes(ulong[] ulongs) {
   int ifrom = ulongs.GetLowerBound(0);
   int ito   = ulongs.GetUpperBound(0);
   int l = (ito - ifrom + 1)*8;
   byte[] ret = new byte[l];
   int retind = 0;
   for(int i=ifrom; i<=ito; i++)
   {
           ulong v = ulongs[i];
           ret[retind++] = (byte) (v >> 0x38);
           ret[retind++] = (byte) (v >> 0x30);
           ret[retind++] = (byte) (v >> 40);
           ret[retind++] = (byte) (v >> 0x20);
           ret[retind++] = (byte) (v >> 0x18);
           ret[retind++] = (byte) (v >> 0x10);
           ret[retind++] = (byte) (v >> 8);
           ret[retind++] = (byte) v;
   }

   return ret;
}

Thus, to use it with fixbinary_single above, you would have to modify it a bit.

In my original performance tests in 2006, I was not able to detect any difference in the performance of fixbinary_single and fixstring_single. Interesting enough, when I re-ran my tests in 2009, I saw 15-20 % better results with binary input for lists up to 10 000 elements. Whether this is due to that this was on SQL 2008 or that I used different hardware, I don't know. In 2009, I also tested lists with 50 000 elements, and this case the difference is dramatic – the binary version is three times faster, for reasons I will return to.

There is a second benefit that Alex pointed out to me: using binary format saves network bandwidth. Instead of passing 10 bytes per number, as you would need with a string that can fit all positive integers, with a binary string you only need four bytes per number. This can make a difference for longer lists, and I discuss this in the section Call Overhead for Plain Strings in the second performance appendix.

Fixed Length and the CLR

Since what is special with this method is the input format, not the algorithm itself, you could use the format with a CLR function instead. I wrote a CLR table-valued function that accepted fixed-length as input. There was very little difference in performance to the other CLR functions.

Unlimited Input

For some other methods, I have said that by splitting up the string in chunks we can get better performance. This does not apply to fixed-length input, because that is due to an issue with charindex which we don't use here. The other reason to go chunking is avoid running out of numbers in the Numbers table. This risk is smaller with fixed-length, since we only use one number per list element, not per character. So with the Numbers table above, we can handle one million list elements which is quite a few. Nevertheless, it is possible to write a function which is water-proof in this regard. Steve Kass proposed this function which self-joins Numbers to square the maximum number. That is, with one million numbers in the table, you get 1E12 numbers in total to play with.

CREATE FUNCTION fixstring_multi(@str nvarchar(MAX), @itemlen tinyint)
   RETURNS TABLE AS
RETURN(SELECT listpos = n1.Number + m.maxnum * (n2.Number - 1),
              str = substring(@str,
                    @itemlen *
                        (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1,
                    @itemlen)
       FROM   Numbers n1
       CROSS  JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m
       JOIN   Numbers n2 ON
              @itemlen *
                   (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1 <=
              len(@str)
       WHERE  n2.Number <= len(@str) / (m.maxnum * @itemlen) + 1
         AND  n1.Number <= CASE WHEN len(@str) / @itemlen <= m.maxnum
                                THEN len(@str) / @itemlen +
                                     CASE len(@str) % @itemlen
                                          WHEN 0 THEN 0
                                          ELSE 1
                                     END
                                ELSE m.maxnum
                           END
)

This is a more complex function that fixstring_single, but to save space I leave it as an exercise to the reader to understand what's going on. I only make a note about the line with CROSS JOIN: this saves me from hard-coding the number of rows in Numbers

You may think that a self-join that results in 1E12 numbers must be expensive, and indeed if your lists are mainly short, less than 200 elements, the overhead is considerable. However in my tests, there were cases where fixstring_multi outperformed fixstring_single with a wide margin on a 4-CPU machine. Just as for inline_split_me, the reason is parallelism.

Another approach to permit unlimited input is of course to go chunking. And with the fixed-length method, it's possible to do this with an inline function:

CREATE FUNCTION fixstring_multi2(@str nvarchar(MAX), @itemlen tinyint)
  RETURNS TABLE AS
RETURN(
     SELECT listpos = (s.sliceno - 1) * (s.maxnum / @itemlen) + n.Number,
            str = substring(s.slice, @itemlen * (n.Number - 1) + 1, @itemlen)
     FROM   (SELECT m.maxnum,
                    sliceno = n.Number,
                    slice = substring(@str,
                                      (m.maxnum - m.maxnum % @itemlen) *
                                            (n.Number - 1) + 1,
                                      m.maxnum - m.maxnum % @itemlen)
             FROM   Numbers n
             CROSS  JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m
             WHERE  n.Number <= len(@str) /
                    (m.maxnum - m.maxnum % @itemlen) +
                    CASE len(@str) % (m.maxnum - m.maxnum % @itemlen)
                         WHEN 0 THEN 0
                         ELSE 1
                    END) AS s
     JOIN   Numbers n ON n.Number <= len(s.slice) / @itemlen +
              CASE len(s.slice) % @itemlen WHEN 0 THEN 0 ELSE 1 END
   )

Performance is similar to fixstring_multi, including the good performance on multi-processor machines thanks to parallelism but it takes a bigger toll as the length of the string increases. More about that in a second.

Performance with Extremely Long Input

In the beginning of this section I said that the fixed-length method is the fastest method, except for extremely long input. In my original tests I found that on one machine, fixed-length lagged behind several other methods when the input was a list of 10 000 strings. This puzzled me for a long time, and eventually I ran some special tests for this case. My conclusion is that SQL Server has some change in its internal handling of nvarchar(MAX), which causes processing of it to be slower above a certain limit. This limit is around 500 000 bytes on x64 machines, and 750 000 bytes on 32-bit machines. When the input length exceeds this limit, the execution time for the fixed-length functions about doubles. Exactly what happens internally, I don't know, but it surely has something to do with the ability of MAX variables to spill to disk. (Although, I doubt that they actually spilled to disk, as I would expect that to have a much great performance impact.)

In my tests in 2006, I only saw this with the fixed-length and only with string input, because that was the only case where I exceeded the limit. In 2009, I added a test with 50 000 list elements, and now I saw this effect with all methods that work on nvarchar(MAX) without splitting it up in chunks. The routines that apply chunking are fairly immune, since they do relatively few operations on the long string. The issue does not arise at all with the non-SQL methods: CLR handles the long string within the .NET Framework, and XML also has its own internal handling. This also explains why fixbinary_single is much faster than fixstring_single at 50 000 elements: the binary string is still below this limit. If you would have a list with 150 000 elements you would see it with binary input as well.

I have more details about this limit in the performance appendix. As a piece of a curio, in my test on 50 000 list elements it seems that I hit a second limit with fixstring_multi2. For 10 000 elements simply unpacking the list took 195 ms in the 2009 tests. Unpacking 50 000 list elements with its twin fixstring_multi took 690 ms. With fixstring_multi2 it took 670 seconds. I have not analysed this much at all. It's not a change in query plan, but it may be a resource issue, so the results could be different on different hardware. (Reflexion added in 2016: maybe something did really spill to disk this time. I did not exactly use SSDs in those tests.)

Multithread Scalability

When I ran my multithread tests, the fixed-length methods did not do well. In some tests, they even scaled negatively. That is, the total execution time to run a certain number of calls was longer when spreading these calls over several threads, than when making all calls from a single thread. The one exception was when I used a multi-statement function based on fixstring_single. This function did not exhibit this behaviour. However, neither did this function scale as well as the CLR functions, particularly not with a high number of parallel threads.

Using a Function of Numbers

As I have discussed, the Numbers table comes with a robustness problem. You may run out of numbers, or there may inadvertently be gaps in the table. As second obstacle with these methods is that you may not want to (or be permitted to) create a table for this purpose. So what about a table-valued function you can use instead?

One of the smartest SQL people I know is SQL Server MVP Itzik Ben-Gan, and in his book Inside SQL Server 2005: T‑SQL Programming he presented this function:

CREATE FUNCTION dbo.fn_nums(@n AS bigint) RETURNS TABLE AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)
  SELECT n AS Number FROM Nums WHERE n <= @n;
GO
SELECT Number FROM fn_nums(1000)

When called, this function generates numbers from 1 to the number you specify. And this function is fast! The query I presented to you initially to fill the Numbers table with a million numbers runs for some 20 seconds. Itzik's function generates those numbers in 1-2 seconds.

The function may seem unwieldy at first, but this is how it works: In his first CTE, L0, Itzik generates a virtual table with two rows. In the next CTE, he cross-joins L0 with itself, so that L1 generates four rows. In L2 he cross-joins L1 with itself, and he is now up to 16 rows. The procedure is repeated with L3 that produces 256 rows. L4 makes them 65536 rows, and then L5 explodes them into over four milliard rows. This is brought to the CTE Nums where he numbers the rows. The optimizer is smart enough to realise that all rows have the same value, so it does not waste time on sorting them. Finally, he selects from Nums to get the numbers asked for.

The function is so fast that it is a viable alternative to side-step Numbers and use the function directly. I derived fn_inline_split_me from inline_split_me by changing the FROM and WHERE clauses like this:

FROM  fn_nums(len(@param))
WHERE substring(convert(nvarchar(MAX), N',') + @param, Number, 1) = 
         N',' COLLATE Slovenian_BIN2

Likewise, I created fn_fixstring_single from fixstring_single by skipping the WHERE clause entirely, and I wrote the FROM clause as:

FROM   fn_nums(len(@str) / @itemlen +
               CASE len(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)

In my performance tests, the functions based on fn_nums in many cases had equal performance to those that use Numbers, but there were also plentiful of cases where the fn_nums-based functions were considerably slower, particularly with fixed-length. Only in a few tests with short input length I measured lower values for functions based on fn_nums.

As I have noted, the optimizer can easily stumble on functions based on the Numbers table. It does not get any easier with fn_nums, au contraire. The optimizer has very little idea of what is going on, and will in many cases assume that the query will produce four milliard rows. When I ran my tests, the optimizer lost control for all EXISTS queries using functions based on fn_nums just like it did for the functions based Numbers. And in difference to all other methods, this also happened to fn_fixstring_single for some of the join queries.

I looked into writing fn_chunk_split_me, a version of duo_chunk_split_me that used fn_nums, but this was not easy at all. First I just replaced Numbers in the final join with fn_nums(4000), but I found that for each generated number, SQL Server would scan the @slices table. I also tried

FROM  @slices s
CROSS APPLY fn_nums(len(s.slice) - 1) AS N
WHERE substring(s.slice, N.Number, 1) = @delim COLLATE Slovenian_BIN2

with an equally poor plan. Even worse was:

fn_nums(SELECT MAX(len(slice)) FROM @slices)

This was a complete disaster. I did not have the patience to wait for it to complete, but we are talking minutes. Looking at the query plan, it seemed that SQL Server looked into generate all those four milliard rows, and then filter.

Eventually, I wrote fn_chunk_split_me as

INSERT @numbers(Number)
   SELECT Number FROM fn_nums(CASE WHEN len(@list) < @maxlen
                                   THEN len(@list)
                                   ELSE @maxlen
                              END + 2)

I then used @numbers in the final join instead of Numbers. The performance is clearly inferior to duo_chunk_split_me, particularly for shorter lists. For the very longest lists, with 50 000 elements fn_chunk_split_me was only a few percent slower.

When I had completed all my performance tests and published the article, I was made aware of that Itzik had come up with a new version of fn_nums where the SELECT clause goes:

SELECT TOP (@n) Number FROM Nums ORDER BY n

and I also tried:

SELECT TOP (@n) Number From Nums WHERE n <= @n

The advantage of using TOP is that the optimizer has less reason to believe that the query will generate four milliard rows. I ran some extra tests, and I saw some improvement with fixed-length input, but they still lagged behind the functions based on the Numbers table. I now got better performance in fn_chunk_split_me using the query with CROSS APPLY above. However, for longer input, inserting the numbers into a table variable was still faster, and the function was still considerably slower than duo_chunk_split_me. And for fn_inline_split_me, the optimizer now found one of these crazy plans where it takes 30 seconds to handle 20 elements for some of the join queries in addition to the EXISTS queries.

This experience made me reconsider some of the enthusiasm I initially had for using fn_nums. If you use it, you should feed it a constant or an expression based on variables only. Don't pass it the result of scalar query or a column from the surrounding query, or you may be in for a disaster. The version with TOP may be less sensitive to this, but whichever version you use, test carefully. The optimizer is in for a tough game, and you run a higher risk of a really bad plan, since the optimizer works with such an incorrect estimate. Maybe the best is to play it safe, and extract the elements from the input list into a temp table or table variable and then use that in your query. Or write fn_nums as a multi-statement function to hide the Cartesian product from the optimizer.

A final note: the ultimate solution would be if SQL Server had a built-in function to generate these numbers, one the optimizer would have full understanding of. I have a Connect item that suggests this. Votes are always welcome!

XML Revisited

In the section on XML, I presented an undocumented way to get the list position in an XML document, and recommended against relying on it. I also promised to show you a supported and documented way, and now that we know about the Numbers table, we can look at that solution.

As I noted previously, there is no XQuery function or xml type method for this task. However, you can extract a specific element from a document by using an XPath expression where you index by number. For instance:

DECLARE @x xml
SELECT @x = N'<Root><Num num="9"/><Num num="12"/>
              <Num num="27"/><Num num="37"/></Root>'
SELECT @x.value('(/Root/Num)[3]/@num', 'int')  

This returns 27, the value of the third element in the document.

Instead of a constant, XPath permits you to pass the value of a variable or a column with the sql:column function. SQL Server MVP Alejandro Mesa decided to try to get the list position by passing numbers from the Numbers table. That is, instead of shredding the entire document at once, he would take it piece by piece.

Let's first look at using sql:column in a simple way. You cannot use it directly in the value function, you first have to use the query function, which returns an XML fragment, and then you can apply value to that fragment. I have highlighted the interesting part:

DECLARE @x xml
SELECT @x = N'<Root><Num num="9"/><Num num="12"/>
              <Num num="27"/><Num num="37"/></Root>'
SELECT Number AS listpos,
       @x.query('(/Root/Num)[sql:column("Number")]').
          value('(/Num)[1]/@num', 'int') AS num
FROM   Numbers
WHERE  Number <= 4

That is, sql:column inserts the column value into the XPath expression. This returns:

listpos     num
----------- -----------
1           9
2           12
3           27
4           37

The above however is nothing that performs well with larger documents. Alejandro tried several variations, until he eventually arrived at this solution:

DECLARE @no_of_elem  int
SELECT @no_of_elem = @str.value('count(/Root/Num)', 'int')

INSERT #Int_UNPACK (number)
SELECT n = T.Item.value('@num[1]', 'int')
FROM   Numbers n
CROSS  APPLY @str.nodes('(/Root/Num)[sql:column("n.Number")]') AS T(Item)
WHERE  n.Number <= @no_of_elem

He first determines the total number of elements in the document using the XQuery function count and saves this in a variable. Next he uses Numbers to "loop" over the document to get one node at the time. As you may guess, this is slower than shredding the entire document in one go. In my tests, the overhead was 30-80 % compared the query I showed in the main XML section. A considerable overhead, but given what it's doing it's amazingly quick.

It may not be necessary to use that extra variable, but you can see it as a safety precaution, so that the optimizer is not lead astray. What is very important for performance, though, is that pair of parentheses around /Root/Num. Without them, the query is horribly slow. Why? I'm afraid that explaining details on XPath performance is beyond the scope of this article. And more importantly, my knowledge.

Using Recursive CTEs

This is a method that was originally suggested to me by SQL Server MVP Nigel Rivett. A function based on this idea also appears in SQL Server MVP Itzik Ben-Gan's book Inside Microsoft SQL Server 2005: T‑SQL Querying.

This method is not among the quickest; it beats the iterative method with a mere 15 %. Nevertheless, Nigel Rivett says he prefers this method, because when he works as a consultant, he wants to leave as little footprint as possible, preferably not even a function. And indeed, the CTE method is fairly easy to use directly in a stored procedure. Nevertheless, I will show the method packaged as an inline function. As with inline_split_me, I will first show a core function where I've stripped the performance tweaks to make it easier to digest:

CREATE FUNCTION cte_split_core (@list  nvarchar(MAX),
                                @delim nchar(1) = ',') RETURNS TABLE AS
RETURN
   WITH csvtbl(start, stop) AS (
     SELECT start = convert(bigint, 1),
            stop = charindex(@delim, @list + @delim)
     UNION ALL
     SELECT start = stop + 1,
            stop = charindex(@delim, @list + @delim, stop + 1)
     FROM   csvtbl
     WHERE  stop > 0
   )
   SELECT substring(@list, start,
                    CASE WHEN stop > 0 THEN stop - start ELSE 0 END)
          AS Value
   FROM   csvtbl
   WHERE  stop > 0

The query starts off with a recursive CTE. Recursive CTE are intended to roll up hierarchical structures, but they can also be used hide loops into what looks like a set-based statements, and that is sort of what happens here. You can think of a recursive CTE of this kind as a compact form of writing:

SELECT ...
UNION ALL
SELECT ...
FROM   CTE
UNION ALL
SELECT ...
FROM   CTE
...

where each SELECT statement takes its input for the CTE from the SELECT right above it. Thus, in this function, the first SELECT returns 1 where the list starts, and the position for the first delimiter. The second SELECT sets start to the position after the first delimiter and stop to the position of the second delimiter. The third SELECT, as implied by the recursion, returns the position after the second delimiter as the start position, and the position for the third delimiter. And so it continues until there are no more delimiters and stop is returned as 0, leading to that the last SELECT returning nothing at all which stops the recursion.

All this produces a virtual table with the start and stop positions for all the list elements, and we can use the rows in this table as input to substring. The CASE in the third argument to substring may seem unnecessary, since we have the same condition in the WHERE clause. But SQL Server sometimes computes expressions in the SELECT list before filtering, so we need the CASE expression to be sure to not get a run-time error for negative input to substring.

To get the real function, we now apply the same improvements as for inline_split_me (please refer to this section for details):

CREATE FUNCTION cte_split_inline (@list  nvarchar(MAX),
                                  @delim nchar(1) = ',') RETURNS TABLE AS
RETURN
   WITH csvtbl(start, stop) AS (
     SELECT start = convert(bigint, 1),
            stop = charindex(@delim COLLATE Slovenian_BIN2,
                             @list + convert(nvarchar(MAX), @delim))
     UNION ALL
     SELECT start = stop + 1,
            stop = charindex(@delim COLLATE Slovenian_BIN2,
                             @list + convert(nvarchar(MAX), @delim), stop + 1)
     FROM   csvtbl
     WHERE  stop > 0
  )
  SELECT ltrim(rtrim(convert(nvarchar(4000),
            substring(@list, start,
                      CASE WHEN stop > 0 THEN stop - start ELSE 0 END)
         ))) AS Value
  FROM   csvtbl
  WHERE  stop > 0

Here is an example of using this function: There is a little twist:

CREATE PROCEDURE get_company_names_cte @customers nvarchar(2000) AS
   SELECT C.CustomerID, C.CompanyName
   FROM   Northwind..Customers C
   JOIN   cte_split_inline(@customers, ',') s ON C.CustomerID = s.Value
   OPTION (MAXRECURSION 0)
go
EXEC get_company_names_cte 'ALFKI, BONAP, CACTU, FRANK'

Note the OPTION clause. Without this clause, SQL Server would terminate the function prematurely if there are more than 100 elements in the list, 100 being the default value for MAXRECURSION. MAXRECURSION serves as a safe-guard if you would happen to write a recursive CTE which never terminates. For more "normal" recursive CTEs that winds up employer-manager relations, 100 is a lot, but for our purposes 100 is a far too low number. Here we set MAXRECURSION to 0 which turns off the check entirely.

But why is the OPTION clause not within the function? Simply because OPTION clauses are not permitted in inline functions. Recall that inline functions are not really functions at all, but just macros that are pasted into the referring query.

The function does not return the list position, but it would be easy to do so, by adding a counter column to the CTE:

SELECT ..., listpos = 1
...
UNION ALL
SELECT ..., listpos = listpos + 1

As you may guess, the function above takes a performance toll from the combination of nvarchar(MAX) and charindex. This can be addressed by writing a multi-statement function that breaks up the input into chunks, in a similar vein to duo_chunk_split_me. However, I found in my tests, that the overhead for bouncing the data over the return table in a multi-statement function slightly exceeded the cost for charindex on nvarchar(MAX). Since the selling point of the CTE method is its low intrusiveness, the chunking function is of less interest, and I leave it as an exercise to the reader to implement it. Or peek at this link.

As with all other inline functions, the optimizer failed on the EXISTS queries in my test, and it took over 30 seconds to extract 20 elements from my test table. It may be better to just extract the list elements into a temp table and use that in the real query.

Looking at Table-Valued Parameters

I am not covering how to use table-valued parameters in this article, but instead there is a separate article Using Table-Valued Parameters with SQL Server and .NET where I have detailed examples. However, it is still worthwhile to discuss their performance in relation to the other methods, so let's take a look at that before we move to the methods that are less useful.

Performance in SQL Server

As I have said a few times already, using table-valued parameters is the preferred method for passing a list of values to SQL Server. One important reason is simplicity: writing a stored procedure that accepts a table-valued parameter is straightforward. Not that using a list-to-table function is a big deal, but relational databases are centred around tables.

If you read the other article, you will learn that passing a value to a TVP from ADO .NET is a very simple affair. TVPs also have the advantage that you can add constraints to the table type to enforce uniqueness or some other type of contract. Nor do you have to worry in your database code about format errors in a comma-separated list.

Does this also mean that this method gives you the best performance? In general, yes. In each and every case? No. When running the tests for the performance appendix, I did find situations where other methods outperformed TVPs. However, I believe that in the long run TVPs will you give you better performance than any other method. There are two reasons for this:

Data is already in table format. With all other methods, cycles need to be spent on parsing a character string to get the data into table format. In my tests, the fixed-length method performed better in some tests with integer data. Indeed, this method just chops up a fixed-length string reading from a table of numbers, so it is very similar to reading from a table variable. However, TVPs have one more ace up the sleeve:

The optimizer gets a clue. For all other methods, the optimizer has no understanding of what is going on. In many situations you get a useful plan nevertheless, but with methods based on inline T‑SQL functions the optimizer often lose grip entirely and produce a plan that is nothing but a disaster. And even if the plan is useful, it may not be the most optimal because the optimizer has no idea how many rows your list of values will produce, which means that if you use the list in a query with other tables, row estimates are likely to be way off.

This is different for table-valued parameters. Just like table variables, TVPs do not have distribution statistics, but there is nevertheless one piece of information: cardinality. That is, the first time you call a procedure that takes a TVP, the optimizer sniffs the TVP – as it sniffs all other parameters – and the optimizer sees that the TVP has so and so many rows. This gives the optimizer better odds for good estimates for the number of rows in the rest of the query.

Not that it is perfect: There is the general problem that the sniffed value may be atypical. (For a closer discussion on parameter sniffing, see my article Slow in the Application, Fast in SSMS?.) And it is not always correct information leads to the best plan; in the performance appendix for SQL 2008 you can read about a case where SQL Server chooses an incorrect plan, when it has more accurate cardinality information. But as I discuss in the appendix this concerns only a window of the input size. Furthermore, cardinality is far from sufficient in all cases. Consider the query:

SELECT * FROM Orders WHERE CustomerID IN (SELECT custid FROM @custids)

Say that there are four values in @custids. If they are just four plain customers, seeking the non-clustered index on CustomerID is good. But if they are the top four customers accounting for 40 % of the volume, you want a table scan. But since a TVP does not have distribution statistics, the optimizer cannot distinguish the cases. The workaround is simple: bounce the data over a temp table and take benefit of that temp tables have distribution statistics. Since that workaround is the same as for all list-to-table functions, you may argue that when you need to do this, there is no special performance advantage of TVPs.

Client-side Performance

A reasonable question is: does TVP incur more calling overhead than regular parameters? The answer is yes. In my tests I found that passing 50 000 integer values to an unindexed TVP from ADO .NET took 40-50 ms compared to 20-35 ms for a comma-separated list. (Note that these numbers apply to the specific hardware that I used for the tests.) For a TVP with a primary key, the overhead was around 150 ms, because in this case SQL Server must sort the incoming data to verify that there are no duplicates.

While this overhead may seem considerable, you need to put it in perspective and look at the total execution time, and in most cases, the server-side execution time exceeds the numbers in the previous paragraph with a wide margin. As just one data point: in my test, the server-side execution time for my join test over 50 000 list elements was 213 ms for a non-indexed TVP, and the best non-TVP method (fixed-length binary input) needed 420 ms. The performance appendix for SQL 2008 has more details.

As for having a primary key on the table type, this is something that cuts both ways. Without it, call overhead is lower, because there no need for the sort, but now the Sort operator will typically appear in the query itself. However, if you can ensure that the data you pass to the TVP is sorted, you can avoid the sort all together. How to do this is something I discussion in the chapter Looking Closer on the SqlMetaData Constructors in the article about TVPs.

Dynamic SQL

Let me say this directly: this is a method you should never use. Way back in SQL 7, when there were no table-valued functions a case could be made for dynamic SQL, but today, no. Still, in SQL 2016 I see people using or suggesting this method in forums, which is very disheartening.

The reason why it still flourishes may be that for a list of numbers, it appears deceivingly simple:

CREATE PROCEDURE get_product_names_exec @ids nvarchar(4000) AS

EXEC('SELECT ProductName, ProductID
      FROM   Northwind..Products
      WHERE  ProductID IN (' + @ids + ')')
go
EXEC get_product_names_exec '9, 12, 27, 37'
And this has always been the curse with dynamic SQL: it looks simple to start with, but when you look more into it, there are quite a few complex matters to consider. To start with, let's look an example with a list of strings:
CREATE PROCEDURE get_company_names_exec @customers nvarchar(2000) AS
   EXEC('SELECT CustomerID, CompanyName
         FROM   Northwind..Customers
         WHERE  CustomerID IN (' + @customers + ')')
go
EXEC get_company_names_exec '''ALFKI'', ''BONAP'', ''CACTU'', ''FRANK'''

The nested quotes make procedures using this method difficult to call.

The next thing is permissions. Normally, when you use stored procedures, users do not need direct permissions to the tables, but this does not apply when you use dynamic SQL. You can work around this with certificate signing, something I discuss in detail in my article Giving Permissions through Stored Procedures. But then you have a method that requires a complexity that no other method in this article calls for.

Another critical issue is that your code will be open to SQL injection. That is, a malicious caller could pass data in @ids that causes the dynamic SQL to execute something completely different than you intended to. For a discussion on SQL injection and other issues with dynamic SQL see my article The Curse and Blessings of Dynamic SQL.

Don't overlook the impact on the readability and maintainability of your code. The above may not look too scary, but imagine a query that runs well over 50 lines and has other variables in it? Should you make it into dynamic SQL, which makes it more difficult to read, understand and maintain and deprives you of the possibility to track dependency information, only to put in a comma-separated list? Of course not!

The final blow is performance. For long lists, dynamic SQL is slow, one of the slowest methods, way behind the iterative method. The reason for this is that with dynamic SQL, you get the cost for compilation with almost every query and compilation of long IN lists is anything but cheap. In my first round of tests in 2006, processing 10 000 elements took 4-6 seconds on reasonable hardware. When I ran my performance tests in 2009 and added a test of 50 000 list elements, I had to except dynamic SQL from that length, because the query failed with an internal query-processor error.

If you reuse the same list, the plan is likely to be in the cache, and in this case execution will be fast the second time. But keep in mind that it has to be exactly the same list, as the cache is both case- and space-sensitive. And how likely is this? Rather, this points to a second problem with dynamic SQL: cache bloat. You will fill the plan cache with plans that are very rarely reused. On machines with lots of memory, this can lead to severe performance problems because of hash collisions in cache-lookups.

You could argue that since the optimizer sees the exact values, this increases your chances to a good plan, particularly if there is skew in the data. This is certainly true for short lists. However, there are better alternatives for that case which we will look at in the next section.

Passing the List as Many Parameters

It couldn't be unfancier:

CREATE PROCEDURE get_product_names_manyparams
       @p1 int = NULL, @p2 int = NULL, @p3 int = NULL, @p4 int = NULL,
       @p5 int = NULL, @p6 int = NULL, @p7 int = NULL, @p8 int = NULL AS
  SELECT ProductName, ProductID
  FROM   Northwind..Products
  WHERE  ProductID IN (@p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8)
go
EXEC get_product_names_manyparams 9, 12, 27, 37

That is, just have a stored procedure with lots of parameters, and then pass as many as you need.

The reader immediately realises that there is a maintenance problem here. No programmer likes to write code like that. And there is an obvious limitation: no matter how many parameters you put in, you run the risk of needing one more.

And, true, this solution is only an alternative when your upper limit is fairly low. After all, there are applications where 30 list elements would mean a lot. The advantage? If you stick OPTION (RECOMPILE) at the end of the query, the optimizer knows exactly which values you are looking for. If there is a skew, it may be that different values of the search parameters calls for different plans. And since the plan is not cached there is no cache bloat as with dynamic SQL. Or any permissions issue.

Still, you may not be comfortable with having this number of parameters. An alternative is to pass the regular comma-separated list, unpack it into a temp table or a table variable, and if the list is short, you extract the few values into variables. You query would both have an IN and a operation against the temp table. For a further discussion of this idea, see the section Handling Multi-Valued Parameters in my article Dynamic Search Conditions.

The absolute upper limit for this method is 2 100 list elements, that is how many parameters you can pass to a stored procedure. You may think such a procedure is completely impossible to write, but it is in fact quite easy. In order to be able to test the performance, I created my test procedures with 2 000 parameters and I used T‑SQL to generate the parameter lists and the IN list. This is possible with help of the FOR XML PATH clause. And once again the Numbers table came into service:

SELECT '@p' + ltrim(str(Number)) + ' int = NULL, '
FROM   Numbers
WHERE  Number <= 2000
ORDER  BY Number
FOR XML PATH('')

Normally, when you use FOR XML PATH you specify the name for the enclosing element, but if you leave it as the empty string, the result is a plain text string although it's still returned as if it was an XML document. Run the above in grid mode in SSMS, open the XML document, copy and paste. Go to the end of the line to remove the final comma and you're done. As for passing the parameters from client code, that should not constitute any problem. That is after all only a loop.

So how did my 2 000-parameter procedure perform? If I look at my main summary, it seems to do pretty well. 19 ms in execution time for 2 000 elements which is better than any other method. But that result is based on timing within the procedure only. When I instead look at the total time as seen from the client the situation is outright poor. The total time for the call in this case exceeds 550 ms. Or more than double the execution time for dynamic SQL for equally many elements.

My regular test setup is Perl script, and in order to investigate whether this extreme overhead was due to poor performance in my Perl-specific API (as is the case for table-valued parameters), I also wrote a test program in C#. The call overhead was lower with SqlClient, but still almost 400 ms for 2 000 parameters.

Making the List into Many SELECT

Inserting Many Rows

This method is mainly of interest in the situation you need to insert many rows, so I will introduce it from that angle.

Say that you don't want to use XML or JSON for some reason, and neither can you use table-valued parameters. What is left – if we overlook bulkload or SQL Server Integration Services – is to use INSERT statements in some way.

The simplest approach is to run a batch that goes:

SET NOCOUNT ON
INSERT tbl(...) VALUES (...)
INSERT tbl(...) VALUES (...)
...

SET NOCOUNT ON is needed to prevent each INSERT from generating a (1 row affected) message to the client and add a lot of network overhead. Still, this is not terribly efficient, because each INSERT writes to the transaction log. So you would want a way to insert many rows in one statement. Here is a way:

INSERT tbl (...)
   SELECT val1a, val2a, val3a
   UNION ALL
   SELECT val1b, val2b, val3b
   UNION ALL
   ...

That is, instead of a lot of INSERT statement you have one big fat SELECT with lots of UNION ALL clauses. Alas, neither this performs well. For small batches around 20 elements, it's faster than many INSERT statements, as we reduce the transaction-log overhead. But as the batch size grows, the time it takes for the optimizer to compile the query plan explodes, and the total execution time can be several times than a batch of single-row inserts. One way to handle this is to generate many INSERT SELECT UNION batches, with a reasonable batch size of 20-50 elements. That gives you the best of those two methods.

Since SQL 2008 we can provide more than one row with INSERT VALUES:

INSERT tbl(a, b)
   VALUES (1, 2), (12, 12), (14, 99)

This relieves you from the bulkier SELECT UNION syntax. Performancewise, there is no difference though. Internally, this is converted to the same representation as those SELECT UNION, and with the same poor development with an increasing the number of rows. Except that Microsoft realised this, and put a cap on INSERT VALUES, so it does not accept more than 1000 rows. However, my tests indicate that you should stay well below that level. Things may change, though. According to Microsoft's response to a Connect item on this theme, it seems that this was to be addressed in SQL 2012. However, I have not verified that it really happened. The cap is still in place in SQL 2016.

Instead of using VALUES or SELECT UNION, there is a third method that is a lot faster. You get the benefit of a single INSERT statement, but you don't have to pay the excessive price for compilation. The trick is to use INSERT-EXEC, and it was suggested to me by Jim Ebbers:

INSERT sometable (a, b, c)
   EXEC('SELECT 1, ''New York'', 234
         SELECT 2, ''London'', 923
         SELECT 3, ''Paris'', 1024
         SELECT 4, ''Munich'', 1980')

Normally when you use INSERT-EXEC you have a stored procedure or a batch of dynamic SQL that returns one result set, and that result set is inserted in to the table. But if the stored procedure or batch of dynamic SQL returns multiple result sets, this works as long as all result sets have the same structure. The reason why this method is so much faster than using UNION is that every SELECT statement is compiled independently. One complexity with this method is that you have to be careful to get the nested quotes correct when you generate the EXEC statement.

It may come as a surprise that I propose dynamic SQL here, given that I slammed it earlier. But the full story is that all three methods here presume that client would generate the full batch with INSERT statements with values and all, and what is that if not dynamic SQL? Obviously, the client would have to have safeguards against SQL injection, permissions would not be an issue etc. One scenario where this may fit in is when you export data in form of a script to be run on a server elsewhere.

Comma-Separated Lists

It may not seem apparent, but it is possible to use the methods above as a foundation to unpack comma-separated lists, although it should be said directly that this more sorts under the title Crazy things you can do in T‑SQL. It's difficult to see a situation when this method would be the best choice.

Originally this idea was suggested to me by Steve Kass in the days when SQL 2000 ruled the world. The idea is that you use the replace function to replace the delimiter with the binding elements. In Steve Kass's original proposal he used UNION ALL SELECT and then he added some more plumbing to the string so that you have an SQL batch. Then you can use INSERT EXEC to insert the result of the SQL batch into a temp table, which you then use in your target query. Although UNION has very poor performance, I first show a stored procedure that uses UNION, for the simple reason that this code is simpler to understand than what is to follow:

CREATE PROCEDURE unpack_with_union
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = N',' AS
DECLARE @sql nvarchar(MAX),
        @q1  char(1),
        @q2  char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT INTO ' + @tbl + ' SELECT ' +
               replace(replace(@list, @q1, @q2), @delimiter,
                       N' UNION ALL SELECT ')
--PRINT @sql
EXEC (@sql)

The inner replace is there to handle the potential risk that @list includes single quotes, which we double in an attempt to protect us against SQL injection. The outer replace replaces the delimiter with the SQL plumbing. The name of the table to insert to is passed in the parameter @tbl. The variables @q1 and @q2 saves me from having a mess of single quotes all over the place. (It will get worse than this.)

Here is an example of how to use it:

CREATE PROCEDURE get_product_names_union @ids varchar(50) AS
   CREATE TABLE #temp (id int NULL)

   EXEC unpack_with_union @ids, '#temp'

   SELECT P.ProductName, P.ProductID
   FROM   Northwind..Products P
   JOIN   #temp t ON P.ProductID = t.id
go
EXEC get_product_names_union '9, 12, 27, 37'

The reason that this is a procedure and not a function is that you cannot use dynamic SQL in functions.

However, the performance of this procedure and a unpack_with_insert that uses INSERT VALUES is not defensible. It took unpack_with_union almost five seconds to unpack 2 000 numbers in one of my performance tests. unpack_with_insert performed the same task in 1.3 seconds.

Decent performance is possible to achieve if we use the trick with INSERT-EXEC and many small SELECT statements, as suggested by Jim Ebbers. Here is such a procedure, a bit more complex than the one above.

CREATE PROCEDURE unpack_with_manyselect
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = ',' AS
DECLARE @sql nvarchar(MAX),
        @q1     char(1),
        @q2     char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT ' + @tbl + ' EXEC(' + @q1 + 'SELECT ' +
               replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2),
                       @delimiter COLLATE Slovenian_BIN2,
                       ' SELECT ') + @q1 + ')'
--PRINT @sql
EXEC (@sql)

Here, the inner replace replaces a single quote within @list by no less than four single quotes. This is because the INSERT EXEC itself is nested into an EXEC(). (You may guess why I have that commented PRINT @sql there!) In this procedure I have also added a COLLATE clause. (The other procedure is so slow, that the COLLATE would not make any difference anyway.) This procedure unpacks 2 000 numbers in 75 ms – decent, but slower than most other methods in this article.

If you look closer, you see that the procedures presented so far only work well with numeric values. If you want to pass a list of strings, you need to quote the strings yourself. Here is a version for handling strings to save you from this:

CREATE PROCEDURE unpackstr_with_manyselect
                 @list      nvarchar(MAX),
                 @tbl       varchar(30),
                 @delimiter nchar(1) = ',' AS
DECLARE @sql    nvarchar(MAX),
        @q1     char(1),
        @q2     char(2)
SELECT @q1 = char(39), @q2 = char(39) + char(39)
SELECT @sql = 'INSERT ' + @tbl +
               ' EXEC(' + @q1 + 'SELECT ltrim(rtrim(' + @q2 +
               replace(replace(@list, @q1 COLLATE Slovenian_BIN2, @q2 + @q2),
                       @delimiter COLLATE Slovenian_BIN2,
                       @q2 + ')) SELECT ltrim(rtrim(' + @q2) +
               @q2 + '))' + @q1 + ')'
--PRINT @sql
EXEC (@sql)

Here is an example of usage:

CREATE PROCEDURE get_company_names_manyselect @custids nvarchar(2000) AS
   CREATE TABLE #temp (custid nchar(5) NULL)

   EXEC unpackstr_with_manyselect @custids, '#temp'

   SELECT C.CompanyName, C.CustomerID
   FROM   Northwind..Customers C
   JOIN   #temp t ON C.CustomerID = t.custid
go
EXEC get_company_names_manyselect 'ALFKI, BONAP, CACTU, FRANK'

Is this method really a good alternative for handling comma-separated lists? No. Performance for integer lists is in par with the iterative method, but it is significantly slower with lists of strings. The iterative method has the benefit that it is easy to adapt the method to support different input formats, and getting information like the list position is simple. unpack_with_manyselect does not lend itself to any of that. The strength here lies in inserting many values.

Note: all procedures here include an INSERT. But you could leave out the INSERT from the procedure and instead call the procedure from INSERT EXEC. In that case, you have the choice of using a temp table or using a table variable.

Really Slow Methods

In a Q&A column of an SQL journal some year or two after SQL 2000 and table-valued functions had hit the streets, the following solution was suggested by one SQL Server MVP, referring to another MVP, who both shall go nameless:

CREATE PROCEDURE get_company_names_charindex @customers nvarchar(2000) AS
   SELECT CustomerID, CompanyName
   FROM   Northwind..Customers
   WHERE  charindex(',' + CustomerID + ',', ',' + @customers + ',') > 0
go
EXEC get_company_names_charindex 'ALFKI,BONAP,CACTU,FRANK'

By adding commas of both sides of the input string, we can use charindex to find ",ALFKI," etc. (Note that you cannot have embedded blanks here.)

The author noted in his column that this method would not have good performance, since embedding the table column in an expression precludes the use of any index on that column, leading to a table scan. But that's only a small part of the story. A plain table scan on my test table takes 800 ms, once the table is entirely in cache. This method needs 65 seconds for an input list of 20 strings! Why is it so slow? Because, not only the table has to be scanned, but the string has also to be scanned from start to end, and with a Windows collation and the full Unicode rules, this hurts. Badly.

Variations on this theme are illustrated by these WHERE clauses:

WHERE  patindex('%,' + CustomerID + ',%',   ',' + @customers + ',') > 0
WHERE  ',' + @customers + ','   LIKE    '%,' + CustomerID + ',%'

The solution with LIKE is equally slow or slower than charindex. I have never tested patindex for SQL 2005 or later.

There is a "go faster" switch here: the COLLATE clause. Add a COLLATE clause to force a binary collation, and performance improves with a factor of 7 to 10. But the problem is that if you pass a string like alfki,bonap,cactu,frank you may still expect a hit, which you would not get with a binary collation. So in my tests, I only forced the collation when the input was a list of integers, as in this example:

CREATE PROCEDURE get_product_names_realslow @ids varchar(200) AS
SELECT ProductName, ProductID
FROM   Northwind..Products
WHERE  charindex(',' + ltrim(str(ProductID)) + ',' COLLATE Slovenian_BIN2,
                 ',' + @ids + ',' COLLATE Slovenian_BIN2) > 0
go
EXEC get_product_names_realslow '9,12,27,37'

If you use an SQL collation, you also get this huge performance improvement if you can stick to varchar (and then you could still be case-insensitive).

But while an improvement with a factor ten is impressing, we still talking seven seconds for a list of 20 integers. When no other method needs as much as fifty milliseconds (save when then optimizer loses itself in the inline functions).

Concluding Performance Remarks

In this section I bring up some more comments on performance, that I was not able to fit in elsewhere, and which are a little too nitty-gritty to appear in the general considerations.

The Optimizer and len()

I have said in several places, that the optimizer does not understand to use len(@param) when estimating how many rows inline_split_me and fixstring_single will produce. This is true as long as the input parameter is (n)varchar(MAX). But if the parameter is varchar(8000) or nvarchar(4000), the optimizer is actually able to understand it. This means that if you write fixstring_single as:

ALTER FUNCTION fixstring_single(@str nvarchar(4000), @itemlen tinyint)
RETURNS TABLE
RETURN(SELECT listpos = n.Number,
              str = rtrim(convert(nvarchar(4000),
                      substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen)))
       FROM   listtest..Numbers n
       WHERE  n.Number <= len(@str) / @itemlen)

The optimizer will make an exact estimate of how many values your string will produce. Note that the original fixstring_single I presented includes a CASE expression to permit the last element to be shorter. This CASE expression is missing here, since if I add it to the WHERE condition, the optimizer finds it to difficult and resorts to the standard estimate of 30 % rows returned.

Changing the input parameter to limited length in inline_split_me, does not yield equally perfect results. The optimizer makes a correct estimate of how many rows it will read from Numbers, but it will of course have to guess how many rows in total the function will produce. The result is likely to be better than the standard estimate, though.

Keep in mind that while these changes may help you to improve performance, you are in for a gamble if the input can exceed 8000 bytes. Not the least does this apply to fixed-length which needs more space per element.

JOIN, EXISTS and DISTINCT values

You have already read a section on JOIN vs. EXISTS, but there is a little more to say on the topic, and how to deal with duplicates in the input string.

Above, I said that in my tests all inline functions resulted in abysmal plans with my EXISTS queries and I suggested that it may be better sometimes to write a join query with DISTINCT. A different take on it is a tip that I got from SQL Server MVP Adam Machanic. He fought this problem and his way out was to add a DISTINCT to the query in the inline function.

I constructed variations of the inline functions in this article and ran it through my test setup. Indeed, the optimizer now settled for reasonable plans for most of my EXISTS queries. But for fixstring_multi and fixstring_multi2 I got even worse plans in some cases also for the join queries, so this trick may cut both ways. (When I killed one query it had ran for 20 minutes. For an input of 20 elements.)

There is one important thing to observe if you take this road. Assuming that you using the function for a list of integer values, it becomes important to have the conversion to integer inside the function. That is:

CREATE FUNCTION inline_split_int_distinct(@param nvarchar(MAX))
RETURNS TABLE AS
RETURN(SELECT DISTINCT convert(int,
                  substring(@param, Number,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      @param + convert(nvarchar(MAX), N','),
                                      Number) -
                            Number)
              ) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(@param))
         AND  substring(convert(nvarchar(MAX), N',') + @param, Number, 1) =
                  N',' COLLATE Slovenian_BIN2)

Else SQL Server will spend time on sorting and comparing a bunch of string values, which is far costlier that dealing with numbers.

It seems quite clear that putting DISTINCT into a list-of-table function in your standard library is not a very good idea, as that would give you the overhead of dealing with duplicates every time. And there may be the occasions you actually want those duplicates. But it is certainly a good idea to stow this in your bag of tricks.

A final note on the topic of eliminating duplicates: there is a third option to DISTINCT and EXISTS, and that is if you bounce the values from the input list over a table variable (including a multi-statement UDF) or a temp table. You can of course use DISTINCT when you do this, but an alternative is to add a primary key with the IGNORE_DUP_KEY clause. This is a more efficient way of eliminating duplicates that are only occasional. See this blog post from SQL Server MVP Alex Kuznetsov for more details.

Unpacking Lists in a Table, Take Two

When you have the dreadful situation that you need to unbundle a multi-valued column, as I discussed in the section Comma-separated Lists in a Table Column in the introductory article, the presumptions for performance are somewhat different from when you unpack a list in an input parameter.

The first thing to note is that the size of your table column may permit you to change the input parameter in your function from MAX to varchar(8000) or nvarchar(4000) without any risk for data loss. Since the limited string types perform better with charindex, this should give you a performance boost. And since you may be working with several hundreds of lists or more at the same time, the performance for splitting shorter lists may also have an impact.

The second thing is more surprising: the relation in performance between the methods changes. Nikola Ilijev has published a blog post where he presents tests with different functions against a table. And in his tests the CTE-based methods were faster than inline_split_me. and other functions based around the Numbers table. I was able to repeat his results, and interesting enough, the difference also shows when I use a multi-statement version of the inline functions.

Really how this comes, I don't know, but I think it is rather the table-of-number functions that are slow in this context than the other way round. To wit, when I tested the CLR functions in Nikola's test setup they outperformed both CTE:s and function based on Numbers with a wide margin.

What About In-Memory OTLP?

SQL 2014 introduced In-Memory OLTP and natively compiled stored procedures. In SQL 2014 the surface area was so small that it was not much you could do with it. SQL 2016 added more functionality for Hekaton (as In-Memory OLTP is also known as), and you can now also create natively compiled scalar functions. But there are still no natively compiled table-valued functions, nor is charindex available. This makes it difficult to implement a reusable string-splitter in Hekaton. It is likely, though, that Microsoft continues to develop In-Memory OLTP and the surface area will increase. And once these devices – charindex and table-valued functions – this could change the competition among the methods presented in this article quite a bit.

What if you need to do string-splitting inside a natively compiled module today? Since string_split and OPENJSON are also unavailable, these easy picks are not options. In SQL 2016, you can at least implement your own charindex as natively compiled function (don't forget to cast to a binary collation!), and you could then use the iterative method directly in your stored procedure. If you have a memory-optimized version of the Numbers table you could use that method, obviously. There certainly is a strong incentive to avoid it all together and pass the data in a table-valued parameter from the client – TVPs are supported with natively compiled procedures.

Final Words

You have now seen a number of methods of passing a list of values to SQL Server to use these values to find data in a table. Most of the methods transform the list of values to table format. I have also discussed general considerations on how to apply these methods. If you want to know more about the performance of these methods, there are two appendixes to this article where I present data from my performance tests.

But keep in mind that everything I say about performance here relates to my tests on SQL 2005 and SQL 2008. With your queries, and a more recent version of SQL Server, the situation may be different, and if you need best performance, you need to run benchmarks with your queries. And you may find that you get best performance with some variation that is not described in this article. On the other hand, in the very most situations, the plain iterative method will serve you very well.

Acknowledgements and Feedback

I have not invented all these methods myself. This article summarises the work of a lot of people in the SQL community. There are those who came up with an original idea, those who have refined the method itself, the interfaces or some other detail. I cannot mention them all, because I simply don't know about all of them.

That said, here is a list of persons who have contributed to this article, directly or indirectly, knowingly or unknowingly. One great source has been current and former SQL Server MVPs: Itzik Ben-Gan, Fernando Guerrero, Roy Harvey, Umachandar Jaychandran, Steve Kass, Narayana Vyas Kondreddi, Bryant Likes, Alejandro Mesa, Simon Sabin, Adam Machanic, Tom Moreau, Tony Rogerson, Nigel Rivett, Anith Sen, Razvan Socol and Linda Wierzbicki. But there also a lot people who have contributed through mail or discussions in public newsgroups: Jim Ebbers, Brian W Perrin, Sam Saffron, Ivan Arjentinski, Joe Celko, David Craig, Ken Henderson, Paul Marsh, Chris Payne, Daniel Sebestyen, Dariusz Śpiewak, John Moreno, Bharathi Veeramac, James Vitale, Nikola Ilijev and David Satz. And there are probably a few more that I should have mentioned but that I have forgotten.

As always, I welcome feedback on the article, if you think something is not covered correctly. Not the least do I welcome if you point out spelling and grammar errors! The address esquel@sommarskog.se.

Revisions

2016-08-21

Refreshed the article after the release of SQL 2016. This article is now known as The Long Version, and there is now a separate Short Version which serves an introduction and as a quick answer for those who only need to know how to handle a comma-separated list, but do not need to learn about umpteen methods to do it. In the light of this, the introduction is entirely rewritten.

The article now covers two new methods available in SQL 2016: the built-in functions string_split and OPENJSON.

Overall, I have reviewed the entire article and refreshed passages where I felt it was called for. Not the least I have removed text "this is new in SQL 2005 compared to SQL 2000", as that is quite out of date by now.

There is also a short section on Hekaton (In-Memory OLTP) at the end.

 
2013-11-02
David Satz pointed out that there is a difference how blank elements are interpreted depending whether you use '.' or '(./text())[1]' in element-centric XML and I have added a short discussion on this.
2010-10-10

Rewrote the section on XML. I now suggest that it is best practice to use [1] with attribute-centric XML, although it's redundant. For element-centric XML, I point out that you need a trace flag to avoid query-plan problems, but even better is to use the text() node test. This gives you the same performance with element-centric XML as you get with attribute-centric XML. Updated the examples to insert many rows from XML so that they do not use parent-axis access, which often has poor performance. I have added a section on typed XML in more detail. Because of the many changes around XML, I have also rerun the performance tests and updated the second performance appendix with the new results.

Beside the update on XML, added a note on INSERT VALUES with many rows that performance should be better in the next version of SQL Server according to Microsoft.

2010-01-17
Added comments on variations on fn_nums, and added the section Concluding Performance Remarks with some more tidbits that I had failed to include in the previous version.
2010-01-06
Extensive rewrite to adapt for SQL 2008, but also new methods and revelations:
2007-03-03
First version of the article for SQL 2005.

Back to my home page.