Arrays and Lists in SQL Server

The Short Version

An SQL text by Erland Sommarskog, SQL Server MVP. Latest Revision 2021-02-27.
Copyright applies to this text. See here for font conventions used in this article.

Introduction

This is a short article directed to readers with a limited experience of SQL Server programming that discusses how to handle a list of values delimited by commas or some other separator. There is an accompanying article, Arrays and Lists in SQL Server, The Long Version, which includes many more ways to crack lists in to tables – and you could argue too many. This longer article is intended for an audience with a little more experience. Most likely, this short story should tell you all you need, but in a few places I will refer to the longer article for users with special needs.

Table of Contents

Introduction

The IN Misconception

How to Handle the List

Table-Valued Parameters

string_split – The Built-in Solution

Two Simple Multi-Statement Functions

What If You Cannot Use a Function?

An Anti-Pattern

Delimited Lists in a Table Column

Performance Tip

Revision History

The IN Misconception

I frequently see people in the SQL forums asking why does this not work?

DECLARE @list varchar(23) = '1,2,3,4'
SELECT ...FROM tbl WHERE col IN (@list)

The answer is that it does work: just look at this:

CREATE TABLE #test (id   int         NOT NULL,
                    col varchar(23)  NOT NULL)
INSERT #test(id, col)
   VALUES(1, 'Something'), (2, '1,2,3,4'), (3, 'Anything')
DECLARE @list varchar(23) = '1,2,3,4'
SELECT id FROM #test WHERE col IN (@list)

The SELECT returns the row with id = 2, not the others.

People who ask why IN does not work have a misconception about IN. IN is not a function. IN is an operator and the expression

col IN (val1, val2, val3, ...) 

is simply a shortcut for:

col = val1 OR col = val2 OR col = val3 OR ...

val1 etc here can be table columns, variables or constants. The parser rewrites the IN expression to a list of OR as soon as it sees it. (This explains why you get multiple error messages when you happen to misspell the name of the column left of IN.) There is no magical expansion of a variable values. The value '1,2,3,4' means exactly that string, not a list of numbers.

How to Handle the List

Now you know why IN (@list) does not work as you hoped for, but if you have a comma-separated list you still need to know to work with it. That is what you will learn in this chapter.

Table-Valued Parameters

The best approach in my opinion is to reconsider having a delimited list at all. After all, you are in a relational database, so why not use a table? That is, you should pass the data in a table-valued parameter (TVP) instead of a delimited list. If you have never used TVPs before, I have an article, Using Table-Valued Parameters in SQL Server and .NET, where I give a tutorial of passing TVPs from .NET to SQL Server., The article includes a detailed description of passing a comma-separated list to a TVP. You will find that it is astonishingly simple.

Unfortunately, not all environments support TVPs, so using a TVP is not always an option. In that case you need to split the list into table format, and that is what we will look at in the rest of this chapter.

string_split – The Built-in Solution

If you are on SQL 2016 or later, there is a very quick solution:

SELECT ...
FROM   tbl 
WHERE  col IN (SELECT convert(int, value) FROM string_split('1,2,3,4', ','))

string_split is a built-in table-valued function that accepts two parameters. The first is a delimited list, and the second is the delimiter.

Unforunately, though, string_split has a couple of shortcomings, so it does not always meet your needs.

In the following sections we will look at alternatives to string_split.

Two Simple Multi-Statement Functions

If you search the web, there is no end of functions to split strings into table format. Here, I will present two simple functions that run on SQL 2008 or later, one for a list of integers and one for a list of strings. I should immediately warn you that these functions are not the most efficient and therefore not suitable if you have long lists with thousands of elements. But they are perfectly adequate if you are passing the contents of a multi-choice checkbox from a client where you would rarely have as many as 50 elements.

I opted to share these functions because they are simple and you can easily adapt them if you want different behaviour with regards to the choices that I have made. In my long article, I describe methods that are faster, but they all require extra setup than just a function.

Below is a function to split a delimited list of integers. The function accepts a parameter for the delimiter which can be up to 10 characters long. The function returns the list positions for the elements. An empty element is returned as NULL. If there is a non-numeric value in the list, there will be a conversion error.

CREATE FUNCTION intlist_to_tbl (@list  nvarchar(MAX),
                                @delim nvarchar(10))
   RETURNS @tbl TABLE (listpos int NOT NULL IDENTITY(1,1),
                       n       int NULL) AS
BEGIN
   DECLARE @pos       int = 1,
           @nextpos   int = 1,
           @valuelen  int,
           @delimlen  int = datalength(@delim) / 2

   WHILE @nextpos > 0
   BEGIN
      SELECT @nextpos = charindex(@delim COLLATE Czech_BIN2, @list, @pos)
      SELECT @valuelen = CASE WHEN @nextpos > 0 THEN @nextpos
                              ELSE len(@list) + 1
                         END - @pos
      INSERT @tbl (n)
         VALUES (convert(int, nullif(substring(@list, @pos, @valuelen), '')))
      SELECT @pos = @nextpos + @delimlen
   END
   RETURN
END

You are likely to be puzzled by the COLLATE clause. This is a small speed booster. By forcing a binary collation, we avoid that SQL Server employs the full Unicode rules when searching for the delimiter. This pays off when scanning long strings. Why Czech? The language does not matter here, so I just picked one with a short name.

And why datalength divided by 2 and not len? datalength returns the length in bytes, whence the division. len does not count trailing spaces, so it does not work if the delimiter is a space.

Here are two examples:

SELECT * FROM intlist_to_tbl('1,2,3, 677,7 , ,-1', ',')
SELECT * FROM intlist_to_tbl('1<->2<->3<-> 677<->7<-><->-1', '<->')

Since the values are the same in both lists, the output is the same:

listpos     n

----------- -----------

1           1

2           2

3           3

4           677

5           7

6           NULL

7           -1

Here is an example of how you would use it in a simple query:

SELECT ...
FROM   tbl 
WHERE  col IN (SELECT n FROM intlist_to_tbl('1,2,3,4', ','))

If you find that you are only using comma-separated lists, you may grow tired of having to specify the delimiter every time. To that end, this wrapper can be handy:

CREATE FUNCTION intlisttotbl (@list nvarchar(MAX)) RETURNS TABLE AS
RETURN (
   SELECT listpos, n FROM intlist_to_tbl(@list, ',')
)

I leave it as an exercise to the reader to come up with a better name.

Here is a function for a list of strings. It accepts an input parameter of the type nvarchar(MAX), but the return table has both a varchar and an nvarchar column. I will return to why in a second. Like intlist_to_tbl it returns the list position. It trims leading and trailing spaces. In difference to intlist_to_tbl, empty elements are returned as empty strings and not as NULL.

CREATE FUNCTION strlist_to_tbl (@list  nvarchar(MAX),
                                @delim nvarchar(10))
   RETURNS @tbl TABLE (listpos int NOT NULL IDENTITY(1,1),
                       str     varchar(4000) NOT NULL,
                       nstr    nvarchar(4000) NOT NULL) AS
BEGIN
   DECLARE @pos       int = 1,
           @nextpos   int = 1,
           @valuelen  int,
           @nstr      nvarchar(4000),
           @delimlen  int = datalength(@delim) / 2

   WHILE @nextpos > 0
   BEGIN
      SELECT @nextpos = charindex(@delim COLLATE Czech_BIN2, @list, @pos)
      SELECT @valuelen = CASE WHEN @nextpos > 0 THEN @nextpos
                              ELSE len(@list) + 1
                         END - @pos
      SELECT @nstr = ltrim(rtrim(substring(@list, @pos, @valuelen)))
      INSERT @tbl (str, nstr)
         VALUES (@nstr, @nstr)
      SELECT @pos = @nextpos + @delimlen
   END
   RETURN
END

Here are two examples:

SELECT * FROM strlist_to_tbl(N'Alpha (α) | Beta (β)|Gamma (γ)|Delta (δ)|', '|')
SELECT * FROM strlist_to_tbl(N'a///b///c///v///x', '///')

Here is the output:

listpos     str        nstr

----------- ---------- ---------- 

1           Alpha (a)  Alpha (α)

2           Beta (ß)   Beta (β)

3           Gamma (?)  Gamma (γ)

4           Delta (d)  Delta (δ)

5                      

 

listpos     str        nstr

----------- ---------- -----------

1           a          a

2           b          b

3           c          c

4           v          v

5           x          x

Note in the first result set that the Greek characters has been replaced by fallback characters in the str column. They are unchanged in the nstr column. (If you have a Greek or a UTF-8 collation, the two columns will be identical, though.)

Here are two examples of using this function:

SELECT ...
FROM   tbl 
WHERE  varcharcol IN (SELECT str FROM strlist_to_tbl('a,b,c', ','))

SELECT ...
FROM   tbl 
WHERE  nvarcharcol IN (SELECT nstr FROM strlist_to_tbl('a,b,c', ','))

These examples illustrate why there are two columns. If you are going to use the list against a varchar column, you need to use the str column. This is important because of the type-conversion rules in SQL Server. If you mistakenly compare varcharcol to nstr, varcharcol will be converted to nvarchar, and this can render any index on varcharcol ineligible for the query, leading to a performance disaster as the table must be scanned. And conversely, if you have an nvarchar column, you need to compare it to the nvarchar value, since else the result can be incorrect because of the character replacement with the conversion to varchar.

I like to point out that these functions are by no means cast in stone, but see them as suggestions. Feel free to modify them according to your preferences and needs.

What If You Cannot Use a Function?

If you are in the unfortunate situation that you don't have the permission or authorisation to create functions, what can you do? One option is of course to incorporate the procedure body in your code, but that is not really appealing.

An alternative that is popular with some people is to convert the list into an XML document. This works on all versions from SQL 2005 and up:

DECLARE @list nvarchar(MAX) = '1,99,22,33,45',
        @xml  xml
SELECT @xml = '<x>' + replace(@list COLLATE Czech_BIN2, ',', '</x><x>') + '</x>'
SELECT X.x.value('.', 'int') AS val
       --, row_number() OVER(ORDER BY X.x) AS listpos
FROM   @xml.nodes('/x/text()') X(x)

To give you an idea of what is going on, here is the resulting XML:

<x>1</x><x>99</x><x>22</x><x>33</x><x>45</x>

You can use the XML query in your main query directly, but it is probably easier to stick the result in a temp table and work from there.

As you can see, there is a listpos column in the query, but I have commented it out. This is because while seems to give the desired result, it is to my knowing not something which is documented and you can rely on. That is, it could stop working at some point.

There is little reason to consider XML if you have access to string_split, so it is mainly an option if you are on compatibility level 120 or lower.

If you on are SQL 2016 or later, and you need the list position but you cannot write your own function, there is an option that is easier to use than XML, to wit JSON:

DECLARE @list nvarchar(MAX) = '1,99,22,33,45'
SELECT convert(int, [key]) + 1 AS listpos, convert(int, value) AS n
FROM   OPENJSON('[' + @list + ']')

That is, just wrap the list in brackets and off you go. Would you have another delimiter than comma, you will need to replace that delimiter with a comma to adhere to the JSON syntax.

OPENJSON returns a result set with three columns, but only key and value are of interest to you. Both are nvarchar(4000), so you need to cast them to int. Note that the values in key are zero-based.

In these examples, I used integer lists. I need to raise a word of warning if you are considering to use XML or JSON for lists of strings. If the values are strictly alphanumeric, no sweat. But if there are characters that are special to XML or JSON, the method above will fall apart. It possible to save the show with help of CDATA sections that protects special characters as in this example I got from Yitzhak Khabinsky:

DECLARE @list nvarchar(MAX),
        @xml  xml
SET @list = 'Dog & [Pony],Always < then,Glenn & Co. > 100';
SELECT @xml = '<x><![CDATA[' + 
                   replace(@list COLLATE Czech_BIN2, ',', ']]></x><x><![CDATA[') + 
                ']]></x>';
SELECT @xml
SELECT X.x.value('.', 'nvarchar(30)') AS val
       --, row_number() OVER(ORDER BY X.x) AS listpos
FROM   @xml.nodes('/x/text()') X(x);

The output is:

val

------------------------------

Dog & [Pony]

Always < then

Glenn & Co. > 100

If you feel that your head is starting to spin at this moment, you have my sympathy. Despite its complexity, it is probably the best solution when you cannot write a function. If you want an alternative, you can look at the CTE method that I describe in my longer article. This method can give you the list position in a guaranteed way.

When it comes to speed, XML and JSON are faster than the functions that I showed you in the previous section, and they should work well with lists with thousands of values. Particularly, pay attention to the addition of the text() function in the .nodes method. Without it, shredding the XML takes about 50 % more time. (I owe this trick to Yitzhak Khabinsky.)

An Anti-Pattern

Amazingly enough, I still occasionally see people who use or propose dynamic SQL. That is, something like this:

SELECT @sql = 'SELECT ...FROM tbl WHERE col IN (' + @list + ')'

There are all sorts of problems here. Risk for SQL injection. It makes the code more difficult to read and maintain. (Just imagine that this is a large query spanning fifty lines that someone wrapped in dynamic SQL only because of the list). Permissions can be a problem. It leads to cache littering. On top of that performance is poor. Above I cautioned you that the functions I presented are not good for long lists – but they are certainly better than dynamic SQL. It takes SQL Server a long time to parse a long list of values for IN.

Do absolutely not use this!

Delimited Lists in a Table Column

Sometimes you may encounter a table column which holds a delimited list of values. This is an anti-pattern that appears to have become rather more popular over the years, despite that relational databases are designed from the principle that a cell (that is, a column in a row) is supposed to hold one single atomic value. Storing delimited lists goes against that principle, and if you store data this way, you will have to pay a dear price in terms of performance and complex programming. The proper way is to store the data in the list in a child table.

Nevertheless, you may encounter a comma-separated list that someone else has designed. And even if you have the power to change the design, you still need to how to handle it. Let's first get an example to work with:

CREATE TABLE orders (orderid    int          NOT NULL,
                     custid     int          NOT NULL,
                     orderdate  date         NOT NULL,
                     products   varchar(MAX) NOT NULL,
                     quantities varchar(MAX) NOT NULL,
                     prices     varchar(MAX) NOT NULL,
                     CONSTRAINT pk_orders PRIMARY KEY (orderid)
)
go
INSERT orders (orderid, custid, orderdate, products, quantities, prices)
   VALUES (1, 108, '20201215', 
           'A16769,B1234,B2679,DL123', '1,2,1,1', '100,123,9000,450'),
          (2, 985, '20201216', 
           'A16769,A8744,B1233,CBGB2,E98767', '3,4,1,1,7', '100,560,400,600,320'),
          (3, 254, '20201217', 
           'X5277', '19', '300')
go
SELECT * FROM orders

This is an unusually bad example with three comma-separated lists that are synchronised with each other. (Thankfully, I rarely something this crazy in the wild!) To keep it simple, we first ignore the quantities and prices columns and run a query that lists the orders with one product per row:

SELECT o.orderid, o.custid, o.orderid, p.str AS prodid
FROM   orders o
CROSS  APPLY strlist_to_tbl(o.products, ',') AS p
ORDER BY o.orderid, prodid

The key is the CROSS APPLY operator. APPLY is a kind of a join operator. When you say A JOIN B, you add conditions with ON that correlate A and B, but B itself cannot refer to A. For instance, B cannot be a call to a table-valued function that takes a column from A as parameter. But this is exactly what APPLY permits you. On the other hand, there is no ON clause with APPLY as the relation between A and B is inside B. (B can also be a subquery).

Here is the result set:

orderid     custid      orderid     prodid

----------- ----------- ----------- ---------

1           108         1           A16769

1           108         1           B1234

1           108         1           B2679

1           108         1           DL123

2           985         2           A16769

2           985         2           A8744

2           985         2           B1233

2           985         2           CBGB2

2           985         2           E98767

3           254         3           X5277

 

Note: There is also OUTER APPLY. The difference between CROSS APPLY and OUTER APPLY is outside the scope of this article, though.

The normal design is of course to have two tables, orders and orderdetails. Here is a script to create a new table and move the data in the columns products, quantities and prices columns to this new table:

CREATE TABLE orderdetails (orderid   int NOT NULL,
                           prodid    varchar(10) NOT NULL,
                           qty       int         NOT NULL,
                           price     int         NOT NULL,
                           CONSTRAINT pk_orderdetails PRIMARY KEY (orderid, prodid)
)

INSERT orderdetails (orderid, prodid, qty, price)
   SELECT o.orderid, p.str, q.n, c.n AS price
   FROM   orders o
   CROSS  APPLY strlist_to_tbl(o.products, ',') AS p
   CROSS  APPLY intlist_to_tbl(o.quantities, ',') AS q
   CROSS  APPLY intlist_to_tbl(o.prices, ',') AS c 
   WHERE  p.listpos = q.listpos
     AND  p.listpos = c.listpos

ALTER TABLE orders DROP COLUMN products, quantities, prices

To pair the values from the lists, we synchronise them on the listpos column.

Say now that you need to update one of the values in the list. Well, didn't I tell you: you need to change the design so that the delimited list becomes a child table? But, OK, you are stuck with the design, so what do you do? Answer: you will have to unpack the data into a temp table, perform your updates and then reconstruct the list(s). As I said, relational databases are not designed for this pattern.

Here is how you would rebuild a list if you are on SQL 2017 or later. For brevity, I only show how to build the products column. The other two are left as an exercise to the reader.

SELECT orderid, string_agg(prodid, ',') WITHIN GROUP (ORDER BY prodid)
FROM   orderdetails
GROUP  BY orderid

The string_agg function is an aggregate function like SUM or COUNT and it builds a concatenated list of all the input values delimited by the string you specify in the second parameter. The WITHIN GROUP clause permits you to specify the order of the list.

If you are on SQL 2016 or earlier, you can use FOR XML PATH which is a more roundabout way and the syntax is not very intuitive:

; WITH CTE AS (
   SELECT orderid, p.products.value('.', 'nvarchar(MAX)') AS products
   FROM   orders o
   CROSS  APPLY (SELECT od.prodid + ','
                 FROM   orderdetails od
                 WHERE  o.orderid = od.orderid
                 ORDER  BY od.prodid
                 FOR XML PATH(''), TYPE) AS p(products)
)
SELECT orderid, substring(products, 1, len(products) - 1)
FROM   CTE

I refrain from trying to explain how it works. Just try to mimic the pattern if you need this. Or redesign the table after all...

Performance Tip

As long as you have only a handful of elements, the method you use to crack the list does not have any significant impact on performance. What is more important is how you get the values from the list into the rest of the query. For simplicity's sake, I have shown you examples like this:

SELECT ...
FROM   tbl 
WHERE  col IN (SELECT n FROM intlist_to_tbl('1,2,3,4', ','))

However, the optimizer has a hard time to come up with the best plan, since it does not know much about what is coming out of the function. This gets more pronounced if the query is complex and includes a couple joins and whatnots. This can result in poor performance, because the optimizer settles on table scan where it should use an index or vice versa. This applies regardless if you use your own function, string_split or something with XML or JSON.

For this reason, I recommend that you unpack the list of values into a temp table and then use that temp table in your query like this:

CREATE TABLE #values (n int NOT NULL PRIMARY KEY)
INSERT #values(n)
   SELECT number FROM intlist_to_tbl('1,2,3,4', ',')

SELECT ...
FROM   tbl 
WHERE  col IN (SELECT n FROM #values)

Because a temp table has statistics, the optimizer has knowledge about the actual values in the list, and therefore the chances are better that it will come up with a good plan.

Revision History

2021-02-27
Revised the section What If You Cannot Use a Function? after useful input from Yitzhak Khabinsky.
2021-02-15
Fixed an embarrassing bug in the functions intlist_to_tbl and strlist_to_tbl. They would get stuck in an infinite loop if the delimiter was a space.
2021-01-09
Revised the article to be about twice in length to include more example functions and solutions for situations where you cannot create functions. I also added some quick examples of the reverse operation: build comma-separated lists from table values.
2018-08-26
Added a section Performance Tip about the most important performance aspect when using list-to-table functions.
2016-08-21
Previously this article was just a contents holder that pointed to the various other articles I have on the topic. It is now a short introduction for people who are in need of a quick solution and don't need (and nor should) read the long article with all the performance considerations etc.

Back to my home page.