An SQL text by Erland Sommarskog, SQL Server MVP. Latest revision: 2010-01-06.
Note: this article covers SQL 2000, SQL 7 and SQL 6.5. There are two more article in this series: Arrays and Lists in SQL Server 2005 and Beyond and Arrays and Lists in SQL 2008 (Using Table-Valued Parameters).
In the public forums for SQL Server, you often see people asking How do I
use arrays in SQL Server? Or Why does
SELECT * FROM tbl WHERE
col IN (@list) not work? This text describes several methods to do
this, both good and bad ones. I also present data from performance tests of
the various methods.
This is a quite a long article, and if all you want to know is how to handle a comma-separated list, then just click one these three quick links depending on which version of SQL Server you are using: SQL2000, SQL7, SQL6.5. If that answer does not meet your needs, come back and read from the beginning. :-)
If you feel deterred by the sheer length of this article, you should be relieved to know that this is the kind of article where you may come and go as you please. If you are a plain SQL programmer who want to know "how do I?", you may find the section on performance data be too much of nitty-gritty details. On the other hand, true SQL buffs who are curious about the performance numbers, might find the explanations of the methods a little tedious and may prefer to skim these parts.
Here is a table of contents:Introduction
You have a number of key values, identifying a couple of rows in a table, and you want to retrieve these rows. If you are composing the SELECT statement in the client code, you might have some code that looks like this:
SQL = "SELECT ProductID, ProductName FROM Northwind..Products " & _ "WHERE ProductID IN (" & List & ")" rs = cmd.Execute(SQL)
List is here a variable which you somewhere have assigned a string value of a comma-separated list, for instance "9, 12, 27, 39".
Now you want to use stored procedures. However, you don't seem to find that any apparent way of doing this. Some people try with:
CREATE PROCEDURE get_product_names @ids varchar(50) AS SELECT ProductID, ProductName FROM Northwind..Products WHERE ProductID IN (@ids)
Which they then try to call:
EXEC get_product_names '9, 12, 27, 37'
But this fails with:
Server: Msg 245, Level 16, State 1, Procedure get_product_names, Line 2 Syntax error converting the varchar value '9, 12, 27, 37' to a column of data type int.
This fails, because we are no longer composing an SQL statement dynamically, and @ids is just one value in the IN clause. An IN clause could also read:
... WHERE col IN (@a, @b, @c)
Others that run into the problem say to themselves: If T-SQL has arrays like any other normal language, then this is not a problem. Well, T-SQL does have arrays. They are called tables, and for all matters they are much more general in nature than arrays. If all you want is plain integer-indexed array with a single value in each cell, this is easily emulated in a table. However, you cannot pass a table as a parameter to a stored procedure. When one stored procedure calls another stored procedure this is not much of a problem, because they can agree on a common table to use for their interchange. This is not possible when you call a stored procedure from a client. But: a client can pass a string, and in SQL you can unpack that string into a table, and most of the methods I describe in this article use some technique to do this.
There are several possible solutions to this problem that can be implemented more or less generically. I present the methods I know of. Here is a summary of the methods that we shall look at, divided into two groups:
The good ones:
If you want a shorter review of the various methods, SQL Server MVP Anith Sen has a good summary on his web site at http://www.projectdmx.com/tsql/sqlarrays.aspx. His assessment of the methods does not always agree with mine, though.
Before I proceed to the various methods, I like to cover two general issues that both are related to performance.
Note: the corresponding section of the article for SQL 2005 is longer has more information of which most is applicable to SQL 2000 as well, so you may be interested in reading that part, even if you are on SQL 2000.
Most of these methods can be implemented as a table-valued functions or stored procedures. This is very nice, because it permits you to pick a solution and write one or two functions to put in your database, and then you can use it all over the place. Here is an example of how you would use such a function:
CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN charlist_to_table(@customers) s ON C.CustomerID = s.str go EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'
If you are on SQL7 you would have to use a stored procedure instead. Please see the SQL7 section for details on the technique.
Now, how should the interface of a function like charlist_to_table be like? The input parameter should be a long text type, preferably text or ntext, so that the function can handle input of any size. But some methods use T-SQL functions that do not support text/ntext, in which case a varchar(8000) or nvarchar(4000) will have to do. You may also prefer to add a parameter that permits you to define another delimiter than comma or similar type of usability parameters. You will occasionally see such parameters in the examples in this text.
How should the table returned from the function look like? Obviously, it should include a column which holds the value of the list elements, and we will return to this in a minute. Sometimes you may also want a column which indicates the position in the list. Such a column is easily achieved with some of the methods I present, but are very difficult to achieve with others.
OK, list-element value, but which data type? You could write a function for every possible data type in SQL Server. In this document I only discuss lists of integers and strings, because it is my assumption that this covers 99% of the cases.
If you have a list of integers, you will have to convert from string to integer somewhere, because the input is text. It may be a matter of taste whether you want to write a function that returns a table of integers, so you can join directly with the table-function column, or whether you prefer to write:
CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN charlist_to_table(@ids) i ON P.ProductID = convert(int, i.str) go EXEC get_product_names_tblnum '9, 12, 27, 37'
If you return a list of strings, you may think there is no conversion issue, but watch out! SQL Server has two families of string types: The 8-bit char/varchar/text data types and the 16-bit nchar/nvarchar/ntext data types for Unicode. Which of these should your function return? You may think that returning an nvarchar(4000) column is the best, because then you will get the right result with both 8-bit and 16-bit data, but not so fast! The answer is that if you are to join with a char/varchar column, your function must return varchar and if you join with an nchar/nvarchar column you must return nvarchar. The reason for this is that if you say:
SELECT * FROM tbl JOIN charlist_to_table(@list) c ON tbl.varcharcol = c.nvarcharcol
tbl.varcharcol will be implicitly converted to nvarchar and this prevents SQL Server from using any index on tbl.varcharcol. This can lead to a significant degrade in performance.
There are two ways to address this. One is to write two functions: one that returns varchar(8000) and one that returns nvarchar(4000). The alternative is to write one function that accepts ntext as input and returns two columns, one varchar and one nvarchar column. For some methods, there is some minor performance cost (5-10%) for the latter arrangement, whereas other methods do not seem to take a toll at all. You will see examples of both strategies in this text.
Even if you have the data types correct, there is another potential problem with joining directly to a table-valued function (or in case of XML to OPENXML). The optimizer has no information of what is in that table beforehand, and will have to build the query plan from general assumptions. In many situations this will work alright, but sometimes this will work better:
CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS CREATE TABLE #temp (id int NOT NULL) INSERT TABLE #temp (id) SELECT convert(int, i.str) FROM charlist_to_table(@ids) SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN #temp t ON P.ProductID = t.id go
The reason for this is that a temp table has statistics, so SQL Server may recompile the procedure prior to executing the SELECT statement and use the statistics about the temp table to build a better query plan for the SELECT statement. On the other hand, there is a cost for the recompilation, so you may as well lose performance with using a temp table. As a general rule, only consider a temp table, if you find that joining directly to the function gives you bad performance. (One reader suggested using a table variable instead. Since a table variable does not have statistics, it should not make much difference as to join directly to a table-valued function.)
If all you want is a solution and then continue with your coding, this is my suggestion for your pick. There are faster methods than this one, but you need to have very long input to find performance for this method to be intolerable. Where the iterative method wins is simplicity. You just load the functions and then you are on the air. The faster methods require a help table with correct data to work. While not a major issue, it's one more component to keep track of. Also, the code for the iterative method is very easy to understand, not the least if you have a background with C or Visual Basic. This is not the least important if you have an input list with a special format that you need to adapt the code for.
Here is you would use such a function in SQL 2000 for a list of integers.
CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN iter_intlist_to_table(@ids) i ON P.ProductID = i.number go EXEC get_product_names_iter '9 12 27 37'
Yeah, that's right, that's not a comma-separated list. I figured that for a list of integers, the comma does not really have any function, so I opted to use space as separator instead when I wrote this function. Here is the code for iter_intlist_to_table:
CREATE FUNCTION iter_intlist_to_table (@list ntext) RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, number int NOT NULL) AS BEGIN DECLARE @pos int, @textpos int, @chunklen smallint, @str nvarchar(4000), @tmpstr nvarchar(4000), @leftover nvarchar(4000) SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = ltrim(@leftover + substring(@list, @textpos, @chunklen)) SET @textpos = @textpos + @chunklen SET @pos = charindex(' ', @tmpstr) WHILE @pos > 0 BEGIN SET @str = substring(@tmpstr, 1, @pos - 1) INSERT @tbl (number) VALUES(convert(int, @str)) SET @tmpstr = ltrim(substring(@tmpstr, @pos + 1, len(@tmpstr))) SET @pos = charindex(' ', @tmpstr) END SET @leftover = @tmpstr END IF ltrim(rtrim(@leftover)) <> '' INSERT @tbl (number) VALUES(convert(int, @leftover)) RETURN END
The table returned from this function has two columns. number is list element, and listpos the position of the number in the list.
Since this function only deals with numbers, it may surprise you that I am using the Unicode string types, ntext and nvarchar. But I found that I got some 10% better performance with them than the 8-bit data types. Exactly why I don't know, but my guess is that charindex is internally implemented un Unicode, so the 8-bit data types require conversion.
The iteration is a two-level loop. This is because the input parameter is of the type ntext to permit unlimited input. Not all string functions functions in T-SQL accept text/ntext as parameters (and some that do, do not operate beyond the limit of varchar/nvarchar). Therefore, I chop up the input string in chunks of 4000 characters, before I move on to the main loop where I use charindex to search for space which is my delimiter. If you prefer comma, you can easily change that. When the inner loop is completed, there may be some leftover piece that should be joined to the start of the next chunk. And finally, when the last chunk is done, I pick up the last element in the list.
You might notice that I use both datalength and len. These are two functions that are deceivingly similar, but you need to apply them with care. Both return the length of a string, but datalength includes trailing spaces whereas len does not. More importantly, datalength is the only to support the text/ntext data types. datalength counts bytes, and len counts characters, which is why I divide the return value from datalength by 2, but not the result from len.
The function is naïve enough to assume that @list only includes numbers. If this is not the case, you will get a conversion error, and the batch will be aborted. If you create the list from internal identifiers in client code, you may feel confident that the list is valid, and this is not much of an issue for you. But what if you actually want to apply error-checking? For instance, what if you are a DBA, and you don't trust your client programmers? You can add code to verify that @str is a legal number, but if the test fails, you cannot use RAISERROR since you are in a function. The best you can do is to skip the illegal list item, or insert NULL (in which case you must change the return table to permit NULL). If you would like raise an error, then you need to implement the method in a stored procedure instead. You find an example in the SQL7 section.
Here is a similar function, but that returns a table of strings.
CREATE FUNCTION iter_charlist_to_table (@list ntext, @delimiter nchar(1) = N',') RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, str varchar(4000), nstr nvarchar(2000)) AS BEGIN DECLARE @pos int, @textpos int, @chunklen smallint, @tmpstr nvarchar(4000), @leftover nvarchar(4000), @tmpval nvarchar(4000) SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen) SET @textpos = @textpos + @chunklen SET @pos = charindex(@delimiter, @tmpstr) WHILE @pos > 0 BEGIN SET @tmpval = ltrim(rtrim(left(@tmpstr, @pos - 1))) INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval) SET @tmpstr = substring(@tmpstr, @pos + 1, len(@tmpstr)) SET @pos = charindex(@delimiter, @tmpstr) END SET @leftover = @tmpstr END INSERT @tbl(str, nstr) VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover))) RETURN END
Here is an example on how you would use the function:
CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN iter_charlist_to_table(@customers, DEFAULT) s ON C.CustomerID = s.nstr go EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'
This function is very similar to iter_intlist_to_table. There is one difference in the interface: you can specify which delimiter to use. There is a default value for the delimiter, but even if you are satisfied with it, you must specify DEFAULT when you call the function, because T-SQL does not permit you to leave to a function argument out completely. No matter which delimiter you use, leading and trailing blanks are stripped from the resulting strings.
The return table has two columns beside listpos, namely str and nstr which both contain the list elements, one is varchar and the other is nvarchar, and you should use the column that matches the column you are joining with. Since Northwind..Customers.CustomerID is nchar(10), nstr is the choice in the example. Do not use nstr when you join with a varchar column, because this can give very bad performance! (I discuss this in the section General Interface Considerations above.) You may think that this extra column comes with a performance cost, but I was not able to detect this when I ran my performance tests.
One thing which is nice with this solution, is that it is fairly extensible. Say that your input string looks like something like this:
"Yes, I want to", "Charlie", "Give it to me"
That is the list items are enclosed by quotes, and the list delimiter can appear in the strings. The function above does not handle this format, but writing a function that handles this sort of input is no big deal.
The fastest way to unpack a comma-separated list is to use a table numbers. A table of numbers is simply a table with one single integer column, with the values 1, 2, 3, ... Here is an easy way to fill in such a table:
SELECT TOP 8000 Number = IDENTITY(int, 1, 1) INTO Numbers FROM pubs..authors t1, pubs..authors t2, pubs..authors t3
I first show you a version that is limited to an input string of 7998 characters to demonstrate the core algorithm. I culled this function from SQL Server MVP Anith Sen's web site.
CREATE FUNCTION inline_split_me (@param varchar(7998)) RETURNS TABLE AS RETURN(SELECT substring(',' + @param + ',', Number + 1, charindex(',', ',' + @param + ',', Number + 1) - Number - 1) AS Value FROM Numbers WHERE Number <= len(',' + @param + ',') - 1 AND substring(',' + @param + ',', Number, 1) = ',')
While the iterative solution was somewhat long-winding but straightforward,
this approach is compact, but not all readers may grasp this SQL in the first
go. (I had a hard time myself.) The first thing to observe is the expression
',' + @param + ',' which reoccurs no less than four times.
By adding the delimiter in beginning and at the end of the string, the first
and last items in the string appear in the same context as all other items.
Next, let's turn to the WHERE clause. The expression:
substring(',' + @param + ',', Number, 1) = ','
evaluates to TRUE for all positions in the string where the delimiter appears. The expression:
Number <= len(',' + @param + ',') - 1
simply sets an upper limit of which numbers we are using.
Let's now look at the SELECTed column Value. We extract the strings with substring, starting on the position after the delimiter. We find the length of the substring by searching for the next delimiter with charindex, taking use of its third parameter which tells charindex where to start searching. Once the next delimiter is found, we subtract the position for the current delimiter and then one more, as the delimiter itself should not be included in the extracted string, to get the length of the list item.
Here is an example on how to use this particular function:
CREATE PROCEDURE get_company_names_inline @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN inline_split_me(@customers) s ON C.CustomerID = s.Value go EXEC get_company_names_inline 'ALFKI,BONAP,CACTU,FRANK'
Note here that, since this function does not strip leading and trailing blanks, there must be no spaces in the list.
This function is an inline function, in difference from the iterative functions which are multi-step functions. An inline function is basically a macro, so that when the optimizer builds the query plan, it inserts the text of the function into the query, making further optimizations possible. I find it difficult to believe though, that for this particular function that you will get much benefit from the function being inline. The function is too convoluted for that, and I have not been able to detect any considerable advantages in my performance tests.
The function inline_split_me limits the input to 7998 characters (or
3998 for nvarchar). It is not possible to use text or ntext
for the input parameter, because you cannot use text/ntext parameters
in expressions such as
',' + @param + ','.
Even more importantly, charindex does not handle text/ntext
data beyond the limit of varchar/nvarchar.
However, this is no fatal obstacle. As with the iterative functions we can break up the text in chunks. Here is a function that does this:
CREATE FUNCTION duo_text_split_me(@list ntext, @delim nchar(1) = N',') RETURNS @t TABLE (str varchar(4000), nstr nvarchar(2000)) AS BEGIN DECLARE @slices TABLE (slice nvarchar(4000) NOT NULL) DECLARE @slice nvarchar(4000), @textpos int, @maxlen int, @stoppos int SELECT @textpos = 1, @maxlen = 4000 - 2 WHILE datalength(@list) / 2 - (@textpos - 1) >= @maxlen BEGIN SELECT @slice = substring(@list, @textpos, @maxlen) SELECT @stoppos = @maxlen - charindex(@delim, reverse(@slice)) INSERT @slices (slice) VALUES (@delim + left(@slice, @stoppos) + @delim) SELECT @textpos = @textpos - 1 + @stoppos + 2 -- On the other side of the comma. END INSERT @slices (slice) VALUES (@delim + substring(@list, @textpos, @maxlen) + @delim) INSERT @t (str, nstr) SELECT str, str FROM (SELECT str = ltrim(rtrim(substring(s.slice, N.Number + 1, charindex(@delim, s.slice, N.Number + 1) - N.Number - 1))) FROM Numbers N JOIN @slices s ON N.Number <= len(s.slice) - 1 AND substring(s.slice, N.Number, 1) = @delim) AS x RETURN END
We first split up the text in slices that we put in the table variable @slices. When we do this, we need to apply some care, so that the last character in a slice is a delimiter. To find the last delimiter in a chunk that we chopped off from the input parameter, we feed charindex the result of the reverse function, neat little trick. While we're at it, we make sure that all rows in @slices start and end with a delimiter. You may note that if the input text is within the limits of a regular nvarchar, we will never enter the loop, but just insert the text directly in to the @slices table.
Once @slices is populated, we apply the core expression. Note here that we do not need to iterate over @slices; we can join directly with Numbers. We use a derived table, to save us from repeating the complex expression with substring and charindex, and which we now have augmented with trim functions to remove leading and trailing spaces. (A derived table is a table constructed by a SELECT expression in the middle of a query, this is a very very useful tool when building complex queries. The alias AS x may not seem to serve a purpose, but the SQL syntax mandates that a derived table must have an alias.)
As with iter_charlist_to_table, this function returns a table with both a varchar and an nvarchar column. In this case, though, my tests indicate that there is a cost of 3-5% over returning a table with only a nvarchar column. Since we are talking execution times in milliseconds, this cost is not likely to have any importance. The performance data that I present, is from a function that only returns an nvarchar column, though.
There is no listpos column here. This is somewhat more difficult to achieve with this method. One way is to add a sliceno column to @slices and an IDENTITY column to the result table, and then order by sliceno and Number when selecting into the result table. However, it is not fully clear whether you can rely on that the IDENTITY values are actually assigned according to the ORDER BY clause. A more reliable way is to join Numbers with itself, as shown by an example on Anith Sen's web site.
You might now be waiting for a function that unpacks a list of numbers, but I am not presenting such a creature. You could write a new function which uses convert in the right places, but you can also do like this:
CREATE PROCEDURE get_product_names_tblnum @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN duo_text_split_me(@ids, DEFAULT) i ON P.ProductID = convert(int, i.str) go EXEC get_product_names_tblnum '9, 12, 27, 37'
In this text I focus on the case that the list you like to process is a single list that comes from a client. Occasionally you might stumble on a table contains one or more columns with comma-separated lists. For instance:
CREATE TABLE jackets (model varchar(30), sizes varchar(200), colours varchar(200))
Here sizes and colours contain comma-separated lists of the sizes and colours the jackets are available in. This database design violates the first normal form, and let me stress that, in the very most cases, this is a very poor design. In any case, say that you need to unpack these lists into tables. You can of course use any of the functions that we have looked at here, but you would need to iterate over the table, and unpack row by row, because you cannot pass a table column as a parameter to a table-valued function. As you may know, iterating over a table can be magnitudes slower than applying a set-based operation on all rows in one statement, so therefore it may be a better idea to join directly with the Numbers table. I'm not giving an example here, but you saw the way to do it, in the code for duo_text_split_me, when we joined with the @slices table.
This is a method that was proposed by SQL Server MVP Steve Kass, inspired by an idea in Ken Henderson's book The Guru's Guide to Transact-SQL.
This basic idea is that rather than defining the array as a delimited string, the list elements have fixed length. An advantage from a usability point of view is that you don't have to worry about a delimiter appearing in the data. But the main reason for this approach is performance. This is the fastest of all methods presented here (but see the caveats below). Here is a demonstration of the technique, applied directly on a table:
CREATE PROCEDURE get_product_names_fix @ids varchar(8000), @itemlen tinyint AS SELECT P.ProductID, P.ProductName FROM Northwind..Products P JOIN Numbers n ON P.ProductID = convert(int, substring(@ids, @itemlen * (n.Number - 1) + 1, @itemlen)) AND n.Number <= len(@ids) / @itemlen GO EXEC get_product_names_fix ' 9 12 27 37', 4
Each element in the "array" has the same length, as specified by the parameter @itemlen. We use the substring function to extract each individual element. The table Numbers that appears here, is the same table that we created in the beginning of the section Using a table of Numbers.
Here is a function that embeds the unpacking of the string.
CREATE FUNCTION fixstring_single(@str text, @itemlen tinyint) RETURNS TABLE AS RETURN(SELECT listpos = n.Number, str = substring(@str, @itemlen * (n.Number - 1) + 1, @itemlen) FROM Numbers n WHERE n.Number <= datalength(@str) / @itemlen + CASE datalength(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END)
The purpose of the expression on the last line is to permit the last element in the array to be shorter than the others, in case trailing blanks have been stripped. Here is an example using fixstring_single with a list of strings:
CREATE PROCEDURE get_company_names_fix @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN fixstring_single(@customers, 6) s ON C.CustomerID = s.str go EXEC get_company_names_fix 'ALFKI BONAP CACTU FRANK'
This function uses text for the input parameter, and returns a varchar column, so for Unicode data you would need a twin function that accepts ntext for input and nvarchar for return data. For other methods in this article, I've opted to return varchar and nvarchar from the same function, but this has a bigger overhead for this method (because it's so fast). Beware that you don't use a Unicode function to join with a char/varchar column; this can wreck performance completely, see General Interface Considerations.
The result set also includes the listpos column, which gives the position of the item in the array. Very conveniently, this is the same as the number from the Numbers table.
fixstring_single handles input with as many elements as there are numbers in Numbers. The number I have used in this article, 8000, is likely to be sufficient for most applications. If you need to support more elements, you can just add more numbers to Numbers. If you really feel that you want to support unlimited input, you could write a multi-step function that chops up the input string in slices. Here's another twist, that Steve Kass came up with. This is a function that uses a self-join, so that with 8000 rows in Numbers fixstring_multi handles up to 64 million list items:
CREATE FUNCTION fixstring_multi(@str text, @itemlen tinyint) RETURNS TABLE AS RETURN(SELECT listpos = n1.Number + m.maxnum * (n2.Number - 1), str = substring(@str, @itemlen * (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1, @itemlen) FROM Numbers n1 CROSS JOIN (SELECT maxnum = MAX(Number) FROM Numbers) AS m JOIN Numbers n2 ON @itemlen * (n1.Number + m.maxnum * (n2.Number - 1) - 1) + 1 <= datalength(@str) WHERE n2.Number <= datalength(@str) / (m.maxnum * @itemlen) + 1 AND n1.Number <= CASE WHEN datalength(@str) / @itemlen <= m.maxnum THEN datalength(@str) / @itemlen + CASE datalength(@str) % @itemlen WHEN 0 THEN 0 ELSE 1 END ELSE m.maxnum END)
This is a more complex function that fixstring_single. I leave it as an exercise to the reader to understand what's going on, and I only make a note about the line with CROSS JOIN: this saves me from hard-coding the number of rows in Numbers.
Since what is special with this method is the input format, not the algorithm itself, you could also use this input format with the iterative method. This might appeal to you, if the table of numbers makes you nervous. (If the table does not have the numbers as advertised, methods relying on it will yield incorrect result.) However, the performance benefit of using the iterative method with fixed-length input over delimited input is not extraordinary, and it is slower than using a table with numbers and delimited input. I'm not including an example of an iterative function using fixed input format, but in the full test data, you can see how well it fares.
If you look in the performance section below, you will see that this method outsmarts everything else. But there are a few caveats you should keep in mind, of which the last one is the most important.
First caveat: The fixed format is more sensitive to disruptions. One extra character somewhere, and the the rest of the string is completely wrecked. When you compose a string in an application, this is not so much of an issue, once you have gotten it right. This leads, however, to the
Second caveat: The cost for composing the fixed-length may take more programming power in the client than composing a comma-separated list. I have not considered this for my performance tests, as it would have made those tests far more complex. However, consider these two points: 1) Client languages are generally better in string handling than SQL. 2) Moving execution from the server to the client means that the overall system scales better. So even if there is a certain toll on the client side, it is not likely to overshadow the gain on the server side.
Third caveat: The strings you pass from client to server when using fixed length are longer, so there is more network overhead. This could be an issue over slow links or on a crowded network. But it is not very likely that this is something to worry about. However, the observation on size leads to the
Fourth and most important caveat: It matters how you invoke a procedure using this method. My test script for the performance tests uses RPC (remote procedure call), and this is the recommended way to call a stored procedure from a client program. (For instance, if you use adCmdStoredProcedure for the CommandType on the Command object in ADO, you use RPC.) The other way to call a stored procedure is to send a command-text batch with an EXEC statement, which is what you do in Query Analyzer. I found that for a certain length of the input string, this method is actually somewhat slower than using a comma-separated list with the table-of-numbers function duo_text_split_me. In fact, I had an alternate version of fixstring_multi which was somewhat faster than the one above as long as I invoked it through RPC. But when I invoked it in a command batch, the response time was several seconds, far beyond what is acceptable. Why this is so, I don't know. But the string size surely has something to do with it. I have not examined at which size the effects sets in, but it might be around 8000, that is the limit for a regular varchar.
XML has emerged as a standard for data interchange in recent years. Just like HTML, XML is derived from the general SGML standard, so the visual appearance of XML and HTML is similar. But there are important differences. Unlike HTML, XML is case-sensitive and a single error in an XML document invalidates the document. The purpose of XML is not presentation, but transport of structured data. Microsoft added extensive support for XML in SQL Server 2000, and what interests us here is the OPENXML function which extracts data from an XML string and returns the data as a table.
This opens for the possibility to encode a list of values as an XML string, and then extract the values with OPENXML in T-SQL. But to be blunt: this is not the right way to do it. Not because of performance: while slower than any other method we have looked at so far, performance is still perfectly acceptable for most situations. No, what disqualifies XML is simply that it is too complex for the task. Unless, of course, you already have an XML document with the values you want to use. But to build an XML string only to mimic an array is overkill in my opinion.
But: if you need to send SQL Server an array of structured data – which even might be nested – to insert data into one or more tables, XML is truly a great asset.
Let's nevertheless first look at the case of a simple list. This is how the XML version of get_product_names looks like:
CREATE PROCEDURE get_product_names_xml @ids ntext AS DECLARE @idoc int, @err int EXEC @err = sp_xml_preparedocument @idoc OUTPUT, @ids SELECT @err = @@error + coalesce(@err, 4711) IF @err <> 0 RETURN @err SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN OPENXML(@idoc, '/Root/Num', 1) WITH (num int) AS a ON P.ProductID = a.num EXEC sp_xml_removedocument @idoc go EXEC get_product_names_xml N'<Root><Num num="9"/><Num num="12"/> <Num num="27"/><Num num="37"/></Root>
The important thing to notice here is that we first call sp_xml_prepraredocument which causes SQL Server to build an internal representation of the XML string. @idoc is a reference to that representation, which we use when we invoke OPENXML, which is kind of a table function, although the exact term used by Books Online is rowset provider. Before we exit the routine, we call sp_xml_removedocument to free up the memory allocated for the XML document. Failing to do this can lead to a memory leaks which eventually can prevent sp_xml_preparedocument from completing successfully. (By the way, while it is good habit to check return values from stored procedures, don't do this for sp_xml_removedocument. It always returns 1, which is a known bug. Check @@error instead.)
When you compose an XML string, you need to be careful, because there are several characters such as <, & and " serves as meta characters in XML, so you need to encode them in the same manner as in HTML. The same applies to non-ASCII data if you are not using Unicode. Also, you need to keep in mind that XML is very strict on that all tags must be closed. Thus, you are probably better off using some library routine to compose the XML string. This is not really my area, but MS XML seems to offer some methods for this, as does the XmlTextWriter class in the .Net Framework.
Where the power of XML comes into full play is when you have a bigger set of data that you want to insert into SQL Server, for instance orders and their order details. Before the advent of XML in SQL Server, you did not have much better choice than to call a stored procedure for each row to insert. (There has always been bulk-copy, but this is far from always a convenient solution.) With XML, you can compose an XML document of all that data, and you have a stored procedure which calls sp_xml_prepredocument once, and then you invoke OPENXML twice: first for the orders, and next for the order details. Replacing thousands of calls to stored procedures over the network with a single one, this can give a huge performance benefit.
I'm not including an example of this, as it would take up some space, and this article is already far too long. Rather I leave this as a teaser. If you are completely unacquainted with XML and using with XML with SQL Server, you might consider picking up a book or two on the subject. You might also find some useful information on SQL Server MVP Bryant Likes's site http://www.sqlxml.org.
Note: those who know XML, know that the above is attribute-centred XML. An alternative way to compose the XML string is to use element-centred XML (or a mix of the two). I'm not including any example of element-centred XML, as it there is little practical difference for unpacking a list, neither is there in terms of performance.
Occasionally I have seen the suggestion that you should have a stored procedure that accepts a delimited string as parameter, and which uses the replace function in T-SQL to convert this string to XML before passing it to OPENXML. In my opinion, this is a bad idea for several reasons:
For a list of numbers, this method may seem deceivingly palatable in its simplicity:
CREATE PROCEDURE get_product_names_exec @ids varchar(255) AS EXEC('SELECT ProductName, ProductID FROM Northwind..Products WHERE ProductID IN (' + @ids + ')') go EXEC get_product_names_exec '9, 12, 27, 37'
This example looks very similar to the client-code example in the beginning of this article. And in fact, this method is just a variant of sending down SQL statements from the client and it has the same issues, which we shall look at in a moment. First, though, let's look at using this method with a list of strings, to see that in this case the method is not equally attractive:
CREATE PROCEDURE get_company_names_exec @customers nvarchar(2000) AS EXEC('SELECT CustomerID, CompanyName FROM Northwind..Customers WHERE CustomerID IN (' + @customers + ')') go EXEC get_company_names_exec '''ALFKI'', ''BONAP'', ''CACTU'', ''FRANK'''
The nested quotes make procedures using this method difficult to call.
So what are the issues with this method?
For further details on dynamic SQL, see my web article, The Curse and Blessings of Dynamic SQL.
Often when people ask how to handle comma-separated lists in the SQL newsgroups, dynamic SQL is presented as a solution. Therefore, I like to stress that this is a poor solution, not the least when there are nice and simple generic methods for converting a list to a table. So if you see this advice on the newsgroups, ignore it. And if you have the habit of giving it, please stop doing it.
There is one exception: on SQL Server 6.5 the other methods can only support lists that fit into varchar(255), so dynamic SQL is probably the only viable solution there.
In this section I've collected methods where you transform the list into one or more SELECT statement which you execute with dynamic SQL. Here is the best of these three methods:
CREATE PROCEDURE unpack_without_union @list varchar(8000), @tbl varchar(30), @delimiter char(1) = ',' AS DECLARE @sql varchar(8000), @select varchar(8000) SET @select = 'SELECT ' + REPLACE(@list, @delimiter, ' SELECT ') SET @select = REPLACE(@select, '''', '''''') SET @sql = 'INSERT ' + @tbl + ' EXEC(''' + @select + ''')' EXEC (@sql)
This method builds on idea suggested by Jim Ebbers. Since the code is somewhat obscure with its double use of dynamic SQL, before I explain it, I directly proceed to a method suggested by SQL Server MVP Steve Kass, which is easier to understand, albeit not as good:
CREATE PROCEDURE unpack_with_union @list varchar(8000), @tbl varchar(30), @delimiter char(1) = ',' AS DECLARE @sql varchar(8000) SET @sql = 'INSERT INTO ' + @tbl + ' SELECT ''' + REPLACE(@list, @delimiter, ''' UNION ALL SELECT ''') + '''' EXEC (@sql)
The idea is that you transform the list to a SELECT statement with help of the UNION ALL operator. (Only UNION removes duplicates, UNION ALL includes them.) Then you use dynamic SQL to execute that statement and insert the data into a temp table, of which the name is passed to the procedure. Since the the dynamic SQL only refers to a temp table, there is no permissions issue here.
In the first method, I have made use of what Jim made me aware of: If you say
INSERT tbl EXEC('SELECT 1 SELECT 2 SELECT 3'), INSERT will
handle all those result sets from EXEC(), as if it was one and only. Notice
that I also cater for the fact that there may be single quotes in the input.
This is in fact necessary with unpack_with_union as well, but if you
do it once in unpack_with_union, you should double that up in
unpack_without_union. Why, I leave as an exercise to the reader to
Here is how you would use these methods:
CREATE PROCEDURE get_company_names_nounion @customers varchar(8000) AS CREATE TABLE #temp (custid nchar(10) NOT NULL) EXEC unpack_without_union @customers, '#temp' SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN #temp t ON C.CustomerID = t.custid go EXEC get_company_names_nounion 'ALFKI,BONAP,CACTU,FRANK'
These method can possibly get a reward for being cute, but that's about it. Both methods, as presented here, have a fairly low limit on the maximum number of elements, since all the text you splice in takes up space. Jim's method is better than Steve's since he splices in less text. But for a list with 500 elements, that's still 3500 extra characters caused by those SELECT. You could write a version that takes a ntext parameter and breaks it up in slices. I have not pursued this, since my performance tests indicate that these method are slower than OPENXML, which in its turn is slower than the iterative method, and using a table with numbers. Neither is there any other particular advantage with these methods.
When I originally conducted the performance tests for this article, I was not aware of Jim's method, and since then I have changed hardware, why it is not possible to paste in the numbers for Jim's method (code-name NOUNION) with the old data. However, re-testing only OPENXML, UNION and NOUNION, I find that NOUNION is twice the speed of UNION, but still lags behind OPENXML.
SQL Server MVP Anith Sen proposed a similar method, where he simple converts the list to a bunch of INSERT statements:
CREATE PROCEDURE unpack_with_insert @list varchar(8000), @tbl varchar(30), @delimiter char(1) = ',' AS DECLARE @sql varchar(8000) SET @sql = 'INSERT ' + @tbl + ' VALUES (' + REPLACE(@list, ',', ') INSERT ' + @tbl + ' VALUES (') + ')' EXEC (@sql)
This method has the same problems as the other two methods in this section. In my performance tests it fared even worse than UNION.
In a Q&A column of an SQL journal, the following solution was suggested:
CREATE PROCEDURE get_company_names_charindex @customers nvarchar(2000) AS SELECT CustomerID, CompanyName FROM Northwind..Customers WHERE charindex(',' + CustomerID + ',', ',' + @customers + ',') > 0 go EXEC get_company_names_charindex 'ALFKI,BONAP,CACTU,FRANK'
You may recognize the theme from when we used a table of numbers. By adding commas of both sides of the input string, we can use charindex to find ",ALFKI," etc. (Note that you cannot have embedded blanks here.)
The author noted in his column that this method would not have good performance, since embedding the table column in an expression precludes use of any index on that column, leading to a table scan. But that's only a small part of the story. A plain table scan on my test table takes less than 100 ms, once the table is entirely in cache. This method needs 42 seconds, even for my shortest test list of 200 characters and 15 items!
Variations on this theme are illustrated by these WHERE clauses:
WHERE patindex('%,' + CustomerID + ',%', ',' + @customers + ',') > 0 WHERE ',' + @customers + ',' LIKE '%,' + CustomerID + ',%'
The solution with patindex also needs another 42 seconds. The solution with LIKE was actually four times faster on one of my test machines (but exactly as slow as charindex and patindex on the other machines.) Unfortunately, it is still 100 times slower than dynamic SQL, and 200 times slower than the best methods.
If you are using SQL Server 7, you don't have access to user-defined functions and not to XML. Still you can implement the iterative method or use a table of numbers, with comma-separated or fixed-length input, in stored procedures instead. To demonstrate this, I'm including examples for the iterative method here. You could easily apply the same technique for the other two methods.
Here is a stored procedure which is very similar to the function iter_charlist_to_table above. Rather than returning a table variable, the procedure fills in the temp table #strings:
CREATE PROCEDURE charlist_to_table_sp @list ntext, @delimiter char(1) = N',' AS DECLARE @pos int, @textpos int, @chunklen smallint, @tmpstr nvarchar(4000), @leftover nvarchar(4000), @tmpval nvarchar(4000) SET NOCOUNT ON SELECT @textpos = 1, @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SELECT @chunklen = 4000 - datalength(@leftover) / 2 SELECT @tmpstr = @leftover + substring(@list, @textpos, @chunklen) SELECT @textpos = @textpos + @chunklen SELECT @pos = charindex(@delimiter, @tmpstr) WHILE @pos > 0 BEGIN SELECT @tmpval = left(@tmpstr, @pos - 1) SELECT @tmpval = ltrim(rtrim(@tmpval)) INSERT #strings(str) VALUES (@tmpval) SELECT @tmpstr = substring(@tmpstr, @pos + 1, len(@tmpstr)) SELECT @pos = charindex(@delimiter, @tmpstr) END SELECT @leftover = @tmpstr END INSERT #strings(str) VALUES(ltrim(rtrim(@leftover)))
And here is an example of how you use it:
CREATE PROCEDURE get_company_names_iterproc @customerids nvarchar(2000) AS CREATE TABLE #strings (str nchar(10) NOT NULL) EXEC charlist_to_table_sp @customerids SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN #strings s ON C.CustomerID = s.str go EXEC get_company_names_iterproc 'ALFKI, BONAP, CACTU, FRANK'
The function for unpacking a list of strings had a dual return table, with both a varchar and an nvarchar column. There is not really any reason for this when you use a procedure. The caller specifies the data type that fits with the kind of data he is working with. In the same manner, the function does not populate any listpos column; if you want to keep track of list position, the caller can include an IDENTITY column in the table.
In an earlier incarnation of this article, I suggested that you could have the table name to be a parameter, and the use dynamic SQL to insert into that table. However, this was poor advice. When I ran my performance tests, I found that the performance penalty for the use of dynamic SQL for each INSERT statement is too big to be acceptable. I discuss this further in the section special observations in the performance tests.
There is however a potential performance problem with the solution above as well. It depends on how you use it, but normally it will be a new temp table each time. In this case, charlist_to_table_sp will be recompiled on each invocation. In many situations, this would not be an issue. In fact, I found in my performance tests that this procedure was somewhat faster the corresponding function to unpack a string into table, despite the recompilation. However, on a busy system with an intensive call frequency you could experience blocking from compile locks, see KB article 263889.
One way to avoid this would be to use a permanent table, looking something like this:
CREATE TABLE stringarray (spid int NOT NULL, listpos int NOT NULL, str varchar(4000) NOT NULL, nstr nvarchar(2000) NOT NULL, CONSTRAINT pk_array PRIMARY KEY (spid, listpos))
For the column spid, you would use @@spid, the process id. A procedure populating this table would as its first statement include:
DELETE stringarray WHERE spid = @@spidAnd a caller would need to remember to include the spid column in all references as in this example:
CREATE PROCEDURE get_company_names_spid @customerids nvarchar(2000) AS EXEC charlist_to_table_spid @customerids SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN stringarray s ON C.CustomerID = s.nstr WHERE s.spid = @@spid DELETE stringarray WHERE spid = @@spid -- Housekeeping. go EXEC get_company_names_spid 'ALFKI, BONAP, CACTU, FRANK'
(Follow the link to see the source code of charlist_to_table_spid.)
Yet another alternative would be to have the temp table local to the procedure, and then produce a result set, which the caller can catch with INSERT EXEC. For more details about this technique, and its weaknesses, see my article How to share data between stored procedures.
The technique in the previous section can of course be applied to a list of integers as well, so what comes here is not a true port of the iter_intlist_to_table function, but a version that goes head over heels to validate that the list items are valid numbers to avoid a conversion error. And to be extra ambitious, the procedure permits for signed numbers such as +98 or -83. If a list item is not a legal number, the procedure produces a warning. The procedure fills in a temp table that has a listpos column; this column will show a gap if there is an illegal item in the input.
CREATE PROCEDURE intlist_to_table_sp @list ntext AS DECLARE @pos int, @textpos int, @listpos int, @chunklen smallint, @str nvarchar(4000), @tmpstr nvarchar(4000), @leftover nvarchar(4000) SET NOCOUNT ON SELECT @textpos = 1, @listpos = 1, @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SELECT @chunklen = 4000 - datalength(@leftover) / 2 SELECT @tmpstr = ltrim(@leftover + substring(@list, @textpos, @chunklen)) SELECT @textpos = @textpos + @chunklen SELECT @pos = charindex(' ', @tmpstr) WHILE @pos > 0 BEGIN SELECT @str = rtrim(ltrim(substring(@tmpstr, 1, @pos - 1))) EXEC insert_str_to_number @str, @listpos SELECT @listpos = @listpos + 1 SELECT @tmpstr = ltrim(substring(@tmpstr, @pos + 1, len(@tmpstr))) SELECT @pos = charindex(' ', @tmpstr) END SELECT @leftover = @tmpstr END IF ltrim(rtrim(@leftover)) <> '' EXEC insert_str_to_number @leftover, @listpos go -- This is a sub-procedure to intlist_to_table_sp CREATE PROCEDURE insert_str_to_number @str nvarchar(200), @listpos int AS DECLARE @number int, @orgstr nvarchar(200), @sign smallint, @decimal decimal(10, 0) SELECT @orgstr = @str IF substring(@str, 1, 1) IN ('-', '+') BEGIN SELECT @sign = CASE substring(@str, 1, 1) WHEN '-' THEN -1 WHEN '+' THEN 1 END SELECT @str = substring(@str, 2, len(@str)) END ELSE SELECT @sign = 1 IF @str LIKE '%[0-9]%' AND @str NOT LIKE '%[^0-9]%' BEGIN IF len(@str) <= 9 SELECT @number = convert(int, @str) ELSE IF len(@str) = 10 BEGIN SELECT @decimal = convert(decimal(10, 0), @str) IF @decimal <= convert(int, 0x7FFFFFFF) SELECT @number = @decimal END END IF @number IS NOT NULL INSERT #numbers (listpos, number) VALUES (@listpos, @sign * @number) ELSE RAISERROR('Warning: at position %d, the string "%s" is not an legal integer', 10, -1, @listpos, @orgstr) go
Here is how you would use it:
CREATE PROCEDURE get_product_names_iterproc @ids varchar(50) AS CREATE TABLE #numbers (listpos int NOT NULL, number int NOT NULL) EXEC intlist_to_table_sp @ids SELECT P.ProductID, P.ProductName FROM Northwind..Products P JOIN #numbers n ON P.ProductID = n.number go EXEC get_product_names_iterproc '9 12 27 37'
The validation of the list item is in the sub-procedure insert_str_to_number. For many purposes it would be sufficient to have the test:
@str NOT LIKE '%[^0-9]%' AND len(@str) BETWEEN 1 AND 9
which checks that @str only contain digits and is at most nine digits long (that is, you disapprove ten-digit numbers as well as signed numbers).
You might guess that there is a performance cost for this extravaganza, and indeed the procedure needs about 50% more time than the corresponding function. Still, for many situations, the execution time is acceptable.
One note about the warning produced with RAISERROR: with ADO, this warning may be difficult or impossible to detect on client level. If you change the severity from 10 to 11, it will be an error, and raise an error in your client code.
If you are on SQL 6.5, you are far more limited in what you can do. The varchar data type is limited to 255 characters, and the substring function does not work on text, so you cannot split up in chunks or extract fixed-length elements. You can still implement stored procedures as for SQL 7, but since the input string would be limited to 255 characters, the risk that you every now and then would pass a list exceeding this limit is too considerable to be neglected.
Therefore, the only method that is robust enough on 6.5 is dynamic SQL. You can still use text to pass a long list as in this example:
CREATE PROCEDURE get_authors_exec @authors text AS EXEC('SELECT au_id, au_lname, au_fname FROM pubs..authors WHERE au_id IN (' + @authors + ')') go EXEC get_authors_exec '''172-32-1176'', ''427-17-2319'', ''724-08-9931'''
Testing performance in a DBMS is not a simple task. There are so many factors that affect performance. In my tests I have tested unpacking the list alone into a table (for the methods where this is possible) and using the list to get data from a table. Merely unpacking tells us something about the method as such, but the great impact comes when you try to use the method to get data from one or more tables, in which case the nature of the table(s) involved (index, statistics, size etc) can have great importance. You may find that some method that I claim to be fast, results in slow query plan for your query. Another issue is how submit your SQL statements to the server. For the fixed-string functions, I found that there was a considerable difference between RPC calls and command-text batches. (All data I present are from calls submitted through RPC.) Yet, another issue what happens when the system comes under load. For the sake of simplicity, I have run the tests on idle machines.
So take these numbers for what they are. They certainly serve to point out some methods as hopelessly bad. But you will find that several of them have response times that are perfectly reasonable when run on their own. If you are worried about performance in a certain context, you should probably run your own benchmarks.
As I have assembled data for this article, I have tested too many methods to include data for all of them in the main body of this text. Some of the methods have just been small variations of each other like returning varchar instead of nvarchar, and I have not even bothered to save data for every little variation. Eventually I have data for 16 test cases. Of these, I have selected the nine to include in the main section of the test results in the article. Yet a few more appear in the section for special observations where I highlight some details. The full data for all 16 methods on the three test machines I have used are available as text files, for which there are links further down in this text.
I have given each method that I've tested a name, and I will use these names as a convenient short-hand notation. All methods appear in the text above; most of them in code examples, but some are only mentioned in passing.
Here are the nine main competitors:
|ITER||The iterative functions: iter_intlist_to_table and iter_charlist_to_table. (Note: the data for iter_charlist_to_table if for an older version. The version presented in this article is some 10% faster.)|
|TBLNUM||A table of numbers: the function text_split_me which is the same as duo_text_split me above, save that it only returns a varchar column.|
|FIX$SINGLE||Fixed-length list elements: the function fixstring_single, joining once with the table of numbers.|
|FIX$MULTI||Fixed-length list elements: fixstring_multi, that joins the table of numbers twice to make it virtually unlimited.|
|XMLATTR||OPENXML, using attribute-centred XML.|
|EXEC$A||Dynamic SQL, when called with a new input string.|
|UNION||Making the list into a SELECT, the procedure unpack_with_union.|
|REALSLOW||The really slow method, using charindex to find the list elements.|
|SLOW$LIKE||Another really slow method, using LIKE to find the list elements.|
And here are the other seven:
|ITER$PROC||The iterative procedures iter_intlist_to_table_sp and iter_charlist_to_table_sp. (Note: the data for iter_charlist_to_table_sp if for an older version. The version presented in this article is some 10% faster.)|
|ITER$EXEC||A variation of iter_charlist_to_table_sp that accepts a table name as parameter and uses dynamic SQL to insert into it.|
|TBLNUM$IL||A table with numbers; the inline function inline_split_me which is limited in input size.|
|FIX$ITER||Fixed-length list elements, unpacked iteratively.|
|XMLELEM||OPENXML, using element-centred XML.|
|EXEC$B||Dynamic SQL, invoked for a second time for the same input string.|
|INSERT||A trick with INSERT, the procedure unpack_with_insert.|
In the following sections I describe how the tests were set up and what the presumptions were. If you find this too detailed, please feel free to jump directly to the results.
This is the table I have used for all tests.
CREATE TABLE usrdictwords (wordno int NOT NULL, word varchar(50) NOT NULL, guid char(36) NOT NULL) CREATE CLUSTERED INDEX wordno_ix ON usrdictwords(wordno) CREATE UNIQUE INDEX word_ix ON usrdictwords(word)
The table contains 202985 rows, and the average length of the column word is 9.7 characters. The longest word is 31 characters. The values in the column wordno are in fact unique; the reason that the index is not unique is simply a mistake that I didn't notice until I had ran all tests. The order of wordno is not correlated with the order of word. The column guid serves to make the table a little wider, and to be a token non-indexed column.
For each test, my test script randomly selected rows from usrdictwords, to construct one comma-separated list each of strings and integers from the columns word and wordno respectively. After each list item the script randomly added 0-3 spaces between the list elements. (The reason for this is that the test script also served to validate that the implementations of the methods yielded the correct result.) The script kept adding to the lists until the list of strings exceeded a pre-determined size. Thus, the integer lists were shorter in length, but the number of list elements was the same.
I used five different list sizes:
|Size||Length||Avg. no of list elements|
I ran all methods for the sizes Small and Medium. For Large and upwards, I excluded UNION, INSERT, REALSLOW and SLOW$LIKE, the former two because they cannot handle input of this size, the latter two because they are too slow for my patience. From X-Large and on, I excluded TBLNUM$IL (size constraint). For XX-Large I did not include FIX$SINGLE (size constraint) and ITER$EXEC, EXEC$A and EXEC$B (too slow).
Note that these list lengths apply to the list of string originally constructed by the test script (see above). For many methods, the input was adapted to the format required by the method, which could make the string both shorter (embedded blanks stripped) and longer (XML or fixed-length element). The actual list elements were still the same for all methods.
I have tested two types of operations:
|UNPACK||Unpacking the input list itself into a result set, without involving another table. This test cannot easily be implemented for all methods. You rarely have the need for this in real life, but the test isolates the performance of the method as such.|
|JOIN||Using the input list to extract data from a table. With most methods this means a join operation, and I denote this operation as JOIN also for a method like dynamic SQL even if there is no actual join taking place.|
Here is a sample of a typical UNPACK procedure:
CREATE PROCEDURE TBLNUM_Int_UNPACK_test @str text, @tookms int OUTPUT AS DECLARE @start datetime SELECT @start = getdate() SELECT number = convert(int, Value) INTO tmp FROM text_split_me(@str) SELECT @tookms = datediff(ms, @start, getdate()); SELECT number FROM tmp DROP TABLE tmp
Thus, the procedure starts a timer, issues the query, inserting the result into a table in the test database, stops the timer, and returns the data to the test script which then checks that the method returned the expected data. I chose to catch the data with SELECT INTO, as SELECT INTO is less logged than INSERT. Returning the data while the timer is going would mean that network latency could have had gross effect on the result. A side effect of the arrangement with SELECT INTO is that the procedure is always recompiled when it hits the last SELECT statement, but that is outside the timer. I also give an example of a JOIN procedure to highlight one important detail with the string join:
CREATE PROCEDURE ITER_Str_JOIN_test @str text, @tookms int OUTPUT AS DECLARE @start datetime SELECT @start = getdate() SELECT u.wordno, u.guid INTO tmp FROM usrdictwords u JOIN iter_charlist_to_tbl(@str, DEFAULT) AS a on u.word = a.str SELECT @tookms = datediff(ms, @start, getdate()); SELECT wordno FROM tmp DROP TABLE tmp
Note here that when joining over the string column in the test table, I am also retrieving the column guid which is not in any index. This forces SQL Server to access the data pages of the table. (Otherwise the query would be covered by the index on word.) This gives the optimizer the two choices: using the non-clustered index on word with bookmark lookups or scanning the table.
For most methods, there have been four procedures: UNPACK-Str, UNPACK-Int, JOIN-Str and JOIN-Int.
All test procedures are available in the accompanying zip file.
The test script is written in Perl, connecting to SQL Server using DB-Library (because this is the client library I know well). Using DB-Library has imposed some limitations; test procedures cannot have ntext parameters, nor varchar longer than 255 characters. Therefore all procedures use text for the input parameter. The test script issues the same set of SET commands that OLE DB and ODBC do when they connect. It also issues SET NOCOUNT ON. The test script is available in the accompanying zip file.
When running the tests, the test script creates list of strings and integers as described above, and then passes the assembled lists to all test procedures currently in the test database. The test script uses a fixed seed for the random generator, so different runs of the script generate the the same test data. This permitted to me test a method separately from the rest, yet using exactly the same data as for the others. (This was handy when I found an improvement for one method, and didn't want to rerun all the others.) I ran the tests 100 times for each string size, except for REALSLOW and SLOW$LIKE for which I only ran two test runs each, since these methods are excessively slow.
I ran the tests on three different machines:
|JAMIE4K||This is a server with 4 Pentium III 550 GHz, 3 GB of RAM, disk arrays. SQL 2000 SP3.|
|ABA0163||This is my office workstation with a single Pentium III 996 MHz, 512 MB of RAM (though SQL Server is constraint to use around 120 MB) with SCSI disks. SQL 2000 SP2.|
|KESÄMETSÄ||This is my home machine with a single Pentium 4, 2.266 GHz, 512 MB and IDE disks. SQL2000 SP3.|
For all tests, I ran the test script from KESÄMETSÄ. To ABA0163 and JAMIE4K I connected to our office over VPN. The machines were generally idle when I ran the scripts.
In the main presentation of the results in this article, I only present the results from JAMIE4K. I find JAMIE4K more representative of two reasons. 1) This is the only test machine which is a real server, and therefore has more resemblance with production machines in general. 2) As JAMIE4K has slower CPUs, the timings are somewhat more accurate. (Time in SQL Server has a resolution of 3.33 milliseconds, so timings below 50 ms are not very reliable.)
Since the test table is small enough to be in cache throughout the tests, it's mainly an issue of CPU power, and for most methods, ABA0163 is twice as fast as JAMIE4K and KESÄMETSÄ is four times as fast. But there are some exceptions from this rule, which may be due that JAMIE4K has been able to use all four processors in those cases.
Here I present average execution times in milliseconds for the nine main contenders on JAMIE4K. To see the full results all sixteen methods for each machine with minimum, average and maximum execution times, standard deviation and variation coefficient, here is one link each for JAMIE4K, ABA0163 and KESÄMETSÄ.
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
There is one safe conclusion you can draw from this data: the methods REALSLOW and SLOW$LIKE are not usable. (The reason SLOW$LIKE is four times faster than REALSLOW for JOIN-Str is obviously due to that for some reason SQL Server has not used all four processors on JAMIE4K for REALSLOW in this case. Compare with the results on ABA0163 and KESÄMETSÄ.)
For the other seven methods, results are very close to each other, although EXEC$A sticks out a little. Thus, for this small input size, all these methods are OK. I again like to stress that the resolution of the datetime data type in SQL Server does not lend itself for accuracy on this level. On the faster machines ABA0163 and KESÄMETSÄ all methods, save REALSLOW and SLOW$LIKE, have at least one run which is timed for 0 ms.
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
The results for REALSLOW and SLOW$LIKE are now even more ridiculous. You see now, why I didn't want to run them with string size Large.
The other seven methods are still showing results that most people would consider perfectly acceptable. Nevertheless, we can see that EXEC$A and UNION are starting to lag behind the other five. UNION has a size constraint, and cannot handle input of the size Large, unless we chop the input in slices. This is something that I have not pursued, and the reason is precisely the numbers above. It does not seem likely that UNION would be able to give the other methods serious competition.
One could note that on the other two machines, several of the methods still has individual runs that are timed at 0 ms, and that the data size is still too small for accurate conclusions.
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
The field is starting to spread out, and by now you can clearly rank the methods in performance order. When it comes to just unpacking the string, FIX$SINGLE is almost as fast with string size Small. But the method to pay most attention to here, is dynamic SQL, EXEC$A. 2½ seconds to get the string elements, when all other methods needs less than 350 ms, is simply not acceptable.
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
At this size dynamic SQL really gets out of hand. And just how much it gets out of hand, we will return to in a second when we look at trend analysis.
Since this is the last size at which we see FIX$SINGLE (with 8000 rows in Numbers, it does not handle size XX-Large), let's make a summary of methods in terms of speed. You can very clearly see that using fixed-length elements and unpacking them with a table of numbers is the fastest method. You can also see that the simple fixstring_single is distinctively faster than the more complicated fixstring_multi. True, the latter permits virtually unlimited input. Then again, you can add more numbers to Numbers to push the limits of what fixstring_single may handle. 8000 elements in a string array is a whole lot, already that.
Now, there are two issues that slightly reduce the superiority of the fixed-length methods. The first has to do with how the method was invoked. In my tests, I called all test procedures through RPC, which is the normal to do from an application. I found that when I instead passed a command batch with an EXEC statement, then over a certain length of the input parameter, the fixed-length methods, and only they, took a toll. FIX$SINGLE was still the fastest, but TBLNUM overtook FIX$MULTI for the #2 position.
The second issue arises when we look at the deltas between times for the corresponding JOIN and UNPACK operations:
Exactly what these numbers say I don't really know. They could say that with FIX$SINGLE, the time to get the data itself is longer, and thus with some other table, some other method would be faster. (The reason for this would be a different query plan.) Then again, since XML, the slowest method, has the smallest difference, it may be that the times for the unpack the string and getting the table data overlap, and this is an issue we should not pay any attention to at all. If top-most performance is a key issue for you, you need to investigate this further for your own case, as I pointed out in my general disclaimer.
Finally, it is worth remembering that performance is not everything. If you prefer to use a delimited list, you can still unpack with a table of numbers for good performance. And if you don't want the extra table, use the iterative method. Half a second to unpack a 24000 characters long list of 1980 elements is nothing to be alarmed of.
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
With this excessive input size, you cannot expect sub-second response time on this hardware.
This table shows how much the execution times grow when we move from one string size to another. I give the smallest and biggest growth for each method:
|ITER||1.7 – 2.3||2.2 – 2.4||2.5 – 2.7||4.6 – 4.7|
|TBLNUM||1.3 – 1.9||1.7 – 2.0||2.1 – 2.7||4.2 – 4.5|
|FIX$SINGLE||1.0 – 1.5||1.2 – 2.1||1.4 – 2.4|
|FIX$MULTI||1.1 – 1.6||1.4 – 1.7||1.9 – 2.4||3.4 – 4.1|
|XMLATTR||1.8 – 2.1||2.8 – 2.8||2.9 – 2.9||4.8 – 5.1|
|EXEC$A||3.6 – 4.3||5.1 – 6.1||5.7 – 8.0|
|UNION||3.3 – 4.1|
|REALSLOW||8.9 – 9.6|
|SLOW$LIKE||8.9 – 9.6|
A reasonable assumption is that the execution time for a certain string length follows the linear equation:
t(s) = y + k·sHere
sis the string size,
yrepresents some initial overhead, and
kis some constant fairly close to 1. The values of
kare of course different depending on method.
We can observe two things from this table: 1) it appears that the faster the
method the smaller is the slope (
k). Thus, if input size
grows even more, the faster methods will keep their position. 2) Dynamic SQL
does not live up the assumption that
k is close to 1, but
appears to have a slope somewhere near 2 (if the equation is linear at all).
Which means that if input size doubles, the execution time for dynamic SQL
quadruples. Did I say that this is a bad method?
Note: such a linear equation exists only as long as we don't run into some resource limit. As I've noted the fixed-length method sees a slowdown over a certain input size when they are invoked in a command-text batch. .
In this section I discuss assorted observations for the methods that were not among the main contenders.
In the test there were also the procedure versions of the iterative method. I didn't include them in the main result, because I tested three procedures that all were different to each other:
|ITER$PROC||Str||iter_charlist_to_table_sp, a straightforward transformation of the function iter_charlist_to_table.|
|ITER$PROC||Int||iter_intlist_to_table_sp, which performs extensive checking of the input elements, and thus performs a lot more work than the the function iter_intlist_to_table.|
|ITER$EXEC||Str||A version iter_charlist_to_table_sp that accepts the name of a temp table as a parameter and that uses dynamic SQL to insert the data in the table.|
Let's first look at the data for an array of strings:
We see here that the procedure version actually is faster than the function if we are only unpacking the string. This is somewhat surprising, not the least since the procedure is recompiled each time. The two possible reasons I can think is that a temp table in general is faster than a table variable, or that a procedure in general is faster than a multi-step function. But when we use the result to get data from the temp table, the procedure is no longer the faster for sizes Large and X-Large. It seems that joining with the temp table is more expensive than joining with the table variable. This could be due to different query plans (recall that there is a choice of table scan or index seek + bookmark lookup in this case). In any case for XX-Large, the temp table is faster also for joining, and also have a smaller delta.
Despite these numbers, personally I find the function interface much nicer to work with, as I can slap it right into the query, which is why I have focused on the functions in this text.
And ITER$EXEC... The idea was that you could pass a table name as a parameter, and then use dynamic SQL to insert the list items into the table like this:
SELECT @sql = 'SET QUOTED_IDENTIFIER OFF INSERT ' + @tbl + '(str) VALUES( "' + replace(@tmpval, '"', '""') + '")' EXEC (@sql)
The feature may look neat, but as you see in the table above, it is expensive. Too expensive to be defensible, in my opinion. The reason it is expensive, is that for each item in the list, SQL Server has to parse and build a query plan for the INSERT statement. This is yet another example that dynamic SQL in the wrong place gives you bad performance. (There are also places where dynamic SQL is the right thing for best performance.) But at least ITER$EXEC has a slope close to 1 in its equation, in difference to EXEC$A.
So what about that extravagant procedure for a list of integers?
Not surprisingly, we can conclude that the extravaganza has a price in performance. Still, it is not as fatal as that little piece of dynamic SQL in ITER$EXEC. And if you rewind to the main results, you see that this procedure is still faster than XML.
What is more remarkable is the difference in delta for sizes Large and X-Large. This can not be explained with the extravaganza. Again, possible causes for the difference is the query plans. Since the test table has a clustered index on its integer column, one may think that there is no alternative strategies, but SQL Server still have three types of joins to choose from. I have not investigated what exactly is going on here, but rather I again refer to my general disclaimer: you have to test with your case to get your exact data.
I like to stress that the differences between procedure and function that we have seen here relate to the iterative method, which performs more INSERT statements than any other method. I would be cautious to extrapolate these observations to the other methods.
There were two functions in the test for unpacking a delimited string with a table of numbers in the test: the multi-step TBLNUM, and the inline TBLNUM$IL, limited to 7998 characters in input. Here are the results for these two methods for string size Large:
|Method||UNPACK Str||UNPACK Int||JOIN Str||JOIN Int|
You might get the impression that the multi-step function is actually somewhat faster than the inline function, but partially this is a trick of the eye. The multi-step function uses nvarchar and the inline varchar (else it would not be able to handle this size). When both use varchar the inline function is maybe 5% faster. But since execution times at this size are very moderate anyway, this difference is mainly of academic interest.
Again, I like to stress that the difference (or lack of) between inline vs. multi-step relates to delimited input. I have not performed similar studies for fixed-length input, for which the revelations might be different.
EXEC$B is not really a method. EXEC$B is when you execute the exactly the same dynamic SQL statement for a second time. It's behaviour in the tests also defies everything else. For once, I include the results from the other two machines:
If we first look at the numbers for JAMIE4K and compare with the other methods, we see that only FIX$SINGLE is able to match EXEC$B, and not always; for size Large EXEC$B wins for Str and is tied with FIX$SINGLE for Int. But since there is a very large cost for EXEC$B, to wit EXEC$A, the initial query to get the plan into the cache, EXEC$B is not a real contender. You will have to submit that same SQL statement 20 times, to outperform XML for size X-Large. Even more to outperform ITER or TBLNUM.
Now look at the results for Int on ABA0163 and KESÄMETSÄ, and the row for Medium. What on Earth is going on here? Recall that this is the average of 100 runs, so it is not an occasional spurious result. Since there is a clustered index on the Int column, it is difficult to think of another query plan. ...but there is a non-clustered index that the optimizer could get the idea to scan?
If you feel like running these tests yourself, maybe testing your own method, you can download the file arraylist.zip that includes the test script, the test procedures, the functions for the various methods and the table of numbers. The test data is in a separate file, usrdictwords.zip, of the simple reason that this is 6 MB in size. (The arraylist.zip is a mere 30 KB.) Please refer to the README.HTML in the arraylist.zip for further instructions.
I have not invented all these methods myself. This article is a summarizes the work of a lot of people in the SQL community. There are those who came up with an original idea, those who have refined interfaces etc. I cannot mention them all, because I simply don't know about all of them.
That said, here is a list of persons who have contributed to this article, directly or indirectly, knowingly or unknowingly: SQL Server MVPs Steve Kass, Anith Sen, Linda Wierzbicki, Itzik Ben-Gan, Fernando Guerrero, Umachandar Jaychandran, Narayana Vyas Kondreddi, Tom Moreau and Bryant Likes, and Jim Ebbers, James Vitale, Bharathi Veeramac, Paul Marsh, David Craig, Daniel.Sebestyen, Ken Henderson, Ivan Arjentinski, Chris Payne and Joe Celko.
If you have suggestions for other methods, or twists of those listed, that you think ought to be mentioned in this article, or if you have other comments or questions, please drop me a line at firstname.lastname@example.org.
2010-01-06 – There are three versions of the article, the new one devoted entirely to table-valued parameters in SQL 2008. Also tidied up formatting and HTML.
2007-03-03 – There is now new version of the article for SQL 2005. This old version remains for users of SQL 2000.
2004-12-26 – Added a general note about the use of temp tables. Added unpack_without_union, based on a suggestion by Jim Ebbers.
Back to my home page.