Arrays and Lists in SQL Server
Data from Performance Tests in SQL 2008

An SQL text by Erland Sommarskog, SQL Server MVP Latest revision 2010-01-06.

Introduction

This is an appendix to the articles Arrays and Lists in SQL Server 2005 and Beyond and Arrays and Lists in SQL Server 2008 (Using Table-Valued Parameters).In this appendix, I present data and analysis from the suite of performance tests I ran in 2009 on SQL 2008 for the various list-to-table methods, including table-valued parameters. There is another appendix that with data from tests that I ran in 2006 on SQL 2005. If you have arrived at this page from a web search, you may want to at least browse the main articles first.

The main reason to run this set of tests was to capture the performance of new methods: table-valued parameters, Itzik Ben-Gan's clever fn_nums, Adam Machanic's new CLR function and the primitive method of passing many parameters. This time I also made the effort to measure call overhead and a feeble attempt to measure multi-thread performance.

This appendix is differently organised from the first. You will not see much numbers within the article itself, but instead I will refer you to big digit dumps in form of Excel books with the data, and the article focuses on analysis. The Excel books are saved in the old Excel 2003 format, however they display better in Excel 2007.

I first list all tested methods whereupon I describe the test setup for the main, single-threaded, tests. I then proceed with making various observations of the outcome of these tests. I continue with discussing my results from testing call overhead, and finally I turn to the multi-thread test, describing how it was set up, and observations I have made from it.

Table of Contents

   General Disclaimers
   The Contenders
   The Test Setup
      Used Hardware
      The Test Table
      The Test Script
      The Input Data
      Tested Operations and Sample Test Procedures
      Measuring Time
   Single-threaded Results
      Overview of single-summary.xls
      Query Plans
      Table-Valued Parameters
      CLR
      XML
      Looking at fn_nums
      JOIN vs. EXISTS and the Others
      The MAX Threshold
   Call Overhead
      Methodology
      Clientsummary.xls
      Table-valued Parameters
      Passing Many Parameters
      Plain Strings
      XML
   Multi-threaded Tests
      Test Setup
      What to Measure
      Tested Methods and Operations
      Hardware etc.
      multithreadsummary.xls
      General Pattern
      Winners...
      ... and Losers
      Parallelism?
   Conclusion
   Revisions

General Disclaimers

These tests are fairly simple in nature. I have run the tests on an idle machine, where I've stopped as many services I've dared, in order to minimise disturbing activity in the machine. The test table has been small enough to fit entirely into memory on the test box. The test machine itself is a laptop, not a server-class machine you would use for a production system.

This means that these while these tests give us some information, there are also pieces missing. The numbers should be informative enough to tell us how fast or slow the methods as such are. But they do not tell us how they will work with your tables, and nor do they say much about how they will work under load. Indeed, I did run multi-threaded tests this time, but the results of those tests are fairly inconclusive, not the least because of the poor choice of hardware.

Thus, if you need top performance, you need to conduct your own tests with your tables, preferably on a production-class machine. You should also decide whether you want parallelism or not and make sure you turn of parallelism if this is not desired. You should also conduct your tests under load, for instance running tests with parallel clients hammering the server.

I like to stress that you cannot use the results in this appendix to compare with my previous tests for SQL 2005 or SQL 2000. There are differences in the test setup, and I have used different hardware this time.

All stored procedures, user-defined functions and the test script are available through links in this appendix. Please contact me, if you also want the test data. This is 50 MB in zipped format, why I have not made it available for download.

The Contenders

For this suite of tests, I did not include the really slow methods and the poorly performing unpack_with_union and unpack_with_insert. I also excluded minor variations that I had in the tests in 2006 to test some special tweak. As with the old tests, I've have run tests for more methods that I present here, but I've excluded methods that performed less well, or were just a variation of some other method. Still there are 31 different methods, falling into 12 groups.

I have given each method that I've tested a name, and I will use these names as a convenient short-hand notation. Throughout the appendix, these names appear as links and if you click the link, you will be brought to page with the test procedures for the method in questions, and these pages in their turn have links to tested functions. Thus, in this way you can see exactly what I have tested.

Here are the 31 methods, listed in alphabetic order, and with a thicker line separating different groups of methods.

CLR$ADAM Adam Mechanic's CLR function that uses a char array and where he avoids IndexOf. For the tests of  integer lists, I added a function that returns int, so all conversion to integer are done in C#.
CLR$FIXFixed-length format, implemented with the CLR.
CLR$ITERRolling our own in the CLR.
CLR$SPLIT CLR using the split method.
CTE$CHUNKA chunking procedure using a recursive CTE.
CTE$IL cte_split_inline in the main article.
CTE$MSTMTA multi-statement function with the same contents as cte_split_inline.
EXEC$ADynamic SQL.
FIX$BINARY fixbinary_single
FIX$MSTMTA multi-statement version of fixstring_single.
FIX$MULTI fixstring_multi.
FIX$MULTI2 fixstring_multi2.
FIX$SINGLE fixstring_single.
FNFIX$BINARYfn_fixbinary_single, a version of fixbinary_single that uses fn_nums.
FNFIX$MSTMTA multi-statement version of fixstring_single that uses fn_nums.
FNFIX$SINGLEfn_fixstring_single, a version of fixstring_single that uses fn_nums.
FNNUM$CHUNKfn_chunk_split_me, a version of chunk_split_me that uses fn_nums.
FNNUM$ILfn_inline_split_me, a version of inline_split_me that uses fn_nums.
FNNUM$MSTMTA multi-statement version of FNNUM$IL.
ITER$CHUNK The iterative method with chunks.
ITER$SIMPLEThe iterative method without chunking.
MANYPARAMPassing many parameters.
MANYSELECTMaking the list into many SELECT statements.
TBLNUM$CHUNK chunk_split_me, same as duo_chunk_split_me, but with a single return column
TBLNUM$IL inline_split_me.
TBLNUM$MSTMTmstmt_split_me, a multi-statement version of inline_split_me.
TVP$NOPKA table-valued parameter without a primary key in the definition.
TVP$PKA table-valued parameter with a primary key.
XMLATTRAttribute-centred XML with nodes and input as xml.
XMLATTRPOSReturning the list position from an XML document by using the Numbers table.
XMLELEMElement-centred XML with nodes and input as xml.

New from the tests in 2006 are CLR$ADAM, all FNFIX and FNNUM methods, MANYPARAM, TBLNUM$MSTMT, TVP$NOPK, TVP$PK, XMLATTRPOS. TBLNUM$CHUNK was called TBLNUM in those tests. TBLNUM$MSTMT was probably among the methods I opted to exclude in 2006, but it happened to be included this time.

The methods that have been retained from the old suite of tests are mainly unchanged. The difference is the methods based on inline_split_me and cte_split_inline where I've applied the workaround for the caching bug.

The Test Setup

I here describe the test setup. It is similar to the 2006 tests, but there are also several differences and I note these differences as far it is relevant. The description here focuses on the single-threaded tests. The multi-threaded tests were set up somewhat differently, and I describe this setup later.

Used Hardware

I ran all tests on my laptop, which has a dual-core CPU, running at 2.2 GHz. The machine has 4 GB of memory and runs Vista Business x64 SP2. For the single-threaded tests I also ran the test script on this machine. The laptop has only one disk, so data and log files for the test database and tempdb are all on the same disk.

This time I did not find it worthwhile to run the tests on several machines. Unfortunately, I did not have access to server-class hardware.

The Test Table

I used the same test table as in the tests in 2006:

CREATE TABLE usrdictwords
  (wordno int          NOT NULL,
   word   nvarchar(50) NOT NULL,
   guid   char(36)     NOT NULL)
CREATE UNIQUE CLUSTERED INDEX wordno_ix ON usrdictwords(wordno)
CREATE UNIQUE INDEX word_ix ON usrdictwords(word)

The table contains 1 163 750 rows, and the average length of the column word is 10.6 characters. The longest word is 30 characters. wordno holds numbers in the range from 0 to 1 163 749. The order of wordno is not correlated with the order of word. The column guid serves to make the table a little wider, and to be a token non-indexed column. The data in guid is not used for anything in the tests.

The collation for the database is Slovenian_CS_AS. (The data is a taken from a Slovenian version of the Unix file /usr/dict/words, used for the spell utility, that I culled off the net somewhere.)

The Test Script

The test script is written in Perl, connecting to SQL Server using the SQL Server Native Client OLE DB provider.

The script creates the database with a size of 195 MB for the data file and 100 MB for the log file (the initial size is 750 MB during the data load, but the script shrinks it before running the tests). The database is in simple recovery. tempdb is not controlled by the script, but I made sure that tempdb had at least 200 MB for data and 50 MB for the log.

The test script runs with the default SET options you get with OLE DB, except that it also issues SET NOCOUNT ON.

When running the tests, the test script creates lists of strings and integers as described below, and passes the assembled lists to all test procedures currently in the test database in alphabetic order. That is, all methods see the same input data. For the single-threaded test, the script runs for six list lengths: 20, 200, 650, 2 000, 10 000 and 50 000 list elements. The script first runs all test procedures for lists with 20 elements 100 times (but see below), then all for 200 elements 100 times etc. All input lists have unique entries; there are no duplicates in them.

When starting on a new list length, the script performs an sp_recompile on usrdictwords to flush all plans. It then runs a test 0 for that length that is not included in the results, to avoid distortion from the cost of compilation.

Normally, a test procedure is run 100 times for a certain list length. However, since some methods are slow or excessively slow, the script stops running a procedure if its total execution time exceeds three minutes. Nevertheless, the procedure is still permitted to run three times more (including the initial test 0) on the next list length so that I can catch the development as the number of elements rises. However, if the total time for a procedure exceeds 15 minutes, this second chance is abandoned.

The script did not run MANYPARAM for 10 000 and 50 000 elements, as this method cannot support this length. I also excluded EXEC$A from running at 50 000 elements, since this failed an internal query-processor error.

The script starts a transaction before running a test procedure and commits when execution is completed.

You can view the complete test script here.

The Input Data

For each test round, the test script randomly selects a set of unique numbers in the range from 0 to 1 163 749. These numbers constitute the input for the integer lists. The script uses these numbers retrieve the corresponding words in usrdictwords and thereby it gets the input for the string lists.

The exact format of the lists depends on the method, with the most common being a comma-separated list with no extra spacing. (In difference to my tests for SQL 2000 and SQL 2005 where I randomly added extra spacing.) Other formats are as follows:

MethodsData typeFormat
CLR$ITER, CLR$SPLIT, ITER$CHUNK and ITER$SIMPLEIntegerSpace-separated list
EXEC$AString Elements are wrapped in single quotes to appease SQL syntax.
FIX$BINARY, FNFIX$BINARYInteger Binary format.
CLR$FIX and other FIX and FNFIX methodsInteger Fixed length with 9 positions per number.
CLR$FIX and other FIX and FNFIX methodsString Fixed length with 30 positions per string.
XML methods:All XML documents as shown in the XML section in the main article.
TVP methods and MANYPARAMAllData passed as the method requires. More details on table types for TVPs later.

Tested Operations and Sample Test Procedures

I extended the set of operations from the previous tests with two more, for a total of four. Here is a summary, followed by a sample of each type of procedure.

For most methods I've run the tests with lists of numbers and lists of strings. The exceptions are FIX$BINARY and FNFIX$BINARY, which are only meaningful with numeric input.

There is a stored procedure for each combination of method, operation, data type which means that for a single method there is up to eight test procedures. Below I show an example of test procedures for each operation and data types. To see the test procedures for a specific method, click on the link for the method in question.

UNPACK

Here is a typical procedure for the UNPACK operation:

CREATE PROCEDURE CTE$MSTMT_Int_UNPACK_test
                 @str      nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

INSERT #Int_UNPACK (number)
SELECT number = convert(int, str)
FROM   cte_split_mstmt(@str, DEFAULT)

SELECT @tookms = datediff(ms, @start, sysdatetime());

IF @retdata = 1 SELECT number FROM #Int_UNPACK
TRUNCATE TABLE #Int_UNPACK

The procedure: 1) Accepts an nvarchar(MAX) parameter. 2) Starts a timer. 3) Issues the query, inserting the result set into a empty temp table. 4) Stops the timer. 5) Returns the data to the test script if requested. 6) Truncates the table after the run.

Notes:

JOIN

Here is a sample procedure of the JOIN operation for an integer list.

CREATE PROCEDURE CLR$ITER_Int_JOIN_test
                 @str      nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

INSERT #Int_JOIN (word)
SELECT u.word
FROM   usrdictwords u
JOIN   CLR_intlist_iter(@str, DEFAULT) AS a ON u.wordno = a.number

SELECT @tookms = datediff(ms, @start, sysdatetime());

IF @retdata = 1 SELECT word FROM #Int_JOIN
TRUNCATE TABLE #Int_JOIN

The expected query plan is to retrieve the data through a Clustered Index Seek on wordno.

And here is one with a list of strings. In interest of brevity, I only show the INSERT statement:

INSERT #Str_JOIN (wordno, guid)
SELECT u.wordno, u.guid
FROM   usrdictwords u
JOIN   inline_split_me(@str) AS a ON u.word = a.Value

To make the tests more interesting, I designed the JOIN procedures for string data so that it retrieves both the columns wordno and guid. This gives the optimizer two obvious choices to access usrdictwords: 1) do an Index Seek on the index on wordno followed by a number of key lookups on the base table. 2) Scan usrdictwords as part of a hash or merge join.

COUNT

Here is how a COUNT procedure for integer data looks like:

CREATE PROCEDURE CLR$FIX_Int_COUNT_test
                 @str nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

DECLARE @cnt bigint
SELECT @cnt = SUM(len(word))
FROM   usrdictwords u
JOIN   CLR_intlist_fix(@str, 9) AS a ON u.wordno = a.number

SELECT @tookms = datediff(ms, @start, sysdatetime());

As far as it comes to retrieving data from the test table, the COUNT procedures are the same as the JOIN procedures. But instead of inserting the data into a temp table, I perform a SUM operation. In this way I reduce the overhead that caused by the insertion into the temp table. I particularly devised these procedures for the multi-threaded tests. While the access path to the table is the same for JOIN, it is a different operation, and the optimizer may still make different choices.

(Why do I call it COUNT when I use SUM? I first used COUNT, but then I decided to use SUM to force SQL Server to read all rows. But I never came around to change the name.)

Here is a COUNT procedure for strings:

CREATE PROCEDURE XMLATTR_Str_COUNT_test @str      xml,
                                        @retdata  bit = 1,
                                        @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

DECLARE @cnt bigint
SELECT @cnt = SUM(len(guid))
FROM   usrdictwords u
JOIN   @str.nodes('/Root/Word') AS T(Item)
  ON   u.word = T.Item.value('@Item', 'nvarchar(50)')

SELECT @tookms = datediff(ms, @start, sysdatetime());

As with the JOIN operation, I force SQL Server to read the data pages. (@cnt will always be 36 times number of rows, but SQL Server still has to read every row, since len() does not include trailing spaces and SQL Server does not know that I filled every char(36) to the max.)

EXISTS

Finally, here are samples of the EXISTS operation. I show the INSERT statements only:

INSERT #Int_JOIN (word)
SELECT u.word
FROM   usrdictwords u
WHERE  EXISTS (SELECT *
               FROM   inline_split_me(@str) AS a
               WHERE  u.wordno = convert(int, a.Value))
INSERT #Str_JOIN (wordno, guid)
SELECT u.wordno, u.guid
FROM   usrdictwords u
WHERE  EXISTS (SELECT *
               FROM   cte_split_chunk(@str, DEFAULT) AS a
               WHERE  u.word = a.str)

Measuring Time

As you can see from the sample procedures, I use the new system function sysdatetime() that returns datetime2 to measure execution time. While datetime2 permits a precision up to 100 ns, I only measure whole milliseconds. The reason is simple: if you run a loop like:

SET NOCOUNT ON
DECLARE @i int = 1000
DECLARE @t TABLE (d2 datetime2)
WHILE @i > 0
BEGIN
   INSERT @t VALUES(sysdatetime())
   SELECT @i -= 1
END
SELECT d2, COUNT(*)
FROM   @t
GROUP  BY d2

You will see an output like this:

2009-09-27 19:12:55.8656640    1
2009-09-27 19:12:55.8666640   25
2009-09-27 19:12:55.8676640   26
2009-09-27 19:12:55.8686640   38

That is, the resolution of the values returned from sysdatetime() is only a millisecond. I played with a CLR routine that uses a .NET class which in its turn calls the routine QueryPerformanceCounter in the Windows API, and which has a resolution of 100 ns. However, I decided not rely on it. Just because something has a certain resolution does not mean that it has the same accuracy. When SQL 2005 shipped, SQL Trace was able to return durations in microseconds. But if you run a trace today in SQL 2008 or SQL 2005 SP3, you will only see durations in whole milliseconds. The reason Microsoft did this change is that QueryPerformanceCounter relies on the CPU frequency being constant, but that is not always the case in a modern computer, because the BIOS often lower the frequency to preserve energy. For a longer discussion on this topic, read these blog entries from Bob Dorr, an Escalation Engineer for SQL Server, available through this link and this other link.

While it seems from Bob Dorr's posts, that we should be able to rely on a one-millisecond resolution, I have my doubts. I ran the test suite several times, mainly because I found a new twist, or identified an error somewhere. And sometimes when I had run the test suite, I saw timings of -1 ms. Why I don't know, but I assume it is related to these issues. Presumably the start value was read on one CPU, and the stop value on the other. Thankfully, the results I publish do not include any such anomaly. (Interestingly enough, many full runs of the test suite included no such values, and then I could run one where 85 out of 100 executions for a procedure clocked in at -1. I never saw -2 or even worse, though.)

So even if the resolution nominally is one millisecond in this data, which is better than I was able to use when I ran my tests on SQL 2000 and SQL 2005, there is all reason to be careful with the exact value.

Single-threaded Results

The tests produced far too many numbers to show here. You find the results from the single-threaded tests in the Excel file single-summary.xls. It's not that you absolutely must download and open it – but I will more or less assume that you do. In difference to the articles for the tests on SQL 2000 and SQL 2005, I am not including any numbers in the article itself.

I start with giving an overview of the Excel file, and then I go on and discuss various interesting facts from the results.

Overview of single-summary.xls

The Excel file has seven tabs. On each tab there is one column per data type, operation and list length. The method names are coloured by category as a reminder:

Blue – Direct SQL.
Dark redT-SQL Inline function.
Green – Opaque inline.
Black – Intermediate storage.

See the section Inline, Multi-Statement and Temp Tables in the main article for explanation of these terms.

These are the tabs:

Timings – The median execution time in milliseconds for the test.

Rankings – Ranking within the column. To reduce the impact of random flutter, the ranking ignores the last digit, so102 and 109 rank equal. (But 109 and 110 ranks differently.) This has the effect that for shorter lengths about all methods tie for first place.

Ratios to top – The ratio between the execution time for this method and the best method in this column. When the winner has a median of 0, this counts as 1 here. (When computing the ratios, I've used all digits, so 109 and 102 gives a ratio of 1.07.)

Ratios to prev – The ratio between the execution time at this list length and the execution time at the previous length for the same method, operation and data type. All methods have a blank for 20 elements obviously. Methods with a median of 0 for 20 elements, also has a blank for 200 elements. At the bottom, you can see the ratios between the lengths itself. If the ratio exceeds the length ratio, this is bad. The numbers are in red if the ratio exceeds the length ratio with a factor of 1.2.

Compare to JOIN – This tab compares the other operations to the JOIN operation. More about this tab later.

Comparisons – This tab has some special comparisons between selected methods, that I will return to.

No of runs – How many times the method was run for this data type and operation. 100 in most cases, but lower if the test procedure exceeded the permitted execution time. (See the description of the test script for details.)

Query Plans

Every now and then I will refer to the query plans and try to describe them as well as I can. If you are really geeky, you can download the plans from queryplans.zip. This file includes a BCP file and a format file. You need this table:

CREATE TABLE queryplans (method     varchar(20)   NOT NULL,
                         datatype   char(3)       NOT NULL
                            CHECK (datatype IN ('Str', 'Int')),
                         optype     char(6)       NOT NULL
                            CHECK (optype IN ('UNPACK', 'EXISTS', 'JOIN', 'COUNT')),
                         listlen    int           NOT NULL,
                         estrows    bigint        NOT NULL,
                         query_plan xml           NOT NULL,
                         PRIMARY KEY (method, datatype, optype, listlen)
)

To load the file, run:

BCP yourdb..queryplans in queryplans.bcp -f queryplans.fmt -T

(Add -S for server and change -T to -U and -P for username and password as needed.) To view the plans for a certain group of tests, run for instance:

SELECT * FROM queryplans
WHERE method = 'CLR$SPLIT' AND datatype = 'Int' AND optype = 'JOIN'
ORDER BY listlen

The column estrows, is the number of rows SQL Server thinks the query will return; the value is always 1 for COUNT. If you have Mgmt Studio 2008, just double-click the XML document and SSMS will recognize that this is an execution plan and display a graphical plan. If you have SSMS 2005, you need to save the XML with the .sqlplan extension and reopen it. The plans are actual execution plans and include actual number of rows and executions as well as estimated values.

I had to exclude the plans for the MANYPARAM method from the zip file – these plans are huge. When I first compiled the zip file, it was almost 50 MB, but when I excluded MANYPARAM, the size went below 2 MB. Note also that the plan for (CLR$ADAM, Int, COUNT, 20) is the plan for something else due to a bug in my test script.

Table-Valued Parameters

The most interesting method to check out is of course that new kid on the block: table-valued parameters. In the main articles, I tout this as the best method. So how does it fare?

First, only so know you what I'm talking about, these are the table types used by TVP$NOPK:

CREATE TYPE intlist_tbltype AS TABLE (n int NOT NULL)
CREATE TYPE stringlist_tbltype AS TABLE (str nvarchar(50) NOT NULL)

And TVP$PK uses these:

CREATE TYPE intlist_pktype AS TABLE (n int NOT NULL PRIMARY KEY)
CREATE TYPE stringlist_pktype AS TABLE (str nvarchar(50) NOT NULL PRIMARY KEY)

We turn to the Rankings sheet and we can see that TVP ranks very high. The two methods often hold positions 1 and 2, but there are a few notable exceptions. The most striking is that TVP$PK for the JOIN, COUNT and EXISTS for integer data falls through the field for 10 000 elements, only to return to the top for 50 000 elements. If you flip over to Timings or Ratios to prev, you can also see that there is a increase in execution time that by far exceeds the increase of number of elements. For JOIN, the execution time rises from 9 ms at 2 000 elements, to 264 ms at 10 000 elements, and then there is a modest increase to 339 ms for 50 000 elements.

What is going on? The answer lies in the query plan. For most methods, the query plan will be the same no matter the number of elements, because the optimizer does not understand to use the length of an input string for its considerations. But this is different for table-valued parameters, where the optimizer can sniff the number of elements. Usually an advantage, but this time it backfires. For TVP$NOPK, as well as most other methods, the optimizer settles for scanning the input source, and then perform a loop join against usrdictwords. But for TVP$PK, the optimizer decides to perform a hash join and scans both tables. And for 10 000 elements, this is an incorrect decision. But the more elements we send in, the more lookups the loop join needs to perform. Consider the extreme situation where we have as many elements as there are rows in the source table: the loop join will be many times slower than a table scan and a hash join.

At which point the loop join and hash join has equal performance I don't know, but it seems to be above 50 000 elements. Because at 50 000, TVP$NOPK still goes for the loop join and is significantly faster than TVP$PK. But the difference is smaller, and this permits TVP$PK to catch up and overtake most other methods. For 10 000 elements, TVP$PK is 5.7 slower than TVP$NOPK for JOIN, but at 50 000 elements this ratio is only 1.6.

Now, why the optimizer makes different decisions depending on the table parameter has a PK or not, I don't know. Maybe the optimizer thinks that since the values are known to be unique, we will hit more distinct pages in usrdictwords, and therefore a loop join will be costlier. And this could make sense, if none of the table had been in cache. (The way the test is setup, the table should be entirely in cache.) I should add that for strings, the optimizer makes a similar choice, but only at 50 000 elements.

The conclusion of this is not that having a PK for your TVP is a bad idea, this is just one of these oddities that can happen with a cost-based optimizer that makes estimates from statistics. If we dig a little more into the data, we can find a case where the primary key definitely helps. If you look at the execution times for the EXISTS operations, you see that TVP$NOPK falls a little behind from the absolute top. It ranks only #5 for 10 000 integer elements and #6 for 10 000 string elements. You can also see that it is noticeably slower than TVP$PK at 50 000 elements. The reason for this is in the query plan: The plan for TVP$NOPK for all EXISTS queries include a hash aggregate to weed out potential duplicates. This operator is not needed for TVP$PK, since here the optimizer knows that there are no duplicates, and TVP$PK gets the same query plan for JOIN and EXISTS. (Then again, as we shall see when we look at call overhead, the TVP$PK has this cost when it's populated.)

If you look in the Timings sheet, there are a few more differences between TVP$NOPK and TVP$PK. For COUNT on integer, TVP$NOPK is a lot faster at 50 000 elements, and there is a similar difference in the reverse direction for the COUNT of strings at 10 000 elements. These differences are due to that one of them gets a parallel plan, the other not. There are a few more cases where parallel plans are in sway. It seems that if I were to run these tests on a multi-core box, there would be more dramatic differences.

A final note: the misjudgement to use a hash join for 10 000 integer elements was not a complete surprise to me. I observed this in the tests in 2006 as well, when this happened to methods using temp tables as I discuss in the sections Inline, Multi-statement and Temp Tables and Dynamic SQL in the other performance appendix. In the 2009 tests, this is represented by the methods MANYSELECT and EXEC$A.

CLR

Next we look at the CLR, and the best sheet to study these methods is the Timings sheet. If you look for differences between the four CLR methods, you find that they are like Siamese quadruplets. Only for 50 000 string elements, the differences are big and consistent enough that we can place CLR$ITER and CLR$SPLIT in one group ahead of the other two. But the difference is still very small: 30 ms for a total execution time of 500 ms or more.

XML

Two things are of interest for XML: to compare XMLATTRPOS with XMLATTR, and compare XMLELEM to SQL 2005.

If you look in the Timings sheet, you can see that XMLATTRPOS is always slower than XMLATTR, which is not surprising, since it is a more inefficient method. But you can also see that only in one case it needs as much as the double the time that XMLATTR takes, and in that case the absolute difference is only 3 ms. And given that it "loops" over the XML document to extract one element at a time, this is astonishingly good. You can also note that XMLATTRPOS performs clearly better than the iterative method, and it plays an even game with the CTE-based methods.

In the sheet Comparisons, I have computed the ratios for the other XML methods, the CTE methods, the iterative methods and MANYSELECT with regards to XMLATTRPOS. Numbers in blue highlight where the other method is faster than XMLATTRPOS. (Ignore the lower half of this sheet for now.) You can see that  XMLATTR only has blue numbers, and so has XMLELEM for JOIN and COUNT, but apart from that, only CTE$IL has more than one blue cell.

XMLELEM on the other hand... For UNPACK and EXISTS, the execution times are absurd with 191 seconds to unpack 10 000 elements. Apparently, there are query-plan disasters in both cases. The plans for UNPACK and EXISTS include an Eager Spool operator which is missing from the other two. A eager-spool operator caches data in an internal temp table, which is good if the data can be reused, but this does not happen here, and all the operator contributes with is a massive overhead. If you look in the Ratios to prev sheet, you can see that for these bad cases, the execution times for XMLELEM grow far more than the number of elements does. (You may note that in this misery, EXISTS is still only half as slow as UNPACK; this is due to that EXISTS benefits from a parallel plan.)

While this is bad, there is still an improvement from SQL 2005 SP2. If you look at results for XMLELEM in the performance appendix from 2006, you can see that in this case the JOIN had bad performance too.

Looking at fn_nums

Another set of new contenders in this test was the functions based on Itzik Ben-Gan's fn_nums. The best way to evaluate them is to head to the Comparisons sheet, and look at the lower part. You can there see how they performed compared to the corresponding method that uses the Numbers table. I've highlighted in red when the ratio is > 1.2, that is when the execution time for the fn_nums method exceeds the other method with more than 20%. And I've highlighted in green when the ratio is < 1, and the fn_nums method thus is faster. Left in black are those that can count as "equal".

It does not take very long to see that there is a lot more red than green, and the green spots appear mainly for shorter lengths up to 2 000 elements. The best green value is 0.5 which is for an absolute difference of 1 ms. On the other hand there is no lack of red values > 2. Particularly this is noticeable for FNNUM$CHUNK, although it evens out as the list length grows.

In two cases the difference is drastic and that is for JOIN and COUNT on strings for FNFIX$SINGLE where the execution time for 20 elements is over 40 seconds. The query plan starts with a number of loop joins with a warning for missing join predicate; but so does all fn_nums plans. Then comes nested loops joins against usrdictwords. But instead of an index seek against the non-clustered index, there is a Lazy Spool operator and scan of the table. The actual plan tells me that the scan is only performed once, but I find this little difficult to believe. It should not take 40 seconds to scan that table. And the execution time should not increase considerably when we go up to 200 elements, but it does. Hm, maybe spool operator caches the entire table, and then the spool operator is scanned once for each input string?

Not that I understand how this ever could be a good idea, but the net effect is that for every list element the table is scanned. To see if I could stop this from happening, I forced the use of the non-clustered index, but this only bought me an index scan + key lookup. Only if I added guid as an included column, the optimizer would use the index. I also found that if I modified fn_nums to return only 65 536 numbers, I got a normal plan. But obviously this would not be wise to do in the general case, as you could run out of numbers.

JOIN vs. EXISTS and the others

New for this tests were the EXISTS and COUNT operations. It can be interesting to compare these against JOIN, and I've added a comparison of UNPACK vs. JOIN for completeness.

EXISTS and inline T-SQL functions

Since input lists are unique, JOIN and EXISTS are logically the same. However, SQL Server does not know about this, but in one single case: TVP$PK. For all other methods, SQL Server needs to ensure that a row in the target table does not appear more than once in the result set, even if it would happen to match multiple entries in the input list. A simple way to achieve this is to take the same plan as for the JOIN query, and insert an operator to weed to duplicates. This operator can be applied on the input list, before the operator that joins the list to the table, or it can be applied to the final result set. But as we shall see, SQL Server does not always produce this sort of plan.

If you look in Timings in the columns for EXISTS, you can directly see that there are a number of methods for which the execution time is excessive. And if you look in the name column, you see that all these names are in dark red. And if you look a little closer, you see that there is no method name in dark red for which the execution time for EXISTS is normal. Or in another words: for all methods based on an inline function, the optimizer goes seriously wrong. The plans it arrives at, are nowhere close to what I outlined above. I've looked at some of them, and there is a common pattern: there is an index scan of usrdictwords as the outer operator in a loop join and for each row in the table, there is a table spool operator, that may hold the values from the the input string. As with fn_nums, I am puzzled. A scan of the table should not take that long time, and furthermore is this was all there is to it, there should hardly be any difference in execution time between 20 and 200 elements, but in most cases the execution time rises almost with a factor of 10.

Most of these plans include parallelism, but it does not help to turn it off. All that changes is that the parallelism operators go away; else the plans are the same. And performance is not any better.

As for why the optimizer produces these plans, I don't know. I thought that the poor estimates could have something to do with it, so I reduced Numbers to 256 numbers and I modified fn_nums similarly. 256 is just above the standard assumption for XML (200 rows) and far lower the assumption for a CLR function (1 000 rows). But, no, still the same very poor plans.

EXISTS and the rest

For remaining comparisons, turn to the tab Compare to JOIN where you find the difference in ms between JOIN on the one hand and UNPACK, COUNT and EXISTS on the other. A positive number means that the operation is faster than the join, a negative that it is slower.

Above I said that the plan for EXISTS could be the same plan as for JOIN, but with one extra operator to weed out the duplicates. From this follows that we can expect EXISTS to perform slower than JOIN. That operator can be either a Distinct Sort or a Hash Aggregate. If you look at the numbers, you see that there mainly is a real cost only for the longest list length of 50 000 list elements.

Interesting enough, there are a few cases where EXISTS is faster: CLR, MANYSELECT, XMLATTR and TVP$PK. This can be explained by other plan differences for the first two. The CLR plans for JOIN includes a Table Spool operator which is missing from the EXISTS plans. MANYSELECT has a parallel plan for EXISTS over strings, but not for JOIN. For integers, it does not need any distinct operator, as it uses hash join. For XMLATTR and TVP$PK there are no plan differences, but keep in mind that compared to the total execution times, these differences are small overall.

If you look at the numbers in Compare to JOIN, you can see that TVP$NOPK has a larger cost for EXISTS than any other method. You can also see that for the CLR, EXISTS is faster over strings in all cases, but for integers it is slower for the longest input lists. These differences depend whether SQL Server uses sorting or hashing to weed out the duplicates. Sorting is the most common choice, but for TVP$NOPK and CLR over integers, SQL Server opts to use hashing, and the for 50 000 elements the latter is slower.

This was certainly a surprise to me, but I think I found the answer in the default trace. (A trace which starts when SQL Server starts, unless you configure it not to.) I found plenty of Hash Warning and Sort Warning events in the trace. Since my timings table includes a starttime column as well as the duration, I could correlate the events with the test procedure that was running at the time. All EXISTS queries that use any sort or hash operator to dedup generated the corresponding warning at 10 000 and 50 000 list elements. At 50 000 elements, there was a hash-recursion level of 3. All queries with sorting had to do multiple passes at 50 000 elements. (For XMLATTR this also happened for strings at 10 000 elements.)

Apparently hashing suffers more from spilling to disk than sorting. But it's important to note that whether spilling occurs depends on the amount of available memory. My laptop has 4 GB of memory and many production machines top that. On a more able machine, you may see much smaller differences between JOIN and EXISTS. Then again, on 32-bit SQL Server it does not help how much memory you cram in, as on 32-bit SQL Server sorting and hashing cannot used the extended memory space beyond the plain 2 GB.

COUNT

There is reason to believe that COUNT should always be faster than JOIN, because it is the same query, with the difference that we aggregate the result into a variable instead of inserting rows into a temp table, which of course yields some overhead.

And indeed there are no negative numbers for COUNT (save one value for FNFIX$SINGLE where the total execution time is over five minutes, and thus is completely uninteresting). There is a great deal of variation from method to method, and I will only make a few observations.

The methods that build on a multi-statement function, that is all with the name in black but MANYSELECT, have very similar numbers, and this is to be expected. The method to unpack the list is hidden into the function, and other mechanisms determine the difference between inserting the result from the JOIN query into a temp table, and running the COUNT query.

For some methods, parallelism plays a part, and it cuts both ways. FIX$BINARY, FIX$SINGLE, FIX$MULTI, FIX$MULTI2 and TBLNUM$IL all have a parallel plan for COUNT but not for JOIN; this applies to both data types. For TBLNUM$IL parallelism gives a tremendous boost, and the simple fixed-length methods also seem to benefit from it. But for FIX$MULTI and FIX$MULTI2, the numbers for COUNT are lower than for most other methods. (Please disregard FIX$MULTI2 over 50 000 strings; the total execution time is over ten minutes which is just absurd.)

For TVP methods, in most cases, the numbers for COUNT are lower than for the multi-statement functions. There is a mix of parallel and non-parallel plans, but there is no obvious pattern. Maybe there simply is a difference in how fast different methods can populate the temp table. Indeed, methods that involve a multi-statement function first need to fill the return table, whereas a query using a TVP can start to fill the temp table faster, and thus gets a lower overhead. But I am only speculating here.

For the CLR, there is a Table Spool operator in the plan for JOIN for both data types that is not present for the COUNT plans. However, the spool operator sits in different places. For strings, it comes between the index seek in the non-clustered index and the key lookup in the clustered index, whereas for integer it is just before the INSERT operator. Really what effect it has, and why it is needed, I don't know. But compared to the multi-statement functions, the CLR methods suffer less from the temp table for the integer queries, but takes a higher toll for the string queries.

UNPACK

Obviously, UNPACK should be faster than JOIN, which it also is, except for the tragical XMLELEM. But as with COUNT, there are some differences between the methods. UNPACK as such is rarely of much interest, but there are a few observations that can help us to understand how the joins work.

To get a frame of reference, we can start with looking at the methods based on multi-statement functions. As with COUNT and EXISTS, they have very similar numbers, as all that affects is the cost for the JOIN operation.

In the column for 10 000 integers, two methods stand out with high values, MANYSELECT and TVP$PK. But that is because of something we know already: the join for these is done with a hash join, which is a poor choice.

More interesting are the methods for which the numbers are conspicuously low: CTE$IL, the inline functions based on fn_nums, CLR and the TVP methods. When the numbers are low, the join has the same basic query plan as the methods building on multi-statement functions, and there is no parallelism in the game. Yet they have a far lower "overhead" for the join. Apparently these methods can deliver data into the join faster. And if you think of it: CTE$IL and the functions based in fn_nums produce the data elements without touching any sort of table. And likewise does the CLR stream data into the query.

There are some exceptions to this pattern, but for FNFIX$SINGLE over strings and some of the TVP numbers, the explanation is simple: the join plan is different from the "standard" plan. What is a little more mysterious is the CLR. Here the low numbers appear only for the integer queries. For the string queries, CLR has a higher "overhead" for the join than the multi-statement function. As I've already mentioned, the join plans for the CLR includes a table spool operator, and so does the UNPACK plans. But as I mentioned, this spool operator sits in different places in the plan, and presumably this affects how quickly CLR can feed the query with data.

The Bottom Line of all this?

So how interesting is this comparison of operators? Well, there are two bottom lines:

  1. Be very careful with using EXISTS with inline T-SQL functions.
  2. You need to benchmark in your own environment with your own queries to see what performance you actually get. The observations I have made here, mainly apply to my test and you cannot rely on them being generally applicable.

The MAX Threshold

In the sheet Ratios to prev, you can see the ratio between the execution time when moving from one list length to another. At the bottom of the sheet, below the coloured areas, you can see the actual ratios between the lengths. I've highlighted in red, if a time ratio exceeds the length ratio with more than 20 %.

In some cases, the high ratios are due to query-plan changes from list length to another. This applies to TVP$PK, MANYSELECT and EXEC$A. A few of the high ratios for short list lengths are probably due to the low resolution of 1 ms.

But beyond that you can see that for integer data, there are a couple of methods where the ratio for 50 000 is far above 5.0, and this happens consistently for all operations. If you look at string data, the pattern is a little a different. Here the jump occurs at 10 000 elements for some methods, and at 50 000 for some other. And if you study this closer, you see that the jump at 10 000 string elements happens for methods that use fixed-length input.

This is nothing new. This is something I also discuss in the older performance appendix, in the section Fixed-Length and the String-Length Threshold. A short recap: when the length of the value of any of the MAX data types exceeds a certain threshold, this leads to a jump in execution time. (Possibly because SQL Server starts spilling the variable to disk, or at least enables it to do so.) This threshold is around 500 000 bytes in 64-bit SQL Server. (And 750 000 bytes in 32-bit.). The results of the tests in 2009 agree with the predictions I made in the older performance appendix.

That is, for 50 000 elements all input lists exceed the limit, save for the methods that work on binary input. But effect only strikes at pure T-SQL methods; CLR and XML are unaffected. And so are the TVP methods, since they don't work with any string at all. Furthermore, methods that applies chunking are largely unaffected, because they don't perform very many operations on the MAX string. This leaves with the following methods being victims: CTE$IL and CTE$MSTMT, all fixed-length methods but the binary methods, FNNUM$IL, FNNUM$MSTMT, ITER$SIMPLE, TBLNUM$IL and TBLNUM$MSTMT. The one method that I cannot determine whether it is affected or not is MANYSELECT, since this method is also affected by plan changes.

It is also striking that some methods takes a larger toll of this effect than others. For ITER$SIMPLE the factor is 7-7.5, while for TBLNUM$IL it ranges from 23 to 33, with the net effect that TBLNUM$IL is slower than ITER$SIMPLE for 50 000 elements, and ITER$SIMPLE is not known for its efficiency.

There is one more observation in this area and that is the total breakdown for FIX$MULTI2 at 50 000 string elements. I have not investigated very closely why this happens, but so much is clear that it is not a query-plan change; the plan is the same as for other list lengths. I assume that SQL Server hits some resource limit causing it to read something to or from disk over and over again. In such case, the performance should be better if there is more memory available.

Call Overhead

In the tests for SQL 2000 and SQL 2005, I measured execution times server-side only, feeling that client-side timing would suffer from random network disturbances. But with the advent of table-valued parameters, this position became untenable. There is all reason to ask if there is any extra overhead with using TVPs.

Methodology

I used the same test setup to measure call overhead, as I used for the single-thread tests. In fact, the single-threaded tests includes two timers: one server-side within the procedure, and one client-side around the call to the stored procedure.

When investigating how I should measure time client-side, I looked at what the .NET Framework and the Windows API have to offer. In the end, though, I went for a simple-minded approach: I used SQL Server. Before calling a test procedure I called this procedure:

CREATE PROCEDURE start_client_timer AS
   DECLARE @d2 datetime2(3) = sysdatetime()
   DECLARE @b varbinary(128) = convert(varbinary(128), @d2)
   SET CONTEXT_INFO @b

SET CONTEXT_INFO is different from all other SET commands, as the effect of it is not reverted when the scope exists. Then to retrieve the execution-time for the call, I used this procedure:

CREATE PROCEDURE get_clientms @clientms  int          OUTPUT,
                              @starttime datetime2(3) OUTPUT AS
   DECLARE @now datetime2(3) = sysdatetime()
   SELECT @starttime = convert(datetime2(3), context_info())
   SELECT @clientms = datediff(ms, @starttime, @now)

While these calls in themselves add an overhead, this means that I was guaranteed that the client-side time never could be less than the time that I measured within the procedure. (With different timers on different computers, there is a risk that one of them would cross a boundary when the other doesn't, so the time measured client-side could appear as lower which of course is undesirable.)

Once equipped with tool, I computed the call overhead as the difference between the total execution time measured client-side and server-side divided by the number of calls. More precisely this overhead includes:

However, the overhead for compiling a procedure is not included, as the first call for each list length is not included in the numbers I present.

Clientsummary.xls

You find the data for the call overhead in the Excel file clientsummary.xls. When you open it, you will see less columns than in single-summary.xls, because I've factored out the operation type. Reasonably, the call overhead does not depend on whether the operation is UNPACK, JOIN, COUNT or EXISTS.

In the Excel book, there are two sheets. For now we only look at the first sheet, Call Overhead. Here you see two tables, one for integer data, one for string data. For each data type, I have sorted the methods per performance on the longest list length. I've drawn lines to divide the methods into groups. The numbers you see is the call overhead per call in milliseconds. They are presented with two decimals, which may seem brave, given that the timing resolution is only one millisecond. But the results are actually quite consistent when it comes to the division into groups.

Table-valued Parameters

Let's start with TVPs, since they are the most interesting here. As you can see the call overhead for TVP is considerable. Almost two seconds for 50 000 integers. If you add that to the values on the Timings sheet in single-summary.xls, TVP ends up in a non-descript middle. Still better than MANYSELECT and CTE$IL, but overtaken by ITER$CHUNK. So that is a really big disappointment?

Before you jump to conclusion, keep in mind that test script is written in Perl, and it uses a client API that sits on top of OLE DB, and which I have authored myself. That is, a quite untypical use case. Undoubtedly, most people will use TVPs through SqlClient and ADO .NET. Therefore, I wrote a test program in C# which is based on the same idea as my regular test script, but far less sophisticated and hardwired for the TVP methods. (In fact it is so unsophisticated that I wrote two programs, one for strings and one for numbers, tvptest.cs and tvptest_str.cs.). The only operation I tested was UNPACK, but as noted above the operation type should not matter. I tested both to pass a DataTable and a List<SqlDataRecord>. I measured two aspects of the call overhead: A) only the time for the call to ExecuteNonQuery and B) adding the time it takes to load the DataTable or the List. I made sure that for each test I had a new suite of numbers or words. I only ran each particular test 10 times. I did not test passing a DataReader, as I could not think of a way to test this fairly.

You can view the results of these tests in the sheet Call Overhead ADO. There are some new method names here: TVP$NOPK$DT$A, TVP$PK$LIST$B and so on. The bit DT means that the TVP was passed as a DataTable, and LIST that it was passed as a List<SqlDataRecord>. "Methods" ending in $A are the measurements for the call to ExecuteNonQuery only, whereas $B also includes the time it takes to load the DataTable or the List.

As you can see, with ADO .NET the picture is completely different. In ADO .NET, the overhead for TVP$NOPK is lower than for XML and for strings the results are also better than for fixed-length input. (But keep in mind that that the numbers for XML and fixed-length are measured from Perl, so it is a bit of apples-to-oranges comparison.)

For TVP$PK the situation is a bit bleaker with as much as 350 ms in overhead for 50 000 strings. This higher overhead comes from that SQL Server needs to sort the data to get it into the primary-key index. There are some funny relations going from one length to another: 20 ms for 10 000 integers explode to 150 ms for 50 000 integers, and a similar effect occurs when going from 2 000 to 10 000 strings. As when I explored the differences between EXISTS and JOIN, I looked in the default trace, and I found Sort Warnings that I could correlate to the calls for TVP$PK. Thus, these jumps are due to the sort spilling to disk. This indicates that with a different amount of available memory, I would have seen different results on a 64-bit machine. (A side note: earlier I said that TVP$PK gave better performance with EXISTS queries than TVP$NOPK, because it does not an operator to eliminate duplicates. A classic case of swings and roundabouts.)

If we also include the time it takes to load the DataTable or the List, the numbers rise quite considerably. This indicates that if you happen to have 50 000 numbers in a comma-separated list already, it may be better to use a method which accepts such input, rather than populating a TVP. (Or implement a DataReader that reads the CSV?)

As for DataTable vs. List, there is no significant difference between the two.

(Note: I first ran the ADO test on my workstation, only for integers. The numbers I got there was about the double than what I later received on my laptop, despite these machines being fairly similar in capacity and configuration. So take the numbers in Call Overhead ADO as a rough sketch.)

Passing Many Parameters

This far I have not covered the MANYPARAM method (except that I noted that its query plans are huge). If you look at the server-side results in the Timings or Rankings sheet in single-summary.xls you can see that it does very good for the list lengths where it is able to run. For the COUNT on strings it plays an even game with the TVP methods.

But all this changes when we look at the call overhead. The numbers you can see in the Call Overhead sheet in clientsummary.xls are nothing but gross. Already at 20 elements it's 12 ms, and for 2 000 elements, it's more than half a second! There is no method which has a server-side result which is that poor, so in fact that MANYPARAM is the slowest method of them all if we disregard the cases where optimizer goes astray.

To find out whether this was yet another issue with my client API, I wrote a test for MANYPARAM in ADO .NET, this time only for integers. (You find it in manyparamtest.cs.) You can see the result on the Call Overhead ADO sheet. As with TVP the performance in ADO .NET is better, but only with some 25%. Almost 10 ms for 20 elements, and just below 400 ms for 2 000 elements.

In the ADO sheet there are two entries, MANYPARAM$A and MANYPARAM$B. For $A, the timing covers only the call to ExecuteNonQuery, for $B it also includes the time it takes to fill the parameter collection. The difference between the two is very small, so setting up all those parameters is cheap. But passing them, no, that's darn expensive.

All in all, while using this technique may be acceptable in some limited cases, it's definitely not a viable choice when you need to work with many list elements. Unwieldy code, an upper limit of 2 100 elements and poor performance; there is just nothing to defend it.

Plain Strings

Let's take a quick look on the other methods but XML. That is, all methods that takes an nvarchar(MAX) or varbinary(MAX) parameter, and for which there cannot be any difference on the SQL Server side in receiving the parameter. All that should matter is the length of the parameter string – or so one would think.

Looking in the Call Overhead sheet, we see that the values fall in distinct groups, and if we look closer, we can see that within each group, the input format is the same. For integers the groups are, in order:

  1. Binary input.
  2. Space-separated strings.
  3. Fixed length.
  4. Comma-separated strings.

The first three groups are what we could expect. The shorter the string, the lesser the overhead. Keep in mind that these tests use shared memory, so with a real network link, these differences are likely to be more accentuated. But the last group is very funny. The space-separated and the comma-separated strings have exactly the same length, so there is no apparent reason why the overhead would be 12 milliseconds more for 50 000 integers only because the separator is a comma rather than a space. I don't have any answer for this, but I strongly suspect that this is something related my test setup or my Perl API, and this is not likely to appear with an ADO .NET application.

For strings, we see the following groups:

  1. Comma-separated strings.
  2. Fixed-length.
  3. Dynamic SQL.

Again, there is something strange going on. The input strings for dynamic SQL are longer than the regular comma-separated strings, since each value is quoted, so far so good. But even with these extra quotes, the strings for dynamic SQL are considerably shorter than the fixed-length strings with 30 positions per element. And again, I have no explanation to offer but the assumption that it is not likely to show in other APIs.(Compilation? No, the compilation overhead for dynamic SQL is included in the server-side times.)

In any case, while these differences are funny, the impact on the total execution time is limited.

XML

For XML two things matter: the length of the document and the fact that SQL Server has to parse the XML documents to be able to store them in its internal format. And not surprisingly, XML has a higher overhead than plain strings. For integers, this is a factor five for attributed-centred XML vs. fixed-length format for the longer input lists, despite the XML documents are only twice the size. For strings, XMLATTR has only 60% more overhead than fixed-length, but in this case the XML documents are shorter than the fixed-length strings. Even if a factor of five sounds a lot, the practical importance is not that big. If you add the call overhead to the times in the Timings sheet in singlesummary.xls you can see that the rankings are hardly affected.

Element-centred XML has a higher overhead, and it seems to roughly agree with the difference in size of the XML documents.

Multi-threaded Tests

It is with some doubt that I publish the results from my multi-threaded tests, because about the only safe conclusion I am able to draw is that the tests are inconclusive. So take what follows with several grains of salt.

Test Setup

The test script, testscript.pl, is also able to run multi-threaded tests, but there are significant differences how the tests are run.

When the test script runs multi-threaded tests, the first thing it does is to select 10 001 lists each of integer and string data. The lists vary in lengths, repeating in groups of 11. Lists 1-3, 12-14, 22-24 etc have 20 elements. List 4, 15, 26 etc have five elements. Then come two lists with 50 elements, two with 100 and then one each with 200, 650, and 2 000 elements. (The reason that it is not a random distribution is simply laziness.) I chose mainly short lists, since it seems unlikely that you will have thousands of users hammering the server with lists with 10 000 elements.

The test script iterates over the numbers of threads to test for – 1, 5, 25 and 100 threads – and runs each test procedure for these numbers of threads. Before spawning the threads, the script first clears both the buffer cache and the plan cache. To factor compilation out of the test, the script runs the test procedure in question, using the very first input list, to enter a plan for the procedure into cache. This means that the plan is always created for 20 elements in the input list. As you see, the multi-threaded tests runs with the opposite presumptions than the single-threaded tests, starting with a cold cache. I opted to do this for two reasons: 1) Some methods (read dynamic SQL) may happen to squeeze data out of the cache, thereby leaving more work to the next guy. 2) With reads from disk, I hoped to see a better scaling effect of using more threads.

The scripts proceeds to take an exclusive application lock on session level whereupon it spawns the threads. The first thing for a thread is to call a version of start_client_timer that first tries to get a shared application lock on the resource locked by the main thread of the test script. Once the main thread has spawned all threads, it releases the application lock (after a short wait), and thereby starting the test itself. The main thread then waits for all subthreads to complete.

The remaining 10 000 lists are divided in segments over the subthreads. For instance, when running with five threads, thread 1 uses lists 2-2001, thread 2 uses lists 2002-4001 and so on. That is, no matter if the number of threads is 1, 5, 25 or 100, the script makes the same number of calls to the test procedure. The subthreads collect the timings from the procedure, but save this in memory only, and move directly to the next call. When a subthread has run all its calls, it returns the data to the main thread, that saves the data to the database when the test is over.

What to Measure

A challenge for me was to determine what times to use. The server-side timings were basically useless, since I had opted to use short lists. I first tried to use the average or median of the total execution time for the threads, but I faced distribution problems. Most threads had about the same execution time, but an occasional thread could complete much faster – and I mean much faster: subsecond, when most other threads ran for ten seconds or more. Presumably, this was due to some weirdness in the scheduling.

Eventually, I decided to simply compute the difference between the starting time for the first thread and the end time for the last thread. After all, in the real world what counts is wall-clock time. But this choice is not entirely without problems. While a test runs for a several minutes, all I get is one single value, and for some methods the execution time for a test proved to be quite different from one execution to another. Running the entire test suite to collect one single value per test took 18-20 hours, so it was not realistic to run many times to even out these variations.

The results I publish are the average of six full runs. This was not my original plan, but because I found minor flaws with regards to tempdb size (see below) in my test setup, I had reason to rerun anyway. In the end I was never able to resolve the tempdb issue entirely, and I decided that the impact was small enough that using an average of the six runs I had was better that just presenting the results from a single potentially flawless run.

Tested Methods and Operations

For the multi-threaded tests, I tested UNPACK, JOIN and COUNT, but I left out EXISTS as it did not seem interesting enough. I ran the test for both integers and strings.

I made some exclusions. UNPACK for XMLELEM (both data types) and JOIN and COUNT for FNFIX$SINGLE on strings ran too slow in the single-threaded tests to permit them be included in this test. I also largely excluded the TVP methods and MANYPARAM due to their high call overhead. This is a pity, as it would be interesting to see how well the TVP methods scale, but because of my Perl API, this was not possible. I did however included them in occasional runs, but I am not comparing their results with the other methods.

Hardware etc.

As with the single-threaded tests, I used the SQL Server 2008 instance I have on my laptop. The test script ran on my desktop machine, which runs Vista Ultimate x64 SP2 and has 6 GB of RAM and a 2.2 GHz dual-core CPU. (The test script consumes quite a lot of memory for 100 threads, over 2 GB, so running both the test script and SQL Server on the laptop did not seem like a good idea.)

It goes without saying that this system is less than optimal for multi-thread tests: only two CPUs, and all database files, including tempdb, on a single disk.

As I mentioned there was an issue with tempdb in the tests: I discovered very late that I had not sized tempdb correctly, and the size I used for the single-threaded tests was insufficient, so SQL Server would autogrow the data and log files for tempdb while the test was running. As autogrow could affect on the measurements, I wanted to eliminate it. But no matter how big I sized the log, autogrow still occurred. In the final run, the data file for tempdb was 1 GB and the log was 250 MB. In this run, autogrow appeared in a place where it had not appeared before. This was when I gave up, and decided to use the average of the six runs I had completed.

So, yes, the test results include distortion caused by autogrow, but I the effect seems to be minor. I was able to correlate the autogrow events in the default trace with my test results, and I could see during which procedures that had been running when autogrow took place. Their values did not stand out as particularly extreme compared to other runs where autogrow did not appear for the same procedure.

multithreadsummary.xls

You can find the results of the multi-threaded tests in the Excel file multithreadsummary.xls. This file has a number of tabs:

Totaltime – The total execution time in milliseconds for each test.

Deltatime – The difference in execution time between this number of threads and the previous number. That is, in the column 100 you see the difference in execution time between 100 and 25 threads. I have put positive numbers in red, as the general assumption is that the more threads for the same work, the shorter the execution time.

Rankings – The ranking within this combination of operation, data type and number of threads for the total execution time.

Ratio to top – How many times slower this method ran compared to the fastest method for this combination.

Threadratio – The ratio in execution time between this test and the same test with one thread. The higher the value, the better does the method scale. I have put ratios less than 1.1 in red, as this is very poor .

Ranking threadratio – Ranking of the values in the Threadratio sheet.

Distance threadratio – The difference in threadratio between this method, and the method with the highest threadratio for this combination of operation, data type and number of threads. In this sheet I have shaded the values in stronger and stronger shades of red in steps of 0.4: values up to 0.4 are unshaded, values between 0.4 and 0.8 are pink, values between 0.8 and 1.2 are more reddish pink, values between 1.2 and 1.6 are light red, and values over 1.6 are fully red. To make the JOIN column easier to read, I'm using a different colour on this sheet for JOIN. (Note: if you are viewing the Excel book with Excel 2003, you will only see three different shades: values up 0.8 appear as unshaded, and values between 0.8 and 1.6 have the same light-red colour.)

Individual runs – This sheet gives the detail data from the six runs. The column totaltime is the same value as in the Totaltime sheet, that is the average of all six runs. The columns total1-total6 is the total execution time for that particular run, and the columns ratio1-ratio6 gives the ratio between the execution time for that run and the average. I've highlighted ratios above 1.1 or below 0.9 in red.

As in single-summary.xls, the methods appear in alphabetic order with the same colour coding. The TVP methods and MANYPARAM appear "outside competition" at the bottom of the Totaltime, Deltatime and Threadratio sheets only.

General Pattern

Let's first look at the Deltatime sheet. We can see that when we go from one to five threads, the execution time falls considerably in most cases, with the exception of fixed-lengths methods for integer data with JOIN and UNPACK.

When we move to 25 threads, on the other hand, there is a rise in execution time for most methods, particularly for JOIN and UNPACK. Curiously enough, the fixed-length methods that scaled negatively for five threads, are now able to improve their execution time. For COUNT there are more methods for which the execution time continues to decrease, if not that much.

Finally, for 100 threads the vast majority of the methods have a drop in performance from 25 threads for UNPACK and JOIN. For COUNT there are still a few that scale positively.

Overall, we get best scaling with COUNT, which seems reasonable to expect. With the other operations we write to temp tables, whereas with COUNT we only read from cache. On this modest hardware, it is likely that all these processes saturate the disk.

If you move over to the sheet Distance threadratio, there is another interesting observation. You can see that in the column for JOIN on strings there are only a handful on pink cells and none that is redder, whereas other datatype/optype columns have more cells that are shaded and in different shades of red. I guess this tells us that the join operation over strings in this test does not leave much scaling potential, as, say, the COUNT operation over integer does. Not only do we have the temp tables, but string-comparison is more CPU-intensive, so that resource too is saturated.

Winners...

To evaluate the results, we need look in several places in the Excel book. The Rankings sheet gives the ranking of the overall performance, but it is in the Threadratio and Ranking threadratio sheets, we can see the actual effect of adding many parallel processes. Both are important: there are some methods that ranks good on the threadratio, but still ranks mediocre or poorly in total execution time.

The results do not lend themselves to very many conclusions: as I said, the main conclusion is that the results are inconclusive. Nevertheless, it is fairly easy to identify a winner, and that is the CLR family. If you look in Rankings, you see that no CLR method ranks lower than sixth place for five or more threads. And for 25 and 100 threads, CLR holds places 1-4 for all tests. Likewise, if you look in Ranking threadratio, you can see a similar pattern. For integer data, other methods manage to sneak into the top 4 in three cases only. For string data the picture is more mixed, particularly for the JOIN operation where the winners in all three cases are elsewhere. But overall, the CLR scales very well.

Can we also draw any conclusion which of the CLR methods that is the best? It is difficult not to notice that CLR$ADAM tops all tests for integer data in Ranking threadratio. And if you look a little closer, you see that for string data CLR$FIX ranks best of the CLR methods. If you move over to the sheet Distance threadratio, you can see that in some cases they win only with a slim margin. But there are also COUNT on integers and UNPACK and COUNT for strings where CLR$ADAM and CLR$FIX make decisive wins. So do these methods scale better than the other two? No, I think this is a mirage. Move over to Totaltime, and the picture is different. The two "winners" tends to be the slowest CLR methods with a single thread, so they are just catching up. (As for why they are slower, there are hints in the previous sections in the article. In the single-threaded tests CLR$ITER and CLR$SPLIT were somewhat faster than the other two for the longest lists. Maybe, we are be able to see this small difference also for shorter lists when we make so many calls. A more important factor, though, may be the difference in call overhead. Recall that I've run CLR$ADAM with comma-separated input, which for some strange reason has a higher call overhead than space-separated. You can note that CLR$ADAM lags more behind on integer lists, and CLR$FIX on string lists.)

But it could also be that strange difference in call overhead between space-separated lists and comma-separated lists that)

Thus, from these tests, you cannot say that one CLR method is better or worse than the other, only that CLR in general scales well. (But recall the experiences of SQL Server MVPs Adam Machanic and Paul Nielsen when using String.Split.)

And, please keep in mind that these tests do not cover TVPs. You can see its data in the Threadratio sheet, and it's fairly drab. But due to the high call overhead in my Perl API, we don't know what we are measuring. As the time to make the call exceeds the execution time in SQL Server, we are not stressing SQL Server like we do with the other methods, that's for sure. So maybe CLR is only the winner in the absence of TVPs.

Are there any other winners? Well, there are a few surprise showings. Look at which method that gets to #1 for the JOIN on strings with 100 threads: CTE$CHUNK, who would have thought? In some individual runs I saw it make #1 also for the JOIN on integers. Now, what is going on here I don't know, but you can see that CTE$CHUNK ranks much better for JOIN and UNPACK that include a temp table, than for COUNT that does not. The same also applies for CTE$MSTMT, the winner for the join over strings with 25 threads. Maybe the CTE methods are slow to feed the temp table, so they have a better potential for scaling. After all, if you look in Rankings, CTE$CHUNK ranks no better than #13 for the integer join, and #10 for the string join, so it's #1 showing in threadratio is probably more a piece of curio than anything else.

While you are in the Rankings sheet, look at the row for MANYSELECT and see that it is in place 24 for all tests on strings, that is last or second last. You can also see that for integers, it never ranks better than #20. Now switch to Ranking threadratio: for the join over strings: it grabs places 2 and 3 for 25 and 100 threads respectively. It's rankings for COUNT are good too. But since the total time is poor, the method is still not of interest.

While not directly a winner, let me also note that XML does fairly well. For strings, XMLATTR ranks between places 5 and 9 for total execution time and it ranks decently also on threadratio.

... and Losers

The most noticeable loser is the fixed-length method, and more precisely when you use the Numbers table. As you can see in the Threadratio sheet, there are several entries in red, meaning that the ratio with regards to a single thread is less than 1.1. There are even values below 1, meaning that it took longer time to run the test with many threads instead of one!

Certainly, the picture is not clear-cut. If you look in Ranking Threadratio, there are some tests where the fixed-length method ranks fairly decently. FIX$MSTMT even manages to rank as #1 and #2 for the JOIN and COUNT over strings for 5 threads respectively. But then as the number of threads increases, it falls to lower rankings. And you can see the same pattern for FIX$BINARY and FIX$SINGLE over integers. However, there are also some cases with the opposite pattern. FIX$MULTI and FIX$MULTI2 ties for #8 for COUNT over integers for 100 threads while being at places 19 and 21 for 5 threads, and FIX$SINGLE has a similar development for COUNT over strings: #12 for 5 threads, and then #9 for both 25 and 100 threads.

But given that the fixed-length method is one of the fastest methods in the single-threaded test, the overall result is a disappointment. If you look in the Rankings sheet, you can see some really poor rankings, not the least for the join over integers for 5 threads, where FIX$BINARY is at #17, and the others (save FIX$MSTMT) fare even poorer.

It is interesting to compare with the FNFIX methods. Looking in Rankings Threadratio, it is true that FNFIX$MSTMT ranks lower than FIX$MSTMT, but if you look at FNFIX$BINARY and FNFIX$SINGLE they rank a whole lot better than FIX$BINARY and FIX$SINGLE, except for the join over integers where the FNFIX methods also scale poorly.

Another group that does not scale very well are the FNNUM methods. In Ranking Threadratio, they never make it to top 10, but in one single case and then just barely. The TBLNUM methods have a more mixed picture, but TBLNUM$IL does really bad on the two join operations. TBLNUM$CHUNK and TBLNUM$MSTMT are unable to make of use the good scaling potential for the COUNT over integers, and for 100 threads their threadratio is more than 2.0 less than the leader.

By now, I think I've mentioned all groups of methods but the iterative method and dynamic SQL: both make a mediocre appearance. Although, I will have to admit that I thought EXEC$A would lose big time, due to all its compilation. But it only hits the bottom for COUNT over integers.

Parallelism?

A final question is: could it be that the poor showing of some methods is due to a parallel plan? Conceivably, with a parallel plan, the CPUs get saturated more easily, as all threads try to use all cores at once. I investigated this, and the answer is no. Rather it is the other way round. That is, methods that have a parallel plan for one operation tend scale better for that operation than for operations where they have a serial plan. For the record, here are all cases of parallel plans in the multi-threaded test:

MethodData typeOperation
CTE$ILStrCOUNT
FIX$BINARYIntCOUNT
FIX$MULTIIntCOUNT
FIX$MULTIIntUNPACK
FIX$MULTIStrCOUNT
FIX$MULTIStrUNPACK
FIX$MULTI2IntCOUNT
FIX$MULTI2IntUNPACK
FIX$MULTI2StrCOUNT
FIX$MULTI2StrUNPACK
FIX$SINGLEIntCOUNT
FIX$SINGLEStrCOUNT
FNNUM$ILStrCOUNT
TBLNUM$ILIntCOUNT
TBLNUM$ILStrCOUNT

Many fixed-length methods in the pack, and generally they did fairly well on COUNT, and this applies to TBLNUM$IL as well.

Conclusion

Did you really make it this far? Then I don't know who is the most geeky, you who read all this – or I who wrote it. Anyway, you have now seen the results from tests of various list-to-table methods. The single-threaded tests are tested and tried – this is the third time I run them. New for this round were observations of call overhead, and multi-threaded tests.

In the single-threaded tests you saw that TVP, CLR and fixed-length were generally the fastest. But a poor choice of the optimizer can send a fast method down the list, or a successful parallel plan can send some other method to the top. And for all methods based on T-SQL, there is a threshold around 500 000 bytes (x64) or 750 000 bytes (x86) where working with nvarchar(MAX) becomes slower. The method most prone to take hit from this is fixed-length.

But what is more important are the really catastrophic results you can get with T-SQL inline functions. For many applications the iterative method will do just fine, never mind that in these tests it ranks poorly. But few applications can accept that it takes 35 seconds to handle a list of 20 elements.

The client-side tests, showed that TVPs have some more overhead than plain strings, but if you are using ADO .NET this is nothing to be alarmed of. (But if you use Win32::SqlServer, you may have reason for concern.) The client-side tests also showed that using many parameters is dead in the water, unless you only cap it at a very small number.

The multi-thread tests showed that the CLR scales well, and the fixed-length method less so. But overall the result from this test is dubious, given the system I ran the tests on. And the fact that TVPs could not compete also makes those tests less interesting.

And it cannot be stressed enough: if you think you need top-notch performance in your system, you need to run your own benchmarks with your queries, with your hardware. And with the environment you expect. That is, if you need it for long strings, test with long strings. If you expect many multiple processes to hammer the server, test with many processes. And keep in mind that much more than the method itself, what will be decisive is the query plan you get.

Revisions

2010-01-06 – First revision

Back to my home page.